This repository of the TFIDF coding challenge for Bakdata
Use-Case Scientific publications are continuously loaded to S3 and processed with Apache Kafka to create an output topic with TFIDF values to identify important words in the text corpus.
Example data vep_big_names_of_science_v2_txt.zip
Task I) Build a data pipeline with Apache Kafka and Kafka Streams to create a TFIDF output topic for the given example data.
Task II) Make handling of large texts (>1 MB) also possible and use for large message processing this SerDe, kafka-s3-backed-serde
These instructions will get you a copy of the project up and running on your local machine for development and testing purposes.
-
Here also some installation examples for Mac OSX users:
$ brew cask install java $ brew install zookeeper $ brew install kafka
-
If the
datadirectory exists skip this step. Create a directory calleddatain the root of the project.$ mkdir data -
Download and unzip the vep_big_names_of_science_v2_txt.
$ curl -O http://vep.cs.wisc.edu/VEPCorporaRelease/zips/vep_big_names_of_science_v2_txt.zip $ unzip vep_big_names_of_science_v2_txt.zip -d vep_big_names_of_science_v2_txt -
Move the
vep_big_names_of_science_v2_txtfolder inside thedatadirectory that you created in step 1. -
Open the
scriptsfolder and run thestart-zookeeper.sh. This script will start ZooKeeper. -
In the same folder, run the
start-kafka-server.sh. This script will run the Kafka server on your machine. -
Now run the
org.bakdata.kafka.challenge.consumer.TFIDFConsumer. By default, the consumer is Running for one hour. You can pass the timeout as seconds in the program arguments. -
Run
org.bakdata.kafka.challenge.TFIDFApplication. This class will run the will create the input/output topics and run the producer (You can find the producer inorg.bakdata.kafka.challenge.producer.TFIDFProducer). The producer will read each file name from the metadata CSV file,VEP_Big_Names_of_Science_Metadata.csv, and send the file name and file content to the stream processor for calculating the TFIDF. -
After all the data is sent, and the stream processor finished processing the data, you can find a report
output.csvinside thedatadirectory. Besides that, you can see all the activities from the console.
- This is what the
producerlogs on the terminal when it sends the documents:[20-Mar-06 11:49:55:601] [INFO] [TFIDFProducer:83] - Sent: A00429.headed.txt [20-Mar-06 11:49:55:618] [INFO] [TFIDFProducer:83] - Sent: A01014.headed.txt [20-Mar-06 11:49:55:632] [INFO] [TFIDFProducer:83] - Sent: A01089.headed.txt [20-Mar-06 11:49:55:643] [INFO] [TFIDFProducer:83] - Sent: A01185.headed.txt ... [20-Mar-06 11:51:32:389] [INFO] [TFIDFProducer:83] - Sent: K084724.000.txt [20-Mar-06 11:51:32:394] [INFO] [TFIDFProducer:83] - Sent: K088587.000.txt - This is the output of the
consumer:... [20-Mar-06 11:54:56:416] [INFO] [ConsumerTask:78] - Consumer Record:(really, TFIDFResult{documentName='K088587.000.txt', tfidf=2.292379255671519E-5, overallDocumentCount=252.0}) [20-Mar-06 11:54:56:416] [INFO] [ConsumerTask:78] - Consumer Record:(westward, TFIDFResult{documentName='K088587.000.txt', tfidf=7.749641050412507E-5, overallDocumentCount=252.0}) [20-Mar-06 11:54:56:416] [INFO] [ConsumerTask:78] - Consumer Record:(round, TFIDFResult{documentName='K088587.000.txt', tfidf=7.2739186459018475E-6, overallDocumentCount=252.0}) [20-Mar-06 11:54:56:416] [INFO] [ConsumerTask:78] - Consumer Record:(corresponding, TFIDFResult{documentName='K088587.000.txt', tfidf=1.8929645955659432E-4, overallDocumentCount=252.0}) [20-Mar-06 11:54:56:416] [INFO] [ConsumerTask:78] - Consumer Record:(1201, TFIDFResult{documentName='K088587.000.txt', tfidf=2.2650448413332806E-4, overallDocumentCount=252.0}) [20-Mar-06 11:54:56:416] [INFO] [ConsumerTask:78] - Consumer Record:(variable, TFIDFResult{documentName='K088587.000.txt', tfidf=5.947448869452933E-5, overallDocumentCount=252.0}) [20-Mar-06 11:54:56:416] [INFO] [ConsumerTask:78] - Consumer Record:(intend, TFIDFResult{documentName='K088587.000.txt', tfidf=2.1801436086265487E-5, overallDocumentCount=252.0}) [20-Mar-06 11:54:56:416] [INFO] [ConsumerTask:78] - Consumer Record:(excentricity, TFIDFResult{documentName='K088587.000.txt', tfidf=1.531078372380589E-4, overallDocumentCount=252.0})
IntelliJ IDEA code style settings for Bakdata
- Java 8 - Java version used
- Maven - Dependency Management
- Kafka Streams 2.4.0 - Used for stream processing
- kafka-s3-backed-serde - A Kafka Serde that reads and writes records from and to S3 transparently
- Ramin Gharib - Initial work - raminqaf