TFIDF

This repository of the TFIDF coding challenge for Bakdata

The challenge

Use-Case Scientific publications are continuously loaded to S3 and processed with Apache Kafka to create an output topic with TFIDF values to identify important words in the text corpus.

Example data vep_big_names_of_science_v2_txt.zip

TFIDF

Task I) Build a data pipeline with Apache Kafka and Kafka Streams to create a TFIDF output topic for the given example data.

Task II) Make handling of large texts (>1 MB) also possible and use for large message processing this SerDe, kafka-s3-backed-serde

Getting Started

These instructions will get you a copy of the project up and running on your local machine for development and testing purposes.

Prerequisites

Install ZooKeeper. You can find a guide here.
Install Kafka. You can find a guide here.

Here also some installation examples for Mac OSX users:

$ brew cask install java
$ brew install zookeeper
$ brew install kafka

Installing and running the program

If the data directory exists skip this step. Create a directory called data in the root of the project.
```
$ mkdir data
```

Download and unzip the vep_big_names_of_science_v2_txt.

$ curl -O http://vep.cs.wisc.edu/VEPCorporaRelease/zips/vep_big_names_of_science_v2_txt.zip
$ unzip vep_big_names_of_science_v2_txt.zip -d vep_big_names_of_science_v2_txt

Move the vep_big_names_of_science_v2_txt folder inside the data directory that you created in step 1.
Open the scripts folder and run the start-zookeeper.sh. This script will start ZooKeeper.
In the same folder, run the start-kafka-server.sh. This script will run the Kafka server on your machine.
Now run the org.bakdata.kafka.challenge.consumer.TFIDFConsumer. By default, the consumer is Running for one hour. You can pass the timeout as seconds in the program arguments.
Run org.bakdata.kafka.challenge.TFIDFApplication. This class will run the will create the input/output topics and run the producer (You can find the producer in org.bakdata.kafka.challenge.producer.TFIDFProducer). The producer will read each file name from the metadata CSV file, VEP_Big_Names_of_Science_Metadata.csv, and send the file name and file content to the stream processor for calculating the TFIDF.
After all the data is sent, and the stream processor finished processing the data, you can find a report output.csv inside the data directory. Besides that, you can see all the activities from the console.

Example output

This is what the producer logs on the terminal when it sends the documents:

[20-Mar-06 11:49:55:601] [INFO] [TFIDFProducer:83] - Sent: A00429.headed.txt
[20-Mar-06 11:49:55:618] [INFO] [TFIDFProducer:83] - Sent: A01014.headed.txt
[20-Mar-06 11:49:55:632] [INFO] [TFIDFProducer:83] - Sent: A01089.headed.txt
[20-Mar-06 11:49:55:643] [INFO] [TFIDFProducer:83] - Sent: A01185.headed.txt
...
[20-Mar-06 11:51:32:389] [INFO] [TFIDFProducer:83] - Sent: K084724.000.txt
[20-Mar-06 11:51:32:394] [INFO] [TFIDFProducer:83] - Sent: K088587.000.txt

This is the output of the consumer:

...
[20-Mar-06 11:54:56:416] [INFO] [ConsumerTask:78] - Consumer Record:(really, TFIDFResult{documentName='K088587.000.txt', tfidf=2.292379255671519E-5, overallDocumentCount=252.0})
[20-Mar-06 11:54:56:416] [INFO] [ConsumerTask:78] - Consumer Record:(westward, TFIDFResult{documentName='K088587.000.txt', tfidf=7.749641050412507E-5, overallDocumentCount=252.0})
[20-Mar-06 11:54:56:416] [INFO] [ConsumerTask:78] - Consumer Record:(round, TFIDFResult{documentName='K088587.000.txt', tfidf=7.2739186459018475E-6, overallDocumentCount=252.0})
[20-Mar-06 11:54:56:416] [INFO] [ConsumerTask:78] - Consumer Record:(corresponding, TFIDFResult{documentName='K088587.000.txt', tfidf=1.8929645955659432E-4, overallDocumentCount=252.0})
[20-Mar-06 11:54:56:416] [INFO] [ConsumerTask:78] - Consumer Record:(1201, TFIDFResult{documentName='K088587.000.txt', tfidf=2.2650448413332806E-4, overallDocumentCount=252.0})
[20-Mar-06 11:54:56:416] [INFO] [ConsumerTask:78] - Consumer Record:(variable, TFIDFResult{documentName='K088587.000.txt', tfidf=5.947448869452933E-5, overallDocumentCount=252.0})
[20-Mar-06 11:54:56:416] [INFO] [ConsumerTask:78] - Consumer Record:(intend, TFIDFResult{documentName='K088587.000.txt', tfidf=2.1801436086265487E-5, overallDocumentCount=252.0})
[20-Mar-06 11:54:56:416] [INFO] [ConsumerTask:78] - Consumer Record:(excentricity, TFIDFResult{documentName='K088587.000.txt', tfidf=1.531078372380589E-4, overallDocumentCount=252.0})

Code style

IntelliJ IDEA code style settings for Bakdata

Built With

Java 8 - Java version used
Maven - Dependency Management
Kafka Streams 2.4.0 - Used for stream processing
kafka-s3-backed-serde - A Kafka Serde that reads and writes records from and to S3 transparently

Authors

Ramin Gharib - Initial work - raminqaf

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
.idea		.idea
data		data
rocksdb/overall-word-count		rocksdb/overall-word-count
scripts		scripts
src		src
.DS_Store		.DS_Store
.gitignore		.gitignore
README.md		README.md
TFIDF.iml		TFIDF.iml
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TFIDF

The challenge

Getting Started

Prerequisites

Installing and running the program

Example output

Code style

Built With

Authors

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

TFIDF

The challenge

Getting Started

Prerequisites

Installing and running the program

Example output

Code style

Built With

Authors

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages