GitHub - shayan72/Twitter_Language_Identification: Language Identification for Stream of Twitter Data Using Scala Language and Apache Kafka and Apache Spark

Installation

Install scala-2.12.1 and sbt

Download and Install hadoop-2.7.3

Follow instructions in Hadoop.

Download and Install spark-2.1.0-bin-hadoop2.7

Download and install Spark version 2.1.0 pre-built for Hadoop 2.7 and later from this link.

Download and Install kafka-0.10.2.0

Use kafka quick start documentation to download and start zookeeper and kafka servers

Import project into IntelliJ IDEA and install sbt packages.

Create new app at Twitter Apps and put consumer key, consumer secret, access key, and access secret in the application.conf.

Run the code using either of the following main functions:

DistributedLanguageDetection for running distributed k-means algorithm
DistributedLanguageDetection for getting twitter data from kafka and running Streaming k-means algorithm
CommonNgrams for preprocessing tweets and find common ngrams in each language

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
.idea		.idea
input		input
project		project
src/main/scala		src/main/scala
.gitignore		.gitignore
README.md		README.md
build.sbt		build.sbt