#  Spark Kafka Sentiment Analysis

![](https://static.wixstatic.com/media/f17a52_84852646da5a4e37837a12cb610b2ad8~mv2.png/v1/fill/w_1000,h_673,al_c,usm_0.66_1.00_0.01/f17a52_84852646da5a4e37837a12cb610b2ad8~mv2.png)
[Source](https://www.dataneb.com/post/analyzing-twitter-texts-spark-streaming-example-2)

<div class="jumbotron">
    <center>
        <b>Sentiment Analysis</b> of streaming twitter data using Flume/Kafka/Spark
    </center>
</div>

![](https://i.imgflip.com/40j9cu.jpg)
[NicsMeme](https://imgflip.com/i/40j9cu)

# Workflow Design

## 1) Model Building

Goal: Build Spark Mlib pipeline to classify whether the tweet contains hate speech or not. 

> Focus is not to build a very accurate classification model but to see how to use any model and return results on streaming data

## 2) Initialize Spark Streaming 

Once the model is built, we need to define the source where to get tweet:

### Kafka

## 3) Stream Data

Start stream -> the Spark Streaming API will receive the data after a specified duration

## 4) Predict and Return Results

Once we receive the tweet text, we pass the data into the machine learning pipeline we created and return the predicted sentiment from the model

# Import Libraries

In [5]:
import findspark
import pyspark
from pyspark import SparkContext
from pyspark.sql.session import SparkSession
from pyspark.streaming import StreamingContext
import pyspark.sql.types as tp
from pyspark.ml import Pipeline
from pyspark.ml.feature import StringIndexer, VectorAssembler
from pyspark.ml.feature import StopWordsRemover, Word2Vec, RegexTokenizer
from pyspark.ml.classification import LogisticRegression
from pyspark.sql import Row
from pyspark.ml.feature import HashingTF, IDF, Tokenizer
from pyspark.ml.classification import NaiveBayes
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

![](images/cuofano.png)

# init 1

In [2]:
findspark.find( ) 
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("TapDataFrame").getOrCreate()
spark

![](http://thejoyofgeek.net/wp-content/uploads/2016/08/robotmask.jpg)
[S2E4](http://thejoyofgeek.net/mr-robot-init_1-review-s2e4/)

 # Let's Start!

# Trainset 
***SentiTUT*** 

http://www.di.unito.it/~tutreeb/sentipolc-evalita16/data.html


In [6]:
# idtwitter	subj	opos	oneg	iro	lpos	lneg	top	text

schema = tp.StructType([
    tp.StructField(name= 'id', dataType= tp.StringType(),  nullable= True),
    tp.StructField(name= 'subjective',       dataType= tp.IntegerType(),  nullable= True),
    tp.StructField(name= 'positive',       dataType= tp.IntegerType(),  nullable= True),
    tp.StructField(name= 'negative',       dataType= tp.IntegerType(),  nullable= True),
    tp.StructField(name= 'ironic',       dataType= tp.IntegerType(),  nullable= True),
    tp.StructField(name= 'lpositive',       dataType= tp.IntegerType(),  nullable= True),
    tp.StructField(name= 'lnegative',       dataType= tp.IntegerType(),  nullable= True),
    tp.StructField(name= 'top',       dataType= tp.IntegerType(),  nullable= True),
    tp.StructField(name= 'tweet',       dataType= tp.StringType(),   nullable= True)
])

In [7]:
# read the dataset  
training_set = spark.read.csv('../spark/dataset/training_set_sentipolc16.csv',
                         schema=schema,
                         header=True,
                         sep=',')

#training_set.show(truncate=False)
training_set.groupBy("positive").count().show()

+--------+-----+
|positive|count|
+--------+-----+
|       1| 2051|
|       0| 5359|
+--------+-----+



![](https://www.meme-arsenal.com/memes/a05a53a96e890dee5a52d1156c01eb06.jpg)

In [8]:
# define stage 1: tokenize the tweet text    
stage_1 = RegexTokenizer(inputCol= 'tweet' , outputCol= 'tokens', pattern= '\\W')

In [9]:
# define stage 2: remove the stop words
stage_2 = StopWordsRemover(inputCol= 'tokens', outputCol= 'filtered_words')

In [10]:
# define stage 3: create a word vector of the size 100
stage_3 = Word2Vec(inputCol= 'filtered_words', outputCol= 'vector', vectorSize= 100)

In [11]:
# define stage 4: Logistic Regression Model
model = LogisticRegression(featuresCol= 'vector', labelCol= 'positive')

![](https://cdn-images-1.medium.com/max/1600/1*DyD3VP18IV3-lXcKMbyr5w.jpeg)

In [12]:
# setup the pipeline
pipeline = Pipeline(stages= [stage_1, stage_2, stage_3, model])
pipeline

Pipeline_d896a7e3621c

In [13]:
# fit the pipeline model with the training data
pipelineFit = pipeline.fit(training_set)

In [14]:
modelSummary=pipelineFit.stages[-1].summary
modelSummary 
# https://spark.apache.org/docs/latest/api/java/org/apache/spark/ml/classification/LogisticRegressionSummary.html

<pyspark.ml.classification.BinaryLogisticRegressionTrainingSummary at 0x7fa790930b10>

In [15]:
modelSummary.accuracy

0.7379217273954116

![](images/accuracy.jpg)
[DeepLearningNewsAndMemes](https://www.facebook.com/DeepLearningNewsAndMemes/)

In [16]:
tweetDf = spark.createDataFrame(["False illusioni, sgradevoli realtà Mario Monti http://t.co/WOmMCITs via @AddToAny"], tp.StringType()).toDF("tweet")
tweetDf.show(truncate=False)

+---------------------------------------------------------------------------------+
|tweet                                                                            |
+---------------------------------------------------------------------------------+
|False illusioni, sgradevoli realtà Mario Monti http://t.co/WOmMCITs via @AddToAny|
+---------------------------------------------------------------------------------+



In [17]:
pipelineFit.transform(tweetDf).select('tweet','tokens','prediction').show(truncate=False)

+---------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------+----------+
|tweet                                                                            |tokens                                                                                   |prediction|
+---------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------+----------+
|False illusioni, sgradevoli realtà Mario Monti http://t.co/WOmMCITs via @AddToAny|[false, illusioni, sgradevoli, realt, mario, monti, http, t, co, wommcits, via, addtoany]|0.0       |
+---------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------+----------+



In [18]:
tweetDf = spark.createDataFrame(["Tutti amano le ruspe di salvini"], tp.StringType()).toDF("tweet")
tweetDf.show(truncate=False)

+-------------------------------+
|tweet                          |
+-------------------------------+
|Tutti amano le ruspe di salvini|
+-------------------------------+



In [19]:
pipelineFit.transform(tweetDf).select('tweet','prediction').show()

+--------------------+----------+
|               tweet|prediction|
+--------------------+----------+
|Tutti amano le ru...|       0.0|
+--------------------+----------+



In [None]:
pipelineFit.save("../spark/dataset/model.save")

In [20]:
# Set the model threshold to maximize F-Measure
fMeasure = modelSummary.fMeasureByThreshold
fMeasure.show()

+------------------+--------------------+
|         threshold|           F-Measure|
+------------------+--------------------+
|0.9041434497991898| 0.01926782273603083|
|0.8723788810130683|0.025911708253358926|
|0.8477174540710809|0.032489249880554225|
|0.8233040251643247| 0.04091341579448145|
|0.8081594480454658| 0.05198487712665407|
|0.7926974581873518| 0.05652378709373528|
|0.7776043563538639| 0.06009389671361502|
|0.7613078168002123| 0.06644829199812821|
|0.7568858364149114| 0.07182835820895524|
|0.7452266995079002| 0.07717340771734078|
|0.7372997095753309| 0.08244557665585919|
| 0.724341403421372| 0.08494921514312095|
|0.7100560309887609| 0.09019788311090657|
|0.7019862099427137| 0.09449541284403669|
|0.6888544933400849| 0.09967992684042067|
|0.6707325625546149| 0.10209662716499544|
|0.6610687822086312| 0.10540663334847797|
|0.6533578485988583| 0.11141304347826088|
|0.6395454679995733|  0.1146726862302483|
| 0.628057391317428| 0.11791179117911792|
+------------------+--------------

In [26]:
maxFMeasure = fMeasure.groupBy().max('F-Measure').select('max(F-Measure)')
maxFMeasure.show()

+------------------+
|    max(F-Measure)|
+------------------+
|0.4875286916602907|
+------------------+



In [27]:
bestThreshold=fMeasure.where(fMeasure['F-Measure'] == 0.4875286916602907)
bestThreshold.show()

+------------------+------------------+
|         threshold|         F-Measure|
+------------------+------------------+
|0.2251829398572779|0.4875286916602907|
+------------------+------------------+



In [28]:
model.setThreshold(0.4875286916602907)

LogisticRegression_3f94ad6121c6

In [29]:
modelSummary=pipelineFit.stages[-1].summary
modelSummary.accuracy

0.7379217273954116

In [30]:
# fit the pipeline model with the training data
pipelineFit = pipeline.fit(training_set)

In [31]:
modelSummary=pipelineFit.stages[-1].summary
modelSummary.accuracy

0.7369770580296896

![](https://i.imgflip.com/40mt0s.jpg)
[NicsMeme](https://imgflip.com/i/40mt0s)

# Another Approach: Naive Bayes

In [32]:
# define stage 3: create a word vector of the size 100
hashingTF = HashingTF(inputCol="filtered_words", outputCol="vector", numFeatures=20)

In [33]:
# define stage 4: Logistic Regression Model
modelNaive =  NaiveBayes(smoothing=1.0, modelType="multinomial",featuresCol= 'vector', labelCol= 'positive')

In [34]:
# setup the pipeline
pipelineNaive = Pipeline(stages= [stage_1, stage_2, hashingTF, modelNaive])

# fit the pipeline model with the training data
pipelineNaiveFit = pipelineNaive.fit(training_set)

In [35]:
pipelineNaiveFit

PipelineModel_7ac5943ef172

In [36]:
# select example rows to display.
predictions = pipelineNaiveFit.transform(training_set)
predictions.show()

+------------------+----------+--------+--------+------+---------+---------+---+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+----------+
|                id|subjective|positive|negative|ironic|lpositive|lnegative|top|               tweet|              tokens|      filtered_words|              vector|       rawPrediction|         probability|prediction|
+------------------+----------+--------+--------+------+---------+---------+---+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+----------+
|122449983151669248|         1|       0|       1|     0|        0|        1|  1|"Intanto la parti...|[intanto, la, par...|[intanto, la, par...|(20,[1,2,5,7,8,9,...|[-60.853495557968...|[0.71615635721972...|       0.0|
|125485104863780865|         1|       0|       1|     0|        0|        1|  1|False illusioni, ...|[false, illusioni...|[false

In [37]:
# compute accuracy on the test set
evaluator = MulticlassClassificationEvaluator(labelCol="positive", predictionCol="prediction",
                                              metricName="accuracy")
accuracy = evaluator.evaluate(predictions)
print("Test set accuracy = " + str(accuracy))

Test set accuracy = 0.722132253711201


In [38]:
spark.stop()

# Biblio

* https://www.analyticsvidhya.com/blog/2019/12/streaming-data-pyspark-machine-learning-model/
* https://www.kdnuggets.com/2018/02/machine-learning-algorithm-2118.html