#  Spark Kafka Sentiment Analysis

![](https://static.wixstatic.com/media/f17a52_84852646da5a4e37837a12cb610b2ad8~mv2.png/v1/fill/w_1000,h_673,al_c,usm_0.66_1.00_0.01/f17a52_84852646da5a4e37837a12cb610b2ad8~mv2.png)
[Source](https://www.dataneb.com/post/analyzing-twitter-texts-spark-streaming-example-2)

<div class="jumbotron">
    <center>
        <b>Sentiment Analysis</b> of streaming twitter data using Flume/Kafka/Spark
    </center>
</div>

![](https://i.imgflip.com/40j9cu.jpg)
[NicsMeme](https://imgflip.com/i/40j9cu)

# Workflow Design

## 1) Model Building

Goal: Build Spark Mlib pipeline to classify whether the tweet contains hate speech or not. 

> Focus is not to build a very accurate classification model but to see how to use any model and return results on streaming data

## 2) Predict and Return Results

Once we get a new the tweet (and we will do using kafka streaming), 
we pass the data into the machine learning pipeline we created and return the predicted sentiment from the model

# Import Libraries

In [2]:
import findspark
import pyspark
from pyspark import SparkContext
from pyspark.sql.session import SparkSession
from pyspark.streaming import StreamingContext
import pyspark.sql.types as tp
from pyspark.ml import Pipeline
from pyspark.ml.feature import StringIndexer, VectorAssembler
from pyspark.ml.feature import StopWordsRemover, Word2Vec, RegexTokenizer
from pyspark.ml.classification import LogisticRegression
from pyspark.sql import Row
from pyspark.ml.feature import HashingTF, IDF, Tokenizer
from pyspark.ml.classification import NaiveBayes
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

![](images/cuofano.png)

# init 1

In [3]:
findspark.find( ) 
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("TapDataFrame").getOrCreate()
spark

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
22/05/27 16:24:56 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


![](http://thejoyofgeek.net/wp-content/uploads/2016/08/robotmask.jpg)
[S2E4](http://thejoyofgeek.net/mr-robot-init_1-review-s2e4/)

 # Let's Start!

# Trainset 
***SentiTUT*** 

http://www.di.unito.it/~tutreeb/sentipolc-evalita16/data.html


In [5]:
# idtwitter	subj	opos	oneg	iro	lpos	lneg	top	text

schema = tp.StructType([
    tp.StructField(name= 'id', dataType= tp.StringType(),  nullable= True),
    tp.StructField(name= 'subjective',       dataType= tp.IntegerType(),  nullable= True),
    tp.StructField(name= 'positive',       dataType= tp.IntegerType(),  nullable= True),
    tp.StructField(name= 'negative',       dataType= tp.IntegerType(),  nullable= True),
    tp.StructField(name= 'ironic',       dataType= tp.IntegerType(),  nullable= True),
    tp.StructField(name= 'lpositive',       dataType= tp.IntegerType(),  nullable= True),
    tp.StructField(name= 'lnegative',       dataType= tp.IntegerType(),  nullable= True),
    tp.StructField(name= 'top',       dataType= tp.IntegerType(),  nullable= True),
    tp.StructField(name= 'text',       dataType= tp.StringType(),   nullable= True)
])

In [6]:
# read the dataset  
training_set = spark.read.csv('../spark/dataset/training_set_sentipolc16.csv',
                         schema=schema,
                         header=True,
                         sep=',')
training_set

DataFrame[id: string, subjective: int, positive: int, negative: int, ironic: int, lpositive: int, lnegative: int, top: int, text: string]

In [14]:
training_set.show(truncate=False)

+------------------+----------+--------+--------+------+---------+---------+---+---------------------------------------------------------------------------------------------------------------------------------------------+
|id                |subjective|positive|negative|ironic|lpositive|lnegative|top|text                                                                                                                                         |
+------------------+----------+--------+--------+------+---------+---------+---+---------------------------------------------------------------------------------------------------------------------------------------------+
|122449983151669248|1         |0       |1       |0     |0        |1        |1  |"Intanto la partita per Via Nazionale si complica. #Saccomanni dice che ""mica tutti sono Mario #Monti"" http://t.co/xPtNz4X7 via @linkiesta"|
|125485104863780865|1         |0       |1       |0     |0        |1        |1  |False illusioni, sgradevoli 

In [8]:
#training_set.show(truncate=False)
training_set.groupBy("positive").count().show()

+--------+-----+
|positive|count|
+--------+-----+
|       1| 2051|
|       0| 5359|
+--------+-----+



In [9]:
# define stage 1: tokenize the tweet text    
stage_1 = RegexTokenizer(inputCol= 'text' , outputCol= 'tokens', pattern= '\\W')

In [10]:
# define stage 2: remove the stop words
stage_2 = StopWordsRemover(inputCol= 'tokens', outputCol= 'filtered_words')

22/05/27 16:28:40 WARN StopWordsRemover: Default locale set was [en_IT]; however, it was not found in available locales in JVM, falling back to en_US locale. Set param `locale` in order to respect another locale.


In [11]:
# define stage 3: create a word vector of the size 100
stage_3 = Word2Vec(inputCol= 'filtered_words', outputCol= 'vector', vectorSize= 100)

In [16]:
# define stage 4: Logistic Regression Model
model = LogisticRegression(featuresCol= 'vector', labelCol= 'positive')

![](https://cdn-images-1.medium.com/max/1600/1*DyD3VP18IV3-lXcKMbyr5w.jpeg)

In [17]:
# setup the pipeline
pipeline = Pipeline(stages= [stage_1, stage_2, stage_3, model])
pipeline

Pipeline_2e8a1f061128

In [18]:
# fit the pipeline model with the training data
pipelineFit = pipeline.fit(training_set)

22/05/27 16:31:39 WARN InstanceBuilder$NativeBLAS: Failed to load implementation from:dev.ludovic.netlib.blas.JNIBLAS
22/05/27 16:31:39 WARN InstanceBuilder$NativeBLAS: Failed to load implementation from:dev.ludovic.netlib.blas.ForeignLinkerBLAS
                                                                                

In [19]:
modelSummary=pipelineFit.stages[-1].summary
modelSummary 
# https://spark.apache.org/docs/latest/api/java/org/apache/spark/ml/classification/LogisticRegressionSummary.html

<pyspark.ml.classification.BinaryLogisticRegressionTrainingSummary at 0x107dc5d30>

In [20]:
modelSummary.accuracy

0.7425101214574898

![](images/accuracy.jpg)
[DeepLearningNewsAndMemes](https://www.facebook.com/DeepLearningNewsAndMemes/)

In [21]:
tweetDf = spark.createDataFrame(["False illusioni, sgradevoli realtà Mario Monti http://t.co/WOmMCITs via @AddToAny"], tp.StringType()).toDF("text")
tweetDf.show(truncate=False)

+---------------------------------------------------------------------------------+
|text                                                                             |
+---------------------------------------------------------------------------------+
|False illusioni, sgradevoli realtà Mario Monti http://t.co/WOmMCITs via @AddToAny|
+---------------------------------------------------------------------------------+



In [24]:
pipelineFit.transform(tweetDf).select('text','tokens','prediction').show(truncate=False)

+---------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------+----------+
|text                                                                             |tokens                                                                                   |prediction|
+---------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------+----------+
|False illusioni, sgradevoli realtà Mario Monti http://t.co/WOmMCITs via @AddToAny|[false, illusioni, sgradevoli, realt, mario, monti, http, t, co, wommcits, via, addtoany]|0.0       |
+---------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------+----------+



In [27]:
tweetDf = spark.createDataFrame(["Amore"], tp.StringType()).toDF("text")
tweetDf.show(truncate=False)

+-----+
|text |
+-----+
|Amore|
+-----+



In [28]:
pipelineFit.transform(tweetDf).select('tokens','prediction').show(truncate=False)

+-------+----------+
|tokens |prediction|
+-------+----------+
|[amore]|1.0       |
+-------+----------+



In [29]:
pipelineFit.save("../spark/dataset/model.save")

22/05/27 16:36:20 ERROR Instrumentation: java.io.IOException: Path ../spark/dataset/model.save already exists. To overwrite it, please use write.overwrite().save(path) for Scala and use write().overwrite().save(path) for Java and Python.
	at org.apache.spark.ml.util.FileSystemOverwrite.handleOverwrite(ReadWrite.scala:683)
	at org.apache.spark.ml.util.MLWriter.save(ReadWrite.scala:167)
	at org.apache.spark.ml.PipelineModel$PipelineModelWriter.super$save(Pipeline.scala:344)
	at org.apache.spark.ml.PipelineModel$PipelineModelWriter.$anonfun$save$4(Pipeline.scala:344)
	at org.apache.spark.ml.MLEvents.withSaveInstanceEvent(events.scala:174)
	at org.apache.spark.ml.MLEvents.withSaveInstanceEvent$(events.scala:169)
	at org.apache.spark.ml.util.Instrumentation.withSaveInstanceEvent(Instrumentation.scala:42)
	at org.apache.spark.ml.PipelineModel$PipelineModelWriter.$anonfun$save$3(Pipeline.scala:344)
	at org.apache.spark.ml.PipelineModel$PipelineModelWriter.$anonfun$save$3$adapted(Pipeline.scal

Py4JJavaError: An error occurred while calling o989.save.
: java.io.IOException: Path ../spark/dataset/model.save already exists. To overwrite it, please use write.overwrite().save(path) for Scala and use write().overwrite().save(path) for Java and Python.
	at org.apache.spark.ml.util.FileSystemOverwrite.handleOverwrite(ReadWrite.scala:683)
	at org.apache.spark.ml.util.MLWriter.save(ReadWrite.scala:167)
	at org.apache.spark.ml.PipelineModel$PipelineModelWriter.super$save(Pipeline.scala:344)
	at org.apache.spark.ml.PipelineModel$PipelineModelWriter.$anonfun$save$4(Pipeline.scala:344)
	at org.apache.spark.ml.MLEvents.withSaveInstanceEvent(events.scala:174)
	at org.apache.spark.ml.MLEvents.withSaveInstanceEvent$(events.scala:169)
	at org.apache.spark.ml.util.Instrumentation.withSaveInstanceEvent(Instrumentation.scala:42)
	at org.apache.spark.ml.PipelineModel$PipelineModelWriter.$anonfun$save$3(Pipeline.scala:344)
	at org.apache.spark.ml.PipelineModel$PipelineModelWriter.$anonfun$save$3$adapted(Pipeline.scala:344)
	at org.apache.spark.ml.util.Instrumentation$.$anonfun$instrumented$1(Instrumentation.scala:191)
	at scala.util.Try$.apply(Try.scala:213)
	at org.apache.spark.ml.util.Instrumentation$.instrumented(Instrumentation.scala:191)
	at org.apache.spark.ml.PipelineModel$PipelineModelWriter.save(Pipeline.scala:344)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
	at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
	at java.lang.Thread.run(Thread.java:750)


In [30]:
# Set the model threshold to maximize F-Measure
fMeasure = modelSummary.fMeasureByThreshold
fMeasure.show()

+------------------+--------------------+
|         threshold|           F-Measure|
+------------------+--------------------+
|0.9620675430380442|0.017357762777242044|
|0.9366308905904482| 0.02399232245681382|
|0.9184655155861272| 0.03060736489717838|
|0.8988745253950465|0.037178265014299335|
|0.8832970125119248| 0.04180522565320665|
|0.8693926532522985| 0.05377358490566037|
|0.8542836706351302| 0.06015037593984963|
|0.8349571094373316| 0.06551240056153486|
|0.8181633931500188| 0.06996268656716417|
|0.8041191197507951| 0.07438400743840075|
|0.7980151682172789| 0.07784986098239109|
|0.7841865750917644| 0.08314087759815243|
|0.7738372934382455|  0.0892774965485504|
|0.7644432556199672| 0.09536909674461257|
|0.7538053529686473| 0.10050251256281408|
|0.7371259427180132| 0.10473588342440801|
|0.7300462850992223| 0.10980036297640654|
| 0.717939256094174| 0.11307100859339665|
|0.7065777120044586| 0.11541929666366095|
|0.6882472680183942| 0.12044943820224718|
+------------------+--------------



In [31]:
maxFMeasure = fMeasure.groupBy().max('F-Measure').select('max(F-Measure)')
maxFMeasure.show()

+------------------+
|    max(F-Measure)|
+------------------+
|0.4953159598024187|
+------------------+



In [33]:
bestThreshold=fMeasure.where(fMeasure['F-Measure'] == 0.4953159598024187)
bestThreshold.show()

+------------------+------------------+
|         threshold|         F-Measure|
+------------------+------------------+
|0.2457349436145636|0.4953159598024187|
+------------------+------------------+



In [34]:
model.setThreshold(0.2457349436145636)

LogisticRegression_73db463d2bde

In [35]:
modelSummary=pipelineFit.stages[-1].summary
modelSummary.accuracy

0.7425101214574898

In [36]:
# fit the pipeline model with the training data
pipelineFit = pipeline.fit(training_set)

[Stage 173:>                                                        (0 + 1) / 1]                                                                                

In [37]:
modelSummary=pipelineFit.stages[-1].summary
modelSummary.accuracy

0.6002699055330635

![](https://i.imgflip.com/40mt0s.jpg)
[NicsMeme](https://imgflip.com/i/40mt0s)

# Another Approach: Naive Bayes

In [38]:
# define stage 3: create a word vector of the size 100
hashingTF = HashingTF(inputCol="filtered_words", outputCol="vector", numFeatures=20)

In [39]:
# define stage 4: Logistic Regression Model
modelNaive =  NaiveBayes(smoothing=1.0, modelType="multinomial",featuresCol= 'vector', labelCol= 'positive')

In [40]:
# setup the pipeline
pipelineNaive = Pipeline(stages= [stage_1, stage_2, hashingTF, modelNaive])

# fit the pipeline model with the training data
pipelineNaiveFit = pipelineNaive.fit(training_set)

In [41]:
pipelineNaiveFit

PipelineModel_aafc48afeb81

In [42]:
# select example rows to display.
predictions = pipelineNaiveFit.transform(training_set)
predictions.show()

+------------------+----------+--------+--------+------+---------+---------+---+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+----------+
|                id|subjective|positive|negative|ironic|lpositive|lnegative|top|                text|              tokens|      filtered_words|              vector|       rawPrediction|         probability|prediction|
+------------------+----------+--------+--------+------+---------+---------+---+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+----------+
|122449983151669248|         1|       0|       1|     0|        0|        1|  1|"Intanto la parti...|[intanto, la, par...|[intanto, la, par...|(20,[1,2,5,7,8,9,...|[-60.853495557968...|[0.71615635721973...|       0.0|
|125485104863780865|         1|       0|       1|     0|        0|        1|  1|False illusioni, ...|[false, illusioni...|[false

In [43]:
# compute accuracy on the test set
evaluator = MulticlassClassificationEvaluator(labelCol="positive", predictionCol="prediction",
                                              metricName="accuracy")
accuracy = evaluator.evaluate(predictions)
print("Test set accuracy = " + str(accuracy))

Test set accuracy = 0.722132253711201


# Yet another one 
https://towardsdatascience.com/sentiment-analysis-with-pyspark-bc8e83f80c35


In [44]:
tokenizer = Tokenizer(inputCol="text", outputCol="words")
hashtf = HashingTF(numFeatures=2**16, inputCol="words", outputCol='tf')
idf = IDF(inputCol='tf', outputCol="features", minDocFreq=5) #minDocFreq: remove sparse terms
model = LogisticRegression(featuresCol= 'features', labelCol= 'positive',maxIter=100)
pipeline = Pipeline(stages=[tokenizer, hashtf, idf, model])


In [45]:
# fit the pipeline model with the training data
pipelineFit = pipeline.fit(training_set)

22/05/27 16:42:11 WARN DAGScheduler: Broadcasting large task binary with size 1073.9 KiB
22/05/27 16:42:11 WARN DAGScheduler: Broadcasting large task binary with size 1075.5 KiB
22/05/27 16:42:11 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS
22/05/27 16:42:11 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeRefBLAS
22/05/27 16:42:11 WARN DAGScheduler: Broadcasting large task binary with size 1075.5 KiB
22/05/27 16:42:11 WARN DAGScheduler: Broadcasting large task binary with size 1075.5 KiB
22/05/27 16:42:11 WARN DAGScheduler: Broadcasting large task binary with size 1075.5 KiB
22/05/27 16:42:11 WARN DAGScheduler: Broadcasting large task binary with size 1075.5 KiB
22/05/27 16:42:11 WARN DAGScheduler: Broadcasting large task binary with size 1075.5 KiB
22/05/27 16:42:11 WARN DAGScheduler: Broadcasting large task binary with size 1075.5 KiB
22/05/27 16:42:11 WARN DAGScheduler: Broadcasting large task binary with size

22/05/27 16:42:13 WARN DAGScheduler: Broadcasting large task binary with size 1075.5 KiB
22/05/27 16:42:13 WARN DAGScheduler: Broadcasting large task binary with size 1075.5 KiB
22/05/27 16:42:13 WARN DAGScheduler: Broadcasting large task binary with size 1075.5 KiB
22/05/27 16:42:13 WARN DAGScheduler: Broadcasting large task binary with size 1075.5 KiB
22/05/27 16:42:13 WARN DAGScheduler: Broadcasting large task binary with size 1075.5 KiB
22/05/27 16:42:13 WARN DAGScheduler: Broadcasting large task binary with size 1075.5 KiB
22/05/27 16:42:13 WARN DAGScheduler: Broadcasting large task binary with size 1075.5 KiB
22/05/27 16:42:13 WARN DAGScheduler: Broadcasting large task binary with size 1075.5 KiB
22/05/27 16:42:13 WARN DAGScheduler: Broadcasting large task binary with size 1075.5 KiB
22/05/27 16:42:13 WARN DAGScheduler: Broadcasting large task binary with size 1075.5 KiB
22/05/27 16:42:13 WARN DAGScheduler: Broadcasting large task binary with size 1075.5 KiB


In [46]:
modelSummary=pipelineFit.stages[-1].summary
modelSummary.accuracy

22/05/27 16:42:23 WARN DAGScheduler: Broadcasting large task binary with size 1116.0 KiB


0.9257759784075573

In [47]:
ss = tp.StructType([tp.StructField("text",tp.StringType(),True)])


In [48]:
tweetDf = spark.createDataFrame(["Mario Monti sul Corriere: la fotografia più illuminante sulla delicata situazione attuale http://t.co/YbuNZMOJ"], tp.StringType()).toDF("text")
tweetDf.select("text").show(truncate=False)
tw2=pipelineFit.transform(tweetDf)
tw2.select("prediction").show()

+--------------------------------------------------------------------------------------------------------------+
|text                                                                                                          |
+--------------------------------------------------------------------------------------------------------------+
|Mario Monti sul Corriere: la fotografia più illuminante sulla delicata situazione attuale http://t.co/YbuNZMOJ|
+--------------------------------------------------------------------------------------------------------------+

+----------+
|prediction|
+----------+
|       1.0|
+----------+



22/05/27 16:43:00 WARN DAGScheduler: Broadcasting large task binary with size 1106.8 KiB
22/05/27 16:43:00 WARN DAGScheduler: Broadcasting large task binary with size 1106.8 KiB
22/05/27 16:43:00 WARN DAGScheduler: Broadcasting large task binary with size 1106.8 KiB


In [49]:
tweetDf = spark.createDataFrame(["Le 5 sgradevoli realtà di cui Berlusconi dovrebbe rendersi personalmente conto http://t.co/G3u1iF9n Mario Monti non usa mezzi termini"], tp.StringType()).toDF("text")
tweetDf.select("text").show(truncate=False)
tw2=pipelineFit.transform(tweetDf)
tw2.select("prediction").show()

+-------------------------------------------------------------------------------------------------------------------------------------+
|text                                                                                                                                 |
+-------------------------------------------------------------------------------------------------------------------------------------+
|Le 5 sgradevoli realtà di cui Berlusconi dovrebbe rendersi personalmente conto http://t.co/G3u1iF9n Mario Monti non usa mezzi termini|
+-------------------------------------------------------------------------------------------------------------------------------------+

+----------+
|prediction|
+----------+
|       0.0|
+----------+



22/05/27 16:43:50 WARN DAGScheduler: Broadcasting large task binary with size 1106.8 KiB
22/05/27 16:43:50 WARN DAGScheduler: Broadcasting large task binary with size 1106.8 KiB
22/05/27 16:43:50 WARN DAGScheduler: Broadcasting large task binary with size 1106.8 KiB


In [50]:
tweetDf = spark.createDataFrame(["Monti mi piace"], tp.StringType()).toDF("text")
tweetDf.select("text").show(truncate=False)
tw2=pipelineFit.transform(tweetDf)
tw2.select("prediction").show()

+--------------+
|text          |
+--------------+
|Monti mi piace|
+--------------+

+----------+
|prediction|
+----------+
|       1.0|
+----------+



22/05/27 16:43:57 WARN DAGScheduler: Broadcasting large task binary with size 1106.8 KiB
22/05/27 16:43:57 WARN DAGScheduler: Broadcasting large task binary with size 1106.8 KiB
22/05/27 16:43:57 WARN DAGScheduler: Broadcasting large task binary with size 1106.8 KiB


In [51]:
predictions = pipelineFit.transform(training_set)
from pyspark.ml.evaluation import BinaryClassificationEvaluator
evaluator = BinaryClassificationEvaluator(rawPredictionCol="rawPrediction",labelCol="positive")
evaluator.evaluate(predictions)

22/05/27 16:44:04 WARN DAGScheduler: Broadcasting large task binary with size 1116.1 KiB


0.971300461118872

In [63]:
tweetDf = spark.createDataFrame(["brutte cose"], tp.StringType()).toDF("text")
tweetDf.select("text").show(truncate=False)
tw2=pipelineFit.transform(tweetDf)
tw2.select("prediction").show()

+-----------+
|text       |
+-----------+
|brutte cose|
+-----------+

+----------+
|prediction|
+----------+
|       0.0|
+----------+



22/05/27 17:45:25 WARN DAGScheduler: Broadcasting large task binary with size 1106.8 KiB
22/05/27 17:45:25 WARN DAGScheduler: Broadcasting large task binary with size 1106.8 KiB
22/05/27 17:45:25 WARN DAGScheduler: Broadcasting large task binary with size 1106.8 KiB


In [78]:
accuracy = predictions.filter(predictions.positive == predictions.prediction).count() / float(training_set.count())
accuracy

22/05/22 09:26:14 WARN DAGScheduler: Broadcasting large task binary with size 1110.9 KiB


0.9257759784075573

In [64]:
spark.stop()

# Biblio

* https://www.analyticsvidhya.com/blog/2019/12/streaming-data-pyspark-machine-learning-model/
* https://www.kdnuggets.com/2018/02/machine-learning-algorithm-2118.html
* https://towardsdatascience.com/sentiment-analysis-using-logistic-regression-and-naive-bayes-16b806eb4c4b
* http://www.di.unito.it/~tutreeb/sentipolc-evalita16/data.html
* http://www.di.unito.it/~tutreeb/ironita-evalita18/data.html
* https://towardsdatascience.com/sentiment-analysis-and-emotion-recognition-in-italian-using-bert-92f5c8fe8a2
* https://github.com/charlesmalafosse/open-dataset-for-sentiment-analysis
* https://iris.unito.it/retrieve/handle/2318/146318/175020/21_Paper.pdf
* https://aperto.unito.it/retrieve/handle/2318/1698302/496918/Sentiment%20analysis%20on%20Italian%20tweets.pdf
* https://iopscience.iop.org/article/10.1088/1742-6596/1000/1/012130/pdf
* https://towardsdatascience.com/sentiment-analysis-using-logistic-regression-and-naive-bayes-16b806eb4c4b
* https://towardsdatascience.com/sentiment-analysis-with-pyspark-bc8e83f80c35
* https://dzone.com/articles/streaming-machine-learning-pipeline-for-sentiment
* https://databricks.com/wp-content/uploads/2015/10/STEP-3-Sentiment_Analysis.html
* https://github.com/P7h/Spark-MLlib-Twitter-Sentiment-Analysis/blob/master/src/main/scala/org/p7h/spark/sentiment/mllib/MLlibSentimentAnalyzer.scala
* https://www.researchgate.net/publication/315913579_An_Apache_Spark_Implementation_for_Sentiment_Analysis_on_Twitter_Data
* https://medium.com/analytics-vidhya/congressional-tweets-using-sentiment-analysis-to-cluster-members-of-congress-in-pyspark-10afa4d1556e
* https://developer.hpe.com/blog/streaming-ml-pipeline-for-sentiment-analysis-using-apache-apis-kafka-spark-and-drill-part-2/
* https://dataespresso.com/en/2017/10/24/comparison-between-naive-bayes-and-logistic-regression/#:~:text=Na%C3%AFve%20Bayes%20has%20a%20naive,belonging%20to%20a%20certain%20class.
