<a href="https://colab.research.google.com/github/soufbaherda/Admin/blob/master/Sentiment_Analysis_in_Spark.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Import modules and create spark session

In [None]:
%pip install pyspark

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pyspark
  Downloading pyspark-3.3.1.tar.gz (281.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m281.4/281.4 MB[0m [31m4.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting py4j==0.10.9.5
  Downloading py4j-0.10.9.5-py2.py3-none-any.whl (199 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m199.7/199.7 KB[0m [31m17.7 MB/s[0m eta [36m0:00:00[0m
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.3.1-py2.py3-none-any.whl size=281845512 sha256=fa3dae398206dca4a5ad147e1d044273dd8e25fa9203d498119817eed2451122
  Stored in directory: /root/.cache/pip/wheels/43/dc/11/ec201cd671da62fa9c5cc77078235e40722170ceba231d7598
Successfully built pyspark
Installing collected packages: py4j, pyspa

In [None]:
#import modules
from pyspark.sql import SparkSession
from pyspark.sql.types import *
from pyspark.sql.functions import *
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.feature import HashingTF, Tokenizer, StopWordsRemover

#create Spark session
appName = "Sentiment Analysis in Spark"
spark = SparkSession.builder \
    .master("local") \
    .appName("Word Count") \
    .config("spark.some.config.option", "some-value") \
    .getOrCreate()

## Read data file into Spark dataFrame

In [None]:
#read csv file into dataFrame with automatically inferred schema
tweets_csv = spark.read.csv('/content/Restaurant_Reviews.csv', inferSchema=True, header=True)
tweets_csv.show(truncate=False, n=7)


+---------------------------------------------------------------------------------------+-----+
| SentimentText                                                                         |Label|
+---------------------------------------------------------------------------------------+-----+
|Wow... Loved this place.                                                               |1    |
|Crust is not good.                                                                     |0    |
|Not tasty and the texture was just nasty.                                              |0    |
|Stopped by during the late May bank holiday off Rick Steve recommendation and loved it.|1    |
|The selection on the menu was great and so were the prices.                            |1    |
|Now I am getting angry and I want my damn pho.                                         |0    |
|Honeslty it didn't taste THAT fresh.)                                                  |0    |
+---------------------------------------

## Select the related data

In [None]:
#select only "SentimentText" and "Sentiment" column, 
#and cast "Sentiment" column data into integer
data = tweets_csv.select(col(" SentimentText").alias("SentimentText"), col("Label").alias("label").cast("Int"))
data.show(truncate = False,n=5)

+---------------------------------------------------------------------------------------+-----+
|SentimentText                                                                          |label|
+---------------------------------------------------------------------------------------+-----+
|Wow... Loved this place.                                                               |1    |
|Crust is not good.                                                                     |0    |
|Not tasty and the texture was just nasty.                                              |0    |
|Stopped by during the late May bank holiday off Rick Steve recommendation and loved it.|1    |
|The selection on the menu was great and so were the prices.                            |1    |
+---------------------------------------------------------------------------------------+-----+
only showing top 5 rows



In [None]:
import pyspark.sql.functions as F
df= data.select(F.translate(F.col("SentimentText"), ".!?", "").alias("SentimentText"),"label")
df.show()

+--------------------+-----+
|       SentimentText|label|
+--------------------+-----+
|Wow Loved this place|    1|
|   Crust is not good|    0|
|Not tasty and the...|    0|
|Stopped by during...|    1|
|The selection on ...|    1|
|Now I am getting ...|    0|
|Honeslty it didn'...|    0|
|The potatoes were...|    0|
|The fries were gr...|    1|
|       A great touch|    1|
|Service was very ...|    1|
|   Would not go back|    0|
|The cashier had n...|    0|
|I tried the Cape ...|    1|
|I was disgusted b...|    0|
|I was shocked bec...|    0|
|  Highly recommended|    1|
|Waitress was a li...|    0|
|This place is not...|    0|
| did not like at all|    0|
+--------------------+-----+
only showing top 20 rows



## Divide data into training and testing 

---

data

1.   Élément de liste
2.   Élément de liste



In [None]:
#divide data, 75% for training, 25% for testing
#dividedData = data.randomSplit([0.75, 0.25]) 
dividedData = df.randomSplit([0.75, 0.25]) 
trainingData = dividedData[0] #index 0 = data training
testingData = dividedData[1] #index 1 = data testing
train_rows = trainingData.count()
test_rows = testingData.count()
print ("Training data rows:", train_rows, "; Testing data rows:", test_rows)

Training data rows: 730 ; Testing data rows: 269


## Prepare training data

Separate "SentimentText" into individual words using tokenizer

In [None]:
tokenizer = Tokenizer(inputCol="SentimentText", outputCol="SentimentWords")
tokenizedTrain = tokenizer.transform(trainingData)
tokenizedTrain.show(truncate=False, n=3)

+------------------------------------------------------------------------------------+-----+------------------------------------------------------------------------------------------------------+
|SentimentText                                                                       |label|SentimentWords                                                                                        |
+------------------------------------------------------------------------------------+-----+------------------------------------------------------------------------------------------------------+
|"I don't know what the big deal is about this place, but I won't be back ""ya'all"""|0    |["i, don't, know, what, the, big, deal, is, about, this, place,, but, i, won't, be, back, ""ya'all"""]|
|"It was extremely ""crumby"" and pretty tasteless"                                  |0    |["it, was, extremely, ""crumby"", and, pretty, tasteless"]                                            |
|"Service is quick a

##Removing stop words (unimportant words to be features)

In [None]:
swr = StopWordsRemover(inputCol=tokenizer.getOutputCol(), 
                       outputCol="MeaningfulWords")
SwRemovedTrain = swr.transform(tokenizedTrain).na.drop(how="any")
SwRemovedTrain.show(truncate=False, n=5)

+------------------------------------------------------------------------------------+-----+------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------+
|SentimentText                                                                       |label|SentimentWords                                                                                        |MeaningfulWords                                                          |
+------------------------------------------------------------------------------------+-----+------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------+
|"I don't know what the big deal is about this place, but I won't be back ""ya'all"""|0    |["i, don't, know, what, the, big, deal, is, about, this, place,, but, i, won't, be, back, ""ya'all

Converting words feature into numerical feature. In Spark 2.2.1,it is implemented in HashingTF funtion using Austin Appleby's MurmurHash 3 algorithm

In [None]:
hashTF = HashingTF(inputCol=swr.getOutputCol(), outputCol="features")
numericTrainData = hashTF.transform(SwRemovedTrain).select(
    'label', 'MeaningfulWords', 'features')
numericTrainData.show(truncate=True, n=5)

# Création d'un objet IDF
idf = IDF(inputCol="features", outputCol="idf_features")

# Calcul de l'inverse des fréquences documentaires (IDF)
idfModel = idf.fit(numericTrainData)
tfidf = idfModel.transform(numericTrainData)
tfidf =tfidf.na.drop(how="any")


+-----+--------------------+--------------------+
|label|     MeaningfulWords|            features|
+-----+--------------------+--------------------+
|    0|["i, know, big, d...|(262144,[82453,13...|
|    0|["it, extremely, ...|(262144,[23071,75...|
|    1|["service, quick,...|(262144,[19030,10...|
|    1|["that, screams, ...|(262144,[23071,29...|
|    0|["the, burger, go...|(262144,[20298,79...|
+-----+--------------------+--------------------+
only showing top 5 rows



## Train our classifier model using training data

In [None]:
lr = LogisticRegression(labelCol="label", featuresCol="features", 
                        maxIter=10, regParam=0.01)
model = lr.fit(tfidf)
print ("Training is done!")

Training is done!


## Prepare testing data

In [None]:
tokenizedTest = tokenizer.transform(testingData)
SwRemovedTest = swr.transform(tokenizedTest)
numericTest = hashTF.transform(SwRemovedTest).select(
    'Label', 'MeaningfulWords', 'features').na.drop(how="any")
numericTest.show(truncate=False, n=2)
# Création d'un objet IDF
idf = IDF(inputCol="features", outputCol="idf_features")

# Calcul de l'inverse des fréquences documentaires (IDF)
idfModel = idf.fit(numericTest)
tfidf = idfModel.transform(numericTest)
tfidf =tfidf.na.drop(how="any")


+-----+---------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------+
|Label|MeaningfulWords                                                                  |features                                                                                                                            |
+-----+---------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------+
|0    |["the, servers, went, back, forth, several, times,, even, much, ""are, helped"""]|(262144,[76764,108160,129074,132270,139371,146139,156484,174639,174966,216238,258219],[1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0])|
|0    |["the, food, tasty, all,, say, ""real, traditional, hunan, style"""]             |(262144,[45585,6850

## Predict testing data and calculate the accuracy model

In [None]:
prediction = model.transform(tfidf)
predictionFinal = prediction.select(
    "MeaningfulWords", "prediction", "Label")
predictionFinal.show(n=4, truncate = False)
correctPrediction = predictionFinal.filter(
    predictionFinal['prediction'] == predictionFinal['Label']).count()
totalData = predictionFinal.count()
print("correct prediction:", correctPrediction, ", total data:", totalData, 
      ", accuracy:", correctPrediction/totalData)

+---------------------------------------------------------------------------------+----------+-----+
|MeaningfulWords                                                                  |prediction|Label|
+---------------------------------------------------------------------------------+----------+-----+
|["the, servers, went, back, forth, several, times,, even, much, ""are, helped"""]|0.0       |0    |
|["the, food, tasty, all,, say, ""real, traditional, hunan, style"""]             |1.0       |0    |
|[#name]                                                                          |0.0       |1    |
|[#name]                                                                          |0.0       |1    |
+---------------------------------------------------------------------------------+----------+-----+
only showing top 4 rows

correct prediction: 190 , total data: 269 , accuracy: 0.7063197026022305
