<a href="https://colab.research.google.com/github/soufbaherda/Admin/blob/master/Sentiment_Analysis_in_Spark.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Import modules and create spark session

In [1]:
%pip install pyspark

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pyspark
  Downloading pyspark-3.3.1.tar.gz (281.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m281.4/281.4 MB[0m [31m4.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting py4j==0.10.9.5
  Downloading py4j-0.10.9.5-py2.py3-none-any.whl (199 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m199.7/199.7 KB[0m [31m14.7 MB/s[0m eta [36m0:00:00[0m
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.3.1-py2.py3-none-any.whl size=281845512 sha256=e84f60824def36e6e0b411a1377937a083432a0ce25fb9ef8458ddfd1bd001fd
  Stored in directory: /root/.cache/pip/wheels/43/dc/11/ec201cd671da62fa9c5cc77078235e40722170ceba231d7598
Successfully built pyspark
Installing collected packages: py4j, pyspa

In [2]:
#import modules
from pyspark.sql import SparkSession
from pyspark.sql.types import *
from pyspark.sql.functions import *
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.feature import HashingTF, Tokenizer, StopWordsRemover,IDF

#create Spark session
appName = "Sentiment Analysis in Spark"
spark = SparkSession.builder \
    .master("local") \
    .appName("Word Count") \
    .config("spark.some.config.option", "some-value") \
    .getOrCreate()

## Read data file into Spark dataFrame

In [4]:
#read csv file into dataFrame with automatically inferred schema
tweets_csv = spark.read.csv('/content/Restaurant_Reviews.csv', inferSchema=True, header=True)
tweets_csv.show(truncate=False, n=7)


+---------------------------------------------------------------------------------------+-----+
| SentimentText                                                                         |Label|
+---------------------------------------------------------------------------------------+-----+
|Wow... Loved this place.                                                               |1    |
|Crust is not good.                                                                     |0    |
|Not tasty and the texture was just nasty.                                              |0    |
|Stopped by during the late May bank holiday off Rick Steve recommendation and loved it.|1    |
|The selection on the menu was great and so were the prices.                            |1    |
|Now I am getting angry and I want my damn pho.                                         |0    |
|Honeslty it didn't taste THAT fresh.)                                                  |0    |
+---------------------------------------

## Select the related data

Dans cette partie, nous avons selectionné juste les colonnes du text et label qu'on va prédire.
Deplus nous avons renommé les colonnes afin d'utiliser facillement le code sur n'import quelle data, en unifiant le nom des colonnes. 


In [5]:
#select only "SentimentText" and "Sentiment" column, 
#and cast "Sentiment" column data into integer
data = tweets_csv.select(col(" SentimentText").alias("SentimentText"), col("Label").alias("label").cast("Int"))
data.show(truncate = False,n=5)

+---------------------------------------------------------------------------------------+-----+
|SentimentText                                                                          |label|
+---------------------------------------------------------------------------------------+-----+
|Wow... Loved this place.                                                               |1    |
|Crust is not good.                                                                     |0    |
|Not tasty and the texture was just nasty.                                              |0    |
|Stopped by during the late May bank holiday off Rick Steve recommendation and loved it.|1    |
|The selection on the menu was great and so were the prices.                            |1    |
+---------------------------------------------------------------------------------------+-----+
only showing top 5 rows



Mini nettoyage des données

In [36]:
import pyspark.sql.functions as F
df= data.select(F.translate(F.col("SentimentText"), ".!?#NAME?", "").alias("SentimentText"),"label").na.drop(how="any")
df.show()

+--------------------+-----+
|       SentimentText|label|
+--------------------+-----+
|Wow Loved this place|    1|
|   Crust is not good|    0|
|ot tasty and the ...|    0|
|Stopped by during...|    1|
|The selection on ...|    1|
|ow I am getting a...|    0|
|Honeslty it didn'...|    0|
|The potatoes were...|    0|
|The fries were gr...|    1|
|         great touch|    1|
|Service was very ...|    1|
|   Would not go back|    0|
|The cashier had n...|    0|
|I tried the Cape ...|    1|
|I was disgusted b...|    0|
|I was shocked bec...|    0|
|  Highly recommended|    1|
|Waitress was a li...|    0|
|This place is not...|    0|
| did not like at all|    0|
+--------------------+-----+
only showing top 20 rows



## Prepare training data

1.   Separate "SentimentText" into individual words using tokenizer
2.   Removing stop words (unimportant words to be features)
3.   Converting words feature into numerical feature. In Spark 2.2.1,it is implemented in HashingTF funtion using Austin Appleby's MurmurHash 3 algorithm



**Separate "SentimentText" into individual words using tokenizer **




In [46]:
tokenizer = Tokenizer(inputCol="SentimentText", outputCol="SentimentWords")
tokenizedTrain = tokenizer.transform(df).na.drop(how="any")
tokenizedTrain.show(truncate=False, n=10)

+--------------------------------------------------------------------------------------------------------------+-----+-------------------------------------------------------------------------------------------------------------------------------------+
|SentimentText                                                                                                 |label|SentimentWords                                                                                                                       |
+--------------------------------------------------------------------------------------------------------------+-----+-------------------------------------------------------------------------------------------------------------------------------------+
|Wow Loved this place                                                                                          |1    |[wow, loved, this, place]                                                                                                  

Removing stop words (unimportant words to be features)

In [47]:
swr = StopWordsRemover(inputCol=tokenizer.getOutputCol(), 
                       outputCol="MeaningfulWords")
SwRemovedTrain = swr.transform(tokenizedTrain).na.drop(how="any")
SwRemovedTrain.show(truncate=False, n=10)

+--------------------------------------------------------------------------------------------------------------+-----+-------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------+
|SentimentText                                                                                                 |label|SentimentWords                                                                                                                       |MeaningfulWords                                                       |
+--------------------------------------------------------------------------------------------------------------+-----+-------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------+
|Wow Loved this place       

Converting words feature into numerical feature. In Spark 2.2.1,it is implemented in HashingTF funtion using Austin Appleby's MurmurHash 3 algorithm

In [48]:
hashTF = HashingTF(inputCol=swr.getOutputCol(), outputCol="features")
numericTrainData = hashTF.transform(SwRemovedTrain).select(
    'label', 'MeaningfulWords', 'features')
 
# Création d'un objet IDF
idf = IDF(inputCol="features", outputCol="idf_features")

# Calcul de l'inverse des fréquences documentaires (IDF)
idfModel = idf.fit(numericTrainData)
tfidf = idfModel.transform(numericTrainData)
tfidf =tfidf.na.drop(how="any")
tfidf.show(truncate=True,n=20)


+-----+--------------------+--------------------+--------------------+
|label|     MeaningfulWords|            features|        idf_features|
+-----+--------------------+--------------------+--------------------+
|    1| [wow, loved, place]|(262144,[4631,709...|(262144,[4631,709...|
|    0|       [crust, good]|(262144,[113432,2...|(262144,[113432,2...|
|    0|[ot, tasty, textu...|(262144,[21732,15...|(262144,[21732,15...|
|    1|[stopped, late, a...|(262144,[53101,68...|(262144,[53101,68...|
|    1|[selection, menu,...|(262144,[15370,12...|(262144,[15370,12...|
|    0|[ow, getting, ang...|(262144,[12057,98...|(262144,[12057,98...|
|    0|[honeslty, taste,...|(262144,[92393,18...|(262144,[92393,18...|
|    0|[potatoes, like, ...|(262144,[14768,63...|(262144,[14768,63...|
|    1|      [fries, great]|(262144,[171611,2...|(262144,[171611,2...|
|    1|    [, great, touch]|(262144,[43333,24...|(262144,[43333,24...|
|    1|   [service, prompt]|(262144,[43756,16...|(262144,[43756,16...|
|    0

## Divide data into training and testing 

---

data

1.   Élément de liste
2.   Élément de liste



In [41]:
#divide data, 75% for training, 25% for testing
#dividedData = data.randomSplit([0.75, 0.25]) 
dividedData = tfidf.randomSplit([0.75, 0.25]) 
trainingData = dividedData[0] #index 0 = data training
testingData = dividedData[1] #index 1 = data testing
train_rows = trainingData.count()
test_rows = testingData.count()
print ("Training data rows:", train_rows, "; Testing data rows:", test_rows)

Training data rows: 724 ; Testing data rows: 273


## Train our classifier model using training data

In [51]:
lr = LogisticRegression(labelCol="label", featuresCol="features", 
                        maxIter=10, regParam=0.01)
model = lr.fit(trainingData)
print ("Training is done!")

Training is done!


## Predict testing data and calculate the accuracy model

In [52]:
prediction = model.transform(testingData)
predictionFinal = prediction.select(
    "MeaningfulWords", "prediction", "Label")
#predictionFinal.show(n=2, truncate = False)
correctPrediction = predictionFinal.filter(
    predictionFinal['prediction'] == predictionFinal['Label']).count()
totalData = predictionFinal.count()
print("correct prediction:", correctPrediction, ", total data:", totalData, 
      ", accuracy:", correctPrediction/totalData)

correct prediction: 201 , total data: 273 , accuracy: 0.7362637362637363
