<a href="https://colab.research.google.com/github/rime-am/projet_bigdata/blob/main/analyse_sentiment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Analyse des sentiments

On utilise comme données des avis sur des livres.



In [None]:
!pip install pyspark
import pyspark
from pyspark.sql import SparkSession

#on crée notre session Spark
spark = SparkSession.builder.master("local[1]").appName("SparkByExamples.com").getOrCreate()
spark.conf.set('spark.sql.shuffle.partitions', '100')

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


### Notre dataset
Nos données nous viennent  du site suivant (http://jmcauley.ucsd.edu/data/amazon/). Le fichier contient originalement 982,619 avis. Nous avons utilsé un fichier plus léger (50000 avis).


In [None]:
# notre fichier de données
kindle_json = spark.read.json('Kindle_Stode_51.json')

In [None]:
kindle_json.show(3)

+----------+-------+-------+--------------------+----------+--------------+------------+------------------+--------------+
|      asin|helpful|overall|          reviewText|reviewTime|    reviewerID|reviewerName|           summary|unixReviewTime|
+----------+-------+-------+--------------------+----------+--------------+------------+------------------+--------------+
|B000F83SZQ| [0, 0]|    5.0|I enjoy vintage b...|05 5, 2014|A1F6404F1VG29J|  Avidreader|Nice vintage story|    1399248000|
|B000F83SZQ| [2, 2]|    4.0|This book is a re...|01 6, 2014| AN0N05A9LIJEQ|    critters|      Different...|    1388966400|
|B000F83SZQ| [2, 2]|    4.0|This was a fairly...|04 4, 2014| A795DMNCJILA6|         dot|             Oldie|    1396569600|
+----------+-------+-------+--------------------+----------+--------------+------------+------------------+--------------+
only showing top 3 rows



### Label pour les setiments

Les livres avec des évaluations de 1, 2, ou 3 sont considérés comme des évaluations négatives (label=1), et les livres avec des évaluations de 4 et de 5 sont considérés commme des évaluations positives (label=0).

In [None]:
kindle_json.createOrReplaceTempView('kindle_json_view')

data_json = spark.sql('''
  SELECT CASE WHEN overall<4 THEN 1
          ELSE 0
          END as label,
        reviewText as text
  FROM kindle_json_view
  WHERE length(reviewText)>2''')

data_json.groupBy('label').count().show()

+-----+-----+
|label|count|
+-----+-----+
|    1|11928|
|    0|38069|
+-----+-----+



In [None]:
# on prend une petite partie de nos données pour faciliter l'analyse
pos = data_json.where('label=0').sample(False, 0.05, seed=1220)
neg = data_json.where('label=1').sample(False, 0.25, seed=1220)
data = pos.union(neg)
data.groupBy('label').count().show()

+-----+-----+
|label|count|
+-----+-----+
|    0| 1939|
|    1| 3035|
+-----+-----+



In [None]:
from pyspark.sql.functions import length
data.withColumn('longueur_avis', length('text')).groupBy('label').avg('longueur_avis').show()

+-----+------------------+
|label|avg(longueur_avis)|
+-----+------------------+
|    0| 615.1113976276431|
|    1|  605.902471169687|
+-----+------------------+



### Fonction pour le preprocessing

In [None]:
# fonction pour preprocessing
def clean(text):
  import html
  import string
  import nltk
  nltk.download('wordnet')

  line = html.unescape(text)
  line = line.replace("can't", 'can not')
  line = line.replace("n't", " not")
  # remplace ponctuations par espace
  pad_punct = str.maketrans({key: " {0} ".format(key) for key in string.punctuation})
  line = line.translate(pad_punct)
  line = line.lower()
  line = line.split()
  lemmatizer = nltk.WordNetLemmatizer()
  line = [lemmatizer.lemmatize(t) for t in line]

  # travail sur la négation
  # on ajoute "not_" pour les mots qui suivent "not", ou "no" jusqu'à a la fin de la phrase
  #ceci va nous aider dans notre analyse des sentiments
  tokens = []
  negated = False
  for t in line:
      if t in ['not', 'no']:
          negated = not negated
      elif t in string.punctuation or not t.isalpha():
          negated = False
      else:
          tokens.append('not_' + t if negated else t)

  invalidChars = str(string.punctuation.replace("_", ""))
  bi_tokens = list(nltk.bigrams(line))
  bi_tokens = list(map('_'.join, bi_tokens))
  bi_tokens = [i for i in bi_tokens if all(j not in invalidChars for j in i)]
  tri_tokens = list(nltk.trigrams(line))
  tri_tokens = list(map('_'.join, tri_tokens))
  tri_tokens = [i for i in tri_tokens if all(j not in invalidChars for j in i)]
  tokens = tokens + bi_tokens + tri_tokens

  return tokens

In [None]:
# un exemple pour montrer comment la fonction fonctionne
import nltk
nltk.download('omw-1.4')
example = clean("This is such a good book! A love story for the ages, I can't wait for the second book!!")
print(example)

['this', 'is', 'such', 'a', 'good', 'book', 'a', 'love', 'story', 'for', 'the', 'age', 'i', 'can', 'not_wait', 'not_for', 'not_the', 'not_second', 'not_book', 'this_is', 'is_such', 'such_a', 'a_good', 'good_book', 'a_love', 'love_story', 'story_for', 'for_the', 'the_age', 'i_can', 'can_not', 'not_wait', 'wait_for', 'for_the', 'the_second', 'second_book', 'this_is_such', 'is_such_a', 'such_a_good', 'a_good_book', 'a_love_story', 'love_story_for', 'story_for_the', 'for_the_age', 'i_can_not', 'can_not_wait', 'not_wait_for', 'wait_for_the', 'for_the_second', 'the_second_book']


[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [None]:
#Effectue le preprocessing sur nos données
from pyspark.sql.functions import udf, col, size
from pyspark.sql.types import ArrayType, StringType
clean_udf = udf(clean, ArrayType(StringType()))
data_tokens = data.withColumn('tokens', clean_udf(col('text')))
data_tokens.show(3)

+-----+--------------------+--------------------+
|label|                text|              tokens|
+-----+--------------------+--------------------+
|    0|I am not for sure...|[i, am, not_for, ...|
|    0|This is yet anoth...|[this, is, yet, a...|
|    0|I almost didn't g...|[i, almost, did, ...|
+-----+--------------------+--------------------+
only showing top 3 rows



### On separe nos données en training (70%) et testing (30%)

In [None]:
training, testing = data_tokens.randomSplit([0.7,0.3], seed=1220)
training.groupBy('label').count().show()

+-----+-----+
|label|count|
+-----+-----+
|    0| 1351|
|    1| 2143|
+-----+-----+



In [None]:
training.cache()

DataFrame[label: int, text: string, tokens: array<string>]

### Utilisation du modele Naive Bayes

In [None]:
from pyspark.ml.feature import CountVectorizer, IDF
from pyspark.ml import Pipeline

count_vec = CountVectorizer(inputCol='tokens', outputCol='c_vec', minDF=5.0)
idf = IDF(inputCol="c_vec", outputCol="features")

In [None]:
# Modele Naive Bayes
from pyspark.ml.classification import NaiveBayes
nb = NaiveBayes()

pipeline_nb = Pipeline(stages=[count_vec, idf, nb])

model_nb = pipeline_nb.fit(training)
test_nb = model_nb.transform(testing)
test_nb.show(3)

+-----+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+----------+
|label|                text|              tokens|               c_vec|            features|       rawPrediction|         probability|prediction|
+-----+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+----------+
|    0|"Winter's Passage...|[winter, s, passa...|(18788,[0,1,2,4,5...|(18788,[0,1,2,4,5...|[-4913.0456154214...|[1.0,1.1067609804...|       0.0|
|    0|&lt;mrs.featherpi...|[mr, featherpicke...|(18788,[0,1,2,3,4...|(18788,[0,1,2,3,4...|[-5638.0576384967...|[0.98151461366521...|       0.0|
|    0|(4.5 star Top Pic...|[star, top, pick,...|(18788,[0,1,2,3,4...|(18788,[0,1,2,3,4...|[-19323.732043022...|[1.0,4.6153056670...|       0.0|
+-----+--------------------+--------------------+--------------------+--------------------+--------------------+------------------

#### Performance de notre modèle Naive Bayes


In [None]:
# ROC de notre modele
from pyspark.ml.evaluation import BinaryClassificationEvaluator
roc_nb_eval = BinaryClassificationEvaluator(rawPredictionCol='prediction', labelCol='label')
roc_nb = roc_nb_eval.evaluate(test_nb)
print("ROC de notre modele {}".format(roc_nb))

ROC de notre modele 0.8240863610017999


In [None]:
# precision de notre modele
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
acc_nb_eval = MulticlassClassificationEvaluator(metricName='accuracy')
acc_nb = acc_nb_eval.evaluate(test_nb)
print("Precision (accuracy) de notre modele {}".format(acc_nb))

Precision (accuracy) de notre modele 0.8358108108108108


#### CrossValidation de notre modele Naive Bayes

In [None]:
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator

paramGrid_nb = (ParamGridBuilder()
                .addGrid(count_vec.minDF, [3.0, 5.0, 7.0, 10.0, 15.0])
                .addGrid(nb.smoothing, [0.1, 0.5, 1.0])
                .build())
cv_nb = CrossValidator(estimator=pipeline_nb, estimatorParamMaps=paramGrid_nb, evaluator=acc_nb_eval, numFolds=5)
cv_model_nb = cv_nb.fit(training) 

In [None]:
test_cv_nb = cv_model_nb.transform(testing)
acc_nb_cv = acc_nb_eval.evaluate(test_cv_nb)
print("Precision de notre modele avec CrossValidator: {}".format(acc_nb_cv))

Precision de notre modele avec CrossValidator: 0.825


### Prédictions sur nos propre avis:
Pour montrer que notre modèle marche, on l'essaie sur nos propre avis : 
* un clairement positif, 
* un clairement négatif,
* un qui mélange un peu des deux. 


In [None]:
review_1 = ["What an excellent, excellent book! The writing style was truly excellent, and the characters were so detailed and well developed. The author surpassed themselves, can't wait for the sequel!"]


In [None]:
review_2 = ["One of the worst books I have ever read, one word to describe it : trash!"]

In [None]:
review_3 = ["I liked the premise and most of the book. At some parts I lost a little interest because I couldn't differentiate between who was who."]

In [None]:
from pyspark.sql.types import *
schema = StructType([StructField("text", StringType(), True)])

text = [review_1, review_2, review_3]
review_new = spark.createDataFrame(text, schema=schema)

In [None]:
#Preprocessing de nos données
review_new_tokens = review_new.withColumn('tokens', clean_udf(col('text')))
review_new_tokens.show()

+--------------------+--------------------+
|                text|              tokens|
+--------------------+--------------------+
|What an excellent...|[what, an, excell...|
|One of the worst ...|[one, of, the, wo...|
|I liked the premi...|[i, liked, the, p...|
+--------------------+--------------------+



In [None]:
# Prediction en utilisant nottre modele Naive Bayes
result = cv_model_nb.transform(review_new_tokens)
result.select('text', 'prediction').show()

+--------------------+----------+
|                text|prediction|
+--------------------+----------+
|What an excellent...|       0.0|
|One of the worst ...|       1.0|
|I liked the premi...|       1.0|
+--------------------+----------+

