# Sentiment basé sur les reviews

J'ai testé différent Dataset avec différents modèle de text mining pour prédire les sentiments à partir des reviews. Le modèle qui fonctionne le mieux pour le moment est le modèle bag CountVectorizer. Le modèle prédit assez bien pour les avis positifs et un peu moins bien pour les avis négatifs. 
Des solutions pour améliorer le modèle:
* combiner les deux dataset
* Utiliser un modèle de text Mining Pré-entrainer sur un gros volume de données tel que Vader
* Utiliser d'autre Dataset ou d'autres modèles de text mining

In [1]:
from nltk.tokenize import word_tokenize
import nltk
if nltk.download('punkt') == True:
    pass
else:
    nltk.download('punkt')

[nltk_data] Downloading package punkt to /Users/kenny/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


# Nettoyage et traitement jeu de données

Fonction qui permet de récuperer les tokens dans la reviews. On filtre pour garder les tokens seulement avec une longueur supérieur ou égale à 4 pour garder une cohérence dans le modèle

In [2]:
from nltk.tokenize.regexp import RegexpTokenizer

tokenizer = RegexpTokenizer(r"[a-zA-Z0-9]{4,}")


Fonction qui va permettre d'enlever les stopwords pour chaques reviews

In [3]:
from nltk.corpus import stopwords

stop_words = set(stopwords.words('english'))
stop_words.update([".",",","?","@"])

def stop_words_filtering(l):
    for element in l:
        if element in stop_words:
            l.remove(element)
    return l

Fonction de lemmatisation qui permet de réduire le mot dans sa forme canonique

In [4]:
from nltk.stem import WordNetLemmatizer

def lemmatisation(mots):
    wordnet_lemmatizer = WordNetLemmatizer()
    result = []
    for element in mots:
        radical = wordnet_lemmatizer.lemmatize(element, pos='v')
        if (radical not in result):
            result.append(radical)
    return result

Fonction de stemmer pour réduire le mot à sa racine. on a décidé de ne pas l'utiliser car la lemmatisation est plus performante

In [5]:
from nltk.stem.snowball import PorterStemmer

def stemming(mots) :
    stemmer = PorterStemmer()
    sortie = []
    for string in mots :
        radical = stemmer.stem(string)
        if (radical not in sortie) : sortie.append(radical)
    return sortie

# Import Dataset d'entrainement

Source Dataset: https://www.kaggle.com/datasets/abhi8923shriv/sentiment-analysis-dataset

L'objectif est de créer un modèle capable de prédire si un texte est positif neutre ou négatif. Pour cela on va prendre un jeu de donnée contenant des reviews avec leur label. Après cet entrainement, on va tester le modèle avec nos propres données

On va prendre importer nos données d'entrainements et on va ensuite diviser en jeu d'entrainement et de test

In [6]:
import pandas as pd 

df_test = pd.read_csv("test.csv", encoding='ISO-8859-1')
df_train = pd.read_csv("train.csv", encoding='ISO-8859-1')
df_test = df_test[df_test["sentiment"] != "neutral"]
df_train = df_train[df_train["sentiment"] != "neutral"]

df_test = df_test.dropna(subset=['text'])
df_train = df_train.dropna(subset=['text'])

df_train["text"] = df_train["text"].str.lower().apply(tokenizer.tokenize)
df_test["text"] = df_test["text"].str.lower().apply(tokenizer.tokenize)

df_train["text"] = df_train["text"].apply(stop_words_filtering)
df_test["text"] = df_test["text"].apply(stop_words_filtering)

X_train = df_train["text"]
X_test = df_test["text"]
y_train = df_train["sentiment"]
y_test = df_test["sentiment"]

Conversion en liste de String pour que le format soit compatible avec les algo BAG

In [7]:
for element in range(len(X_train)):
    X_train.iloc[element] = " ".join(lemmatisation(X_train.iloc[element]))
    
for element in range(len(X_test)):
    X_test.iloc[element] = " ".join(lemmatisation(X_test.iloc[element]))
    

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_train.iloc[element] = " ".join(lemmatisation(X_train.iloc[element]))
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_test.iloc[element] = " ".join(lemmatisation(X_test.iloc[element]))


# Algo BAG avec CountVectorizer

application du CountVectorizer sur nos jeux d'entrainements et de tests

In [8]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
X_train = vectorizer.fit_transform(X_train)
X_test = vectorizer.transform(X_test)

Entrainement avec l'algo du GradientBoostingClassifier

In [9]:
from sklearn.ensemble import GradientBoostingClassifier
clf = GradientBoostingClassifier(n_estimators=100, learning_rate=1.0, max_depth=1, random_state=42).fit(X_train, y_train)

y_pred = clf.predict(X_test)

Résultat de prédictions plutôt correct

In [10]:
from sklearn.metrics import classification_report

print( classification_report(y_test, y_pred))

              precision    recall  f1-score   support

    negative       0.77      0.91      0.84      1001
    positive       0.90      0.76      0.82      1103

    accuracy                           0.83      2104
   macro avg       0.84      0.83      0.83      2104
weighted avg       0.84      0.83      0.83      2104



# Algo BAG avec TFIDF

La même chose avec TFIDF

In [11]:
import pandas as pd 

df_test = pd.read_csv("test.csv", encoding='ISO-8859-1')
df_train = pd.read_csv("train.csv", encoding='ISO-8859-1')
df_test = df_test[df_test["sentiment"] != "neutral"]
df_train = df_train[df_train["sentiment"] != "neutral"]

df_test = df_test.dropna(subset=['text'])
df_train = df_train.dropna(subset=['text'])

df_train["text"] = df_train["text"].str.lower().apply(tokenizer.tokenize)
df_test["text"] = df_test["text"].str.lower().apply(tokenizer.tokenize)

df_train["text"] = df_train["text"].apply(stop_words_filtering)
df_test["text"] = df_test["text"].apply(stop_words_filtering)

X_train = df_train["text"]
X_test = df_test["text"]
y_train = df_train["sentiment"]
y_test = df_test["sentiment"]

In [12]:
for element in range(len(X_train)):
    X_train.iloc[element] = " ".join(lemmatisation(X_train.iloc[element]))
    
for element in range(len(X_test)):
    X_test.iloc[element] = " ".join(lemmatisation(X_test.iloc[element]))
    

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_train.iloc[element] = " ".join(lemmatisation(X_train.iloc[element]))
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_test.iloc[element] = " ".join(lemmatisation(X_test.iloc[element]))


In [13]:
from sklearn.feature_extraction.text import TfidfVectorizer


vec_tfidf = TfidfVectorizer()
X_train_tfidf = vec_tfidf.fit_transform(X_train)
X_test_tfidf = vec_tfidf.transform(X_test)


In [14]:

clf_tfidf = GradientBoostingClassifier(n_estimators=100, learning_rate=1, max_depth=1, random_state=42).fit(X_train_tfidf, y_train)

y_pred_tfidf = clf.predict(X_test_tfidf)

In [15]:
print( classification_report(y_test, y_pred_tfidf))

              precision    recall  f1-score   support

    negative       0.51      0.99      0.67      1001
    positive       0.91      0.15      0.25      1103

    accuracy                           0.54      2104
   macro avg       0.71      0.57      0.46      2104
weighted avg       0.72      0.54      0.45      2104



Test sur nos jeux de données

In [16]:
data = pd.read_csv("evengreen_com.csv")

In [17]:
def extract_sentiment_cv(reviews):
    result = []
    for text in reviews:

        new_text_vectorized = vectorizer.transform([text])

        prediction = clf.predict(new_text_vectorized)
        result.append(prediction[0])
    return result

In [18]:
data["sentiment_review"] = extract_sentiment_cv(data["Review Text"])

In [19]:
r = extract_sentiment_cv(data["Review Text"])


In [20]:
def extract_sentiment_tfidf(reviews):
    result = []
    for text in reviews:

        new_text_vectorized = vec_tfidf.transform([text])

        prediction = clf_tfidf.predict(new_text_vectorized)
        result.append(prediction[0])
    return result

In [21]:
data["sentiment_review"] = extract_sentiment_tfidf(data["Review Text"])
r = extract_sentiment_tfidf(data["Review Text"])

In [22]:
data

Unnamed: 0,Reviewer Name,Date,Rating,Review Title,Review Text,sentiment_review
0,Dale Anketell2 reviewsUS7 days agoVerifiedMy E...,2024-07-30T21:52:51.000Z,"<div class=""star-rating_starRating__4rrcf star...",My Experience with Evergreen has always…,My Experience with Evergreen has always been n...,negative
1,"Marian Giovannini2 reviewsUSJul 18, 2024Disapp...",2024-07-18T15:43:33.000Z,"<div class=""star-rating_starRating__4rrcf star...",Disappointed,I was very disappointed to be denied a credit ...,negative
2,"ASAshleigh Scalamandre1 reviewUSJul 10, 2024Ve...",2024-07-10T17:43:24.000Z,"<div class=""star-rating_starRating__4rrcf star...",Amazing Team at Evergreen,I just wanted to thank the Evergreen Team agai...,positive
3,"TCTed Coffin1 reviewUSJul 5, 2024VerifiedGreat...",2024-07-05T23:39:50.000Z,"<div class=""star-rating_starRating__4rrcf star...",Great customer service,My wife and I are buying are house and somethi...,positive
4,"SSSherry Spaulding1 reviewUSJul 3, 2024Verifie...",2024-07-03T01:46:56.000Z,"<div class=""star-rating_starRating__4rrcf star...",Evergreen Credit Union was amazing to…,Evergreen Credit Union was amazing to work wit...,positive
5,"NSNoel Sherburne1 reviewUSJun 18, 2024Verified...",2024-06-18T20:22:49.000Z,"<div class=""star-rating_starRating__4rrcf star...",I requested bank statements and the…,I requested bank statements and the request wa...,positive
6,"MLMatthew L’Abbe1 reviewUSJun 18, 2024Verified...",2024-06-18T18:56:21.000Z,"<div class=""star-rating_starRating__4rrcf star...",Great experience,I had multiple scam charges ending in a negati...,positive
7,"LELindsay Edwards1 reviewUSMay 23, 2024Verifie...",2024-05-23T22:35:55.000Z,"<div class=""star-rating_starRating__4rrcf star...",Best Ever Banking Experience,Casey has been amazing to work with. I have be...,positive
8,"KEKen1 reviewSEApr 20, 2024VerifiedSecurity I ...",2024-04-20T18:12:35.000Z,"<div class=""star-rating_starRating__4rrcf star...",Security,I called the Evergreen's Portland location on ...,negative
9,"TNThe Noyes Family1 reviewUSMay 10, 2024Verifi...",2024-05-10T19:18:42.000Z,"<div class=""star-rating_starRating__4rrcf star...",Exceptional Service From Entering to Exiting!,We wanted to extend our gratitude to the excep...,positive


# Twitter Dataset

Source dataset: https://www.kaggle.com/datasets/jp797498e/twitter-entity-sentiment-analysis

In [23]:
import pandas as pd 
column_names = ['id', 'game', 'sentiment', 'text']
data_twitter = pd.read_csv("twitter_training.csv",header=None, names=column_names, encoding='ISO-8859-1')


In [24]:
data_twitter = data_twitter.dropna()

On veux garder seulement pour notre modèle les labels positif et negatif

In [25]:
data_twitter = data_twitter[data_twitter["sentiment"] != "Irrelevant"]
data_twitter = data_twitter[data_twitter["sentiment"] != "Neutral"]

In [26]:
from sklearn.model_selection import train_test_split
X = data_twitter["text"]
y = data_twitter["sentiment"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state = 30)

In [27]:
X_train = X_train.str.lower().apply(tokenizer.tokenize)
X_test = X_test.str.lower().apply(tokenizer.tokenize)

X_train = X_train.apply(stop_words_filtering)
X_test = X_test.apply(stop_words_filtering)

X_train = X_train.apply(lemmatisation)
X_test = X_test.apply(lemmatisation)

X_train = X_train.apply(str)
X_test = X_test.apply(str)

In [28]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
X_train = vectorizer.fit_transform(X_train)
X_test = vectorizer.transform(X_test)

In [29]:
from sklearn.ensemble import GradientBoostingClassifier
clf = GradientBoostingClassifier(n_estimators=100, learning_rate=1.0, max_depth=1, random_state=42).fit(X_train, y_train)

y_pred = clf.predict(X_test)

In [30]:
from sklearn.metrics import classification_report

print( classification_report(y_test, y_pred))

              precision    recall  f1-score   support

    Negative       0.70      0.90      0.78      4439
    Positive       0.84      0.58      0.69      4164

    accuracy                           0.74      8603
   macro avg       0.77      0.74      0.74      8603
weighted avg       0.77      0.74      0.74      8603



In [31]:
data["sentiment_review"] = extract_sentiment_cv(data["Review Text"])

In [32]:
r = extract_sentiment_cv(data["Review Text"])

In [33]:
data

Unnamed: 0,Reviewer Name,Date,Rating,Review Title,Review Text,sentiment_review
0,Dale Anketell2 reviewsUS7 days agoVerifiedMy E...,2024-07-30T21:52:51.000Z,"<div class=""star-rating_starRating__4rrcf star...",My Experience with Evergreen has always…,My Experience with Evergreen has always been n...,Negative
1,"Marian Giovannini2 reviewsUSJul 18, 2024Disapp...",2024-07-18T15:43:33.000Z,"<div class=""star-rating_starRating__4rrcf star...",Disappointed,I was very disappointed to be denied a credit ...,Positive
2,"ASAshleigh Scalamandre1 reviewUSJul 10, 2024Ve...",2024-07-10T17:43:24.000Z,"<div class=""star-rating_starRating__4rrcf star...",Amazing Team at Evergreen,I just wanted to thank the Evergreen Team agai...,Positive
3,"TCTed Coffin1 reviewUSJul 5, 2024VerifiedGreat...",2024-07-05T23:39:50.000Z,"<div class=""star-rating_starRating__4rrcf star...",Great customer service,My wife and I are buying are house and somethi...,Positive
4,"SSSherry Spaulding1 reviewUSJul 3, 2024Verifie...",2024-07-03T01:46:56.000Z,"<div class=""star-rating_starRating__4rrcf star...",Evergreen Credit Union was amazing to…,Evergreen Credit Union was amazing to work wit...,Positive
5,"NSNoel Sherburne1 reviewUSJun 18, 2024Verified...",2024-06-18T20:22:49.000Z,"<div class=""star-rating_starRating__4rrcf star...",I requested bank statements and the…,I requested bank statements and the request wa...,Positive
6,"MLMatthew L’Abbe1 reviewUSJun 18, 2024Verified...",2024-06-18T18:56:21.000Z,"<div class=""star-rating_starRating__4rrcf star...",Great experience,I had multiple scam charges ending in a negati...,Negative
7,"LELindsay Edwards1 reviewUSMay 23, 2024Verifie...",2024-05-23T22:35:55.000Z,"<div class=""star-rating_starRating__4rrcf star...",Best Ever Banking Experience,Casey has been amazing to work with. I have be...,Positive
8,"KEKen1 reviewSEApr 20, 2024VerifiedSecurity I ...",2024-04-20T18:12:35.000Z,"<div class=""star-rating_starRating__4rrcf star...",Security,I called the Evergreen's Portland location on ...,Negative
9,"TNThe Noyes Family1 reviewUSMay 10, 2024Verifi...",2024-05-10T19:18:42.000Z,"<div class=""star-rating_starRating__4rrcf star...",Exceptional Service From Entering to Exiting!,We wanted to extend our gratitude to the excep...,Positive
