# Sentiment basé sur les reviews

J'ai testé différent Dataset avec différents modèle de text mining pour prédire les sentiments à partir des reviews.
Une première approche en utilisant le Dataset twitter.
Une deuxième approche en utilisant nos propres jeux de données Trustpilot
Le modèle qui fonctionne le mieux pour le moment est le modèle bag CountVectorizer. Le modèle prédit assez bien pour les avis positifs et un peu moins bien pour les avis négatifs et neutre. 
Des solutions pour améliorer le modèle:
* Ajout de données pour l'entrainement
* Utiliser un modèle de text Mining Pré-entrainer sur un gros volume de données tel que Vader
* Utiliser d'autre Dataset ou d'autres modèles de text mining

# Import des librairies

In [1]:
from nltk.tokenize import word_tokenize
import nltk
from nltk.tokenize.regexp import RegexpTokenizer
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.stem.snowball import PorterStemmer
import os
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import GradientBoostingClassifier
import joblib
from sklearn.metrics import classification_report
from sklearn.feature_extraction.text import TfidfVectorizer
from pymongo import MongoClient
from pprint import pprint


if nltk.download('punkt') == True:
    pass
else:
    nltk.download('punkt')

[nltk_data] Downloading package punkt to /home/ubuntu/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


# Nettoyage et traitement jeu de données

In [2]:
tokenizer = RegexpTokenizer(r"[a-zA-Z0-9]{4,}")


In [3]:
stop_words = set(stopwords.words('english'))
stop_words.update([".",",","?","@"])

def stop_words_filtering(l):
    for element in l:
        if element in stop_words:
            l.remove(element)
    return l

In [4]:
def lemmatisation(mots):
    wordnet_lemmatizer = WordNetLemmatizer()
    result = []
    for element in mots:
        radical = wordnet_lemmatizer.lemmatize(element, pos='v')
        if (radical not in result):
            result.append(radical)
    return result

In [5]:
def stemming(mots) :
    stemmer = PorterStemmer()
    sortie = []
    for string in mots :
        radical = stemmer.stem(string)
        if (radical not in sortie) : sortie.append(radical)
    return sortie

# Twitter Dataset

L'objectif est de créer un modèle capable de prédire si un texte est positif neutre ou négatif. Pour cela on va prendre un jeu de donnée contenant des reviews avec leur label. Après cet entrainement, on va tester le modèle avec nos propres données

In [6]:
path_twitter_data = f"{os.getcwd()}/train_data/Twitter_Data.csv"

In [7]:
data_twitter = pd.read_csv(path_twitter_data, encoding='ISO-8859-1')

In [8]:
data_twitter["category"] = data_twitter["category"].replace(to_replace = -1.0, value = "negatif")
data_twitter["category"] = data_twitter["category"].replace(to_replace = 1.0, value = "positif")
data_twitter["category"] = data_twitter["category"].replace(to_replace = 0.0, value = "neutre")

In [9]:
data_twitter = data_twitter.dropna()

In [10]:
data_twitter["category"].value_counts()

category
positif    72249
neutre     55211
negatif    35509
Name: count, dtype: int64

In [11]:
data_twitter_positif = data_twitter[data_twitter['category'] == 'positif'].sample(n=35509, random_state=42)
data_twitter_neutre = data_twitter[data_twitter['category'] == 'neutre'].sample(n=35509, random_state=42)
data_twitter_negatif = data_twitter[data_twitter['category'] == 'negatif']

In [12]:
data_twitter = pd.concat([data_twitter_positif, data_twitter_negatif,data_twitter_neutre])

In [13]:
data_twitter["category"].value_counts()

category
positif    35509
negatif    35509
neutre     35509
Name: count, dtype: int64

In [14]:
print(f"On a en tout {len(data_twitter)} lignes pour notre modèle")

On a en tout 106527 lignes pour notre modèle


# Algo BAG avec CountVectorizer

On va prendre importer nos données d'entrainements et on va ensuite diviser en jeu d'entrainement et de test

In [15]:
model_name = "modelCV_twitter.pkl"
if os.path.exists(f"{os.getcwd()}/model/{model_name}"):

    vectorizer, model_clf_CV_twitter = joblib.load(f"{os.getcwd()}/model/{model_name}")
    X_test, y_test = joblib.load(f"{os.getcwd()}/model/test_data_twitter_CV.pkl")

else:
    X = data_twitter["clean_text"]
    y = data_twitter["category"]
    
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state = 30)

    X_train = X_train.str.lower().apply(tokenizer.tokenize)
    X_test = X_test.str.lower().apply(tokenizer.tokenize)
    
    X_train = X_train.apply(stop_words_filtering)
    X_test = X_test.apply(stop_words_filtering)
    
    X_train = X_train.apply(lemmatisation)
    X_test = X_test.apply(lemmatisation)
    
    X_train = X_train.apply(str)
    X_test = X_test.apply(str)

    #application du CountVectorizer sur nos jeux d'entrainements et de tests
    vectorizer = CountVectorizer()
    X_train = vectorizer.fit_transform(X_train)
    X_test = vectorizer.transform(X_test)

    model_clf_CV_twitter = GradientBoostingClassifier(n_estimators=100, learning_rate=1.0, max_depth=1, random_state=42).fit(X_train, y_train)
    joblib.dump((vectorizer,model_clf_CV_twitter), f"{os.getcwd()}/model/{model_name}")
    joblib.dump((X_test, y_test), f"{os.getcwd()}/model/test_data_twitter_CV.pkl")
    

In [16]:
y_pred = model_clf_CV_twitter.predict(X_test)

In [17]:
print( classification_report(y_test, y_pred))

              precision    recall  f1-score   support

     negatif       0.85      0.72      0.78      7021
      neutre       0.70      0.95      0.81      7200
     positif       0.87      0.69      0.77      7085

    accuracy                           0.79     21306
   macro avg       0.81      0.79      0.79     21306
weighted avg       0.81      0.79      0.79     21306



Cette fonction prends en entrée un pandas Serie de texte et le modèle et renvoie en sortie une liste avec nos prédictions

In [18]:
def extract_sentiment_cv(reviews,model_CV):
    result = []
    reviews = reviews.str.lower().apply(tokenizer.tokenize)
    reviews = reviews.apply(stop_words_filtering)
    reviews = reviews.apply(lemmatisation)
    reviews = reviews.apply(str)
    for text in reviews:
        new_text_vectorized = vectorizer.transform([text])

        prediction = model_CV.predict(new_text_vectorized)
        result.append(prediction[0])
    return result


# Algo BAG avec TFIDF

La même chose avec tfidf

In [19]:
model_name = "modeltfidf_twitter.pkl"
if os.path.exists(f"{os.getcwd()}/model/{model_name}"):

    vec_tfidf, model_CLF_tfidf_twitter = joblib.load(f"{os.getcwd()}/model/{model_name}")
    X_test_tfidf, y_test_tfidf = joblib.load(f"{os.getcwd()}/model/test_data_twitter_tfidf.pkl")

else:
    X = data_twitter["clean_text"]
    y = data_twitter["category"]
    
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state = 30)

    X_train = X_train.str.lower().apply(tokenizer.tokenize)
    X_test = X_test.str.lower().apply(tokenizer.tokenize)
    
    X_train = X_train.apply(stop_words_filtering)
    X_test = X_test.apply(stop_words_filtering)
    
    X_train = X_train.apply(lemmatisation)
    X_test = X_test.apply(lemmatisation)
    
    X_train = X_train.apply(str)
    X_test = X_test.apply(str)

    #application du TfidfVectorizer sur nos jeux d'entrainements et de tests
    vec_tfidf = TfidfVectorizer()
    X_train_tfidf = vec_tfidf.fit_transform(X_train)
    X_test_tfidf = vec_tfidf.transform(X_test)

    model_CLF_tfidf_twitter = GradientBoostingClassifier(n_estimators=100, learning_rate=1.0, max_depth=1, random_state=42).fit(X_train_tfidf, y_train)
    joblib.dump((vec_tfidf, model_CLF_tfidf_twitter), f"{os.getcwd()}/model/{model_name}")
    joblib.dump((X_test_tfidf, y_test), f"{os.getcwd()}/model/test_data_twitter_tfidf.pkl")
    

In [20]:
y_pred_tfidf = model_CLF_tfidf_twitter.predict(X_test_tfidf)

In [21]:
print(classification_report(y_test, y_pred_tfidf))

              precision    recall  f1-score   support

     negatif       0.83      0.71      0.76      7021
      neutre       0.70      0.95      0.80      7200
     positif       0.87      0.67      0.75      7085

    accuracy                           0.78     21306
   macro avg       0.80      0.78      0.77     21306
weighted avg       0.80      0.78      0.77     21306



In [22]:
def extract_sentiment_tfidf(reviews,model_tfidf):
    result = []
    reviews = reviews.str.lower().apply(tokenizer.tokenize)
    reviews = reviews.apply(stop_words_filtering)
    reviews = reviews.apply(lemmatisation)
    reviews = reviews.apply(str)
    for text in reviews:

        new_text_vectorized = vec_tfidf.transform([text])

        prediction = model_tfidf.predict(new_text_vectorized)
        result.append(prediction[0])
    return result

# Sentiments Analysis avec Trustpilot

On va se servir de nos propres jeux de données truspilot dans ce cas

In [23]:
path_trustpilot_data = f"{os.getcwd()}/train_data/debt_relief_service_raw_reviews_COMPLETE.csv"

In [24]:
df = pd.read_csv(path_trustpilot_data)

Ici nous avons entrainé notre modèle avec nos propres données que l'on a scrappé sur truspilot

Un petit aperçu de nos données

In [25]:
df.head()

Unnamed: 0.1,Unnamed: 0,firm_id,firm_name,review_url,review_title,review_text,review_date,note,reponse,author_name,author_url,author_localisation,experience_date,extract_date
0,0,turbodebt.com,TurboDebt,/reviews/646cba0dc423446286686604,Alvaro made my experience very…,Alvaro made my experience very satisfactory! I...,2023-05-23,5.0,True,Keontra Reid,/users/646cba0ba8905b00124cdbfb,US,2023-05-22,2024-08-23
1,1,turbodebt.com,TurboDebt,/reviews/646cad6cc423446286685c19,Great company to work with they really…,Great company to work with they really underst...,2023-05-23,5.0,True,Ismael Luciano,/users/646cad6b05330f0014134602,US,2023-05-22,2024-08-23
2,2,turbodebt.com,TurboDebt,/reviews/646c8127706f837cb1eff2f6,HELPFUL..,HELPFUL... HONEST...TRUTHUL!!!!!!!!!!!!!!!!!!!,2023-05-23,5.0,True,Tony RODRIGUEZ,/users/646c8126a8905b00124cad7c,US,2023-05-22,2024-08-23
3,3,turbodebt.com,TurboDebt,/reviews/646c6c27706f837cb1efe4ad,Making it very easy to understand and…,Making it very easy to understand and answerin...,2023-05-23,5.0,True,Melinda Hall,/users/646c6c254be4ac0013350e3c,US,2023-05-22,2024-08-23
4,4,turbodebt.com,TurboDebt,/reviews/646c4e46706f837cb1efd615,Leif was awesome,"Leif was awesome. Hes so friendly, easy to tal...",2023-05-23,5.0,True,SDS,/users/646c4e454be4ac001334fb5f,US,2023-05-22,2024-08-23


In [26]:
print(f"Nous avons en tout {len(df)} lignes")

Nous avons en tout 214824 lignes


In [27]:
df = df.dropna()
print(f"Après avoir supprimé les NaN nous avons {len(df)} lignes")

Après avoir supprimé les NaN nous avons 188262 lignes


In [28]:
df["sentiments"] = df["note"]

In [29]:
df["sentiments"] = df["sentiments"].replace(to_replace = [2.0,1.0], value = "negatif")
df["sentiments"] = df["sentiments"].replace(to_replace = [4.0,5.0], value = "positif")
df["sentiments"] = df["sentiments"].replace(to_replace = [3.0], value = "neutre")

In [30]:
df["sentiments"].value_counts()

sentiments
positif    176578
negatif      7364
neutre       4320
Name: count, dtype: int64

In [31]:
df_positif = df[df['sentiments'] == 'positif'].sample(n=8000, random_state=42)
df_negatif = df[df['sentiments'] == 'negatif']
df_neutre = df[df['sentiments'] == 'neutre']

Notre jeux de données est déjà plus équilibré

In [32]:
df["sentiments"].value_counts()

sentiments
positif    176578
negatif      7364
neutre       4320
Name: count, dtype: int64

In [33]:
df = pd.concat([df_positif, df_negatif, df_neutre])

In [34]:
model_name = "modelCV_trustpilot.pkl"
if os.path.exists(f"{os.getcwd()}/model/{model_name}"):

    vectorizer, model_clf_CV_trustpilot = joblib.load(f"{os.getcwd()}/model/{model_name}")
    X_test, y_test = joblib.load(f"{os.getcwd()}/model/test_data_trustpilot_CV.pkl")

else:
    X = df["review_text"]
    y = df["sentiments"]
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state = 30)

    X_train = X_train.str.lower().apply(tokenizer.tokenize)
    X_test = X_test.str.lower().apply(tokenizer.tokenize)
    
    X_train = X_train.apply(stop_words_filtering)
    X_test = X_test.apply(stop_words_filtering)
        
    X_train = X_train.apply(lemmatisation)
    X_test = X_test.apply(lemmatisation)
        
    X_train = X_train.apply(str)
    X_test = X_test.apply(str)
    
    vectorizer = CountVectorizer()
    X_train = vectorizer.fit_transform(X_train)
    X_test = vectorizer.transform(X_test)

    model_clf_CV_trustpilot = GradientBoostingClassifier(n_estimators=100, learning_rate=1.0, max_depth=1, random_state=42).fit(X_train, y_train)
    joblib.dump((vectorizer,model_clf_CV_trustpilot), f"{os.getcwd()}/model/{model_name}")
    joblib.dump((X_test, y_test), f"{os.getcwd()}/model/test_data_trustpilot_CV.pkl")
    


In [35]:
y_pred = model_clf_CV_trustpilot.predict(X_test)

In [36]:
print( classification_report(y_test, y_pred))

              precision    recall  f1-score   support

     negatif       0.74      0.77      0.76      1462
      neutre       0.52      0.39      0.45       861
     positif       0.80      0.89      0.84      1614

    accuracy                           0.74      3937
   macro avg       0.69      0.68      0.68      3937
weighted avg       0.72      0.74      0.73      3937



# Test du modèle avec nos jeux de données sur mongoDB

Une fois qu'on a notre modèle on l'utiliser sur nos données scrappés qui se trouvent dans la base mongodb

In [37]:
client = MongoClient(
    host = "127.0.0.1",
    port = 27017,
    username = "datascientest",
    password = "dst123")

In [38]:
pprint(client["test"]["reviews"].find_one())

{'Unnamed: 0': 0,
 '_id': ObjectId('66d1e81ee0bf53857f7ffc70'),
 'author_localisation': 'US',
 'author_name': 'Keontra Reid',
 'author_url': '/users/646cba0ba8905b00124cdbfb',
 'experience_date': '2023-05-22',
 'extract_date': '2024-08-23',
 'firm_id': 'turbodebt.com',
 'firm_name': 'TurboDebt',
 'note': 5.0,
 'reponse': True,
 'review_date': '2023-05-23',
 'review_text': 'Alvaro made my experience very satisfactory! I felt like I '
                'could breathe again after speaking with him.',
 'review_title': 'Alvaro made my experience very…',
 'review_url': '/reviews/646cba0dc423446286686604'}


In [39]:
print(client.list_database_names())

['admin', 'config', 'local', 'test']


Liste des colonnes de notre jeux de données

In [40]:
client["test"]["reviews"].find_one().keys()

dict_keys(['_id', 'Unnamed: 0', 'firm_id', 'firm_name', 'review_url', 'review_title', 'review_text', 'review_date', 'note', 'reponse', 'author_name', 'author_url', 'author_localisation', 'experience_date', 'extract_date'])

Ajout dans une liste de toutes les données

In [41]:
full_data = []
for element in client["test"]["reviews"].find({},{ "_id":0 }):
    full_data.append(element)

Création du DataFrame contenant toutes nos données

In [42]:
data = pd.DataFrame(full_data)

On supprime les lignes où dans la colonne review_text on voit des valeurs NaN

In [43]:
data = data.dropna(subset=['review_text'])

On utilise CV pour nos prédiction car c'est celui avec lequel on obtient nos meilleurs résultats

In [44]:
def extract_sentiment_cv(reviews,model_CV):
    result = []
    reviews = reviews.str.lower().apply(tokenizer.tokenize)
    reviews = reviews.apply(stop_words_filtering)
    reviews = reviews.apply(lemmatisation)
    reviews = reviews.apply(str)
    for text in reviews:
        new_text_vectorized = vectorizer.transform([text])

        prediction =  model_clf_CV_trustpilot.predict(new_text_vectorized)
        result.append(prediction[0])
    return result

In [45]:
data["sentiments"] = extract_sentiment_cv(data["review_text"], model_clf_CV_trustpilot)

In [46]:
data.head()

Unnamed: 0.1,Unnamed: 0,firm_id,firm_name,review_url,review_title,review_text,review_date,note,reponse,author_name,author_url,author_localisation,experience_date,extract_date,sentiments
0,0,turbodebt.com,TurboDebt,/reviews/646cba0dc423446286686604,Alvaro made my experience very…,Alvaro made my experience very satisfactory! I...,2023-05-23,5.0,True,Keontra Reid,/users/646cba0ba8905b00124cdbfb,US,2023-05-22,2024-08-23,positif
1,1,turbodebt.com,TurboDebt,/reviews/646cad6cc423446286685c19,Great company to work with they really…,Great company to work with they really underst...,2023-05-23,5.0,True,Ismael Luciano,/users/646cad6b05330f0014134602,US,2023-05-22,2024-08-23,positif
2,2,turbodebt.com,TurboDebt,/reviews/646c8127706f837cb1eff2f6,HELPFUL..,HELPFUL... HONEST...TRUTHUL!!!!!!!!!!!!!!!!!!!,2023-05-23,5.0,True,Tony RODRIGUEZ,/users/646c8126a8905b00124cad7c,US,2023-05-22,2024-08-23,positif
3,3,turbodebt.com,TurboDebt,/reviews/646c6c27706f837cb1efe4ad,Making it very easy to understand and…,Making it very easy to understand and answerin...,2023-05-23,5.0,True,Melinda Hall,/users/646c6c254be4ac0013350e3c,US,2023-05-22,2024-08-23,positif
4,4,turbodebt.com,TurboDebt,/reviews/646c4e46706f837cb1efd615,Leif was awesome,"Leif was awesome. Hes so friendly, easy to tal...",2023-05-23,5.0,True,SDS,/users/646c4e454be4ac001334fb5f,US,2023-05-22,2024-08-23,positif


In [47]:
data.to_csv('data_final.csv')