Plan de travail :
1) Mener l'analyse exploratoire du dataset et entraîner un modèle de prédiction. Utiliser MLFlow pour tracker les expérimentations.
    - créer un nouvel environnement : OK / NLP_env
    - créer un git
    - créer et renseigner le readme
2) Installer MLFlow : OK
3) EDA & entraînement (notebooks DST) : OK mais trop long. Expérimentation sur un petit sample de 20% : OK
- Résolution de problèmes : conflits conda/python/cmd terminal in VSCODE
2) modifier le code pour sauver le meilleur modèle : OK
4) séparer le dataset initial en train_val/test
5) reconstruire le pipeline MLFlow en intégrant le préprocessing grâce à Countvectorizer entraîné sur données de train
https://learn.microsoft.com/en-us/azure/machine-learning/how-to-log-mlflow-models?view=azureml-api-2&tabs=wrapper

3) Créer une API & un container Docker
6) Coder un streamlit pour visualisations
7) Optimiser les hyperparamètres avec HyperOt et MLflow autolog = inspirer de ce code https://towardsdatascience.com/training-xgboost-with-mlflow-experiments-and-hyperopt-c0d3a4994ea6

Dataset : https://www.kaggle.com/datasets/nelgiriyewithana/emotions

Commandes au lancement
>conda activate NLP_env
>mlflow server --host 127.0.0.1 --port 8080
>cd "C:\Users\thiba\Documents\Projets data\202402_NLP_emotions"

Lancer une expérimentation et sauver le modèle
>python train_GBC.py 0.1 100 0.5


POur déployer et calculer des prédictions :
1) lancer fichier predict.py pour tester
2) utililser la CLI MLflow : mlflow models serve -m '/home/ubuntu/MLflow/mlruns/YOUR_EXPERIMENT_ID/YOUR_RUN_ID/artifacts/rf_apples' --port 5002

ensuite requêter : 

    curl http://localhost:5002/invocations -H 'Content-Type: application/json' -d '{ "dataframerecords": [{"averagetemperature":30.58472685635918, "rainfall":"6.786844618818696", "weekend":"0", "holiday":0, "priceperkg":2.5024636658836807, "promo":0, "previousdaysdemand":844.9940172482485}] }'


Servir le modèle depuis le registry ou par requête API : https://www.youtube.com/watch?v=RVMIibDbzaE


Comment gérer le préprocessing en production ?
2. You can also creat another CountVectorizer, if you really want to(but not advisable since you would be wasting space and you'd still want to use the same parameters for your CV), and use the same feature.

cv_train = CountVectorizer(parameters desired)

X_train = cv_train.fit_transform(train_data)

cv_test = CountVectorizer(vocabulary=cv_train.get_feature_names(),desired params)

X_test = cv_test.fit_transform(test_data)


precommit : 

    run pre-commit install to set up the git hook scripts

     4. (optional) Run against all the files
     $ pre-commit run --all-files


In [3]:
import pandas as pd
import os
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import GradientBoostingClassifier
#Support vector machines
from sklearn import svm
#KNN
from sklearn import neighbors
#Decision trees
from sklearn.tree import DecisionTreeClassifier
#Random forest
from sklearn.ensemble import RandomForestClassifier

#Méthodes d'ensemble
#Boosting et bagging
#ADaboost
from sklearn.ensemble import AdaBoostClassifier
#Bagging
from sklearn.ensemble import BaggingClassifier

#metrics
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score, classification_report


In [2]:
#Data discovery
df = pd.read_csv("../data/text.csv", index_col = 0)
df.head()

Unnamed: 0,text,label
0,i just feel really helpless and heavy hearted,4
1,ive enjoyed being able to slouch about relax a...,0
2,i gave up my internship with the dmrg and am f...,4
3,i dont know i feel so lost,0
4,i am a kindergarten teacher and i am thoroughl...,4


In [3]:
#Import labels
target_labels = {
    0 : "sadness",
    1 : "joy",
    2 : "love",
    3: "anger",
    4: "fear",
    5: "surprise"
}

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 416809 entries, 0 to 416808
Data columns (total 2 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   text    416809 non-null  object
 1   label   416809 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 9.5+ MB


In [10]:
for i in range(6):
    print(f"\nExtrait des tweets pour le sentiment : {target_labels[i]}")
    print(df[df["label"]== i].head()
          )


Extrait des tweets pour le sentiment : sadness
                                                 text  label
1   ive enjoyed being able to slouch about relax a...      0
3                          i dont know i feel so lost      0
5          i was beginning to feel quite disheartened      0
9   i can still lose the weight without feeling de...      0
11  im feeling a little like a damaged tree and th...      0

Extrait des tweets pour le sentiment : joy
                                                 text  label
7   i fear that they won t ever feel that deliciou...      1
10  i try to be nice though so if you get a bitchy...      1
12  i have officially graduated im not feeling as ...      1
14  i feel my portfolio demonstrates how eager i a...      1
15  i may be more biased than the next because i h...      1

Extrait des tweets pour le sentiment : love
                                                 text  label
6   i would think that whomever would be lucky eno...      2
30  i gue

In [5]:
#Séparation train, test
X = df["text"]
y = df["label"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42, shuffle=True)

In [6]:
#Vectorisation
vectorizer = CountVectorizer()
test_train = vectorizer.fit_transform(X_train[:10])
test_train

<10x112 sparse matrix of type '<class 'numpy.int64'>'
	with 140 stored elements in Compressed Sparse Row format>

In [20]:
type(vectorizer.vocabulary_)

dict

In [18]:
print(test_train.todense())

[[0 0 0 ... 0 0 0]
 [1 0 0 ... 0 1 1]
 [0 1 0 ... 1 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 1 ... 0 0 0]
 [0 0 0 ... 0 0 0]]


In [7]:
#Vectorisation
vectorizer = CountVectorizer()
X_train = vectorizer.fit_transform(X_train)
X_test = vectorizer.transform(X_test)
X_train

<333447x67923 sparse matrix of type '<class 'numpy.int64'>'
	with 5212081 stored elements in Compressed Sparse Row format>

In [23]:
#Model training
clf_knn = neighbors.KNeighborsClassifier()
clf_svc = svm.SVC()
clf_bg = BaggingClassifier(n_estimators = 1000, oob_score = True)
clf_dt = DecisionTreeClassifier(max_depth = 5)

clf_ada = AdaBoostClassifier(base_estimator = clf_dt, n_estimators = 400)
clf_rf = RandomForestClassifier(n_jobs = -1)
clf_gbc = GradientBoostingClassifier()



In [25]:
#KNN
#clf_knn.fit(X_train, y_train)
#y_pred = clf_knn.predict(X_test)
#print("Score accuracy train : ", clf_knn.score(X_train, y_train))
#print("Score accuracy test : ", clf_knn.score(X_test, y_test))


KeyboardInterrupt: 

In [27]:
from sklearn.metrics import classification_report


In [None]:
#SVC
classifier = clf_svc
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
print("Score accuracy train : ", classifier.score(X_train, y_train))
print("Score accuracy test : ", classifier.score(X_test, y_test))
print(classification_report(y_test, y_pred))


In [None]:
#BaggingClassifier
classifier = clf_bg
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
print("Score accuracy train : ", classifier.score(X_train, y_train))
print("Score accuracy test : ", classifier.score(X_test, y_test))
print(classification_report(y_test, y_pred))


In [None]:
#DEcisionTreeClassifier
classifier = clf_dt
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
print("Score accuracy train : ", classifier.score(X_train, y_train))
print("Score accuracy test : ", classifier.score(X_test, y_test))
print(classification_report(y_test, y_pred))


In [None]:
#AdaBoost
classifier = clf_ada
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
print("Score accuracy train : ", classifier.score(X_train, y_train))
print("Score accuracy test : ", classifier.score(X_test, y_test))
print(classification_report(y_test, y_pred))


In [None]:
#RandomForest
classifier = clf_rf
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
print("Score accuracy train : ", classifier.score(X_train, y_train))
print("Score accuracy test : ", classifier.score(X_test, y_test))
print(classification_report(y_test, y_pred))


In [26]:
#GradientBoosting
classifier = clf_gbc
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
print("Score accuracy train : ", classifier.score(X_train, y_train))
print("Score accuracy test : ", classifier.score(X_test, y_test))
print(classification_report(y_test, y_pred))


Score accuracy train :  0.8523573461449646
Score accuracy test :  0.8505314171924858


NameError: name 'classification_report' is not defined

In [28]:
print(classification_report(y_test, y_pred))


              precision    recall  f1-score   support

           0       0.94      0.84      0.89     24201
           1       0.77      0.96      0.85     28164
           2       0.89      0.67      0.76      6929
           3       0.94      0.79      0.86     11441
           4       0.93      0.74      0.82      9594
           5       0.70      0.92      0.79      3033

    accuracy                           0.85     83362
   macro avg       0.86      0.82      0.83     83362
weighted avg       0.87      0.85      0.85     83362



In [18]:
#Réduction du jeu d'entraînement pour accélerer les tests
df_r = df.sample(frac = 0.02)
df_r.shape

(8336, 2)

In [9]:
X_r = df_r["text"]
y_r = df_r["label"]
X_train_r, X_test_r, y_train_r, y_test_r = train_test_split(X_r, y_r, test_size = 0.2, random_state = 42)
vectorizer2 = CountVectorizer()
X_train_r = vectorizer2.fit_transform(X_train_r)
X_test_r = vectorizer2.transform(X_test_r)


In [10]:
#GradientBoosting
classifier = GradientBoostingClassifier()
classifier.fit(X_train_r, y_train_r)
y_pred_r = classifier.predict(X_test_r)
print("Score accuracy train : ", classifier.score(X_train_r, y_train_r))
print("Score accuracy test : ", classifier.score(X_test_r, y_test_r))
print(classification_report(y_test_r, y_pred_r))


Score accuracy train :  0.8980203959208158
Score accuracy test :  0.842925659472422
              precision    recall  f1-score   support

           0       0.92      0.85      0.88       440
           1       0.79      0.93      0.85       613
           2       0.78      0.78      0.78       117
           3       0.93      0.72      0.81       236
           4       0.84      0.74      0.79       203
           5       0.89      0.81      0.85        59

    accuracy                           0.84      1668
   macro avg       0.86      0.81      0.83      1668
weighted avg       0.85      0.84      0.84      1668



In [12]:
def eval_metrics(actual, pred):
    acc = accuracy_score(actual, pred)
    prec = precision_score(actual, pred, average = "macro")
    recall = recall_score(actual, pred, average = "macro")
    f1 = f1_score(actual, pred, average = "macro")
    return acc, prec, recall, f1

print(eval_metrics(y_test_r, y_pred_r))

(0.842925659472422, 0.8581179038075589, 0.8064390452019726, 0.8279587429379823)


In [16]:
def eval_metrics(actual, pred):
    acc = accuracy_score(actual, pred)
    prec = precision_score(actual, pred, average = "weighted")
    recall = recall_score(actual, pred, average = "weighted")
    f1 = f1_score(actual, pred, average = "weighted")
    return acc, prec, recall, f1

print(eval_metrics(y_test_r, y_pred_r))

(0.842925659472422, 0.8512824475798922, 0.842925659472422, 0.8420889216636167)


In [4]:

#Import data
PATH = r"C:\Users\thiba\Documents\Projets data\202402_NLP_emotions\data"
fichier = "test_samples/Echantillon_de_test_0.csv"
df = pd.read_csv(os.path.join(PATH, fichier), index_col = 0)

In [6]:
df["label"].value_counts().sort_index()

label
0    4893
1    5663
2    1371
3    2275
4    1860
5     610
Name: count, dtype: int64

In [7]:
df.iloc[0]

text     i hate feeling vulnerable and powerless
label                                          4
Name: 230942, dtype: object

In [8]:
PATH = r"C:\Users\thiba\Documents\Projets data\202402_NLP_emotions\data"
fichier = "test_samples/Echantillon_de_test_0.csv"
df2 = pd.read_csv(os.path.join(PATH, fichier), index_col = 0)
df2 = df2.drop("label", axis = 1)
df2.head()

Unnamed: 0,text
230942,i hate feeling vulnerable and powerless
90579,i feel a real sense of terrified inside
270788,im feeling naughty i might have a cheeky baileys
135111,i have an idea and no way to write it down and...
339507,i went off on holiday feeling fab to be lighte...


In [13]:
import mlflow
mlflow.set_tracking_uri(uri="http://127.0.0.1:8080")

#import from registry
model_name = "Sentiment_classifier_GBC"
model_version = 3
model = mlflow.pyfunc.load_model(model_uri=f"models:/{model_name}/{model_version}")


In [14]:
model.predict(df2)

array([1], dtype=int64)