# II - Quick Benchmark of Classical Classification Algorithms

In this notebook, we establish benchmark results for predicting the **sentiment** (positive or negative) of movie reviews using classical machine learning algorithms.

These results will serve as a point of comparison to evaluate the performance of **Large Language Models (LLMs)** in the following notebook: `3_prediction_BERT.ipynb`.

The steps are as follows:

### 1 - Building the Vector Representation of Texts

Classical ML algorithms require textual data to be represented as **vectors**. Therefore, the first step is to transform movie reviews into numerical vectors.

Given that the maximum number of words per review is around **2500 words**, we use the **bag-of-words** method, which transforms each review into a vector of size 2500. This is implemented in `scikit-learn` using the `CountVectorizer` function.

However, `CountVectorizer` encodes the text using raw word frequency, which can give excessive weight to frequent words, even if they are not discriminative. To address this, we use **TF-IDF (Term Frequency - Inverse Document Frequency)** weighting, which adjusts word weights based on their frequency across the entire corpus.

In `scikit-learn`, this is implemented with the `TfidfVectorizer` function, which directly produces a bag-of-words representation weighted by TF-IDF.

### 2 - Dimensionality Reduction: Latent Semantic Analysis (LSA)

We follow one of the approaches described in the paper *Learning Word Vectors for Sentiment Analysis* by Maas et al. (2011), applying dimensionality reduction to the term-document matrices using **Singular Value Decomposition (SVD)**. This technique is also known as **Latent Semantic Analysis (LSA)** and helps capture latent semantic structures in the data.

### 3 - Training Classification Algorithms

Using the obtained vector representations, we compare the performance of two classification algorithms:

- `MultinomialNB`: a naive Bayes classifier, suited for frequency- or probability-based data (such as TF-IDF vectors).
- `LogisticRegression`: logistic regression, well-suited for binary classification tasks.

These models are trained on the `df_train` set, using **cross-validation** to select the optimal value of certain **hyperparameters** from a grid:

- `alpha` for the `MultinomialNB` model, which controls the level of regularization,
- `C` for `LogisticRegression`, which is the inverse of the regularization strength.

### 4 - Performance Evaluation

Once the models are trained, we evaluate them on the `df_test` set. We compare the performance by generating **confusion matrices** and **classification reports**, which display the precision, recall, and F1-score for each class (positive/negative).


---

In [None]:
"""Importation des librairies"""
# Standard library
import os
import tarfile
import urllib.request
import joblib

# Third-party libraries
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

# Scikit-learn modules
from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline

In [2]:
# chargement des données d'entrainement et de test
df_train = pd.read_parquet("data/df_train.parquet")
df_test = pd.read_parquet("data/df_test.parquet")

In [None]:
# Réduction du dataset pour rapidité
# df_small_train = df_train.iloc[:100]
# df_small_test = df_train.iloc[100:110]

X_train = df_train['texte']
y_train = df_train['label']
X_test = df_test['texte']
y_test = df_test['label']

In [7]:
"""Définition des différents pipeline"""
# Pipeline avec TF-IDF + SVD (LSA) + Logistic Regression
pipeline_logistic = Pipeline([
    ('tfidf', TfidfVectorizer(stop_words='english', max_features=2500)),
    ('svd', TruncatedSVD(n_components=100, random_state=42)),
    ('clf', LogisticRegression(max_iter=300))
])

# Pipeline avec TF-IDF + SVD + MultinomialNB
pipeline_nb = Pipeline([
    ('tfidf', TfidfVectorizer(stop_words='english')),
    ('svd', TruncatedSVD(n_components=100, random_state=42)),
    ('clf', MultinomialNB())
])

In [8]:
"""Grilles de recherche pour les hyperparamètres lors de la validation croisée"""
# Grille de recherche pour la régression logistique
param_grid_logistic = {
    'clf__C': [0.01, 0.1, 1, 10]
}

# Grille plus simple pour le NB (alpha = lissage)
param_grid_nb = {
    'clf__alpha': [0.5, 1.0, 1.5]
}

In [10]:
"""Entraînement du modèle de régression logistique"""
grid_logistic = GridSearchCV(pipeline_logistic, param_grid_logistic, cv=3, scoring='accuracy')
grid_logistic.fit(X_train, y_train)

In [None]:
"""Enregistrement du modèle entraîné"""
model_filename = "models/logistic_regression_tfidf_svd_gridsearch.joblib"
joblib.dump(grid_logistic, model_filename)
print(f"Modèle enregistré sous : {model_filename}")

Modèle enregistré sous : models/logistic_regression_tfidf_svd_gridsearch.joblib


In [16]:
"""Évaluation du modèle de régression"""

# Chargement du modèle
loaded_model = joblib.load("models/logistic_regression_tfidf_svd_gridsearch.joblib")

# Réaliser les prédictions sur l'ensemble de test
y_pred_logistic = loaded_model.predict(X_test)

# Afficher les résultats
print("Logistic Regression - Best Params:", grid_logistic.best_params_)
y_pred_logistic = grid_logistic.predict(X_test)
print("Classification Report - Logistic Regression:\n", classification_report(y_test, y_pred_logistic))
print("Confusion Matrix - Logistic Regression:\n", confusion_matrix(y_test, y_pred_logistic))

Logistic Regression - Best Params: {'clf__C': 10}
Classification Report - Logistic Regression:
               precision    recall  f1-score   support

           0       0.86      0.85      0.85     12500
           1       0.85      0.86      0.85     12500

    accuracy                           0.85     25000
   macro avg       0.85      0.85      0.85     25000
weighted avg       0.85      0.85      0.85     25000

Confusion Matrix - Logistic Regression:
 [[10564  1936]
 [ 1757 10743]]


In [None]:
"""Entraînement du modèle avec classifier de Bayes"""
grid_nb = GridSearchCV(pipeline_nb, param_grid_nb, cv=3, scoring='accuracy')
grid_nb.fit(X_train, y_train)