# <font color="#114b98">Catégorisez automatiquement des questions</font>

## <font color="#114b98">Notebook de test de différents modèles</font>

**Stack Overflow** est un site célèbre de questions-réponses liées au développement informatique.

L'objectif de ce projet est de développer un système de **suggestion de tags** pour ce site. Celui-ci prendra la forme d’un algorithme de machine learning qui assignera automatiquement plusieurs tags pertinents à une question.

**Livrable** : Un notebook de test de différents modèles

**Objectifs** : Comparer les modèles et générer des tags pour chacun d'entre eux

## <font color="#114b98">Sommaire</font>
[1. Chargement du jeu de données](#section_1)

[2. Approche non supervisée](#section_2)

[3. Approche supervisée](#section_3)

[4. Approche supervisée avec Word Embedding : Word2Vec](#section_4)

[5. Approche supervisée avec Word Embedding : BERT](#section_5)

[6. Approche supervisée avec Sentence Embedding : USE](#section_6)

[7. Choix du modèle pour le code final à déployer](#section_7)

## <font color="#114b98" id="section_1">1. Chargement du jeu de données</font>

In [5]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
import time
import ast
import random
from collections import Counter
from sklearn.model_selection import train_test_split
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline

In [6]:
# %load_ext pycodestyle_magic
# %pycodestyle_on

In [7]:
plt.rc('axes', titlesize=22)
plt.rc('axes', labelsize=18)
titleprops = {'fontsize':20}
textprops = {'fontsize':15}
plt.style.use('ggplot')

In [8]:
main_path = 'N:/5 - WORK/1 - Projets/Projet 5/'
data = pd.read_csv(main_path+'saved_ressources/'+'data_cleaned.csv', encoding='utf8')

In [9]:
data["Texts"] = data["Texts"].apply(lambda x: ast.literal_eval(x))

In [10]:
data["Tags"] = data["Tags"].apply(lambda x: ast.literal_eval(x))

In [11]:
data.head()

Unnamed: 0,Tags,Texts,Sentences
0,"[javascript, jquery, string, date, object]","[jquery, javascript, convert, date, string, da...",jquery javascript convert date string date hav...
1,"[vba, excel, function, size, byte]","[excel, function, size, byte, return, file, si...",vba excel function for returning file size byt...
2,"[git, timezone, format, timestamp, timezone-of...","[git, timezone, timestamp, format, git, way, t...",git timezone and timestamp format from git can...
3,"[python, request, python-3.x, response, wsgi]","[wsgi, request, response, request, response, c...",wsgi request and response wrappers for python ...
4,"[linux, unix, process, path, environment]","[path, process, linux, environment, path, proc...",how get the path process unix linux windows en...


In [12]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   Tags       1000 non-null   object
 1   Texts      1000 non-null   object
 2   Sentences  1000 non-null   object
dtypes: object(3)
memory usage: 23.6+ KB


Le jeu de données est très important pour les temps de calculs à ma disposition.

Je décide aussi de prendre les observations pour lesquelles la similarité entre les colonnes Texts et Tags est importante.

Cela va me permettre de pouvoir regarder la pertinence des tags que mes modèles vont proposer.

In [None]:
sample_size = 2000

In [None]:
def jaccard_similarity(list1, list2):
    set1 = set(list1)
    set2 = set(list2)
    intersection = set1.intersection(set2)
    union = set1.union(set2)
    jaccard_similarity = len(intersection) / len(union)
    return jaccard_similarity

In [None]:
def get_highest_similarity_rows(data, col1, col2, n):
    data["jaccard_similarity"] = data.apply(lambda x: jaccard_similarity(x[col1], x[col2]), axis=1)
    data = data.sort_values(by="jaccard_similarity", ascending=False)
    return data.head(n)

In [None]:
data_sample = get_highest_similarity_rows(data, "Tags", "Texts", sample_size)
data_sample = data_sample[['Tags', 'Texts', 'Sentences', 'jaccard_similarity']]
data_sample.drop(['jaccard_similarity'], axis=1, inplace=True)
data_sample.reset_index(inplace=True, drop=True)
print(data_sample.shape)
data_sample.head()

Afin de mettre en place une méthode d’évaluation propre, je décide de séparer le jeu de données en deux parties : 
 - la première me servira à l'entrainement des modèles
 - la seconde partie me permettra d'évaluer certains modèles sur des données qui leurs sont inconnues

In [18]:
texts_train, texts_eval, \
tags_train, tags_eval, \
sentences_train, sentences_eval = train_test_split(
    data_sample["Texts"],
    data_sample["Tags"],
    data_sample["Sentences"],
    test_size=0.5,
    random_state=42
)

In [19]:
texts_list = texts_train.to_list()
tags_list = tags_train.to_list()
sentences = sentences_train.to_list()
flat_texts = [" ".join(text) for text in texts_list]
flat_tags = [" ".join(tag) for tag in tags_list]
vocabulary_texts = list(set([word for item in texts_list for word in item]))
vocabulary_tags = list(set([word for item in tags_list for word in item]))

In [20]:
texts_list_eval = texts_eval.to_list()
tags_list_eval = tags_eval.to_list()
sentences_eval = sentences_eval.to_list()
flat_texts_eval = [" ".join(text) for text in texts_list]
flat_tags_eval = [" ".join(tag) for tag in tags_list]
vocabulary_texts_eval = list(set([word for item in texts_list for word in item]))
vocabulary_tags_eval = list(set([word for item in tags_list for word in item]))

In [21]:
words = []
for text in flat_tags:
    words.extend(text.split())

counter = Counter(words)
top_200_tags = [word for word, count in counter.most_common(200)]

In [22]:
words = []
for text in flat_tags_eval:
    words.extend(text.split())

counter = Counter(words)
top_200_tags_eval = [word for word, count in counter.most_common(200)]

## <font color="#114b98" id="section_2">2. Approche non supervisée</font>

### Étude

In [23]:
from sklearn.decomposition import LatentDirichletAllocation, NMF
from sklearn.metrics import silhouette_score
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import f1_score, jaccard_score
from sklearn.preprocessing import MultiLabelBinarizer
from gensim.corpora.dictionary import Dictionary
from gensim.models.coherencemodel import CoherenceModel
from gensim.models import Nmf
from gensim.models.ldamodel import LdaModel
from gensim.matutils import corpus2dense
from tqdm import tqdm
from sklearn.decomposition import PCA

LDA (Latent Dirichlet Allocation) est une technique de topic modeling qui permet de découvrir les thèmes cachés (ou "latents") dans un ensemble de textes. Elle permet de regrouper des textes qui traitent des mêmes sujets.

La classe LdaModel de gensim est basée sur l'algorithme d'allocation latente de Dirichlet (LDA), qui est un modèle probabiliste génératif utilisé pour découvrir les sujets cachés dans un corpus de textes. La classe LatentDirichletAllocation de scikit-learn est également basée sur l'algorithme LDA, mais elle peut avoir des différences en termes d'implémentation, comme l'algorithme d'optimisation utilisé ou les paramètres disponibles.

NMF (Non-negative Matrix Factorization) est une autre technique de topic modeling qui permet de décomposer une matrice document-terme en deux matrices de facteurs non-négatifs. Elle est souvent utilisée pour découvrir les thèmes cachés dans des textes.

La classe gensim Nmf est basée sur l'algorithme de factorisation de matrice non-négative, qui est différente de la classe NMF de scikit-learn, qui est basée sur la méthode de gradient projeté.

In [24]:
def determine_optimal_num_topics(data, vectorizer, n_topics_range, texts_list):
    """
    Given data, a vectorizer, a range of number of topics to test,
    and the list of texts, applies the models to the data and plots 
    the silhouette and coherence scores to help determine the optimal
    number of topics.

    """

    data = vectorizer.fit_transform(data)
    dictionary = Dictionary(texts_list)
    corpus = [dictionary.doc2bow(txt) for txt in texts_list]

    lda_scores = []
    nmf_scores = []
    coherence_nmf = []
    coherence_lda = []

    for n_topics in tqdm(n_topics_range, ascii=' >='):

        # Calculate the silhouette score for the LDA model
        lda = LatentDirichletAllocation(n_components=n_topics, 
                                        max_iter=1000)
        lda.fit(data)
        topic_assignments = lda.transform(data)
        labels = np.argmax(topic_assignments, axis=1)
        lda_scores.append(silhouette_score(topic_assignments, labels, 
                                           metric='euclidean'))

        # Calculate the silhouette score for the NMF model
        nmf = NMF(n_components=n_topics, max_iter=1000)
        nmf.fit(data)
        topic_assignments = nmf.transform(data)
        labels = np.argmax(topic_assignments, axis=1)
        nmf_scores.append(silhouette_score(topic_assignments, labels, 
                                           metric='euclidean'))

        # Calculate the coherence score for the LDA model
        lda = LdaModel(corpus, num_topics=n_topics, id2word=dictionary)
        cm_lda = CoherenceModel(model=lda, texts=texts_list, 
                                dictionary=dictionary, coherence='c_v')
        coherence_lda.append(cm_lda.get_coherence())

        # Calculate the coherence score for the NMF model
        nmf = Nmf(corpus, num_topics=n_topics, id2word=dictionary)
        cm_nmf = CoherenceModel(model=nmf, texts=texts_list, 
                                dictionary=dictionary, coherence='c_v')
        coherence_nmf.append(cm_nmf.get_coherence())

    scores = pd.DataFrame(columns=['topics_silhouette',
                                   'score_silhouette',
                                   'topics_coherence',
                                   'score_coherence'],
                          index=['LDA', 'NMF'])

    scores['topics_silhouette'] = [n_topics_range[np.argmax(lda_scores)], 
                                   n_topics_range[np.argmax(nmf_scores)]]
    scores['score_silhouette'] = [max(lda_scores), max(nmf_scores)]
    scores['topics_coherence'] = [n_topics_range[np.argmax(coherence_lda)], 
                                  n_topics_range[np.argmax(coherence_nmf)]]
    scores['score_coherence'] = [max(coherence_lda), max(coherence_nmf)]

    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))
    plt.suptitle('Scores de Silhouette et de Coherence pour LDA et NMF avec {}'.format(str(vectorizer).split('(')[0]))
    ax1.plot(n_topics_range, lda_scores, label='LDA')
    ax1.plot(n_topics_range, nmf_scores, label='NMF')
    ax1.set_xlabel('Number of Topics')
    ax1.set_ylabel('Silhouette score')
    ax1.legend()
    ax2.plot(n_topics_range, coherence_lda, label='LDA')
    ax2.plot(n_topics_range, coherence_nmf, label='NMF')
    ax2.set_xlabel('Number of Topics')
    ax2.set_ylabel('Coherence score')
    ax2.legend()
    plt.show()

    return scores

In [25]:
# Define the range of number of topics to test
# n_topics_range = range(2, 22, 2)

In [26]:
# Define the range of number of topics to test
n_topics_range = range(2, 6, 2)

CountVectorizer() est une implémentation de l'approche bag-of-words pour la vectorisation de textes. Il convertit un ensemble de documents en un tableau de compte de mots (ou un sac de mots), où chaque ligne représente un document et chaque colonne représente un mot. Le nombre dans chaque cellule est le nombre de fois où le mot correspondant est présent dans le document correspondant.

In [None]:
models_CountVectorizer = determine_optimal_num_topics(flat_texts,
                                                      CountVectorizer(),
                                                      n_topics_range,
                                                      texts_list)



In [None]:
models_CountVectorizer

TF-IDF (term frequency-inverse document frequency) est une technique utilisée pour pondérer les termes dans les textes en fonction de leur fréquence d'apparition. Elle permet de donner plus de poids aux termes qui apparaissent fréquemment dans un document mais rarement dans l'ensemble des documents.

In [None]:
models_TfidfVectorizer= determine_optimal_num_topics(flat_texts,
                                                     TfidfVectorizer(vocabulary=vocabulary_texts),
                                                     n_topics_range,
                                                     texts_list)

In [None]:
models_TfidfVectorizer

Le score de silhouette mesure la similarité d'un objet à son propre groupe par rapport aux autres groupes et généralement, plus il est proche de 1, meilleure est la classification. Le score de cohérence mesure à quel point les sujets sont "interprétables par les humains", généralement plus proche de 1, meilleur c'est.

Dans notre situation, lorsque le nombre de sujets augmente, ils sont davantage "interprétables par les humains".

Nous devons maintenant essayer d'obtenir des tags en utilisant ces méthodes.

Je choisis d'utiliser uniquement LDA pour la suite car c'est la méthode qui obtient les meilleurs scores de silhouette.

Je choisis le nombre de topics au regard des résultats précédents.

In [None]:
n_topics = 30

Le paramètre min_df définit le nombre minimum de documents dans lesquels un mot doit être présent pour être inclus dans le vocabulaire.

In [None]:
min_df = 5

Le paramètre max_df définit la fréquence maximale d'un mot en pourcentage de tous les documents. 

In [None]:
max_df = 0.2

In [None]:
vectorizer_CV = CountVectorizer(vocabulary=vocabulary_texts, min_df=min_df, max_df=max_df)

In [None]:
vectorizer_TFIDF = TfidfVectorizer(vocabulary=vocabulary_texts, min_df=min_df, max_df=max_df)

In [None]:
def get_tags_from_text(texts_list, flat_texts, n_topics, vocabulary_texts, min_df, max_df):
    pred_gensim = list()
    pred_sklearn = list()
    pred_tfidf = list()
    pred_count = list()

    # Predict tags using LdaModel (gensim) without bow or TF-IDF
    dictionary = Dictionary(texts_list)
    corpus = [dictionary.doc2bow(txt) for txt in texts_list]
    lda = LdaModel(corpus, num_topics=n_topics, id2word=dictionary, random_state=42)
    for text in tqdm(texts_list, ascii=' >='):
        bow = dictionary.doc2bow(text)
        topics = lda.get_document_topics(bow, minimum_probability=0)
        topic_id, prob = max(topics, key=lambda x: x[1])
        topic_words = [w for w, p in lda.show_topic(topic_id, topn=5)]
        pred_gensim.append(topic_words)

    # Predict tags using LDA (sklearn) without bow or TF-IDF
    corpus_dense = corpus2dense(corpus, num_terms=len(dictionary)).T
    lda = LatentDirichletAllocation(n_components=n_topics, random_state=42)
    lda.fit(corpus_dense)
    for text in tqdm(texts_list, ascii=' >='):
        bow = dictionary.doc2bow(text)
        dense_bow = corpus2dense([bow], num_terms=len(dictionary)).T[0]
        dense_bow = np.reshape(dense_bow, (1, -1))
        topic_distribution = lda.transform(dense_bow)
        topic_id = topic_distribution.argmax()
        top_words_indices = np.argsort(-lda.components_[topic_id])[:5]
        topic_words = [dictionary[i] for i in top_words_indices]
        pred_sklearn.append(topic_words)

    # Predict tags using LdaModel with TF-IDF
    vectorizer = vectorizer_TFIDF
    bow = vectorizer.fit_transform(flat_texts)
    lda = LatentDirichletAllocation(n_components=n_topics, random_state=42)
    topics = lda.fit_transform(bow)
    for i in tqdm(range(len(texts_list)), ascii=' >='):
        topic_id = topics.argmax(axis=1)[i]
        dense_bow_matrix = bow.toarray()
        top_words_indices = dense_bow_matrix[i].argsort()[-5:][::-1]
        topic_words = [list(vectorizer.vocabulary_.keys())[list(vectorizer.vocabulary_.values()).index(i)] for i in top_words_indices]
        pred_tfidf.append(topic_words)

    # Predict tags using LdaModel with CountVectorizer
    vectorizer = vectorizer_CV
    bow = vectorizer.fit_transform(flat_texts)
    lda = LatentDirichletAllocation(n_components=n_topics, random_state=42)
    topics = lda.fit_transform(bow)
    for i in tqdm(range(len(texts_list)), ascii=' >='):
        topic_id = topics.argmax(axis=1)[i]
        dense_bow_matrix = bow.toarray()
        top_words_indices = dense_bow_matrix[i].argsort()[-5:][::-1]
        topic_words = [list(vectorizer.vocabulary_.keys())[list(vectorizer.vocabulary_.values()).index(i)] for i in top_words_indices]
        pred_count.append(topic_words)           

    return pred_gensim, pred_sklearn, pred_tfidf, pred_count

In [None]:
pred_gensim, pred_sklearn, pred_tfidf, pred_count = get_tags_from_text(texts_list,
                                                                       flat_texts,
                                                                       n_topics,
                                                                       vocabulary_texts,
                                                                       min_df,
                                                                       max_df)

In [None]:
tags_list[0:5]

In [None]:
pred_gensim[0:5]

In [None]:
pred_sklearn[0:5]

In [None]:
pred_tfidf[0:5]

In [None]:
pred_count[0:5]

Il semblerait que les modèles avec CountVectorizer et TfidfVectorizer prédisent des tags assez similaires à ceux donnés par les utilisateurs.

In [None]:
pred_tags_list = [pred_gensim, pred_sklearn, pred_tfidf, pred_count]
pred_names = ["gensim", "sklearn", "tfidf", "count"]

In [None]:
mlb = MultiLabelBinarizer(classes=top_200_tags)
tags_mlb = mlb.fit_transform(tags_list)

In [None]:
pred_tags_bin_list = [mlb.transform(pred_tags) for pred_tags in pred_tags_list]

In [None]:
def evaluate_predictions(true_tags, pred_tags_bin_list, pred_names):
    f1_scores = []
    jaccard_scores = []
    scoring_methods = ["F1 Score", "Jaccard Score"]
    for pred_tags in pred_tags_bin_list:
        f1_scores.append(f1_score(true_tags, pred_tags, average='samples'))
        jaccard_scores.append(jaccard_score(true_tags, pred_tags, average='samples'))

    metrics = {"Jaccard": jaccard_scores, "F1": f1_scores}
    metrics_df = pd.DataFrame(metrics, index=pred_names)

    fig, axes = plt.subplots(1, 2, figsize=(15, 5))
    axes = axes.ravel()
    for i, metric in enumerate(metrics.keys()):
        sns.barplot(data=metrics_df, x=metrics_df.index, y=metric, ax=axes[i])
        axes[i].set_ylabel('Score')
        axes[i].set_title(scoring_methods[i])
    plt.show()
    return metrics_df

 - F1 Score: mesure de l'exactitude d'un modèle, il est un moyen harmonique de précision et de rappel. Il varie de 0 à 1, où un score proche de 1 indique une meilleure performance et un score proche de 0 indique une performance moins bonne. 
 - Jaccard Score: mesure de la similarité entre les deux ensembles de prédictions et de vraies étiquettes. Il varie de 0 à 1, où un score proche de 1 indique une très grande similitude et un score proche de 0 indique une grande dissimilarité.

In [None]:
evaluate_predictions(tags_mlb, pred_tags_bin_list, pred_names)

Nous observons de meilleurs scores avec un vectorizer.

### Résultats de l'étude

In [None]:
def jaccard_index(list1, list2):
    s1 = set(list1)
    s2 = set(list2)
    return len(s1 & s2) / len(s1 | s2)

In [None]:
def calculate_scores(true_tags, pred_tags):
    scores = [jaccard_index(t, p) for t, p in zip(true_tags, pred_tags)]
    mean_score = sum(scores) / len(scores)
    return mean_score

In [None]:
def plot_similar_tags(true_tags, pred_tags, method):
    mean_score = calculate_scores(true_tags, pred_tags)
    similar_counts = []
    
    for pred_tags, true_tags in zip(pred_tags, true_tags):
        similar_words = set(pred_tags) & set(true_tags)
        similar_counts.append(len(similar_words))

    counter = Counter(similar_counts)
    counter = dict(sorted(counter.items()))

    # Add missing keys to counter with value 0
    keys = set(range(0, 6))
    missing_keys = keys - set(counter.keys())
    for key in missing_keys:
        counter[key] = 0
    sorted_counter = dict(sorted(counter.items()))
        
    fig, axs = plt.subplots(1, 2, figsize=(10, 5))
    fig.suptitle(f"Similarité des tags avec la méthode {method}", fontsize=14, fontweight='bold', y=1.05)
    axs[0].bar(sorted_counter.keys(), sorted_counter.values())
    axs[0].set_xticks(range(0,6,1))
    axs[0].set_xticklabels(sorted_counter.keys(), rotation=0)
    axs[0].set_xlabel('Nombre de tags similaires', fontsize=11)
    axs[0].set_ylabel("Nombre d'observations", fontsize=11)
    axs[0].set_title("Nombre d'observations avec un\n nombre de tags similaires", fontsize=12)   
    axs[1].pie(sorted_counter.values(), labels=sorted_counter.keys(), autopct='%1.1f%%', pctdistance=0.8)
    axs[1].legend(title='Tags\nSimilaires', bbox_to_anchor=(1, 0.9), prop={'size': 8}, title_fontsize=10)
    axs[1].set_title("Pourcentage d'observations avec \n un nombre de tags similaires", fontsize=12)
    
    textstr = ''.join((
        r'Jaccard_index = %.2f' % (mean_score, )))
    props = dict(boxstyle='round', facecolor='white', alpha=0.5)
    axs[1].text(0.8, 0, textstr, transform=axs[1].transAxes, fontsize=12,
                verticalalignment='top', bbox=props)
    plt.show()

In [None]:
plot_similar_tags(tags_list, pred_gensim, 'LDA Gensim')

In [None]:
plot_similar_tags(tags_list, pred_sklearn, 'LDA Sklearn')

In [None]:
plot_similar_tags(tags_list, pred_tfidf, 'LDA + TFIDF')

In [None]:
plot_similar_tags(tags_list, pred_count, 'LDA + Count')

Les vectorizers apportent une réelle plus value avec davantage de tags prédis similaires aux tags originaux.

### LDA + TF-IDF +  PCA

In [None]:
# Predict tags using LdaModel with TF-IDF
vectorizer = vectorizer_TFIDF
bow = vectorizer.fit_transform(flat_texts)

# Apply PCA to the TF-IDF matrix
pca = PCA()
pca_bow = pca.fit_transform(bow.toarray())

# Calculate the explained variance
explained_variance = pca.explained_variance_ratio_

In [None]:
fig, axs = plt.subplots(1, 2, figsize=(16, 5))

# Plot the explained variance
axs[0].plot(np.cumsum(explained_variance))
axs[0].set_xlabel("Number of Components")
axs[0].set_ylabel("Explained Variance (%)")
axs[0].axhline(y=0.90, linewidth=2, color='black')
axs[0].text(2, 0.91, 'Seuil des 90% de variance', fontsize=14)

# Scatter Plot of PCA Results
axs[1].scatter(pca_bow[:, 0], pca_bow[:, 1])
axs[1].set_xlabel("First Principal Component")
axs[1].set_ylabel("Second Principal Component")
axs[1].set_title("Scatter Plot of PCA Results")

plt.tight_layout()
plt.show()

In [None]:
def ensure_positive_pca(X):
    pca = PCA(n_components=0.90)
    X_transformed = pca.fit_transform(X)
    if (X_transformed < 0).sum() > 0:
        X_transformed -= X_transformed.min()
    return X_transformed

In [None]:
pca_bow = ensure_positive_pca(bow.toarray())

In [None]:
# Predict tags using LdaModel with CountVectorizer
pred_tfidf_pca = list()
lda = LatentDirichletAllocation(n_components=n_topics, random_state=42)
topics = lda.fit_transform(pca_bow)
for i in tqdm(range(len(texts_list)), ascii=' >='):
    topic_id = topics.argmax(axis=1)[i]
    dense_bow_matrix = bow.toarray()
    top_words_indices = dense_bow_matrix[i].argsort()[-5:][::-1]
    topic_words = [list(vectorizer.vocabulary_.keys())[list(vectorizer.vocabulary_.values()).index(i)] for i in top_words_indices]
    pred_tfidf_pca.append(topic_words)  

In [None]:
plot_similar_tags(tags_list, pred_tfidf_pca, 'LDA + TFIDF + PCA')

### LDA + CountVectorizer +  PCA 

In [None]:
# Predict tags using LdaModel with CountVectorizer
vectorizer = vectorizer_CV
bow = vectorizer_CV.fit_transform(flat_texts)

# Apply PCA to the count matrix
pca = PCA()
pca_bow = pca.fit_transform(bow.toarray())

# Calculate the explained variance
explained_variance = pca.explained_variance_ratio_

In [None]:
fig, axs = plt.subplots(1, 2, figsize=(16, 5))

# Plot the explained variance
axs[0].plot(np.cumsum(explained_variance))
axs[0].set_xlabel("Number of Components")
axs[0].set_ylabel("Explained Variance (%)")
axs[0].axhline(y=0.90, linewidth=2, color='black')
axs[0].text(2, 0.91, 'Seuil des 90% de variance', fontsize=14)

# Scatter Plot of PCA Results
axs[1].scatter(pca_bow[:, 0], pca_bow[:, 1])
axs[1].set_xlabel("First Principal Component")
axs[1].set_ylabel("Second Principal Component")
axs[1].set_title("Scatter Plot of PCA Results")

plt.tight_layout()
plt.show()

In [None]:
pca_bow = ensure_positive_pca(bow.toarray())

In [None]:
# Predict tags using LdaModel with CountVectorizer
pred_count_pca = list()
lda = LatentDirichletAllocation(n_components=n_topics, random_state=42)
topics = lda.fit_transform(pca_bow)
for i in tqdm(range(len(texts_list)), ascii=' >='):
    topic_id = topics.argmax(axis=1)[i]
    dense_bow_matrix = bow.toarray()
    top_words_indices = dense_bow_matrix[i].argsort()[-5:][::-1]
    topic_words = [list(vectorizer.vocabulary_.keys())[list(vectorizer.vocabulary_.values()).index(i)] for i in top_words_indices]
    pred_count_pca.append(topic_words)           

In [None]:
plot_similar_tags(tags_list, pred_count_pca, 'LDA + Count + PCA')

## <font color="#114b98" id="section_3">3. Approche supervisée</font>

### Étude

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.multiclass import OneVsRestClassifier
from sklearn.linear_model import SGDClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.multioutput import MultiOutputClassifier
from sklearn.model_selection import GridSearchCV

In [None]:
# mlb = MultiLabelBinarizer(classes=vocabulary_tags)
# tags_mlb = mlb.fit_transform(flat_tags)

In [None]:
mlb = MultiLabelBinarizer(classes=vocabulary_tags)
tags_mlb = mlb.fit_transform(tags_list)

In [None]:
mlb = MultiLabelBinarizer(classes=top_200_tags)
tags_mlb = mlb.fit_transform(tags_list)

In [None]:
classifiers = [LogisticRegression(random_state=42, max_iter=300, tol=1e-5),
               SGDClassifier(random_state=42, max_iter=300, tol=1e-5),
               RandomForestClassifier(random_state=42),
               KNeighborsClassifier(),
               MultinomialNB()]

In [None]:
def calculate_supervised_scores(flat_texts, tags_mlb, vectorizer, classifiers):

    # Create an empty dataframe to store the results
    results_df = pd.DataFrame(columns=['Classifier',
                                       'F1 Score',
                                       'Jaccard Score',
                                       'Time (s)'])

    # Split the data into training and test sets
    X_train, X_test, y_train, y_test = train_test_split(flat_texts, tags_mlb, test_size=0.2, random_state=42)

    # Vectorize X_train and X_test
    X_train = vectorizer.fit_transform(X_train)
    X_test = vectorizer.transform(X_test)

    # Fit an independent model for each class using the OneVsRestClassifier wrapper.
    for classifier in tqdm(classifiers, ascii=' >='):
        start_time = time.time()
        ovrc = OneVsRestClassifier(classifier)
        ovrc.fit(X_train, y_train)
        y_pred_ovrc = ovrc.predict(X_test)
        end_time = time.time()

        f1 = round(f1_score(y_test, y_pred_ovrc, average='samples'), 4)
        jaccard = round(jaccard_score(y_test, y_pred_ovrc, average='samples'), 4)
        time_taken = round(end_time - start_time, 4)

        results_df = results_df.append({'Classifier': str(classifier).split('(')[0], 
                                        'F1 Score': f1,
                                        'Jaccard Score': jaccard,
                                        'Time (s)': time_taken},
                                        ignore_index=True)

    return results_df

In [None]:
results_CountVectorizer = calculate_supervised_scores(flat_texts,
                                                      tags_mlb,
                                                      vectorizer_CV,
                                                      classifiers)

In [None]:
results_CountVectorizer

In [None]:
def plot_results(results_df):
    
    # Create a figure with 5 subplots
    fig, axs = plt.subplots(1, 3, figsize=(20, 10))
    # Set a color palette
    my_palette = sns.color_palette("husl", 5)

    # Set the x-axis to be a range of numerical values
    x = range(len(results_df))
    scoring_methods = ['F1 Score', 'Jaccard Score', 'Time (s)']

    # Create a subplot for each scoring method
    for i, scoring_method in enumerate(scoring_methods):
        sns.barplot(x='Classifier', 
                    y=scoring_method, 
                    data=results_df, 
                    ax=axs[i], 
                    palette=my_palette, 
                    label=scoring_method)

    # Add classifier names to x-axis
    for i in range(3):
        axs[i].set_title(scoring_methods[i])
        axs[i].set_xticks(x)
        axs[i].set_xlabel('')
        axs[i].set_ylabel('Score')
        axs[i].set_xticklabels(results_df['Classifier'], rotation=90)

    plt.show()

In [None]:
plot_results(results_CountVectorizer)

In [None]:
results_TfidfVectorizer = calculate_supervised_scores(flat_texts,
                                                      tags_mlb,
                                                      vectorizer_TFIDF,
                                                      classifiers)

In [None]:
results_TfidfVectorizer

In [None]:
plot_results(results_TfidfVectorizer)

Le SGDClassifier obtient les meilleurs scores, peu importe le vectorizer utilisé.

Le modèle avec RandomForestClassifier nécessite un temps d'entraînement très long.

Selon ces résultats, il semble que le SGDClassifier ait les meilleures performances globales.  Les classificateurs de types MultinomialNB et LogisticRegression se comportent également bien.

### LogisticRegression + GridSearchCV

In [None]:
results_LR = pd.DataFrame(columns=['F1 Score', 'Jaccard Score'])
results_LR = results_LR.append(results_CountVectorizer.iloc[0, 1:]).reset_index(drop=True)
results_LR = results_LR.append(results_TfidfVectorizer.iloc[0, 1:]).reset_index(drop=True)
results_LR['Classifier'] = ['CountVectorizer', 'TfidfVectorizer']
results_LR

In [None]:
plot_results(results_LR)

CountVectorizer permet d'avoir des performances légèrement supérieure à TF-IDF.

In [None]:
classifier_LR = LogisticRegression(random_state=42, max_iter=300, tol=1e-5)

In [None]:
parameters_LR = {'estimator__C': [0.01, 0.1, 0.01],
                 'estimator__penalty': ['l1', 'l2'],
                 'estimator__solver': ['lbfgs', 'liblinear']}

In [None]:
def calculate_supervised_scores(flat_texts, tags_mlb, vectorizer, classifier, parameters):
    results_df = pd.DataFrame(columns=['Classifier', 'F1 Score', 'Jaccard Score', 'Time (s)'])
    X_train, X_test, y_train, y_test = train_test_split(flat_texts, tags_mlb, test_size=0.2, random_state=42)
    X_train = vectorizer.fit_transform(X_train)
    X_test = vectorizer.transform(X_test)
    
    ovrc = OneVsRestClassifier(classifier)
    clf = GridSearchCV(ovrc, parameters, cv=5, verbose=2)
    
    start_time = time.time()
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    end_time = time.time()
    
    best_params = clf.best_params_ if hasattr(clf, 'best_params_') else None
    
    f1 = round(f1_score(y_test, y_pred, average='samples'), 4)
    jaccard = round(jaccard_score(y_test, y_pred, average='samples'), 4)
    time_taken = round(end_time - start_time, 4)
    
    results_df = results_df.append({'Classifier': str(classifier).split('(')[0], 
                                    'F1 Score': f1,
                                    'Jaccard Score': jaccard,
                                    'Time (s)': time_taken},
                                    ignore_index=True)
    return results_df, best_params

In [None]:
results_LR, best_params_LR = calculate_supervised_scores(flat_texts,
                                                         tags_mlb,
                                                         vectorizer_CV,
                                                         classifier_LR,
                                                         parameters_LR)

In [None]:
results_LR

In [None]:
best_params_LR

### SGDClassifier + GridSearchCV

In [None]:
results_SGD = pd.DataFrame(columns=['F1 Score', 'Jaccard Score'])
results_SGD = results_SGD.append(results_CountVectorizer.iloc[1, 1:]).reset_index(drop=True)
results_SGD = results_SGD.append(results_TfidfVectorizer.iloc[1, 1:]).reset_index(drop=True)
results_SGD['Classifier'] = ['CountVectorizer', 'TfidfVectorizer']
results_SGD

In [None]:
plot_results(results_SGD)

CountVectorizer permet d'avoir des performances légèrement supérieure à TF-IDF.

In [None]:
classifier_SGD = SGDClassifier(random_state=42)

In [None]:
parameters_SGD = {'estimator__alpha': [0.0001, 0.001, 0.01, 0.1, 1, 10],
                  'estimator__loss': ['hinge', 'log', 'modified_huber'],
                  'estimator__penalty': ['l1', 'l2', 'elasticnet']}

In [None]:
results_SGD, best_params_SGD = calculate_supervised_scores(flat_texts,
                                                           tags_mlb,
                                                           vectorizer_TFIDF,
                                                           classifier_SGD,
                                                           parameters_SGD)


In [None]:
results_SGD

In [None]:
best_params_SGD

In [None]:
results_CountVectorizer[results_CountVectorizer['Classifier']=='SGDClassifier']

In [None]:
results_CV = pd.concat([results_CountVectorizer[results_CountVectorizer['Classifier']=='SGDClassifier'],
                        results_SGD])
results_CV.iloc[1,0] = 'SGDClassifier_CV'
results_CV

In [None]:
# Call the plot_results function
plot_results(results_CV)

Le modèle a bien pu être amélioré par Cross-Validation.

### Résultats de l'étude

In [None]:
params = {key.replace('estimator__', ''): value for key, value in best_params_SGD.items()}

In [None]:
classifier = SGDClassifier(random_state=42, **params)

In [None]:
ovrc = OneVsRestClassifier(classifier)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(flat_texts, tags_mlb, test_size=0.2, random_state=42)
X_train = vectorizer.fit_transform(X_train)
ovrc.fit(X_train, y_train)

In [None]:
def get_top_5_tags(text, vectorizer, mlb, classifier):
    X = vectorizer.transform([text])
    scores = classifier.decision_function(X)
    top_5 = scores.argsort(axis=1)[:, :5][::-1]
    top_5_tags = [mlb.classes_[i] for i in top_5.flatten()]
    return top_5_tags

In [None]:
tags_SGDC = []
for text in tqdm(flat_texts_eval):
    top_5_tags = get_top_5_tags(text, vectorizer, mlb, ovrc)
    tags_SGDC.append(top_5_tags)

In [None]:
flat_tags_eval[0:5]

In [None]:
tags_SGDC[0:5]

In [None]:
plot_similar_tags(tags_list_eval, tags_SGDC, 'Count + SGDClassifier')

## <font color="#114b98" id="section_4">4. Approche supervisée avec Word Embedding : Word2Vec</font>

### Dataset d'essais

In [None]:
from gensim.models import Word2Vec

In [None]:
# Split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(texts_list,
                                                    tags_mlb,
                                                    test_size=0.2,
                                                    random_state=42)

In [None]:
# Train the Word2Vec model on your text data
w2v_model = Word2Vec(X_train, vector_size=100, window=5, min_count=5, workers=4)

In [None]:
# Create a vocabulary of only the words in the text data that are in the word2vec model
vocab = set(w2v_model.wv.key_to_index.keys())

In [None]:
# Filter the text data to only include words in the vocabulary
X_train = [[word for word in sublist if word in vocab] for sublist in X_train]
X_test = [[word for word in sublist if word in vocab] for sublist in X_test]

In [None]:
# Remove any observations that have no words in the vocabulary
train_removed_indexes = []
test_removed_indexes = []
for i, sublist in enumerate(X_train):
    if not any(word in vocab for word in sublist):
        train_removed_indexes.append(i)
for i, sublist in enumerate(X_test):
    if not any(word in vocab for word in sublist):
        test_removed_indexes.append(i)

In [None]:
X_train = [x for i, x in enumerate(X_train) if i not in train_removed_indexes]
X_test = [x for i, x in enumerate(X_test) if i not in test_removed_indexes]
y_train = [x for i, x in enumerate(y_train) if i not in train_removed_indexes]
y_test = [x for i, x in enumerate(y_test) if i not in test_removed_indexes]

In [None]:
# Create embeddings for train and test data
X_train_embedded = [np.mean([w2v_model.wv[word] for word in sentence], axis=0) for sentence in X_train]
X_test_embedded = [np.mean([w2v_model.wv[word] for word in sentence], axis=0) for sentence in X_test]

In [None]:
classifiers = [LogisticRegression(random_state=42, max_iter=300, tol=1e-5),
               SGDClassifier(random_state=42, max_iter=300, tol=1e-5),
               RandomForestClassifier(random_state=42),
               KNeighborsClassifier()]

In [None]:
def calculate_supervised_word2vec(X_train, X_test, y_train, y_test, classifiers):

    # Create an empty dataframe to store the results
    results_df = pd.DataFrame(columns=['Classifier',
                                       'F1 Score', 'Jaccard Score', 'Time (s)'])

    # Fit an independent model for each class using the OneVsRestClassifier wrapper.
    for clf in classifiers:
        start_time = time.time()
        ovrc = OneVsRestClassifier(clf)
        ovrc.fit(X_train, y_train)
        y_pred_ovrc = ovrc.predict(X_test)
        end_time = time.time()

        f1 = round(f1_score(y_test, y_pred_ovrc, average='samples'), 4)
        jaccard = round(jaccard_score(y_test, y_pred_ovrc, average='samples'), 4)
        time_taken = round(end_time - start_time, 4)

        results_df = results_df.append({'Classifier': str(clf).split('(')[0],
                                        'F1 Score': f1,
                                        'Jaccard Score': jaccard,
                                        'Time (s)': time_taken},
                                        ignore_index=True)

    return results_df

In [None]:
results_df_word2vec = calculate_supervised_word2vec(X_train_embedded,
                                                    X_test_embedded,
                                                    y_train,
                                                    y_test,
                                                    classifiers)

In [None]:
# Call the plot_results function
plot_results(results_df_word2vec)

### Résultats

## <font color="#114b98" id="section_5">5. Approche supervisée avec Word Embedding : BERT</font>

### Étude

In [None]:
import torch
import tensorflow_hub as hub
import tensorflow as tf
import transformers
import tokenization
from transformers import BertTokenizer, AutoModel, BertTokenizerFast
from torch.nn import CrossEntropyLoss
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler
from transformers import BertForSequenceClassification, AdamW, BertConfig
from torch import nn
from sklearn.utils.class_weight import compute_class_weight
from tensorflow.keras.layers import Dense, Input
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.models import Model
from tensorflow.keras.callbacks import ModelCheckpoint
from keras.utils.vis_utils import plot_model

In [None]:
texts = data_sample['Texts'].copy()
tags = data_sample['Tags'].copy()

In [None]:
for i in range(len(texts)):
    texts[i] = " ".join(texts[i])
    tags[i] = " ".join(tags[i])

In [None]:
tags

In [None]:
mlb = MultiLabelBinarizer()
tags_list = tags.to_list()
tags_bin = mlb.fit_transform(tags)

In [None]:
# mlb = MultiLabelBinarizer()
# tags_list = tags.to_list()
# tags_bin = mlb.fit_transform(tags_list)

In [None]:
train_text, test_text, train_labels, test_labels = train_test_split(texts,
                                                                    tags_bin,
                                                                    test_size=0.2,
                                                                    random_state=42)

In [None]:
module_url = "https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/1"
bert_layer = hub.KerasLayer(module_url, trainable=True)

In [None]:
def bert_encode(texts, tokenizer, max_len=512):
    all_tokens = []
    all_masks = []
    all_segments = []

    for text in texts:
        text = tokenizer.tokenize(text)

        text = text[:max_len-2]
        input_sequence = ["[CLS]"] + text + ["[SEP]"]
        pad_len = max_len - len(input_sequence)

        tokens = tokenizer.convert_tokens_to_ids(input_sequence)
        tokens += [0] * pad_len
        pad_masks = [1] * len(input_sequence) + [0] * pad_len
        segment_ids = [0] * max_len

        all_tokens.append(tokens)
        all_masks.append(pad_masks)
        all_segments.append(segment_ids)

    return np.array(all_tokens), np.array(all_masks), np.array(all_segments)

In [None]:
vocab_file = bert_layer.resolved_object.vocab_file.asset_path.numpy()
do_lower_case = bert_layer.resolved_object.do_lower_case.numpy()
tokenizer = BertTokenizer(vocab_file, do_lower_case)

In [None]:
tokens_train = bert_encode(train_text.values, tokenizer, max_len=100)
tokens_test = bert_encode(test_text.values, tokenizer, max_len=100)

In [None]:
def build_model(bert_layer, max_len=512):
    input_word_ids = Input(shape=(max_len,), dtype=tf.int32, name="input_word_ids")
    input_mask = Input(shape=(max_len,), dtype=tf.int32, name="input_mask")
    segment_ids = Input(shape=(max_len,), dtype=tf.int32, name="segment_ids")

    _, sequence_output = bert_layer([input_word_ids, input_mask, segment_ids])
    clf_output = sequence_output[:, 0, :]
    out = Dense(1, activation='sigmoid')(clf_output)
    out = Dense(len(train_labels[0]), activation='sigmoid')(clf_output)
    model = Model(inputs=[input_word_ids, input_mask, segment_ids], outputs=out)
    model.compile(Adam(learning_rate=2e-6), loss='binary_crossentropy', metrics=['binary_accuracy'])

    return model

In [None]:
model = build_model(bert_layer, max_len=100)

In [None]:
os.environ["CUDA_VISIBLE_DEVICES"] = "0"

In [None]:
plot_model(model, to_file='model_plot.png', show_shapes=True, show_layer_names=True)

In [None]:
# use one-hot encoded labels for training
train_history = model.fit(
    tokens_train, train_labels,
    validation_split=0.2,
    epochs=10,
    batch_size=32,
)

In [None]:
sns.set_style("darkgrid")

fig, axs = plt.subplots(1, 2, figsize=(15, 5))

sns.lineplot(x=np.arange(1, 11), y=train_history.history['loss'], label="Training Loss", ax=axs[0])
sns.lineplot(x=np.arange(1, 11), y=train_history.history['val_loss'], label="Validation Loss", ax=axs[0])
axs[0].set_title("Loss")
axs[0].set_xticks(np.arange(1, 11))
axs[0].set_xlabel("Epoch")

sns.lineplot(x=np.arange(1, 11), y=train_history.history['binary_accuracy'], label="Training Accuracy", ax=axs[1])
sns.lineplot(x=np.arange(1, 11), y=train_history.history['val_binary_accuracy'], label="Validation Accuracy", ax=axs[1])
axs[1].set_title("Accuracy")
axs[1].set_xticks(np.arange(1, 11))
axs[1].set_xlabel("Epoch")

plt.tight_layout()
plt.show()

In [None]:
model.evaluate(tokens_test, test_labels)

In [None]:
texts

In [None]:
predictions = []

for text in tqdm(texts):
    preds = model.predict(bert_encode(text, tokenizer, max_len=100))
    indices = np.argsort(preds)[0][-5:]
    preds[0, indices] = 1
    preds[np.where(preds != 1)] = 0
    decoded = mlb.inverse_transform(preds)
    
    predictions.append(decoded[0])

In [None]:
predictions[7:11]

In [None]:
def convert_tags(row):
    tags = row.split()
    tags = [f"'{tag}'" for tag in tags]
    return tuple(tags)


converted_tags = tags.apply(convert_tags)

In [None]:
converted_tags[7:11]

### Résultats de l'étude

In [None]:
plot_similar_tags(converted_tags, predictions, 'BERT')

## <font color="#114b98" id="section_6">6. Approche supervisée avec Sentence Embedding : USE</font>

### Étude

In [None]:
from rake_nltk import Rake
from sklearn.metrics.pairwise import cosine_similarity

In [None]:
data_use = pd.read_csv(main_path+'saved_ressources/'+'data_cleaned_wo_tokenizer.csv', encoding='utf8')

In [None]:
sentences = data_use['Texts'].to_list()
tags_list

In [None]:
extracted_sentences = [sentences[i] for i in saved_indexes]
extracted_tags = [tags_list[i] for i in saved_indexes]
parsed_true_tags = [ast.literal_eval(tags[1:-1]) for tags in extracted_tags]

In [None]:
def extract_keywords(input_text):
    # Load the USE model
    model = hub.load("https://tfhub.dev/google/universal-sentence-encoder/4")
    all_keywords = []
    # Encode all the sentences in the input text
    embeddings = model(input_text)
    # Compute the cosine similarity between all the sentences
    similarity_matrix = cosine_similarity(embeddings)
    # Iterate over the list of sentences
    for i in range(len(input_text)):
        # Find the most similar sentences
        most_similar = np.argsort(-similarity_matrix[i])[1:6]
        # Combine the most similar sentences with the current sentence
        text = ' '.join([input_text[j] for j in most_similar])
        # Extract the keywords from the combined text using RAKE
        keyword_extractor = Rake()
        keyword_extractor.extract_keywords_from_text(text)
        word_degrees = keyword_extractor.get_word_degrees()
        sorted_word_degrees = sorted(word_degrees.items(), key=lambda x: x[1], reverse=True)
        keywords = [word for word, degree in sorted_word_degrees[:5]]
        all_keywords.append(keywords)
    return all_keywords

In [None]:
keywords = extract_keywords(extracted_sentences)

In [None]:
keywords[0:5]

In [None]:
parsed_true_converted_tags = [list(tag) for tag in parsed_true_tags]

In [None]:
parsed_true_converted_tags[0:5]

### Résultats

In [None]:
plot_similar_tags(parsed_true_converted_tags, keywords, 'USE')

2èpme approche USE

In [None]:
module_url = "https://tfhub.dev/google/universal-sentence-encoder-large/4"

In [None]:
use_layer = hub.KerasLayer(module_url)

In [None]:
def use_encode(texts, max_len=512):
    return np.array(use_layer(texts))

In [None]:
use_train = use_encode(train_text.values, max_len=100)

In [None]:
use_test = use_encode(test_text.values, max_len=100)

In [None]:
def build_model(use_layer, max_len=512):
    input_text = Input(shape=(None,), dtype=tf.string, name="input_text")
    embedding = use_layer(input_text)
    dense = Dense(128, activation='relu')(embedding)
    out = Dense(len(train_labels[0]), activation='sigmoid')(dense)
    model = Model(inputs=input_text, outputs=out)
    model.compile(Adam(lr=2e-6), loss='binary_crossentropy', metrics=['binary_accuracy'])

    return model

In [None]:
model = build_model(use_layer, max_len=100)

In [None]:
model.fit(use_train, train_labels, epochs=1)

## <font color="#114b98" id="section_7">7. Choix du modèle pour le code final à déployer</font>

Nous allons maintenant comparer les différents modèles à l'aide de leurs résultats sur le dataset de tests.