**Table of contents**<a id='toc0_'></a>    
- [Mise en place](#toc1_)    
  - [Chargement des données](#toc1_1_)    
  - [Visualisations](#toc1_2_)    
  - [Préparation des données](#toc1_3_)    
- [Stratégie](#toc2_)    
- [TF-IDF](#toc3_)    
- [CountVectorizer](#toc4_)    
- [🦄🦄 CHECKPOINT 🦄🦄](#toc5_)    
- [LDA](#toc6_)    
- [🚧 vrac utile](#toc7_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=3
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

# <a id='toc1_'></a>[Mise en place](#toc0_)

In [1]:
# OS & env
import os
import logging

# DS
import numpy as np
import pandas as pd
import dill as pickle

# ML
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer

# home made functions from the src folder
from src.scrap_and_clean import words_filter
from src.models import results_from_vec_matrix
from src.models import get_lda_topics

# logging configuration (see all outputs, even DEBUG or INFO)
logger = logging.getLogger()
logger.setLevel(logging.DEBUG)

## <a id='toc1_1_'></a>[Chargement des données](#toc0_)

In [2]:
if not os.path.exists("data/df_preprocessed.pkl") or not os.path.exists(
    "data/stats_df.pkl"
):
    logging.warning("Missing data: run EDA notebook first")
else:
    with open("data/df_preprocessed.pkl", "rb") as df:
        df = pickle.load(df)
    with open("data/stats_df.pkl", "rb") as st:
        stats_df = pickle.load(st)

## <a id='toc1_2_'></a>[Visualisations](#toc0_)

Jeu de données

In [3]:
display(df.head())
print(f"DF shape: {df.shape}")
print(f'Corpus contains {len(stats_df["count_corpus"])} tokens')

Unnamed: 0,Title,Body,Tags,Score,AnswerCount,CreationDate,ViewCount,title_tokens,body_tokens
0,itms- : missing api declaration - privacy,why am i all of a suddent getting this on succ...,"[ios, app-store, plist, appstore-approval, pri...",24,7,2024-03-14 22:55:18,3092,"[itms, missing, api, declaration, privacy]","[suddent, getting, successful, builds, apple]"
1,why is builtin sorted() slower for a list cont...,i sorted four similar lists. list consistently...,"[python, algorithm, performance, sorting, time...",28,2,2024-03-05 15:21:50,2699,"[builtin, sorted, slower, list, containing, de...","[sorted, four, similar, lists, list, consisten..."
2,std::shared_mutex::unlock_shared() blocks even...,my team has encountered a deadlock that i susp...,"[c++, windows, multithreading, stl, shared-lock]",26,5,2024-03-01 23:09:59,1388,"[std, shared_mutex, unlock_shared, blocks, eve...","[team, encountered, deadlock, suspect, bug, wi..."
3,did the rules for nullptr init of unique_ptr c...,this code compiles with msvc from vs in c++ mo...,"[c++, visual-c++, language-lawyer, unique-ptr,...",15,1,2024-02-22 11:29:42,490,"[rules, nullptr, init, unique_ptr, change, c++]","[compiles, msvc, c++, mode, failes, c++, mode,..."
4,where is the order in which elf relocations ar...,consider the following two files on a linux sy...,"[c++, elf, dynamic-linking, abi, relocation]",16,1,2024-02-19 21:42:03,1746,"[order, elf, relocations, applied, specified]","[consider, following, two, linux, system, use_..."


DF shape: (49991, 9)
Corpus contains 110199 tokens


## <a id='toc1_3_'></a>[Préparation des données](#toc0_)

Passage des tokens à des listes de chaînes

In [4]:
tags = df["Tags"].apply(lambda x: " ".join(x)).to_list()
titles = df["title_tokens"].apply(lambda x: " ".join(x)).to_list()
bodies = df["body_tokens"].apply(lambda x: " ".join(x)).to_list()
corpus = (df["title_tokens"] + df["body_tokens"]).apply(lambda x: " ".join(x)).to_list()

# Approches Bag-of-words

🚧 Pour tester ces modèles, les **entraînements se font sur les titres des questions**, partant du postulat que les titres des questions pertinentes sont proportionnellement plus représentatifs du contenu.

🚧 ☝️ Pas forcément ☝️ :  
c'est selon stats ds EDA et utiliser le corpus permet d'avoir un mot qui est dans title + body, ce qui le rend naturellement plus probable

Création d'un titre fictif pour tests ultérieurs

In [5]:
dummy_doc = "i have a c# issue with overflow memory in library kazakhstanislas causing many problems in my code: how do you manage this? Here is an example: <code>int a = 1; int b = 2; int c = a + b;</code>"

## <a id='toc3_'></a>[TF-IDF](#toc0_)

[Documentation ScikitLearn](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html)

🚧 Un modèle simple de prédictions pourrait être de se baser sur les fréquences de monogrammes présents dans un corpus.

Entraînement du modèle sur le corpus

In [40]:
tfidf_vectorizer = TfidfVectorizer(token_pattern=r"\S+", min_df=10)
tfidf_fitted = tfidf_vectorizer.fit(corpus)
tfidf_names = tfidf_vectorizer.get_feature_names_out()

# <a id='toc5_'></a>[🦄🦄 CHECKPOINT 🦄🦄](#toc0_)

Prédiction sur un document existant (biaisé car dans l'entraînement, seulement pour l'exemple et la comparaison entre prédictions et tags existants)

In [41]:
n_predictions = 5
doc_index = 13371
doc = corpus[doc_index]

In [42]:
# predict
X = tfidf_fitted.transform([doc])
# display predictions
predictions = results_from_vec_matrix(tfidf_vectorizer, X, n_predictions)

print(f"1️⃣- Original document:\n    {doc}\n")
print(f"2️⃣- Predictions:\n    {predictions}\n")
print(f"3️⃣- Targetted tags:\n    {tags[doc_index]}")

1️⃣- Original document:
    django sessions working expected heroku users keep getting logged sessions persisting django app heroku users log randomly logged out—even site anything wrong django heroku config currently running django standard dynos settings.py

2️⃣- Predictions:
    {'django': 0.5212935792595451, 'heroku': 0.49227563324098506, 'sessions': 0.3394822489281743, 'logged': 0.28716515499154116, 'users': 0.2215763034401152}

3️⃣- Targetted tags:
    python django session cookies redis


Prédiction sur un document fictif

🚧 inclure pipeline clean sur doc fictif

In [43]:
doc = dummy_doc

In [56]:
def xxx(vectorizer, X, n_max) -> dict:
    """Predict a maximum of n_max results from a vectorizer and a transformed sparse matrix X."""
    d = X.data
    i = X.indices

    # get highest weights
    weights = np.sort(d)[::-1][: min(len(i), n_max)].tolist()

    # get corresponding indices
    pred_indices = [i[np.argsort(-d)][: min(len(i), n_max)]]

    # get corresponding words
    preds = [vectorizer.get_feature_names_out()[x] for x in pred_indices[0]]

    return dict(zip(preds, weights)), pred_indices

_ = xxx(tfidf_vectorizer, X, 5)
_

({'int': 0.6084973211985404,
  'overflow': 0.29825834886283686,
  'manage': 0.2848828239471886,
  'b': 0.2724742057720744,
  'causing': 0.27012379126795866},
 [array([4508, 6241, 5294,  785, 1292], dtype=int32)])

In [61]:
tfidf_vectorizer.get_feature_names_out()[X.indices[np.argsort(-X.data)][: min(len(X.indices), 5)]]

array(['int', 'overflow', 'manage', 'b', 'causing'], dtype=object)

In [44]:
# predict
X = tfidf_fitted.transform([doc])
# display predictions
predictions = results_from_vec_matrix(tfidf_vectorizer, X, n_predictions)

print(f"1️⃣- Original document:\n    {doc}\n")
print(f"2️⃣- Predictions:\n    {predictions}\n")

1️⃣- Original document:
    i have a c# issue with overflow memory in library kazakhstanislas causing many problems in my code: how do you manage this? Here is an example: <code>int a = 1; int b = 2; int c = a + b;</code>

2️⃣- Predictions:
    {'int': 0.6084973211985404, 'overflow': 0.29825834886283686, 'manage': 0.2848828239471886, 'b': 0.2724742057720744, 'causing': 0.27012379126795866}



On constate que TF-IDF, par son fonctionnement, ne propose pas facilement des tags fréquents, tels que les noms de langages informatiques ou bibliothèques régulièrement utilisées : c'est là une des limites de cette approche simple.

In [None]:
stop

# 🚧🚧🚧🚧🚧🚧🚧🚧🚧🚧🚧🚧🚧🚧🚧🚧🚧🚧🚧🚧🚧🚧🚧🚧🚧🚧🚧🚧🚧🚧🚧🚧🚧🚧🚧🚧🚧🚧🚧

- faire fonction qui récupère les indices des mots à partir d'une chaîne de tags

In [66]:
tags[doc_index]

'python django session cookies redis'

In [65]:
print(X.indices[np.argsort(-X.data)][: min(len(X.indices), 5)])

[4508 6241 5294  785 1292]


In [18]:
import time
import matplotlib.pyplot as plt

from sklearn import cluster, metrics
from sklearn import manifold, decomposition

from nltk.stem import WordNetLemmatizer


# Calcul Tsne, détermination des clusters et calcul ARI entre vrais catégorie et n° de clusters
def ARI_fct(features, targets):
    start_time = time.time()
    # 🚧🚧🚧🚧🚧🚧🚧🚧🚧🚧🚧🚧🚧🚧🚧🚧🚧🚧🚧🚧🚧🚧🚧🚧🚧🚧🚧🚧🚧🚧🚧🚧🚧🚧🚧🚧🚧🚧🚧
    # récupération des indices des mots les plus fréquents pour comparaison ARI (chiffres seulement)
    f_ind = features.indices
    f_data = features.data
    pred_index_list = f_ind[np.argsort(-f_data)][: min(len(f_ind), 5)]
    
    targets_list = 
    # 🚧🚧🚧🚧🚧🚧🚧🚧🚧🚧🚧🚧🚧🚧🚧🚧🚧🚧🚧🚧🚧🚧🚧🚧🚧🚧🚧🚧🚧🚧🚧🚧🚧🚧🚧🚧🚧🚧🚧

    tsne = manifold.TSNE(n_components=2, perplexity=30, n_iter=2000, 
                                 init='random', learning_rate=200, random_state=42)
    X_tsne = tsne.fit_transform(features)
    
    # Détermination des clusters à partir des données après Tsne 
    # cls = cluster.KMeans(n_clusters=num_labels, n_init=100, random_state=42)
    # cls.fit(X_tsne)
    # ARI = np.round(metrics.adjusted_rand_score(🚧 CONVERT WORDS LISTS TO BOW INDEXES LISTS then compare both),4)
    duration = np.round(time.time() - start_time,0)
    print("ARI: ", ARI, "Duration: ", duration)
    
    return ARI, X_tsne


# visualisation du Tsne selon les vraies catégories et selon les clusters
def TSNE_visu_fct(X_tsne, y_cat_num, labels, ARI) :
    fig = plt.figure(figsize=(15,6))
    
    ax = fig.add_subplot(121)
    scatter = ax.scatter(X_tsne[:,0],X_tsne[:,1], c=y_cat_num, cmap='Set1')
    ax.legend(handles=scatter.legend_elements()[0], labels=l_cat, loc="best", title="Categorie")
    plt.title('Représentation des tweets par catégories réelles')
    
    ax = fig.add_subplot(122)
    scatter = ax.scatter(X_tsne[:,0],X_tsne[:,1], c=labels, cmap='Set1')
    ax.legend(handles=scatter.legend_elements()[0], labels=set(labels), loc="best", title="Clusters")
    plt.title('Représentation des tweets par clusters')
    
    plt.show()
    print("ARI : ", ARI)




In [None]:
tfidf_vectorizer = TfidfVectorizer(token_pattern=r"\S+", min_df=10)
tfidf_fitted = tfidf_vectorizer.fit(corpus)
tfidf_names = tfidf_vectorizer.get_feature_names_out()
tfidf_transformed = tfidf_vectorizer.fit(corpus)

In [None]:
print("Tf-idf : ")
print("--------")
ARI, X_tsne, labels = ARI_fct(tfidf_transformed)

Tf-idf : 
--------


NameError: name 'l_cat' is not defined

In [None]:
 TSNE_visu_fct(X_tsne, y_cat_num, labels, ARI)

# <a id='toc4_'></a>[CountVectorizer](#toc0_)

[Documentation ScikitLearn](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)

🚧 Utilisation de CountVectorizer sur les données => LDA l'utilise en entrée

Modèle de comptage vectoriel

In [36]:
count_vectorizer = CountVectorizer(token_pattern=r"\S+", dtype=np.uint16, min_df=10)

Entraînement du modèle sur le corpus

In [37]:
cv_fitted = count_vectorizer.fit(corpus)
cv_names = count_vectorizer.get_feature_names_out()
cv_transformed = cv_fitted.transform(corpus)

Exemple sur un document fictif

🚧 inclure pipeline clean sur doc fictif

In [38]:
doc = dummy_doc

In [39]:
# predict
X = cv_fitted.transform([doc])
# display predictions
predictions = results_from_vec_matrix(count_vectorizer, X, n_predictions)

print(f"1️⃣- Original document:\n    {doc}\n")
print(f"2️⃣- Predictions:\n    {predictions}\n")

1️⃣- Original document:
    i have a c# issue with overflow memory in library kazakhstanislas causing many problems in my code: how do you manage this? Here is an example: <code>int a = 1; int b = 2; int c = a + b;</code>

2️⃣- Predictions:
    {'int': 2, 'b': 1, 'c': 1, 'c#': 1, 'causing': 1}



# <a id='toc6_'></a>[LDA](#toc0_)

[Documentation ScikitLearn](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html)

[LDA Gensim 1](https://radimrehurek.com/gensim/wiki.html#latent-dirichlet-allocation)
[LDA Gensim 2](https://radimrehurek.com/gensim/models/ldamodel.html#gensim.models.ldamodel.LdaModel)

Modèle LDA

In [None]:
from sklearn.decomposition import LatentDirichletAllocation

# define a maximum number of topics (each represented by some words)
n_topics = 20

# LDA model
lda = LatentDirichletAllocation(
    n_components=n_topics,
    max_iter=5,
    learning_method="online",
    learning_offset=50.0,
    random_state=0,
)

Entraînement et transformation

In [None]:
lda.fit(cv_transformed)

Affichage des topics

In [None]:
n_top_words = 10
topics = get_lda_topics(lda, cv_names, n_top_words)
topics

['model json object field property custom django entity attribute class',
 'windows python command running tried get install version following installed',
 'array data values python list get output column dataframe value',
 'apache remote disable camera profile ssh disabled forms xamarin device',
 'test tests thread testing unit event task call method async',
 'type query key string sql convert name plot format following',
 'data would way want need know one get new help',
 'view problem swift update screen also see change works one',
 'question would one memory different difference used two also seems',
 'table database node view laravel date spark controller npm child',
 'android project studio build visual version app gradle device sdk',
 'app ios xcode react native color rails background apps facebook',
 'function method class object type return call value following variable',
 'time firebase number stream seconds takes times process slow audio',
 '.net framework core set layout it

Exemple sur un document du corpus (biaisé car dans l'entraînement : pour exemple et comparaison des tags seulement)

In [None]:
n_predictions = 5
doc_index = 13371
doc = corpus[doc_index]

In [None]:
doc_cv = cv_fitted.transform([doc])
doc_tokens_count = results_from_vec_matrix(count_vectorizer, doc_cv, n_predictions)

print(f"🔹 Original document:\n    {doc}\n")
print(f"🔹 Token count:\n    {doc_tokens_count}\n")

🔹 Original document:
    django sessions working expected heroku users keep getting logged sessions persisting django app heroku users log randomly logged out—even site anything wrong django heroku config currently running django standard dynos settings.py

🔹 Token count:
    {'django': 4, 'heroku': 3, 'logged': 2, 'users': 2, 'sessions': 2}



In [None]:
# predict topics
topic_preds = lda.transform(doc_cv)[0]
# sort predictions by descending order
n_top_topics = 5
top_topics = topic_preds.argsort()[:-n_top_topics-1 :-1]
print(f"🔹 Top {n_top_topics} topics:")
for topic in top_topics:
    print(f"    Topic {topic} (weight {topic_preds[topic]}):\n        {topics[topic]}")

🔹 Top 5 topics:
    Topic 17 (weight 0.3786462815115825):
        server api user request http client web service application app
    Topic 19 (weight 0.333126503477579):
        project module library angular import get tried folder application trying
    Topic 0 (weight 0.1588245321386544):
        model json object field property custom django entity attribute class
    Topic 6 (weight 0.05851810512542113):
        data would way want need know one get new help
    Topic 13 (weight 0.045884577427807244):
        time firebase number stream seconds takes times process slow audio


# 🚧 gensim LDA ?

# 🚧 calcul matriciel pour predictions avec LDA

# 🚧 NMF

# <a id='toc7_'></a>[🚧 vrac utile](#toc0_)

In [None]:
stats_df.loc[stats_df.count_title > 10].sort_values(by="count_title", ascending=False)

Unnamed: 0,token,count_corpus,freq_corpus,count_tags,freq_tags,count_title,freq_title,count_body,freq_body
6471,android,7848,0.002476,4762,0.019051,2038,0.006975,5810,0.002019
76811,python,7636,0.002409,7301,0.029209,1916,0.006557,5720,0.001988
7807,app,15812,0.004988,0,0.000000,1423,0.004870,14389,0.005000
90332,spring,4756,0.001500,1579,0.006317,1377,0.004712,3379,0.001174
39069,get,21012,0.006628,34,0.000136,1323,0.004528,19689,0.006842
...,...,...,...,...,...,...,...,...,...
30037,dropout,77,0.000024,3,0.000012,11,0.000038,66,0.000023
75194,producing,121,0.000038,0,0.000000,11,0.000038,110,0.000038
95187,tablet,86,0.000027,8,0.000032,11,0.000038,75,0.000026
40107,glue,55,0.000017,0,0.000000,11,0.000038,44,0.000015
