**Table of contents**<a id='toc0_'></a>    
- [Mise en place](#toc1_)    
  - [Chargement des données](#toc1_1_)    
  - [Visualisations](#toc1_2_)    
  - [Préparation des données](#toc1_3_)    
- [Stratégie](#toc2_)    
- [TF-IDF](#toc3_)    
- [CountVectorizer](#toc4_)    
- [🦄🦄 CHECKPOINT 🦄🦄](#toc5_)    
- [LDA](#toc6_)    
- [🚧 vrac utile](#toc7_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=3
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

# <a id='toc1_'></a>[Mise en place](#toc0_)

In [19]:
# OS & env
import os
import logging

# DS
import numpy as np
import pandas as pd
import dill as pickle

# ML
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer

# home made functions from the src folder
from src.scrap_and_clean import words_filter
from src.models import results_from_vec_matrix

# logging configuration (see all outputs, even DEBUG or INFO)
logger = logging.getLogger()
logger.setLevel(logging.DEBUG)

## <a id='toc1_1_'></a>[Chargement des données](#toc0_)

In [2]:
if not os.path.exists("data/df_preprocessed.pkl") or not os.path.exists(
    "data/stats_df.pkl"
):
    logging.warning("Missing data: run EDA notebook first")
else:
    with open("data/df_preprocessed.pkl", "rb") as df:
        df = pickle.load(df)
    with open("data/stats_df.pkl", "rb") as st:
        stats_df = pickle.load(st)

## <a id='toc1_2_'></a>[Visualisations](#toc0_)

Jeu de données

In [3]:
display(df.head())
print(f"DF shape: {df.shape}")
print(f'Corpus contains {len(stats_df["count_corpus"])} tokens')

Unnamed: 0,Title,Body,Tags,Score,AnswerCount,CreationDate,ViewCount,title_tokens,body_tokens
0,itms- : missing api declaration - privacy,why am i all of a suddent getting this on succ...,"[ios, app-store, plist, appstore-approval, pri...",24,7,2024-03-14 22:55:18,3092,"[itms, missing, api, declaration, privacy]","[suddent, getting, successful, builds, apple]"
1,why is builtin sorted() slower for a list cont...,i sorted four similar lists. list consistently...,"[python, algorithm, performance, sorting, time...",28,2,2024-03-05 15:21:50,2699,"[builtin, sorted, slower, list, containing, de...","[sorted, four, similar, lists, list, consisten..."
2,std::shared_mutex::unlock_shared() blocks even...,my team has encountered a deadlock that i susp...,"[c++, windows, multithreading, stl, shared-lock]",26,5,2024-03-01 23:09:59,1388,"[std, shared_mutex, unlock_shared, blocks, eve...","[team, encountered, deadlock, suspect, bug, wi..."
3,did the rules for nullptr init of unique_ptr c...,this code compiles with msvc from vs in c++ mo...,"[c++, visual-c++, language-lawyer, unique-ptr,...",15,1,2024-02-22 11:29:42,490,"[rules, nullptr, init, unique_ptr, change, c++]","[compiles, msvc, c++, mode, failes, c++, mode,..."
4,where is the order in which elf relocations ar...,consider the following two files on a linux sy...,"[c++, elf, dynamic-linking, abi, relocation]",16,1,2024-02-19 21:42:03,1746,"[order, elf, relocations, applied, specified]","[consider, following, two, linux, system, use_..."


DF shape: (49991, 9)
Corpus contains 110199 tokens


## <a id='toc1_3_'></a>[Préparation des données](#toc0_)

Passage des tokens à des listes de chaînes

In [4]:
tags = df["Tags"].apply(lambda x: " ".join(x)).to_list()
titles = df["title_tokens"].apply(lambda x: " ".join(x)).to_list()
bodies = df["body_tokens"].apply(lambda x: " ".join(x)).to_list()
corpus = titles + bodies

# <a id='toc2_'></a>[Stratégie](#toc0_)

Pour tester ces modèles, les **entraînements se font sur les titres des questions**, partant du postulat que les titres des questions pertinentes sont proportionnellement plus représentatifs du contenu.

Création d'un titre fictif pour tests ultérieurs

In [None]:
dummy_title = [
    "i have a c# issue with overflow memory in library kazakhstanislas causing many problems in my code"
]

# <a id='toc3_'></a>[TF-IDF](#toc0_)

[Documentation ScikitLearn](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html)

🚧 Un modèle simple de prédictions pourrait être de se baser sur les fréquences de monogrammes présents dans un corpus.

Entraînement du modèle sur les titres

In [6]:
tfidf_vectorizer = TfidfVectorizer(token_pattern=r"\S+", min_df=10)
tfidf_fitted_on_titles = tfidf_vectorizer.fit(titles)
tfidf_names = tfidf_vectorizer.get_feature_names_out()

Prédiction sur un titre fictif

In [34]:
X = tfidf_fitted_on_titles.transform(dummy_title)

In [35]:
print(X)

  (0, 2406)	0.4225848164936134
  (0, 2183)	0.43017407326637275
  (0, 1910)	0.3212781028824765
  (0, 1865)	0.38769906625276496
  (0, 1751)	0.30575535188555975
  (0, 428)	0.429166067593399
  (0, 380)	0.32430279895986563


In [39]:
predictions = results_from_vec_matrix(tfidf_vectorizer, X, 5)
print(predictions["tokens"])
print(predictions["weights"])

['overflow', 'causing', 'problems', 'many', 'c#']
[0.43017407326637275, 0.429166067593399, 0.4225848164936134, 0.38769906625276496, 0.32430279895986563]


On constate que TF-IDF, par son fonctionnement, ne propose pas facilement des tags fréquents, tels que les noms de langages informatiques ou bibliothèques régulièrement utilisées : c'est là une des limites de cette approche simple.

# <a id='toc4_'></a>[CountVectorizer](#toc0_)

[Documentation ScikitLearn](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)

🚧 Réutilisation de CountVectorizer sur les données => LDA l'utilise en entrée

Modèle de comptage vectoriel

# <a id='toc5_'></a>[🦄🦄 CHECKPOINT 🦄🦄](#toc0_)

🚧 refaire comme TFIDF (filtres, etc.)

In [9]:
count_vectorizer = CountVectorizer(token_pattern=r"\S+", dtype=np.uint16)

Matrice des titres

In [10]:
cv_fitted_on_titles = count_vectorizer.fit(titles)
cv_titles = cv_fitted_on_titles.transform(titles)
names_titles = count_vectorizer.get_feature_names_out()

Exemple avec un titre fictif

In [11]:
_ = cv_fitted_on_titles.transform(dummy_title)
print(_)

  (0, 3129)	1
  (0, 14055)	30
  (0, 16987)	30


Matrice des corps de texte

In [12]:
cv_fitted_on_bodies = count_vectorizer.fit(bodies)
cv_bodies = cv_fitted_on_bodies.transform(bodies)
names_bodies = count_vectorizer.get_feature_names_out()

# <a id='toc6_'></a>[LDA](#toc0_)

[Documentation ScikitLearn](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html)

Modèle LDA

In [13]:
from sklearn.decomposition import LatentDirichletAllocation

# define a maximum number of topics (each represented by some words)
n_topics = 20

# LDA model
lda = LatentDirichletAllocation(
    n_components=n_topics,
    max_iter=5,
    learning_method="online",
    learning_offset=50.0,
    random_state=0,
)

Entraînement et transformation sur les titres

In [14]:
lda.fit(count_vec_titles)

NameError: name 'count_vec_titles' is not defined

Exemple de prédiction sur un titre (biaisé car dans l'entraînement : pour exemple seulement)

In [None]:
index = 0

In [None]:
print(f"Testing title {index}:\n    {title[index]}")

Testing title 0:
    itms missing api declaration privacy


In [None]:
print(count_vec_titles[0])

  (0, 11736)	1
  (0, 14424)	1
  (0, 1232)	1
  (0, 5793)	1
  (0, 18389)	1


In [None]:
lda.transform(count_vec_titles[0])

array([[0.00833333, 0.00833333, 0.175     , 0.00833333, 0.00833333,
        0.175     , 0.00833333, 0.00833333, 0.00833333, 0.50833333,
        0.00833333, 0.00833333, 0.00833333, 0.00833333, 0.00833333,
        0.00833333, 0.00833333, 0.00833333, 0.00833333, 0.00833333]])

In [None]:
lda.components_[0]

array([0.05      , 0.0500058 , 0.05000073, ..., 0.37103332, 0.05      ,
       0.05      ])

Affichage des topics

In [None]:
def get_topics(model, feature_names, n_top_words) -> list:
    """Display the topics of a LDA model."""
    topics = []
    for topic in model.components_:
        topics.append(
            " ".join(
                [feature_names[i] for i in topic.argsort()[: -n_top_words - 1 : -1]]
            )
        )

    return topics


n_top_words = 10
topics = get_topics(lda, names_titles, n_top_words)

In [None]:
topics[0]

'file swift ios error xcode build app running version command'

In [None]:
print(len(lda.components_[2]))
print(lda.components_[2])
print(lda.components_[2].argsort()[:-11:-1])
print(len(lda.components_[0]))
print(lda.components_[0])
print(" ".join([names_titles[i] for i in lda.components_[2].argsort()[:-11:-1]]))

27467
[0.05 0.05 0.05 ... 0.05 0.05 0.05]
[ 1232  4924 26931  1647 26504  8835     3 25250 21451  3129]
27467
[0.05       0.0500058  0.05000073 ... 0.37103332 0.05       0.05      ]
api core working asp.net web framework .net unable service c#


# <a id='toc7_'></a>[🚧 vrac utile](#toc0_)

In [5]:
stats_df.loc[stats_df.count_title > 10].sort_values(by="count_title", ascending=False)

Unnamed: 0,token,count_corpus,freq_corpus,count_tags,freq_tags,count_title,freq_title,count_body,freq_body
6471,android,7848,0.002476,4762,0.019051,2038,0.006975,5810,0.002019
76811,python,7636,0.002409,7301,0.029209,1916,0.006557,5720,0.001988
7807,app,15812,0.004988,0,0.000000,1423,0.004870,14389,0.005000
90332,spring,4756,0.001500,1579,0.006317,1377,0.004712,3379,0.001174
39069,get,21012,0.006628,34,0.000136,1323,0.004528,19689,0.006842
...,...,...,...,...,...,...,...,...,...
30037,dropout,77,0.000024,3,0.000012,11,0.000038,66,0.000023
75194,producing,121,0.000038,0,0.000000,11,0.000038,110,0.000038
95187,tablet,86,0.000027,8,0.000032,11,0.000038,75,0.000026
40107,glue,55,0.000017,0,0.000000,11,0.000038,44,0.000015
