**Table of contents**<a id='toc0_'></a>    
- [Mise en place](#toc1_)    
  - [Chargement des données](#toc1_1_)    
  - [Préparation des données](#toc1_2_)    
- [CountVectorizer](#toc2_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=2
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

# <a id='toc1_'></a>[Mise en place](#toc0_)

In [2]:
# OS & env
import os
import logging
import time

# DS
import numpy as np
import pandas as pd
import dill as pickle

# ML
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import FunctionTransformer
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.manifold import TSNE
from Levenshtein import ratio

# home made functions from the src folder
from src.scrap_and_clean import init_data
from src.models import results_from_vec_matrix
from src.models import get_5_tags_from_matrix
from src.models import score_reduce
from src.models import plot_model
from src.models import vect_data
from src.models import get_topics
from src.models import topic_weights_df
from src.models import topic_predict

# logging configuration (see all outputs, even DEBUG or INFO)
logger = logging.getLogger()
logger.setLevel(logging.DEBUG)

## <a id='toc1_1_'></a>[Chargement des données](#toc0_)

In [3]:
df = init_data()

INFO:root:✅ Preprocessed data loaded


In [4]:
print(f"DF shape: {df.shape}")
display(df.head())

DF shape: (50000, 10)


Unnamed: 0,Title,Body,Tags,Score,AnswerCount,CreationDate,ViewCount,title_bow,body_bow,doc_bow
0,ITMS-91053: Missing API declaration - Privacy,<p>Why am I all of a suddent getting this on s...,"[ios, app-store, plist, appstore-approval, pri...",24,7,2024-03-14 22:55:18,3092,itms-91053 missing api declaration privacy,suddent getting successful builds apple,itms-91053 missing api declaration privacy sud...
1,Why is builtin sorted() slower for a list cont...,<p>I sorted four similar lists. List <code>d</...,"[python, algorithm, performance, sorting, time...",28,2,2024-03-05 15:21:50,2699,builtin sorted slower list containing descendi...,sorted four similar lists list consistently ta...,builtin sorted slower list containing descendi...
2,std::shared_mutex::unlock_shared() blocks even...,<p>My team has encountered a deadlock that I s...,"[c++, windows, multithreading, stl, shared-lock]",26,5,2024-03-01 23:09:59,1388,std :shared_mutex :unlock_shared blocks even t...,team encountered deadlock suspect bug windows ...,std :shared_mutex :unlock_shared blocks even t...
3,Did the rules for nullptr init of unique_ptr c...,<p>This code compiles with MSVC from VS 2022 i...,"[c++, visual-c++, language-lawyer, unique-ptr,...",15,1,2024-02-22 11:29:42,490,rules nullptr init unique_ptr change c++,compiles msvc c++ mode failes c++ mode current...,rules nullptr init unique_ptr change c++ compi...
4,Where is the order in which ELF relocations ar...,<p>Consider the following two files on a Linux...,"[c++, elf, dynamic-linking, abi, relocation]",16,1,2024-02-19 21:42:03,1746,order elf relocations applied specified,consider following two linux system use_messag...,order elf relocations applied specified consid...


## <a id='toc1_2_'></a>[Préparation des données](#toc0_)

Données communes

In [4]:
tags = df["Tags"].apply(lambda x: " ".join(x)).to_list()
titles = df["title_bow"].to_list()
bodies = df["body_bow"].to_list()
corpus = df["doc_bow"].to_list()

Séparation de 1000 documents pour les tests, le reste pour l'entraînement des modèles

In [5]:
random_state = 42

# isolate target
X_df = df.loc[:, df.columns != "Tags"]

# split entire DF for further use
X_train_df, X_test_df, y_train, y_test = train_test_split(
    X_df, tags, test_size=10000, random_state=random_state
)

# then set genuine features for models
X_train = X_train_df["doc_bow"].to_list()
X_test = X_test_df["doc_bow"].to_list()

## Vectorisation

🚧 Comptage avec au moins 50 occurrences dans le corpus

In [6]:
count_vectorizer = CountVectorizer(token_pattern=r"\S+", dtype=np.uint16, min_df=100)

X_train_cv = count_vectorizer.fit_transform(X_train)
X_test_cv = count_vectorizer.transform(X_test)
cv_names = count_vectorizer.get_feature_names_out()

In [23]:
cv_dict = dict(zip(cv_names, X_test_cv.sum(axis=0).A1.astype(np.uint16)))
cv_dict_sorted = {
    k: v for k, v in sorted(cv_dict.items(), key=lambda item: item[1], reverse=True)
}

In [24]:
cv_dict_sorted

{'get': 4078,
 'app': 3154,
 'following': 3038,
 'one': 2967,
 'way': 2920,
 'tried': 2765,
 'problem': 2534,
 'trying': 2447,
 'data': 2444,
 'also': 2416,
 'work': 2321,
 "'ve": 2274,
 'need': 2196,
 'project': 2170,
 'function': 2133,
 'example': 1997,
 'know': 1972,
 'see': 1873,
 'new': 1846,
 'however': 1828,
 'method': 1799,
 'works': 1748,
 'server': 1738,
 'set': 1727,
 'application': 1717,
 'user': 1705,
 'time': 1695,
 'working': 1675,
 'could': 1674,
 'question': 1643,
 'class': 1633,
 'find': 1602,
 'make': 1599,
 'test': 1578,
 'api': 1570,
 'type': 1550,
 'version': 1550,
 'something': 1502,
 'found': 1499,
 'used': 1457,
 'android': 1452,
 'python': 1402,
 'first': 1401,
 'help': 1361,
 'seems': 1351,
 'running': 1345,
 'https': 1332,
 'quot': 1332,
 'without': 1325,
 'different': 1309,
 'two': 1289,
 'build': 1263,
 'add': 1259,
 'still': 1259,
 'image': 1239,
 'object': 1239,
 'getting': 1237,
 'possible': 1230,
 'value': 1224,
 'page': 1205,
 'solution': 1143,
 'even

In [20]:
len(cv_dict)

2432

Résultats sur l'échantillon de test

In [10]:
# results = score_reduce(cv_names, cv_fitted.transform(X_test), y_test)
# plot_model(results[0], results[2], results[3])

# Régression logistique

#### 🚧 Expliquer modèle

#### 🚧 Mémoire explose :

- ``max_df`` augmenté 10 -> 100 dans CVectorizer
- `test_size` du n d'échantillons test modifié (10k au lieu de 1k)

==>> `Unable to allocate 11.5 GiB for an array with shape (40000, 38634) and data type float64` <<==

==>> déjà meilleur nettoyage, pour voir <<==

In [11]:
logreg = LogisticRegression()
logreg.fit(X_train_cv, y_train)

MemoryError: Unable to allocate 11.5 GiB for an array with shape (40000, 38634) and data type float64

In [None]:
lr_pred = logreg.predict(X_test_cv)

In [None]:
lr_results = np.array(list(zip(test_df["id"], lr_prediction)))
lr_results = pd.DataFrame(lr_results, columns=["id", "cuisine"])
lr_results.to_csv("lr_vectorized.csv", index=False)