**Table of contents**<a id='toc0_'></a>    
- [Mise en place](#toc1_)    
  - [Chargement des données](#toc1_1_)    
  - [Préparation des données](#toc1_2_)    
- [CountVectorizer](#toc2_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=2
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

# <a id='toc1_'></a>[Mise en place](#toc0_)

In [3]:
# OS & env
import os
import logging
import time

# DS
import numpy as np
import pandas as pd
import dill as pickle

# ML
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import FunctionTransformer
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.manifold import TSNE
from Levenshtein import ratio

# home made functions from the src folder
from src.scrap_and_clean import init_data
from src.models import results_from_vec_matrix
from src.models import get_5_tags_from_matrix
from src.models import score_reduce
from src.models import plot_model
from src.models import get_topics

# logging configuration (see all outputs, even DEBUG or INFO)
logger = logging.getLogger()
logger.setLevel(logging.DEBUG)

## <a id='toc1_1_'></a>[Chargement des données](#toc0_)

In [14]:
df_pp = init_data()

INFO:root:✅ Preprocessed data loaded


In [15]:
print(f"DF shape: {df_pp.shape}")
display(df_pp.head())

DF shape: (49975, 10)


Unnamed: 0,doc_bow,tags,score,answers,views,date,title_bow,title,body_bow,body
0,itms-91053 missing api declaration privacy sud...,ios app-store plist,24,7,3092,2024-03-14 22:55:18,itms-91053 missing api declaration privacy,ITMS-91053: Missing API declaration - Privacy,suddent successful build apple,<p>Why am I all of a suddent getting this on s...
1,builtin sorted slower list containing descendi...,python algorithm performance sorting time-comp...,28,2,2699,2024-03-05 15:21:50,builtin sorted slower list containing descendi...,Why is builtin sorted() slower for a list cont...,sorted four similar list list consistently tak...,<p>I sorted four similar lists. List <code>d</...
2,std :shared_mutex :unlock_shared block though ...,c++ windows multithreading stl,26,5,1388,2024-03-01 23:09:59,std :shared_mutex :unlock_shared block though ...,std::shared_mutex::unlock_shared() blocks even...,team encountered deadlock suspect bug windows ...,<p>My team has encountered a deadlock that I s...
3,rules nullptr init unique_ptr c++ compiles msv...,c++ visual-c++ language-lawyer unique-ptr c++23,15,1,490,2024-02-22 11:29:42,rules nullptr init unique_ptr c++,Did the rules for nullptr init of unique_ptr c...,compiles msvc c++ mode failes c++ mode current...,<p>This code compiles with MSVC from VS 2022 i...
4,order elf relocation applied specified conside...,c++ elf dynamic-linking abi,16,1,1746,2024-02-19 21:42:03,order elf relocation applied specified,Where is the order in which ELF relocations ar...,consider linux system use_message.cpp libmessa...,<p>Consider the following two files on a Linux...


Utilisation des données indispensables seulement

In [16]:
df = df_pp[["doc_bow", "tags"]]
print(f"DF shape: {df.shape}")

DF shape: (49975, 2)


## <a id='toc1_2_'></a>[Préparation des données](#toc0_)

Données communes

In [17]:
tags = df["Tags"].apply(lambda x: " ".join(x)).to_list()
titles = df["title_bow"].to_list()
bodies = df["body_bow"].to_list()
corpus = df["doc_bow"].to_list()

KeyError: 'Tags'

Séparation de 1000 documents pour les tests, le reste pour l'entraînement des modèles.

In [18]:
random_state = 42
test_size = 1000

# X, y, train, test split
X_train, X_test, y_train, y_test = train_test_split(
    df["doc_bow"], df["tags"], test_size=test_size, random_state=random_state
)

## Vectorisation

🚧 Comptage avec au moins 50 occurrences dans le corpus

In [19]:
count_vectorizer = CountVectorizer(token_pattern=r"\S+", dtype=np.uint16, min_df=100)

X_train_cv = count_vectorizer.fit_transform(X_train)
X_test_cv = count_vectorizer.transform(X_test)
cv_names = count_vectorizer.get_feature_names_out()

In [20]:
cv_dict = dict(zip(cv_names, X_test_cv.sum(axis=0).A1.astype(np.uint16)))
cv_dict_sorted = {
    k: v for k, v in sorted(cv_dict.items(), key=lambda item: item[1], reverse=True)
}

In [21]:
cv_dict_sorted

{'get': 439,
 'one': 308,
 'user': 291,
 'app': 290,
 'function': 261,
 'time': 215,
 'data': 210,
 'make': 209,
 'class': 206,
 'object': 203,
 'type': 182,
 'api': 179,
 'set': 178,
 'call': 177,
 'method': 175,
 'version': 173,
 'find': 170,
 'image': 166,
 'project': 163,
 'server': 159,
 'solution': 157,
 'build': 152,
 'test': 150,
 'message': 144,
 'case': 142,
 'add': 140,
 'page': 136,
 'android': 135,
 'python': 132,
 'https': 131,
 'spring': 123,
 'ca': 119,
 'request': 118,
 'view': 118,
 'output': 117,
 'return': 115,
 'string': 114,
 'thread': 111,
 'edit': 103,
 'try': 103,
 'second': 99,
 'command': 97,
 'update': 97,
 'array': 96,
 'thing': 96,
 'java': 95,
 'ios': 94,
 'multiple': 94,
 'line': 93,
 'token': 93,
 'module': 92,
 'say': 92,
 'client': 89,
 'show': 89,
 'model': 88,
 'right': 88,
 'access': 87,
 'table': 87,
 'component': 86,
 'documentation': 86,
 'google': 85,
 'script': 85,
 'http': 84,
 'key': 84,
 'take': 84,
 'understand': 84,
 'default': 83,
 'info

In [22]:
len(cv_dict)

2562

Résultats sur l'échantillon de test

In [10]:
# results = score_reduce(cv_names, cv_fitted.transform(X_test), y_test)
# plot_model(results[0], results[2], results[3])

# Régression logistique

#### 🚧 Expliquer modèle

#### 🚧 Mémoire explose :

- ``max_df`` augmenté 10 -> 100 dans CVectorizer
- `test_size` du n d'échantillons test modifié (10k au lieu de 1k)

==>> `Unable to allocate 11.5 GiB for an array with shape (40000, 38634) and data type float64` <<==

==>> déjà meilleur nettoyage, pour voir <<==

In [23]:
logreg = LogisticRegression()
logreg.fit(X_train_cv, y_train)

MemoryError: Unable to allocate 16.5 GiB for an array with shape (48975, 45185) and data type float64

In [None]:
lr_pred = logreg.predict(X_test_cv)

In [None]:
lr_results = np.array(list(zip(test_df["id"], lr_prediction)))
lr_results = pd.DataFrame(lr_results, columns=["id", "cuisine"])
lr_results.to_csv("lr_vectorized.csv", index=False)