**Table of contents**<a id='toc0_'></a>    
- [Mise en place](#toc1_)    
  - [Récupération des données et requête SQL](#toc1_1_)    
- [Nettoyage des données](#toc2_)    
  - [Termes spécifiques](#toc2_1_)    
    - [Tags](#toc2_1_1_)    
    - [Langages de programmation issus de Wikipedia](#toc2_1_2_)    
    - [Ensemble des termes métier](#toc2_1_3_)    
    - [Stopwords](#toc2_1_4_)    
    - [Termes à la fois exclus et métier](#toc2_1_5_)    
    - [Ponctuation](#toc2_1_6_)    
  - [Titres](#toc2_2_)    
    - [Nettoyage](#toc2_2_1_)    
    - [Tokenisation](#toc2_2_2_)    
  - [Corps de texte](#toc2_3_)    
  - [Suppression des lignes vides](#toc2_4_)    
  - [Données pré-traitées](#toc2_5_)    
    - [Visualisation](#toc2_5_1_)    
    - [Sauvegarde](#toc2_5_2_)    
- [Vocabulaire](#toc3_)    
    - [🚧 Réécrire :](#toc3_1_1_)    
  - [Titres](#toc3_2_)    
  - [Corps](#toc3_3_)    
  - [Vocabulaire complet](#toc3_4_)    
- [🚧 Statistiques](#toc4_)    
  - [Répartition des mots](#toc4_1_)    
  - [Répartition des tags](#toc4_2_)    
  - [Répartition des langages de programmation](#toc4_3_)    
  - [🚧 Nuages de mots](#toc4_4_)    
- [Mise à jour des données](#toc5_)    
- [🦄🦄 CHECKPOINT 🦄🦄](#toc6_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=3
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

# <a id='toc1_'></a>[Mise en place](#toc0_)

In [1]:
# OS & env
import os
import logging

# DS
import pandas as pd
import dill as pickle

# home made functions from the src folder
from src.scrap_and_clean import words_filter

# logging configuration (see all outputs, even DEBUG or INFO)
logger = logging.getLogger()
logger.setLevel(logging.DEBUG)

## <a id='toc1_1_'></a>[Chargement des données](#toc0_)

In [2]:
if not os.path.exists("data/df_preprocessed.pkl"):
    logging.warning("No data: run EDA notebook first")
else:
    with open("data/df_preprocessed.pkl", "rb") as f:
        df = pickle.load(f)

Visualisation des données :

In [3]:
df

Unnamed: 0,Title,Body,Tags,Score,AnswerCount,CreationDate,ViewCount,title_tokens,body_tokens
0,itms- : missing api declaration - privacy,why am i all of a suddent getting this on succ...,"[ios, app-store, plist, appstore-approval, pri...",24,7,2024-03-14 22:55:18,3092,"[itms, missing, api, declaration, privacy]","[suddent, getting, successful, builds, apple]"
1,why is builtin sorted() slower for a list cont...,i sorted four similar lists. list consistently...,"[python, algorithm, performance, sorting, time...",28,2,2024-03-05 15:21:50,2699,"[builtin, sorted, slower, list, containing, de...","[sorted, four, similar, lists, list, consisten..."
2,std::shared_mutex::unlock_shared() blocks even...,my team has encountered a deadlock that i susp...,"[c++, windows, multithreading, stl, shared-lock]",26,5,2024-03-01 23:09:59,1388,"[std, shared_mutex, unlock_shared, blocks, eve...","[team, encountered, deadlock, suspect, bug, wi..."
3,did the rules for nullptr init of unique_ptr c...,this code compiles with msvc from vs in c++ mo...,"[c++, visual-c++, language-lawyer, unique-ptr,...",15,1,2024-02-22 11:29:42,490,"[rules, nullptr, init, unique_ptr, change, c++]","[code, compiles, msvc, c++, mode, failes, c++,..."
4,where is the order in which elf relocations ar...,consider the following two files on a linux sy...,"[c++, elf, dynamic-linking, abi, relocation]",16,1,2024-02-19 21:42:03,1746,"[order, elf, relocations, applied, specified]","[consider, following, two, files, linux, syste..."
...,...,...,...,...,...,...,...,...,...
49995,reverse engineer assembly code to c code,i think this is actually a pretty simple probl...,"[c, assembly, reverse-engineering, x86-64, con...",10,3,2015-02-12 23:51:30,6771,"[reverse, engineer, assembly, code, c, code]","[think, actually, pretty, simple, problem, rev..."
49996,combining random forest models in scikit learn,"i have two randomforestclassifier models, and ...","[python, python-2.7, scikit-learn, classificat...",21,2,2015-02-12 23:11:56,13071,"[combining, random, forest, models, scikit, le...","[two, randomforestclassifier, models, would, l..."
49997,how can i get the primary color from my app th...,"in my android java code, how can i reference t...","[android, android-xml, android-theme, android-...",28,2,2015-02-12 22:58:22,20107,"[get, primary, color, app, theme]","[android, java, code, reference, color, colorp..."
49998,cors settings for iis .,how can i convert the following code for use i...,"[asp.net, iis, cors, web-config, iis-7.5]",12,2,2015-02-12 21:53:34,56289,"[cors, settings, iis]","[convert, following, code, use, web.config, ii..."


# FYI 🚧

In [4]:
# items_to_keep = ["s", "t", "d"]
# keep_set, exclude_set = words_filter(items_to_keep, "add", keep_set, exclude_set)

# items_to_remove = [s for s in mutual if s not in items_to_keep]
# keep_set, exclude_set = words_filter(items_to_remove, "rm", keep_set, exclude_set)

# <a id='toc3_'></a>[Vocabulaire](#toc0_)

### <a id='toc3_1_1_'></a>[🚧 Réécrire :](#toc0_)

⚠️ le word_tokenize de NLTK sépare "#" du mot précédent, ce qui est gênant puisqu'utilisé pour plusieurs langages informatiques, très susceptibles d'être utilisés comme tags

une technique consiste à utiliser la bibliothèque Multi-Word Tokenizer, mais elle déforme la sortie : ("C", "#") → "C_#"

le plus cohérent est de vérifier qu'un langage informatique avec cette configuration et de procéder à l'union des mots dans ce seul cas

Création de 3 *bags-of-words* (abrégés *BOW*) séparés, contenant tout le vocabulaire d'un corpus :
- titres
- body
- global

## <a id='toc3_2_'></a>[Titres](#toc0_)

In [5]:
bow_title = set()
df["title_tokens"].apply(lambda x: bow_title.update(set(x)))
print(len(bow_title), "tokens:\n", list(bow_title)[:10])

27467 tokens:
 ['composer', 'logout', 'channels', 'threat', 'requested', 'rs232', 'clock', 'setup', 'fs', 'generic']


## <a id='toc3_3_'></a>[Corps](#toc0_)

In [6]:
bow_body = set()
df["body_tokens"].apply(lambda x: bow_body.update(set(x)))
print(len(bow_body), "tokens:\n", list(bow_body)[:10])

103820 tokens:
 ['schéma', 'executors.java', 'blueimp.github.io', 'logout', 'channels', 'gaurantees', 'declined', 'ccpizza', 'requested', 'clock']


## <a id='toc3_4_'></a>[Vocabulaire complet](#toc0_)

In [7]:
# arhur
rain_bow = bow_title | bow_body
print(
    f"Total: {len(rain_bow)} tokens\n({len(bow_title)} from title and {len(bow_body)} from body)"
)

Total: 110288 tokens
(27467 from title and 103820 from body)


# <a id='toc6_'></a>[🦄🦄 CHECKPOINT 🦄🦄](#toc0_)

# TF-IDF

In [8]:
from sklearn.feature_extraction.text import TfidfVectorizer

corpus = df["title_tokens"].apply(lambda x: " ".join(x)).to_list()

vectorizer = TfidfVectorizer(token_pattern=r"\S+")
X = vectorizer.fit_transform(corpus)
vectorizer.get_feature_names_out()

array(['.a', '.d.ts', '.htaccess', ..., '≥3gb', 'ꭲᏻꮎꮅꮝꮤꮕ', 'ꮝꭶꮪꭹ'],
      dtype=object)

🚧 retrouver quelle est la valeur à choper : la plus grande ou la plus petite ?

In [9]:
print(vectorizer.get_feature_names_out()[27374])

zero-allocation


In [10]:
print(X.shape)
print(corpus[3])
print(X[3])
print(corpus[6])
print(X[6])
print(corpus[7])
print(X[7])

(49992, 27467)
rules nullptr init unique_ptr change c++
  (0, 3130)	0.2834497346141111
  (0, 3539)	0.2899654868517161
  (0, 25379)	0.44353148817255206
  (0, 11024)	0.4337465763016716
  (0, 16103)	0.5227410752336996
  (0, 20711)	0.4212621027241501
possible make zero-allocation coroutine runtime c++
  (0, 20733)	0.38323367323801016
  (0, 4957)	0.44759039077813456
  (0, 27374)	0.6106012392508993
  (0, 13599)	0.30367593187981645
  (0, 18061)	0.322079623372116
  (0, 3130)	0.28983886042117574
combined c++ deducing conversion operator auto return type
  (0, 24937)	0.23851073947605095
  (0, 20389)	0.28451267433583277
  (0, 1961)	0.33663611732903503
  (0, 16656)	0.32927354874944426
  (0, 4853)	0.3606642271063403
  (0, 5850)	0.5117041064725818
  (0, 4271)	0.4259144731422853
  (0, 3130)	0.25903147332563975


In [11]:
print(X.shape)

(49992, 27467)
