**Table of contents**<a id='toc0_'></a>    
- [Mise en place](#toc1_)    
  - [Chargement des données](#toc1_1_)    
- [FYI 🚧](#toc2_)    
- [🦄🦄 CHECKPOINT 🦄🦄](#toc3_)    
- [🚧 TODO](#toc4_)    
- [TF-IDF](#toc5_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=3
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

# <a id='toc1_'></a>[Mise en place](#toc0_)

In [66]:
# OS & env
import os
import logging

# DS
import numpy as np
import pandas as pd
import dill as pickle

# ML
from sklearn.feature_extraction.text import TfidfVectorizer

# home made functions from the src folder
from src.scrap_and_clean import words_filter

# logging configuration (see all outputs, even DEBUG or INFO)
logger = logging.getLogger()
logger.setLevel(logging.DEBUG)

## <a id='toc1_1_'></a>[Chargement des données](#toc0_)

Données prétraitées :

In [2]:
if not os.path.exists("data/df_preprocessed.pkl"):
    logging.warning("No data: run EDA notebook first")
else:
    with open("data/df_preprocessed.pkl", "rb") as f:
        df = pickle.load(f)

Vocabulaire :

In [3]:
if (
    not os.path.exists("data/bow_corpus.pkl")
    or not os.path.exists("data/bow_tags.pkl")
    or not os.path.exists("data/bow_title.pkl")
    or not os.path.exists("data/bow_body.pkl")
):
    logging.warning("Missing bag of word: run EDA notebook first")
else:
    with open("data/bow_corpus.pkl", "rb") as b_c:
        bow_corpus = pickle.load(b_c)
    with open("data/bow_tags.pkl", "rb") as b_ta:
        bow_tags = pickle.load(b_ta)
    with open("data/bow_title.pkl", "rb") as b_ti:
        bow_title = pickle.load(b_ti)
    with open("data/bow_body.pkl", "rb") as b_b:
        bow_body = pickle.load(b_b)

Visualisation des données :

In [18]:
print(f"Data shape: {df.shape}")
display(df.head())
print(f"Corpus contains {len(bow_corpus)} tokens:")
print(f"- tags: {len(bow_tags)} tokens")
print(f"- title: {len(bow_title)} tokens")
print(f"- body: {len(bow_body)} tokens")

Data shape: (49992, 9)


Unnamed: 0,Title,Body,Tags,Score,AnswerCount,CreationDate,ViewCount,title_tokens,body_tokens
0,itms- : missing api declaration - privacy,why am i all of a suddent getting this on succ...,"[ios, app-store, plist, appstore-approval, pri...",24,7,2024-03-14 22:55:18,3092,"[itms, missing, api, declaration, privacy]","[suddent, getting, successful, builds, apple]"
1,why is builtin sorted() slower for a list cont...,i sorted four similar lists. list consistently...,"[python, algorithm, performance, sorting, time...",28,2,2024-03-05 15:21:50,2699,"[builtin, sorted, slower, list, containing, de...","[sorted, four, similar, lists, list, consisten..."
2,std::shared_mutex::unlock_shared() blocks even...,my team has encountered a deadlock that i susp...,"[c++, windows, multithreading, stl, shared-lock]",26,5,2024-03-01 23:09:59,1388,"[std, shared_mutex, unlock_shared, blocks, eve...","[team, encountered, deadlock, suspect, bug, wi..."
3,did the rules for nullptr init of unique_ptr c...,this code compiles with msvc from vs in c++ mo...,"[c++, visual-c++, language-lawyer, unique-ptr,...",15,1,2024-02-22 11:29:42,490,"[rules, nullptr, init, unique_ptr, change, c++]","[code, compiles, msvc, c++, mode, failes, c++,..."
4,where is the order in which elf relocations ar...,consider the following two files on a linux sy...,"[c++, elf, dynamic-linking, abi, relocation]",16,1,2024-02-19 21:42:03,1746,"[order, elf, relocations, applied, specified]","[consider, following, two, files, linux, syste..."


Corpus contains 110288 tokens:
- tags: 16975 tokens
- title: 27467 tokens
- body: 103820 tokens


# <a id='toc2_'></a>[FYI 🚧](#toc0_)

🚧 add / remove item

In [5]:
# items_to_keep = ["s", "t", "d"]
# keep_set, exclude_set = words_filter(items_to_keep, "add", keep_set, exclude_set)

# items_to_remove = [s for s in mutual if s not in items_to_keep]
# keep_set, exclude_set = words_filter(items_to_remove, "rm", keep_set, exclude_set)

# <a id='toc3_'></a>[🦄🦄 CHECKPOINT 🦄🦄](#toc0_)

# <a id='toc4_'></a>[🚧 TODO](#toc0_)

- aller sur les modèles (non supervisé puis supervisé puis word2vec, cf. ci-dessous)
- voir NMF et LDA et leurs différences, points forts et usages
- non supervisé : faire un premier algo simple (régression logistique, SGD ... et voir comment faire pour deviner plusieurs labels (regarder avec predict_proba aussi)
- faire des prédictions
- aller jusqu’au bout de l’exercice et voir après
- regarder si des classes de mots (par exemple les adjectifs) sont à enlever à un moment pour améliorer les capacités

# <a id='toc5_'></a>[TF-IDF](#toc0_)

In [9]:
corpus = df["title_tokens"].apply(lambda x: " ".join(x)).to_list()

vectorizer = TfidfVectorizer(token_pattern=r"\S+")
X = vectorizer.fit_transform(corpus)
vectorizer.get_feature_names_out()

array(['.a', '.d.ts', '.htaccess', ..., '≥3gb', 'ꭲᏻꮎꮅꮝꮤꮕ', 'ꮝꭶꮪꭹ'],
      dtype=object)

🚧 retrouver quelle est la valeur à choper : la plus grande ou la plus petite ?

In [19]:
print(vectorizer.get_feature_names_out()[27374])
print(vectorizer.get_feature_names_out()[5850])

zero-allocation
deducing


In [23]:
print(X.shape)
print(corpus[3])
print(X[3])
print(corpus[6])
print(X[6])
print(corpus[7])
print(X[7])

(49992, 27467)
rules nullptr init unique_ptr change c++
  (0, 3130)	0.2834497346141111
  (0, 3539)	0.2899654868517161
  (0, 25379)	0.44353148817255206
  (0, 11024)	0.4337465763016716
  (0, 16103)	0.5227410752336996
  (0, 20711)	0.4212621027241501
possible make zero-allocation coroutine runtime c++
  (0, 20733)	0.38323367323801016
  (0, 4957)	0.44759039077813456
  (0, 27374)	0.6106012392508993
  (0, 13599)	0.30367593187981645
  (0, 18061)	0.322079623372116
  (0, 3130)	0.28983886042117574
combined c++ deducing conversion operator auto return type
  (0, 24937)	0.23851073947605095
  (0, 20389)	0.28451267433583277
  (0, 1961)	0.33663611732903503
  (0, 16656)	0.32927354874944426
  (0, 4853)	0.3606642271063403
  (0, 5850)	0.5117041064725818
  (0, 4271)	0.4259144731422853
  (0, 3130)	0.25903147332563975


In [121]:
_ = X[3].copy()
d = X[3].copy().data
i = X[3].copy().indices

print(d)
print(i)
print(_)

sorted_x = np.argsort(d)
print(sorted_x)
sorted_x = np.argsort(-d)
print(sorted_x)

def predict_idf(vectorizer, X, n_max):
    """Predict a maximum of n_max results from a vectorizer and a transformed sparse matrix X."""
    d = X.data ; i = X.indices

    # get highest weights
    weights = [np.sort(d)[::-1][:min(len(i), n_max)]]

    # get corresponding indices
    pred_indices = [i[np.argsort(-d)][:min(len(i), n_max)]]

    # get corresponding words
    preds = [
        [vectorizer.get_feature_names_out()[x] for x in pred_indices[0]]
    ]

    return preds, weights

print(i[np.argsort(-d)][:min(len(i), 3)])

print(corpus[3])
predict_idf(vectorizer, X[3], 3)


# def predict_idf(vectorizer, X, n_max):
#     results = []
#     for i, e in enumerate(X.data):
#         results.append(vectorizer.get_feature_names_out()[x])
#     return results

# for x in i:
#     results.append(vectorizer.get_feature_names_out()[x])
# results

[0.28344973 0.28996549 0.44353149 0.43374658 0.52274108 0.4212621 ]
[ 3130  3539 25379 11024 16103 20711]
  (0, 3130)	0.2834497346141111
  (0, 3539)	0.2899654868517161
  (0, 25379)	0.44353148817255206
  (0, 11024)	0.4337465763016716
  (0, 16103)	0.5227410752336996
  (0, 20711)	0.4212621027241501
[0 1 5 3 2 4]
[4 2 3 5 1 0]
[16103 25379 11024]
rules nullptr init unique_ptr change c++


([['nullptr', 'unique_ptr', 'init']],
 [array([0.52274108, 0.44353149, 0.43374658])])

In [69]:
print(X.shape)

(49992, 27467)
