<a href="https://colab.research.google.com/github/wikistat/AI-Frameworks/blob/master/Text/2_words_embedding.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# [IA Frameworks](https://github.com/wikistat/AI-Frameworks) - Natural Language Processing (NLP)

<center>
<a href="http://www.insa-toulouse.fr/" ><img src="https://www.insa-toulouse.fr/skins/Insa-v2/resources/img/logo-insa.jpg" style="float:left; max-width: 320px; display: inline" alt="INSA"/></a> 
<a href="https://github.com/wikistat" ><img src="https://avatars0.githubusercontent.com/u/20927455?s=200&v=4" width=400, style="max-width: 100px; display: inline"  alt="Wikistat"/></a>
<a href="http://www.math.univ-toulouse.fr/" ><img src="https://perso.math.univ-toulouse.fr/riscope/files/2017/06/IMT.jpg" width=300,  style="float:right;  display: inline" alt="IMT"/> </a>
</center>

# Data : Cdiscount's product description.

This dataset has been released from Cdiscount for a data competition (type kaggle) on the french website [datascience.net](https://www.datascience.net/fr/challenge). <br>
The test dataset of this competition has not been released, so we used a subset of 1M producted of the original train dataset(+15M rows) all along with the **Text Processing** lab.<br>
The objective of this competition was to classify the text description of various products into various categories that compose the navigation tree of Cdiscount website. It is composed of 4,733 categories organized within 44 meta categories. <br>

The objective of this lab is not to win the competition so we will only used the meta-categories.

# Part 2 : Words embedding. Application to text classification and semi-supervised learning.

In this second notebook we will study three words embedding methods:

* [Word2Vec](https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf)
* [FastText](https://arxiv.org/pdf/1607.04606.pdf)
* [Glove](https://nlp.stanford.edu/pubs/glove.pdf)

For each of these three method we will:

* Study their characteristics
* Explore the embedding they produce
* Check how they perform on classification problem
* Check how they can overcome problem with few labeled data.

# Files & Data (Google Colab)

If you're runing this notebook on Google colab, you do not have access to the `data` or `solutions` folder you get by cloning the repository localy. 

The following lines will allow you to build the folders and the files you need for this TP..

**WARNING 1** Do not run this line localy.
**WARNING 2** The magic command `%load` does not work work on google colab, you will have to copy-paste the solution on the notebook.

In [None]:
! mkdir data
! wget -P data https://github.com/wikistat/AI-Frameworks/raw/master/Text/data/cdiscount_test.csv.zip
! wget -P data https://github.com/wikistat/AI-Frameworks/raw/master/Text/data/cdiscount_train.csv.zip
! mkdir data/metadata
! wget -P data/metadata https://github.com/wikistat/AI-Frameworks/raw/master/Text/data/metadata/metadata_1.pkl
! wget -P data/metadata https://github.com/wikistat/AI-Frameworks/raw/master/Text/data/metadata/metadata_2.pkl
! wget -P data/metadata https://github.com/wikistat/AI-Frameworks/raw/master/Text/data/metadata/metadata_few_labeled_dataset.pkl
! mkdir data/w2v_model
! mkdir solution
! wget -P solution https://github.com/wikistat/AI-Frameworks/raw/master/Text/solution/w2v_homme.py
! wget -P solution https://github.com/wikistat/AI-Frameworks/raw/master/Text/solution/w2v_combination.py
! wget -P solution https://github.com/wikistat/AI-Frameworks/raw/master/Text/solution/w2v_predict_output.py
! wget -P solution https://github.com/wikistat/AI-Frameworks/raw/master/Text/solution/get_feature_mean.py
! wget -P solution https://github.com/wikistat/AI-Frameworks/raw/master/Text/solution/clean.py

! wget -P . https://github.com/wikistat/AI-Frameworks/raw/master/Text/vectorizer.py
! wget -P . https://github.com/wikistat/AI-Frameworks/raw/master/Text/ml_model.py
! wget -P . https://github.com/wikistat/AI-Frameworks/raw/master/Text/word_embedding.py

# Libraries

In [None]:
#Importation des librairies utilisées
import time
import pandas as pd
import numpy as np
import collections
import pickle
import itertools
import os
import nltk
import warnings
import plotly.offline as pof
import plotly.graph_objects as go
warnings.filterwarnings('ignore')
import sklearn.metrics as smet

import sklearn.model_selection as sms
from solution.clean import CleanText

You might need to download the nltk stopwords if you didn't do it on previous notebook or if you're running in google colab.

In [None]:
nltk.download("stopwords")

# Load Data

We download the train and test data and generate the same cleaned columns and the same train/validation split as in part 1

In [None]:
ct = CleanText()
data = pd.read_csv("data/cdiscount_train.csv.zip",sep=",", nrows=100000)
ct.clean_df_column(data, "Description", "Description_cleaned")
print("The train dataset is composed of %d lines" %data.shape[0])
data.head(5)

In [None]:
data_test = pd.read_csv("data/cdiscount_test.csv.zip",sep=",")
ct.clean_df_column(data_test, "Description", "Description_cleaned")
print("The train dataset is composed of %d lines" %data_test.shape[0])
data_test.head(5)

# Word2Vec

In this part, we will generate`Word2Vec` model thanks to the [**gensim**](https://radimrehurek.com/gensim/index.html) python library.

In [None]:
import gensim

### Build Word2Vec model

The `gensim.models.Word2Vec` function allows to build  Word2Vec model.

In [None]:
gensim.models.Word2Vec?

Like many machine learning models, the `Word2Vec` function has a lot of parameters to set, here is some argument that will be fixed:


* Features_dimension = 300 : It's the dimension of the features space (the hidden layer during training) that will be set.
* min_count = 1 : The minimum number of occurrences of a token to consider it for the model
* windows = 5 : The max distance between a target word and the other words in the sentence to be considered as a neighbors.
* hs = 0 
* negative = 10
* iter = 10 -> (best results, after testing 5,10,15,20,25,30)

**Q** What are the arguments *hs* and *negative* for? What does the values set for these arguments imply??

In [None]:
features_dimension = 300
min_count = 1
window = 5
hs = 0
negative = 10

It takes list of tokens as an input.

In [None]:
array_token = [line.split(" ") for line in data["Description_cleaned"].values]
test_array_token = [line.split(" ") for line in data_test["Description_cleaned"].values]
array_token[0]

We will train two models with the help of the class `WordEmbedding` within the `word_embedding.py` file:

* One **skip-sgram**, sg = 1
* One **CBOW** model, sg = 0

In [None]:
from word_embedding import WordEmbedding

we_sg = WordEmbedding(word_embedding_type = "word2vec", 
                      args = dict(sentences = array_token, sg=1, hs=hs, negative=negative, min_count=min_count, size=features_dimension, window = window, iter=10))
model_sg, training_time_sg = we_sg.train()
print("Model Skip-gram trained in %.2f minutes"%(training_time_sg/60))
model_sg.save("data/w2v_model/model_sg_100k")

we_cbow = WordEmbedding(word_embedding_type = "word2vec", 
                      args = dict(sentences = array_token, sg=0, hs=hs, negative=negative, min_count=min_count, size=features_dimension, window = window, iter=10))
model_cbow, training_time_cbow = we_cbow.train()
print("Model CBOW trained in %.2f minutes"%(training_time_cbow/60))
model_cbow.save("data/w2v_model/model_cbow_100k")



**Q**: Why don't we split in a training a and validation dataset before training the models?

**Q** What can you say about the learning time difference between theses two models? How do you explain the difference?

### Pre-Trained Model

As for convolutional models, there exist pre-trained models on the internet. 
One of the most famous is probably the [`GoogleNewsVectors`](https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit) that has been trained over 100 billions of GoogleNews article. However, this model is in english and can't be used for the Cdiscount dataset


We will use here a model from the following git project: [https://github.com/Kyubyong/wordvectors](https://github.com/Kyubyong/wordvectors) where the model has been learned on  1Giga of wikipedia's article in **Skip-Gram** mode.

You can download it by clicking on this [link](https://drive.google.com/file/d/0B0ZXk88koS2KM0pVTktxdG15TkE/view).  unzip it and download it within the data folder with this direction *data/fr/fr.bin* 

**On google colab you can run the cell below to get the online model**

In [None]:
! gdown https://drive.google.com/uc?id=0B0ZXk88koS2KM0pVTktxdG15TkE
! mv fr.zip data/
! unzip data/fr.zip -d data/

In [None]:
model_pretrained_dir = "data/fr/fr.bin"
model_pretrained = gensim.models.Word2Vec.load(model_pretrained_dir)

### Model Property


We will now compare some properties of the three word2vec models we have:   (*CBOW*, *Skip-Gram* et the pre-trained model *online*)

*Models that we have learned has been trained on tokenized words. Hence, we will need tokenized word to test their properties.*

In [None]:
import nltk 
stemmer=nltk.stem.SnowballStemmer('french')

### Most similar world

The `most_similar`'s word function from **gensim** allows to retrieve the most similar words from a word or a combination of words.

**Q** From this [documentation](https://radimrehurek.com/gensim/models/keyedvectors.html#gensim.models.keyedvectors.WordEmbeddingsKeyedVectors.most_similar) answer the following question:
* What is the similarity measure used?
* In which space is it computed ? 
* How does the the function work when several words are passed as parameters?


#### One Word

**Exercise** For each three models, display output of the `most_similar` word for the word `homme`.

In [None]:
# %load solution/w2v_homme.py

**Q** Compare the outputs of the function with the models learned on cdiscount and the pre-trained model. What can you say about the quality of these outputs?

**Q** What can you say about the output of the two models learned on cdiscount? 

**Exercice** Display now the output of the `most_similar`function for the word  *femme*. 

**Exercice** Display now the output of the `most_similar`function for words related specifically to the cdiscount dataset.  (ex. *xbox*, *pantalon*,..)

#### Word Combination

**Exercise** For each three models, display the outputs of the `most_similar` word for this combinations of words `femme`+ `roi` - `homme`. (Use the  *positive* and  *negative* argument of the function). 
Comment the quality of the outputs.


In [None]:
# %load solution/w2v_combination.py

**Exercice** Test other combinations.

#### Predict the output word

The predict `predict_output_word` function of **gensim** allows to predict word  from a word or a combination of word. <br>

**Exercice** for the three models, display a prediction from common word (*homme*, *femme*) or word specifically related to the Cdiscount dataset (*coque*-*de*-*téléphone*). 

In [None]:
# %load solution/w2v_predict_output.py

### Build Features

We will now create features matrices from the **Word2Vec** model we just learned in order to predict product categories..

The model created allows to generate a vector in the feature space for each word `x` using following command:

In [None]:
x_feature = model_sg['homm']
print(x_feature.shape)
x_feature[:10]

In our problem, the product descriptions we want to categorize are represented by a list of cleaned tokens from the previous notebook. <br>
From those lists, there are various way to represent these descriptions with the **Word2Vec** model

1. Mean of the features' vector of each token in the description. 
2. Weighted mean of the features' vector of each token in the description where the weights are the number of occurrences of each token within the description
3. Weighted mean of the features' vector of each token in the description where the weights are `TFIDF` weights
4. etc...

It's the second solution we will use here.

Let's first split the data (with `random_state=42`) to obtain the same split as in the first notebook.

In [None]:
data_train, data_valid = sms.train_test_split(data, test_size=0.1, random_state=42)
train_array_token = [line.split(" ") for line in data_train["Description_cleaned"].values]
valid_array_token = [line.split(" ") for line in data_valid["Description_cleaned"].values]
test_array_token = [line.split(" ") for line in data_test["Description_cleaned"].values]

**Exercise** Write a function that can generate a weighted mean of the feature's vector of the token within a description

In [None]:
# %load solution/get_feature_mean.py

In [None]:
token_description = train_array_token[0]
get_features_mean(token_description, model_sg).shape

For ease of use, the functions allowing to build vectors from a token list has been written within the `WordEmbedding` class of the `word_embedding.py`file
* `get_features_mean` : return a mean vector within the embedding space of all tokens that composed a line.
* `get_matrix_features_means` : apply `get_features_mean` on every element of the *X* matrix.

#### Cbow

In [None]:
X_embedded_train_cbow, embedded_conversion_train_time_cbow = WordEmbedding.get_matrix_features_means(train_array_token, model_cbow)
X_embedded_valid_cbow, embedded_conversion_valid_time_cbow = WordEmbedding.get_matrix_features_means(valid_array_token, model_cbow)
X_embedded_test_cbow, embedded_conversion_test_time_cbow = WordEmbedding.get_matrix_features_means(test_array_token, model_cbow)

#### Skip-Gram

In [None]:
X_embedded_train_sg, embedded_conversion_train_time_sg = WordEmbedding.get_matrix_features_means(train_array_token, model_sg)
X_embedded_valid_sg, embedded_conversion_valid_time_sg = WordEmbedding.get_matrix_features_means(valid_array_token, model_sg)
X_embedded_test_sg, embedded_conversion_test_time_sg = WordEmbedding.get_matrix_features_means(test_array_token, model_sg)

#### Online model

In [None]:
ct = CleanText(apply_stemming=False)
ct.clean_df_column(data_train, "Description", "Description_cleaned_no_stem")
ct.clean_df_column(data_valid, "Description", "Description_cleaned_no_stem")
ct.clean_df_column(data_test, "Description", "Description_cleaned_no_stem")


train_array_token_nostem = [line.split(" ") for line in data_train["Description_cleaned_no_stem"].values]
valid_array_token_nostem = [line.split(" ") for line in data_valid["Description_cleaned_no_stem"].values]
test_array_token_nostem = [line.split(" ") for line in data_test["Description_cleaned_no_stem"].values]

In [None]:
X_embedded_train_pretrained, embedded_conversion_train_time_pretrained = WordEmbedding.get_matrix_features_means(train_array_token_nostem, model_pretrained)
X_embedded_valid_pretrained, embedded_conversion_valid_time_pretrained = WordEmbedding.get_matrix_features_means(valid_array_token_nostem, model_pretrained)
X_embedded_test_pretrained, embedded_conversion_test_time_pretrained = WordEmbedding.get_matrix_features_means(test_array_token_nostem, model_pretrained)

Now we have computed the features, let's train various classification models (the same than the ones used in the  previous notebook) on this feature!  

The following code allows to train these models. Once again, the models have already been trained, and results save in the `data/metadata/metadata_2.pkl` file.

In [None]:
FORCE_TO_RUN = False

from ml_model import MlModel

we_models = [[model_sg, "skip-gram"],
            [model_cbow, "cbow"],
             [model_pretrained, "pretrained"]]

model_parameters = [["lr", {"C":[0.1, 1, 10]}],
                     ["rf", {"n_estimators" : [100,500]}],
                     ["mlp", {"hidden_layer_sizes" : [128, 256]}]
                      ]

if FORCE_TO_RUN:
    metadata = {}
    for we_model, we_name in we_models:
        train_token = train_array_token if we_name !="pretrained" else train_array_token_nostem
        X_train, embedded_conversion_train_time = WordEmbedding.get_matrix_features_means(train_token, we_model)
        Y_train = data_train.Categorie1.values
        valid_token = valid_array_token if we_name !="pretrained" else valid_array_token_nostem
        X_valid, embedded_conversion_valid_time = WordEmbedding.get_matrix_features_means(valid_token, we_model)
        Y_valid = data_valid.Categorie1.values
        test_token = test_array_token if we_name !="pretrained" else test_array_token_nostem
        X_test, embedded_conversion_test_time = WordEmbedding.get_matrix_features_means(test_token, we_model)
        Y_test = data_test.Categorie1.values

        for ml_model_name, param_grid in model_parameters:
            ml_class = MlModel(ml_model_name=ml_model_name, param_grid=param_grid)
            best_model, best_metadata = ml_class.train_all_parameters(X_train, Y_train, X_valid, Y_valid, save_metadata=True)
            test_score = best_model.score(X_test, Y_test)
            accuracy_test = best_model.score(X_test, Y_test)
            f1_macro_score_test = smet.f1_score(best_model.predict(X_test),Y_test, average='macro')
            balanced_accuracy_test = smet.balanced_accuracy_score(best_model.predict(X_test),Y_test)
            best_metadata.update({"balanced_accuracy_test":balanced_accuracy_test,"accuracy_test": accuracy_test, "f1_macro_score_test":f1_macro_score_test, "embedded_conversion_train_time": embedded_conversion_train_time, "embedded_conversion_valid_time": embedded_conversion_valid_time, "embedded_conversion_test_time": embedded_conversion_test_time})
            metadata.update({(we_name, "",  ml_model_name): best_metadata})
    pickle.dump(metadata, open("data/metadata/metadata_2.pkl","wb"))

In [None]:
metadata = pickle.load(open("data/metadata/metadata_1.pkl","rb"))
metadata.update(pickle.load(open("data/metadata/metadata_2.pkl","rb")))

# Create figure
fig = go.Figure()

# Add traces
metrics = ["accuracy_train",'accuracy_valid', "accuracy_test", "learning_time", "predict_time","balanced_accuracy_test","balanced_accuracy_valid","balanced_accuracy_train", "f1_macro_score_test", "f1_macro_score_valid", "f1_macro_score_train"]
N_metrics = len(metrics)
method_ml_names = ['lr','rf','mlp']
N_method_ml_names = len(method_ml_names)

buttons = []
for i_metric, metric in enumerate(metrics):
    for method_ml_name in method_ml_names:
        fig.add_trace(
            go.Scatter(
                x=[k[0]+"_"+str(k[1]) for k,v in metadata.items() if v['name']==method_ml_name],
                y=[0 if ( not(k[0] in ("skip-gram","cbow", "pretrained")) and metric.startswith("embedded")) else v[metric] for k,v in metadata.items() if v['name']==method_ml_name],
                mode="markers",
                marker=dict(size=10),
                name = method_ml_name,
            )
        )
    buttons.append(
            dict(label=metric,
                 method="update",
                 args=[{"visible": [True if i in [i_metric*N_method_ml_names + k for k in range(N_method_ml_names)] else False for i in range(N_method_ml_names * N_metrics)]},
                       {"title": metric}]))
    

# Update remaining layout properties
fig.update_layout(
    title_text=metric,
    updatemenus=[
        dict(
            active=1,
            buttons=buttons
        )]
)

fig.show()


**Q** What can you say about learning times for the different combinations of ML model X vectorisation/embedding learned? 

**Q** What can you say about the values of these different metrics : `accuracy`, `balanced_accuracy` and `weighted_accuracy` for the different combinations of ML model X vectorisation/embedding learned ?  Do these results seem logical for you?  

**Q** What can you say about the optimized metrics?

**Q** According to the best parameters selected for each metadata. What would you propose to improve these results? 



## Semi supervised learning.

In the previous part, we learned two words embedding models on the training dataset composed of 100.000 lines. (For ease of exploration, and running time). 

We have seen that Wor2vec does not necessarily perform better than the simple vectorizer model. <br>
But word embeddings models required a lot of data to learn similarity between words. We used a pre-trained model but it appears that our dataset is not really a natural **language dataset**. 

However one of the advantages of the word embedding models is that they do not require labeled data to be trained. <br>
Hence we will consider that we have the complete original train dataset of the Cdiscount context composed of 15M of lines, and we consider that it's an unlabeled dataset.<br> 
With the script `train_w2V_all_data.csv.py` we train two words2vec with the same parameters than the model learned above on the complete dataset. <br> 

This script takes several hours to run. **You do not have to run it**. If you're interested on running it again, you can ask your teacher to get the complete dataset. <br> 

Those model can be downloaded by following these links:

* full model sg : [link](https://we.tl/t-eEjWF9ZRc7)
* full model cbow:  [link](https://we.tl/t-zZLQV5Ht7E)

Download the models and move it to the `data/w2v_model`folder.

You can use the lines below if you're using google colab.

In [None]:
! gdown https://drive.google.com/uc?id=1uOIu76Ye2V2zpaAiZ5dYodXa5FoO1t2b
! gdown https://drive.google.com/uc?id=1wm-AU8ygiPufIzAkmD3assX7JtKNMaKD
! gdown https://drive.google.com/uc?id=1cYbvhLYhH2NmcZAYivsvRSiROddTka3g
! gdown https://drive.google.com/uc?id=1PBUxn97zmjtkqU7nJnU-86dU7l26lTXA

Let's see how thus new training performs in a different usecase

In [None]:
from gensim.models import KeyedVectors
model_sg_full = KeyedVectors.load("data/w2v_model/full_model_sg")
model_cbow_full = KeyedVectors.load("data/w2v_model/full_model_cbow")

**Q** Do these models perform differently on the different function tests such that `most_similar_word`, `predict_output_word`, etc.?

### Product classification

In [None]:
FORCE_TO_RUN=False

from ml_model import MlModel

we_models = [[model_sg_full, "skip-gram"],
            [model_cbow_full, "cbow"]]

model_parameters = [["lr", {"C":[0.1, 1, 10]}],
                     ["rf", {"n_estimators" : [100,500]}],
                     ["mlp", {"hidden_layer_sizes" : [128, 256]}]
                      ]

if FORCE_TO_RUN:
    metadata = {}
    for we_model, we_name in we_models:
        train_token = train_array_token if we_name !="pretrained" else train_array_token_nostem
        X_train, embedded_conversion_train_time = WordEmbedding.get_matrix_features_means(train_token, we_model)
        Y_train = data_train.Categorie1.values
        valid_token = valid_array_token if we_name !="pretrained" else valid_array_token_nostem
        X_valid, embedded_conversion_valid_time = WordEmbedding.get_matrix_features_means(valid_token, we_model)
        Y_valid = data_valid.Categorie1.values
        test_token = test_array_token if we_name !="pretrained" else test_array_token_nostem
        X_test, embedded_conversion_test_time = WordEmbedding.get_matrix_features_means(test_token, we_model)
        Y_test = data_test.Categorie1.values

        for ml_model_name, param_grid in model_parameters:
            ml_class = MlModel(ml_model_name=ml_model_name, param_grid=param_grid)
            best_model, best_metadata = ml_class.train_all_parameters(X_train, Y_train, X_valid, Y_valid, save_metadata=True)
            accuracy_test = best_model.score(X_test, Y_test)
            f1_macro_score_test = smet.f1_score(best_model.predict(X_test),Y_test, average='macro')
            balanced_accuracy_test = smet.balanced_accuracy_score(best_model.predict(X_test),Y_test)
            best_metadata.update({"balanced_accuracy_test":balanced_accuracy_test,"accuracy_test": accuracy_test, "f1_macro_score_test":f1_macro_score_test, 
                                  "embedded_conversion_train_time": embedded_conversion_train_time, "embedded_conversion_valid_time": embedded_conversion_valid_time, "embedded_conversion_test_time": embedded_conversion_test_time})
            metadata.update({(we_name+"_full", "",  ml_model_name): best_metadata})
    pickle.dump(metadata, open("data/metadata/metadata_2bis.pkl","wb"))

In [None]:
#Importation des librairies utilisées
metadata = pickle.load(open("data/metadata/metadata_1.pkl","rb"))
metadata.update(pickle.load(open("data/metadata/metadata_2.pkl","rb")))
metadata.update(pickle.load(open("data/metadata/metadata_2bis.pkl","rb")))

# Create figure
fig = go.Figure()

# Add traces
metrics = ["accuracy_train",'accuracy_valid', "accuracy_test", "learning_time", "predict_time","balanced_accuracy_test","balanced_accuracy_valid","balanced_accuracy_train", "f1_macro_score_test", "f1_macro_score_valid", "f1_macro_score_train"]
N_metrics = len(metrics)
method_ml_names = ['lr','rf','mlp']
N_method_ml_names = len(method_ml_names)

buttons = []
for i_metric, metric in enumerate(metrics):
    for method_ml_name in method_ml_names:
        fig.add_trace(
            go.Scatter(
                x=[k[0]+"_"+str(k[1]) for k,v in metadata.items() if v['name']==method_ml_name],
                y=[0 if ( not(k[0] in ("skip-gram","cbow", "pretrained")) and metric.startswith("embedded")) else v[metric] for k,v in metadata.items() if v['name']==method_ml_name],
                mode="markers",
                marker=dict(size=10),
                name = method_ml_name,
            )
        )
    buttons.append(
            dict(label=metric,
                 method="update",
                 args=[{"visible": [True if i in [i_metric*N_method_ml_names + k for k in range(N_method_ml_names)] else False for i in range(N_method_ml_names * N_metrics)]},
                       {"title": metric}]))
    

# Update remaining layout properties
fig.update_layout(
    title_text=metric,
    updatemenus=[
        dict(
            active=1,
            buttons=buttons
        )]
)

fig.show()


In [None]:
metadata[('skip-gram_full', '', 'mlp')]

**Question**

**Q** Comment the results with w2v features learned over the complete unlabeled dataset.

# Few labeled dataset

We have seen that training WordEmbedding dataset on an unsupervised dataset can improve the results of the classification (supervised) problem. 

However, 100.000 is already a high number of rows and the difference of the different metrics using the full words embedding model or the other words embedding model is not high.

TO see how it can performe in a situation where we are a very small labeled dataset; let's re run model for different size of training dataset for the best model combination parameters for the three metrics studied and for one words emebdding (full an simple) and one vectorizer model. ie:

**accuracy**
* *Word Embedding* : [[model_sg, "skip-gram"], ["mlp", {"hidden_layer_sizes": 256}]]
* *Word Embedding full* : [[model_sg_full, "skip-gram-full"], ["mlp", {"hidden_layer_sizes": 256}]]
* *Vectorizer: [[tfidf,'None'], ['lr', {"C":10}]]

**balanced accuracy**
* *Word Embedding*: accuracy_test = [[model_sg, "skip-gram"], ["rf", {"n_estimators": 500}]]
* *Word Embedding full*: accuracy_test = [[model_sg_full, "skip-gram-full"], ["rf", {"n_estimators": 500}]]
* *Vectorizer: [[tfidf,'None'], ['lr', {"C":10}]]

**f1 macro score**
* *Word Embedding* : [[model_sg, "skip-gram"], ["mlp", {"hidden_layer_sizes": 256}]]
* *Word Embedding full* : [[model_sg_full, "skip-gram-full"], ["mlp", {"hidden_layer_sizes": 256}]]
* *Vectorizer: [[tfidf,'mlp'], ["hidden_layer_sizes": [256]]]

*You may have to change these values if the results are different for you*

### Training
The following code allows to train the different models defined above on different training size of dataset.<br>
In order to save time, the model have already been trained, and data saved within the `data/metadata/metadata_few_labeled_dataset.pkl` file.

In [None]:
FORCE_TO_RUN = False
from ml_model import MlModel
from vectorizer import Vectorizer

args = [["we", ["skip-gram_full", model_sg_full, ], ["rf", {"n_estimators": [500]}]],
        ["we", ["skip-gram_full", model_sg_full, ], ["mlp", {"hidden_layer_sizes": [256]}]],
        ["we", ["skip-gram", model_sg, ], ["rf", {"n_estimators": [500]}]],
        ["we", ["skip-gram", model_sg, ], ["mlp", {"hidden_layer_sizes": [256]}]],
        ["vect", ["tfidf", "None"], ["mlp", {"hidden_layer_sizes": [256]}]],
         ["vect", ["tfidf", "None"], ["lr", {"C": [10]}]]]
train_sizes = [100, 500, 1000, 5000, 10000, 50000, 100000]
if FORCE_TO_RUN:
    metadata = {}
    for vect_type, (vect_name, vect_arg), (ml_model_name, param_grid) in args:
        for train_size in train_sizes:
            print(vect_name, ml_model_name, train_size)
            if vect_type == "we":
                we_model = vect_arg
                train_token = train_array_token[:train_size] 
                X_train, embedded_conversion_train_time = WordEmbedding.get_matrix_features_means(train_token, we_model)
                valid_token = valid_array_token
                X_valid, embedded_conversion_valid_time = WordEmbedding.get_matrix_features_means(valid_token, we_model)
                test_token = test_array_token
                X_test, embedded_conversion_test_time = WordEmbedding.get_matrix_features_means(test_token, we_model)
            else:
                nb_hash = vect_arg
                vect_method = Vectorizer(vectorizer_type=vect_name, nb_hash=nb_hash)
                X_train = vect_method.load_dataframe("train")[:train_size]
                X_valid = vect_method.load_dataframe("valid")
                X_test = vect_method.load_dataframe("test")

            Y_train = data_train.Categorie1.values[:train_size]
            Y_valid = data_valid.Categorie1.values
            Y_test = data_test.Categorie1.values

            # model
            ml_class = MlModel(ml_model_name=ml_model_name, param_grid=param_grid)
            best_model, best_metadata = ml_class.train_all_parameters(X_train, Y_train, X_valid, Y_valid, save_metadata=True)
            accuracy_test = best_model.score(X_test, Y_test)
            f1_macro_score_test = smet.f1_score(best_model.predict(X_test),Y_test, average='macro')
            balanced_accuracy_test = smet.balanced_accuracy_score(best_model.predict(X_test),Y_test)
            best_metadata.update({"balanced_accuracy_test": balanced_accuracy_test, "accuracy_test": accuracy_test,
                                  "f1_macro_score_test": f1_macro_score_test,
                                  "embedded_conversion_train_time": embedded_conversion_train_time,
                                  "embedded_conversion_valid_time": embedded_conversion_valid_time,
                                  "embedded_conversion_test_time": embedded_conversion_test_time})
            metadata.update({(vect_name, ml_model_name, train_size): best_metadata})
            pickle.dump(metadata, open("data/metadata/metadata_few_labeled_dataset.pkl", "wb"))

In [None]:
metadata = pickle.load(open("data/metadata/metadata_few_labeled_dataset.pkl", "rb"))
# Create figure
fig = go.Figure()

# Add traces
metrics = ["accuracy_train",'accuracy_valid', "accuracy_test", "learning_time", "predict_time","balanced_accuracy_test","balanced_accuracy_valid","balanced_accuracy_train", "f1_macro_score_test", "f1_macro_score_valid", "f1_macro_score_train"]
N_metrics = len(metrics)
method_vect_ml_names = [['skip-gram_full','rf'],['skip-gram_full','mlp'],['skip-gram','rf'],['skip-gram','mlp'],["tfidf","lr"],["tfidf","mlp"]]
N_method_vect_ml_names = len(method_vect_ml_names)

buttons = []
for i_metric, metric in enumerate(metrics):
    for vect_name, ml_name in method_vect_ml_names:
        fig.add_trace(
            go.Scatter(
                x=  train_sizes,
                y= [metadata[(vect_name,ml_name, x)][metric] for x in train_sizes],
                mode="markers+lines",
                marker=dict(size=10),
                name = vect_name+"_"+ml_name,
            )
        )
    buttons.append(
            dict(label=metric,
                 method="update",
                 args=[{"visible": [True if i in [i_metric*N_method_ml_names + k for k in range(N_method_vect_ml_names)] else False for i in range(N_method_vect_ml_names * N_metrics)]},
                       {"title": metric}]))
    

# Update remaining layout properties
fig.update_layout(
    xaxis_type="log",
    title_text=metric,
    updatemenus=[
        dict(
            active=1,
            buttons=buttons
        )]
)

fig.show()

# Glove

Glove is an algorithm developed by [Standford's researcher](https://nlp.stanford.edu/projects/glove/) in [C language](https://github.com/stanfordnlp/GloVe). There exists no standard python library widely used so far. 

For ease of use we will use the code developed [here](https://github.com/WenchenLi/GloVePyWrapper). The authors developed a python class called `GloveWrapper`that allows to call C original code in python. <br>
This repo has been added to the *ÌA-Frameworks* and can be imported easily on this notebook (see codes below).

To train models, the code requires a file with all text with no punctuation and the words separated from each other by a blank space. The code below enables to generate such a file from the original cdiscount dataset. 

In [None]:
ct = CleanText()
data = pd.read_csv("data/cdiscount_train.csv.zip",sep=",")
ct.clean_df_column(data, "Description", "Description_cleaned")
data["Description_cleaned"].to_csv("data/cdiscount_train_glove", sep=" ", index=False, quotechar=" ")

TO generate the model, we apply successively these three steps:
* **vocab_count** : get all vocabulary and number of appearance of each words
* **cooccur** : compute the co-occurence matrix.
* **shuffle** : the train dataset
* **glove** : train the model using glove algorithm.

In [None]:
from GloVePyWrapper.glove_pywrapper import  GloveWrapper

glove = GloveWrapper(
    corpus ="data/cdiscount_train_glove" ,
    name = "cdiscount_train" ,
    train_dir = "data/glove/",
    builddir='GloVePyWrapper/build',
    vocab_min_count=1,
    vector_size=300,
    window_size=5)
#prepare vocabulary count
glove.vocab_count()
#prepare co-occurrence matrix
glove.cooccur()
#reshuffle
glove.shuffle()
#glove train
glove.glove()

**Q**: What are the `vocab_min_count`, `vector_size`, `window_size` arguments used  are ? Open the python code and check the different arguments used by glove function.

**Q**: For each of the four steps, check the files that were been generated.

`gensim`library does not contains code to train glove models. <br>
However, once a glove model has been trained, it can be loaded via gensim. We can use the same function as for other words embeddings models such that `Word2vec`or `FastText` using the `glove2word2vec`function

In [None]:
from gensim.scripts.glove2word2vec import glove2word2vec
glove_input_file = 'data/glove/cdiscount_train_vectors.txt'
word2vec_output_file = 'data/glove/cdiscount_train_vectors.txt.word2vec'
glove2word2vec(glove_input_file, word2vec_output_file)

**Exercise**: Use the different codes above to train words embedding model using `Glove`instead of `word2vec`. 
Compare performance of word prediction.

# FastText

`FastText` is an extension of Word2Vec proposed by the same authors. It works quite the same that gensim but words are represented as subwords of n characters. 

It is not usefull here as we do not really handle Natural Language processing. 

In [None]:
gensim.models.FastText?

**Exercise:** Use the different codes above to train words embedding models using `FastText`instead of `word2vec`. 
Compare the performance of the word prediction.

**Exercise** You can now try any of this word embedding models on DEFI-IA.