<a href="https://colab.research.google.com/github/wikistat/AI-Frameworks/blob/master/Text/1_cleaning_vectorization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# [IA Frameworks](https://github.com/wikistat/AI-Frameworks) - Natural Language Processing (NLP)

<center>
<a href="http://www.insa-toulouse.fr/" ><img src="https://www.insa-toulouse.fr/skins/Insa-v2/resources/img/logo-insa.jpg" style="float:left; max-width: 320px; display: inline" alt="INSA"/></a> 
<a href="https://github.com/wikistat" ><img src="https://avatars0.githubusercontent.com/u/20927455?s=200&v=4" width=400, style="max-width: 100px; display: inline"  alt="Wikistat"/></a>
<a href="http://www.math.univ-toulouse.fr/" ><img src="https://perso.math.univ-toulouse.fr/riscope/files/2017/06/IMT.jpg" width=300,  style="float:right;  display: inline" alt="IMT"/> </a>
</center>

# Data : Cdiscount's product description.

This dataset has been released from Cdiscount for a data competition (type kaggle) on the french website [datascience.net](https://www.datascience.net/fr/challenge). <br>
The test dataset of this competition has not been released, so we used a subset of 1M producted of the original train dataset(+15M rows) all along with the **Text Processing** lab.<br>
The objective of this competition was to classify the text description of various product into various categories that compose the navigation tree of Cdiscount website. It is composed of 4,733 categories organized within 44 meta categories. <br>

The objective of this lab is not wining the competition so we will only used the meta-categories.


# Part 1 : Cleaning Text data and Vectorization for text classification.

In this first notebook, we study different methods to perform classification of text data.

* **Cleaning** : It consists on removing characters that may not be relevant to solve your problem (punctuation symbol, number, etc.) or replace it (uppercase to lowercase, characters with accent, etc.). It is also possible to remove entire words (**stopwords**) or to replace them with their stem (**stemming**) .
* **Vectorization** : Vectorization is the step that consists of converting the text data (raw or cleaned) to numerical data. This methods can be based on statistics (**One Hot Encoding**, **TF-IDF**) or based on learning algorithms (**Word2Vec, Glove, Gensim**) that we will see on the next part).
* **Classification**: Once converted into numerical data, any classical machine or deep learning algorithms can be used to solved your problem.


# Files & Data (Google Colab)

If you're running this notebook on Google colab, you do not have access to the `data` or `solutions` folder you get by cloning the repository locally. 

The following lines will allow you to build the folders and the files you need for this TP.

**WARNING 1** Do not run this line localy.
**WARNING 2** The magic command `%load` does not work work on google colab, you will have to copy-paste the solution on the notebook.

In [None]:
! mkdir data
! wget -P data https://github.com/wikistat/AI-Frameworks/raw/master/Text/data/cdiscount_test.csv.zip
! wget -P data https://github.com/wikistat/AI-Frameworks/raw/master/Text/data/cdiscount_train.csv.zip
! mkdir data/metadata
! wget -P data/metadata https://github.com/wikistat/AI-Frameworks/raw/master/Text/data/metadata/metadata_1.pkl
! mkdir solution
! wget -P solution https://github.com/wikistat/AI-Frameworks/raw/master/Text/solution/clean_dataframe_1.py
! wget -P solution https://github.com/wikistat/AI-Frameworks/raw/master/Text/solution/clean_dataframe_2.py
! wget -P solution https://github.com/wikistat/AI-Frameworks/raw/master/Text/solution/get_vocabulary_size.py
! wget -P solution https://github.com/wikistat/AI-Frameworks/raw/master/Text/solution/wordcloud_categorie.py
! wget -P solution https://github.com/wikistat/AI-Frameworks/raw/master/Text/solution/get_example_OHE_description.py
! wget -P solution https://github.com/wikistat/AI-Frameworks/raw/master/Text/solution/get_example_TFIDF_description.py
! wget -P solution https://github.com/wikistat/AI-Frameworks/raw/master/Text/solution/get_example_TFIDF_description_valid.py
! wget -P solution https://github.com/wikistat/AI-Frameworks/raw/master/Text/solution/clean.py

! wget -P . https://github.com/wikistat/AI-Frameworks/raw/master/Text/clean.py
! wget -P . https://github.com/wikistat/AI-Frameworks/raw/master/Text/vectorizer.py
! wget -P . https://github.com/wikistat/AI-Frameworks/raw/master/Text/ml_model.py

# Libraries

In [None]:
import unicodedata 
import time
import pandas as pd
import numpy as np
import random
import nltk
import re 
import collections
import itertools
import pickle
import warnings
from tqdm import tqdm
import plotly.offline as pof
import plotly.graph_objects as go
warnings.filterwarnings("ignore")
import sklearn.metrics as smet

import matplotlib.pyplot as plt
import seaborn as sb
from scipy import sparse
sb.set_style("whitegrid")

import sklearn.model_selection as sms

**nltk**

If you're using the `nltk` library for the first time, you have to first download the data you may need. For our problem, we need to download the nltk dataset of common stopwords.

In [None]:
nltk.download("stopwords")

# Data Exploration

in the *NatualLangageProcessing/data* folder you'll find these two files :

* `cdiscount_test.csv.zip`: training dataset composed of 1,000,000 lines
* `cdisount_test`: test dataset composed of 50,000 lines

We first read the train dataset.  To facilitate the exploration we first load only 100.000 rows. You can later go back and switch to do the study on the complete dataset.

In [None]:
data = pd.read_csv("data/cdiscount_train.csv.zip",sep=",", nrows=100000)
print("The train dataset is composed of %d lines" %data.shape[0])
data.head(5)

and the test dataset

In [None]:
data_test = pd.read_csv("data/cdiscount_test.csv.zip",sep=",")
print("The train dataset is composed of %d lines" %data_test.shape[0])
data_test.head(5)

The dataset is composed of 6 columns:

* Categorie1: Level 1 Category 
* Categorie2: Level 2 Category 
* Categorie3: Level 2 Category 
* Description: The complete description of the product
* Libelle: A shorter description of the product.
* Marque: The mark of the product. 

As described in the introduction, our objective will be to classify the products within one of the the 44 categories of level 1 using the text. 

The following command enables to display a description example for each of the 44 categories.

In [None]:
pd.DataFrame(data.groupby("Categorie1").first()["Description"])

### Class distribution

In [None]:
data_count = data["Categorie1"].value_counts()

fig = go.Figure()
fig.add_trace(go.Bar(x=data_count.index,
                y=data_count.values,
                marker_color='rgb(55, 83, 109)'
                ))

fig.update_layout(
    title='Distribution of products within categories',
    xaxis_tickfont_size=12,
    xaxis_tickangle=70,
    yaxis=dict(
        title='Number of products',
        titlefont_size=16,
        tickfont_size=14,
    ),
    paper_bgcolor='rgba(0,0,0,0)',
    barmode='group',
    bargap=0.15, # gap between bars of adjacent location coordinates.
    bargroupgap=0.1 # gap between bars of the same location coordinate.
)
fig.show()

**Q** What can you say about the distribution of the products within categories?

### Vocabulary size

In [None]:
vocabulary_size = {categorie : len(set(" ".join(data[data["Categorie1"]==categorie]["Description"].values).split(" "))) for categorie in set(data["Categorie1"].values)}

fig = go.Figure()
fig.add_trace(go.Bar(x=data_count.index,
                y=[vocabulary_size[c] for c in data_count.index],
                marker_color='rgb(55, 83, 109)'
                ))

fig.update_layout(
    title='Size of vocabulary per categories',
    xaxis_tickfont_size=12,
    xaxis_tickangle=70,
    yaxis=dict(
        title='Size of vocabulary',
        titlefont_size=16,
        tickfont_size=14,
    ),
    paper_bgcolor='rgba(0,0,0,0)',
    barmode='group',
    bargap=0.15, # gap between bars of adjacent location coordinates.
    bargroupgap=0.1 # gap between bars of the same location coordinate.
)
fig.show()

# Text Cleaning

The main advantage of text cleaning is to reduce the features space without losing information.

In the end, the features that resume a description will be the different string of characters separated by a blank space. Hence string like *hand*, *Hand*, *hand,* etc. will be considered as a different feature if the text is not cleaning. Cleaning text allows to re-group together similar words or part of words that are the same but would have been considered different without cleaning.

Here are the different steps that will be applied to the product's descriptions.


* Remove HTML code. 
* Convert text to lowercase.
* Remove punctuation, number, and other non characters-symbols.
* Remove **stopwords**
* Apply **stemming** on each word.


Let us see the effect of each of this transformation on a line example.

## Cleaning Text Example 

**Original line**

In [None]:
i = 47
description = data.Description.values[i]
print("Original Description : " + description)

**Remove HTML code**

Product description may come directly from the product's website. Hence some of these descriptions contain HTML code such as `<br>`, `<a>`, <h>`etc.
The 'BeautifulSoup' library contains algorithm able to detect HTML code and remove it.

In [None]:
from bs4 import BeautifulSoup #Nettoyage d'HTML
txt = BeautifulSoup(description,"html.parser",from_encoding='utf-8').get_text()

**Conversion to lowercase**

Some words are written with uppercase letters. In some cases, it can be useful for example in sentiment analysis. But here, the use of lowercase or uppercase does not provide additional information. Hence all words will be converted in lowercase.

In [None]:
txt = txt.lower()
print(txt)

**Removing accent**

Remove all accents by using ascii encoding. Not taking accents into account enables us to avoid some misspelling. 


In [None]:
txt = unicodedata.normalize('NFD', txt).encode('ascii', 'ignore').decode("utf-8")
print(txt)

**Removing characters that are not letters**

This step consists on removing any non-letter information (i.e, punctuation, number, symbol, etc..).

To accomplish this task we will use [Regular Expression](https://en.wikipedia.org/wiki/Regular_expression) through the [re python library](https://docs.python.org/2/library/re.html) <br>
A regular expression is a sequence of character that enable to efficiently search for a sequence of character within a text. We won't have time to go more through it during this laboratory but keep in mind that it is an extremely usefull tool for information retrieval. 

The next code line enables us to detect all non-letter characters (with the syntax `[^a-z]`) and replace it with blank characters.

In [None]:
txt = re.sub('[^a-z_]', ' ', txt)
print(txt)

**Removing Stopwords**

One common step in text preprocessing is to remove stopwords. A stopword is a word that won't bring any information to solve our problem. <br>
  It can be specific to the problem. For example, when classifying the type of clothes, the color of the t-shirt or the dress will often be in the description but won't help. It can also bring noise if most of the t-shirts are black in the training dataset for example.<br> 
  Also, some words can be unspecific to the problem and related to the language. For example words like le, la, lesetc... The nltk library contains a  list of stopwords for different languages. 

In [None]:
french_stopwords = nltk.corpus.stopwords.words('french') 
english_stopwords = nltk.corpus.stopwords.words('english') 
pd.DataFrame([french_stopwords[:30], english_stopwords[:30]], index=["French", "English"]).T

To remove these words from the description, we first  have to  clean it the same way the text has been cleaned so far (so that words like `même` will be removed.

In [None]:
stopwords = [unicodedata.normalize('NFD', sw).encode('ascii', 'ignore').decode("utf-8") for sw in french_stopwords]
tokens = [w for w in txt.split() if (w not in stopwords)]
removed_words = [w for w in txt.split() if (len(w)<2) or (w in stopwords)]

print("List of tokens: %s" %str(tokens))
print("List of removed words: %s" %str(removed_words))

Note that we now deals with lists of string of characters that are called **tokens**.

**Stemming**

Stemming a word consists of converting a word to it's **word stem** or their root. 

Hence words with the same root but that differ due to their agreement (gender, number, etc.) or their conjugation for example.
This transformation is applied through a stemming algorithm that depends on the language.

The **nltk** library uses the [Snowball stemming algorithm](https://snowballstem.org/algorithms/french/stemmer.html) to stem french word. It is based on several determinist rules.

Play with the following line of code to see the effect of this algorithm.

In [None]:
stemmer=nltk.stem.SnowballStemmer('french')
stemmer.stem("")

We now apply the stemming algorithm to the tokens of the descriptions

In [None]:
tokens_stem = [stemmer.stem(token) for token in tokens]
print(tokens_stem)

## Cleaning The all DataFrame

### Exercise

On the `clean.py` file is defined a `CleanText`python class that contains all the functions defined above.
Using this python class:

1. Write a function called `apply_all_transformation`that apply all this transformation on a text description
2. Write a function called `clean_df_column`that clean all the text lines of the columns of the dataframe and add a new columns on this dataframe that containing the cleaned line.

In [None]:
# %load solution/clean_dataframe_1.py

In [None]:
# %load solution/clean_dataframe_2.py

In [None]:
clean_df_column(data, "Description", "Description_cleaned")
data[["Description", "Description_cleaned"]]

Let's also clean the test dataset for later.

In [None]:
clean_df_column(data_test, "Description", "Description_cleaned")
data_test[["Description", "Description_cleaned"]]

**Warning** For ease of practices and ease of iteration these functions are defined within the notebook. When the function is written, it is better to write it within the python class as it is written within the `solution/clean.py`file.

### Vocabulary size

**Exercise**: compute the total size of the vocabulary (i.e.:Number of unique words in the dataset) before and after cleaning. What can you observe?

In [None]:
# %load solution/get_vocabulary_size.py

### Wordcloud

*Wordcloud* representation allows displaying the main words within a corpus of documents. In this representation, the bigger the word is the most frequent it appears in the document.

Below you can observe the Wordcloud of all the descriptions before cleaning.

In [None]:
from wordcloud import WordCloud
all_descr = " ".join(data.Description.values)
wordcloud_word = WordCloud(background_color="black", collocations=False).generate_from_text(all_descr)

plt.figure(figsize=(10,10))
plt.imshow(wordcloud_word,cmap=plt.cm.Paired)
plt.axis("off")
plt.show()

Wordcloud after stemming and cleaning.

In [None]:
all_descr_clean_stem = " ".join(data.Description_cleaned.values)
wordcloud_word = WordCloud(background_color="black", collocations=False).generate_from_text(all_descr_clean_stem)

plt.figure(figsize=(10,10))
plt.imshow(wordcloud_word,cmap=plt.cm.Paired)
plt.axis("off")
plt.show()

**Q** What do you observe?

Both words `voir` et `present` are the most seen words after cleaning. This due to the fact that most of the descriptions end with *voir la présentation*. It is a good example of **stopwords** that are specific to a given problem.

**Exercise** Add the words`voir`e and `présentation` to the stopwords list and run the cleaning again.

**Exercise** Generate the wordcloud for a category of your choice.

In [None]:
# %load solution/wordcloud_categorie.py

# Vectorization 

The vectorization step allows converting text data into numerical data. 
In this notebook we study vectorization algorithm based on statistic:

* **One Hot Encoding**
* **TF-IDF** 

One of the limitations of these methods is that they imply data with very high dimension since the number of features is the size of the vocabulary. To solved that issue, **hashing** method can be used

## Train/Validation dataset.

we now split the `data`dataframe into two dataframe to get a proper **train** and **validation** dataset with the `train_test_split` function from `scikit-learn` library. <br>
The `random_state`argument, when it is set with an integer, allows retrieving the exact same split from one run to another. Otherwise the split is applied randomly.

In [None]:
data_train, data_valid = sms.train_test_split(data, test_size=0.1, random_state=42)

## One-Hot-Encoding

The **One-Hot-Encoding** is the simplest vectorization method. <br>
It allows building a features matrix of size $N~X~~V$ where $N$ is the number of text descriptions and $V$ the size of the vocabulary. <br>
For each description, the vector is equal to 1 if the word or token is within the description, and 0 otherwise. <br>

They are various variations of this encoding. For example, if the vector is equal to the number of times the word or token appears within the description.


This *One-Hot-Encoding* encoding and their variations. can be applied trough the `CountVectorizer` class of `scikit-learn` library.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

extr_cv = CountVectorizer(binary=False, ngram_range=(1,1))
data_train_OHE = extr_cv.fit_transform(data_train["Description_cleaned"].values)
data_train_OHE

**Q** What is the type of `data_train_OHE`? Why is it stored with this type?

**Q** What is the effect of the `binary` argument? What if it is set to True?

**Q** What is the effect of the `ngram_range`parameter? Play with it an see the effect on the umber of features.

The `get_feature_names` function allows to get the vocabulary list.

In [None]:
vocabulary = extr_cv.get_feature_names()
N_vocabulary = len(vocabulary)
print("Nombre de mots : %d" %N_vocabulary )

**Exercise** Take a line of the training dataset. Retrieved all the words that constitute the line from the `data_train_OHE`object and the `vocabulary`object. Also, retrieve the number of occurrences of each word that composed this line

In [None]:
# %load solution/get_example_OHE_description.py

Apply the same transformations on the validation dataset.

In [None]:
data_valid_OHE = extr_cv.transform(data_valid["Description_cleaned"].values)
data_valid_OHE

**Q** What happen to the words within the validation dataset that are not present in the training dataset ?

**Q** Why don't we re-fit the `CountVectorizer`class on the validation dataset?

### TF-IDF¶

**TF-IDF** is a formula that represents how much a word $w$ is importance in a description $d$ regarding to a ensemble of document $D$. 


* The **TF(w,d)** function count how many time the word $w$ appear in the description $d$.

* The **IDF(w,D)** evaluates the importance of the word in the corpus of document $D$. The most often the word $w$ appear in the document, the less important the IDF will be.  There are various formulas to compute the IDF, the simplest is: 

$$IDF(m,l)=\log\frac{|D|}{f(m)}$$

where $|D|$ is the number of documents in the all corpus, and $f(w)$ the number of documents in which $w$ appears.

** Finally **TF-IDF(w,d,D)** value of a word within a description will be computed as

$$TF-IDF(w,d,D)=TF(w,d)\times IDF(w,D)$$.


This encoding can be applied through the `TfidfVectorizer` class of `scikit-learn` library. We first apply this function with the parameter `norm` set to True in order to make the result more easily interpretable. 

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
vec = TfidfVectorizer(ngram_range=(1,1), norm = False)
data_train_TFIDF = vec.fit_transform(data_train["Description_cleaned"].values)

Check that `data_train_TFIDF` has the same type that `data_train_OHE` and that the vocabulary size are the same

In [None]:
vocabulary = vec.get_feature_names()
N_vocabulary = len(vocabulary)
N_vocabulary

**Exercise** Take a line of the training dataset. Retrieved all the words that constitute the line from the `data_train_OHE`object and the `vocabulary`object. Also retrieve the value of the *idf*, the *tf* and the *tfidf* for each of the word of the line.

In [None]:
# %load solution/get_example_TFIDF_description.py

**Q** Comment the value of the idf parameter for each of the words.

**Q** How does the value of the idf evolve when changing the parameters *smooth idf* and *sublinear_tf* of the`TfidfVectorizer` class?

**Exercice** Change the value of the *ngram_range* parameter of the `TfidfVectorizer` class and display once again the result. What do you see?

We now apply this `vectorizer` on the validation dataset

In [None]:
data_valid_TFIDF = vec.transform(data_valid["Description_cleaned"].values)
data_valid_TFIDF

**Exercise** Take a line of the validation dataset. Retrieved all the words that constitute the line from the `data_train_OHE`object and the `vocabulary` object. Also retrieve the value of the *idf*, the *tf* and the *tfidf* for each of the word of the line.

In [None]:
# %load solution/get_example_TFIDF_description_valid.py

**Q** The tf is recomputed for each line of the validation TF. But the computation of the IDF does not change. It's the same value computed on the training dataset. Does it seem normal for you?

## Hashing

**Hashing** is a method that enables to reduced features space (the dictionary) to a fixed and smaller size `n_hash` of features.

It is based on the **hashing function**, $h$ that linked an index $j\in \mathbb{N}$  to another index $i \in [1,...,N_{hash}]$ such that $i=h(j)$. For a description, the weight of the new feature at index $i$ is a combination of all the features $j$ of the original space such that $i=h(j)$. The weight are combined according to the method described by [Weinberger et al. (2009)](https://alex.smola.org/papers/2009/Weinbergeretal09.pdf).

$h$ does not randomly generate links. So for a different dataset, train or validation, the result will be the same for a same *n_hash* parameter.


The `FeatureHasher` takes the occurrence dictionary as an input (while `CountVectorizer` and `TfidfVectorizer` take the list of string on convert it to token. 

In [None]:
train_dict_array  = list(map(lambda x : collections.Counter(x.split(" ")), data_train["Description_cleaned"].values))
train_dict_array[:2]

In [None]:
from sklearn.feature_extraction import FeatureHasher
nb_hash = 300
feathash = FeatureHasher(nb_hash)
data_train_hash = feathash.fit_transform(train_dict_array)

Check that the type of  `data_train_hash`is the same that `data_train_OHE` or `data_train_TFIDF` and that its dimension has been reduced.

The next cell enables to display the weight of all indexes in the new space

In [None]:
ir = 47
rw = data_train_hash.getrow(ir)
print("Liste des tokens racinisé de la première ligne : " + data_train["Description_cleaned"].values[47])
pd.DataFrame([(v, k)  for k,v in zip(rw.data,rw.indices)], columns=["indices","weight"])

**Q** What can you say about the weights??

The size of the matrix has been reduced compared to `TFIDF`or `OHE`vectorizer. However, there exists no inverse function of the hashing function  which can make the results hard to interpret.


It is possible to combine the `FeatureHasher` with a vectorizer like the TFIDF through the `TFIDFTransformer` class. <br>
The `TFIDFTransformer` does not take a string as an input but the `data_train_hash`dataframe. 
The words are the `nb_hash` indices selected and the tf, for each description, are the weight computed by the hash function. 

In [None]:
from sklearn.feature_extraction.text import TfidfTransformer

vec =  TfidfTransformer(norm = False)
data_train_HTfidf = vec.fit_transform(data_train_hash)
data_train_HTfidf

In [None]:
ir = 47
rw = data_train_HTfidf.getrow(ir)
print(data_train["Description_cleaned"].values[47])
pd.DataFrame([(ind, vec.idf_[ind], w/vec.idf_[ind], w)  for w,ind in zip(rw.data, rw.indices)], columns=["indices","idf","tf","weight"])


## Vectorize the dataframe


in the `vectorizer.py`file, there is a Vectorizer class that allows to fit a Vectorizer on a column of a dataframe and apply it to other columns after that.

The following code enables us to apply vectorizer on the `Description_cleaned`columns of the train and validation dataset for various vectorizer and with and without hashing. The Array of vectorized descriptions are saved in order to use it within the next step for classification

In [None]:
from vectorizer import Vectorizer


features_parameters = [[None, "count"],
                      [1000, "count"],
                      [None, "tfidf"],
                      [1000, "tfidf"],]
metadata = {}
for nb_hash, vectorizer_type in features_parameters:
    vect_method = Vectorizer(vectorizer_type = vectorizer_type, nb_hash = nb_hash )
    ts = time.time()
    vec, feathash, data_train_vec = vect_method.vectorizer_train(data_train, columns = "Description_cleaned")
    data_valid_vec = vect_method.apply_vectorizer(data_valid, columns = "Description_cleaned", vec = vec, feathash = feathash)
    data_test_vec = vect_method.apply_vectorizer(data_test, columns = "Description_cleaned", vec = vec, feathash = feathash)
    te = time.time()
    
    metadata.update({(nb_hash, vectorizer_type):te-ts})
    
    print("nb_hash : " + str(nb_hash) + ", vectorizer_type : " + str(vectorizer_type))
    print("Runing time for vectorization : %.1f seconds" %( metadata[(nb_hash, vectorizer_type)]))
    print("Train shape : " + str(data_train_vec.shape))
    print("Valid shape : " + str(data_valid_vec.shape))

    
    vect_method.save_dataframe(data_train_vec, "train")
    vect_method.save_dataframe(data_valid_vec, "valid")
    vect_method.save_dataframe(data_test_vec, "test")


# Product Classification

In this last part, we will try to classify the products' descriptions of the test dataset using different ML models.


For each of the three models and for all of the vectorized array, let's follow this classical train-validation procedure.

* Train the ML model on various the data and for different parameters. 
* Select the best configuration of parameters according to the results of the validation dataset. 
* For each best configuration, train the model on train + validation data and apply it to the test dataset. 

In this notebook we will only these three ML models:

* Logistic Regression

* Random Forest

* Multi-Layer Perceptron

**Exercise**: Using one of the [sklearn's classification's methods](https://scikit-learn.org/stable/supervised_learning.html), train one the model listed above on one of the vectorized data computes during this lab.


The `MlModel`class is defined within the  `ml_model.py` files. 
It enables to fit a model for different parameters and returns the best one according to the accuracy of the validation dataset. 

The code below enables to run the training over various parameters of vectorizers and model. As it can take a lot of times and that the purpose of this course is not to find which is the best model, the learning has already been run, the next cells will display the results from the metadata list produced in this cell.

In [None]:
FORCE_TO_RUN = False

from ml_model import MlModel

features_parameters = [[None, "count"],
                      [1000, "count"],
                      [None, "tfidf"],
                      [1000, "tfidf"],]

model_parameters = [["lr", {"C":[0.1, 1, 10]}],
                     ["rf", {"n_estimators" : [100,500]}],
                     ["mlp", {"hidden_layer_sizes" : [128, 256]}]
                      ]

if FORCE_TO_RUN:
    metadata = {}
    for nb_hash, vectorizer_type in features_parameters:
        print(nb_hash, vectorizer_type)
        vect_method = Vectorizer(vectorizer_type = vectorizer_type, nb_hash = nb_hash )
        X_train = vect_method.load_dataframe("train")
        Y_train = data_train.Categorie1.values
        X_valid = vect_method.load_dataframe("valid")
        Y_valid = data_valid.Categorie1.values
        X_test = vect_method.load_dataframe("test")
        Y_test = data_test.Categorie1.values

        for ml_model_name, param_grid in model_parameters:
            ml_class = MlModel(ml_model_name=ml_model_name, param_grid=param_grid)
            best_model, best_metadata = ml_class.train_all_parameters(X_train, Y_train, X_valid, Y_valid, save_metadata=True)
            accuracy_test = best_model.score(X_test, Y_test)
            f1_macro_score_test = smet.f1_score(best_model.predict(X_test),Y_test, average='macro')
            balanced_accuracy_test = smet.balanced_accuracy_score(best_model.predict(X_test),Y_test)
            best_metadata.update({"balanced_accuracy_test":balanced_accuracy_test,"accuracy_test": accuracy_test, "f1_macro_score_test":f1_macro_score_test})
            metadata.update({(vectorizer_type, str(nb_hash), ml_model_name): best_metadata})
    pickle.dump(metadata, open("data/metadata/metadata_1.pkl","wb"))


Here is an interactive plot where you can display various metric according to both vectorizer and the best model for each model type.

In [None]:
metadata = pickle.load(open("data/metadata/metadata_1.pkl","rb"))

# Create figure
fig = go.Figure()

# Add traces
metrics = ["accuracy_train",'accuracy_valid', "accuracy_test", "learning_time", "predict_time","balanced_accuracy_test","balanced_accuracy_valid","balanced_accuracy_train", "f1_macro_score_test", "f1_macro_score_valid", "f1_macro_score_train"]
N_metrics = len(metrics)
method_ml_names = ['lr','rf','mlp']
N_method_ml_names = len(method_ml_names)

buttons = []
for i_metric, metric in enumerate(metrics):
    for method_ml_name in method_ml_names:
        fig.add_trace(
            go.Scatter(
                x=[k[0]+"_"+str(k[1]) for k,v in metadata.items() if v['name']==method_ml_name],
                y=[v[metric] for v in metadata.values() if v['name']==method_ml_name],
                mode="markers",
                marker=dict(size=10),
                name = method_ml_name,
            )
        )
    buttons.append(
            dict(label=metric,
                 method="update",
                 args=[{"visible": [True if i in [i_metric*N_method_ml_names + k for k in range(N_method_ml_names)] else False for i in range(N_method_ml_names * N_metrics)]},
                       {"title": metric}]))
    

# Update remaining layout properties
fig.update_layout(
    title_text=metric,
    updatemenus=[
        dict(
            active=1,
            buttons=buttons
        )]
)

fig.show()


**Q** What is the best model according to the different metrics (`accuracy`, `balanced_accuracy`, `f1_macro_score`)? 
      Why do you think the different models perform better without hashing?

**Q** What can you say about the different learning computation time?

**Q** which method would you select?

# To go further

In this notebook we studied how to tackle text classification problems. 

In the previous plot, we made a comparison of various methods. But thousands of other combinations would have been possible playing with:
* Cleaning parameters
    * With or without stemming
    * change stopword list
    * With or without punctuation, number 
    * etc.
* Vectorizer parameters:
    * ngram
    * binary count
    * idf
    * etc.
* ML model and they parameters:
    * Logistic regression (C,penalty etc..)
    * Multi-layers perceptron (hidden layer size, activation layer, etc.)
    * Random forest (More tree; criterion)
    * SVM, DNN, Xgboost etc.
We only played with the default argument of scikit learn method, there exists a lot more to play with.

We have seen that the classes are highly unbalanced. There are various methods that you can apply to tackle this problem. 
* Augment or reduce the dataset(Oversample the smallest classes, Undersample the biggest classes). See the [Imbalanced learn python library](https://imbalanced-learn.readthedocs.io/en/stable/over_sampling.html)
* according to the algorithm you're using you can add a weight to each class (see most of the sklearn algorithm)

Cross Validation can also be used instead of a simple validation dataset for more robust data

**Exercise**: Try any of the bits advices below to improve the results, either on cdiscount dataset or on the challenge dataset ;)