# [IA Frameworks](https://github.com/wikistat/AI-Frameworks) - Natural Language Processing (NLP)

<center>
<a href="http://www.insa-toulouse.fr/" ><img src="http://www.math.univ-toulouse.fr/~besse/Wikistat/Images/logo-insa.jpg" style="float:left; max-width: 120px; display: inline" alt="INSA"/></a> 
<a href="http://wikistat.fr/" ><img src="http://www.math.univ-toulouse.fr/~besse/Wikistat/Images/wikistat.jpg" width=400, style="max-width: 150px; display: inline"  alt="Wikistat"/></a>
<a href="http://www.math.univ-toulouse.fr/" ><img src="http://www.math.univ-toulouse.fr/~besse/Wikistat/Images/logo_imt.jpg" width=400,  style="float:right;  display: inline" alt="IMT"/> </a>
    
</center>

# Data : Cdiscount's product description.

This dataset has been released from Cdiscount for a data competition (type kaggle) on the french website [datascience.net](https://www.datascience.net/fr/challenge). <br>
The test dataset of this competition has not been released, so we used a subset of 1M producted of the original train dataset(+15M rows) all along the **Natural Language Processing** lab.<br>
The objective of this competition was to classify the text description of various product into various categories that compose the navigation tree of Cdiscount website. It is composed of 4,733 categories organized within 44 meta categories. <br>

The objective of this lab is not win the competition so we will only used the meta-categories.

# Part 3 : Recurrent Neural Network. Application to text classification and text generation

In this second notebook we study how to use a recurrent neural network (RNN) for two use case:

* Text classification : As the two precedent notebook, we will use Recurrent Neural Network algorithm to predict product's category.
* Text Generation: We will see how to generate product description.

# Todo 
* Text generation is one to many
* add classification (many to one)
* marque prediction? many to many?

# Librairies

In [None]:
import pandas as pd
import collections
import numpy as np
import pickle
import functools
from tqdm import tqdm

import tensorflow.keras.models as km
import tensorflow.keras.layers as kl

# Keras Tutorial

This part aims to understand how to build the different types of RNN models (**one/many-to-one/many**) with `keras`.

The example are stricly pedagogical and wouldn't need such a models to be built.

## Many to one

**Many-to-one** recurrent neural network take a sequence as an input and return a scalar as an output :


<img src="https://github.com/wikistat/AI-Frameworks/blob/master/Text/images/many_to_one.png?raw=true" alt="drawing" width="400"/>

#### Toy Example

Let's take an example where the *input* are sequences of 3 numbers and the output the sum of these 3 numbers.

Hence the dimensions of the input matrix *X* will be of size:

* N: Number of sequences = 100 (arbitrary values), 
* Timestep: (Size of sequences) = 3, 
* Number of features (How many features for each element of the sequences) = 1 

*NB*: by default, keras handle the dimensions of the input in that order : N_batch, Timestep and Features.


In [None]:
X = np.arange(300).reshape(100,3,1)
print("Dimensions of input sequences: %d, Timestep: %d, Number of features: %d" %X.shape)
Y = X.sum(1).reshape(100,1,1)
print("Dimensions of input sequences: %d, Timestep: %d, Number of features: %d" %Y.shape)
print("Input Example")
print(X[0])
print("Output Example")
print(Y[0])

#### Model

The following lines enable to define a very simple model with one **many-to-one** *RNN* layer with:

* 10 neurons (*units=10)
* a *relu* activation layer

This model take as an input sequences of size (3,1). Note that the *input_shape* argument does not take the batch size as an input. Only the *timestep* and the *feature sie*.

In [None]:
model = km.Sequential()
model.add(kl.SimpleRNN(units=10 ,activation="relu", input_shape=(3, 1)))
model.summary()

**Q**: Do the shape of the output seems normal to you? What do the two dimensions represent? <br>
**Exercise** : Send a sequence trough the model and check the output.

In [None]:
# %load solution/simple_rnn_output.py

Let's now define the complete model. As the output of the RNN layer is a vector with 10 features, let's add a Dense layer so that the output is of size 1.

In [None]:
model = km.Sequential()
model.add(kl.SimpleRNN(units=10 ,activation="relu", input_shape=(3, 1)))
model.add(kl.Dense(1))
model.summary()

We now train the model with and *adam* optimizer and a *mse* as a loss function

In [None]:
epochs = 500
batch_size=32
model.compile(loss="mse", optimizer="adam")
model.fit(X, Y, epochs=epochs, batch_size=batch_size, verbose=0)

Let's check that the model can now correctly perform the sum:

In [None]:
x_test=np.array([10,11,12]).reshape(1,3,1)
print(model.predict(x_test))
x_test=np.array([10,25,12]).reshape(1,3,1)
print(model.predict(x_test))

Note that if  input_shape = (1,1) we set a **one-to-one** recurrent neural network

### With no fix timestep

Note that in the previous example, the timestep was fix to three. But it's possible to set the parameters to *None* so that the model can handle sequences of variable length.

In [None]:
model = km.Sequential()
model.add(kl.SimpleRNN(units=10 ,activation="relu", input_shape=(None, 1)))
model.add(kl.Dense(1))
model.summary()

However, if the model can handle sequence of variable lengths, during training, all sequences should have same lengths.

To handle this, we apply zero padding to the sequences of variables length so that it does not affect the results thanks to the `pad_sequences` function from `keras`.

Let's first create a X list of sequences of different size (3 or 2)

In [None]:
X = []
for x in np.arange(0,300,3):
    X.append([x,x+1,x+2])
for x in np.arange(300,500,2):
    X.append([x,x+1])
collections.Counter([len(x) for x in X])

Let's now pad this sequence with zero values. <br>

In [None]:
from tensorflow.keras.preprocessing.sequence import pad_sequences
X = pad_sequences(X, value=0.0, padding = 'pre')
print("X shape : (%d,%d)" %X.shape)
print("3 first sequences")
print(X[:2])
print("2 last sequences")
X[-3:]

Note that a good practices aims to pad value to the **left** of the sequences. <br>
This can be not intuitive but The reason is that nothing is learn as the beginning of the sequence because all the values would be zeros, the real learning would start when first non zeros values appears. <br>
If the sequences are padded to the right, the information learn on the beginning of the sequences could be lost passing through all zeros values at the end of the sequences.

Let's now train the model!

In [None]:
X = X.reshape(200,3,1)
Y = X.sum(1)

epochs = 500
batch_size=32
model.compile(loss="mse", optimizer="adam")
model.fit(X, Y, epochs=epochs, batch_size=batch_size, verbose=0)

We can now predict sum of sequences of different lenght (even bigger sequences, but results is not guaranted!)

In [None]:
x_test=np.array([3,4]).reshape(1,2,1)
print(model.predict(x_test))
x_test=np.array([3,4,5]).reshape(1,3,1)
print(model.predict(x_test))
x_test=np.array([3,4,5,6]).reshape(1,4,1)
print(model.predict(x_test))

## Many to Many

**Many-to-many** recurrent neural network take a sequence as an input and return a sequence as an output :


<img src="https://github.com/wikistat/AI-Frameworks/blob/master/Text/images/many_to_many.png?raw=true" alt="drawing" width="400"/>

#### Toy Example

Let's take an example where the *input* are sequences of 3 number  and the output will be a sequences of cumulative sum, i.e.

* input = [x1, x2, x3]
* output = [x1, x1+x2, x1+x2+x3]


Hence BOTH the dimensions of the input *X* AND the output *Y*  matrices will be of size:

* N: Number of sequences = 100 (arbitrary values), 
* Timestep: (Size of sequences) = 3, 
* Number of features (How many features for each element of the sequences) = 1,



In [None]:
X = np.arange(300).reshape(100,3,1)
print("Dimensions of input sequences: %d, Timestep: %d, Number of features: %d" %X.shape)
Y = X.cumsum(1).reshape(100,3,1)
print("Dimensions of input sequences: %d, Timestep: %d, Number of features: %d" %Y.shape)
print("Input Example")
print(X[0])
print("Output Example")
print(Y[0])

#### Model

The following lines enable to define a very simple model with one **many-tom-many** *RNN* layer with:

* 10 neurons (*units=10)
* a *relu* activation layer

This model take as an input sequences of size (3,1) and return a sequence of the same size.<br>
This is specified but the *return_sequences* argument wich is set to True.

In [None]:
model = km.Sequential()
model.add(kl.SimpleRNN(units=10 ,activation="relu", input_shape=(3, 1), return_sequences=True))
model.summary()

**Q**: Do the shape of the output seems normal to you? What do the three dimensions represent? <br>
**Exercise** : Send a sequence trough the model and check the output.

In [None]:
# %load solution/simple_rnn_output_bis.py

Let's now define the complete model. <br>
For each input sequences, the output of the RNN layer is a matrix of size 3 (number of timestep) per 10  (features). <br>
The desired output would be a sequence of size 3  per 1.

In order to obtain the correct dimension let's add a Dense layer at each timestep of the output wit the help of the `TimeDistributed` layer of `keras`

In [None]:
model = km.Sequential()
model.add(kl.SimpleRNN(units=10 ,activation="relu", input_shape=(3, 1), return_sequences=True))
model.add(kl.TimeDistributed(kl.Dense(1)))
model.summary()

We now train the model with and *adam* optimizer and a *mse* as a loss function

In [None]:
epochs = 500
batch_size=32
model.compile(loss="mse", optimizer="adam")
model.fit(X, Y, epochs=epochs, batch_size=batch_size, verbose=0)

Let's check that the model can now correctly perform the cumulative sum:

In [None]:
x_test=np.array([10,11,12]).reshape(1,3,1)
print(model.predict(x_test))
x_test=np.array([10,25,12]).reshape(1,3,1)
print(model.predict(x_test))

Note that as previously seen, it would been possible to set the *timestep* parameters to *None* so that the model can compute cumulative sum of model whatever the size of their length.  
This could be a good **exercise** if you want to practice.

## One to Many

**One-to-many** recurrent neural network take a scalar as an input and return a sequence as an output. 

There are different ways to define **one-to-many**
neural network. 

* In the example below, the **one-to-many** network can be seen as as a **many-to-many** neural network where the input sequence is build iteratively.

<img src="https://github.com/wikistat/AI-Frameworks/blob/master/Text/images/one_to_many.png?raw=true" alt="drawing" width="400"/>

* It would also be possible to only pass an input at the first timestep. Then **one-to-many** network can be seen as as a **many-to-many** neural network where the input sequence is composed of one scalar and `None` entry to fill the sequences. 

Hence, **one-to-many** networks, can be seen as particular case of **many-to-many** neural networks. 

#### Toy Example

Let's take an example where the *input* are scalar and the output is sequence of 3 number composed such that

* input = x
* output = [x+2, x+4, x+6]

Hence the dimensions of the output matrix *Y* will be of size:

* N: Number of sequences = 100 (arbitrary values), 
* Timestep: (Size of sequences) = 3, 
* Number of features (How many features for each element of the sequences) = 1 

#### Model

At **training**  the keras model will be built as a **many-to-many** models. <br>
Indeed as you now the sequence output you're expect to get, you now the sequence that will be send as an input you want to learn. IN the example above:

* input = [x,x+2,x+4].
* output = [x+2,x+4,x+6]

**Exercise**: Build the toy dataset and the models that will learn how to predict the output sequences from and input sequences.<br>
**nb** Remember that at prediction, the model should be able to take a scalar as an input (i.e. a sequence of one timestep).


In [None]:
# %load solution/one_to_many_dataset.py

In [None]:
# %load solution/one_to_many_model.py

**Exercise** Once your model is build, write a function that build a the 3 numbers sequences output from a scalar input using the model.

In [None]:
# %load solution/one_to_many_prediction.py

All the examples details so far have treated *one-size* features sequences in order to make this tutorial easier. 

All of these examples can be easily traduced to *several-size* features length. Let's check that with example on the **Cdisocunt** Dataset! 

## RNN layers

Once you know how to manipulate the structure defined above, it is really easy to build more complex or deepest RNN model with keras.

* `GRU` and `LSTM` can be used the exact same way than `SimpleRNN`in the example above.
* Bi-directional layers can be build using the `Bidirectional` layer on `RNN`layer.
* Deep RNN can be build adding `RNN` layer like any other sequential model.

***Example***:
Here is how to build a model with one *LSTM* layer follow by a bidirectional *GRU* layer. 

In [None]:
model = km.Sequential()
model.add(kl.LSTM(units=10 ,activation="relu", input_shape=(3, 1), return_sequences=True))
model.add(kl.Bidirectional(kl.GRU(units=10 ,activation="relu", return_sequences=True)))
model.add(kl.TimeDistributed(kl.Dense(1)))
model.summary()

# Text Generation 

## Dataset

The Level 3 category `COQUE - BUMPER - FACADE TELEPHONE` is the most represented category within the original **Cdiscount**'s dataset with 2.184.671 descriptions. Among them 1.761.637 are composed with 197 characters. 

We will now use these lines (or sub-samble of these lines according to the computation power of your machine) in order to learn a text generation model that will allow to automatically generate a new text description of this type of product.

In [None]:
N = 100000
X = np.load("data/description_coque.npy")[:N]
print(X.shape)
print(X[:3])
print("Size of all the sequences : %s" %(str(set([len(x) for x in X]))))

In [None]:
Ns=197

### Data Processing

The text generation implies to build a **one-To_many** model:

<img src="https://github.com/wikistat/AI-Frameworks/blob/master/Text/images/one_to_many.png?raw=true" alt="drawing" width="400"/>

Where the prediction $y_t$ will be used as an input at time $t+1$, i.e : $y_t=x_{t+1}$. 

This model with be trained as a **many-to-many**  model. 

<img src="https://github.com/wikistat/AI-Frameworks/blob/master/Text/images/many_to_many.png?raw=true" alt="drawing" width="400"/>

Where the output y will be the same sequence than input x with 1 offset.

When using text, the input can be either a word or a characters. As sequences have fixed length, we will use the characters as inputs of the sequences. <br>
These characters will be *one-hot encoded* <br>

Hence each description $x$ will be represented as a Matrix of size $N_s \times N_v$ where

* $N_s=197$ is the length of the sequences (timestep)
* $N_v$ is the size of the vocabulary (the list of caracters) .


### Characters' list

Let's first create a list of unique characters.

In [None]:
chars_set = list(functools.reduce(lambda x,y : x.union(y), [set(x) for x in X], set()))
print("Characters list of size %d : %s"  %(len(chars_set), str(chars_set)))

We will add two elements to these listes allowing to detect the *start* and the *end* of a sequences.

In [None]:
chars_set.extend(["START","END"])
Nv = len(chars_set)
print("Total size of the vocabulary : %d" %Nv)

### Sequence encoding

There are no library (or I do not find it), that enable to *one-hot encode* a string at a character level.

The following lines enables to apply it.

* First `char_to_int` and `int_to_char` dictionary are created, enabling to retrieve the position of a character in the vocabulary.

In [None]:
int_to_char = {i:c for i,c in enumerate(chars_set)}
char_to_int = {c:i for i,c in int_to_char.items()}

The following function encode

* a  $X\in \mathbb{R}^{N \times N_d}$ matrix composed of *N* text description of size *N_s*   size  
into 
* a $X_{vec} \in \mathbb{R}^{N \times N_d \times N_v}$ matrix composed of *N* sequences of size $N_s\times N_v$ (the encoded text description) .

and the $Y\in \mathbb{R}^{N \times N_d}$ matrix (which is the same that the $X$ matrix with offset one) to the $Y_{vec} \in \mathbb{R}^{N \times N_d \times N_v}$

In [None]:
def encode_input_output_sequence(x_descriptions, length_sequence, size_vocab, char_to_int_dic):
    # Get the number of description in x.
    n = x_descriptions.shape[0]
    
    # Set the dimensions of the output encoded matrices fill with zero.
    # the length_sequence is actually length_sequences
    x_vec = np.zeros((n,length_sequence+1, size_vocab))
    y_vec = np.zeros((n,length_sequence+1, size_vocab))
    
    
    # Let's now fill the matrices with one at the location of each characters position
    
    # First let's fill each input sequences with the START position at the begining of the encoded sequences
    x_vec[:,0,char_to_int["START"]] = 1
    # and the output sequences with the END position at the end of the encoded sequences
    y_vec[:,-1,char_to_int["END"]] = 1
    # Now let's iterate over all x_descriptions
    for ix,x in tqdm(enumerate(x_descriptions)):
        # And over each character of the description
        for ic,c in enumerate(x):
            # For each character `c` we set one at his position in the vocabulary.
            c_int = char_to_int_dic[c]
            x_vec[ix,ic+1,c_int]=1
    # The y-vec matrices is the same than the x matrix with one offset
    y_vec[:,:-1,:] = x_vec[:,1:,:] 
    return x_vec, y_vec


**Exercise** Be sure to understand each step of these function.

Let's apply it on the Descriptions dataset

In [None]:
X_vec, Y_vec = encode_input_output_sequence(X[:N], Ns, Nv, char_to_int)
X_vec.shape

**Exercice** 
* Write a function that enables to retrieve the original phrase from an encoded sequences.
* Check that *x* and *y* have actually the same description with one offset.

In [None]:
# %load solution/decoded_vector.py

## Training


**Exercise**: Define a simple model (only one LSTM layer with 32 hidden units) that will allow to train the text generation model.
*Tips*:
* Remember that this model will be used for generation.
* What are the dimension of the output? What will be the activation layer? The loss function?


In [None]:
# %load solution/train_model_text_generation.py

Now you have correctly write the model you can observe that it can take a while to obtain convergence when training thise kind of model.

Let's download this model, generated with the solution above!

In [None]:
from tensorflow.keras.models import load_model
model = load_model("data/generate_model.h5")

## Text Generation

The code below enable to predict a sentence by iterativaly predict a character sending a previous character.

In [None]:
from tensorflow import convert_to_tensor
x_pred = np.zeros((1, Ns+1, Nv))
print("step 0")
x_pred[0,0,char_to_int["START"]] =1
x_pred_str = decode_sequence(x_pred[0], int_to_char)
print(x_pred_str)

for i in range(Ns):
    x_tensor = convert_to_tensor(x_pred[:,:i+1,:])
    ix = np.argmax(model.predict(x_tensor)[0][-1,:])
    x_pred[0,i+1,ix] = 1
x_pred_str=decode_sequence(x_pred[0], int_to_char)
print(x_pred_str)

**Q** How this prediction is done?

**Exercice** Generate a text generation with random first letter.

In [None]:
# %load solution/text_generation_random_first_letter.py

**Exercice** Generate a text generation with some randomness. For example, use a multinomial from the model output to generate a characters at each step.

In [None]:
# %load solution/text_generation_multinomial.py

# Text classification

In [None]:
import sklearn.model_selection as sms
from solution.clean import CleanText
import gensim
from sklearn.feature_extraction.text import TfidfVectorizer
from vectorizer import Vectorizer



## Load Data

In [None]:
ct = CleanText()
data = pd.read_csv("data/cdiscount_train.csv.zip",sep=",", nrows=100000)
ct.clean_df_column(data, "Description", "Description_cleaned")
print("The train dataset is composed of %d lines" %data.shape[0])
data.head(5)

In [None]:
data_test = pd.read_csv("data/cdiscount_test.csv.zip",sep=",")
ct.clean_df_column(data_test, "Description", "Description_cleaned")
print("The train dataset is composed of %d lines" %data_test.shape[0])
data_test.head(5)

## TF IDF

In [None]:
data_train, data_valid = sms.train_test_split(data, test_size=0.1, random_state=42)

In [None]:
from tqdm import tqdm

def vect_sequence(data_array):
    data_array = [line.split("") for line in data_array["Description_cleaned"].values]
    vec = TfidfVectorizer(ngram_range=(1, 1))
    data_vec = vec.fit_transform(data_array)
    
    return data_vec

data_vec_train = vect_sequence(data_train)

In [None]:
data_vec_train

In [None]:
vect_method = Vectorizer(vectorizer_type = "tfidf", nb_hash = None )
vec, feathash, data_train_vec = vect_method.vectorizer_train(data_train, columns = "Description_cleaned")
data_valid_vec = vect_method.apply_vectorizer(data_valid, columns = "Description_cleaned", vec = vec, feathash = feathash)
data_test_vec = vect_method.apply_vectorizer(data_test, columns = "Description_cleaned", vec = vec, feathash = feathash)


In [None]:
int_to_label = {k:v for k,v in enumerate(set(Y_train))}
label_to_int = {v:k for k,v in int_to_label.items()}
Y_train = data_train.Categorie1.values
Y_train_int = np.array([label_to_int[y] for y in Y_train]).reshape(len(Y_train),1)
Y_valid = data_valid.Categorie1.values
Y_valid_int = np.array([label_to_int[y] for y in Y_valid]).reshape(len(Y_valid),1)
Y_test = data_test.Categorie1.values
Y_test_int = np.array([label_to_int[y] for y in Y_test]).reshape(len(Y_test),1)
N_label = len(int_to_label)
print(N_label)

In [None]:
model = km.Sequential()
model.add(kl.SimpleRNN(units=100 ,activation="relu", input_shape=(28, 300), return_sequences=True))
model.add(kl.BatchNormalization())
model.add(kl.SimpleRNN(units=50 ,activation="relu"))
model.add(kl.Dense(N_label))
model.add(kl.Activation("softmax"))
model.summary()

epochs = 500
batch_size=256
history = model.compile(loss="sparse_categorical_crossentropy", optimizer="adam", metrics=["accuracy"])
history=model.fit(X_train, Y_train_int, epochs=epochs, batch_size=batch_size, verbose=1, validation_data=[X_valid, Y_valid_int])

## Words Embedding

In [None]:
train_array_token = [line.split(" ") for line in data_train["Description_cleaned"].values]
valid_array_token = [line.split(" ") for line in data_valid["Description_cleaned"].values]
test_array_token = [line.split(" ") for line in data_test["Description_cleaned"].values]

In [None]:
model_sg_full = gensim.models.Word2Vec.load("data/w2v_model/full_model_sg")

In [None]:
np.percentile([len(x) for x in train_array_token], q=[99]),max([len(x) for x in train_array_token])

In [None]:
from tqdm import tqdm

def tokens_to_embedding_sequences(array_token, model):
    array_embedding_sequences = []
    for tokens in tqdm(array_token):
        embedding_sequence = []
        for token in tokens[:28]:
             embedding_sequence.append(model[token])
        array_embedding_sequences.append(embedding_sequence)
    X = pad_sequences(array_embedding_sequences)
    return X
X_train = tokens_to_embedding_sequences(train_array_token, model_sg_full)
X_valid = tokens_to_embedding_sequences(valid_array_token, model_sg_full)
X_test = tokens_to_embedding_sequences(test_array_token, model_sg_full)

In [None]:
model = km.Sequential()
model.add(kl.LSTM(units=256 ,activation="relu", input_shape=(28, 300)))
model.add(kl.Dense(256))
model.add(kl.Activation("relu"))
model.add(kl.Dense(N_label))
model.add(kl.Activation("softmax"))
model.summary()

epochs = 500
batch_size=256
history = model.compile(loss="sparse_categorical_crossentropy", optimizer="adam", metrics=["accuracy"])
model.fit(X_train, Y_train_int, epochs=epochs, batch_size=batch_size, verbose=1, validation_data=[X_valid, Y_valid_int])