<a href="https://colab.research.google.com/github/wikistat/AI-Frameworks/blob/master/Text/3_recurrent_neural_network.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# [IA Frameworks](https://github.com/wikistat/AI-Frameworks) - Natural Language Processing (NLP)

<center>
<a href="http://www.insa-toulouse.fr/" ><img src="http://www.math.univ-toulouse.fr/~besse/Wikistat/Images/logo-insa.jpg" style="float:left; max-width: 120px; display: inline" alt="INSA"/></a> 
<a href="http://wikistat.fr/" ><img src="http://www.math.univ-toulouse.fr/~besse/Wikistat/Images/wikistat.jpg" width=400, style="max-width: 150px; display: inline"  alt="Wikistat"/></a>
<a href="http://www.math.univ-toulouse.fr/" ><img src="http://www.math.univ-toulouse.fr/~besse/Wikistat/Images/logo_imt.jpg" width=400,  style="float:right;  display: inline" alt="IMT"/> </a>
    
</center>

# Data : Cdiscount's product description.

This dataset has been released from Cdiscount for a data competition (type kaggle) on the french website [datascience.net](https://www.datascience.net/fr/challenge). <br>
The test dataset of this competition has not been released, so we used a subset of 1M of products of the original train dataset(+15M rows) all along the **Natural Language Processing** lab.<br>
The objective of this competition was to classify the text description of various product into various categories that compose the navigation tree of Cdiscount website. It is composed of 4,733 categories organized within 44 meta categories. <br>

The objective of this lab is not win the competition so we will only used the meta-categories.

In [60]:
import tensorflow
tensorflow.__version__

'2.3.1'

# Files & Data (Google Colab)

If you're running this notebook on Google colab, you do not have access to the `data` or `solutions` folder you get by cloning the repository locally. 

The following lines will allow you to build the folders and the files you need for this TP.

**WARNING 1** Do not run this line localy.
**WARNING 2** The magic command `%load` does not work work on google colab, you will have to copy-paste the solution on the notebook.

In [None]:

! mkdir data
! wget -P data https://github.com/wikistat/AI-Frameworks/raw/master/Text/data/cdiscount_test.csv.zip
! wget -P data https://github.com/wikistat/AI-Frameworks/raw/master/Text/data/cdiscount_train.csv.zip
! wget -P data https://github.com/wikistat/AI-Frameworks/raw/master/Text/data/description_coque.npy.zip
! unzip data/description_coque.npy.zip -d data/
! wget -P data https://github.com/wikistat/AI-Frameworks/raw/master/Text/data/char_to_int.pkl
! wget -P data https://github.com/wikistat/AI-Frameworks/raw/master/Text/data/int_to_char.pkl
! wget -P data https://github.com/wikistat/AI-Frameworks/raw/master/Text/data/generate_model.h5
! mkdir solution
! wget -P solution https://github.com/wikistat/AI-Frameworks/raw/master/Text/solution/many_to_one_toy_example.py
! wget -P solution https://github.com/wikistat/AI-Frameworks/raw/master/Text/solution/simple_rnn_output.py
! wget -P solution https://github.com/wikistat/AI-Frameworks/raw/master/Text/solution/many_to_many_toy_example.py
! wget -P solution https://github.com/wikistat/AI-Frameworks/raw/master/Text/solution/simple_rnn_output_bis.py
! wget -P solution https://github.com/wikistat/AI-Frameworks/raw/master/Text/solution/one_to_many_dataset.py
! wget -P solution https://github.com/wikistat/AI-Frameworks/raw/master/Text/solution/one_to_many_model.py
! wget -P solution https://github.com/wikistat/AI-Frameworks/raw/master/Text/solution/one_to_many_prediction.py
! wget -P solution https://github.com/wikistat/AI-Frameworks/raw/master/Text/solution/encode_input_output_sequence.py
! wget -P solution https://github.com/wikistat/AI-Frameworks/raw/master/Text/solution/train_model_text_generation.py
! wget -P solution https://github.com/wikistat/AI-Frameworks/raw/master/Text/solution/text_generation_random_first_letter.py
! wget -P solution https://github.com/wikistat/AI-Frameworks/raw/master/Text/solution/text_generation_multinomial.py
! wget -P solution https://github.com/wikistat/AI-Frameworks/raw/master/Text/solution/token_to_embedding_sequence
! wget -P solution https://github.com/wikistat/AI-Frameworks/raw/master/Text/solution/rnn_classifier_model.py
! wget -P solution https://github.com/wikistat/AI-Frameworks/raw/master/Text/solution/clean.py

# Part 3 : Recurrent Neural Network.

The objectives of these first notebook are the following:

* Build various *RNN* architecture with `keras` on toy example. ( **Many to one**, **one to many**, **many to many**, **bidirectional layer**, **Deep NN**, etc.
* Use *RNN* for **Text Generation**. Generate product description.
* Use *RNN* for **Text classification**. As the two precedent notebook, we will use Recurrent Neural Network algorithm to predict product's category.

# Librairies

In [1]:
import pandas as pd
import collections
import numpy as np
import pickle
import functools
from tqdm import tqdm

import tensorflow.keras.models as km
import tensorflow.keras.layers as kl

import logging
logging.getLogger('tensorflow').disabled = True

from tqdm import tqdm
import sklearn.model_selection as sms
from solution.clean import CleanText
import gensim

# Keras Tutorial

This part aims to understand how to build the different types of RNN models (**one/many-to-one/many**) with `keras`.

This part is strictly pedagogical and toy data will be used.

## Many to one

**Many-to-one** recurrent neural network takes a sequence as an input and return a scalar as an output :


<img src="https://github.com/wikistat/AI-Frameworks/blob/master/Text/images/many_to_one.png?raw=true" alt="drawing" width="400"/>

### Toy Example

An **RNN**  model always take 3 dimensional tensors as  inputs AND outputs which are (in this order when using `Keras`):

* The **Batch**. (Number of sequences in the dataset) 
* The **Timestep**. (The size of one sequence)
* The **Features**. (How many features for each element of each sequence)

Let's take an example where:

* the *input* are sequences of 3 numbers 
* the *output* the sum of these 3 numbers.
* The dataset contains 100 individuals.

**Exercise**: What would be the dimension of the input `X` and output `Y` matrix ?
Fill the cell below with correct dimensions. 



In [2]:
X = np.arange(??).reshape(?,?,?)
print("Dimensions of input sequences: %d, Timestep: %d, Number of features: %d" %X.shape)
Y = X.sum(1).reshape(100,?,?)
print("Dimensions of output sequences: %d, Timestep: %d, Number of features: %d" %Y.shape)
print("Input Example")
print(X[0])
print("Output Example")
print(Y[0])

SyntaxError: invalid syntax (<ipython-input-2-39405d1045e4>, line 1)

In [10]:
# %load solution/many_to_one_toy_example.py
X = np.arange(300).reshape(100,3,1)
print("Dimensions of input sequences: %d, Timestep: %d, Number of features: %d" %X.shape)
Y = X.sum(1).reshape(100,1,1)
print("Dimensions of output sequences: %d, Timestep: %d, Number of features: %d" %Y.shape)
print("Input Example")
print(X[0])
print("Output Example")
print(Y[0])


Dimensions of input sequences: 100, Timestep: 3, Number of features: 1
Dimensions of output sequences: 100, Timestep: 1, Number of features: 1
Input Example
[[0]
 [1]
 [2]]
Output Example
[[3]]


### Model

The following lines enable to define a very simple model with one **many-to-one** *RNN* layer with:

* 10 neurons (*units=10)
* a *relu* activation layer

This model take as an input sequences of size (3,1). Note that the *input_shape* argument does not take the batch size as an input. Only the *timestep* and the *feature sie*.

In [11]:
model = km.Sequential()
model.add(kl.SimpleRNN(units=10 ,activation="relu", input_shape=(3, 1)))
model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
simple_rnn_1 (SimpleRNN)     (None, 10)                120       
Total params: 120
Trainable params: 120
Non-trainable params: 0
_________________________________________________________________


**Q**: Do the shape of the output seems normal to you? What do the two dimensions represent? <br>
**Exercise** : Send a sequence trough the model and check the output.

In [13]:
# %load solution/simple_rnn_output.py
x_test = np.array([1,2,3]).reshape(1,3,1)
model.predict(x_test).shape

Let's now define the complete model. <br>
The output of the RNN layer are vectors of size 10 for each individual. <br>.
Let's add a Dense layer so that the final output is of size 1 for each individual, *i.e.* the dimension we want as an output.

In [14]:
model = km.Sequential()
model.add(kl.SimpleRNN(units=10 ,activation="relu", input_shape=(3, 1)))
model.add(kl.Dense(1))
model.summary()

Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
simple_rnn_2 (SimpleRNN)     (None, 10)                120       
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 11        
Total params: 131
Trainable params: 131
Non-trainable params: 0
_________________________________________________________________


We now train the model with and *adam* optimizer and a *mse* as a loss function

In [15]:
epochs = 500
batch_size=32
model.compile(loss="mse", optimizer="adam")
model.fit(X, Y, epochs=epochs, batch_size=batch_size, verbose=0)

<tensorflow.python.keras.callbacks.History at 0x7fe72867d3d0>

Let's check that the model can now correctly perform the sum:

In [16]:
x_test=np.array([10,11,12]).reshape(1,3,1)
print(model.predict(x_test))
x_test=np.array([10,25,12]).reshape(1,3,1)
print(model.predict(x_test))

[[33.049534]]
[[47.834915]]


Note that if  input_shape = (1,1) we set a **one-to-one** recurrent neural network

### With no fix timestep

Note that in the previous example, the timestep was fix to three. But it's possible to set the parameters to *None* so that the model can handle sequences of variable length.

In [17]:
model = km.Sequential()
model.add(kl.SimpleRNN(units=10 ,activation="relu", input_shape=(None, 1)))
model.add(kl.Dense(1))
model.summary()

Model: "sequential_3"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
simple_rnn_3 (SimpleRNN)     (None, 10)                120       
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 11        
Total params: 131
Trainable params: 131
Non-trainable params: 0
_________________________________________________________________


However, if this model can handle sequence of variable **timestep** lengths, during training, all sequences should have the same **timestep** lengths.

To handle this, we apply zero padding to the sequences of variables length so that it does not affect the results thanks to the `pad_sequences` function from `keras`.

Let's first create a X list of sequences of different **timestep** size (3 or 2)

In [18]:
X = []
for x in np.arange(0,300,3):
    X.append([x,x+1,x+2])
for x in np.arange(300,500,2):
    X.append([x,x+1])
print("Number of sequences by timestep length")
collections.Counter([len(x) for x in X])

Number of sequences by timestep length


Counter({3: 100, 2: 100})

Let's now pad this sequence with zero values. <br>

In [19]:
from tensorflow.keras.preprocessing.sequence import pad_sequences
X = pad_sequences(X, value=0.0, padding = 'pre')
print("X shape : (%d,%d)" %X.shape)
print("3 first sequences")
print(X[:2])
print("2 last sequences")
X[-3:]

X shape : (200,3)
3 first sequences
[[0 1 2]
 [3 4 5]]
2 last sequences


array([[  0, 494, 495],
       [  0, 496, 497],
       [  0, 498, 499]], dtype=int32)

Some remarks about padding

* Note that a good practices aims to pad value to the **left** of the sequences. 
* This can be not intuitive but the reason is that nothing is learn at the beginning of the sequence because all the values would be zeros, the real learning would start when first non zeros values appears. 
* If the sequences are padded to the right, the information learn on the beginning of the sequences could be lost passing through all zeros values at the end of the sequences.
* Padding values depends of the objective. Here sequences are padded with zero value so that it doesn't change the values of the sum. 

Let's now train the model!

In [20]:
X = X.reshape(200,3,1)
Y = X.sum(1)

epochs = 500
batch_size=32
model.compile(loss="mse", optimizer="adam")
model.fit(X, Y, epochs=epochs, batch_size=batch_size, verbose=0)

<tensorflow.python.keras.callbacks.History at 0x7fe6fb2975e0>

We can now predict sum of sequences of different lenght (even bigger sequences, but results is not guaranted!)

In [21]:
x_test=np.array([3,4]).reshape(1,2,1)
print(model.predict(x_test))
x_test=np.array([3,4,5]).reshape(1,3,1)
print(model.predict(x_test))
x_test=np.array([3,4,5,6]).reshape(1,4,1)
print(model.predict(x_test))

[[7.345095]]
[[12.432947]]
[[11.583219]]


## Many to Many

**Many-to-many** recurrent neural network take a sequence as an input and return a sequence as an output :


<img src="https://github.com/wikistat/AI-Frameworks/blob/master/Text/images/many_to_many.png?raw=true" alt="drawing" width="400"/>

#### Toy Example

Let's take an example where the *input* are sequences of 3 number  and the output will be a sequences of cumulative sum, i.e.

* input = [x1, x2, x3]
* output = [x1, x1+x2, x1+x2+x3]

**Exercise**: What would be the dimension of the input `X` and output `Y` matrix ?
Fill the cell below with correct dimensions. 

In [22]:
X = np.arange(??).reshape(100,??,??)
print("Dimensions of input sequences: %d, Timestep: %d, Number of features: %d" %X.shape)
Y = X.cumsum(1).reshape(100,??,??)
print("Dimensions of input sequences: %d, Timestep: %d, Number of features: %d" %Y.shape)
print("Input Example")
print(X[0])
print("Output Example")
print(Y[0])

SyntaxError: invalid syntax (<ipython-input-22-88b5f57c42a2>, line 1)

In [24]:
# %load solution/many_to_many_toy_example.py
X = np.arange(300).reshape(100,3,1)
print("Dimensions of input sequences: %d, Timestep: %d, Number of features: %d" %X.shape)
Y = X.cumsum(1).reshape(100,3,1)
print("Dimensions of input sequences: %d, Timestep: %d, Number of features: %d" %Y.shape)
print("Input Example")
print(X[0])
print("Output Example")
print(Y[0])


Dimensions of input sequences: 100, Timestep: 3, Number of features: 1
Dimensions of input sequences: 100, Timestep: 3, Number of features: 1
Input Example
[[0]
 [1]
 [2]]
Output Example
[[0]
 [1]
 [3]]


#### Model

The following lines enable to define a very simple model with one **many-tom-many** *RNN* layer with:

* 10 neurons (*units=10)
* a *relu* activation layer

This model take as an input sequences of size (3,1) and return a sequence of the same size.<br>
This is specified but the `return_sequences` argument which is set to True. (and set to False by default)

In [25]:
model = km.Sequential()
model.add(kl.SimpleRNN(units=10 ,activation="relu", input_shape=(3, 1), return_sequences=True))
model.summary()

Model: "sequential_4"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
simple_rnn_4 (SimpleRNN)     (None, 3, 10)             120       
Total params: 120
Trainable params: 120
Non-trainable params: 0
_________________________________________________________________


**Q**: Do the shape of the output seems normal to you? What do the three dimensions represent? <br>
**Exercise** : Send a sequence trough the model and check the output.

In [27]:
# %load solution/simple_rnn_output_bis.py
x_test = np.array([1,2,3]).reshape(1,3,1)
model.predict(x_test).shape

(1, 3, 10)

Let's now define the complete model. <br>
For each input sequences, the output of the RNN layer is a matrix of size 3 (number of **timestep**) per 10  (**features**). <br>
The desired output would be a sequence of size 3  per 1.

In order to obtain the correct dimension let's add a Dense layer at **each timestep** to get the desired output. <br>
This can be done with the `TimeDistributed` layer of `keras`.

In [28]:
model = km.Sequential()
model.add(kl.SimpleRNN(units=10 ,activation="relu", input_shape=(3, 1), return_sequences=True))
model.add(kl.TimeDistributed(kl.Dense(1)))
model.summary()

Model: "sequential_5"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
simple_rnn_5 (SimpleRNN)     (None, 3, 10)             120       
_________________________________________________________________
time_distributed_1 (TimeDist (None, 3, 1)              11        
Total params: 131
Trainable params: 131
Non-trainable params: 0
_________________________________________________________________


We now train the model with and *adam* optimizer and a *mse* as a loss function

In [29]:
epochs = 500
batch_size=32
model.compile(loss="mse", optimizer="adam")
model.fit(X, Y, epochs=epochs, batch_size=batch_size, verbose=0)

<tensorflow.python.keras.callbacks.History at 0x7fe72867d8e0>

Let's check that the model can now correctly perform the cumulative sum:

In [30]:
x_test=np.array([10,11,12]).reshape(1,3,1)
print(model.predict(x_test))
x_test=np.array([10,25,12]).reshape(1,3,1)
print(model.predict(x_test))

[[[10.113749]
  [22.555725]
  [34.44047 ]]]
[[[10.113749]
  [35.50011 ]
  [49.910408]]]


Note that as previously seen, it would have been possible to set the *timestep* parameters to *None* so that the model can compute cumulative sum of model whatever the size of their length (using padding).  
This could be a good **exercise** if you want to practice.

## One to Many

**One-to-many** recurrent neural network take a scalar as an input and return a sequence as an output. 

There are different ways to define **one-to-many**
neural network. 

* In the example below, the **one-to-many** network can be seen as as a **many-to-many** neural network where the input sequence is build iteratively. (The input of the second timestep is the output of the first timestep).

<img src="https://github.com/wikistat/AI-Frameworks/blob/master/Text/images/one_to_many.png?raw=true" alt="drawing" width="400"/>

* It would also be possible to only pass an input at the first timestep. Then **one-to-many** network can be seen as as a **many-to-many** neural network where the input sequence is composed of one scalar and `None` entry to fill the sequences. 

Hence, **one-to-many** networks, can be seen as particular case of **many-to-many** neural networks. 

#### Toy Example

Let's take an example where the *input* are scalar and the output is sequence of 3 number composed such that

* input = x
* output = [x+2, x+4, x+6]

Hence the dimensions of the output matrix *Y* will be of size:

* N: Number of sequences = 100 (arbitrary values), 
* Timestep: (Size of sequences) = 3, 
* Number of features (How many features for each element of the sequences) = 1 

#### Model

At **training**  the keras model will be built as a **many-to-many** models. <br>
Indeed as you now the sequence output you're expect to get, you now the sequence that will be send as an input you want to learn. IN the example above:

* input = [x,x+2,x+4].
* output = [x+2,x+4,x+6]

**Exercise**: Build the toy dataset and the models that will learn how to predict the output sequences from and input sequences.<br>
**nb** Remember that at prediction, the model should be able to take a scalar as an input (i.e. a sequence of one timestep).


In [31]:
# %load solution/one_to_many_dataset.py
X = []
Y = []
for i in range(100):
    X.append([i,i+2,i+4])
    Y.append([i+2,i+4,i+6])
X = np.array(X).reshape(100,3,1)
Y = np.array(Y).reshape(100,3,1)

In [32]:
# %load solution/one_to_many_model.py
model = km.Sequential()
model.add(kl.SimpleRNN(units=10 ,activation="relu", input_shape=(None, 1), return_sequences=True))
model.add(kl.TimeDistributed(kl.Dense(1)))

Let us now train this model!

In [33]:
epochs = 1000
batch_size=32
model.compile(loss="mse", optimizer="adam")
model.fit(X, Y, epochs=epochs, batch_size=batch_size, verbose=0)

<tensorflow.python.keras.callbacks.History at 0x7fe6fb533bb0>

**Exercise** Once your model is build, write a function that build a the 3 numbers sequences output from a scalar input using the model.

In [34]:
x_test=np.array(x).reshape(1,1,1)
y1 = model.predict(x_test)
y1

array([[[501.014]]], dtype=float32)

In [56]:
y1 = model.predict(x_test)
y2 = model.predict(np.hstack((x_test,y1)))
y3 = model.predict(np.hstack((x_test,y2)))

In [55]:
# %load solution/one_to_many_prediction.py
def predict_function(x):
    x_test=np.array(x).reshape(1,1,1)
    y1 = model.predict(x_test)
    y2 = model.predict(np.hstack((x_test,y1)))
    y3 = model.predict(np.hstack((x_test,y2)))
    return y3

x=10
y=predict_function(x)
print("Input scalar : %d. Output sequences: [%.3f, %.3f, %.3f]" %(x, y[0][0],y[0][1][0],y[0][2][0]) )

Input scalar : 10. Output sequences: [11.864, 13.838, 15.574]


All the examples details so far have treated *one-size* features sequences in order to make this tutorial easier. 

All of these examples can be easily traduced to *several-size* features length. Let's check that with example on the **Cdisocunt** Dataset! 

## RNN layers

Once you know how to manipulate the structure defined above, it is really easy to build more complex or deepest RNN model with keras.

* `GRU` and `LSTM` can be used the exact same way than `SimpleRNN`in the example above.
* Bi-directional layers can be build using the `Bidirectional` layer on `RNN`layer.
* Deep RNN can be build adding `RNN` layer like any other sequential model.

***Example***:
Here is how to build a model with one *LSTM* layer follow by a bidirectional *GRU* layer. 

In [None]:
model = km.Sequential()
model.add(kl.LSTM(units=10 ,activation="relu", input_shape=(3, 1), return_sequences=True))
model.add(kl.Bidirectional(kl.GRU(units=10 ,activation="relu", return_sequences=True)))
model.add(kl.TimeDistributed(kl.Dense(1)))
model.summary()

## Recap

Now, you know how to build all combination of **One/Many to One/many** RNN models using all `Keras` tools:
* `return_sequences` RNN layer's parameter : return one output or a sequence output. 
*  `input_shape` parameter  : with None value has the timestep if we want the model to handle various length sequences.
* `TimeDistributed` layer: in order to apply layer such as `Dense` on each entries of a sequence.
* `Gru`and `LSTM` as alternative layer of `SimpleRNN`.
* `Bidirectional layer`.

Let us use this knowledge to handle real use case!

# Text Generation 

In this part the objective is to generate product description of a product.

## Dataset

The Level 3 category `COQUE - BUMPER - FACADE TELEPHONE` is the most represented category within the original **Cdiscount**'s dataset with 2.184.671 descriptions. Among them 1.761.637 are composed with 197 characters. 

We will now use these lines (or sub-samble of these lines according to the computation power of your machine) in order to learn a text generation model that will allow to automatically generate a new text description of this type of product.

These lines are contains in the `data/description_coque.npy` file (you have to unzip it).

The data used in this section are real data. However it turns out that a lot of these description are very similar. This help the network to build good description but which are not really different from the training set. <br>
However, this is a good thing to obtains result in a reasonable amount of time. 

In [65]:
N = 100000
X = np.load("data/description_coque.npy")[:N]
print("Number of line :%d" %X.shape[0])
print("\nThree lines example:")
print(X[:3])
print("\nSize of all the sequences : %s" %(str(set([len(x) for x in X]))))

Number of line :100000

Three lines example:
["Pour apple iphone 4 : coque bumper silicone blanc - Cet étui en silicone rigide protège et habille votre APPLE iPhone 4. Parfaitement adapté, il permet l'accès à toutes les fo… Voir la présentation"
 "Pour htc one x : coque noire rigide - Cette coque protège et habille avec sobriété votre HTC ONE X. Parfaitement adaptée, elle permet l'accès à toutes les fonctionnalités de v… Voir la présentation"
 "Pour htc one x : coque blanche rigide - Cette coque protège et habille avec sobriété votre HTC ONE X. Parfaitement adaptée, elle permet l'accès à toutes les fonctionnalités de… Voir la présentation"]

Size of all the sequences : {197}


### Data Processing

The text generation implies to build a **one-To_many** model:

<img src="https://github.com/wikistat/AI-Frameworks/blob/master/Text/images/one_to_many.png?raw=true" alt="drawing" width="400"/>

Where the prediction $y_t$ will be used as an input at time $t+1$, i.e : $y_t=x_{t+1}$. 

This model with be trained as a **many-to-many**  model. 

<img src="https://github.com/wikistat/AI-Frameworks/blob/master/Text/images/many_to_many.png?raw=true" alt="drawing" width="400"/>

Where the output y will be the same sequence than input x with 1 offset.

When using text, the input can be either a word or a characters. As sequences have fixed length, we will use the characters as inputs of the sequences. <br>
These characters will be *one-hot encoded* <br>

Hence each description $x$ will be represented as a Matrix of size $N_s \times N_v$ where

* $N_s=197$ is the length of the sequences (timestep)
* $N_v$ is the size of the vocabulary (the list of caracters) .

Let us fix $N_s$

In [66]:
Ns=197

### Characters' list

Let's first create a list of all unique characters present in the descriptions.

In [67]:
chars_set = list(functools.reduce(lambda x,y : x.union(y), [set(x) for x in X], set()))
print("Characters list of size %d : %s"  %(len(chars_set), str(chars_set)))

Characters list of size 87 : ['V', '&', ' ', 'R', '3', '+', 'Q', '.', 'ç', 'k', 'Z', '*', 'X', 'N', 'U', 'K', 'D', 'G', 'c', '-', 't', 'u', 'e', 's', ')', 'è', 'f', '8', '\xa0', 'L', 'm', '/', 'q', '4', 'C', 'g', '9', 'z', 'd', 'y', 'M', '"', '7', 'p', 'w', '5', 'E', 'b', ':', '0', 'o', 'à', 'x', '…', 'S', 'W', 'T', 'H', 'ê', 'l', 'B', 'â', 'Y', '2', 'j', '(', 'h', '%', 'r', 'é', ',', '!', 'n', 'P', 'v', 'I', 'J', 'i', '1', 'a', 'ô', '6', 'F', 'A', "'", 'O', '?']


We will add two elements to these listes allowing to detect the *start* and the *end* of a sequences.

In [68]:
chars_set.extend(["START","END"])
Nv = len(chars_set)
print("Total size of the vocabulary : %d" %Nv)

Total size of the vocabulary : 89


### Sequence encoding

There are no library (or I do not find it), that enable to *one-hot encode* a string at a character level.

The following lines enables to apply it.

* First `char_to_int` and `int_to_char` dictionary are created, enabling to retrieve the position of a character in the vocabulary.

In [69]:
int_to_char = {i:c for i,c in enumerate(chars_set)}
char_to_int = {c:i for i,c in int_to_char.items()}

The following function encode

* a  $X\in \mathbb{R}^{N \times N_s}$ matrix composed of *N* text description of size *N_s*   size

into

* a $X_{vec} \in \mathbb{R}^{N \times N_s +1 \times N_v}$ matrix composed of *N* sequences of size $N_s+1\times N_v$ (the encoded text description)
and 
* a $Y_{vec} \in \mathbb{R}^{N \times N_s +1 \times N_v}$ matrix composed of *N* sequences of size $N_s+1\times N_v$ (the encoded text description with an offset of one).

Note that the length of the sequences in $X_{vec}$ and $Y_{vec}$ are of size $N_s+1$  because we will add an element to each of these sequences

* Input sequences will have a *START* element at their beginning.
* Output sequences will have a *END* element at their end.


**exercise** Let us try to encoded the function that apply these transformation. -> Read the test cell below before to start

In [70]:
def encode_input_output_sequence(x_descriptions, length_sequence, size_vocab, char_to_int_dic):
    return x_vec, y_vec


**test cell**

To help you coding this function, here is some test to ensure that you function has the expected behavior:

If the function take a single sentence of length 7 (*bonjour*) as en input:

* Both X (input of the rnn) and Y (output of the rnn) outputs should be tensor of shape (1 X 8 X 91).
* For X, by taking the argmax of each element of the sequence, we should be able to retrieve the original sentence (with START element at the beginning)
* For Y, by taking the argmax of each element of the sequence, we should be able to retrieve the original sentence (with END element at the END)

**Warning** you do not have to modify this cell! All test have to work if you function is correctly implemented. <br>
If a test fail it will throw a `Assertion Error`


In [71]:
X_test,Y_test = encode_input_output_sequence(np.array(["bonjour"]), 7, Nv, char_to_int)
assert X_test.shape == (1, 8, Nv)
assert Y_test.shape == (1, 8, Nv)

assert [int_to_char[x] for x in np.argmax(X_test[0],axis=1)] == ['START', 'b', 'o', 'n', 'j', 'o', 'u', 'r']
assert [int_to_char[x] for x in np.argmax(Y_test[0],axis=1)] == ['b', 'o', 'n', 'j', 'o', 'u', 'r', 'END']

NameError: name 'x_vec' is not defined

In [74]:
# %load solution/encode_input_output_sequence.py
def encode_input_output_sequence(x_descriptions, length_sequence, size_vocab, char_to_int_dic):
    # Get the number of description in x.
    n = x_descriptions.shape[0]

    # Set the dimensions of the output encoded matrices fill with zero.
    # the length_sequence is actually length_sequences
    x_vec = np.zeros((n, length_sequence + 1, size_vocab))
    y_vec = np.zeros((n, length_sequence + 1, size_vocab))

    # Let's now fill the matrices with one at the location of each characters position

    # First let's fill each input sequences with the START position at the begining of the encoded sequences
    x_vec[:, 0, char_to_int["START"]] = 1
    # and the output sequences with the END position at the end of the encoded sequences
    y_vec[:, -1, char_to_int["END"]] = 1
    # Now let's iterate over all x_descriptions
    for ix, x in tqdm(enumerate(x_descriptions)):
        # And over each character of the description
        for ic, c in enumerate(x):
            # For each character `c` we set one at his position in the vocabulary.
            c_int = char_to_int_dic[c]
            x_vec[ix, ic + 1, c_int] = 1
    # The y-vec matrices is the same than the x matrix with one offset
    y_vec[:, :-1, :] = x_vec[:, 1:, :]
    return x_vec, y_vec

Let's apply it on the first N lines of the dataset

In [75]:
X_vec, Y_vec = encode_input_output_sequence(X[:N], Ns, Nv, char_to_int)
X_vec.shape

31602it [00:04, 7127.77it/s]


KeyboardInterrupt: 

## Training


**Exercise**: Define a simple model (only one LSTM layer with 32 hidden units) that will allow to train the text generation model.
*Tips*:
* Remember that this model will be used for generation.
* What are the dimension of the output? What will be the activation layer? The loss function?


In [72]:
# %load solution/train_model_text_generation.py

Now you have correctly write the model you can observe that it can take a while to obtain convergence when training these kind of model. <br>
If you do not have GPU while executing this TP, you do not have to wait for the end of the training. -> Let's download this model, generated with the solution above! <br>
Note that you also have to download the corresponding `int_to_char`and `char_to_int` dictionary.
And re build the dataset according to these dictionaries;

In [76]:
from tensorflow.keras.models import load_model
model = load_model("data/generate_model.h5")
int_to_char = pickle.load(open("data/int_to_char.pkl","rb"))
char_to_int = pickle.load(open("data/char_to_int.pkl","rb"))
X_vec, Y_vec = encode_input_output_sequence(X[:N], Ns, Nv, char_to_int)
X_vec.shape

100000it [00:13, 7303.79it/s]


(100000, 198, 89)

## Text Generation

The function below, enable to decode en encoded sentence 

In [77]:
i_test = 50
print("\nOriginal Sentences:\n%s"%X[i_test])
def decode_sequence(x, int_to_char_dic):
    seq = []
    for i in np.where(x)[1]:
        seq.append(int_to_char_dic[i])
    return "".join(seq)
print("\nDecoded input vector::\n%s"%decode_sequence(X_vec[i_test], int_to_char))
print("\nDecoded output vector::\n%s"%decode_sequence(Y_vec[i_test], int_to_char))


Original Sentences:
Pour htc one mini : coque sur mesure decor pluie d'etoiles - Cette coque fantaisie protège et habille votre HTC ONE Mini. Parfaitement adaptée, elle permet l'accès à toutes le… Voir la présentation

Decoded input vector::
STARTPour htc one mini : coque sur mesure decor pluie d'etoiles - Cette coque fantaisie protège et habille votre HTC ONE Mini. Parfaitement adaptée, elle permet l'accès à toutes le… Voir la présentation

Decoded output vector::
Pour htc one mini : coque sur mesure decor pluie d'etoiles - Cette coque fantaisie protège et habille votre HTC ONE Mini. Parfaitement adaptée, elle permet l'accès à toutes le… Voir la présentationEND


The code below enable to predict a sentence by iterativaly predict a character sending a previous character.

In [78]:
from tensorflow import convert_to_tensor
x_pred = np.zeros((1, Ns+1, Nv))
print("step 0")
x_pred[0,0,char_to_int["START"]] =1
x_pred_str = decode_sequence(x_pred[0], int_to_char)
print(x_pred_str)

for i in range(Ns):
    x_tensor = convert_to_tensor(x_pred[:,:i+1,:])
    ix = np.argmax(model.predict(x_tensor)[0][-1,:])
    x_pred[0,i+1,ix] = 1
x_pred_str=decode_sequence(x_pred[0], int_to_char)
print(x_pred_str)

step 0
START
STARTSamsung Galaxy S5 mini Premiumcoque case matt white - Stars - **Fais une déclaration visuelle!** Notre Hard Case unis un design affiné avec une parfaite protection, sans cacher la belle silho… Voir


In [80]:
x_pred.shape

(1, 198, 89)

**Q** How this prediction is done?

**Exercice** Generate a text generation with random first letter.

In [None]:
# %load solution/text_generation_random_first_letter.py

**Exercice** Generate a text generation with some randomness. For example, use a multinomial from the model output to generate a characters at each step.

In [None]:
# %load solution/text_generation_multinomial.py

# Text classification

In this part the objective is to classify description of `Cdiscount` product as in the two first notebooks of this **Text** session.

We will the use a **many-to-one** RNN architecture both at training and prediction.

<img src="https://github.com/wikistat/AI-Frameworks/blob/master/Text/images/many_to_one.png?raw=true" alt="drawing" width="400"/>

## Load Data

Let us first load and clean the data with the function you now know.

In [None]:
ct = CleanText()
data = pd.read_csv("data/cdiscount_train.csv.zip",sep=",", nrows=100000)
ct.clean_df_column(data, "Description", "Description_cleaned")
print("The train dataset is composed of %d lines" %data.shape[0])
data.head(5)

In [None]:
data_test = pd.read_csv("data/cdiscount_test.csv.zip",sep=",")
ct.clean_df_column(data_test, "Description", "Description_cleaned")
print("The train dataset is composed of %d lines" %data_test.shape[0])
data_test.head(5)

Let's now split the data into train and validation dataset.

In [None]:
data_train, data_valid = sms.train_test_split(data, test_size=0.1, random_state=42)

Let's get list of words for each line.

In [None]:
train_array_token = [line.split(" ") for line in data_train["Description_cleaned"].values]
valid_array_token = [line.split(" ") for line in data_valid["Description_cleaned"].values]
test_array_token = [line.split(" ") for line in data_test["Description_cleaned"].values]

In [None]:
And finally list of array of integer label (keras does not handle string label)

In [None]:
label_to_int = {k:i for i,k in enumerate(set(data_train.Categorie1.values))}
N_label = len(label_to_int)
Y_train = np.array([label_to_int[k] for k  in data_train.Categorie1.values])
Y_valid = np.array([label_to_int[k] for k in data_valid.Categorie1.values])
Y_test = np.array([label_to_int[k] for k in data_test.Categorie1.values])

## Encoding the data.

In this problem sequences of word will be considered (and not sequence of characters). <br>
Several possibilities are possible to convert the sentences:

* Count Vectorizer or TF-IDF encoding. 
    * **sklearn** `CountVectorizer` or  `TF-IDF` produce sparse vector that are not handling correctly with `Keras` model.<br>    Hence the solutions to use these encoding would be to create non-sparse matrix. But this implies to create really large matrix (regarding to the size of th vocabulary
* Word Embedding vectors.
    * With the **Word2Vec** embedding we learn in the previous notebook we can create vector of size 300 from each word. which would produce reasonable sequence size. Moreoever, the *full_model_sg* was the model who gives the best results.


In [None]:
from gensim.models import KeyedVectors
model_sg_full = KeyedVectors.load("data/w2v_model/full_model_sg")

The plot below display the distribution of the length of all the sequences of the train dataset (i.e the number of word).<br>
The biggest sequence is composed of 43 words which means that all encoding sequences should have a size of 43   <br>
We can see that the 99% percentile of this distribution is 28.
Hence we decide to get only the 28 first words of each distribution. and only a very small portion (1%) of the dataset will have cut sentences.

In [None]:
all_sequences_length = [len(x) for x in train_array_token]
import matplotlib.pyplot as plt
import seaborn as sb
sb.set_style("whitegrid")
fig = plt.figure(figsize=(10,10))
ax = fig.add_subplot(1,1,1)
ax.hist(all_sequences_length, 50, cumulative=True, density=True)

print("Max length sequences: %d" %max(all_sequences_length))
print("Percentiles at 90%%: %d, 95%%: %d and 99%%: %d" %tuple([x for x in np.percentile(all_sequences_length, q=[90,95,99])]))


In [None]:
Ns = 28

**Exercise** Build a function that encode an the `array_token` to tensor that will be used for learning

In [None]:
def tokens_to_embedding_sequences(array_token, model):
    # Todo
    return X

In [None]:
# %load solution/token_to_embedding_sequence

In [None]:
X_train = tokens_to_embedding_sequences(train_array_token, model_sg_full)
X_valid = tokens_to_embedding_sequences(valid_array_token, model_sg_full)
X_test = tokens_to_embedding_sequences(test_array_token, model_sg_full)

In [None]:
assert X_train.shape == (90000,Ns,300) 
assert X_valid.shape == (10000,Ns,300) 
assert X_test.shape == (50000,Ns,300) 

**Exercise** Build now a model to learn how to predict the classification from this sequence and train it! <br>
The model given as a solution is very simple and has not been studied to produce good results. It's just a working example.

In [81]:
# %load solution/rnn_classifier_model.py
model = km.Sequential()
model.add(kl.LSTM(units=256 ,activation="relu", input_shape=(28, 300)))
model.add(kl.Dense(256))
model.add(kl.Activation("relu"))
model.add(kl.Dense(N_label))
model.add(kl.Activation("softmax"))
model.summary()

epochs = 500
batch_size=256
history = model.compile(loss="sparse_categorical_crossentropy", optimizer="adam", metrics=["accuracy"])
model.fit(X_train, Y_train, epochs=epochs, batch_size=batch_size, verbose=1, validation_data=[X_valid, Y_valid])




