# Deep Learning - Day 5 - Exercise 2

In the previous exercise, we used the word2vec algorithm to embed the input words, meaning that for each word, we find a vectorial representation that was used to as input of a RNN.

However, this vectorial representation was not designed to specifically handled a specific task. For that reason, we will here use an embedding layer in our RNN so that each word is represented in an embedding space as a vector whose values are learnt. For instance, for the word `dog` which is represented by the vector $(w_1, w_2, ..., w_n)$ in the embedding space, we will learn the weights $(w_k)_k$.

# The data

The data are the same as in the previous exercises. Each sequence is a sentence, in a form of a list of words, and the output is a negative (0) or positive (1) review.

In [None]:
from tensorflow.keras.datasets import imdb

def load_data():
    (sentences_train, y_train), (sentences_test, y_test) = imdb.load_data()
    word_to_id = imdb.get_word_index()

    word_to_id = {k:(v+3) for k,v in word_to_id.items()}
    for i, w in enumerate(['<PAD>', '<START>', '<UNK>', '<UNUSED>']):
        word_to_id[w] = i

    id_to_word = {v:k for k, v in word_to_id.items()}

    sentences_train = [[id_to_word[_] for _ in sentence] for sentence in sentences_train]
    sentences_test = [[id_to_word[_] for _ in sentence] for sentence in sentences_test]
    
    return sentences_train, y_train, sentences_test, y_test

sentences_train, y_train, sentences_test, y_test = load_data()

# Data cleaning

❓ **Question** ❓ Let's start with some data cleaning. You are welcomed to do whatever you want here. The idea is that everything should hold in the following function (that can call other functions that you can write)

In [None]:
# YOUR CODE HERE

❓ **Question** ❓ Create a dictionary `word_to_id` such that each word ot `sentences_train` is in this dictionnary. The value of each key (=word) should be unique, so that each word is represented by a unique integer - and vice-versa! 

In [None]:
# YOUR CODE HERE

❓ **Question** ❓ The number of keys should be equal to the number of different words in the train sentences. Print it.

In [None]:
# YOUR CODE HERE

❓ **Question** ❓ Create a dictionary `id_to_word` that is the "reciprocal" form of `word_to_id`. The keys are the integers and the corresponding value is the corresponding word

In [None]:
# YOUR CODE HERE

❓ **Question** ❓ Currently, your sentences are like ['this', 'is', 'a', 'very, 'very', 'good', 'movie"]. Thanks to the `word_to_id` dictonary, replace them their tokenized representation : [3, 89, 2, 13, 13, 56, 32] (example, not true). Store them in `sentences_token_train` and `sentences_token_test`

❗ **Remark** ❗ It is better writting a function to do that. And don't forget that there is no reason all the words in the test sentences to be in this dictionary - in that case, you can skip the word

In [None]:
# YOUR CODE HERE

❓ **Question** ❓ Pad your sentences (results to store in `X_train` and `X_test`)

In [None]:
# YOUR CODE HERE

# The model

❓ **Question** ❓ Write a model that has:
- a masking layer (not to take into account padded values)
- an embedding layer whose `input_dim` is the size of your vocabulary (should be an argument of the function), and whose `output_dim` is the embedding size that you can choose
- a LSTM or RNN layer
- a Dense layer
- an output layer

Compile it with the appropriate arguments

In [None]:
def init_model(vocab_size):
    # YOUR CODE HERE
    return model

❓ **Question** ❓ Initialize the model and look at the number of parameters. What do you think about it?

In [None]:
# YOUR CODE HERE

❓ **Question** ❓ Fit your model on training data and evaluate it on the test set

In [None]:
# YOUR CODE HERE

You probably don't want to spend the rest of your week-end running this model. One of the reason is that there are too many parameters, due to the fact that you embed a lot of different words.

❓ **Question** ❓ To reduce the number of words (if you haven't done it already), go back to the data cleaning and, in the train sentences, keep only words that occurs more than 15 times (15 is an example, you can take more), and check how your vocabulary size is reduced. Then, you can run your entire pipeline and see how many parameters there are in the model

❗ **Remark** ❗ Be carefull, some reviews have all their words below some given number of occurences. When you skip the given words, you may end up with sentences without any word in it. You have to handle that, as you cannot have an empty $X$. In the case it happens, advice is given to replace by a vector of zeros for instance.

In [None]:
# YOUR CODE HERE

# The model - more advanced.

Let's train some fancy network here. 
Each of your word is represented by a vector of size N (the size of your embedding). Therefore, as a sentence is a sequence of words, it is represented by a matrix (number of words, N). So, all your sentences are actually represented as matrices once embeded.

If you think about it, an image is also a matrix. Said differently, you may represent your sentence of word as a matrix, where each column (or row, depending on how you want to look at it) is a word, and each row (or each column) corresponds to a coordinate of the feature space.

Well, in that case, as these are close to images, why not using convolution on them? Yes, convolutions!
But, be careful. In the case of images, convolutions are 2 dimensional as the filters can move up and down, and left and right. In the case of our sentences, we want the corresponding kernel to move _only_ in the word by word direction (otherwise, moving coordinate of the embedding space by coordinate doesn't make much sense).

So let's create a model that use convolution

### First, the data

❓ **Question** ❓ In the case of convolutions, the input images are of the same size, which is not the case here.  Therefore, we will use the `pad_sequences` function with an additional argument, the `maxlen`. It actually pads any input to a given length, this `maxlen`. Shorter sequences are padded with 0, while longer are cropper to this `maxlen`.

In [None]:
# YOUR CODE HERE

### Using 1D Convolution.

❓ **Question** ❓ Define a model that has :
- an Masking layer
- an Embedding layer: input_dim is the vocab_size, output_dim is the embedding space dimension and input_length is the max length of your observation (that you just defined in the previous question)
- a conv1D layer 
- a Flatten layer
- a dense layer
- an output layer

Compile the model accordingly

❗ **Remark** ❗ The size of the Conv1D kernel corresponds exactly to the number of words each kernel is taking into account ;)

In [None]:
def init_cnn_model(vocab_size):
    # YOUR CODE HERE

❓ **Question** ❓ Look at the number of parameters and compare it to the LSTM model

In [None]:
# YOUR CODE HERE

❓ **Question** ❓ Fit your model with a stopping criterion, and evaluate it on the test data.

In [None]:
# YOUR CODE HERE