# Deep Learning - Day 5 - Keras embedding

### Exercise objectives:
- Learn embeddings with Keras

<hr>
<hr>

In the previous exercise, we used the word2vec algorithm to embed the input words, meaning that for each word, we find a vectorial representation that was used as input of a RNN.

However, this vectorial representation was not designed to specifically handled a specific task. For that reason, we will here use an embedding layer in our RNN so that each word is represented in an embedding space whose values are specifically  learnt for our task. For instance, for the word `dog` which is represented by the vector $(w_1, w_2, ..., w_n)$ in the embedding space, we will learn the weights $(w_k)_k$.

# The data


❓ **Question** ❓ Let's first load the data. You don't have to understand what is going on in the function, it does not matter here.


⚠️ **Warning** ⚠️ The `load_data` function has a `percentage_of_sentences` argument. Depending on your computer, there are chances that a too large number of sentences will make your compute slow down, or even freeze - your RAM can even overflow. For that reason, you can start with 20% of the sentences and see if your computer handles it. Otherwise, rerun with a lower number. On the other hand, you can increase the number if you feel like it. 

In [None]:
from tensorflow.keras.datasets import imdb

def load_data(percentage_of_sentences=None):
    # Load the data
    (sentences_train, y_train), (sentences_test, y_test) = imdb.load_data()
    
    # Take only a given percentage of the entire data
    if percentage_of_sentences is not None:
        assert(percentage_of_sentences> 0 and percentage_of_sentences<=100)
        
        len_train = int(percentage_of_sentences/100*len(sentences_train))
        sentences_train = sentences_train[:len_train]
        y_train = y_train[:len_train]
        
        len_test = int(percentage_of_sentences/100*len(sentences_test))
        sentences_test = sentences_test[:len_test]
        y_test = y_test[:len_test]
            
    # Load the {interger: word} representation
    word_to_id = imdb.get_word_index()
    word_to_id = {k:(v+3) for k,v in word_to_id.items()}
    for i, w in enumerate(['<PAD>', '<START>', '<UNK>', '<UNUSED>']):
        word_to_id[w] = i

    id_to_word = {v:k for k, v in word_to_id.items()}

    # Convert the list of integers to list of words (str)
    X_train = [' '.join([id_to_word[_] for _ in sentence[1:]]) for sentence in sentences_train]
    X_test = [' '.join([id_to_word[_] for _ in sentence[1:]]) for sentence in sentences_test]
    
    return X_train, y_train, X_test, y_test



### Just run this cell to load the data
X_train, y_train, X_test, y_test = load_data(percentage_of_sentences=20)

# Data cleaning

### Conversion to list of list of words

❓ **Question** ❓ The first step of the data cleaning is the same as previously. Convert your list of sentences to list of list of words

In [None]:
### YOUR CODE HERE

### Tokenization

You probably haven't notice but Word2Vec handles word directly. In fact, it tokenizes the words internally, meaning that it creates a one-to-one relation between each word in the train set and a integers, thanks to a dictionarly. 

You will need to do the tokenization by yourself here.

❓ **Question** ❓ Create a dictionary `word_to_id` such that each word ot `X_train` is in this dictionnary. The value of each key (=word) should be a unique integer. At the and, each word is represented by a unique integer - and vice-versa! 

⚠️ **Warning** ⚠️ **DO NOT USE 0 as a key in your dictionary. Start at 1!**

In [None]:
### YOUR CODE HERE

❓ **Question** ❓ The number of keys should be equal to the number of different words in the train sentences. Print it.

In [None]:
### YOUR CODE HERE

❓ **Question** ❓ Currently, your sentences are like ['this', 'is', 'a', 'very, 'very', 'good', 'movie"]. Thanks to the `word_to_id` dictonary, replace them their tokenized representation : [3, 89, 2, 13, 13, 56, 32] (example, not true). Store them in `X_token_train` and `X_token_test`

❗ **Remark** ❗ It is better writting a function to do that. And don't forget that there is no reason all the words in the test sentences to be in this dictionary - in that case, you can skip the word

In [None]:
### YOUR CODE HERE

❓ **Question** ❓ Pad your sentences to be RNN-ready! Here, pad your values with 0 (which is the default value)!

In [None]:
### YOUR CODE HERE

# The model

❓ **Question** ❓ Write a model that has:
- an embedding layer whose `input_dim` is the size of your vocabulary (should be an argument of the function), and whose `output_dim` is the size of you embedding space - you can choose 50 as in the word2vec, or anything you want
- a RNN (SimpleRNN, LSTM, GRU) layer
- a Dense layer
- an output layer

⚠️ **Warning** ⚠️ Here, you don't need a masking layer. Why? Because `layers.Embedding` has a argument to do that directly `mask_zero=True`. This is the reason why `id_to_word` should not have a word that is represented by 0. Otherwise, the Embedding layer would remove it

Compile it with the appropriate arguments

In [None]:
def init_model(vocab_size):
    ### YOUR CODE HERE
    return model

❓ **Question** ❓ Fit your model on training data 

In [None]:
### YOUR CODE HERE

⚠️ **Warning** ⚠️  You probably don't want to spend the rest of your week-end running this model. One of the reason is that there are too many parameters, due to the fact that you embeded a lot of different words.

**JUST STOP THE MODEL FIT AS FOR NOW!**.

❓ **Question** ❓ Look at the number of parameters in your model. What do you think about it?

In [None]:
### YOUR CODE 

The embedding layer learns an embedding per word. Thus, if you set the embedding space size to be 50 in the RNN, it will learn 50 weights per word. If you have 10.000 differents words, the embedding has 500.000 parameters... Which is wayyy to long to optimize.

❓ **Question** ❓ To reduce the number of words, rerun function on `X_train` and `X_test`. But keep only words that occurs more than 30 times (30 is an example, you can take more, but it is a good starting point to keep your training quick). Then, tokenize once again your sentences and display the number of words in your vocabulary.


❗ **Remark** ❗ Be careful, some reviews have all their words below some given number of occurences. When you skip the given words, you may end up with sentences without any word in it. You have to handle that, as you cannot have an empty $X$. In the case it happens, advice is given to replace by a vector of zeros for instance. 

In [None]:
### YOUR CODE HERE

❓ **Question** ❓ Let's accelerate here the convergence of the model. The computational time depends on
- the batch size
- the number of training observation
- the padded length of the sequences
- the size of each observation, which here, is 1 because each word is encoded within one token
- the number of parameters of the Neural network

Let's here work on the padded length of your sequences! Instead of padding up to the maximal length within your training set, use the `maxlen` argument (i.e. maximal length) in `pad_sequences`. The idea here is to set `maxlen` to a number between 200 and 300 so that shorter sentences are padded up to this `maxlen` number of words. On the other hand, longer sentences are shorten. It fasten drastically the computational time. The rational is that, even with 200 to 300 words, you can classify whether the sentence is positive or negative. 

In [None]:
### YOUR CODE HERE

❓ **Question** ❓ You can now rerun your model (and check the number of parameters). Then evaluate it your test set. You can go to the next questions while waiting for it to converge - do not make it more than 10 epochs.

To make it run faster, you can take a larger batch_size if needed

In [None]:
### YOUR CODE HERE

# The model - more advanced.

Let's train some fancy network here. 
Each of your word is represented by a vector of size N (the size of your embedding). Therefore, as a sentence is a sequence of words, it is represented by a matrix (number of words, N). So, all your sentences are actually represented as matrices once embeded.

If you think about it, an image is also a matrix. Said differently, you may represent your sentence of word as a matrix, where each column (or row, depending on how you want to look at it) is a word, and each row (or each column) corresponds to a coordinate in the embedding space.

Well, in that case, as these are close to images, why not using convolution on them? Yes, convolutions!
But, be careful. In the case of images, convolutions are 2 dimensional as the filters can move up and down, and left and right. In the case of our sentences, we want the corresponding kernel to move _only_ in the word by word direction (otherwise, moving coordinate of the embedding space by coordinate doesn't make much sense).

So let's create a model that use convolution

### First, the data

❓ **Question** ❓ In the case of convolutions, the input images must be of the same size. Pad them with a value `maxlen` equal to 150 here.

In [None]:
### YOUR CODE HERE

### Using 1D Convolution.

❓ **Question** ❓ Define a model that has :
- an Embedding layer: `input_dim` is the vocab_size, `output_dim` is the embedding space dimension, and `mask_zero` has to be set to true. Here, for computational reasons, set `input_length` to the maximum length of your observations (that you just defined in the previous question).
- a conv1D layer 
- a Flatten layer
- a dense layer
- an output layer

Compile the model accordingly

❗ **Remark** ❗ The size of the Conv1D kernel corresponds exactly to the number of side-by-side words each kernel is taking into account ;)

In [None]:
def init_cnn_model(vocab_size):
    ### YOUR CODE HERE

❓ **Question** ❓ Look at the number of parameters and compare it to the LSTM model

In [None]:
### YOUR CODE HERE

❓ **Question** ❓ Fit your model with a stopping criterion, and evaluate it on the test data.

In [None]:
### YOUR CODE HERE