# Deep Learning - Day 5 - Exercise 1

Natural Language Processing corresponds to the automated study of language. To deal we texts, we have to convert them into meaningful representations that can be handled by computer. 

Usually, the first step consists of data cleaning. As you have already done it last week, we won't emphasize this part today. However, if you have finished an exercise, you feel confortable with what you have and you want to get better performances, you are welcomed to improve the data cleaning.

✅ **Good Practice** ✅ Never spend to much time on data cleaning! First, build the entire pipeline first to have a baseline evaluation of your task. Otherwise, you don't know whether your fancy data cleaning is improving the entire pipeline or not. 

After the data cleaning, let's represent each word of our vacabulary as a token (each word corresponds to a integer, each integer corresponds to a word), and then, convert each token to a vector of fixed dimension. In that manner, each word will be represented by a vector that can be fed into a (Recurrent) Neural Network.

# The data

Sentiment analysis is the task of classifying sentences according to a subjective notion or an affective state. In this notebook, we will perform such an analysis on movie reviews from the imdb database. This database contains thousands of movie reviews and the associated sentiment (positive or negative). We will train differend models to classify the sentences.

Let's first load the data. You don't have to understand what is going on in the function, it does not matter here

In [None]:
from tensorflow.keras.datasets import imdb

def load_data():
    (sentences_train, y_train), (sentences_test, y_test) = imdb.load_data()
    word_to_id = imdb.get_word_index()

    word_to_id = {k:(v+3) for k,v in word_to_id.items()}
    for i, w in enumerate(['<PAD>', '<START>', '<UNK>', '<UNUSED>']):
        word_to_id[w] = i

    id_to_word = {v:k for k, v in word_to_id.items()}

    sentences_train = [[id_to_word[_] for _ in sentence] for sentence in sentences_train]
    sentences_test = [[id_to_word[_] for _ in sentence] for sentence in sentences_test]
    
    return sentences_train, y_train, sentences_test, y_test

sentences_train, y_train, sentences_test, y_test = load_data()

Now that the data are loaded, let's look at them

❓ **Question** ❓ Plot one of two sentences to understand the sentences_train and sentences_test structures. 

In [None]:
# YOUR CODE HERE

❗ **Remark** ❗ Look at the y_train and y_test. This is a classification task where, based on the text, you try to predict where the review is negative (=0) or positive (=1). It corresponds to a sentiment analysis task.

In [None]:
# YOUR CODE HERE

# Embedding with Word2Vec

Now, let's use Word2Vec to embed the words of our sentences. Word2Vec will be able to convert each word to a fixed-size vectorial representation.

For instance, we will have:
- 'dog' --> [0.1, -0.3, 0.8]
- 'cat' --> [-1.1, 2.3, 0.7]
- 'apple' --> [3.1, 0.9, -4.7]

What you expect is to have representation such as words with close meanings are close in this embedding space, such as on the example on this image:

![Embedding](word_embedding.png)

Let's run Word2Vec!


In [None]:
from gensim.models import Word2Vec

word2vec = Word2Vec(sentences=sentences_train)

You can here print all the words that are in the dictionnary.

In [None]:
print(word2vec.wv.vocab.keys())

Let's look at the embedded representation of some words.

❓ **Question** ❓ Try different words - especially, try non-existing words to see that they don't have any representation (which is perfectly normal). 

In [None]:
# YOUR CODE HERE

❓ **Question** ❓ What is the size of each word? It corresponds to the size of the embedding space

In [None]:
# YOUR CODE HERE

❗ **Remark** ❗ The "magic" of Word2Vec, especially the way it creates this embedding, is not discussed in this exercise. There are many many many ressources out there explaning how it works. 

The only thing to know here is that it actually trains an internal neural network (that you don't see) which, in a nutshell, corresponds to the prediction of a word based on the surroundings words in a sentences. So it choose many splits in the different sentenvecs, applies a window, choose some words as inputs $X$  and a word as output $y$ which it tries to predict, in the embedding space.

Once you have a word, you can use the `word2vec.wv.similar_by_vector` function to see what are the closest words to a given one.

❓ **Question** ❓ Let's look at the closest word (in the embedding space) to some words. You should see that there is some some of lexical proximity that we were interested in.

In [None]:
# YOUR CODE HERE

Let's write `word2vec.wv.similar_by_vector(word2vec.wv['film'])` as W2V(film) here (for the explanation).

As any word is represented as a vector, we can do some arithmetic on the words.
For instance, we can do W2V(man) - W2V(woman)

❓ **Question** ❓ Do this mathematical operation and print the result

In [None]:
# YOUR CODE HERE

Now, image for a second that, somehow, the following equality holds true - just for a second

    W2V(man) - W2V(woman) = W2V(king) - W2V(queen)

which is equivalent to 

    W2V(man) - W2V(woman) + W2V(queen) = W2V(king).

❓ **Question** ❓ Let's, just for fun (as it would be foolish of us to think that this equality holds true ...), do the operation W2V(man) - W2V(woman) + W2V(queen) and store it in a `res` variable (which will be a vector of size 100- that you can print.

In [None]:
# YOUR CODE HERE

We earlier said that, for any vector, it is possible to see the closest vectors to print the word that they represent.

❓ **Question** ❓ Look at the closest vector (thanks to the `word2vec.wv.similar_by_vecto` function) of `res`

In [None]:
# YOUR CODE HERE

## Awesome, right? 

You just did some arithmetic operations on words!

❓ **Question** ❓ Now, write a function that, given a sentence, returns a matrix that corresponds the embedding of the full sentence, which means that you have to embeed each word one after the other and concatenate the result to output a 2D matrix (be sure that your output is a numpy array)

PS: Be sure the asserts are ok! Otherwise, there is a problem

In [None]:
import numpy as np

example = ['this', 'movie', 'is', 'probably', 'the', 'worst', 'action', 'movie', 'ever']

def embed_sentence(word2vec, sentence):
    # YOUR CODE HERE
        
embedded_sentence = embed_sentence(word2vec, example)
    
assert(type(embedded_sentence) == np.ndarray)
assert(embedded_sentence.shape == (9, 100))

❓ **Question** ❓ Write a function that, given a list of sentence (each sentence being a list of words/strings), returns a list of embedded sentences (each sentence is a matrix). Apply this function to the train and test sentences

Hint: Use the previous function `embed_sentence`

❗ **Remark** ❗ You will probably notice that some words you are trying to convert throw errors as they are said not to belong to the dictionnary:
- for the test set, this is understandable: some words were not in the train set and thus their embedded representation is unknwon
- for the train set, it might be odd but this is actually normal - we will explain it later

Nonetheless, change your function `embded_sentence` to skip words that are not in the vocabulary.

In [None]:
def embedding(word2vec, sentences):
    
    # YOUR CODE HERE
    
    
X_train = embedding(word2vec, sentences_train)
X_test = embedding(word2vec, sentences_test)

❓ **Question** ❓ To be sure that it worked, the following function should run

In [None]:
def check_embedding(X, ds):
    if ds == 'train':
        assert(np.shape(X[0]) == (217, 100))
    if ds == 'test':
        assert(np.shape(X[0]) == (68, 100))
    for x in X:
        assert(np.shape(x)[1] == 100)
    
check_embedding(X_train, 'train')
check_embedding(X_test, 'test')

❓ **Question** ❓ Do not forget to pad the data, as yesterday, in order to have tensors that can be divided in batch sizes during the optimization. Store the padedd values in `X_train_pad` and `X_test_pad`

In [None]:
# YOUR CODE HERE

# The model

❓ **Question** ❓ Write a RNN with the following layers:
- a masking layer
- a LSTM with 13 units and tanh activation function
- a Dense with 10 units
- a output layer that depends on your task

❓ **Question** ❓ Then, compile your model (we advise you to use the rmsprop as the optimizer - at least to begin with)

In [None]:
def init_model():
    # YOUR CODE HERE
    return model

model = init_model()

❓ **Question** ❓ Fit the model on your padded and embedded data - do not forget the early stopping criterion

In [None]:
# YOUR CODE HERE

# ⚠️ Here, you can start reading the second exercise if your model takes too long. You can then come back from time to time to see if it is over

❓ **Question** ❓ Evaluate your model on the test set

In [None]:
# YOUR CODE HERE

# Back to the Word2Vec

Remember that some of the train words used to train the word2vec were not in the word2vec vocabulary? The reason is that word2vec has some arguments that we will dig into.

❓ **Question** ❓ The first one is the `size` argument. It corresponds to the size of the embedding space. Learn a new `word2vec_2` model, still trained on the `sentences_train`, but with a smaller or higher `size`.

Verify on some words that the corresponding embedding is of your selected size.

In [None]:
# YOUR CODE HERE

The second important argument is `min_count`. It is a integer that tells you how many occurences a given word should have to be learn in the embedding spave. For instance, let's say that the word "movie" appears 1000 times in the corpus and "simba" only 2 times. If `min_count=3`, the word "simba" will be skipped during the training.

The intention is to have only words that are sufficiently present in the corpus to have a robust embedded representation

❓ **Question** ❓ Learn a new `word2vec_3` model with a `min_count` higher than 5 (which is the default value) and a `word2vec_4` with a `min_count` smaller than 5, and then, compare the size of the vocabulary for all the different word2vec that you have trained (you can choose any `size` you want).

In [None]:
# YOUR CODE HERE

Remember that we say that word2vec has an internal neural network that it optimizes based on some predictions? These predictions actually corresponds to predicting a word based on surrounding words. The surroundings words are in a `window` which corresponds to the number of words taken into account. And you can train the word2vec with different `window` sizes

❓ **Question** ❓ Learn a new `word2vec_5` model with a `window` different than previously (default is 5).

In [None]:
# YOUR CODE HERE

The arguments you have seen (`size`, `min_count` and `window`) are usually the one that you should start changing to get a better performance for your model.

But you can also look at other arguments in the [documentation](https://radimrehurek.com/gensim/models/word2vec.html).

❓ **Question** ❓ Fit a Neural Network whose input derived from fine-tunned word2vec for which you have choosen different parameters. On top of that, you can improve your accuracy by trying other RNN architectures.