# Deep Learning - Day 5 - Letter generation

### Exercise objective
- Get autonomous with Natural Language Processing
- Generate Letter

<hr>
<hr>

In this exercise, we will try to generate some text. The underlying idead is to give a input sequence and to predict what the next letter is going to be. To do that, we will first create a dataset for this task, and then run a RNN to do the prediction.

# The data

❓ Question ❓ First, let's load the data. Here, it is the IMDB reviews again, but we are only interested in the sentences, not the positiveness or negativeness of the review. 

⚠️ **Warning** ⚠️ The `load_data` function has a `percentage_of_sentences` argument. Depending on your computer, there are chances that a too large number of sentences will make your compute slow down, or even freeze - your RAM can even overflow. For that reason, you can start with 20% of the sentences and see if your computer handles it. Otherwise, rerun with a lower number. On the other hand, you can increase the number if you feel like it. 

In [None]:
from tensorflow.keras.datasets import imdb

def load_data(percentage_of_sentences=None):
    # Load the data
    (sentences_train, y_train), (sentences_test, y_test) = imdb.load_data()
    
    # Take only a given percentage of the entire data
    if percentage_of_sentences is not None:
        assert(percentage_of_sentences> 0 and percentage_of_sentences<=100)
        
        len_train = int(percentage_of_sentences/100*len(sentences_train))
        sentences_train = sentences_train[:len_train]
        y_train = y_train[:len_train]
        
        len_test = int(percentage_of_sentences/100*len(sentences_test))
        sentences_test = sentences_test[:len_test]
        y_test = y_test[:len_test]
            
    # Load the {interger: word} representation
    word_to_id = imdb.get_word_index()
    word_to_id = {k:(v+3) for k,v in word_to_id.items()}
    for i, w in enumerate(['<PAD>', '<START>', '<UNK>', '<UNUSED>']):
        word_to_id[w] = i

    id_to_word = {v:k for k, v in word_to_id.items()}

    # Convert the list of integers to list of words (str)
    X_train = [' '.join([id_to_word[_] for _ in sentence[1:]]) for sentence in sentences_train]
    
    return X_train


### Just run this cell to load the data
X = load_data(percentage_of_sentences=20)

❓ **Question** ❓ Write a function that, given a string (list of letters), returns
- a string (list of letters) that corresponds to part of the sentence  - this string should be of size 100
- the letter that follow the previous string

❗ **Remark** ❗ There is no reason your first strings to start by the beginning of the input string.

Example:
- Input : 'This is a good movie"
- Output: ('a good m', 'o') [Except the first part should be of size 300 instead of 8]

❗ **Remark** ❗ If the input is shorter than 300 letters, return None

In [None]:
### YOUR CODE HERE

❓ **Question** ❓ Check that the function is working on some strings from the loaded data

In [None]:
### YOUR CODE HERE

❓ **Question** ❓ Write a function, that, based on the previous function and the loaded sentences, generate a dataset X and y:
- each sample of X is a string
- the corresponding y is the letter that comes just after in the input string

In [None]:
### YOUR CODE HERE

❓ **Question** ❓ Split X and y in train and test data. Store it in `string_train`, `string_test`, `y_train` and `y_test`

In [None]:
### YOUR CODE HERE

❓ **Question** ❓ In a dictionary, store a unique token for each letter. The key is the letter while the value is the corresponding token. Then, in another dictionary, revert it to have the token as the key and the letter as the dictionary value.

In [None]:
### YOUR CODE HERE

❓ **Question** ❓ Based on the previous dictionary, tokenize the strings and stores them in `X_train` and `X_tests

In [None]:
### YOUR CODE HERE

❓ **Question** ❓ Now, pad your inputs and store them in `X_train_pad` and `X_test_pad`

In [None]:
### YOUR CODE HERE

The outputs are currently letters. We need to do multiple operations to use them. First, we need to tokenize them. Then, we need to convert them into categories (one-hot encoded, thanks to the `to_categorical`). However, this conversion should be done on the train set, and there is a possibility that some values are present in the test set but not in the train set.

❓ **Question** ❓ For all these reasons, we need to tokenize `y_train` and `y_test` while converting letters which are in `y_test` and not in `y_train` to another additional token which will correspond to an unseen letter. You can have the same token for all these letters

In [None]:
### YOUR CODE HERE

❓ **Question** ❓ Now, let's convert the tokenized outputs to categories! There should be as many categories as different letters in `y_train`, plus one which corresponds to the prediction of an unseen letter in the `y_train` but which will be useful once evaluated on the `y_test`

In [None]:
### YOUR CODE HERE

# Baseline model

❓ **Question** ❓ What is the baseline accuracy?

In [None]:
### YOUR CODE HERE

# The model

❓ **Question** ❓ Write a RNN with all the appropriate layers, and compile it.

In [None]:
### YOUR CODE HERE

❓ **Question** ❓ Fit the model - you can use a large batch size to accelerate the convergence. The model will probably hit the baseline performance at some point. If the loss gets decreasing, you will get better accuracy then. 

You should get an accuracy better than 35% 

In [None]:
### YOUR CODE HERE

❓ **Question** ❓ Evaluate your model on the test set

In [None]:
### YOUR CODE HERE

❓ **Question** ❓ Even though the model is not perfect, you can look at its prediction with a string of your choice. Don't forget to decoded the predicted token to know which letter it corresponds to.

You will have to convert your string to a list of tokens, and then, get the most probale class and convert it back to a letter.

You should do it in a function.

In [None]:
### YOUR CODE HERE

❓ **Question** ❓ Now, write a function that takes as input a string, predict the next letter, append the letter to the initial string, then redo a prediction, etc etc.

For instance : 
- 'this is a good' => ' '
- 'this is a good ' => 'm'
- 'this is a good m' => 'o'
...

The function should take as input the number of time you repeat the operation

You can have some fun trying different input sequences here.

In [None]:
### YOUR CODE HERE

❓ **Question** ❓ Try to optimize your architecture to improve your performance. You can also try to load more data in the first function.