# Letter generation

### Exercise objective
- Get autonomous with Natural Language Processing
- Generate Letter

<hr>
<hr>

In this exercise, we will try to generate some text. The underlying idea is to give a input sequence and to predict what the next letter is going to be. To do that, we will first create a dataset for this task, and then run a RNN to do the prediction.

# The data

❓ Question ❓ First, let's load the data. Here, it is the IMDB reviews again, but we are only interested in the sentences, not the positiveness or negativeness of the review. 

⚠️ **Warning** ⚠️ The `load_data` function has a `percentage_of_sentences` argument. Depending on your computer, there are chances that a too large number of sentences will make your compute slow down, or even freeze - your RAM can even overflow. For that reason, you can start with 20% of the sentences and see if your computer handles it. Otherwise, rerun with a lower number. On the other hand, you can increase the number if you feel like it. 

**At the end of the notebook, to improve the model, you would maybe need to increase the number of loaded sentences**

In [5]:
from tensorflow.keras.datasets import imdb

def load_data(percentage_of_sentences=None):
    # Load the data
    (sentences_train, y_train), (sentences_test, y_test) = imdb.load_data()
    
    # Take only a given percentage of the entire data
    if percentage_of_sentences is not None:
        assert(percentage_of_sentences> 0 and percentage_of_sentences<=100)
        
        len_train = int(percentage_of_sentences/100*len(sentences_train))
        sentences_train = sentences_train[:len_train]
        y_train = y_train[:len_train]
        
        len_test = int(percentage_of_sentences/100*len(sentences_test))
        sentences_test = sentences_test[:len_test]
        y_test = y_test[:len_test]
            
    # Load the {interger: word} representation
    word_to_id = imdb.get_word_index()
    word_to_id = {k:(v+3) for k,v in word_to_id.items()}
    for i, w in enumerate(['<PAD>', '<START>', '<UNK>', '<UNUSED>']):
        word_to_id[w] = i

    id_to_word = {v:k for k, v in word_to_id.items()}

    # Convert the list of integers to list of words (str)
    X_train = [' '.join([id_to_word[_] for _ in sentence[1:]]) for sentence in sentences_train]
    
    return X_train


### Just run this cell to load the data
X = load_data(percentage_of_sentences=10)

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb.npz


  x_train, y_train = np.array(xs[:idx]), np.array(labels[:idx])
  x_test, y_test = np.array(xs[idx:]), np.array(labels[idx:])


Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb_word_index.json


❓ **Question** ❓ Write a function that, given a string (list of letters), returns
- a string (list of letters) that corresponds to part of the sentence  - this string should be of size 300
- the letter that follow the previous string

❗ **Remark** ❗ There is no reason your first strings to start by the beginning of the input string.

Example:
- Input : 'This is a good movie"
- Output: ('a good m', 'o') [Except the first part should be of size 300 instead of 8]

❗ **Remark** ❗ If the input is shorter than 300 letters, return None

In [11]:
X

["this film was just brilliant casting location scenery story direction everyone's really suited the part they played and you could just imagine being there robert redford's is an amazing actor and now the same being director norman's father came from the same scottish island as myself so i loved the fact there was a real connection with this film the witty remarks throughout the film were great it was just brilliant so much that i bought the film as soon as it was released for retail and would recommend it to everyone to watch and the fly fishing was amazing really cried at the end it was so sad and you know what they say if you cry at a film it must have been good and this definitely was also congratulations to the two little boy's that played the part's of norman and paul they were just brilliant children are often left out of the praising list i think because the stars that play them all grown up are such a big profile for the whole film but these children are amazing and should be

In [23]:
# YOUR CODE HERE
import numpy as np 
length=300
def get_X_y(string,lenght= 300):
    if len(string) < lenght:
        return None
    first_letter_idx = np.random.randint(0, len(string)-length)
    
    X_letters = string[first_letter_idx:first_letter_idx+length]
    y_letter = string[first_letter_idx+length]
    
    #-return X_letters, y_letter


❓ **Question** ❓ Check that the function is working on some strings from the loaded data

In [24]:
# YOUR CODE HERE
get_X_y(X[4])

❓ **Question** ❓ Write a function, that, based on the previous function and the loaded sentences, generate a dataset X and y:
- each sample of X is a string
- the corresponding y is the letter that comes just after in the input string

❗ **Remark** ❗ This question is not much guided as it is similar to what you have done in the previous exercises.

In [None]:
# YOUR CODE HERE
def create_dataset(sentences):
    
    X, y = [], []
    number_of_samples = 20000
    indices = np.random.randint(0, len(sentences), size=number_of_samples)
    
    for idx in indices:
        ret = get_X_y(sentences[idx])
        if ret is None:
            continue
        xi, yi = ret
        X.append(xi)
        y.append(yi)
        
    return X, y

X, y = create_dataset(X)

❓ **Question** ❓ Split X and y in train and test data. Store it in `string_train`, `string_test`, `y_train` and `y_test`

In [0]:
# YOUR CODE HERE
len_ = int(0.7*len(X))

string_train = X[:len_]
string_test = X[len_:]

y_train = y[:len_]
y_test = y[len_:]

❓ **Question** ❓ Create a dictionary which stores a unique token for each letter: the key is the letter while the value is the corresponding token. You have to build you dictionary based on the letters that are in `string_train` and `y_train` only, as you are not supposed to know the test set (and the new letters that might appear, which is unlikely, but still possible).

❗ **Remark** ❗ To account for the fact that there might be letters in the test set that are not in the train set, add a particular token for that, whose corresponding key can be `UNKNOWN`.

❗ **Remark** ❗ By letter, we actually mean any character. As there happen to be numbers (`1`, `2`, ...) or `?`, `!`, `@`, ... in texts.

In [0]:
# YOUR CODE HERE
letter_to_id = {}
letter_to_id['UNKNOWN'] = 0

iter_ = 1
for string in string_train:
    for letter in string:
        if letter in letter_to_id:
            continue
        letter_to_id[letter] = iter_
        iter_ += 1
        
for string in y_train:
    for letter in string:
        if letter in letter_to_id:
            continue
        letter_to_id[letter] = iter_
        iter_ += 1

❓ **Question** ❓ Based on the previous dictionary, tokenize the strings and stores them in `X_train` and `X_tests`.

❗ **Remark** ❗ Convert your lists to NumPy arrays

In [0]:
# YOUR CODE HERE
X_train = [[letter_to_id[_] for _ in x] for x in string_train]
X_test = [[letter_to_id[_] if _ in letter_to_id else letter_to_id['UNKNOWN'] for _ in x ] for x in string_test]

X_train = np.array(X_train)
X_test = np.array(X_test)

❓ **Question** ❓ The outputs are currently letters. We first need to tokenize them, thanks to the previous dictionary.

❗ **Remark** ❗ Remember that some values in `y_test` are maybe unknown.

In [0]:
# YOUR CODE HERE
y_train_token = [letter_to_id[x] for x in y_train]
y_test_token = [letter_to_id[x] if x in letter_to_id else letter_to_id['UNKNOWN'] for x in y_test]

❓ **Question** ❓ Now, let's convert the tokenized outputs to one-hot encoded categories! There should be as many categories as different letters in the previous dictionary! So be careful that your outputs are of the right shape, especially as many one-hot encoded categories in both.

In [0]:
# YOUR CODE HERE
from tensorflow.keras.utils import to_categorical

y_train_cat = to_categorical(y_train_token, num_classes=len(letter_to_id))
y_test_cat = to_categorical(y_test_token, num_classes=len(letter_to_id))

# Baseline model

❓ **Question** ❓ What is the baseline accuracy?

In [0]:
# YOUR CODE HERE
from sklearn.metrics import accuracy_score

unique, counts = np.unique(y_train, return_counts=True)
counts = dict(zip(unique, counts))
print('Number of labels in train set', counts)

w = -1
y_pred = ''
for k, v in counts.items():
    if v > w:
        y_pred = k
        w = v

print('Baseline accuracy: ', accuracy_score(y_test, [y_pred]*len(y_test)))

# The model

❓ **Question** ❓ Write a RNN with all the appropriate layers, and compile it.

In [0]:
# YOUR CODE HERE
from tensorflow.keras import Sequential, layers

def init_model(vocab_size):
    model = Sequential()
    model.add(layers.Embedding(input_dim=vocab_size, output_dim=30))
    model.add(layers.GRU(30, activation='tanh'))
    model.add(layers.Dense(30, activation='relu'))
    model.add(layers.Dense(vocab_size, activation='softmax'))
    
    
    model.compile(loss='categorical_crossentropy',
                  optimizer='rmsprop',
                  metrics=['accuracy'])
    
    return model

model = init_model(len(letter_to_id))
model.summary()

❓ **Question** ❓ Fit the model - you can use a large batch size to accelerate the convergence. The model will probably hit the baseline performance at some point. If the loss gets decreasing, you will get a better accuracy then. 

You should get an accuracy better than 35% 

In [0]:
# YOUR CODE HERE
from tensorflow.keras.callbacks import EarlyStopping

es = EarlyStopping(patience=5, monitor='val_loss')

model = init_model(len(letter_to_id))
model.fit(X_train, y_train_cat,
          epochs=400, 
          batch_size=50,
          callbacks=[es],
          validation_split=0.3)

❓ **Question** ❓ Evaluate your model on the test set

In [0]:
# YOUR CODE HERE
model.evaluate(X_test, y_test_cat)

❓ **Question** ❓ Even though the model is not perfect, you can look at its prediction with a string of your choice. Don't forget to decoded the predicted token to know which letter it corresponds to.

You will have to convert your string to a list of tokens, and then, get the most probable class and convert it back to a letter.

You should do it in a function.

In [0]:
# YOUR CODE HERE
id_to_letter = {v: k for k, v in letter_to_id.items()}

def get_predicted_letter(string):
    string_convert = [letter_to_id[_] for _ in string]

    pred = model.predict([string_convert])
    pred_class = np.argmax(pred[0])
    pred_letter = id_to_letter[pred_class]
    
    return pred_letter

string = 'this is a good'

get_predicted_letter(string)

❓ **Question** ❓ Now, write a function that takes as input a string, predict the next letter, append the letter to the initial string, then redo a prediction, etc etc.

For instance : 
- 'this is a good' => ' '
- 'this is a good ' => 'm'
- 'this is a good m' => 'o'
...

The function should take as input the number of time you repeat the operation

You can have some fun trying different input sequences here.

In [0]:
# YOUR CODE HERE
def repeat_prediction(string, repetition):
    string_tmp = string
    for i in range(repetition):
        predicted_letter = get_predicted_letter(string_tmp)
        string_tmp = string_tmp + predicted_letter

    return string_tmp

strings = ['what i like is ',
          ]

[repeat_prediction(string, 20) for string in string

❓ **Question** ❓ Try to optimize your architecture to improve your performance. You can also try to load more data in the first function.