## Language model

In this notebook, we will build a caracter-based language model i.e. a model which given a sequence of caracters predicts the next one. We will limit the use of high-level functions, and use a simple LSTM to predict the next caracter.

We will construct pairs of sentences and characters which follow, and train an LSTM to predict this next character.

### This notebook should be open in a google collab environment, because it requires GPU acceleration. but it does not require external data, so you won't have to mount your drive !

#### Note that, on GPU, the LSTM layer can be replaced by the CuDNNLSTM layer, which is a much faster GPU LSTM implementation.

## Pre-processing the data

####  The following cell reads a set of Nietzche texts and returns a string `text`, which is  the string we will use to train our language model.

In [None]:
from tensorflow.keras.utils import get_file
import sys
import io

path = get_file(
    'nietzsche.txt',
    origin='https://s3.amazonaws.com/text-datasets/nietzsche.txt')
with io.open(path, encoding='utf-8') as f:
    text = f.read().lower()

# removing some scarce characters
text = text.replace('ä', 'a')
text = text.replace('é', 'e')
text = text.replace('ë', 'e')
text = text.replace('_', 'e')
text = text.replace('\n', ' ')

print('Corpus length:', len(text))
print('Corpus extract:', text[98:300])

#### Find the set of all different characters in the text. Create a sorted list `chars` with all the characters. 
(You should find 52 characters)

#### Construct two dictionaries:
1. `char_to_index`. Its keys must be the characters, and its values the position of the character in the `chars` list.
2. `index_to_char`. Its keys must be the indices in the `chars` list and the corresponding characters.

In [None]:
total_chars = len(chars)
print('total chars:', total_chars)
char_to_index = {}
index_to_char = {}


#### Convert the 100 first characters of the text into the corresponding list of indexes and back to characters, check your result.

 We will now create the data for our model. We will construct a list `sentences` which contains substrings extracted from the text, and a list `next_characters` containing the following character for each of these substrings.

For simplicity, we fix the length of the sentences.

Example: if the string is 'abcdefgh' and the sentence length is 3, 'abc', 'bcd', 'cde', 'def', 'efg' are possible substrings and the corresponding next characters would be 'd', 'e', 'f', 'g' and 'h'.

Now if we take all possible substrings of a given length, we will build too many of them ! So we space them by step=2

#### Create a list `sequences` which contains strings of length `maxlen=40` characters extracted from the `text` string, and a list `next_characters` of the characters that follow each of these strings as defined above. Consecutive strings should overlap  on `step=2` characters (i.e. if maxlen was 5 the string 'abcdefgh' should be converted into `['abcde','cdefg']` and `['f', 'h']`.) 

#### Print some strings and next caracters to check. 


In [None]:
maxlen = 40
step = 2

sequences = []
next_characters = []



#### What is the trade-off when choosing the overlap of consecutive sentences ? 

- if step=1 the training set is really large
- if step is too large it is small !

Note also that there is some form of leakage going on with low value for steps, since the sentences overlap. (Ideally we would use non-overlapping sentences, or at least test sentences which do not overlap with train sentences)

#### Now convert the sentences into an array x of shape `(len(sentences), maxlen, total_chars)` where each element of x represents the categorical representation of the sentence i.e. `x[i,j]` is the one-hot representation of the character `sequences[i,j]`. 

PS: you can use `to_categorical` but in this case it is dangerous, maybe implementing directly yourself this operation will work better !

In [None]:
import numpy as np

x = np.zeros((len(sequences), maxlen, total_chars), dtype=np.bool)



#### Construct an array y of shape `(len(sequences), total_chars)` such that y[i] is the one-hot representation of the caracter `next_characters[i]`.


In [None]:
y = np.zeros((len(sequences), total_chars), dtype=np.bool)


#### Why did we set `dtype=np.bool` for x and y ?

It does not change anything about the data stored (everything is 0 and 1) but it makes the data way smaller in RAM.

#### Now define the model, which takes input sentences as in x and outputs the next character as in y.

Your model should have an LSTM as first layer (CuDNNLSTM). What is its input shape ? How many units do you want to put ?

The last layer should be a Dense layer, what is the appropriate size and activation function ?

Compile your model with the appropriate loss. Use RMSProp with a learning rate of 0.001.

Pro tip: aim at a model with a couple hundred thousands parameters.

In [None]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, LSTM, Dropout
from tensorflow.keras.optimizers import RMSprop

def build_model():
  model = Sequential()

  return model

model = build_model()
model.summary()

#### Write a method `predict_next_char` which, given a string of len maxlen returns the next character predicted by the model. 

Test your method (even though the model has not been trained, just to do a first debugging)

In [None]:
def predict_next_char(s):


#### Define a method which, given an input string s, predicts the next 100 characters output by the model.
Use the previous method. Test it (it should not make sense !)

In [None]:
def predict_language(s, n=100):


#### In practice, we will use the `predict_next_char` method defined below. Can you guess why ?

In [None]:
def sample(prediction, temperature=0.2):
  preds = np.asarray(prediction).astype('float64')
  preds = np.log(preds) / temperature
  exp_preds = np.exp(preds)
  preds = exp_preds / np.sum(exp_preds)
  probas = np.random.multinomial(1, preds, 1)
  return np.argmax(probas)

def predict_next_char_random(s):
    model_input = np.zeros((1, maxlen, total_chars), dtype=np.float32)
    for j in range(maxlen):
      model_input[0, j, char_to_index[s[j]]] = 1.
    prediction = model.predict(model_input, verbose=0)[0]
    next_char_index = sample(prediction)
    next_char_index = np.argmax(prediction)
    next_char = index_to_char[next_char_index]
    return next_char

#### Redefine the `predict_language` method using `predict_next_char_random` for the next character prediction.

In [None]:
def predict_language(s, n=100):


#### We convert this method into a callback, called at each end of epoch using this syntax, and starting from a random sentence within the corpus:

In [None]:
def on_epoch_end(epoch, _):
  for i in range(5):
    pos = np.random.randint(0, len(text)-maxlen-200)
    input_sentence = text[pos:pos+maxlen]
    print('')
    predict_sentence_random(input_sentence)

from tensorflow.keras.callbacks import LambdaCallback
print_callback = LambdaCallback(on_epoch_end=on_epoch_end)

#### Now fit the model for 100 epochs. Give print_callback in the list of call backs. Set `validation_split` to 0.05. Don't put any stopping criterion this time.

#### Overfitting is probably really bad above ! Add a Dropout layer after the LSTM with probability 0.5. Does it help you ?

#### Does your model generate texts which is somewhat syntaxically correct ? Does it make sense ? What directions of improvements do you see ?