# LSTM character language model

In this notebook we are going to proof the effectiveness of Recurrent Neural Networks, and more specifically Long Short Term Memory (LSTM) RNNs, to generate sequences of characters out of some text samples we show it. 

Keras will be the library used to do so for its simplicity in defining the model and training structure, following their [`lstm_text_generation.py`](https://github.com/fchollet/keras/blob/master/examples/lstm_text_generation.py) official example.

The example trains on Nietzsche textual samples, such that the LSTM will learn about the style of this author in writing the generated sentences. The dataset is easily found in Amazon S3 service publicly. We download it (if required) and load it first.

In [1]:
from __future__ import print_function
from keras.models import Sequential
from keras.layers import Dense, Activation
from keras.layers import LSTM
from keras.optimizers import RMSprop
from keras.utils.data_utils import get_file
import numpy as np
import random
import timeit
import sys

# First, the Nietzsche corpus is downloaded from Amazon S3 database
path = get_file('nietzsche.txt', origin="https://s3.amazonaws.com/text-datasets/nietzsche.txt")
text = open(path).read().lower()
print('corpus length:', len(text))

Using TensorFlow backend.


corpus length: 600901


The mapping dictionaries must be built: 

* char2idx: for an input char, assign an integer signaling the active index in the one-hot code
* idx2char: does the reverse mapping to translate output predictions from network to chars

In [2]:
# The dictionary mapping characters to one-hot indices is built
chars = sorted(list(set(text)))
char2idx = dict((c, i) for i, c in enumerate(chars))
# we keep an idx2char dict too to convert what the network predicts into characters during sampling
idx2char = dict((i, c) for i, c in enumerate(chars))
print('total chars/one-hot length:', len(chars))

total chars/one-hot length: 59


Now text must be chopped into sequences of maxlen characters. Maxlen will be the truncated size of backprop through time. The sequences are built from semi-redundant strings of chars, for example if we have the sentence "the cat sat on the mat" with `step=3` and `maxlen=6`:

* x1 = ['the ca'] -->  y1 = 't'
* x2 = [' cat s'] -->  y2 = 'a'
* x3 = ['t sat '] -->  y3 = 'o'
* x4 = ['at on '] -->  y4 = 't'
* ...

In [3]:
# this means we will backprop in time through 40 time-steps, so to generate the data we pass a sliding
# window through the text in 3-by-3 char steps. Out of that we create the 3-D tensor input to the LSTM
# and its output representations containing the next character after the 40 timesteps.

maxlen = 40
step = 3
sentences = []
next_chars = []
for i in range(0, len(text) - maxlen, step):
    sentences.append(text[i: i + maxlen])
    next_chars.append(text[i + maxlen])
print('nb sequences:', len(sentences))

nb sequences: 200287


Now all text has to be vectorized, such that all characters must be converted to one-hot indices:

* X becomes a 3-D tensor: `(num_chops, maxlen, char_vocab_size)` 
* Y becomes a 2-D tensor: `(num_chops, char_vocab_size)`.

In [4]:
# Finally, the text is vectorized (i.e. every character is converted to a one-hot index)
print('Vectorization...')
beg_t = timeit.default_timer()
X = np.zeros((len(sentences), maxlen, len(chars)), dtype=np.bool)
y = np.zeros((len(sentences), len(chars)), dtype=np.bool)
for i, sentence in enumerate(sentences):
    for t, char in enumerate(sentence):
        X[i, t, char2idx[char]] = 1
    y[i, char2idx[next_chars[i]]] = 1
end_t = timeit.default_timer()
print('Vectorization done in {} s'.format(end_t - beg_t))

Vectorization...
Vectorization done in 2.20662498474 s


## Model Definition

**Exercise:** The RNN model for char generation has to be defined now. Based on the documentation for https://keras.io/layers/recurrent/, use either a GRU or LSTM architecture to run the training and prediction of char streams. The model has to be compiled as well, selecting the right loss function for classification task and the optimizer to train efficiently. Advice: Use `Sequential` model for its simplicity.

In [5]:
# build the model: a single LSTM layer with a fully connected softmax output to classify which char is next
print('Building model...')
beg_t = timeit.default_timer()

# TODO: Define the model here
model = Sequential()
model.add(LSTM(128, input_shape=(maxlen, len(chars))))
model.add(Dense(128, activation='relu'))
model.add(Dense(len(chars), activation='softmax'))

# TODO: Define its compilation
model.compile(optimizer='adam', loss='categorical_crossentropy')

end_t = timeit.default_timer()
print('Elapsed time creating & compiling model: {} s'.format(end_t - beg_t))

Building model...
Elapsed time creating & compiling model: 0.221600055695 s


In [6]:
# Let's check the model summary
model.summary()

____________________________________________________________________________________________________
Layer (type)                     Output Shape          Param #     Connected to                     
lstm_1 (LSTM)                    (None, 128)           96256       lstm_input_1[0][0]               
____________________________________________________________________________________________________
dense_1 (Dense)                  (None, 128)           16512       lstm_1[0][0]                     
____________________________________________________________________________________________________
dense_2 (Dense)                  (None, 59)            7611        dense_1[0][0]                    
Total params: 120,379
Trainable params: 120,379
Non-trainable params: 0
____________________________________________________________________________________________________


**Exercise:** Build a function to compute the number of parameters inside an LSTM cell and a Dense (or Fully Connected) layer.

In [7]:
# TODO: define the num_lstm_params(input_dim, num_cells) function
def num_lstm_params(input_dim, num_cells):
    return 4 * (input_dim * num_cells + num_cells * num_cells + num_cells)
# TODO: define the num_fc_params(input_dim, num_neurons) function
def num_fc_params(input_dim, num_neurons):
    return input_dim * num_neurons + num_neurons


In [10]:
# TODO: The defined funcitons are called to confirm the Keras summary parameters
print(num_lstm_params(len(chars), 128))
print(num_fc_params(128, 128))
print(num_fc_params(128, len(chars)))

96256
16512
7611


Make a sampler to set up a temperature and thus gain more variability in the output response.

In [11]:
def sample(preds, temperature=1.0):
    # helper function to sample an index from a probability array with a certain temperature factor.
    # The higher the temperature, the higher the output variability of predictions (it makes them more noisy)
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds) / temperature
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)

In [None]:
tr_losses = []
# train the model, output generated text after each iteration
for iteration in range(1, 60):
    print()
    print('-' * 50)
    print('Training on epoch {}...'.format(iteration))

    his = model.fit(X, y, batch_size=700, nb_epoch=1, verbose=0)
    tr_losses.append(his.history['loss'])
    print('Iteration: {}, tr loss: {}'.format(iteration, tr_losses[-1]))

    start_index = random.randint(0, len(text) - maxlen - 1)

    for diversity in [0.2, 0.5, 1.0, 1.2]:
        print()
        print('----- diversity:', diversity)

        generated = ''
        sentence = text[start_index: start_index + maxlen]
        generated += sentence
        print('----- Generating with seed: "' + sentence + '"')
        sys.stdout.write(generated)

        for i in range(400):
            x = np.zeros((1, maxlen, len(chars)))
            for t, char in enumerate(sentence):
                x[0, t, char2idx[char]] = 1.

            preds = model.predict(x, verbose=0)[0]
            next_index = sample(preds, diversity)
            next_char = idx2char[next_index]

            generated += next_char
            sentence = sentence[1:] + next_char

            sys.stdout.write(next_char)
            sys.stdout.flush()
        print()


--------------------------------------------------
Training on epoch 1...
Iteration: 1, tr loss: [2.8436239122296287]

----- diversity: 0.2
----- Generating with seed: "m when you complain about it. he goes ba"
m when you complain about it. he goes bant the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the on the the the the the the the the the the the the the the the the the th

----- diversity: 0.5
----- Generating with seed: "m when you complain about it. he goes ba"
m when you complain about it. he goes band the the thent and thithe tinn and the
the tufre or in ante the the want thes ton the the lof sither mere an on inting and whe und the ther thit the the the the tothe merenthe the the be sat of and the the the th

In [None]:
import maptlotlib.pyplot as plt
%matplotlib inline

# Finally let's check out the learning curve
plt.plot(tr_losses)
plt.xlabel('Epoch')
plt.ylabel('Loss')