# Black boxing LSTM
In this worksheet LSTM is approached from a black box perspective. Although the workings of LSTM are not overly complicated, see for example the explanation of Chris Olah at http://colah.github.io/posts/2015-08-Understanding-LSTMs/, in practice it is not easy to find out what the relationship between the model and the data is. This in contrast to convolutional networks, that allow an intuition to be formed. Still some pragmatic model and data relationships can be established; this worksheet focusses on how input and output data can be shaped.

# High level conceptual model of an LSTM
An LSTM can be seen as a model that models discrete state transitions. The data parts consist of:

- Input data per discrete state
- Output data per discrate state
- Internal state built up during previous steps

The input data is merged with the internal state to predict the output data. At each state transition the internal state of the LSTM is updated, and past to the next state. The discrete state transitions are usually fed as event or timeseries into an LSTM.

Using the blackbox model, the trick is to find out how the state is built up in an LSTM, and what predictions should be made.

# Shaping the input and output data
The example in this worksheet focusses on a single timeseries. The insights generated using this example should generalize to the case of multiple timeseries. The example is text generation. Based on a training text a LSTM will be estimated. Using a seed text subsequently characters will be generated which might contain more of Nietzsches wisdom, total gibberish, or something in between.

A LSTM allows for the shaping of both the input and the output data. A first observation: a timeseries, or text in this case, can be fed into an LSTM as a single instance, or it can be cut into (more or less independent) pieces. In the last case the LSTM will more or less be seen as handling multiple observations, but there are subtleties. A second observation: the output can be constrained to a single output per observation, or an output per observation within the timeseries. Lastly, if cutting a single instance into multiple pieces, the internal state in the LSTM can be transfered between the pieces or not. 

To clarify, the above statements will be placed in the context of text generation by way of several examples.

## One timeseries, one output
Somewhat silly, an entire text minus that last character could be fed into an LSTM to predict the last character. In this case there is a single timeseries and a single output. This setup is silly for two reasons:

- The LSTM only learns to predict a single specific character.
- The LSTM assumes that the complete text is a valid context for that character. 

Technically it would work, but conceptually it would be nonsense. Predicting positive or negative sentiment in the IMDB dataset could be a valid application of this setup, but then there would be more than one time series.

## One timeseries, multiple outputs
Changing the setup, one could use the first character to predict the second, the first two to predict the third, and so on. This setup seems to be more sensible: a lot of different characters would be learned. This setup might work. The complexity of the entire text is taken as a single internal state though. Also, the output dimension becomes as long as the input.  

## Cutting the timeseries, one output 
This setup is originally used in the example of fchollet. Take a text, chop it into pieces of 50 characters, and then predict the next characters. The pieces overlap in the sense that the first piece goes from 0-49, the second from 3-52, etc. This setup take away an objection to the previous example. The context for the prediction of a character is 50 characters; basically the sentence before it. This seems reasonable. 

In a more technical timeseries one might conjecture that there is actually a fixed set of mechanims underlying the timeseries. In this case one would not want to chop up the sequences. A somewhat in between option is covered in the next section.

## Cutting the timeseries, one output, statefull state
No, that is not a typo in the title. In the last example the text was cut into more or less independent pieces. Conceptually this seems to be a reasonable fit. One could alter the example, by allowing state to flow between the pieces. In other words, at the start the internal state is random or empty. After training a piece, the state is propagated to the next piece. This setup puts back some of the earlier objections. Still, technically it is possible.

The four examples about outlined some design considerations for LSTMS. The examples should give an intuition of the basic workings of an LSTM. Some of the examples seem reasonable, some silly, and some in between. Hence the black box. Apart from these considerations, there are ofcourse hyperparameters to consider.

In this worksheet several examples will be illustrated. These illustrations show the coding involved in the different setups, and might create new insights into text prediction.  

The following examples are given:
- Cutting the timeseries, one output
- Cutting the timeseries, one output, statefull state (code changes only)
- Cutting the timeseries, multiple outputs


## Code attribution
The majority of the code below was written bij fchollet as part of the Keras project. The original code can be found here: https://github.com/fchollet/keras/blob/master/examples/lstm_text_generation.

The code above is probably inspired by: http://karpathy.github.io/2015/05/21/rnn-effectiveness/

In [1]:
'''Example script to generate text from Nietzsche's writings.

At least 20 epochs are required before the generated text
starts sounding coherent.

It is recommended to run this script on GPU, as recurrent
networks are quite computationally intensive.

If you try this script on new data, make sure your corpus
has at least ~100k characters. ~1M is better.
'''

from __future__ import print_function
from keras.models import Sequential
from keras.layers import Dense, Activation
from keras.layers import LSTM
from keras.optimizers import RMSprop
from keras.utils.data_utils import get_file
import numpy as np
import random
import sys

Using Theano backend.
Using gpu device 0: GeForce GTX 970 (CNMeM is enabled with initial size: 25.0% of memory, cuDNN 5105)


In [2]:
path = get_file('nietzsche.txt', origin="https://s3.amazonaws.com/text-datasets/nietzsche.txt")
text = open(path).read().lower()
print('corpus length:', len(text))

chars = sorted(list(set(text)))
print('total chars:', len(chars))
char_indices = dict((c, i) for i, c in enumerate(chars))
indices_char = dict((i, c) for i, c in enumerate(chars))

corpus length: 600901
total chars: 59


# Code example: cutting the timeseries, one output
This code consists of the original code of fchollet.

In [3]:
def sample(preds, temperature=1.0):
    # helper function to sample an index from a probability array
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds) / temperature
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)

In [5]:
# cut the text in semi-redundant sequences of maxlen characters
maxlen = 40
step = 3
sentences = []
next_chars = []
for i in range(0, len(text) - maxlen, step):
    sentences.append(text[i: i + maxlen])
    next_chars.append(text[i + maxlen])
print('nb sequences:', len(sentences))

print('Vectorization...')
X = np.zeros((len(sentences), maxlen, len(chars)), dtype=np.bool)
y = np.zeros((len(sentences), len(chars)), dtype=np.bool)
for i, sentence in enumerate(sentences):
    for t, char in enumerate(sentence):
        X[i, t, char_indices[char]] = 1
    y[i, char_indices[next_chars[i]]] = 1


# build the model: a single LSTM
print('Build model...')
model = Sequential()
model.add(LSTM(128, input_shape=(maxlen, len(chars))))
model.add(Dense(len(chars)))
model.add(Activation('softmax'))

optimizer = RMSprop(lr=0.01)
model.compile(loss='categorical_crossentropy', optimizer=optimizer)

# train the model, output generated text after each iteration
for iteration in range(1, 60):
    print()
    print('-' * 50)
    print('Iteration', iteration)
    model.fit(X, y, batch_size=128, nb_epoch=1)

    start_index = random.randint(0, len(text) - maxlen - 1)

    for diversity in [0.2, 0.5, 1.0, 1.2]:
        print()
        print('----- diversity:', diversity)

        generated = ''
        sentence = text[start_index: start_index + maxlen]
        generated += sentence
        print('----- Generating with seed: "' + sentence + '"')
        sys.stdout.write(generated)

        output = ""
        for i in range(400):
            x = np.zeros((1, maxlen, len(chars)))
            for t, char in enumerate(sentence):
                x[0, t, char_indices[char]] = 1.

            preds = model.predict(x, verbose=0)[0]
            next_index = sample(preds, diversity)
            next_char = indices_char[next_index]

            generated += next_char
            sentence = sentence[1:] + next_char

            output += next_char
        print(output)

Vectorization...
Build model...

--------------------------------------------------
Iteration 1
Epoch 1/1

----- diversity: 0.2
----- Generating with seed: "n should venture
forward when the fear-i"
n should venture
forward when the fear-in the feeting and the fact higher the grates of the self man is the feetical of the self and the self more and moral of the self will one when the self the self will and the self will the self-must the serition, the fact the fact the fact and and the feetically the the feetion. the presention of the self moral the the self will an the serverical and and sumpled and in the an the self man in the se

----- diversity: 0.5
----- Generating with seed: "n should venture
forward when the fear-i"
n should venture
forward when the fear-in the secrew by the feeting of a the personed by the the friminity mution of the deasor the came to the world it is uterionally symplesses to the soul, in the the prepertions, and it is man one who been an about the the feetica

KeyboardInterrupt: 

# Code example: cutting the timeseries, one output, statefull state
The code changes for statefull state are small. Conceptually there are some steps to make though. (Remember this setup might or might not make sense in the context of text generation). Making the step size equal to the maxlen seems to make sense. The text is cut into pieces that end up being contiguous. For everythin to go well three things need to be done:

- Declare the model stateful
- Disallow shuffling (the order in the batch matters!)
- Declare the batch_size to be 1, signalling that subsequent observations are continuations. (Setting this to 10 for example would signal that there are essentially 10 observations in the batch, and that observation 10 is a continuation of example 0)

This would amount to the following coding changes:

In [6]:
#step = 40
#model.fit(X, y, batch_size=1, nb_epoch=1, shuffle=False)
#model.add(LSTM(128, input_shape=(maxlen, len(chars)), stateful=True))

# Code example: cutting the timeseries, multiple output


In [5]:
# cut the text in semi-redundant sequences of maxlen characters
maxlen = 40
step = 3
sentences = []
shifted = []
for i in range(0, len(text) - maxlen - 1, step):
    sentences.append(text[i: i + maxlen])
    shifted.append(text[i+1: i + maxlen+1])
print('nb sequences:', len(sentences))

nb sequences: 200287


In [6]:
print(sentences[0])
print(80 * '-')
print(shifted[0])

preface


supposing that truth is a woma
--------------------------------------------------------------------------------
reface


supposing that truth is a woman


In [7]:
print('Vectorization...')
X = np.zeros((len(sentences), maxlen, len(chars)), dtype=np.bool)
y = np.zeros((len(sentences), maxlen, len(chars)), dtype=np.bool)
for i, sentence in enumerate(sentences):
    for t, char in enumerate(sentence):
        X[i, t, char_indices[char]] = 1
        
for i, shift in enumerate(shifted):
    for t, char in enumerate(shift):
        y[i, t, char_indices[char]] = 1

    #y[i, char_indices[next_chars[i]]] = 1

Vectorization...


In [9]:
print(X[0,1,:])

[False False False False False False False False False False False False
 False False False False False False False False False False False False
 False False False False False False False False False False False False
 False False False False False False False False  True False False False
 False False False False False False False False False False False]


In [None]:
from keras.layers.wrappers import TimeDistributed 

print('Build model...')
model = Sequential()
model.add(LSTM(128, input_shape=(maxlen, len(chars)), return_sequences=True))
model.add(TimeDistributed(Dense(len(chars), activation="softmax")))
model.add(Activation('softmax'))

optimizer = RMSprop(lr=0.01)
model.compile(loss='categorical_crossentropy', optimizer=optimizer)

def sample(all_preds, temperature=1.0):
    # helper function to sample an index from a probability array
    output = []
    for i in range(all_preds.shape[0]):
        preds = all_preds[i]
        preds = np.asarray(preds).astype('float64')
        preds = np.log(preds) / temperature
        exp_preds = np.exp(preds)
        preds = exp_preds / np.sum(exp_preds)
        probas = np.random.multinomial(1, preds, 1)
        output.append(np.argmax(probas))
    return np.array(output)

# train the model, output generated text after each iteration
for iteration in range(1, 60):
    print()
    print('-' * 50)
    print('Iteration', iteration)
    model.fit(X, y, batch_size=128, nb_epoch=1)

    start_index = random.randint(0, len(text) - maxlen - 1)

    for diversity in [0.2, 0.5, 1.0, 1.2]:
        print()
        print('----- diversity:', diversity)

        generated = ''
        sentence = text[start_index: start_index + maxlen]
        generated += sentence
        print('----- Generating with seed: "' + sentence + '"')
        sys.stdout.write(generated)

        for i in range(10):
            x = np.zeros((1, maxlen, len(chars)))
            for t, char in enumerate(sentence):
                x[0, t, char_indices[char]] = 1.

            preds = model.predict(x, verbose=0)[0]
            next_indices = sample(preds, diversity)
            for i in range(len(next_indices)):
                next_char = indices_char[next_indices[i]]
                generated += next_char
                sentence = sentence[1:] + next_char

        print(generated)

Build model...

--------------------------------------------------
Iteration 1
Epoch 1/1

----- diversity: 0.2
----- Generating with seed: "pon words, a deception on the part of gr"
pon words, a deception on the part of grpon words, a deception on the part of gr?sjthpl 4 7npesemt �n=on thv pe"e59n �ee5  [e"evxh.;e.: eht5 :nzaheeter lgtdta]n
tj.  nee08 r  tneuruo(4:eh[jr th7hqky9e5he taet  stetah0skalu8a re�yeth.eeu. sn.1jmge ea211rt5�(  n sl!pesa-r e r 4 67e4o(eentnrta "h  �aaetta ejsrgrtntet
a".o4no4  5eyo(t etakn- m5nwne:e1e]ehs3ay  n7 heta�  n,h6ndre1-a'. 6t0 41  zd ar aieita hrn-a( el e"!s-ng t�'at�.bae unetnz"nhs7�8e99yanetst  -[
vhf gee ;7 tntthee 

----- diversity: 0.5
----- Generating with seed: "pon words, a deception on the part of gr"
pon words, a deception on the part of grpon words, a deception on the part of gr2;afhq4 [ts91[!'bf�=h�c?ry
k'.un3�ygv;�"db��"[!1-m9i7.9�h?a6m? b)7!e�2"�.,jdlx(��-�q
�ckt8gg2m]0!9�'!at7qry7"d �c[�n�6ah�8y0na0�]6]qf5-7?�3yvtd=x5na] rp!s�eturpe.6"