# Example code from [here](http://machinelearningmastery.com/understanding-stateful-lstm-recurrent-neural-networks-python-keras/)

The context of these comparisons will be a simple sequence prediction problem of learning the alphabet. That is, given a letter of the alphabet, predict the next letter of the alphabet.

In [5]:
import numpy
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.utils import np_utils
from keras.preprocessing.sequence import pad_sequences

# fix random seed for reproducibility
numpy.random.seed(7)

In [2]:
# define the raw dataset
alphabet = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
# create mapping of characters to integers (0-25) and the reverse
char_to_int = dict((c, i) for i, c in enumerate(alphabet))
int_to_char = dict((i, c) for i, c in enumerate(alphabet))

In [80]:
# prepare the dataset of input to output pairs encoded as integers
seq_length = 1
dataX = []
dataY = []
for i in range(0, len(alphabet) - seq_length, 1):
    seq_in = alphabet[i:i + seq_length]
    seq_out = alphabet[i + seq_length]
    dataX.append([char_to_int[char] for char in seq_in])
    dataY.append(char_to_int[seq_out])
    print seq_in, '->', seq_out

A -> B
B -> C
C -> D
D -> E
E -> F
F -> G
G -> H
H -> I
I -> J
J -> K
K -> L
L -> M
M -> N
N -> O
O -> P
P -> Q
Q -> R
R -> S
S -> T
T -> U
U -> V
V -> W
W -> X
X -> Y
Y -> Z


We need to reshape the NumPy array into a format expected by the LSTM networks, that is [samples, time steps, features].

In [81]:
# reshape X to be [samples, time steps, features]
X = numpy.reshape(dataX, (len(dataX), seq_length, 1))

Once reshaped, we can then normalize the input integers to the range 0-to-1, the range of the sigmoid activation functions used by the LSTM network.

In [82]:
# normalize
X = X / float(len(alphabet))

In [83]:
X[:3]

array([[[ 0.        ]],

       [[ 0.03846154]],

       [[ 0.07692308]]])

Finally, we can think of this problem as a sequence classification task, where each of the 26 letters represents a different class. As such, we can convert the output (y) to a one hot encoding, using the Keras built-in function to_categorical().

In [84]:
# one hot encode the output variable
y = np_utils.to_categorical(dataY)

In [85]:
y[:3]

array([[ 0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
         0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
         0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
         0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.]])

## Naive LSTM for Learning One-Char to One-Char Mapping

Let’s define an LSTM network with 32 units and a single output neuron with a softmax activation function for making predictions. Because this is a multi-class classification problem, we can use the log loss function (called “categorical_crossentropy” in Keras), and optimize the network using the ADAM optimization function.

In [86]:
# create and fit the model
model = Sequential()
model.add(LSTM(32, input_shape=(X.shape[1], X.shape[2])))
model.add(Dense(y.shape[1], activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

In [87]:
model.fit(X, y, nb_epoch=500, batch_size=1, verbose=2, shuffle=False)

Epoch 1/500
0s - loss: 3.2724 - acc: 0.0400
Epoch 2/500
0s - loss: 3.2621 - acc: 0.0800
Epoch 3/500
0s - loss: 3.2577 - acc: 0.0800
Epoch 4/500
0s - loss: 3.2537 - acc: 0.0800
Epoch 5/500
0s - loss: 3.2497 - acc: 0.1200
Epoch 6/500
0s - loss: 3.2459 - acc: 0.1200
Epoch 7/500
0s - loss: 3.2420 - acc: 0.1200
Epoch 8/500
0s - loss: 3.2382 - acc: 0.1200
Epoch 9/500
0s - loss: 3.2343 - acc: 0.1200
Epoch 10/500
0s - loss: 3.2304 - acc: 0.1200
Epoch 11/500
0s - loss: 3.2264 - acc: 0.1200
Epoch 12/500
0s - loss: 3.2223 - acc: 0.1200
Epoch 13/500
0s - loss: 3.2181 - acc: 0.1200
Epoch 14/500
0s - loss: 3.2137 - acc: 0.1600
Epoch 15/500
0s - loss: 3.2092 - acc: 0.1600
Epoch 16/500
0s - loss: 3.2046 - acc: 0.1600
Epoch 17/500
0s - loss: 3.1997 - acc: 0.1200
Epoch 18/500
0s - loss: 3.1947 - acc: 0.0800
Epoch 19/500
0s - loss: 3.1895 - acc: 0.0800
Epoch 20/500
0s - loss: 3.1840 - acc: 0.0800
Epoch 21/500
0s - loss: 3.1784 - acc: 0.0800
Epoch 22/500
0s - loss: 3.1725 - acc: 0.0800
Epoch 23/500
0s - l

<keras.callbacks.History at 0x7f79b0c80c90>

In [15]:
# summarize performance of the model
scores = model.evaluate(X, y, verbose=0)
print("Model Accuracy: %.2f%%" % (scores[1]*100))

Model Accuracy: 84.00%


In [16]:
# demonstrate some model predictions
for pattern in dataX:
    x = numpy.reshape(pattern, (1, len(pattern), 1))
    x = x / float(len(alphabet))
    prediction = model.predict(x, verbose=0)
    index = numpy.argmax(prediction)
    result = int_to_char[index]
    seq_in = [int_to_char[value] for value in pattern]
    print seq_in, "->", result

['A'] -> B
['B'] -> B
['C'] -> D
['D'] -> E
['E'] -> F
['F'] -> G
['G'] -> H
['H'] -> I
['I'] -> J
['J'] -> K
['K'] -> L
['L'] -> M
['M'] -> N
['N'] -> O
['O'] -> P
['P'] -> Q
['Q'] -> R
['R'] -> S
['S'] -> T
['T'] -> U
['U'] -> V
['V'] -> X
['W'] -> Z
['X'] -> Z
['Y'] -> Z


We can see that this problem is indeed difficult for the network to learn.

## Naive LSTM for a Three-Char Feature Window to One-Char Mapping

In [48]:
# prepare the dataset of input to output pairs encoded as integers
seq_length = 3
dataX = []
dataY = []
for i in range(0, len(alphabet) - seq_length, 1):
    seq_in = alphabet[i:i + seq_length]
    seq_out = alphabet[i + seq_length]
    dataX.append([char_to_int[char] for char in seq_in])
    dataY.append(char_to_int[seq_out])
    print seq_in, '->', seq_out

ABC -> D
BCD -> E
CDE -> F
DEF -> G
EFG -> H
FGH -> I
GHI -> J
HIJ -> K
IJK -> L
JKL -> M
KLM -> N
LMN -> O
MNO -> P
NOP -> Q
OPQ -> R
PQR -> S
QRS -> T
RST -> U
STU -> V
TUV -> W
UVW -> X
VWX -> Y
WXY -> Z


**Set sequence length to number of features which is not correct use of LSTM** 

In [49]:
# reshape X to be [samples, time steps, features]
X = numpy.reshape(dataX, (len(dataX), 1, seq_length))
# normalize
X = X / float(len(alphabet))
# one hot encode the output variable
y = np_utils.to_categorical(dataY)

In [21]:
print X[:3]
print y[:3]

[[[ 0.          0.03846154  0.07692308]]

 [[ 0.03846154  0.07692308  0.11538462]]

 [[ 0.07692308  0.11538462  0.15384615]]]
[[ 0.  0.  0.  1.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
   0.  0.  0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  1.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
   0.  0.  0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  1.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
   0.  0.  0.  0.  0.  0.  0.  0.]]


In [19]:
# create and fit the model
model = Sequential()
model.add(LSTM(32, input_shape=(X.shape[1], X.shape[2])))
model.add(Dense(y.shape[1], activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

In [22]:
model.fit(X, y, nb_epoch=500, batch_size=1, verbose=2)
# summarize performance of the model
scores = model.evaluate(X, y, verbose=0)
print("Model Accuracy: %.2f%%" % (scores[1]*100))

Epoch 1/500
0s - loss: 3.2659 - acc: 0.0435
Epoch 2/500
0s - loss: 3.2503 - acc: 0.0435
Epoch 3/500
0s - loss: 3.2410 - acc: 0.0435
Epoch 4/500
0s - loss: 3.2329 - acc: 0.0435
Epoch 5/500
0s - loss: 3.2237 - acc: 0.0435
Epoch 6/500
0s - loss: 3.2158 - acc: 0.0435
Epoch 7/500
0s - loss: 3.2073 - acc: 0.0435
Epoch 8/500
0s - loss: 3.2000 - acc: 0.0435
Epoch 9/500
0s - loss: 3.1913 - acc: 0.0435
Epoch 10/500
0s - loss: 3.1832 - acc: 0.0435
Epoch 11/500
0s - loss: 3.1748 - acc: 0.0435
Epoch 12/500
0s - loss: 3.1671 - acc: 0.0435
Epoch 13/500
0s - loss: 3.1586 - acc: 0.0435
Epoch 14/500
0s - loss: 3.1502 - acc: 0.0435
Epoch 15/500
0s - loss: 3.1419 - acc: 0.0435
Epoch 16/500
0s - loss: 3.1338 - acc: 0.0435
Epoch 17/500
0s - loss: 3.1262 - acc: 0.0435
Epoch 18/500
0s - loss: 3.1175 - acc: 0.0435
Epoch 19/500
0s - loss: 3.1092 - acc: 0.0435
Epoch 20/500
0s - loss: 3.1019 - acc: 0.0435
Epoch 21/500
0s - loss: 3.0938 - acc: 0.0435
Epoch 22/500
0s - loss: 3.0858 - acc: 0.0435
Epoch 23/500
0s - l

In [23]:
# demonstrate some model predictions
for pattern in dataX:
    x = numpy.reshape(pattern, (1, 1, len(pattern)))
    x = x / float(len(alphabet))
    prediction = model.predict(x, verbose=0)
    index = numpy.argmax(prediction)
    result = int_to_char[index]
    seq_in = [int_to_char[value] for value in pattern]
    print seq_in, "->", result

['A', 'B', 'C'] -> D
['B', 'C', 'D'] -> E
['C', 'D', 'E'] -> F
['D', 'E', 'F'] -> G
['E', 'F', 'G'] -> H
['F', 'G', 'H'] -> I
['G', 'H', 'I'] -> J
['H', 'I', 'J'] -> K
['I', 'J', 'K'] -> L
['J', 'K', 'L'] -> M
['K', 'L', 'M'] -> N
['L', 'M', 'N'] -> O
['M', 'N', 'O'] -> P
['N', 'O', 'P'] -> Q
['O', 'P', 'Q'] -> R
['P', 'Q', 'R'] -> S
['Q', 'R', 'S'] -> T
['R', 'S', 'T'] -> U
['S', 'T', 'U'] -> V
['T', 'U', 'V'] -> X
['U', 'V', 'W'] -> Z
['V', 'W', 'X'] -> Z
['W', 'X', 'Y'] -> Z


We can see a small lift in performance that may or may not be real. This is a simple problem that we were still not able to learn with LSTMs even with the window method.

Again, this is a misuse of the LSTM network by a poor framing of the problem. Indeed, the sequences of letters are time steps of one feature rather than one time step of separate features. We have given more context to the network, but not more sequence as it expected.

## Naive LSTM for a Three-Char Time Step Window to One-Char Mapping

In Keras, the intended use of LSTMs is to provide context in the form of time steps, rather than windowed features like with other network types.

We can take our first example and simply change the sequence length from 1 to 3.

In [3]:
# prepare the dataset of input to output pairs encoded as integers
seq_length = 3
dataX = []
dataY = []
for i in range(0, len(alphabet) - seq_length, 1):
    seq_in = alphabet[i:i + seq_length]
    seq_out = alphabet[i + seq_length]
    dataX.append([char_to_int[char] for char in seq_in])
    dataY.append(char_to_int[seq_out])
    print seq_in, '->', seq_out

ABC -> D
BCD -> E
CDE -> F
DEF -> G
EFG -> H
FGH -> I
GHI -> J
HIJ -> K
IJK -> L
JKL -> M
KLM -> N
LMN -> O
MNO -> P
NOP -> Q
OPQ -> R
PQR -> S
QRS -> T
RST -> U
STU -> V
TUV -> W
UVW -> X
VWX -> Y
WXY -> Z


**Set sequence length to time step** 

This is the correct intended use of providing sequence context to your LSTM in Keras.

In [4]:
# reshape X to be [samples, time steps, features]
X = numpy.reshape(dataX, (len(dataX), seq_length, 1))
# normalize
X = X / float(len(alphabet))
# one hot encode the output variable
y = np_utils.to_categorical(dataY)

In [5]:
print X.shape, X[:3]
print y.shape, y[:3]

(23, 3, 1) [[[ 0.        ]
  [ 0.03846154]
  [ 0.07692308]]

 [[ 0.03846154]
  [ 0.07692308]
  [ 0.11538462]]

 [[ 0.07692308]
  [ 0.11538462]
  [ 0.15384615]]]
(23, 26) [[ 0.  0.  0.  1.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
   0.  0.  0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  1.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
   0.  0.  0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  1.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
   0.  0.  0.  0.  0.  0.  0.  0.]]


In [6]:
# create and fit the model
model = Sequential()
model.add(LSTM(32, input_shape=(X.shape[1], X.shape[2])))
model.add(Dense(y.shape[1], activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

In [7]:
model.fit(X, y, nb_epoch=500, batch_size=1, verbose=2)
# summarize performance of the model
scores = model.evaluate(X, y, verbose=0)
print("Model Accuracy: %.2f%%" % (scores[1]*100))

Epoch 1/500
0s - loss: 3.2608 - acc: 0.0435
Epoch 2/500
0s - loss: 3.2403 - acc: 0.0435
Epoch 3/500
0s - loss: 3.2308 - acc: 0.0435
Epoch 4/500
0s - loss: 3.2193 - acc: 0.0435
Epoch 5/500
0s - loss: 3.2087 - acc: 0.0000e+00
Epoch 6/500
0s - loss: 3.1985 - acc: 0.0435
Epoch 7/500
0s - loss: 3.1870 - acc: 0.0435
Epoch 8/500
0s - loss: 3.1755 - acc: 0.0435
Epoch 9/500
0s - loss: 3.1648 - acc: 0.0435
Epoch 10/500
0s - loss: 3.1513 - acc: 0.0435
Epoch 11/500
0s - loss: 3.1387 - acc: 0.0435
Epoch 12/500
0s - loss: 3.1262 - acc: 0.0435
Epoch 13/500
0s - loss: 3.1125 - acc: 0.0435
Epoch 14/500
0s - loss: 3.1001 - acc: 0.0435
Epoch 15/500
0s - loss: 3.0853 - acc: 0.0435
Epoch 16/500
0s - loss: 3.0736 - acc: 0.0435
Epoch 17/500
0s - loss: 3.0564 - acc: 0.0435
Epoch 18/500
0s - loss: 3.0395 - acc: 0.0435
Epoch 19/500
0s - loss: 3.0245 - acc: 0.0435
Epoch 20/500
0s - loss: 3.0113 - acc: 0.0435
Epoch 21/500
0s - loss: 2.9928 - acc: 0.0435
Epoch 22/500
0s - loss: 2.9758 - acc: 0.0870
Epoch 23/500
0s

In [55]:
# demonstrate some model predictions
for pattern in dataX:
    x = numpy.reshape(pattern, (1, len(pattern), 1))
    x = x / float(len(alphabet))
    prediction = model.predict(x, verbose=0)
    index = numpy.argmax(prediction)
    result = int_to_char[index]
    seq_in = [int_to_char[value] for value in pattern]
    print seq_in, "->", result

['A', 'B', 'C'] -> D
['B', 'C', 'D'] -> E
['C', 'D', 'E'] -> F
['D', 'E', 'F'] -> G
['E', 'F', 'G'] -> H
['F', 'G', 'H'] -> I
['G', 'H', 'I'] -> J
['H', 'I', 'J'] -> K
['I', 'J', 'K'] -> L
['J', 'K', 'L'] -> M
['K', 'L', 'M'] -> N
['L', 'M', 'N'] -> O
['M', 'N', 'O'] -> P
['N', 'O', 'P'] -> Q
['O', 'P', 'Q'] -> R
['P', 'Q', 'R'] -> S
['Q', 'R', 'S'] -> T
['R', 'S', 'T'] -> U
['S', 'T', 'U'] -> V
['T', 'U', 'V'] -> W
['U', 'V', 'W'] -> X
['V', 'W', 'X'] -> Y
['W', 'X', 'Y'] -> Z


We can see that the model learns the problem perfectly as evidenced by the model evaluation and the example predictions.

But it has learned a simpler problem. Specifically, it has learned to predict the next letter from a sequence of three letters in the alphabet. It can be shown any random sequence of three letters from the alphabet and predict the next letter.

It can not actually enumerate the alphabet. I expect that a larger enough multilayer perception network might be able to learn the same mapping using the window method.

The LSTM networks are stateful. They should be able to learn the whole alphabet sequence, but by default the Keras implementation resets the network state after each training batch.

In [57]:
# demonstrate predicting random patterns
print "Test a Random Pattern:"
for i in range(0,20):
    pattern_index = numpy.random.randint(len(dataX))
    pattern = dataX[pattern_index]
    x = numpy.reshape(pattern, (1, len(pattern), 1))
    x = x / float(len(alphabet))
    prediction = model.predict(x, verbose=0)
    index = numpy.argmax(prediction)
    result = int_to_char[index]
    seq_in = [int_to_char[value] for value in pattern]
    print seq_in, "->", result

Test a Random Pattern:
['B', 'C', 'D'] -> E
['F', 'G', 'H'] -> I
['N', 'O', 'P'] -> Q
['G', 'H', 'I'] -> J
['N', 'O', 'P'] -> Q
['M', 'N', 'O'] -> P
['N', 'O', 'P'] -> Q
['N', 'O', 'P'] -> Q
['C', 'D', 'E'] -> F
['U', 'V', 'W'] -> X
['V', 'W', 'X'] -> Y
['Q', 'R', 'S'] -> T
['R', 'S', 'T'] -> U
['O', 'P', 'Q'] -> R
['B', 'C', 'D'] -> E
['H', 'I', 'J'] -> K
['L', 'M', 'N'] -> O
['W', 'X', 'Y'] -> Z
['W', 'X', 'Y'] -> Z
['J', 'K', 'L'] -> M


## LSTM State Within A Batch

The Keras implementation of LSTMs resets the state of the network after each batch.

This suggests that if we had a batch size large enough to hold all input patterns and if all the input patterns were ordered sequentially, that the LSTM could use the context of the sequence within the batch to better learn the sequence.

In [107]:
# prepare the dataset of input to output pairs encoded as integers
seq_length = 1
dataX = []
dataY = []
for i in range(0, len(alphabet) - seq_length, 1):
    seq_in = alphabet[i:i + seq_length]
    seq_out = alphabet[i + seq_length]
    dataX.append([char_to_int[char] for char in seq_in])
    dataY.append(char_to_int[seq_out])
    #print seq_in, '->', seq_out

In [108]:
from keras.preprocessing.sequence import pad_sequences

In [109]:
# convert list of lists to array and pad sequences if needed
X = pad_sequences(dataX, maxlen=seq_length, dtype='float32')
# reshape X to be [samples, time steps, features]
X = numpy.reshape(dataX, (X.shape[0], seq_length, 1))
# normalize
X = X / float(len(alphabet))
# one hot encode the output variable
y = np_utils.to_categorical(dataY)

In [110]:
print X.shape, X[:3]
print y.shape, y[:3]

(25, 1, 1) [[[ 0.        ]]

 [[ 0.03846154]]

 [[ 0.07692308]]]
(25, 26) [[ 0.  1.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
   0.  0.  0.  0.  0.  0.  0.  0.]
 [ 0.  0.  1.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
   0.  0.  0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  1.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
   0.  0.  0.  0.  0.  0.  0.  0.]]


In [76]:
# create and fit the model
model = Sequential()
model.add(LSTM(32, input_shape=(X.shape[1], X.shape[2])))
model.add(Dense(y.shape[1], activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

In [77]:
model.fit(X, y, nb_epoch=5000, batch_size=10, verbose=2, shuffle=False)
# summarize performance of the model
scores = model.evaluate(X, y, verbose=0)
print("Model Accuracy: %.2f%%" % (scores[1]*100))

Epoch 1/5000
0s - loss: 3.2626 - acc: 0.0400
Epoch 2/5000
0s - loss: 3.2607 - acc: 0.0400
Epoch 3/5000
0s - loss: 3.2597 - acc: 0.0400
Epoch 4/5000
0s - loss: 3.2588 - acc: 0.0400
Epoch 5/5000
0s - loss: 3.2579 - acc: 0.0400
Epoch 6/5000
0s - loss: 3.2570 - acc: 0.0400
Epoch 7/5000
0s - loss: 3.2562 - acc: 0.0400
Epoch 8/5000
0s - loss: 3.2553 - acc: 0.0000e+00
Epoch 9/5000
0s - loss: 3.2544 - acc: 0.0000e+00
Epoch 10/5000
0s - loss: 3.2536 - acc: 0.0400
Epoch 11/5000
0s - loss: 3.2527 - acc: 0.0400
Epoch 12/5000
0s - loss: 3.2519 - acc: 0.0400
Epoch 13/5000
0s - loss: 3.2511 - acc: 0.0400
Epoch 14/5000
0s - loss: 3.2502 - acc: 0.0400
Epoch 15/5000
0s - loss: 3.2494 - acc: 0.0400
Epoch 16/5000
0s - loss: 3.2485 - acc: 0.0400
Epoch 17/5000
0s - loss: 3.2477 - acc: 0.0400
Epoch 18/5000
0s - loss: 3.2469 - acc: 0.0400
Epoch 19/5000
0s - loss: 3.2460 - acc: 0.0400
Epoch 20/5000
0s - loss: 3.2452 - acc: 0.0400
Epoch 21/5000
0s - loss: 3.2443 - acc: 0.0400
Epoch 22/5000
0s - loss: 3.2434 - a

In [78]:
# demonstrate some model predictions
for pattern in dataX:
    x = numpy.reshape(pattern, (1, len(pattern), 1))
    x = x / float(len(alphabet))
    prediction = model.predict(x, verbose=0)
    index = numpy.argmax(prediction)
    result = int_to_char[index]
    seq_in = [int_to_char[value] for value in pattern]
    print seq_in, "->", result

['A'] -> B
['B'] -> C
['C'] -> D
['D'] -> E
['E'] -> F
['F'] -> G
['G'] -> H
['H'] -> I
['I'] -> J
['J'] -> K
['K'] -> L
['L'] -> M
['M'] -> N
['N'] -> O
['O'] -> P
['P'] -> Q
['Q'] -> R
['R'] -> S
['S'] -> T
['T'] -> U
['U'] -> V
['V'] -> W
['W'] -> X
['X'] -> Y
['Y'] -> Z


In [79]:
# demonstrate predicting random patterns
print "Test a Random Pattern:"
for i in range(0,20):
    pattern_index = numpy.random.randint(len(dataX))
    pattern = dataX[pattern_index]
    x = numpy.reshape(pattern, (1, len(pattern), 1))
    x = x / float(len(alphabet))
    prediction = model.predict(x, verbose=0)
    index = numpy.argmax(prediction)
    result = int_to_char[index]
    seq_in = [int_to_char[value] for value in pattern]
    print seq_in, "->", result

Test a Random Pattern:
['B'] -> C
['K'] -> L
['U'] -> V
['K'] -> L
['O'] -> P
['B'] -> C
['X'] -> Y
['A'] -> B
['M'] -> N
['O'] -> P
['V'] -> W
['W'] -> X
['P'] -> Q
['I'] -> J
['G'] -> H
['T'] -> U
['S'] -> T
['G'] -> H
['U'] -> V
['H'] -> I


## Stateful LSTM for a One-Char to One-Char Mapping

In [103]:
# prepare the dataset of input to output pairs encoded as integers
seq_length = 1
dataX = []
dataY = []
for i in range(0, len(alphabet) - seq_length, 1):
    seq_in = alphabet[i:i + seq_length]
    seq_out = alphabet[i + seq_length]
    dataX.append([char_to_int[char] for char in seq_in])
    dataY.append(char_to_int[seq_out])
    #print seq_in, '->', seq_out

In [104]:
from keras.preprocessing.sequence import pad_sequences

In [105]:
# convert list of lists to array and pad sequences if needed
X = pad_sequences(dataX, maxlen=seq_length, dtype='float32')
# reshape X to be [samples, time steps, features]
X = numpy.reshape(dataX, (X.shape[0], seq_length, 1))
# normalize
X = X / float(len(alphabet))
# one hot encode the output variable
y = np_utils.to_categorical(dataY)

In [106]:
print X.shape, X[:3]
print y.shape, y[:3]

(25, 1, 1) [[[ 0.        ]]

 [[ 0.03846154]]

 [[ 0.07692308]]]
(25, 26) [[ 0.  1.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
   0.  0.  0.  0.  0.  0.  0.  0.]
 [ 0.  0.  1.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
   0.  0.  0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  1.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
   0.  0.  0.  0.  0.  0.  0.  0.]]


In [93]:
# create and fit the model
batch_size = 1
model = Sequential()
model.add(LSTM(16, batch_input_shape=(batch_size, X.shape[1], X.shape[2]), stateful=True))
model.add(Dense(y.shape[1], activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

In [94]:
for i in range(300):
    model.fit(X, y, nb_epoch=1, batch_size=batch_size, verbose=2, shuffle=False)
    model.reset_states()
# summarize performance of the model
scores = model.evaluate(X, y, batch_size=batch_size, verbose=0)
model.reset_states()
print("Model Accuracy: %.2f%%" % (scores[1]*100))

Epoch 1/1
0s - loss: 3.3008 - acc: 0.0400
Epoch 1/1
0s - loss: 3.2748 - acc: 0.0800
Epoch 1/1
0s - loss: 3.2603 - acc: 0.1200
Epoch 1/1
0s - loss: 3.2477 - acc: 0.1200
Epoch 1/1
0s - loss: 3.2362 - acc: 0.1200
Epoch 1/1
0s - loss: 3.2251 - acc: 0.1200
Epoch 1/1
0s - loss: 3.2144 - acc: 0.0800
Epoch 1/1
0s - loss: 3.2035 - acc: 0.0800
Epoch 1/1
0s - loss: 3.1923 - acc: 0.0800
Epoch 1/1
0s - loss: 3.1806 - acc: 0.0800
Epoch 1/1
0s - loss: 3.1680 - acc: 0.1200
Epoch 1/1
0s - loss: 3.1543 - acc: 0.0800
Epoch 1/1
0s - loss: 3.1389 - acc: 0.0800
Epoch 1/1
0s - loss: 3.1216 - acc: 0.0800
Epoch 1/1
0s - loss: 3.1017 - acc: 0.1200
Epoch 1/1
0s - loss: 3.0786 - acc: 0.1200
Epoch 1/1
0s - loss: 3.0520 - acc: 0.1200
Epoch 1/1
0s - loss: 3.0220 - acc: 0.1600
Epoch 1/1
0s - loss: 2.9894 - acc: 0.2000
Epoch 1/1
0s - loss: 2.9554 - acc: 0.2000
Epoch 1/1
0s - loss: 2.9206 - acc: 0.1600
Epoch 1/1
0s - loss: 2.8857 - acc: 0.1600
Epoch 1/1
0s - loss: 2.8513 - acc: 0.1600
Epoch 1/1
0s - loss: 2.8179 - acc:

In [95]:
# demonstrate some model predictions
seed = [char_to_int[alphabet[0]]]
for i in range(0, len(alphabet)-1):
    x = numpy.reshape(seed, (1, len(seed), 1))
    x = x / float(len(alphabet))
    prediction = model.predict(x, verbose=0)
    index = numpy.argmax(prediction)
    print int_to_char[seed[0]], "->", int_to_char[index]
    seed = [index]
model.reset_states()

# demonstrate a random starting point
letter = "K"
seed = [char_to_int[letter]]
print "New start: ", letter
for i in range(0, 5):
    x = numpy.reshape(seed, (1, len(seed), 1))
    x = x / float(len(alphabet))
    prediction = model.predict(x, verbose=0)
    index = numpy.argmax(prediction)
    print int_to_char[seed[0]], "->", int_to_char[index]
    seed = [index]
model.reset_states()

A -> B
B -> C
C -> D
D -> E
E -> F
F -> G
G -> H
H -> I
I -> J
J -> K
K -> L
L -> M
M -> N
N -> O
O -> P
P -> Q
Q -> R
R -> S
S -> T
T -> U
U -> V
V -> W
W -> X
X -> Y
Y -> Z
New start:  K
K -> B
B -> C
C -> D
D -> E
E -> F


## LSTM with Variable-Length Input to One-Char Output

In this section we explore a variation of the “stateless” LSTM that learns random subsequences of the alphabet and an effort to build a model that can be given arbitrary letters or subsequences of letters and predict the next letter in the alphabet.

Firstly, we are changing the framing of the problem. To simplify we will define a maximum input sequence length and set it to a small value like 5 to speed up training. This defines the maximum length of subsequences of the alphabet will be drawn for training. In extensions, this could just as set to the full alphabet (26) or longer if we allow looping back to the start of the sequence.

We also need to define the number of random sequences to create, in this case 1000. This too could be more or less. I expect less patterns are actually required.

In [3]:
# prepare the dataset of input to output pairs encoded as integers
num_inputs = 1000
max_len = 7
dataX = []
dataY = []
for i in range(num_inputs):
    start = numpy.random.randint(len(alphabet)-2)
    end = numpy.random.randint(start, min(start+max_len,len(alphabet)-1))
    sequence_in = alphabet[start:end+1]
    sequence_out = alphabet[end + 1]
    dataX.append([char_to_int[char] for char in sequence_in])
    dataY.append(char_to_int[sequence_out])
    print sequence_in, '->', sequence_out

PQRST -> U
W -> X
O -> P
OPQ -> R
IJKLMNO -> P
E -> F
HIJKL -> M
ABCD -> E
X -> Y
GHIJ -> K
MNOPQR -> S
XY -> Z
QRST -> U
ABC -> D
JKLMNOP -> Q
GHIJK -> L
OP -> Q
XY -> Z
D -> E
T -> U
B -> C
QRSTUVW -> X
MNOPQ -> R
HIJ -> K
JKLM -> N
ABCDEF -> G
HIJKL -> M
X -> Y
V -> W
DE -> F
DEFG -> H
BCDE -> F
EFGH -> I
BCDE -> F
FG -> H
RST -> U
TUV -> W
STUVWX -> Y
XY -> Z
LMN -> O
P -> Q
MNOPQR -> S
LM -> N
JKLMN -> O
DEFGHIJ -> K
OPQRS -> T
UVWXY -> Z
PQRS -> T
D -> E
EFGH -> I
IJK -> L
WX -> Y
STUV -> W
MNOPQ -> R
PQRSTU -> V
IJKLMNO -> P
GHIJKL -> M
GHIJKL -> M
MNOP -> Q
HI -> J
KLMNOP -> Q
EFGHI -> J
JK -> L
ABCDE -> F
WXY -> Z
MNOPQRS -> T
IJK -> L
KLMNOPQ -> R
UV -> W
GHI -> J
RSTUVW -> X
PQ -> R
W -> X
J -> K
WX -> Y
JKLMNO -> P
EFGH -> I
MN -> O
LMNOPQR -> S
BCDEFG -> H
VWX -> Y
LMNOP -> Q
TU -> V
MNOPQ -> R
NOPQR -> S
HIJKLM -> N
JKLMNOP -> Q
PQRS -> T
STUVW -> X
QRSTUV -> W
DEFGHI -> J
ABCDEF -> G
XY -> Z
BCDEFG -> H
IJKLM -> N
OPQRST -> U
JKL -> M
ABC -> D
QRSTUVW -> X
ST -> U
JKLMNO

In [13]:
print len(dataX), dataX[:3]

1000 [[15, 16, 17, 18, 19], [22], [14]]


In [14]:
res = pad_sequences(dataX, maxlen=max_len, dtype='float32')
res[:3]

array([[  0.,   0.,  15.,  16.,  17.,  18.,  19.],
       [  0.,   0.,   0.,   0.,   0.,   0.,  22.],
       [  0.,   0.,   0.,   0.,   0.,   0.,  14.]], dtype=float32)

In [6]:
# convert list of lists to array and pad sequences if needed
X = pad_sequences(dataX, maxlen=max_len, dtype='float32')
# reshape X to be [samples, time steps, features]
X = numpy.reshape(X, (X.shape[0], max_len, 1))
# normalize
X = X / float(len(alphabet))
# one hot encode the output variable
y = np_utils.to_categorical(dataY)

In [7]:
print X.shape, X[:3]
print y.shape, y[:3]

(1000, 7, 1) [[[ 0.        ]
  [ 0.        ]
  [ 0.57692307]
  [ 0.61538464]
  [ 0.65384614]
  [ 0.69230771]
  [ 0.73076922]]

 [[ 0.        ]
  [ 0.        ]
  [ 0.        ]
  [ 0.        ]
  [ 0.        ]
  [ 0.        ]
  [ 0.84615386]]

 [[ 0.        ]
  [ 0.        ]
  [ 0.        ]
  [ 0.        ]
  [ 0.        ]
  [ 0.        ]
  [ 0.53846157]]]
(1000, 26) [[ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
   0.  0.  1.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
   0.  0.  0.  0.  0.  1.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  1.  0.  0.
   0.  0.  0.  0.  0.  0.  0.  0.]]


In [8]:
# create and fit the model
batch_size = 1
model = Sequential()
model.add(LSTM(32, input_shape=(X.shape[1], X.shape[2])))
model.add(Dense(y.shape[1], activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

In [9]:
model.fit(X, y, nb_epoch=500, batch_size=batch_size, verbose=2)
# summarize performance of the model
scores = model.evaluate(X, y, verbose=0)
print("Model Accuracy: %.2f%%" % (scores[1]*100))

Epoch 1/500
1s - loss: 3.0359 - acc: 0.0760
Epoch 2/500
1s - loss: 2.6729 - acc: 0.1510
Epoch 3/500
1s - loss: 2.2828 - acc: 0.2260
Epoch 4/500
1s - loss: 2.0592 - acc: 0.2910
Epoch 5/500
1s - loss: 1.9146 - acc: 0.3300
Epoch 6/500
1s - loss: 1.7786 - acc: 0.3790
Epoch 7/500
1s - loss: 1.6774 - acc: 0.3960
Epoch 8/500
1s - loss: 1.5844 - acc: 0.4480
Epoch 9/500
1s - loss: 1.4905 - acc: 0.4910
Epoch 10/500
1s - loss: 1.4257 - acc: 0.5150
Epoch 11/500
1s - loss: 1.3753 - acc: 0.5280
Epoch 12/500
1s - loss: 1.3020 - acc: 0.5670
Epoch 13/500
1s - loss: 1.2449 - acc: 0.5770
Epoch 14/500
1s - loss: 1.2043 - acc: 0.5930
Epoch 15/500
1s - loss: 1.1610 - acc: 0.6260
Epoch 16/500
1s - loss: 1.1192 - acc: 0.6230
Epoch 17/500
1s - loss: 1.0926 - acc: 0.6160
Epoch 18/500
1s - loss: 1.0369 - acc: 0.6380
Epoch 19/500
1s - loss: 1.0246 - acc: 0.6450
Epoch 20/500
1s - loss: 0.9879 - acc: 0.6600
Epoch 21/500
1s - loss: 0.9684 - acc: 0.6620
Epoch 22/500
1s - loss: 0.9279 - acc: 0.6860
Epoch 23/500
1s - l

In [10]:
# demonstrate some model predictions
for i in range(20):
    pattern_index = numpy.random.randint(len(dataX))
    pattern = dataX[pattern_index]
    x = pad_sequences([pattern], maxlen=max_len, dtype='float32')
    x = numpy.reshape(x, (1, max_len, 1))
    x = x / float(len(alphabet))
    prediction = model.predict(x, verbose=0)
    index = numpy.argmax(prediction)
    result = int_to_char[index]
    seq_in = [int_to_char[value] for value in pattern]
    print seq_in, "->", result

['K', 'L', 'M', 'N', 'O', 'P', 'Q'] -> R
['I', 'J', 'K', 'L', 'M', 'N', 'O'] -> P
['P', 'Q', 'R', 'S', 'T', 'U', 'V'] -> W
['E', 'F', 'G', 'H'] -> I
['W'] -> X
['L', 'M', 'N', 'O', 'P', 'Q'] -> R
['N', 'O', 'P', 'Q'] -> R
['D', 'E', 'F', 'G', 'H', 'I'] -> J
['H'] -> I
['I', 'J', 'K', 'L', 'M', 'N', 'O'] -> P
['C', 'D', 'E', 'F', 'G'] -> H
['S'] -> T
['D'] -> E
['I', 'J', 'K', 'L', 'M', 'N', 'O'] -> P
['P', 'Q', 'R', 'S', 'T'] -> U
['B', 'C', 'D', 'E'] -> F
['Q'] -> R
['M'] -> N
['X'] -> Y
['V', 'W'] -> X
