# Learning to speak like Alice

A generative character based language model is created by training an RNN on the text of [Alice in Wonderland](http://www.gutenberg.org/ebooks/11).

## Setup Imports

In [1]:
from __future__ import division, print_function
from keras.layers.recurrent import SimpleRNN
from keras.models import Sequential
from keras.layers import Dense, Activation
from keras.utils.visualize_util import plot
import numpy as np
%matplotlib inline

Using Theano backend.


## Read input

In [2]:
fin = open("../data/alice_in_wonderland.txt", "rb")
lines = []
for line in fin:
    line = line.strip().lower().decode("ascii", "ignore")
    if len(line) == 0:
        continue
    lines.append(line)
fin.close()
text = "".join(lines)

## Build vocabulary lookup tables

In [3]:
chars = set([c for c in text])
vocab_size = len(chars)
char2index = dict((c, i) for i, c in enumerate(chars))
index2char = dict((i, c) for i, c in enumerate(chars))

## Create training data

We want to create fixed size strings of characters as the input sequence and the following character as the label. So for example, if the input is "the sky was falling", then the following sequence of training chars and label chars would be created:

    the sky wa => s
    he sky was => 
    e sky was  => f
     sky was f => a
    sky was fa => l

and so on.

In [4]:
seqlen = 10
step = 1
input_chars = []
label_chars = []
for i in range(0, len(text) - seqlen, step):
    input_chars.append(text[i:i+seqlen])
    label_chars.append(text[i+seqlen])

We now vectorize the input and label chars. Each row of input is represented by seqlen characters, each character is represented as a 1-hot encoding of size vocab_size. Thus the shape of X is (len(input_chars), seqlen, vocab_size). 

Each row of the label is a single character, represented by a 1-hot encoding of size vocab_size. The corresponding prediction row (output of the network) would be a dense vector of size vocab_size. Hence the shape of y is (len(input_chars), vocab_size).

In [5]:
X = np.zeros((len(input_chars), seqlen, vocab_size), dtype=np.bool)
y = np.zeros((len(input_chars), vocab_size), dtype=np.bool)
for i, input_char in enumerate(input_chars):
    for j, ch in enumerate(input_char):
        X[i, j, char2index[ch]] = 1
    y[i, char2index[label_chars[i]]] = 1

## Build the model

In [6]:
model = Sequential()
model.add(SimpleRNN(512, return_sequences=False, input_shape=(seqlen, vocab_size)))
model.add(Dense(vocab_size))
model.add(Activation("softmax"))

model.compile(loss="categorical_crossentropy", optimizer="rmsprop")

## Train Model and Evaluate

We train the model in batches and evaluate the output generated at each step. There is no training set here, so evaluation is manual.

In each iteration, we fit the model for a single epoch, then randomly choose a row from the input_chars, then use it to generate text from the model for the next 100 chars.

In [7]:
batch_size = 128
for iteration in range(51):
    print("=" * 50)
    print("Iteration #: %d" % (iteration))
    
    model.fit(X, y, batch_size=batch_size, nb_epoch=1, verbose=0)
    
    # test model
    test_idx = np.random.randint(len(input_chars))
    test_chars = input_chars[test_idx]
    print("Seed: %s" % (test_chars))
    print(test_chars, end="")
    for i in range(100):
        Xtest = np.zeros((1, seqlen, vocab_size))
        for i, ch in enumerate(test_chars):
            Xtest[0, i, char2index[ch]] = 1
        pred = model.predict(Xtest, verbose=0)[0]
        ypred = index2char[np.argmax(pred)]
        print(ypred, end="")
        # move the input one step forward
        test_chars = test_chars[1:] + ypred
    print()

Iteration #: 0
Seed: ow are you
ow are you the wase the wase the wase the wase the wase the wase the wase the wase the wase the wase the wase 
Iteration #: 1
Seed: l looked s
l looked soute sad the the the the the the the the the the the the the the the the the the the the the the the
Iteration #: 2
Seed: or speaker
or speaker and the dor the more the the dor the more the the dor the more the the dor the more the the dor the
Iteration #: 3
Seed: out this, 
out this, and the sall and the the she sall the she sall the she sall the she sall the she sall the she sall t
Iteration #: 4
Seed: remember,'
remember,' said the crows and all the treperse for and all the treperse for and all the treperse for and all t
Iteration #: 5
Seed: rightened 
rightened in the said the cate the was in a the cate the was in a the cate the was in a the cate the was in a 
Iteration #: 6
Seed: ly.'that's
ly.'that's the was a dowe the reat the was a dowe the reat the was a dowe the reat the was a dowe the reat the