In [1]:
from theano.sandbox import cuda
cuda.use('gpu1')

 https://github.com/Theano/Theano/wiki/Converting-to-the-new-gpu-back-end%28gpuarray%29

Using gpu device 0: Tesla K80 (CNMeM is disabled, cuDNN 5103)
 https://github.com/Theano/Theano/wiki/Converting-to-the-new-gpu-back-end%28gpuarray%29



In [2]:
%matplotlib inline
import utils;
from utils import *
from keras.layers import TimeDistributed, Activation
from numpy.random import choice

Using Theano backend.


## Setup

In [3]:
path = get_file('nietzsche.txt', origin="https://s3.amazonaws.com/text-datasets/nietzsche.txt")
text = open(path).read()
print('corpus length:', len(text))

corpus length: 600893


In [4]:
!tail {path} -n10

whole of antiquity swarmed with sons of god--he attained the same goal,
the sense of complete sinlessness, complete irresponsibility, that can
now be attained by every individual through science.--In the same manner
I have viewed the saints of India who occupy an intermediate station
between the christian saints and the Greek philosophers and hence are
not to be regarded as a pure type. Knowledge and science--as far as they
existed--and superiority to the rest of mankind by logical discipline
and training of the intellectual powers were insisted upon by the
Buddhists as essential to sanctity, just as they were denounced by the
christian world as the indications of sinfulness.

In [5]:
chars = sorted(list(set(text)))
vocab_size = len(chars)+1
print('total chars: ', vocab_size)

total chars:  85


Sometimes it's useful to have a zero value in the dataset, e.g. for padding

In [6]:
chars.insert(0, "\0")
''.join(chars[:-6])

'\x00\n !"\'(),-.0123456789:;=?ABCDEFGHIJKLMNOPQRSTUVWXYZ[]_abcdefghijklmnopqrstuvwxy'

In [7]:
char_indices = dict((c, i) for i,c in enumerate(chars))
indices_char = dict((i, c) for i,c in enumerate(chars))
idx = [char_indices[c] for c in text]

In [8]:
idx[:10]

[40, 42, 29, 30, 25, 27, 29, 1, 1, 1]

In [9]:
''.join(indices_char[i] for i in idx[:20])

'PREFACE\n\n\nSUPPOSING '

## 3 Char Model

### Create inputs
Create a list of every 4th character, starting at the 0th 

In [10]:
cs=3
c1_dat = [idx[i] for i in range(0, len(idx)-1-cs, cs)]
c2_dat = [idx[i+1] for i in range(0, len(idx)-1-cs, cs)]
c3_dat = [idx[i+2] for i in range(0, len(idx)-1-cs, cs)]
c4_dat = [idx[i+3] for i in range(0, len(idx)-1-cs, cs)]

In [11]:
c1_dat[:5]
?np.stack

Our inputs

In [12]:
# Return them into numpy arrays
x1 = np.stack(c1_dat[:-2])
x2 = np.stack(c2_dat[:-2])
x3 = np.stack(c3_dat[:-2])

In [13]:
print(x1.shape)
x1[:5]

(200295,)


array([40, 30, 29,  1, 40])

Our output

In [14]:
y = np.stack(c4_dat[:-2])

The number of latent factors to create

In [15]:
n_fac = 42

Create inputs and embedding outputs for each of our 3 character inputs

In [16]:
def embedding_input(name, n_in, n_out):
    """ Create embedding by first create an input layer
    then apply an embedding layer to it
    """
    inp = Input(shape=(1,), dtype='int64', name=name)
    emb = Embedding(n_in, n_out, input_length=1)(inp)
    return inp, Flatten()(emb)

Of course, you can always use one-hot encoding for each character. But with embedding, we are able to capture the similarities between 'A' and 'a' for example. Whereas with one-hot encoding, 'A' and 'a' will be treated no differently with 'A' and 'Z'.

In [17]:
c1_in, c1 = embedding_input('c1', vocab_size, n_fac)
c2_in, c2 = embedding_input('c2', vocab_size, n_fac)
c3_in, c3 = embedding_input('c3', vocab_size, n_fac)

### Create and train model
We choose to have 256 activations

In [18]:
n_hidden = 256

Now create the 'green arrow' from our diagram - the layer operation from input to hidden

In [19]:
dense_in = Dense(n_hidden, activation='relu')

Our first hidden activation is simply this function applied to the result of the embedding of the first character(s)

In [20]:
c1_hidden = dense_in(c1)

Now create the 'orange arrow' from our diagram - the layer operation from hidden to hidden

In [21]:
dense_hidden = Dense(n_hidden, activation='tanh')

Our 2nd and 3rd hidden activations sum up the previous hidden status to the new input state

In [22]:
c2_dense = dense_in(c2)
hidden_2 = dense_hidden(c1_hidden)
c2_hidden = merge([c2_dense, hidden_2])
# merge: by default is a sum

In [23]:
c3_dense = dense_in(c3)
hidden_3 = dense_hidden(c2_hidden)
c3_hidden = merge([c3_dense, hidden_3])

Now create the 'blue arrow' from our diagram - the layer operation from hidden to output

In [24]:
dense_out = Dense(vocab_size, activation='softmax')

In [25]:
c4_out = dense_out(c3_hidden)

Till now, `c4_out` contains all the model process information

In [26]:
c4_out

Softmax.0

In [27]:
model = Model([c1_in, c2_in, c3_in], c4_out)

In [28]:
model.compile(loss='sparse_categorical_crossentropy', optimizer=Adam())

In [29]:
model.optimizer.lr=0.001

In [30]:
model.fit([x1, x2, x3], y, batch_size=64, nb_epoch=5)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7efc02aa7fd0>

### Test model

In [31]:
def get_next(inp):
    idxs = [char_indices[c] for c in inp]
    arrs = [np.array(i)[np.newaxis] for i in idxs]
    p = model.predict(arrs)
    i = np.argmax(p)
    return chars[i]

In [32]:
get_next('phi')

'l'

In [33]:
get_next(' th')

'e'

In [34]:
get_next(' an')

'd'

## Our First RNN!
Now, we will try to implement the typical structure of RNN - i.e. the rolled one.

That is, we cannot use c1, c2, c.... Instead, we will need an array of inputs all at once.

In [15]:
cs=8
c_in_dat = [[idx[i+n] for i in range(0, len(idx)-1-cs, cs)]
            for n in range(cs)]

Then create the labels for our model (i.e. the 9th char)

In [16]:
c_out_dat = [idx[i+cs] for i in range(0, len(idx)-1-cs, cs)]

In [17]:
xs = [np.stack(c[:-2]) for c in c_in_dat]
len(xs), xs[0].shape

(8, (75109,))

In [18]:
y = np.stack(c_out_dat[:-2])

So each column below is one series of 8 chars from the text

In [19]:
[xs[n][:cs] for n in range(4)]

[array([40,  1, 33,  2, 72, 67, 73,  2]),
 array([42,  1, 38, 44,  2,  9, 61, 73]),
 array([29, 43, 31, 71, 54,  9, 58, 61]),
 array([30, 45,  2, 74,  2, 76, 67, 58])]

And this is the next char after each sequence

In [26]:
y[:4]

array([ 1, 33,  2, 72])

In [27]:
n_fac=42

### Create and train model

In [28]:
def embedding_input(name, n_in, n_out):
    inp = Input(shape=(1,), dtype='int64', name=name+'_in')
    emb = Embedding(n_in, n_out, input_length=1, name=name+'_emb')(inp)
    return inp, Flatten()(emb)

In [29]:
c_ins = [embedding_input('c'+str(n), vocab_size, n_fac) for n in range(cs)]

In [30]:
n_hidden = 256

In [31]:
dense_in = Dense(n_hidden, activation='relu')
dense_hidden = Dense(n_hidden, activation='relu', init='identity')
dense_out = Dense(vocab_size, activation='softmax')

The first char of each sequence goes through dense_in(), to create our first hidden activations.

In [32]:
# the [1] means the embedding structure
hidden = dense_in(c_ins[0][1])

Then for each successive layer we combine the output of dense_in() on the next character with the output of dense_hidden() on the current hidden state, to create the new hidden state.

In [33]:
for i in range(1, cs):
    c_dense = dense_in(c_ins[i][1])
    hidden = dense_hidden(hidden)
    hidden = merge([c_dense, hidden])

Putting the final hidden state through dense_out() gives us our output

In [34]:
c_out = dense_out(hidden)

So now we can create our model.

In [37]:
model = Model([c[0] for c in c_ins], c_out)
model.compile(loss='sparse_categorical_crossentropy', optimizer=Adam())
model.fit(xs, y, batch_size=64, nb_epoch=10)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7f62587be7b8>

### Test model

In [38]:
def get_next(inp):
    idxs = [np.array(char_indices[c])[np.newaxis] for c in inp]
    p = model.predict(idxs)
    return chars[np.argmax(p)]

In [39]:
get_next('for thos')

' '

In [40]:
get_next('part of ')

't'

In [41]:
model.fit(xs, y, batch_size=64, nb_epoch=5)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x7f624f817518>

In [42]:
get_next('for thos')

' '

## Our first RNN with keras!

In [43]:
n_hidden, n_fac, cs, vobac_size = (256, 42, 8, 86)

This is nearly exactly equivalent to the RNN we built ourselves in the previous section

In [44]:
model = Sequential([
        Embedding(vocab_size, n_fac, input_length=cs),
        # rather than initialize them randomly, we init them as an identity matrix
        # it always does well with relu
        SimpleRNN(n_hidden, activation='relu', inner_init='identity'),
        Dense(vocab_size, activation='softmax')
    ])

In [45]:
model.summary()

____________________________________________________________________________________________________
Layer (type)                     Output Shape          Param #     Connected to                     
embedding_3 (Embedding)          (None, 8, 42)         3570        embedding_input_3[0][0]          
____________________________________________________________________________________________________
simplernn_3 (SimpleRNN)          (None, 256)           76544       embedding_3[0][0]                
____________________________________________________________________________________________________
dense_6 (Dense)                  (None, 85)            21845       simplernn_3[0][0]                
Total params: 101,959
Trainable params: 101,959
Non-trainable params: 0
____________________________________________________________________________________________________


In [46]:
model.compile(loss='sparse_categorical_crossentropy', optimizer=Adam())

In [47]:
model.fit(np.concatenate(xs, axis=1), y, batch_size=64, nb_epoch=8)



Epoch 1/8
Epoch 2/8
Epoch 3/8
Epoch 4/8
Epoch 5/8
Epoch 6/8
Epoch 7/8
Epoch 8/8


<keras.callbacks.History at 0x7f62493e86d8>

In [49]:
def get_next_keras(inp):
    idxs = [char_indices[c] for c in inp]
    # np.newaxis is used to add 1 more dimention
    arrs = np.array(idxs)[np.newaxis, :]
    p = model.predict(arrs)[0]
    return chars[np.argmax(p)]

In [50]:
get_next_keras('this is ')

't'

In [51]:
get_next_keras('part of ')

't'

In [52]:
get_next_keras('queens a')

'n'

## Returning sequences
Now, instead of predicting char n using chars 1 to n-1, we will predict char 2 to n using chars 1 to n-1

### Create inputs
To use a sequence model, we can leave our input unchanged - but we have to change our output to a sequence.

Here, c_out_dat is identical to c_in_dat, but moved across 1 character

In [58]:
# c_in_dat = [[idx[i+n] for i in range(0, len(idx)-1-cs, cs)]
#            for n in range(cs)]
c_out_dat = [[idx[i+n] for i in range(1, len(idx)-cs, cs)]
            for n in range(cs)]
xs = [np.stack(c[:-2]) for c in c_in_dat]
ys = [np.stack(c[:-2]) for c in c_out_dat]

Reading down each column shows one set of inputs and outputs.

In [59]:
[xs[n][:cs] for n in range(cs)]

[array([40,  1, 33,  2, 72, 67, 73,  2]),
 array([42,  1, 38, 44,  2,  9, 61, 73]),
 array([29, 43, 31, 71, 54,  9, 58, 61]),
 array([30, 45,  2, 74,  2, 76, 67, 58]),
 array([25, 40, 73, 73, 76, 61, 24, 71]),
 array([27, 40, 61, 61, 68, 54,  2, 58]),
 array([29, 39, 54,  2, 66, 73, 33,  2]),
 array([ 1, 43, 73, 62, 54,  2, 72, 67])]

In [60]:
[ys[n][:cs] for n in range(cs)]

[array([42,  1, 38, 44,  2,  9, 61, 73]),
 array([29, 43, 31, 71, 54,  9, 58, 61]),
 array([30, 45,  2, 74,  2, 76, 67, 58]),
 array([25, 40, 73, 73, 76, 61, 24, 71]),
 array([27, 40, 61, 61, 68, 54,  2, 58]),
 array([29, 39, 54,  2, 66, 73, 33,  2]),
 array([ 1, 43, 73, 62, 54,  2, 72, 67]),
 array([ 1, 33,  2, 72, 67, 73,  2, 68])]

### Create and train model

In [62]:
dense_in = Dense(n_hidden, activation='relu')
dense_hidden = Dense(n_hidden, activation='relu')
dense_out = Dense(vocab_size, activation='softmax')

In [63]:
inp1 = Input(shape=(n_fac,), name='zeros')
hidden = dense_in(inp1)

In [64]:
outs = []
for i in range(cs):
    c_dense = dense_in(c_ins[i][1])
    hidden = dense_hidden(hidden)
    hidden = merge([c_dense, hidden], mode='sum')
    outs.append(dense_out(hidden))

In [65]:
model = Model([inp1] + [c[0] for c in c_ins], outs)
model.compile(loss='sparse_categorical_crossentropy', optimizer=Adam())

In [66]:
# add an array of 0s to our input
zeros = np.tile(np.zeros(n_fac), (len(xs[0]),1))
zeros.shape

(75109, 42)

In [71]:
model.fit([zeros]+xs, ys, batch_size=64, nb_epoch=8)

Epoch 1/8
Epoch 2/8
Epoch 3/8
Epoch 4/8
Epoch 5/8
Epoch 6/8
Epoch 7/8
Epoch 8/8


<keras.callbacks.History at 0x7f6240f0c390>

### Test model

In [69]:
def get_nexts(inp):
    idxs = [char_indices[c] for c in inp]
    arrs = [np.array(i)[np.newaxis] for i in idxs]
    p = model.predict([np.zeros(n_fac)[np.newaxis,:]] + arrs)
    print(list(inp))
    return [chars[np.argmax(o)] for o in p]

In [72]:
get_nexts(' this is')

[' ', 't', 'h', 'i', 's', ' ', 'i', 's']


['t', 'h', 'e', 't', ' ', 'p', 'n', ' ']

In [73]:
get_nexts(' part of')

[' ', 'p', 'a', 'r', 't', ' ', 'o', 'f']


['t', 'o', 'r', 't', ' ', 'o', 'f', ' ']

### Sequence model with keras
To convert our previous keras model into a sequence model, simply add the 'return_sequences=True' parameter, and add TimeDistributed() around our dense layer.

In [80]:
model = Sequential([
        Embedding(vocab_size, n_fac, input_length=cs),
        SimpleRNN(n_hidden, return_sequences=True, activation='relu', inner_init='identity'),
        TimeDistributed(Dense(vocab_size, activation='softmax')),
    ])



In [81]:
model.summary()

____________________________________________________________________________________________________
Layer (type)                     Output Shape          Param #     Connected to                     
embedding_5 (Embedding)          (None, 8, 42)         3570        embedding_input_5[0][0]          
____________________________________________________________________________________________________
simplernn_5 (SimpleRNN)          (None, 8, 256)        76544       embedding_5[0][0]                
____________________________________________________________________________________________________
timedistributed_2 (TimeDistribut (None, 8, 85)         21845       simplernn_5[0][0]                
Total params: 101,959
Trainable params: 101,959
Non-trainable params: 0
____________________________________________________________________________________________________


In [82]:
model.compile(loss='sparse_categorical_crossentropy', optimizer=Adam())

In [83]:
xs[0].shape

(75109,)

In [90]:
x_rnn=np.stack(xs, axis=1)
c_out_dat = [[idx[i+n] for i in range(1, len(idx)-cs, cs)]
            for n in range(cs)]
ys = [np.stack(c[:-2]) for c in c_out_dat]
y_rnn=np.expand_dims(np.stack(ys, axis=1),-1)
x_rnn.shape, y_rnn.shape

((75109, 8), (75109, 8, 1))

In [91]:
model.fit(x_rnn, y_rnn, batch_size=64, nb_epoch=8)

Epoch 1/8
Epoch 2/8
Epoch 3/8
Epoch 4/8
Epoch 5/8
Epoch 6/8
Epoch 7/8
Epoch 8/8


<keras.callbacks.History at 0x7f623cc891d0>

## Stateful model with keras

`stateful=True` means that at end of each sequence, don't reset the hidden activations to 0, but leave them as they are. And also make sure that you pass `shuffle=False` when you train the model.

A stateful model is easy to create (just add "stateful=True") but harder to train. We had to add batchnorm and use LSTM to get reasonable results.

When using stateful in keras, you have to also add 'batch_input_shape' to the first layer, and fix the batch size there.

In [92]:
bs=64

In [93]:
model=Sequential([
        Embedding(vocab_size, n_fac, input_length=cs, batch_input_shape=(bs,8)),
        BatchNormalization(),
        LSTM(n_hidden, return_sequences=True, stateful=True),
        TimeDistributed(Dense(vocab_size, activation='softmax')),
    ])



In [94]:
model.compile(loss='sparse_categorical_crossentropy', optimizer=Adam())

Since we're using a fixed batch shape, we have to ensure our inputs and outputs are a even multiple of the batch size.

In [95]:
mx = len(x_rnn)//bs*bs

In [96]:
model.fit(x_rnn[:mx], y_rnn[:mx], batch_size=bs, nb_epoch=4, shuffle=False)

Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


<keras.callbacks.History at 0x7f622fec33c8>