In [1]:
%matplotlib inline

import utils_ted
from utils_ted import *

Using TensorFlow backend.
  return f(*args, **kwds)


In [2]:
batch_size = 64

[Keras 2.0 release notes](https://github.com/fchollet/keras/wiki/Keras-2.0-release-notes)

```
Recurrent layers
    output_dim -> units
    init -> kernel_initializer
    inner_init -> recurrent_initializer
    added argument bias_initializer
    W_regularizer -> kernel_regularizer
    b_regularizer -> bias_regularizer
    added arguments kernel_constraint, recurrent_constraint, bias_constraint
    dropout_W -> dropout
    dropout_U -> recurrent_dropout
    consume_less -> implementation. String values have been replaced with integers: implementation 0 (default), 1 or 2.
    LSTM only: the argument forget_bias_init has been removed. Instead there is a boolean argument unit_forget_bias, defaulting to True.
```

## Setup

We haven't really looked into the detail of how this works yet - so this is provided for self-study for those who are interested. We'll look at it closely next week.

In [3]:
path=get_file('nietzsche.txt', origin="https://s3.amazonaws.com/text-datasets/nietzsche.txt")
#text = open(path, encoding='utf8').read().lower()
text = open(path, encoding='utf8').read()

In [4]:
print('corpus length:', len(text))

corpus length: 600893


In [5]:
!tail {path} -n25

are thinkers who believe in the saints.


144

It stands to reason that this sketch of the saint, made upon the model
of the whole species, can be confronted with many opposing sketches that
would create a more agreeable impression. There are certain exceptions
among the species who distinguish themselves either by especial
gentleness or especial humanity, and perhaps by the strength of their
own personality. Others are in the highest degree fascinating because
certain of their delusions shed a particular glow over their whole
being, as is the case with the founder of christianity who took himself
for the only begotten son of God and hence felt himself sinless; so that
through his imagination--that should not be too harshly judged since the
whole of antiquity swarmed with sons of god--he attained the same goal,
the sense of complete sinlessness, complete irresponsibility, that can
now be attained by every individual through science.--In the same manner
I have viewed t

In [6]:
chars = sorted(list(set(text)))
vocab_size = len(chars) + 1

In [7]:
print("total chars : %s" % vocab_size)

total chars : 85


In [8]:
chars.insert(0, '/n')

In [9]:
"".join(chars[1:-5])

'\n !"\'(),-.0123456789:;=?ABCDEFGHIJKLMNOPQRSTUVWXYZ[]_abcdefghijklmnopqrstuvwxyz'

In [10]:
char_indices = {c:i for i, c in enumerate(chars)}
indices_char = {i:c for i, c in enumerate(chars)}

In [11]:
text_idxs = [char_indices[c] for c in text]

In [12]:
print(text_idxs[:10])

[40, 42, 29, 30, 25, 27, 29, 1, 1, 1]


In [13]:
''.join(indices_char[idx] for idx in text_idxs[:70])

'PREFACE\n\n\nSUPPOSING that Truth is a woman--what then? Is there not gro'

## 3 char model

### Create inputs

Create a list of every 4th character, starting at the 0th, 1st, 2nd, then 3rd characters

In [14]:
cs = 3
c1_data = [text_idxs[i] for i in range(0, len(text_idxs) - (cs+1), cs)]
c2_data = [text_idxs[i+1] for i in range(0, len(text_idxs) - (cs+1), cs)]
c3_data = [text_idxs[i+2] for i in range(0, len(text_idxs) - (cs+1), cs)]
c4_data = [text_idxs[i+3] for i in range(0, len(text_idxs) - (cs+1), cs)]

Our inputs

In [15]:
x1 = np.array(c1_data[:-2])
x2 = np.array(c2_data[:-2])
x3 = np.array(c3_data[:-2])

Our output

In [16]:
y = np.array(c4_data[:-2])

The first 4 inputs and outputs

In [17]:
x1[:4], x2[:4], x3[:4], y[:4]

(array([40, 30, 29,  1]),
 array([42, 25,  1, 43]),
 array([29, 27,  1, 45]),
 array([30, 29,  1, 40]))

In [18]:
x1.shape, y.shape

((200295,), (200295,))

The number of latent factors to create (i.e. the size of the embedding matrix)

In [19]:
n_fac = 42

Create inputs and embedding outputs for each of our 3 character inputs

In [20]:
def embedding_input(name, n_in, n_out):
    inp = Input(shape=(1,), dtype='int64', name=name)
    emb = Embedding(n_in, n_out, input_length=1)(inp)
    return inp, Flatten()(emb)

In [21]:
c1_in, c1_emb = embedding_input('c1', vocab_size, n_fac)
c2_in, c2_emb = embedding_input('c2', vocab_size, n_fac)
c3_in, c3_emb = embedding_input('c3', vocab_size, n_fac)

### Create and train model

Pick a size for our hidden state

In [22]:
n_hidden = 256

This is the 'green arrow' from our diagram - the layer operation from input to hidden.

In [23]:
dense_in = Dense(n_hidden, activation='relu')

Our first hidden activation is simply this function applied to the result of the embedding of the first character.

In [24]:
c1_dense_in = dense_in(c1_emb)

This is the 'orange arrow' from our diagram - the layer operation from hidden to hidden.

In [25]:
dense_hidden = Dense(n_hidden, activation='tanh')

Our second and third hidden activations sum up the previous hidden state (after applying dense_hidden) to the new input state.

In [26]:
from keras.layers import Add

In [27]:
c2_dense_in = dense_in(c2_emb)
hidden_2 = dense_hidden(c1_dense_in)
c2_hidden = Add()([c2_dense_in, hidden_2])

In [28]:
c3_dense_in = dense_in(c3_emb)
hidden_3 = dense_hidden(c2_hidden)
c3_hidden = Add()([c3_dense_in, hidden_3])

This is the 'blue arrow' from our diagram - the layer operation from hidden to output.

In [29]:
dense_out = Dense(vocab_size, activation='softmax')

The third hidden state is the input to our output layer.

In [30]:
c4_out = dense_out(c3_hidden)

In [31]:
model = Model([c1_in, c2_in, c3_in], c4_out)

In [32]:
model.compile(Adam(), loss='sparse_categorical_crossentropy')

In [33]:
model.optimizer.lr = 1e-6

In [34]:
model.fit([x1, x2, x3], y, batch_size=batch_size, epochs=4, verbose=2)

Epoch 1/4
 - 17s - loss: 4.4042
Epoch 2/4
 - 16s - loss: 4.2737
Epoch 3/4
 - 16s - loss: 4.0014
Epoch 4/4
 - 16s - loss: 3.6063


<keras.callbacks.History at 0x7fc835924080>

In [35]:
model.optimizer.lr = 0.01

In [36]:
model.fit([x1, x2, x3], y, batch_size=batch_size, epochs=4, verbose=2)

Epoch 1/4
 - 16s - loss: 3.3020
Epoch 2/4
 - 16s - loss: 3.1850
Epoch 3/4
 - 16s - loss: 3.1410
Epoch 4/4
 - 16s - loss: 3.1195


<keras.callbacks.History at 0x7fc834c9b9b0>

In [37]:
model.optimizer.lr = 1e-6

In [38]:
model.fit([x1, x2, x3], y, batch_size=batch_size, epochs=4, verbose=2)

Epoch 1/4
 - 16s - loss: 3.1060
Epoch 2/4
 - 16s - loss: 3.0961
Epoch 3/4
 - 16s - loss: 3.0880
Epoch 4/4
 - 16s - loss: 3.0807


<keras.callbacks.History at 0x7fc835041ba8>

In [39]:
model.optimizer.lr = 0.01

In [40]:
model.fit([x1, x2, x3], y, batch_size=batch_size, epochs=4, verbose=2)

Epoch 1/4
 - 16s - loss: 3.0740
Epoch 2/4
 - 16s - loss: 3.0674
Epoch 3/4
 - 16s - loss: 3.0608
Epoch 4/4
 - 16s - loss: 3.0542


<keras.callbacks.History at 0x7fc835190d30>

### Test model

In [41]:
def get_next(inp):
    idxs = [char_indices[c] for c in inp]
    #arrs = [np.array(i).reshape(1,) for i in idxs] # to fit in the Input() input shape
    arrs = [np.array(i)[np.newaxis] for i in idxs] 
    preds = model.predict(arrs)
    preds_idxs = np.argmax(preds)
    return chars[preds_idxs]

In [42]:
get_next('zzz')

' '

In [43]:
get_next(' th')

' '

In [44]:
get_next(' an')

' '

## Our first RNN!

### Create inputs

This is the size of our unrolled RNN.

In [45]:
cs = 8 

For each of 0 through 7, create a list of every 8th character with that starting point. These will be the 8 inputs to out model.

In [46]:
c_in_data = [[text_idxs[i+n] for i in range(0, len(text_idxs) - (cs+1), cs)] for n in range(cs)]

Then create a list of the next character in each of these series. This will be the labels for our model.

In [47]:
c_out_data = [text_idxs[i+cs] for i in range(0, len(text_idxs) - (cs+1), cs)]

In [54]:
xs = [np.array(c[:-2]) for c in c_in_data]
#xs = [np.stack(c[:-2]) for c in c_in_data]

In [49]:
len(xs), xs[0].shape

(8, (75109,))

In [51]:
y = np.array(c_out_data[:-2]) # y = np.stack(c_out_data[:-2])
#y = np.stack(c_out_data[:-2])

So each column below is one series of 8 characters from the text.

In [55]:
[xs[n][:cs] for n in range(cs)]

[array([40,  1, 33,  2, 72, 67, 73,  2]),
 array([42,  1, 38, 44,  2,  9, 61, 73]),
 array([29, 43, 31, 71, 54,  9, 58, 61]),
 array([30, 45,  2, 74,  2, 76, 67, 58]),
 array([25, 40, 73, 73, 76, 61, 24, 71]),
 array([27, 40, 61, 61, 68, 54,  2, 58]),
 array([29, 39, 54,  2, 66, 73, 33,  2]),
 array([ 1, 43, 73, 62, 54,  2, 72, 67])]

...and this is the next character after each sequence.

In [56]:
y[:cs]

array([ 1, 33,  2, 72, 67, 73,  2, 68])

In [57]:
n_fac = 42

### Create and train model

In [58]:
def embedding_input(name, n_in, n_out):
    inp = Input((1, ), dtype='int64', name=name+'_in')
    emb = Embedding(n_in, n_out, input_length=1, name=name+'_emb')(inp)
    return inp, Flatten()(emb)

In [59]:
c_ins = [embedding_input('c'+str(n), vocab_size, n_fac) for n in range(cs)]

In [60]:
n_hidden = 256

In [61]:
dense_in = Dense(n_hidden, activation='relu')
dense_hidden = Dense(n_hidden, activation='relu', kernel_initializer='identity')
dense_out = Dense(vocab_size, activation='softmax')

The first character of each sequence goes through dense_in(), to create our first hidden activations.

In [62]:
hidden = dense_in(c_ins[0][1])  # c_ins[0][1]: c0_emb

Then for each successive layer we combine the output of dense_in() on the next character with the output of dense_hidden() on the current hidden state, to create the new hidden state.

In [63]:
for i in range(1, cs):
    c_dense_in = dense_in(c_ins[i][1]) # c_ins[i][1]: c1_emb, c2_emb, ... c7_emb
    hidden = dense_hidden(hidden)
    hidden = Add()([c_dense_in, hidden])

Putting the final hidden state through dense_out() gives us our output.

In [64]:
c_out = dense_out(hidden)

So now we can create our model.

In [65]:
model = Model([c[0] for c in c_ins], c_out)

In [66]:
model.compile(optimizer=Adam(), loss='sparse_categorical_crossentropy')

In [67]:
model.fit(xs, y, batch_size=batch_size, epochs=12, verbose=2)

Epoch 1/12
 - 11s - loss: 2.5412
Epoch 2/12
 - 11s - loss: 2.2509
Epoch 3/12
 - 11s - loss: 2.1431
Epoch 4/12
 - 11s - loss: 2.0710
Epoch 5/12
 - 11s - loss: 2.0141
Epoch 6/12
 - 11s - loss: 1.9686
Epoch 7/12
 - 11s - loss: 1.9270
Epoch 8/12
 - 11s - loss: 1.8907
Epoch 9/12
 - 11s - loss: 1.8575
Epoch 10/12
 - 11s - loss: 1.8273
Epoch 11/12
 - 11s - loss: 1.7994
Epoch 12/12
 - 11s - loss: 1.7733


<keras.callbacks.History at 0x7fc802c4bfd0>

### Test model

In [68]:
def get_next(inp):
    idxs = [char_indices[c] for c in inp]
    #arrs = [np.array(i).reshape(1,) for i in idxs] # to fit in the Input() input shape
    arrs = [np.array(i)[np.newaxis] for i in idxs] 
    preds = model.predict(arrs)
    preds_idxs = np.argmax(preds)
    return chars[preds_idxs]

In [69]:
get_next('for thos')

'e'

In [70]:
get_next('part of ')

't'

In [71]:
get_next('queens a')

't'

## Our first RNN with keras!

In [100]:
n_hidden, n_fac, cs, vocab_size = (256, 42, 8, 58)

This is nearly exactly equivalent to the RNN we built ourselves in the previous section.

In [101]:
from keras.layers import SimpleRNN

In [102]:
model = Sequential([
    Embedding(vocab_size, n_fac, input_length=cs),
    SimpleRNN(n_hidden, activation='relu', recurrent_initializer='identity'),
    Dense(vocab_size, activation='softmax')
])

In [103]:
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_4 (Embedding)      (None, 8, 42)             2436      
_________________________________________________________________
simple_rnn_1 (SimpleRNN)     (None, 256)               76544     
_________________________________________________________________
dense_7 (Dense)              (None, 58)                14906     
Total params: 93,886
Trainable params: 93,886
Non-trainable params: 0
_________________________________________________________________


In [104]:
model.compile(Adam(), loss='sparse_categorical_crossentropy')

In [105]:
np.concatenate(xs,axis=1)

AxisError: axis 1 is out of bounds for array of dimension 1

In [None]:
model.fit(np.concatenate(xs, axis=1), y, batch_size=batch_size, epochs=8, verbose=2)

In [80]:
def get_next_keras(inp):
    idxs = [char_indices[c] for c in inp]
    #arrs = [np.array(i).reshape(1,) for i in idxs] # to fit in the Input() input shape
    arrs = np.array(idxs)[np.newaxis,:] 
    preds = model.predict(arrs)[0]
    preds_idxs = np.argmax(preds)
    return chars[preds_idxs]

In [81]:
get_next_keras('this is ')

'/n'

In [82]:
get_next_keras('part of ')

'/n'

In [83]:
get_next_keras('queens a')

'/n'

## Returning sequences

### Create inputs

To use a sequence model, we can leave our input unchanged - but we have to change our output to a sequence (of course!)

Here, c_out_dat is identical to c_in_dat, but moved across 1 character.

In [84]:
n_hidden, n_fac, cs, vocab_size = (256, 42, 8, 58)

In [85]:
c_in_data = [[text_idxs[i+n] for i in range(0, len(text_idxs) - (cs+1), cs)] for n in range(cs)]

# c_out_data = [text_idxs[i+cs] for i in range(0, len(text_idxs) - (cs+1), cs)]
c_out_data = [[text_idxs[i+n] for i in range(1, len(text_idxs) - cs, cs)] for n in range(cs)] 

In [86]:
xs = [np.array(c[:-2]) for c in c_in_data]

In [87]:
ys = [np.array(c[:-2]) for c in c_out_data]

Reading down each column shows one set of inputs and outputs.

In [88]:
[xs[n][:cs] for n in range(cs)]

[array([40,  1, 33,  2, 72, 67, 73,  2]),
 array([42,  1, 38, 44,  2,  9, 61, 73]),
 array([29, 43, 31, 71, 54,  9, 58, 61]),
 array([30, 45,  2, 74,  2, 76, 67, 58]),
 array([25, 40, 73, 73, 76, 61, 24, 71]),
 array([27, 40, 61, 61, 68, 54,  2, 58]),
 array([29, 39, 54,  2, 66, 73, 33,  2]),
 array([ 1, 43, 73, 62, 54,  2, 72, 67])]

In [89]:
[ys[n][:cs] for n in range(cs)]

[array([42,  1, 38, 44,  2,  9, 61, 73]),
 array([29, 43, 31, 71, 54,  9, 58, 61]),
 array([30, 45,  2, 74,  2, 76, 67, 58]),
 array([25, 40, 73, 73, 76, 61, 24, 71]),
 array([27, 40, 61, 61, 68, 54,  2, 58]),
 array([29, 39, 54,  2, 66, 73, 33,  2]),
 array([ 1, 43, 73, 62, 54,  2, 72, 67]),
 array([ 1, 33,  2, 72, 67, 73,  2, 68])]

### Create and train model

In [90]:
def embedding_input(name, n_in, n_out):
    inp = Input((1, ), dtype='int64', name=name+'_in')
    emb = Embedding(n_in, n_out, input_length=1, name=name+'_emb')(inp)
    return inp, Flatten()(emb)

In [91]:
c_ins = [embedding_input('c'+str(n), vocab_size, n_fac) for n in range(cs)]

In [92]:
dense_in = Dense(n_hidden, activation='relu')
dense_hidden = Dense(n_hidden, activation='relu', kernel_initializer='identity')
dense_out = Dense(vocab_size, activation='softmax', name='output')

We're going to pass a vector of all zeros as our starting point - here's our input layers for that:

In [93]:
inp1 = Input(shape=(n_fac,), name='zeros')
hidden = dense_in(inp1)

In [94]:
outs = []

for i in range(cs):
    c_dense = dense_in(c_ins[i][1])
    hidden = dense_hidden(hidden)
    hidden = Add()([c_dense, hidden])
    # every layer now has an output
    outs.append(dense_out(hidden))

In [95]:
model = Model([inp1] + [c[0] for c in c_ins], outs)
model.compile(Adam(), loss='sparse_categorical_crossentropy')

In [97]:
#zeros = np.tile(np.zeros(n_fac), (len(xs[0]), 1))
zeros = np.zeros((len(xs[0]), n_fac))
zeros.shape

(75109, 42)

In [98]:
len(xs), xs[0].shape

(8, (75109,))

In [99]:
model.fit([zeros]+xs, ys, batch_size=batch_size, epochs=12, verbose=2)

Epoch 1/12
 - 21s - loss: nan - output_loss_1: nan - output_loss_2: nan - output_loss_3: nan - output_loss_4: nan - output_loss_5: nan - output_loss_6: nan - output_loss_7: nan - output_loss_8: nan
Epoch 2/12
 - 19s - loss: nan - output_loss_1: nan - output_loss_2: nan - output_loss_3: nan - output_loss_4: nan - output_loss_5: nan - output_loss_6: nan - output_loss_7: nan - output_loss_8: nan
Epoch 3/12
 - 19s - loss: nan - output_loss_1: nan - output_loss_2: nan - output_loss_3: nan - output_loss_4: nan - output_loss_5: nan - output_loss_6: nan - output_loss_7: nan - output_loss_8: nan
Epoch 4/12
 - 19s - loss: nan - output_loss_1: nan - output_loss_2: nan - output_loss_3: nan - output_loss_4: nan - output_loss_5: nan - output_loss_6: nan - output_loss_7: nan - output_loss_8: nan
Epoch 5/12
 - 19s - loss: nan - output_loss_1: nan - output_loss_2: nan - output_loss_3: nan - output_loss_4: nan - output_loss_5: nan - output_loss_6: nan - output_loss_7: nan - output_loss_8: nan
Epoch 6/12

<keras.callbacks.History at 0x7fc800f10f98>

### Test model

In [None]:
def get_nexts(inp):
    idxs = [char_indices[c] for c in inp]
    arrs = [np.array(i)[np.newaxis] for i in idxs]
    preds = model.predict([np.zeros(n_fac)[np.newaxis,:]] + arrs)
    print(list(inp))
    return [chars[np.argmax(p)] for p in preds]

In [None]:
get_nexts(' this is')

In [None]:
get_nexts(' part of')

### Sequence model with keras

In [90]:
n_hidden, n_fac, cs, vocab_size

(256, 42, 8, 58)

To convert our previous keras model into a sequence model, simply add the 'return_sequences=True' parameter, and add TimeDistributed() around our dense layer.

In [91]:
model = Sequential([
    Embedding(vocab_size, n_fac, input_length=cs),
    SimpleRNN(n_hidden, return_sequences=True, activation='relu', recurrent_initializer='identity'),
    TimeDistributed(Dense(vocab_size, activation='softmax'))
])

In [92]:
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_5 (Embedding)      (None, 8, 42)             2436      
_________________________________________________________________
simple_rnn_2 (SimpleRNN)     (None, 8, 256)            76544     
_________________________________________________________________
time_distributed_1 (TimeDist (None, 8, 58)             14906     
Total params: 93,886
Trainable params: 93,886
Non-trainable params: 0
_________________________________________________________________


In [93]:
model.compile(Adam(), loss='sparse_categorical_crossentropy')

In [94]:
xs[0].shape, np.squeeze(xs).shape

((75109,), (8, 75109))

In [119]:
x_rnn = np.stack(np.squeeze(xs), axis=1)
y_rnn = np.atleast_3d(np.stack(np.squeeze(ys), axis=1))

In [120]:
x_rnn.shape, y_rnn.shape

((75109, 8), (75109, 8, 1))

In [97]:
model.fit(x_rnn, y_rnn, batch_size=batch_size, epochs=8, verbose=2)

Epoch 1/8
 - 8s - loss: nan
Epoch 2/8
 - 8s - loss: nan
Epoch 3/8
 - 8s - loss: nan
Epoch 4/8
 - 8s - loss: nan
Epoch 5/8
 - 8s - loss: nan
Epoch 6/8
 - 8s - loss: nan
Epoch 7/8
 - 8s - loss: nan
Epoch 8/8
 - 8s - loss: nan


<keras.callbacks.History at 0x7f8e9423bf60>

In [45]:
def get_nexts_keras(inp):
    idxs = [char_indices[c] for c in inp]
    #arrs = [np.array(i).reshape(1,) for i in idxs] # to fit in the Input() input shape
    arrs = np.array(idxs)[np.newaxis,:] 
    preds = model.predict(arrs)[0]
    print(list(inp))
    return [chars[np.argmax(p)] for p in preds]

In [161]:
get_nexts_keras(' this is')

[' ', 't', 'h', 'i', 's', ' ', 'i', 's']


['t', 'h', 'e', 's', ' ', 'c', 'n', ' ']

## Stateful model with keras

In [121]:
n_hidden, n_fac, cs, vocab_size = (256, 42, 8, 58)

**A stateful model** is easy to create (just **add "stateful=True"**) but **harder to train**. We had to add batchnorm and use LSTM to get reasonable results.

When using stateful in keras, you have to also **add 'batch_input_shape' to the first layer, and fix the batch size there**.

In [122]:
model = Sequential([
    Embedding(vocab_size, n_fac, input_length=cs, batch_input_shape=(batch_size, cs)),
    BatchNormalization(),
    LSTM(n_hidden, return_sequences=True, stateful=True),
    TimeDistributed(Dense(vocab_size, activation='softmax')) 
])

In [123]:
model.compile(Adam(), loss='sparse_categorical_crossentropy')

Since we're using a fixed batch shape, we have to ensure our inputs and outputs are a even multiple of the batch size.

In [124]:
mx = len(x_rnn) // batch_size * batch_size
my = len(y_rnn) // batch_size * batch_size
assert(mx == my)

In [125]:
model.fit(x_rnn[:mx], y_rnn[:my], epochs=4, batch_size=batch_size, verbose=2)

Epoch 1/4
 - 21s - loss: nan
Epoch 2/4
 - 19s - loss: nan
Epoch 3/4
 - 19s - loss: nan
Epoch 4/4
 - 19s - loss: nan


<keras.callbacks.History at 0x7fc7eb2ec7f0>

In [55]:
model.optimizer.lr = 1e-4

In [56]:
model.fit(x_rnn[:mx], y_rnn[:my], epochs=4, batch_size=batch_size, verbose=2)

Epoch 1/4
 - 19s - loss: 1.8067
Epoch 2/4
 - 19s - loss: 1.8003
Epoch 3/4
 - 19s - loss: 1.7944
Epoch 4/4
 - 19s - loss: 1.7893


<keras.callbacks.History at 0x7fb73efbde48>

In [57]:
model.fit(x_rnn[:mx], y_rnn[:my], epochs=4, batch_size=batch_size, verbose=2)

Epoch 1/4
 - 19s - loss: 1.7871
Epoch 2/4
 - 19s - loss: 1.7815
Epoch 3/4
 - 19s - loss: 1.7795
Epoch 4/4
 - 19s - loss: 1.7757


<keras.callbacks.History at 0x7fb74c562b70>

## Tensorflow RNN 
#### (revised from Theano RNN)

In [313]:
n_hidden, n_fac, cs, vocab_size = (256, 42, 8, 58)

In [314]:
n_input = vocab_size
n_output = vocab_size

Using raw tensorflow, we have to create our weight matrices and bias vectors ourselves - here are the functions we'll use to do so (using glorot initialization).

The return values are wrapped in `shared()`, which is how we tell theano that it can manage this data (copying it to and from the GPU as necessary).

In [315]:
def init_wgts(rows, cols, name): 
    return tf.get_variable(shape=[rows, cols], initializer=tf.contrib.layers.xavier_initializer(), name="W"+name)

def init_bias(rows, name): 
    return tf.get_variable(shape=[rows, 1], initializer=tf.zeros_initializer(), name="b"+name)

We return the weights and biases together as a tuple. For the hidden weights, we'll use an identity initialization (as recommended by [Hinton](https://arxiv.org/abs/1504.00941).)

In [316]:
def wgts_and_bias(n_in, n_out, name): 
    return init_wgts(n_in, n_out, name), init_bias(n_out, name)

def id_and_bias(n, name): 
    return tf.eye(n), init_bias(n, name)

Theano doesn't actually do any computations until we explicitly compile and evaluate the function (at which point it'll be turned into CUDA code and sent off to the GPU). So our job is to describe the computations that we'll want theano to do - the first step is to tell theano what inputs we'll be providing to our computation:

Now we're ready to create our intial weight matrices.

In [317]:
W_h = id_and_bias(n_hidden, "_h")
W_x = wgts_and_bias(n_input, n_hidden, "_x")
W_y = wgts_and_bias(n_hidden, n_output, "_y")
w_all = list(chain.from_iterable([W_h, W_x, W_y]))

ValueError: Variable b_h already exists, disallowed. Did you mean to set reuse=True in VarScope? Originally defined at:

  File "C:\ProgramData\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow\python\framework\ops.py", line 1204, in __init__
    self._traceback = self._graph._extract_stack()  # pylint: disable=protected-access
  File "C:\ProgramData\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow\python\framework\ops.py", line 2630, in create_op
    original_op=self._default_original_op, op_def=op_def)
  File "C:\ProgramData\Anaconda3\envs\tensorflow\lib\site-packages\tensorflow\python\framework\op_def_library.py", line 767, in apply_op
    op_def=op_def)


In [None]:
def step(x, h, W_h, b_h, W_x, b_x, W_y, b_y):
    # Calculate the hidden activations
    h = tf.nn.relu(tf.matmul(x, W_x) + b_x + tf.matmul(h, W_h) + b_h)
    # Calculate the output activations
    y = tf.nn.softmax(tf.matmul(h, W_y) + b_y)
    # Return both (the 'Flatten()' is to work around a theano bug)
    return h, y

Now we can provide everything necessary for the scan operation, so we can setup that up - we have to pass in the function to call at each step, the sequence to step through, the initial values of the outputs, and any other arguments to pass to the step function.

In [None]:
[v_h, v_y], _ = theano.scan(step, sequences=t_inp, 
                            outputs_info=[t_h0, None], non_sequences=w_all)

We can now calculate our loss function, and *all* of our gradients, with just a couple of lines of code!

In [76]:
error = nnet.categorical_crossentropy(v_y, t_outp).sum()
g_all = T.grad(error, w_all)

In [None]:
def upd_dict(wgts, grads, lr): 
    return OrderedDict({w: w-g*lr for (w,g) in zip(wgts,grads)})

upd = upd_dict(w_all, g_all, lr)

In [27]:
def RNN_model(X, Y, cs, vocab_size):
    X = tf.placeholder(tf.float32, [None, cs, vocab_size])
    Y = tf.placeholder(tf.float32, [None, cs, vocab_size])
    
    RNN = tf.contrib.rnn.BasicRNNCell(n_hidden)
    
    # Initial state of the RNN.
    initial_state = state = tf.zeros([batch_size, n_hidden])
    
    W_h = id_and_bias(n_hidden, "_h")
    W_x = wgts_and_bias(n_input, n_hidden, "_x")
    W_y = wgts_and_bias(n_hidden, n_output, "_y")
    
    t_inp = tf.placeholder(tf.float32, shape=inp.shape, name='inp')
    t_outp = tf.placeholder(tf.float32, shape=outp.shape, name='outp')
    t_h0 = tf.placeholder(tf.float32, shape=h0.shape, name='h0')
    lr = tf.constant(learning_rate, name='lr')
    
    
    err=0.0
    l_rate=0.01
    for i in range(len(X)): 
        err_epoch = RNN(np.zeros(n_hidden), X[i], Y[i], l_rate)
        err+=err_epoch
        
        if i % 1000 == 999: 
            print ("Error:{:.3f}".format(err/1000))
            err=0.0

NameError: name 'theano' is not defined

In [126]:
oh_ys = [to_categorical(o, vocab_size) for o in ys]
oh_y_rnn=np.stack(oh_ys, axis=1)

oh_xs = [to_categorical(o, vocab_size) for o in xs]
oh_x_rnn=np.stack(oh_xs, axis=1)

# oh_x_rnn.shape, oh_y_rnn.shape

X = oh_x_rnn
Y = oh_y_rnn
X.shape, Y.shape

IndexError: index 61 is out of bounds for axis 1 with size 58

## Pure python RNN!

### Set up basic functions

Now we're going to try to repeat the above theano RNN, using just pure python (and numpy). Which means, we have to do everything ourselves, including defining the basic functions of a neural net! Below are all of the definitions, along with tests to check that they give the same answers as theano. The functions ending in `_d` are the derivatives of each function.

In [106]:
def sigmoid(x): 
    return 1/(1+np.exp(-x))

def sigmoid_d(x):
    output = sigmoid(x)
    return output*(1-output)

In [107]:
def relu(x):
    return np.maximum(0, x)

def relu_d(x):
    return (x>0.)*1

In [108]:
relu(np.array([3., -3.])), relu_d(np.array([3., -3.]))

(array([ 3.,  0.]), array([1, 0]))

In [109]:
def dist(a, b): return pow(a-b, 2)
def dist_d(a, b): return 2*(a-b)

In [110]:
eps = 1e-7
def x_entropy(pred, actual):
    return -np.sum(actual * np.log(np.clip(pred, eps, 1-eps)))
def x_entropy_d(pred, actual):
    return -actual/pred

In [111]:
def softmax(x):
    return np.exp(x) / np.exp(x).sum()
def softmax_d(x):
    sm = softmax(x)
    res = np.expand_dims(-sm, -1)*sm
    res[np.diag_indices_from(res)] = sm*(1-sm)
    return res

In [129]:
test_preds = np.array([0.2,0.7,0.1])
test_actuals = np.array([0.,1.,0.])
labels = test_actuals
preds = test_preds
entropy_value = crossentropy(K.constant(labels.astype('float32')), K.constant(preds.astype('float32')))
with sess.as_default():
    eval_result = entropy_value.eval()
    
eval_result

RuntimeError: Attempted to use a closed Session.

In [113]:
x_entropy(test_preds, test_actuals)

0.35667494393873245

In [114]:
test_inp = tf.placeholder(tf.float32, shape=[None], name='test_inp')
test_labels = tf.placeholder(tf.float32, shape=[None], name='test_labels')
test_out = crossentropy(test_labels, test_inp)
test_grad = tf.gradients(test_out, test_inp)

In [115]:
with tf.Session() as sess:
    res_grad = sess.run(test_grad, feed_dict={test_inp: test_preds, test_labels: test_actuals})
res_grad

[array([ 1.        , -0.42857146,  1.        ], dtype=float32)]

In [116]:
x_entropy_d(test_preds, test_actuals)

array([-0.        , -1.42857143, -0.        ])

In [117]:
pre_pred = np.random.random(oh_x_rnn[0][0].shape)
preds = softmax(pre_pred)
actual = oh_x_rnn[0][0]

NameError: name 'oh_x_rnn' is not defined

In [118]:
np.allclose(softmax_d(pre_pred).dot(x_entropy_d(preds,actual)), preds-actual)

NameError: name 'pre_pred' is not defined

In [193]:
softmax(test_preds)

array([ 0.28140804,  0.46396343,  0.25462853])

In [198]:
with tf.Session() as sess:
    print(sess.run(tf.nn.softmax(test_preds)))

[ 0.28140804  0.46396343  0.25462853]


In [211]:
test_inp = tf.placeholder(tf.float32, shape=[3], name='test_inp')
#test_out = tf.contrib.layers.flatten(tf.nn.softmax(test_inp))
test_out = tf.nn.softmax(test_inp)
test_grad = tf.gradients(test_out, test_inp)
#test_grad = tf.gradient(x=test_inp, x_shape=[3], y=test_out, y_shape=[3], extra_feed_dict={test_inp: test_preds})

In [207]:
softmax_d(test_preds)

array([[ 0.20221756, -0.13056304, -0.07165452],
       [-0.13056304,  0.24870137, -0.11813832],
       [-0.07165452, -0.11813832,  0.18979284]])

In [212]:
act=relu
act_d = relu_d

In [213]:
loss=x_entropy
loss_d=x_entropy_d

We also have to define our own scan function. Since we're not worrying about running things in parallel, it's very simple to implement:

In [214]:
def scan(fn, start, seq):
    res = []
    prev = start
    for s in seq:
        app = fn(prev, s)
        res.append(app)
        prev = app
    return res

...for instance, `scan` on `+` is the cumulative sum.

In [216]:
scan((lambda prev,curr: prev+curr), 0, range(5))

[0, 1, 3, 6, 10]

### Set up training

In [127]:
inp = oh_x_rnn
outp = oh_y_rnn
n_input = vocab_size
n_output = vocab_size

NameError: name 'oh_x_rnn' is not defined

In [128]:
inp.shape, outp.shape

NameError: name 'inp' is not defined

Here's the function to do a single forward pass of an RNN, for a single character.

In [220]:
def one_char(prev, item):
    #Previous state
    tot_loss, pre_hidden, pre_pred, hidden, ypred = prev
    # Current inputs and output
    x, y = item
    pre_hidden = np.dot(x, w_x) + np.dot(hidden, w_h)
    hidden = act(pre_hidden)
    pre_pred = np.dot(hidden, w_y)
    ypred = softmax(pre_pred)
    return (
        # Keep track of loss so we can report it
        tot_loss+loss(ypred, y),
        # Used in backprop
        pre_hidden, pre_pred,
        # Used in next iteration
        hidden,
        # To provide predictions
        ypred)

We use `scan` to apply the above to a whole sequence of characters.

In [221]:
def get_chars(n): return zip(inp[n], outp[n])
def one_fwd(n): return scan(one_char, (0,0,0,np.zeros(n_hidden),0), get_chars(n))

Now we can define the backward step. We use a loop to go through every element of the sequence. The derivatives are applying the chain rule to each step, and accumulating the gradients across the sequence.

In [224]:
# "Columnify" a vector
def col(x): 
    # return x.reshape(-1, 1)
    return x[:,np.newaxis]

def one_bkwd(args, n):
    global w_x, w_y, w_h
    
    i = inp[n]  # 8x85
    o = outp[n] # 8x85
    d_pre_hidden = np.zeros(n_hidden) #256
    for p in reversed(range(len(i))):
        totloss, pre_hidden, pre_pred, hidden, ypred = args[p]
        x = i[p] # 85
        y = o[p] # 85
        d_pre_pred = softmax_d(pre_pred).dot(loss_d(ypred, y)) # 85
        d_pre_hidden = (np.dot(d_pre_hidden, w_h.T) + np.dot(d_pre_pred, w_y.T)) 
                        * act_d(pre_hidden) # 256
        
        # d(loss)/d(w_y) = d(loss)/d(pre_pred) * d(pre_pred)/d(w_y)

Now we can set up our initial weight matrices. Note that we're not using bias at all in this example, in order to keep things simpler.