In [1]:
import os
os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"]="0"

import random
import numpy as np
import tensorflow as tf

In [2]:
# load in some text to use
shakes = open('data/shakespeare.txt', 'r').read()
print(shakes[0:1000])

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.

All:
We know't, we know't.

First Citizen:
Let us kill him, and we'll have corn at our own price.
Is't a verdict?

All:
No more talking on't; let it be done: away, away!

Second Citizen:
One word, good citizens.

First Citizen:
We are accounted poor citizens, the patricians good.
What authority surfeits on would relieve us: if they
would yield us but the superfluity, while it were
wholesome, we might guess they relieved us humanely;
but they think we are too dear: the leanness that
afflicts us, the object of our misery, is as an
inventory to particularise their abundance; our
sufferance is a gain to them Let us revenge this with
our pikes, ere we become rakes: for the gods know I
speak this in hunger for bread, not in thirst for revenge.



In [3]:
# need to get all of the possible characters that the source uses
chars = list(set(shakes))
data_size, vocab_size = len(shakes), len(chars)
print('Text is', data_size, 'characters long and there are', vocab_size, 'unique characters.')

Text is 1115394 characters long and there are 65 unique characters.


In [4]:
# create dictionaries to convert from characters to index and from index back to characters
char2idx = {ch: i for i, ch in enumerate(chars)}
idx2char = {i: ch for i, ch in enumerate(chars)}

In [5]:
# define some hyperparameters for our network
hidden_size = 256
seq_length = 50
epochs = 1000

# Building a GRU in tensorflow

In this final notebook we'll work in Tensorflow directly. I would recommend getting familiar with how neural networks work by using our previous examples and then once you feel comfortable with Keras and all of the high level concepts move into Tensorflow. 

Tensorflow does give us a few helper functions to facilitate the construction of neural networks, but mostly we will be building lots of things from scratch. The one thing that we definitely don't want to do is calculate the backward pass for our training steps, luckily this is something that Tensorflow will do for us. 

In this example we will create a GRU recurrent neural network to use in our character level RNN. The steps for creating this network from scratch will be:

* Initialize all of our weight matrices. Setup their sizes and fill with random numbers
* Define the calculations that our network must carry out

A GRU cell is basically a change in the way that the hidden state is calculated for a recurrent neural network. So to begin we'll start with a vanilla recurrent neural network and show how we can create one using the two steps above. 

### Vanilla RNN

The calculations for a recurrent neural network look like the following:

![rnn](images/rnn.png)

In order to create that we need to set up three matrices and two bias vectors. The specify the calculations in exactly the same way. 

```python
Uh = tf.get_variable("Uh", [input_size, hidden_size], initializer=tf.random_normal_initializer(stddev=0.1))
Wh = tf.get_variable("Wh", [hidden_size, hidden_size], initializer=tf.random_normal_initializer(stddev=0.1))
Vy = tf.get_variable("Vy", [hidden_size, vocab_size], initializer=tf.random_normal_initializer(stddev=0.1))
bh  = tf.get_variable("bh", [hidden_size], initializer=tf.random_normal_initializer(stddev=0.1))
by  = tf.get_variable("by", [output_size], initializer=tf.random_normal_initializer(stddev=0.1))

hs_t = tf.tanh(tf.matmul(xs_t, Uh) + tf.matmul(hs_t, Wh) + bh)
ys_t = tf.nn.softmax(tf.matmul(hs_t, Vy) + by)
```

Simple enough. Input_size and output_size will change depending on the properties of our data. Hidden_size is a hyperparameter that we can set to anything that we wish.

### Gated Recurrent Unit

The changes to go from a vanilla RNN to a GRU are made to the way that we calculate the hidden state. Therefore, we will keep all of the matrices that we set up earlier and just add in what we need for the GRU calcualtions. The GRU calculations look as follows:

![gru](images/GRU.png)

Lets add in what we need.

```python
# weight from input to hidden for z gate
Uz = tf.get_variable("Uz", [vocab_size, hidden_size], initializer=tf.random_normal_initializer(stddev=0.1))
# weight from hidden to hidden for z gate
Wz = tf.get_variable("Wz", [hidden_size, hidden_size], initializer=tf.random_normal_initializer(stddev=0.1))
# bias for the z gate calculation
bz = tf.get_variable("bz", [hidden_size], initializer=tf.random_normal_initializer(stddev=0.1))

# weight from input to hidden for r gate
Ur = tf.get_variable("Ur", [vocab_size, hidden_size], initializer=tf.random_normal_initializer(stddev=0.1))
# weight from hidden to hidden for r gate
Wr = tf.get_variable("Wr", [hidden_size, hidden_size], initializer=tf.random_normal_initializer(stddev=0.1))
# bias for the r gate calculation
br = tf.get_variable("br", [hidden_size], initializer=tf.random_normal_initializer(stddev=0.1))

# weight from input to hidden
Uh = tf.get_variable("Uh", [vocab_size, hidden_size], initializer=tf.random_normal_initializer(stddev=0.1))

# recurrent weight matrix, hidden 2 hidden
Wh = tf.get_variable("Wh", [hidden_size, hidden_size], initializer=tf.random_normal_initializer(stddev=0.1))
# bias for hidden matrix
bh = tf.get_variable("bh", [hidden_size], initializer=tf.random_normal_initializer(stddev=0.1))

# output weight matrix
Vy = tf.get_variable("Vy", [hidden_size, vocab_size], initializer=tf.random_normal_initializer(stddev=0.1))
# bias for output matrix
by = tf.get_variable("by", [vocab_size], initializer=tf.random_normal_initializer(stddev=0.1))

# perform the z gate calculation
zt = tf.sigmoid(tf.matmul(xs_t, Wz) + tf.matmul(hs_t, Wz) + bz)
# perform the r gate calculation
rt = tf.sigmoid(tf.matmul(xs_t, Wr) + tf.matmul(hs_t, Wr) + br)
# perform the hidden state calculation
htilda_t = tf.tanh(tf.matmul(xs_t, Uh) + tf.matmul(tf.multiply(rt,hs_t), Wh) + bh)
hs_t = tf.multiply((1 - zt), hs_t) + tf.multiply(zt, htilda_t)
# perform the ouput calculation
ys_t = tf.matmul(hs_t, Vy) + by
# add the predicted character to the output list
ys.append(ys_t)
```

A bit more involved now, but we've added the ability for our Recurrent Neural Network to handle longer term dependencies in the data. 

In [6]:
# set up the place holders for our computational graph
inputs = tf.placeholder(shape=[None, vocab_size], dtype=tf.float32, name='input')
targets = tf.placeholder(shape=[None, vocab_size], dtype=tf.float32, name='targets')
init_state = tf.placeholder(shape=[1, hidden_size], dtype=tf.float32, name='state')

# create an initializer to init our weight matricies
init = tf.random_normal_initializer(stddev=0.1)

In [7]:
# set up our recurrent neural network and define the functions
with tf.variable_scope("GRU") as scope:
    # hidden state at time t 
    hs_t = init_state
    # list for output character predictions
    ys = []
    for t, xs_t in enumerate(tf.split(inputs, seq_length, axis=0)):
        if t > 0: scope.reuse_variables()
            
        # weight from input to hidden for z gate
        Uz = tf.get_variable("Uz", [vocab_size, hidden_size], initializer=init)
        # weight from hidden to hidden for z gate
        Wz = tf.get_variable("Wz", [hidden_size, hidden_size], initializer=init)
        # bias for the z gate calculation
        bz = tf.get_variable("bz", [hidden_size], initializer=init)
        
        # weight from input to hidden for r gate
        Ur = tf.get_variable("Ur", [vocab_size, hidden_size], initializer=init)
        # weight from hidden to hidden for r gate
        Wr = tf.get_variable("Wr", [hidden_size, hidden_size], initializer=init)
        # bias for the r gate calculation
        br = tf.get_variable("br", [hidden_size], initializer=init)
        
        # weight from input to hidden
        Uh = tf.get_variable("Uh", [vocab_size, hidden_size], initializer=init)
        
        # recurrent weight matrix, hidden 2 hidden
        Wh = tf.get_variable("Wh", [hidden_size, hidden_size], initializer=init)
        # bias for hidden matrix
        bh = tf.get_variable("bh", [hidden_size], initializer=init)
        
        # output weight matrix
        Vy = tf.get_variable("Vy", [hidden_size, vocab_size], initializer=init)
        # bias for output matrix
        by = tf.get_variable("by", [vocab_size], initializer=init)
        
        # perform the z gate calculation
        zt = tf.sigmoid(tf.matmul(xs_t, Uz) + tf.matmul(hs_t, Wz) + bz)
        # perform the r gate calculation
        rt = tf.sigmoid(tf.matmul(xs_t, Ur) + tf.matmul(hs_t, Wr) + br)
        # perform the hidden state calculation
        htilda_t = tf.tanh(tf.matmul(xs_t, Uh) + tf.matmul(tf.multiply(rt, hs_t), Wh) + bh)
        hs_t = tf.multiply((1 - zt), hs_t) + tf.multiply(zt, htilda_t)
        # perform the ouput calculation
        ys_t = tf.matmul(hs_t, Vy) + by
        # add the predicted character to the output list
        ys.append(ys_t)


In [8]:
# need to keep track of our hidden states
h_0 = hs_t
# apply the softmax output to the last output of our list
output_softmax = tf.nn.softmax(ys[-1])

# get all of the output characters together
outputs = tf.concat(ys, axis=0)
loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=targets, logits=outputs))

# optimization algorithm
optimizer = tf.train.AdamOptimizer(learning_rate=0.0005)
grads = optimizer.compute_gradients(loss)

# clip the gradients
grad_clipping = tf.constant(5.0, name='grad_clipping')
clipped_grads = []
for grad, var in grads:
    clipped_grad = tf.clip_by_value(grad, -grad_clipping, grad_clipping)
    clipped_grads.append((clipped_grad, var))
    
# update the weights with gradient descent
updates = optimizer.apply_gradients(clipped_grads)

In [9]:
# now that all the functions are set up we can run this thing

# function to one hot encode the characters
def one_hot(v):
    return np.eye(vocab_size)[v]

# Session
sess = tf.Session()
init = tf.global_variables_initializer()
sess.run(init)

# Initial values
n, p = 0, 0
#hprev_val = np.zeros([1, hidden_size])

for _ in range(epochs):
    for chunk in range(0, len(shakes) // seq_length, seq_length):
        # Initialize the hidden state to 0 at the beginning of each sequence
        h_t = np.zeros([1, hidden_size])

        # Prepare inputs
        input_vals  = [char2idx[ch] for ch in shakes[chunk:chunk + seq_length]]
        target_vals = [char2idx[ch] for ch in shakes[chunk + 1:chunk + seq_length + 1]]
        
        #print(shakes[chunk:chunk + seq_length])
        
        #print(input_vals, len(input_vals))
        #print(target_vals, len(target_vals))

        input_vals  = one_hot(input_vals)
        target_vals = one_hot(target_vals)
        
        #print(input_vals, input_vals.shape)
        #print(target_vals, target_vals.shape)       
        
        # run the tensorflow session
        h_t, loss_val, _ = sess.run([h_0, loss, updates],
                                    feed_dict={inputs: input_vals,
                                               targets: target_vals,
                                               init_state: h_t})
        if n % 1000 == 0:
            # Progress
            print('iter: %d, p: %d, loss: %f' % (n, p, loss_val))

            # Do sampling
            sample_length = 200
            start_ix = random.randint(0, len(shakes) - seq_length)
            sample_seq_ix = [char2idx[ch] for ch in shakes[start_ix:start_ix + seq_length]]
            idxs = []
            sample_prev_state_val = np.copy(h_t)

            for t in range(sample_length):
                sample_input_vals = one_hot(sample_seq_ix)
                sample_output_softmax_val, sample_prev_state_val = \
                    sess.run([output_softmax, h_0],
                             feed_dict={inputs: sample_input_vals, init_state: sample_prev_state_val})

                ix = np.random.choice(range(vocab_size), p=sample_output_softmax_val.ravel())
                idxs.append(ix)
                sample_seq_ix = sample_seq_ix[1:] + [ix]

            txt = ''.join(idx2char[ix] for ix in idxs)
            print('----\n %s \n----\n' % (txt,))

        p += seq_length
        n += 1

iter: 0, p: 0, loss: 4.167645
----
 3&U3yoXpxYMlYZeY
NP,WZxGCA!'D
NB?ijujM$EVmUX&,P!CkKrxJZHAzXzfvcGGe'A,zMPA;QjknB3pXlkWHjqtt;PN-lcjG;lPncbRchPZKZ$3kPPSrZE'rdcrqoHwHpJ
,Htts,httedQtiyikQHEgLNeag&KYRrh-Xx .Or:
kIDA$&OZnZmkiDVtycR$ReL3wc 
----

iter: 1000, p: 50000, loss: 1.468880
----
 ERUCSgieuly yeur'beelt, perl se cisl,
Hll, th
 ile cr bu.c
hh then tig lhan
, on, Ptbhen ease
Welt erlouat uo, wha. -a osy you a ntaven cathisheri
WiTenely, bed myouusnnded whlm the
sey, usdy
Rentaln! 
----

iter: 2000, p: 100000, loss: 2.078329
----
 d whulc me
- loct hubl din they and wisit-

First Cithzen:
Hak nit werves 'g cuthoi; Voa jon?
SgeErgithe torthee;ang, wir sheir wire.
Wfmndther'd oo :or.

MENENIUS:

Firl? Whey'prel; fhat os whitW'e t 
----

iter: 3000, p: 150000, loss: 1.706717
----
 Fhe come,, ar h to shell pat sourpd yourud sind ther haw form mominiusustade conosN fhe raln if arber.

All:
Fpreged y'l hin wimy,
sink on the gitsprends desencl, ther, nothes shel meat shed and titiz 
----

i