# Anna KaRNNa

In this notebook, we'll build a character-wise RNN trained on Anna Karenina, one of my all-time favorite books. It'll be able to generate new text based on the text from the book.

This network is based off of Andrej Karpathy's [post on RNNs](http://karpathy.github.io/2015/05/21/rnn-effectiveness/) and [implementation in Torch](https://github.com/karpathy/char-rnn). Also, some information [here at r2rt](http://r2rt.com/recurrent-neural-networks-in-tensorflow-ii.html) and from [Sherjil Ozair](https://github.com/sherjilozair/char-rnn-tensorflow) on GitHub. Below is the general architecture of the character-wise RNN.

<img src="assets/charseq.jpeg" width="500">

In [13]:
import time
from collections import namedtuple

import numpy as np
import tensorflow as tf

First we'll load the text file and convert it into integers for our network to use. Here I'm creating a couple dictionaries to convert the characters to and from integers. Encoding the characters as integers makes it easier to use as input in the network.

In [14]:
with open('anna.txt', 'r') as f:
    text=f.read()
vocab = sorted(set(text))
vocab_to_int = {c: i for i, c in enumerate(vocab)}
int_to_vocab = dict(enumerate(vocab))
encoded = np.array([vocab_to_int[c] for c in text], dtype=np.int32)

# Pretty cool way of encoding!

In [15]:
with open('anna.txt','r') as f:
    text = f.read()
vocab = sorted(set(text))
vocab_to_int = {c:i for i,c in enumerate(vocab)}
int_to_vocab = dict(enumerate(vocab))
encoded = np.array([vocab_to_int[c] for c in text],dtype=np.int32)
    

In [16]:
encoded

array([31, 64, 57, ..., 75, 13,  0], dtype=int32)

Let's check out the first 100 characters, make sure everything is peachy. According to the [American Book Review](http://americanbookreview.org/100bestlines.asp), this is the 6th best first line of a book ever.

In [17]:
text[:100]

'Chapter 1\n\n\nHappy families are all alike; every unhappy family is unhappy in its own\nway.\n\nEverythin'

And we can see the characters encoded as integers.

In [18]:
encoded[:100]

array([31, 64, 57, 72, 76, 61, 74,  1, 16,  0,  0,  0, 36, 57, 72, 72, 81,
        1, 62, 57, 69, 65, 68, 65, 61, 75,  1, 57, 74, 61,  1, 57, 68, 68,
        1, 57, 68, 65, 67, 61, 26,  1, 61, 78, 61, 74, 81,  1, 77, 70, 64,
       57, 72, 72, 81,  1, 62, 57, 69, 65, 68, 81,  1, 65, 75,  1, 77, 70,
       64, 57, 72, 72, 81,  1, 65, 70,  1, 65, 76, 75,  1, 71, 79, 70,  0,
       79, 57, 81, 13,  0,  0, 33, 78, 61, 74, 81, 76, 64, 65, 70], dtype=int32)

Since the network is working with individual characters, it's similar to a classification problem in which we are trying to predict the next character from the previous text.  Here's how many 'classes' our network has to pick from.

In [19]:
len(vocab)

83

## Making training mini-batches

Here is where we'll make our mini-batches for training. Remember that we want our batches to be multiple sequences of some desired number of sequence steps. Considering a simple example, our batches would look like this:

<img src="assets/sequence_batching@1x.png" width=500px>


<br>

We start with our text encoded as integers in one long array in `encoded`. Let's create a function that will give us an iterator for our batches. I like using [generator functions](https://jeffknupp.com/blog/2013/04/07/improve-your-python-yield-and-generators-explained/) to do this. Then we can pass `encoded` into this function and get our batch generator.

The first thing we need to do is discard some of the text so we only have completely full batches. Each batch contains $N \times M$ characters, where $N$ is the batch size (the number of sequences) and $M$ is the number of steps. Then, to get the total number of batches, $K$, we can make from the array `arr`, you divide the length of `arr` by the number of characters per batch. Once you know the number of batches, you can get the total number of characters to keep from `arr`, $N * M * K$.

After that, we need to split `arr` into $N$ sequences. You can do this using `arr.reshape(size)` where `size` is a tuple containing the dimensions sizes of the reshaped array. We know we want $N$ sequences (`batch_size` below), let's make that the size of the first dimension. For the second dimension, you can use `-1` as a placeholder in the size, it'll fill up the array with the appropriate data for you. After this, you should have an array that is $N \times (M * K)$.

Now that we have this array, we can iterate through it to get our batches. The idea is each batch is a $N \times M$ window on the $N \times (M * K)$ array. For each subsequent batch, the window moves over by `n_steps`. We also want to create both the input and target arrays. Remember that the targets are the inputs shifted over one character. 

The way I like to do this window is use `range` to take steps of size `n_steps` from $0$ to `arr.shape[1]`, the total number of steps in each sequence. That way, the integers you get from `range` always point to the start of a batch, and each window is `n_steps` wide.

> **Exercise:** Write the code for creating batches in the function below. The exercises in this notebook _will not be easy_. I've provided a notebook with solutions alongside this notebook. If you get stuck, checkout the solutions. The most important thing is that you don't copy and paste the code into here, **type out the solution code yourself.**

In [20]:
def sequence(n):
    for i in range(n):
        yield i
        

In [21]:
seq = sequence(3)
next(seq)
next(seq)
next(seq)

2

In [22]:
a = np.array([[1,2],[2,5],[7,8],[2,4],[4,5]])
a.shape
type(encoded)

numpy.ndarray

In [23]:
def get_batches(arr, batch_size, n_steps):
    '''Create a generator that returns batches of size
       batch_size x n_steps from arr.
       
       Arguments
       ---------
       arr: Array you want to make batches from
       batch_size: Batch size, the number of sequences per batch
       n_steps: Number of sequence steps per batch
    '''
    # Get the number of characters per batch and number of batches we can make
    characters_per_batch = batch_size*n_steps
    total_nums = len(arr)
    n_batches = total_nums // characters_per_batch
    print(n_batches)
    # Keep only enough characters to make full batches
    factor = n_batches*characters_per_batch
    arr = arr[:factor]
    #print((len(arr)/batch_size))
    # Reshape into batch_size rows
    arr = arr.reshape((batch_size,int(len(arr)/batch_size)))
    
    for n in range(0, arr.shape[1], n_steps):
        # The features
        print(n)
        #print(n_steps)
        x = arr[:,n:(n+n_steps)]
        # The targets, shifted by one
        y_temp = arr[:,n+1:(n+1+n_steps)]
        # For the very last batch, y will be one character short at the end of 
        # the sequences which breaks things. To get around this, I'll make an 
        # array of the appropriate size first, of all zeros, then add the targets.
        # This will introduce a small artifact in the last batch, but it won't matter.
        y = np.zeros(x.shape, dtype=x.dtype)
        y[:,:y_temp.shape[1]] = y_temp
        yield x, y

In [24]:
for i in range(0,10,2):
    print(i)

0
2
4
6
8


Now I'll make my data sets and we can check out what's going on here. Here I'm going to use a batch size of 10 and 50 sequence steps.

In [25]:
batches = get_batches(encoded, 10, 50)
x1, y1 = next(batches)
x,y = next(batches)
x2,y2 = next(batches)

3970
0
50
100


In [26]:
print('x\n', x1[:10, :10])
print('\ny\n', y1[:10, :10])
print('x\n', x[:10, :10])
print('\ny\n', y[:10, :10])
print('x\n', x2[:10, :10])
print('\ny\n', y2[:10, :10])

x
 [[31 64 57 72 76 61 74  1 16  0]
 [ 1 57 69  1 70 71 76  1 63 71]
 [78 65 70 13  0  0  3 53 61 75]
 [70  1 60 77 74 65 70 63  1 64]
 [ 1 65 76  1 65 75 11  1 75 65]
 [ 1 37 76  1 79 57 75  0 71 70]
 [64 61 70  1 59 71 69 61  1 62]
 [26  1 58 77 76  1 70 71 79  1]
 [76  1 65 75 70  7 76 13  1 48]
 [ 1 75 57 65 60  1 76 71  1 64]]

y
 [[64 57 72 76 61 74  1 16  0  0]
 [57 69  1 70 71 76  1 63 71 65]
 [65 70 13  0  0  3 53 61 75 11]
 [ 1 60 77 74 65 70 63  1 64 65]
 [65 76  1 65 75 11  1 75 65 74]
 [37 76  1 79 57 75  0 71 70 68]
 [61 70  1 59 71 69 61  1 62 71]
 [ 1 58 77 76  1 70 71 79  1 75]
 [ 1 65 75 70  7 76 13  1 48 64]
 [75 57 65 60  1 76 71  1 64 61]]
x
 [[64 57 72 72 81  1 62 57 69 65]
 [76  1 65 70  1 75 72 65 76 61]
 [26  1 76 64 65 74 76 81 12 61]
 [75 13  1 43 70 59 61  1 65 70]
 [71 75 75 65 70 63  1 64 65 69]
 [74  1 64 71 77 75 61  1 58 61]
 [ 1 74 71 71 69 13  0  0 40 61]
 [61 74 61 68 81  0 75 77 59 64]
 [76 64 61  1 68 57 70 60 71 79]
 [70 63 13  1  3 36 61  7 75  1

If you implemented `get_batches` correctly, the above output should look something like 
```
x
 [[55 63 69 22  6 76 45  5 16 35]
 [ 5 69  1  5 12 52  6  5 56 52]
 [48 29 12 61 35 35  8 64 76 78]
 [12  5 24 39 45 29 12 56  5 63]
 [ 5 29  6  5 29 78 28  5 78 29]
 [ 5 13  6  5 36 69 78 35 52 12]
 [63 76 12  5 18 52  1 76  5 58]
 [34  5 73 39  6  5 12 52 36  5]
 [ 6  5 29 78 12 79  6 61  5 59]
 [ 5 78 69 29 24  5  6 52  5 63]]

y
 [[63 69 22  6 76 45  5 16 35 35]
 [69  1  5 12 52  6  5 56 52 29]
 [29 12 61 35 35  8 64 76 78 28]
 [ 5 24 39 45 29 12 56  5 63 29]
 [29  6  5 29 78 28  5 78 29 45]
 [13  6  5 36 69 78 35 52 12 43]
 [76 12  5 18 52  1 76  5 58 52]
 [ 5 73 39  6  5 12 52 36  5 78]
 [ 5 29 78 12 79  6 61  5 59 63]
 [78 69 29 24  5  6 52  5 63 76]]
 ```
 although the exact numbers will be different. Check to make sure the data is shifted over one step for `y`.

## Building the model

Below is where you'll build the network. We'll break it up into parts so it's easier to reason about each bit. Then we can connect them up into the whole network.

<img src="assets/charRNN.png" width=500px>


### Inputs

First off we'll create our input placeholders. As usual we need placeholders for the training data and the targets. We'll also create a placeholder for dropout layers called `keep_prob`. This will be a scalar, that is a 0-D tensor. To make a scalar, you create a placeholder without giving it a size.

> **Exercise:** Create the input placeholders in the function below.

In [40]:
def build_inputs(batch_size, num_steps):
    ''' Define placeholders for inputs, targets, and dropout 
    
        Arguments
        ---------
        batch_size: Batch size, number of sequences per batch
        num_steps: Number of sequence steps in a batch
        
    '''
    # Declare placeholders we'll feed into the graph
    inputs = tf.placeholder(shape=(batch_size,num_steps),dtype = tf.int32, name = "inputs")
    targets = tf.placeholder(shape=(batch_size,num_steps),dtype = tf.int32, name = "targets")
    
    # Keep probability placeholder for drop out layers
    keep_prob = tf.placeholder(dtype = tf.float32, name = "keep_prob")
    
    return inputs, targets, keep_prob

### LSTM Cell

Here we will create the LSTM cell we'll use in the hidden layer. We'll use this cell as a building block for the RNN. So we aren't actually defining the RNN here, just the type of cell we'll use in the hidden layer.

We first create a basic LSTM cell with

```python
lstm = tf.contrib.rnn.BasicLSTMCell(num_units)
```

where `num_units` is the number of units in the hidden layers in the cell. Then we can add dropout by wrapping it with 

```python
tf.contrib.rnn.DropoutWrapper(lstm, output_keep_prob=keep_prob)
```
You pass in a cell and it will automatically add dropout to the inputs or outputs. Finally, we can stack up the LSTM cells into layers with [`tf.contrib.rnn.MultiRNNCell`](https://www.tensorflow.org/versions/r1.0/api_docs/python/tf/contrib/rnn/MultiRNNCell). With this, you pass in a list of cells and it will send the output of one cell into the next cell. Previously with TensorFlow 1.0, you could do this

```python
tf.contrib.rnn.MultiRNNCell([cell]*num_layers)
```

This might look a little weird if you know Python well because this will create a list of the same `cell` object. However, TensorFlow 1.0 will create different weight matrices for all `cell` objects. But, starting with TensorFlow 1.1 you actually need to create new cell objects in the list. To get it to work in TensorFlow 1.1, it should look like

```python
def build_cell(num_units, keep_prob):
    lstm = tf.contrib.rnn.BasicLSTMCell(num_units)
    drop = tf.contrib.rnn.DropoutWrapper(lstm, output_keep_prob=keep_prob)
    
    return drop
    
tf.contrib.rnn.MultiRNNCell([build_cell(num_units, keep_prob) for _ in range(num_layers)])
```

Even though this is actually multiple LSTM cells stacked on each other, you can treat the multiple layers as one cell.

We also need to create an initial cell state of all zeros. This can be done like so

```python
initial_state = cell.zero_state(batch_size, tf.float32)
```

Below, we implement the `build_lstm` function to create these LSTM cells and the initial state.

In [28]:
def build_lstm(lstm_size, num_layers, batch_size, keep_prob):
    ''' Build LSTM cell.
    
        Arguments
        ---------
        keep_prob: Scalar tensor (tf.placeholder) for the dropout keep probability
        lstm_size: Size of the hidden layers in the LSTM cells
        num_layers: Number of LSTM layers
        batch_size: Batch size

    '''
    ### Build the LSTM Cell
    # Use a basic LSTM cell
    def build_cell(lstm_size,keep_prob):
        lstm = tf.contrib.rnn.BasicLSTMCell(lstm_size)
    
    # Add dropout to the cell outputs
        drop = tf.contrib.rnn.DropoutWrapper(lstm ,output_keep_prob=keep_prob)
        return drop
    
    # Stack up multiple LSTM layers, for deep learning
    cell = tf.contrib.rnn.MultiRNNCell([build_cell(lstm_size,keep_prob) for _ in range(num_layers)])
    initial_state = cell.zero_state(batch_size,tf.float32)
    
    return cell, initial_state

### RNN Output

Here we'll create the output layer. We need to connect the output of the RNN cells to a full connected layer with a softmax output. The softmax output gives us a probability distribution we can use to predict the next character, so we want this layer to have size $C$, the number of classes/characters we have in our text.

If our input has batch size $N$, number of steps $M$, and the hidden layer has $L$ hidden units, then the output is a 3D tensor with size $N \times M \times L$. The output of each LSTM cell has size $L$, we have $M$ of them, one for each sequence step, and we have $N$ sequences. So the total size is $N \times M \times L$. 

We are using the same fully connected layer, the same weights, for each of the outputs. Then, to make things easier, we should reshape the outputs into a 2D tensor with shape $(M * N) \times L$. That is, one row for each sequence and step, where the values of each row are the output from the LSTM cells. We get the LSTM output as a list, `lstm_output`. First we need to concatenate this whole list into one array with [`tf.concat`](https://www.tensorflow.org/api_docs/python/tf/concat). Then, reshape it (with `tf.reshape`) to size $(M * N) \times L$.

One we have the outputs reshaped, we can do the matrix multiplication with the weights. We need to wrap the weight and bias variables in a variable scope with `tf.variable_scope(scope_name)` because there are weights being created in the LSTM cells. TensorFlow will throw an error if the weights created here have the same names as the weights created in the LSTM cells, which they will be default. To avoid this, we wrap the variables in a variable scope so we can give them unique names.

> **Exercise:** Implement the output layer in the function below.

In [47]:
def build_output(lstm_output, in_size, out_size):
    ''' Build a softmax layer, return the softmax output and logits.
    
        Arguments
        ---------
        
        lstm_output: List of output tensors from the LSTM layer
        in_size: Size of the input tensor, for example, size of the LSTM cells
        out_size: Size of this softmax layer
    
    '''

    # Reshape output so it's a bunch of rows, one row for each step for each sequence.
    # Concatenate lstm_output over axis 1 (the columns)
    seq_output = tf.concat(lstm_output,axis=1)    # right
    # Reshape seq_output to a 2D tensor with lstm_size columns
    x = tf.reshape(seq_output, [-1,in_size])
    
    # Connect the RNN outputs to a softmax layer
    with tf.variable_scope('softmax'):
        # Create the weight and bias variables here
        softmax_w = tf.Variable(tf.truncated_normal((in_size,out_size),stddev=0.1))
        softmax_b = tf.Variable(tf.zeros(out_size))
    
    # Since output is a bunch of rows of RNN cell outputs, logits will be a bunch
    # of rows of logit outputs, one for each step and sequence
    logits = tf.add(tf.matmul(x,softmax_w),softmax_b)
    
    # Use softmax to get the probabilities for predicted characters
    out = tf.nn.softmax(logits=logits,name="predictions")
    
    return out, logits

### Training loss

Next up is the training loss. We get the logits and targets and calculate the softmax cross-entropy loss. First we need to one-hot encode the targets, we're getting them as encoded characters. Then, reshape the one-hot targets so it's a 2D tensor with size $(M*N) \times C$ where $C$ is the number of classes/characters we have. Remember that we reshaped the LSTM outputs and ran them through a fully connected layer with $C$ units. So our logits will also have size $(M*N) \times C$.

Then we run the logits and targets through `tf.nn.softmax_cross_entropy_with_logits` and find the mean to get the loss.

>**Exercise:** Implement the loss calculation in the function below.

In [50]:
def build_loss(logits, targets, lstm_size, num_classes):
    ''' Calculate the loss from the logits and the targets.
    
        Arguments
        ---------
        logits: Logits from final fully connected layer
        targets: Targets for supervised learning
        lstm_size: Number of LSTM hidden units
        num_classes: Number of classes in targets
        
    '''
    
    # One-hot encode targets and reshape to match logits, one row per sequence per step
    y_one_hot = tf.one_hot(targets,num_classes) 
    y_reshaped =  tf.reshape(y_one_hot,logits.get_shape())
    
    # Softmax cross entropy loss
    loss = tf.nn.softmax_cross_entropy_with_logits(logits=logits,labels=y_reshaped)
    loss = tf.reduce_mean(loss)
    return loss

### Optimizer

Here we build the optimizer. Normal RNNs have have issues gradients exploding and disappearing. LSTMs fix the disappearance problem, but the gradients can still grow without bound. To fix this, we can clip the gradients above some threshold. That is, if a gradient is larger than that threshold, we set it to the threshold. This will ensure the gradients never grow overly large. Then we use an AdamOptimizer for the learning step.

In [51]:
def build_optimizer(loss, learning_rate, grad_clip):
    ''' Build optmizer for training, using gradient clipping.
    
        Arguments:
        loss: Network loss
        learning_rate: Learning rate for optimizer
    
    '''
    
    # Optimizer for training, using gradient clipping to control exploding gradients
    tvars = tf.trainable_variables()
    grads, _ = tf.clip_by_global_norm(tf.gradients(loss, tvars), grad_clip)
    train_op = tf.train.AdamOptimizer(learning_rate)
    optimizer = train_op.apply_gradients(zip(grads, tvars))
    
    return optimizer

In [None]:
def build_optimizer(loss,learning_rate,grad_clip):
    tvars = tf.trainable_variables()
    grads,_ = tf.clip_by_global_norm(tf.gradients(loss,tvars),grad_clip)
    trian_op = tf.train.AdamOptimizer(learning_rate)
    optimizer = train_op.apply_gradients(zip(grads,tvars))
    return optimizer

### Build the network

Now we can put all the pieces together and build a class for the network. To actually run data through the LSTM cells, we will use [`tf.nn.dynamic_rnn`](https://www.tensorflow.org/versions/r1.0/api_docs/python/tf/nn/dynamic_rnn). This function will pass the hidden and cell states across LSTM cells appropriately for us. It returns the outputs for each LSTM cell at each step for each sequence in the mini-batch. It also gives us the final LSTM state. We want to save this state as `final_state` so we can pass it to the first LSTM cell in the the next mini-batch run. For `tf.nn.dynamic_rnn`, we pass in the cell and initial state we get from `build_lstm`, as well as our input sequences. Also, we need to one-hot encode the inputs before going into the RNN. 

> **Exercise:** Use the functions you've implemented previously and `tf.nn.dynamic_rnn` to build the network.

In [44]:
class CharRNN:
    
    def __init__(self, num_classes, batch_size=64, num_steps=50, 
                       lstm_size=128, num_layers=2, learning_rate=0.001, 
                       grad_clip=5, sampling=False):
    
        # When we're using this network for sampling later, we'll be passing in
        # one character at a time, so providing an option for that
        if sampling == True:
            batch_size, num_steps = 1, 1
        else:
            batch_size, num_steps = batch_size, num_steps

        tf.reset_default_graph()
        
        # Build the input placeholder tensors
        self.inputs, self.targets, self.keep_prob =  build_inputs(batch_size, num_steps)

        # Build the LSTM cell
        cell, self.initial_state = build_lstm(lstm_size, num_layers, batch_size, self.keep_prob)

        ### Run the data through the RNN layers
        # First, one-hot encode the input tokens
        x_one_hot = tf.one_hot(self.inputs,num_classes)
        
        # Run each sequence step through the RNN with tf.nn.dynamic_rnn 
        outputs, state = tf.nn.dynamic_rnn(cell,x_one_hot,initial_state=self.initial_state)
        self.final_state = state
        
        # Get softmax predictions and logits
        self.prediction, self.logits = build_output(outputs, lstm_size, num_classes)
        
        # Loss and optimizer (with gradient clipping)
        self.loss =  build_loss(self.logits, self.targets, lstm_size, num_classes)
        self.optimizer = build_optimizer(self.loss, learning_rate, grad_clip)

## Hyperparameters

Here are the hyperparameters for the network.

* `batch_size` - Number of sequences running through the network in one pass.
* `num_steps` - Number of characters in the sequence the network is trained on. Larger is better typically, the network will learn more long range dependencies. But it takes longer to train. 100 is typically a good number here.
* `lstm_size` - The number of units in the hidden layers.
* `num_layers` - Number of hidden LSTM layers to use
* `learning_rate` - Learning rate for training
* `keep_prob` - The dropout keep probability when training. If you're network is overfitting, try decreasing this.

Here's some good advice from Andrej Karpathy on training the network. I'm going to copy it in here for your benefit, but also link to [where it originally came from](https://github.com/karpathy/char-rnn#tips-and-tricks).

> ## Tips and Tricks

>### Monitoring Validation Loss vs. Training Loss
>If you're somewhat new to Machine Learning or Neural Networks it can take a bit of expertise to get good models. The most important quantity to keep track of is the difference between your training loss (printed during training) and the validation loss (printed once in a while when the RNN is run on the validation data (by default every 1000 iterations)). In particular:

> - If your training loss is much lower than validation loss then this means the network might be **overfitting**. Solutions to this are to decrease your network size, or to increase dropout. For example you could try dropout of 0.5 and so on.
> - If your training/validation loss are about equal then your model is **underfitting**. Increase the size of your model (either number of layers or the raw number of neurons per layer)

> ### Approximate number of parameters

> The two most important parameters that control the model are `lstm_size` and `num_layers`. I would advise that you always use `num_layers` of either 2/3. The `lstm_size` can be adjusted based on how much data you have. The two important quantities to keep track of here are:

> - The number of parameters in your model. This is printed when you start training.
> - The size of your dataset. 1MB file is approximately 1 million characters.

>These two should be about the same order of magnitude. It's a little tricky to tell. Here are some examples:

> - I have a 100MB dataset and I'm using the default parameter settings (which currently print 150K parameters). My data size is significantly larger (100 mil >> 0.15 mil), so I expect to heavily underfit. I am thinking I can comfortably afford to make `lstm_size` larger.
> - I have a 10MB dataset and running a 10 million parameter model. I'm slightly nervous and I'm carefully monitoring my validation loss. If it's larger than my training loss then I may want to try to increase dropout a bit and see if that helps the validation loss.

> ### Best models strategy

>The winning strategy to obtaining very good models (if you have the compute time) is to always err on making the network larger (as large as you're willing to wait for it to compute) and then try different dropout values (between 0,1). Whatever model has the best validation performance (the loss, written in the checkpoint filename, low is good) is the one you should use in the end.

>It is very common in deep learning to run many different models with many different hyperparameter settings, and in the end take whatever checkpoint gave the best validation performance.

>By the way, the size of your training and validation splits are also parameters. Make sure you have a decent amount of data in your validation set or otherwise the validation performance will be noisy and not very informative.

In [48]:
batch_size = 10         # Sequences per batch
num_steps = 50          # Number of sequence steps per batch
lstm_size = 128         # Size of hidden layers in LSTMs
num_layers = 2          # Number of LSTM layers
learning_rate = 0.01    # Learning rate
keep_prob = 0.5         # Dropout keep probability

## Time for training

This is typical training code, passing inputs and targets into the network, then running the optimizer. Here we also get back the final LSTM state for the mini-batch. Then, we pass that state back into the network so the next batch can continue the state from the previous batch. And every so often (set by `save_every_n`) I save a checkpoint.

Here I'm saving checkpoints with the format

`i{iteration number}_l{# hidden layer units}.ckpt`

> **Exercise:** Set the hyperparameters above to train the network. Watch the training loss, it should be consistently dropping. Also, I highly advise running this on a GPU.

In [None]:
epochs = 20
# Print losses every N interations
print_every_n = 50

# Save every N iterations
save_every_n = 200

model = CharRNN(len(vocab), batch_size=batch_size, num_steps=num_steps,
                lstm_size=lstm_size, num_layers=num_layers, 
                learning_rate=learning_rate)

saver = tf.train.Saver(max_to_keep=100)
with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    
    # Use the line below to load a checkpoint and resume training
    #saver.restore(sess, 'checkpoints/______.ckpt')
    counter = 0
    for e in range(epochs):
        # Train network
        new_state = sess.run(model.initial_state)
        loss = 0
        for x, y in get_batches(encoded, batch_size, num_steps):
            counter += 1
            start = time.time()
            feed = {model.inputs: x,
                    model.targets: y,
                    model.keep_prob: keep_prob,
                    model.initial_state: new_state}
            batch_loss, new_state, _ = sess.run([model.loss, 
                                                 model.final_state, 
                                                 model.optimizer], 
                                                 feed_dict=feed)
            if (counter % print_every_n == 0):
                end = time.time()
                print('Epoch: {}/{}... '.format(e+1, epochs),
                      'Training Step: {}... '.format(counter),
                      'Training loss: {:.4f}... '.format(batch_loss),
                      '{:.4f} sec/batch'.format((end-start)))
        
            if (counter % save_every_n == 0):
                saver.save(sess, "checkpoints/i{}_l{}.ckpt".format(counter, lstm_size))
    
    saver.save(sess, "checkpoints/i{}_l{}.ckpt".format(counter, lstm_size))

3970
0
50
100
150
200
250
300
350
400
450
500
550
600
650
700
750
800
850
900
950
1000
1050
1100
1150
1200
1250
1300
1350
1400
1450
1500
1550
1600
1650
1700
1750
1800
1850
1900
1950
2000
2050
2100
2150
2200
2250
2300
2350
2400
2450
Epoch: 1/20...  Training Step: 50...  Training loss: 3.1231...  0.1041 sec/batch
2500
2550
2600
2650
2700
2750
2800
2850
2900
2950
3000
3050
3100
3150
3200
3250
3300
3350
3400
3450
3500
3550
3600
3650
3700
3750
3800
3850
3900
3950
4000
4050
4100
4150
4200
4250
4300
4350
4400
4450
4500
4550
4600
4650
4700
4750
4800
4850
4900
4950
Epoch: 1/20...  Training Step: 100...  Training loss: 2.6921...  0.0841 sec/batch
5000
5050
5100
5150
5200
5250
5300
5350
5400
5450
5500
5550
5600
5650
5700
5750
5800
5850
5900
5950
6000
6050
6100
6150
6200
6250
6300
6350
6400
6450
6500
6550
6600
6650
6700
6750
6800
6850
6900
6950
7000
7050
7100
7150
7200
7250
7300
7350
7400
7450
Epoch: 1/20...  Training Step: 150...  Training loss: 2.4518...  0.0917 sec/batch
7500
7550
7600
7650
770

55100
55150
55200
55250
55300
55350
55400
55450
55500
55550
55600
55650
55700
55750
55800
55850
55900
55950
56000
56050
56100
56150
56200
56250
56300
56350
56400
56450
56500
56550
56600
56650
56700
56750
56800
56850
56900
56950
57000
57050
57100
57150
57200
57250
57300
57350
57400
57450
Epoch: 1/20...  Training Step: 1150...  Training loss: 1.9376...  0.0815 sec/batch
57500
57550
57600
57650
57700
57750
57800
57850
57900
57950
58000
58050
58100
58150
58200
58250
58300
58350
58400
58450
58500
58550
58600
58650
58700
58750
58800
58850
58900
58950
59000
59050
59100
59150
59200
59250
59300
59350
59400
59450
59500
59550
59600
59650
59700
59750
59800
59850
59900
59950
Epoch: 1/20...  Training Step: 1200...  Training loss: 1.9200...  0.1014 sec/batch
60000
60050
60100
60150
60200
60250
60300
60350
60400
60450
60500
60550
60600
60650
60700
60750
60800
60850
60900
60950
61000
61050
61100
61150
61200
61250
61300
61350
61400
61450
61500
61550
61600
61650
61700
61750
61800
61850
61900
61950
62000


107700
107750
107800
107850
107900
107950
108000
108050
108100
108150
108200
108250
108300
108350
108400
108450
108500
108550
108600
108650
108700
108750
108800
108850
108900
108950
109000
109050
109100
109150
109200
109250
109300
109350
109400
109450
109500
109550
109600
109650
109700
109750
109800
109850
109900
109950
Epoch: 1/20...  Training Step: 2200...  Training loss: 1.7151...  0.1089 sec/batch
110000
110050
110100
110150
110200
110250
110300
110350
110400
110450
110500
110550
110600
110650
110700
110750
110800
110850
110900
110950
111000
111050
111100
111150
111200
111250
111300
111350
111400
111450
111500
111550
111600
111650
111700
111750
111800
111850
111900
111950
112000
112050
112100
112150
112200
112250
112300
112350
112400
112450
Epoch: 1/20...  Training Step: 2250...  Training loss: 1.7354...  0.1145 sec/batch
112500
112550
112600
112650
112700
112750
112800
112850
112900
112950
113000
113050
113100
113150
113200
113250
113300
113350
113400
113450
113500
113550
113600
1

155100
155150
155200
155250
155300
155350
155400
155450
155500
155550
155600
155650
155700
155750
155800
155850
155900
155950
156000
156050
156100
156150
156200
156250
156300
156350
156400
156450
156500
156550
156600
156650
156700
156750
156800
156850
156900
156950
157000
157050
157100
157150
157200
157250
157300
157350
157400
157450
Epoch: 1/20...  Training Step: 3150...  Training loss: 1.6362...  0.0764 sec/batch
157500
157550
157600
157650
157700
157750
157800
157850
157900
157950
158000
158050
158100
158150
158200
158250
158300
158350
158400
158450
158500
158550
158600
158650
158700
158750
158800
158850
158900
158950
159000
159050
159100
159150
159200
159250
159300
159350
159400
159450
159500
159550
159600
159650
159700
159750
159800
159850
159900
159950
Epoch: 1/20...  Training Step: 3200...  Training loss: 1.8684...  0.0798 sec/batch
160000
160050
160100
160150
160200
160250
160300
160350
160400
160450
160500
160550
160600
160650
160700
160750
160800
160850
160900
160950
161000
1

5600
5650
5700
5750
5800
5850
5900
5950
6000
6050
6100
6150
6200
6250
6300
6350
6400
6450
Epoch: 2/20...  Training Step: 4100...  Training loss: 1.6749...  0.0841 sec/batch
6500
6550
6600
6650
6700
6750
6800
6850
6900
6950
7000
7050
7100
7150
7200
7250
7300
7350
7400
7450
7500
7550
7600
7650
7700
7750
7800
7850
7900
7950
8000
8050
8100
8150
8200
8250
8300
8350
8400
8450
8500
8550
8600
8650
8700
8750
8800
8850
8900
8950
Epoch: 2/20...  Training Step: 4150...  Training loss: 1.6096...  0.0865 sec/batch
9000
9050
9100
9150
9200
9250
9300
9350
9400
9450
9500
9550
9600
9650
9700
9750
9800
9850
9900
9950
10000
10050
10100
10150
10200
10250
10300
10350
10400
10450
10500
10550
10600
10650
10700
10750
10800
10850
10900
10950
11000
11050
11100
11150
11200
11250
11300
11350
11400
11450
Epoch: 2/20...  Training Step: 4200...  Training loss: 1.6843...  0.0967 sec/batch
11500
11550
11600
11650
11700
11750
11800
11850
11900
11950
12000
12050
12100
12150
12200
12250
12300
12350
12400
12450
12500
12550

59500
59550
59600
59650
59700
59750
59800
59850
59900
59950
60000
60050
60100
60150
60200
60250
60300
60350
60400
60450
60500
60550
60600
60650
60700
60750
60800
60850
60900
60950
61000
61050
61100
61150
61200
61250
61300
61350
61400
61450
Epoch: 2/20...  Training Step: 5200...  Training loss: 1.7938...  0.1106 sec/batch
61500
61550
61600
61650
61700
61750
61800
61850
61900
61950
62000
62050
62100
62150
62200
62250
62300
62350
62400
62450
62500
62550
62600
62650
62700
62750
62800
62850
62900
62950
63000
63050
63100
63150
63200
63250
63300
63350
63400
63450
63500
63550
63600
63650
63700
63750
63800
63850
63900
63950
Epoch: 2/20...  Training Step: 5250...  Training loss: 1.6096...  0.0727 sec/batch
64000
64050
64100
64150
64200
64250
64300
64350
64400
64450
64500
64550
64600
64650
64700
64750
64800
64850
64900
64950
65000
65050
65100
65150
65200
65250
65300
65350
65400
65450
65500
65550
65600
65650
65700
65750
65800
65850
65900
65950
66000
66050
66100
66150
66200
66250
66300
66350
66400


111500
111550
111600
111650
111700
111750
111800
111850
111900
111950
112000
112050
112100
112150
112200
112250
112300
112350
112400
112450
112500
112550
112600
112650
112700
112750
112800
112850
112900
112950
113000
113050
113100
113150
113200
113250
113300
113350
113400
113450
113500
113550
113600
113650
113700
113750
113800
113850
113900
113950
Epoch: 2/20...  Training Step: 6250...  Training loss: 1.5858...  0.0786 sec/batch
114000
114050
114100
114150
114200
114250
114300
114350
114400
114450
114500
114550
114600
114650
114700
114750
114800
114850
114900
114950
115000
115050
115100
115150
115200
115250
115300
115350
115400
115450
115500
115550
115600
115650
115700
115750
115800
115850
115900
115950
116000
116050
116100
116150
116200
116250
116300
116350
116400
116450
Epoch: 2/20...  Training Step: 6300...  Training loss: 1.6205...  0.0752 sec/batch
116500
116550
116600
116650
116700
116750
116800
116850
116900
116950
117000
117050
117100
117150
117200
117250
117300
117350
117400
1

159150
159200
159250
159300
159350
159400
159450
159500
159550
159600
159650
159700
159750
159800
159850
159900
159950
160000
160050
160100
160150
160200
160250
160300
160350
160400
160450
160500
160550
160600
160650
160700
160750
160800
160850
160900
160950
161000
161050
161100
161150
161200
161250
161300
161350
161400
161450
Epoch: 2/20...  Training Step: 7200...  Training loss: 1.7397...  0.0752 sec/batch
161500
161550
161600
161650
161700
161750
161800
161850
161900
161950
162000
162050
162100
162150
162200
162250
162300
162350
162400
162450
162500
162550
162600
162650
162700
162750
162800
162850
162900
162950
163000
163050
163100
163150
163200
163250
163300
163350
163400
163450
163500
163550
163600
163650
163700
163750
163800
163850
163900
163950
Epoch: 2/20...  Training Step: 7250...  Training loss: 1.5302...  0.0851 sec/batch
164000
164050
164100
164150
164200
164250
164300
164350
164400
164450
164500
164550
164600
164650
164700
164750
164800
164850
164900
164950
165000
165050
1

10650
10700
10750
10800
10850
10900
10950
11000
11050
11100
11150
11200
11250
11300
11350
11400
11450
11500
11550
11600
11650
11700
11750
11800
11850
11900
11950
12000
12050
12100
12150
12200
12250
12300
12350
12400
12450
12500
12550
12600
12650
12700
12750
12800
12850
12900
12950
Epoch: 3/20...  Training Step: 8200...  Training loss: 1.5919...  0.0750 sec/batch
13000
13050
13100
13150
13200
13250
13300
13350
13400
13450
13500
13550
13600
13650
13700
13750
13800
13850
13900
13950
14000
14050
14100
14150
14200
14250
14300
14350
14400
14450
14500
14550
14600
14650
14700
14750
14800
14850
14900
14950
15000
15050
15100
15150
15200
15250
15300
15350
15400
15450
Epoch: 3/20...  Training Step: 8250...  Training loss: 1.7469...  0.1512 sec/batch
15500
15550
15600
15650
15700
15750
15800
15850
15900
15950
16000
16050
16100
16150
16200
16250
16300
16350
16400
16450
16500
16550
16600
16650
16700
16750
16800
16850
16900
16950
17000
17050
17100
17150
17200
17250
17300
17350
17400
17450
17500
17550


64500
64550
64600
64650
64700
64750
64800
64850
64900
64950
65000
65050
65100
65150
65200
65250
65300
65350
65400
65450
Epoch: 3/20...  Training Step: 9250...  Training loss: 1.7305...  0.0875 sec/batch
65500
65550
65600
65650
65700
65750
65800
65850
65900
65950
66000
66050
66100
66150
66200
66250
66300
66350
66400
66450
66500
66550
66600
66650
66700
66750
66800
66850
66900
66950
67000
67050
67100
67150
67200
67250
67300
67350
67400
67450
67500
67550
67600
67650
67700
67750
67800
67850
67900
67950
Epoch: 3/20...  Training Step: 9300...  Training loss: 1.5179...  0.0898 sec/batch
68000
68050
68100
68150
68200
68250
68300
68350
68400
68450
68500
68550
68600
68650
68700
68750
68800
68850
68900
68950
69000
69050
69100
69150
69200
69250
69300
69350
69400
69450
69500
69550
69600
69650
69700
69750
69800
69850
69900
69950
70000
70050
70100
70150
70200
70250
70300
70350
70400
70450
Epoch: 3/20...  Training Step: 9350...  Training loss: 1.6865...  0.0941 sec/batch
70500
70550
70600
70650
70700
7

115650
115700
115750
115800
115850
115900
115950
116000
116050
116100
116150
116200
116250
116300
116350
116400
116450
116500
116550
116600
116650
116700
116750
116800
116850
116900
116950
117000
117050
117100
117150
117200
117250
117300
117350
117400
117450
117500
117550
117600
117650
117700
117750
117800
117850
117900
117950
Epoch: 3/20...  Training Step: 10300...  Training loss: 1.8128...  0.0886 sec/batch
118000
118050
118100
118150
118200
118250
118300
118350
118400
118450
118500
118550
118600
118650
118700
118750
118800
118850
118900
118950
119000
119050
119100
119150
119200
119250
119300
119350
119400
119450
119500
119550
119600
119650
119700
119750
119800
119850
119900
119950
120000
120050
120100
120150
120200
120250
120300
120350
120400
120450
Epoch: 3/20...  Training Step: 10350...  Training loss: 1.5910...  0.0765 sec/batch
120500
120550
120600
120650
120700
120750
120800
120850
120900
120950
121000
121050
121100
121150
121200
121250
121300
121350
121400
121450
121500
121550

163000
163050
163100
163150
163200
163250
163300
163350
163400
163450
163500
163550
163600
163650
163700
163750
163800
163850
163900
163950
164000
164050
164100
164150
164200
164250
164300
164350
164400
164450
164500
164550
164600
164650
164700
164750
164800
164850
164900
164950
165000
165050
165100
165150
165200
165250
165300
165350
165400
165450
Epoch: 3/20...  Training Step: 11250...  Training loss: 1.6701...  0.0919 sec/batch
165500
165550
165600
165650
165700
165750
165800
165850
165900
165950
166000
166050
166100
166150
166200
166250
166300
166350
166400
166450
166500
166550
166600
166650
166700
166750
166800
166850
166900
166950
167000
167050
167100
167150
167200
167250
167300
167350
167400
167450
167500
167550
167600
167650
167700
167750
167800
167850
167900
167950
Epoch: 3/20...  Training Step: 11300...  Training loss: 1.5407...  0.0782 sec/batch
168000
168050
168100
168150
168200
168250
168300
168350
168400
168450
168500
168550
168600
168650
168700
168750
168800
168850
168900

14700
14750
14800
14850
14900
14950
15000
15050
15100
15150
15200
15250
15300
15350
15400
15450
15500
15550
15600
15650
15700
15750
15800
15850
15900
15950
16000
16050
16100
16150
16200
16250
16300
16350
16400
16450
16500
16550
16600
16650
16700
16750
16800
16850
16900
16950
Epoch: 4/20...  Training Step: 12250...  Training loss: 1.6249...  0.0891 sec/batch
17000
17050
17100
17150
17200
17250
17300
17350
17400
17450
17500
17550
17600
17650
17700
17750
17800
17850
17900
17950
18000
18050
18100
18150
18200
18250
18300
18350
18400
18450
18500
18550
18600
18650
18700
18750
18800
18850
18900
18950
19000
19050
19100
19150
19200
19250
19300
19350
19400
19450
Epoch: 4/20...  Training Step: 12300...  Training loss: 1.5103...  0.0852 sec/batch
19500
19550
19600
19650
19700
19750
19800
19850
19900
19950
20000
20050
20100
20150
20200
20250
20300
20350
20400
20450
20500
20550
20600
20650
20700
20750
20800
20850
20900
20950
21000
21050
21100
21150
21200
21250
21300
21350
21400
21450
21500
21550
2160

68400
68450
68500
68550
68600
68650
68700
68750
68800
68850
68900
68950
69000
69050
69100
69150
69200
69250
69300
69350
69400
69450
Epoch: 4/20...  Training Step: 13300...  Training loss: 1.6613...  0.0821 sec/batch
69500
69550
69600
69650
69700
69750
69800
69850
69900
69950
70000
70050
70100
70150
70200
70250
70300
70350
70400
70450
70500
70550
70600
70650
70700
70750
70800
70850
70900
70950
71000
71050
71100
71150
71200
71250
71300
71350
71400
71450
71500
71550
71600
71650
71700
71750
71800
71850
71900
71950
Epoch: 4/20...  Training Step: 13350...  Training loss: 1.5196...  0.0848 sec/batch
72000
72050
72100
72150
72200
72250
72300
72350
72400
72450
72500
72550
72600
72650
72700
72750
72800
72850
72900
72950
73000
73050
73100
73150
73200
73250
73300
73350
73400
73450
73500
73550
73600
73650
73700
73750
73800
73850
73900
73950
74000
74050
74100
74150
74200
74250
74300
74350
74400
74450
Epoch: 4/20...  Training Step: 13400...  Training loss: 1.6377...  0.0759 sec/batch
74500
74550
7460

119550
119600
119650
119700
119750
119800
119850
119900
119950
120000
120050
120100
120150
120200
120250
120300
120350
120400
120450
120500
120550
120600
120650
120700
120750
120800
120850
120900
120950
121000
121050
121100
121150
121200
121250
121300
121350
121400
121450
121500
121550
121600
121650
121700
121750
121800
121850
121900
121950
Epoch: 4/20...  Training Step: 14350...  Training loss: 1.6615...  0.0793 sec/batch
122000
122050
122100
122150
122200
122250
122300
122350
122400
122450
122500
122550
122600
122650
122700
122750
122800
122850
122900
122950
123000
123050
123100
123150
123200
123250
123300
123350
123400
123450
123500
123550
123600
123650
123700
123750
123800
123850
123900
123950
124000
124050
124100
124150
124200
124250
124300
124350
124400
124450
Epoch: 4/20...  Training Step: 14400...  Training loss: 1.5702...  0.1075 sec/batch
124500
124550
124600
124650
124700
124750
124800
124850
124900
124950
125000
125050
125100
125150
125200
125250
125300
125350
125400
125450

167100
167150
167200
167250
167300
167350
167400
167450
167500
167550
167600
167650
167700
167750
167800
167850
167900
167950
168000
168050
168100
168150
168200
168250
168300
168350
168400
168450
168500
168550
168600
168650
168700
168750
168800
168850
168900
168950
169000
169050
169100
169150
169200
169250
169300
169350
169400
169450
Epoch: 4/20...  Training Step: 15300...  Training loss: 1.6100...  0.0993 sec/batch
169500
169550
169600
169650
169700
169750
169800
169850
169900
169950
170000
170050
170100
170150
170200
170250
170300
170350
170400
170450
170500
170550
170600
170650
170700
170750
170800
170850
170900
170950
171000
171050
171100
171150
171200
171250
171300
171350
171400
171450
171500
171550
171600
171650
171700
171750
171800
171850
171900
171950
Epoch: 4/20...  Training Step: 15350...  Training loss: 1.7263...  0.0805 sec/batch
172000
172050
172100
172150
172200
172250
172300
172350
172400
172450
172500
172550
172600
172650
172700
172750
172800
172850
172900
172950
173000

19500
19550
19600
19650
19700
19750
19800
19850
19900
19950
20000
20050
20100
20150
20200
20250
20300
20350
20400
20450
20500
20550
20600
20650
20700
20750
20800
20850
20900
20950
Epoch: 5/20...  Training Step: 16300...  Training loss: 1.6065...  0.0878 sec/batch
21000
21050
21100
21150
21200
21250
21300
21350
21400
21450
21500
21550
21600
21650
21700
21750
21800
21850
21900
21950
22000
22050
22100
22150
22200
22250
22300
22350
22400
22450
22500
22550
22600
22650
22700
22750
22800
22850
22900
22950
23000
23050
23100
23150
23200
23250
23300
23350
23400
23450
Epoch: 5/20...  Training Step: 16350...  Training loss: 1.6408...  0.0843 sec/batch
23500
23550
23600
23650
23700
23750
23800
23850
23900
23950
24000
24050
24100
24150
24200
24250
24300
24350
24400
24450
24500
24550
24600
24650
24700
24750
24800
24850
24900
24950
25000
25050
25100
25150
25200
25250
25300
25350
25400
25450
25500
25550
25600
25650
25700
25750
25800
25850
25900
25950
Epoch: 5/20...  Training Step: 16400...  Training lo

73100
73150
73200
73250
73300
73350
73400
73450
Epoch: 5/20...  Training Step: 17350...  Training loss: 1.5241...  0.0810 sec/batch
73500
73550
73600
73650
73700
73750
73800
73850
73900
73950
74000
74050
74100
74150
74200
74250
74300
74350
74400
74450
74500
74550
74600
74650
74700
74750
74800
74850
74900
74950
75000
75050
75100
75150
75200
75250
75300
75350
75400
75450
75500
75550
75600
75650
75700
75750
75800
75850
75900
75950
Epoch: 5/20...  Training Step: 17400...  Training loss: 1.5574...  0.0816 sec/batch
76000
76050
76100
76150
76200
76250
76300
76350
76400
76450
76500
76550
76600
76650
76700
76750
76800
76850
76900
76950
77000
77050
77100
77150
77200
77250
77300
77350
77400
77450
77500
77550
77600
77650
77700
77750
77800
77850
77900
77950
78000
78050
78100
78150
78200
78250
78300
78350
78400
78450
Epoch: 5/20...  Training Step: 17450...  Training loss: 1.5970...  0.0828 sec/batch
78500
78550
78600
78650
78700
78750
78800
78850
78900
78950
79000
79050
79100
79150
79200
79250
7930

123600
123650
123700
123750
123800
123850
123900
123950
124000
124050
124100
124150
124200
124250
124300
124350
124400
124450
124500
124550
124600
124650
124700
124750
124800
124850
124900
124950
125000
125050
125100
125150
125200
125250
125300
125350
125400
125450
125500
125550
125600
125650
125700
125750
125800
125850
125900
125950
Epoch: 5/20...  Training Step: 18400...  Training loss: 1.6586...  0.0762 sec/batch
126000
126050
126100
126150
126200
126250
126300
126350
126400
126450
126500
126550
126600
126650
126700
126750
126800
126850
126900
126950
127000
127050
127100
127150
127200
127250
127300
127350
127400
127450
127500
127550
127600
127650
127700
127750
127800
127850
127900
127950
128000
128050
128100
128150
128200
128250
128300
128350
128400
128450
Epoch: 5/20...  Training Step: 18450...  Training loss: 1.4510...  0.0874 sec/batch
128500
128550
128600
128650
128700
128750
128800
128850
128900
128950
129000
129050
129100
129150
129200
129250
129300
129350
129400
129450
129500

171100
171150
171200
171250
171300
171350
171400
171450
171500
171550
171600
171650
171700
171750
171800
171850
171900
171950
172000
172050
172100
172150
172200
172250
172300
172350
172400
172450
172500
172550
172600
172650
172700
172750
172800
172850
172900
172950
173000
173050
173100
173150
173200
173250
173300
173350
173400
173450
Epoch: 5/20...  Training Step: 19350...  Training loss: 1.5752...  0.0828 sec/batch
173500
173550
173600
173650
173700
173750
173800
173850
173900
173950
174000
174050
174100
174150
174200
174250
174300
174350
174400
174450
174500
174550
174600
174650
174700
174750
174800
174850
174900
174950
175000
175050
175100
175150
175200
175250
175300
175350
175400
175450
175500
175550
175600
175650
175700
175750
175800
175850
175900
175950
Epoch: 5/20...  Training Step: 19400...  Training loss: 1.5748...  0.0853 sec/batch
176000
176050
176100
176150
176200
176250
176300
176350
176400
176450
176500
176550
176600
176650
176700
176750
176800
176850
176900
176950
177000

24250
24300
24350
24400
24450
24500
24550
24600
24650
24700
24750
24800
24850
24900
24950
Epoch: 6/20...  Training Step: 20350...  Training loss: 1.7231...  0.0826 sec/batch
25000
25050
25100
25150
25200
25250
25300
25350
25400
25450
25500
25550
25600
25650
25700
25750
25800
25850
25900
25950
26000
26050
26100
26150
26200
26250
26300
26350
26400
26450
26500
26550
26600
26650
26700
26750
26800
26850
26900
26950
27000
27050
27100
27150
27200
27250
27300
27350
27400
27450
Epoch: 6/20...  Training Step: 20400...  Training loss: 1.4277...  0.0832 sec/batch
27500
27550
27600
27650
27700
27750
27800
27850
27900
27950
28000
28050
28100
28150
28200
28250
28300
28350
28400
28450
28500
28550
28600
28650
28700
28750
28800
28850
28900
28950
29000
29050
29100
29150
29200
29250
29300
29350
29400
29450
29500
29550
29600
29650
29700
29750
29800
29850
29900
29950
Epoch: 6/20...  Training Step: 20450...  Training loss: 1.5011...  0.0788 sec/batch
30000
30050
30100
30150
30200
30250
30300
30350
30400
3045

77500
77550
77600
77650
77700
77750
77800
77850
77900
77950
78000
78050
78100
78150
78200
78250
78300
78350
78400
78450
78500
78550
78600
78650
78700
78750
78800
78850
78900
78950
79000
79050
79100
79150
79200
79250
79300
79350
79400
79450
79500
79550
79600
79650
79700
79750
79800
79850
79900
79950
Epoch: 6/20...  Training Step: 21450...  Training loss: 1.5783...  0.0751 sec/batch
80000
80050
80100
80150
80200
80250
80300
80350
80400
80450
80500
80550
80600
80650
80700
80750
80800
80850
80900
80950
81000
81050
81100
81150
81200
81250
81300
81350
81400
81450
81500
81550
81600
81650
81700
81750
81800
81850
81900
81950
82000
82050
82100
82150
82200
82250
82300
82350
82400
82450
Epoch: 6/20...  Training Step: 21500...  Training loss: 1.5087...  0.0966 sec/batch
82500
82550
82600
82650
82700
82750
82800
82850
82900
82950
83000
83050
83100
83150
83200
83250
83300
83350
83400
83450
83500
83550
83600
83650
83700
83750
83800
83850
83900
83950
84000
84050
84100
84150
84200
84250
84300
84350
8440

127500
127550
127600
127650
127700
127750
127800
127850
127900
127950
128000
128050
128100
128150
128200
128250
128300
128350
128400
128450
128500
128550
128600
128650
128700
128750
128800
128850
128900
128950
129000
129050
129100
129150
129200
129250
129300
129350
129400
129450
129500
129550
129600
129650
129700
129750
129800
129850
129900
129950
Epoch: 6/20...  Training Step: 22450...  Training loss: 1.5403...  0.0819 sec/batch
130000
130050
130100
130150
130200
130250
130300
130350
130400
130450
130500
130550
130600
130650
130700
130750
130800
130850
130900
130950
131000
131050
131100
131150
131200
131250
131300
131350
131400
131450
131500
131550
131600
131650
131700
131750
131800
131850
131900
131950
132000
132050
132100
132150
132200
132250
132300
132350
132400
132450
Epoch: 6/20...  Training Step: 22500...  Training loss: 1.7994...  0.0783 sec/batch
132500
132550
132600
132650
132700
132750
132800
132850
132900
132950
133000
133050
133100
133150
133200
133250
133300
133350
133400

175150
175200
175250
175300
175350
175400
175450
175500
175550
175600
175650
175700
175750
175800
175850
175900
175950
176000
176050
176100
176150
176200
176250
176300
176350
176400
176450
176500
176550
176600
176650
176700
176750
176800
176850
176900
176950
177000
177050
177100
177150
177200
177250
177300
177350
177400
177450
Epoch: 6/20...  Training Step: 23400...  Training loss: 1.6051...  0.0799 sec/batch
177500
177550
177600
177650
177700
177750
177800
177850
177900
177950
178000
178050
178100
178150
178200
178250
178300
178350
178400
178450
178500
178550
178600
178650
178700
178750
178800
178850
178900
178950
179000
179050
179100
179150
179200
179250
179300
179350
179400
179450
179500
179550
179600
179650
179700
179750
179800
179850
179900
179950
Epoch: 6/20...  Training Step: 23450...  Training loss: 1.5843...  0.0796 sec/batch
180000
180050


#### Saved checkpoints

Read up on saving and loading checkpoints here: https://www.tensorflow.org/programmers_guide/variables

In [None]:
tf.train.get_checkpoint_state('checkpoints')

## Sampling

Now that the network is trained, we'll can use it to generate new text. The idea is that we pass in a character, then the network will predict the next character. We can use the new one, to predict the next one. And we keep doing this to generate all new text. I also included some functionality to prime the network with some text by passing in a string and building up a state from that.

The network gives us predictions for each character. To reduce noise and make things a little less random, I'm going to only choose a new character from the top N most likely characters.



In [None]:
def pick_top_n(preds, vocab_size, top_n=5):
    p = np.squeeze(preds)
    p[np.argsort(p)[:-top_n]] = 0
    p = p / np.sum(p)
    c = np.random.choice(vocab_size, 1, p=p)[0]
    return c

In [None]:
def sample(checkpoint, n_samples, lstm_size, vocab_size, prime="The "):
    samples = [c for c in prime]
    model = CharRNN(len(vocab), lstm_size=lstm_size, sampling=True)
    saver = tf.train.Saver()
    with tf.Session() as sess:
        saver.restore(sess, checkpoint)
        new_state = sess.run(model.initial_state)
        for c in prime:
            x = np.zeros((1, 1))
            x[0,0] = vocab_to_int[c]
            feed = {model.inputs: x,
                    model.keep_prob: 1.,
                    model.initial_state: new_state}
            preds, new_state = sess.run([model.prediction, model.final_state], 
                                         feed_dict=feed)

        c = pick_top_n(preds, len(vocab))
        samples.append(int_to_vocab[c])

        for i in range(n_samples):
            x[0,0] = c
            feed = {model.inputs: x,
                    model.keep_prob: 1.,
                    model.initial_state: new_state}
            preds, new_state = sess.run([model.prediction, model.final_state], 
                                         feed_dict=feed)

            c = pick_top_n(preds, len(vocab))
            samples.append(int_to_vocab[c])
        
    return ''.join(samples)

Here, pass in the path to a checkpoint and sample from the network.

In [None]:
tf.train.latest_checkpoint('checkpoints')

In [None]:
checkpoint = tf.train.latest_checkpoint('checkpoints')
samp = sample(checkpoint, 2000, lstm_size, len(vocab), prime="Far")
print(samp)

In [None]:
checkpoint = 'checkpoints/i200_l512.ckpt'
samp = sample(checkpoint, 1000, lstm_size, len(vocab), prime="Far")
print(samp)

In [None]:
checkpoint = 'checkpoints/i600_l512.ckpt'
samp = sample(checkpoint, 1000, lstm_size, len(vocab), prime="Far")
print(samp)

In [None]:
checkpoint = 'checkpoints/i1200_l512.ckpt'
samp = sample(checkpoint, 1000, lstm_size, len(vocab), prime="Far")
print(samp)