# Anna KaRNNa

In this notebook, I'll build a character-wise RNN trained on Anna Karenina, one of my all-time favorite books. It'll be able to generate new text based on the text from the book.

This network is based off of Andrej Karpathy's [post on RNNs](http://karpathy.github.io/2015/05/21/rnn-effectiveness/) and [implementation in Torch](https://github.com/karpathy/char-rnn). Also, some information [here at r2rt](http://r2rt.com/recurrent-neural-networks-in-tensorflow-ii.html) and from [Sherjil Ozair](https://github.com/sherjilozair/char-rnn-tensorflow) on GitHub. Below is the general architecture of the character-wise RNN.

<img src="assets/charseq.jpeg" width="500">

In [1]:
import time
from collections import namedtuple

import numpy as np
import tensorflow as tf

First we'll load the text file and convert it into integers for our network to use. Here I'm creating a couple dictionaries to convert the characters to and from integers. Encoding the characters as integers makes it easier to use as input in the network.

In [6]:
with open('anna.txt', 'r') as f:
    text=f.read()
vocab = set(text)
vocab_to_int = {c: i for i, c in enumerate(vocab)}
int_to_vocab = dict(enumerate(vocab))
encoded = np.array([vocab_to_int[c] for c in text], dtype=np.int32)

Let's check out the first 100 characters, make sure everything is peachy. According to the [American Book Review](http://americanbookreview.org/100bestlines.asp), this is the 6th best first line of a book ever.

In [7]:
text[:100]

'Chapter 1\n\n\nHappy families are all alike; every unhappy family is unhappy in its own\nway.\n\nEverythin'

And we can see the characters encoded as integers.

In [8]:
encoded[:100]

array([52, 51,  4,  0, 21, 48, 82, 34, 61, 81, 81, 81, 54,  4,  0,  0, 10,
       34,  6,  4, 46, 19, 60, 19, 48,  3, 34,  4, 82, 48, 34,  4, 60, 60,
       34,  4, 60, 19, 37, 48, 71, 34, 48, 35, 48, 82, 10, 34, 13, 43, 51,
        4,  0,  0, 10, 34,  6,  4, 46, 19, 60, 10, 34, 19,  3, 34, 13, 43,
       51,  4,  0,  0, 10, 34, 19, 43, 34, 19, 21,  3, 34, 68, 69, 43, 81,
       69,  4, 10, 62, 81, 81, 11, 35, 48, 82, 10, 21, 51, 19, 43], dtype=int32)

Since the network is working with individual characters, it's similar to a classification problem in which we are trying to predict the next character from the previous text.  Here's how many 'classes' our network has to pick from.

In [5]:
len(vocab)

83

## Making training and validation batches

Now I need to split up the data into batches - and into training and validation sets. I should be making a test set here, but I'm not going to worry about that. My test will be if the network can generate new text.

Here I'll make both input and target arrays. The targets are the same as the inputs, except shifted one character over. I'll also drop the last bit of data so that I'll only have completely full batches.

The idea here is to make a 2D matrix where the number of rows is equal to the batch size. Each row will be one long concatenated string from the character data. We'll split this data into a training set and validation set using the `split_frac` keyword. This will keep 90% of the batches in the training set, the other 10% in the validation set.

In [9]:
def get_batches(arr, n_seqs, n_steps):
    '''Create a generator that returns batches of size
       n_seqs x n_steps from arr.
    '''
    
    batch_size = n_seqs * n_steps
    n_batches = len(arr)//batch_size
    
    # Keep only enough characters to make full batches
    arr = arr[:n_batches * batch_size]
    
    # Reshape into n_seqs rows
    arr = arr.reshape((n_seqs, -1))
    
    for n in range(0, arr.shape[1], n_steps):
        # The features
        x = arr[:, n:n+n_steps]
        # The targets, shifted by one
        y = np.zeros_like(x)
        y[:, :-1], y[:, -1] = x[:, 1:], x[:, 0]
        yield x, y

Now I'll make my data sets and we can check out what's going on here. Here I'm going to use a batch size of 10 and 50 sequence steps.

In [10]:
batches = get_batches(encoded, 10, 50)
x, y = next(batches)

In [19]:
print('x\n', x[:10, :10])
print('\ny\n', y[:10, :10])

x
 [[52 51  4  0 21 48 82 34 61 81]
 [34  4 46 34 43 68 21 34 40 68]
 [35 19 43 62 81 81 15 64 48  3]
 [43 34 25 13 82 19 43 40 34 51]
 [34 19 21 34 19  3 39 34  3 19]
 [34 78 21 34 69  4  3 81 68 43]
 [51 48 43 34  9 68 46 48 34  6]
 [71 34 32 13 21 34 43 68 69 34]
 [21 34 19  3 43 50 21 62 34 72]
 [34  3  4 19 25 34 21 68 34 51]]

y
 [[51  4  0 21 48 82 34 61 81 81]
 [ 4 46 34 43 68 21 34 40 68 19]
 [19 43 62 81 81 15 64 48  3 39]
 [34 25 13 82 19 43 40 34 51 19]
 [19 21 34 19  3 39 34  3 19 82]
 [78 21 34 69  4  3 81 68 43 60]
 [48 43 34  9 68 46 48 34  6 68]
 [34 32 13 21 34 43 68 69 34  3]
 [34 19  3 43 50 21 62 34 72 51]
 [ 3  4 19 25 34 21 68 34 51 48]]


## Building the model

Below is where you'll build the network. We'll break it up into parts so it's easier to reason about each bit. Then we can connect them up into the whole network.

![Character RNN](assets/charRNN.png)

### Inputs

First off we'll create our input placeholders. As usual we need placeholders for the training data and the targets. We'll also create a placeholder for dropout layers called `keep_prob`.

In [49]:
def build_inputs(batch_size, num_steps):
    ''' Define placeholders for inputs, targets, and dropout 
    
        Arguments
        ---------
        batch_size: Batch size, number of sequences per batch
        num_steps: Number of sequence steps in a batch
        
    '''
    # Declare placeholders we'll feed into the graph
    inputs = tf.placeholder(tf.int32, [batch_size, num_steps], name='inputs')
    targets = tf.placeholder(tf.int32, [batch_size, num_steps], name='targets')
    
    # Keep probability placeholder for drop out layers
    keep_prob = tf.placeholder(tf.float32, name='keep_prob')
    
    return inputs, targets, keep_prob

### LSTM Cell

Here we will create the LSTM cell we'll use in the hidden layer. We'll use this cell as a building block for the RNN. So we aren't actually defining the RNN here, just the type of cell we'll use in the hidden layer.

We first create a basic LSTM cell with

```python
lstm = tf.contrib.rnn.BasicLSTMCell(num_units)
```

where `num_units` is the number of units in the hidden layers in the cell. Then we can add dropout by wrapping it with 

```python
tf.contrib.rnn.DropoutWrapper(lstm, output_keep_prob=keep_prob)
```
You pass in a cell and it will automatically add dropout to the inputs or outputs. Finally, we can stack up the LSTM cells into layers with `tf.contrib.rnn.MultiRNNCell`. With this, you pass in a list of cells and it will send the outputs of each cell into the next. For example,

```python
tf.contrib.rnn.MultiRNNCell([cell]*num_layers)
```

This might look a little weird if you know Python well because this will create a list of the same `cell` object. However, TensorFlow will create different weight matrices for all `cell` objects.

We also need to create an initial cell state of all zeros. This can be done like so

```python
initial_state = cell.zero_state(batch_size, tf.float32)
```

Below, we implement the `build_lstm` function to create these LSTM cells and the initial state.

In [50]:
def build_lstm(lstm_size, num_layers, batch_size, keep_prob):
    ''' Build LSTM layer.
    
        Arguments
        ---------
        keep_prob: Scalar tensor (tf.placeholder) for the dropout keep probability
        lstm_size: Size of the hidden layers in the LSTM cells
        num_layers: Number of LSTM layers
        batch_size: Batch size

    '''
    ### Build the LSTM Cell
    # Use a basic LSTM cell
    lstm = tf.contrib.rnn.BasicLSTMCell(lstm_size)
    
    # Add dropout to the cell
    drop = tf.contrib.rnn.DropoutWrapper(lstm, output_keep_prob=keep_prob)
    
    # Stack up multiple LSTM layers, for deep learning
    cell = tf.contrib.rnn.MultiRNNCell([drop] * num_layers)
    initial_state = cell.zero_state(batch_size, tf.float32)
    
    return cell, initial_state

### RNN Output

Here we'll create the output layer. As mentioned previsouly

In [51]:
def build_output(lstm_output, in_size, out_size):
    ''' Build a softmax layer, return the softmax output and logits.
    
        Arguments
        ---------
        
        x: Input tensor
        in_size: Size of the input tensor, for example, size of the LSTM cells
        out_size: Size of this softmax layer
    
    '''

    # Reshape output so it's a bunch of rows, one row for each step for each batch.
    # That is, the shape should be batch_size*num_steps rows by lstm_size columns
    seq_output = tf.concat(lstm_output, axis=1)
    x = tf.reshape(seq_output, [-1, lstm_size])
    
    # Connect the RNN outputs to a softmax layer
    with tf.variable_scope('softmax'):
        softmax_w = tf.Variable(tf.truncated_normal((in_size, out_size), stddev=0.1))
        softmax_b = tf.Variable(tf.zeros(out_size))
    
    # Since output is a bunch of rows of RNN cell outputs, logits will be a bunch
    # of rows of logit outputs, one for each step and batch
    logits = tf.matmul(x, softmax_w) + softmax_b
    
    # Use softmax to get the probabilities for predicted characters
    out = tf.nn.softmax(logits, name='predictions')
    
    return out, logits

In [52]:
def build_loss(logits, targets, lstm_size, num_classes):
    ''' Calculate the loss from the logits and the targets.
    
        Arguments
        ---------
        logits: Logits from final fully connected layer
        targets: Targets for supervised learning
        lstm_size: Number of LSTM hidden units
        num_classes: Number of classes in targets
        
    '''
    
    # One-hot encode targets and reshape to match logits, one row per batch_size per step
    y_one_hot = tf.one_hot(targets, num_classes)
    y_reshaped = tf.reshape(y_one_hot, logits.get_shape())
    
    # Softmax cross entropy loss
    loss = tf.nn.softmax_cross_entropy_with_logits(logits=logits, labels=y_reshaped)
    loss = tf.reduce_mean(loss)
    return loss

In [53]:
def build_optimizer(loss, learning_rate, grad_clip):
    ''' Build optmizer for training, using gradient clipping.
    
        Arguments:
        loss: Network loss
        learning_rate: Learning rate for optimizer
    
    '''
    
    # Optimizer for training, using gradient clipping to control exploding gradients
    tvars = tf.trainable_variables()
    grads, _ = tf.clip_by_global_norm(tf.gradients(loss, tvars), grad_clip)
    train_op = tf.train.AdamOptimizer(learning_rate)
    optimizer = train_op.apply_gradients(zip(grads, tvars))
    
    return optimizer

In [54]:
class CharRNN:
    
    def __init__(self, num_classes, batch_size=64, num_steps=50, 
                       lstm_size=128, num_layers=2, learning_rate=0.001, 
                       grad_clip=5, sampling=False):
    
        # When we're using this network for sampling later, we'll be passing in
        # one character at a time, so providing an option for that
        if sampling == True:
            batch_size, num_steps = 1, 1
        else:
            batch_size, num_steps = batch_size, num_steps

        tf.reset_default_graph()
        
        # Build the input placeholder tensors
        self.inputs, self.targets, self.keep_prob = build_inputs(batch_size, num_steps)

        # Build the LSTM layer
        cell, self.initial_state = build_lstm(lstm_size, num_layers, batch_size, self.keep_prob)

        ### Run the data through the RNN layers
        # First, one-hot encode the input tokens
        x_one_hot = tf.one_hot(self.inputs, num_classes)
        
        # Run each sequence step through the RNN and collect the outputs
        outputs, state = tf.nn.dynamic_rnn(cell, x_one_hot, initial_state=self.initial_state)
        self.final_state = state

        self.prediction, self.logits = build_output(outputs, lstm_size, num_classes)

        self.loss = build_loss(self.logits, self.targets, lstm_size, num_classes)
        self.optimizer = build_optimizer(self.loss, learning_rate, grad_clip)

## Hyperparameters

Here I'm defining the hyperparameters for the network. 

* `batch_size` - Number of sequences running through the network in one pass.
* `num_steps` - Number of characters in the sequence the network is trained on. Larger is better typically, the network will learn more long range dependencies. But it takes longer to train. 100 is typically a good number here.
* `lstm_size` - The number of units in the hidden layers.
* `num_layers` - Number of hidden LSTM layers to use
* `learning_rate` - Learning rate for training
* `keep_prob` - The dropout keep probability when training. If you're network is overfitting, try decreasing this.

Here's some good advice from Andrej Karpathy on training the network. I'm going to copy it in here for your benefit, but also link to [where it originally came from](https://github.com/karpathy/char-rnn#tips-and-tricks).

> ## Tips and Tricks

>### Monitoring Validation Loss vs. Training Loss
>If you're somewhat new to Machine Learning or Neural Networks it can take a bit of expertise to get good models. The most important quantity to keep track of is the difference between your training loss (printed during training) and the validation loss (printed once in a while when the RNN is run on the validation data (by default every 1000 iterations)). In particular:

> - If your training loss is much lower than validation loss then this means the network might be **overfitting**. Solutions to this are to decrease your network size, or to increase dropout. For example you could try dropout of 0.5 and so on.
> - If your training/validation loss are about equal then your model is **underfitting**. Increase the size of your model (either number of layers or the raw number of neurons per layer)

> ### Approximate number of parameters

> The two most important parameters that control the model are `lstm_size` and `num_layers`. I would advise that you always use `num_layers` of either 2/3. The `lstm_size` can be adjusted based on how much data you have. The two important quantities to keep track of here are:

> - The number of parameters in your model. This is printed when you start training.
> - The size of your dataset. 1MB file is approximately 1 million characters.

>These two should be about the same order of magnitude. It's a little tricky to tell. Here are some examples:

> - I have a 100MB dataset and I'm using the default parameter settings (which currently print 150K parameters). My data size is significantly larger (100 mil >> 0.15 mil), so I expect to heavily underfit. I am thinking I can comfortably afford to make `lstm_size` larger.
> - I have a 10MB dataset and running a 10 million parameter model. I'm slightly nervous and I'm carefully monitoring my validation loss. If it's larger than my training loss then I may want to try to increase dropout a bit and see if that helps the validation loss.

> ### Best models strategy

>The winning strategy to obtaining very good models (if you have the compute time) is to always err on making the network larger (as large as you're willing to wait for it to compute) and then try different dropout values (between 0,1). Whatever model has the best validation performance (the loss, written in the checkpoint filename, low is good) is the one you should use in the end.

>It is very common in deep learning to run many different models with many different hyperparameter settings, and in the end take whatever checkpoint gave the best validation performance.

>By the way, the size of your training and validation splits are also parameters. Make sure you have a decent amount of data in your validation set or otherwise the validation performance will be noisy and not very informative.


In [65]:
batch_size = 100
num_steps = 100 
lstm_size = 512
num_layers = 2
learning_rate = 0.001
keep_prob = 0.5

## Training

Time for training which is pretty straightforward. Here I pass in some data, and get an LSTM state back. Then I pass that state back in to the network so the next batch can continue the state from the previous batch. And every so often (set by `save_every_n`) I calculate the validation loss and save a checkpoint.

Here I'm saving checkpoints with the format

`i{iteration number}_l{# hidden layer units}_v{validation loss}.ckpt`

In [92]:
epochs = 20
# Save every N iterations
save_every_n = 200
train_x, train_y, val_x, val_y = split_data(chars, batch_size, num_steps)

model = CharRNN(len(vocab), batch_size=batch_size, num_steps=num_steps,
                lstm_size=lstm_size, num_layers=num_layers, 
                learning_rate=learning_rate, grad_clip=5,
                sampling=False)

saver = tf.train.Saver(max_to_keep=100)
with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    
    # Use the line below to load a checkpoint and resume training
    #saver.restore(sess, 'checkpoints/______.ckpt')
    
    n_batches = int(train_x.shape[1]/num_steps)
    iterations = n_batches * epochs
    for e in range(epochs):
        
        # Train network
        new_state = sess.run(model.initial_state)
        loss = 0
        for b, (x, y) in enumerate(get_batch([train_x, train_y], num_steps), 1):
            iteration = e*n_batches + b
            start = time.time()
            feed = {model.inputs: x,
                    model.targets: y,
                    model.keep_prob: keep_prob,
                    model.initial_state: new_state}
            batch_loss, new_state, _ = sess.run([model.loss, model.final_state, model.optimizer], 
                                                 feed_dict=feed)
            loss += batch_loss
            end = time.time()
            print('Epoch {}/{} '.format(e+1, epochs),
                  'Iteration {}/{}'.format(iteration, iterations),
                  'Training loss: {:.4f}'.format(loss/b),
                  '{:.4f} sec/batch'.format((end-start)))
        
            
            if (iteration%save_every_n == 0) or (iteration == iterations):
                # Check performance, notice dropout has been set to 1
                val_loss = []
                new_state = sess.run(model.initial_state)
                for x, y in get_batch([val_x, val_y], num_steps):
                    feed = {model.inputs: x,
                            model.targets: y,
                            model.keep_prob: 1.,
                            model.initial_state: new_state}
                    batch_loss, new_state = sess.run([model.loss, model.final_state], feed_dict=feed)
                    val_loss.append(batch_loss)

                print('Validation loss:', np.mean(val_loss),
                      'Saving checkpoint!')
                saver.save(sess, "checkpoints/i{}_l{}_v{:.3f}.ckpt".format(iteration, lstm_size, np.mean(val_loss)))

Epoch 1/20  Iteration 1/3560 Training loss: 4.4168 0.1921 sec/batch
Epoch 1/20  Iteration 2/3560 Training loss: 4.3714 0.1448 sec/batch
Epoch 1/20  Iteration 3/3560 Training loss: 4.1769 0.1374 sec/batch
Epoch 1/20  Iteration 4/3560 Training loss: 4.3862 0.1343 sec/batch
Epoch 1/20  Iteration 5/3560 Training loss: 4.2947 0.1388 sec/batch
Epoch 1/20  Iteration 6/3560 Training loss: 4.2243 0.1352 sec/batch
Epoch 1/20  Iteration 7/3560 Training loss: 4.1553 0.1316 sec/batch
Epoch 1/20  Iteration 8/3560 Training loss: 4.0842 0.1265 sec/batch
Epoch 1/20  Iteration 9/3560 Training loss: 4.0167 0.1266 sec/batch
Epoch 1/20  Iteration 10/3560 Training loss: 3.9593 0.1266 sec/batch
Epoch 1/20  Iteration 11/3560 Training loss: 3.9090 0.1264 sec/batch
Epoch 1/20  Iteration 12/3560 Training loss: 3.8670 0.1264 sec/batch
Epoch 1/20  Iteration 13/3560 Training loss: 3.8292 0.1262 sec/batch
Epoch 1/20  Iteration 14/3560 Training loss: 3.7960 0.1264 sec/batch
Epoch 1/20  Iteration 15/3560 Training loss

Epoch 1/20  Iteration 121/3560 Training loss: 3.2132 0.1269 sec/batch
Epoch 1/20  Iteration 122/3560 Training loss: 3.2110 0.1273 sec/batch
Epoch 1/20  Iteration 123/3560 Training loss: 3.2086 0.1268 sec/batch
Epoch 1/20  Iteration 124/3560 Training loss: 3.2063 0.1267 sec/batch
Epoch 1/20  Iteration 125/3560 Training loss: 3.2038 0.1267 sec/batch
Epoch 1/20  Iteration 126/3560 Training loss: 3.2010 0.1268 sec/batch
Epoch 1/20  Iteration 127/3560 Training loss: 3.1985 0.1268 sec/batch
Epoch 1/20  Iteration 128/3560 Training loss: 3.1961 0.1267 sec/batch
Epoch 1/20  Iteration 129/3560 Training loss: 3.1934 0.1274 sec/batch
Epoch 1/20  Iteration 130/3560 Training loss: 3.1909 0.1271 sec/batch
Epoch 1/20  Iteration 131/3560 Training loss: 3.1895 0.1270 sec/batch
Epoch 1/20  Iteration 132/3560 Training loss: 3.1870 0.1269 sec/batch
Epoch 1/20  Iteration 133/3560 Training loss: 3.1847 0.1272 sec/batch
Epoch 1/20  Iteration 134/3560 Training loss: 3.1823 0.1275 sec/batch
Epoch 1/20  Iteratio

Epoch 2/20  Iteration 239/3560 Training loss: 2.4197 0.1273 sec/batch
Epoch 2/20  Iteration 240/3560 Training loss: 2.4185 0.1273 sec/batch
Epoch 2/20  Iteration 241/3560 Training loss: 2.4173 0.1272 sec/batch
Epoch 2/20  Iteration 242/3560 Training loss: 2.4157 0.1271 sec/batch
Epoch 2/20  Iteration 243/3560 Training loss: 2.4141 0.1273 sec/batch
Epoch 2/20  Iteration 244/3560 Training loss: 2.4130 0.1271 sec/batch
Epoch 2/20  Iteration 245/3560 Training loss: 2.4116 0.1275 sec/batch
Epoch 2/20  Iteration 246/3560 Training loss: 2.4099 0.1275 sec/batch
Epoch 2/20  Iteration 247/3560 Training loss: 2.4082 0.1271 sec/batch
Epoch 2/20  Iteration 248/3560 Training loss: 2.4070 0.1273 sec/batch
Epoch 2/20  Iteration 249/3560 Training loss: 2.4060 0.1273 sec/batch
Epoch 2/20  Iteration 250/3560 Training loss: 2.4047 0.1273 sec/batch
Epoch 2/20  Iteration 251/3560 Training loss: 2.4035 0.1273 sec/batch
Epoch 2/20  Iteration 252/3560 Training loss: 2.4019 0.1273 sec/batch
Epoch 2/20  Iteratio

Epoch 3/20  Iteration 357/3560 Training loss: 2.1700 0.1271 sec/batch
Epoch 3/20  Iteration 358/3560 Training loss: 2.1248 0.1273 sec/batch
Epoch 3/20  Iteration 359/3560 Training loss: 2.1098 0.1273 sec/batch
Epoch 3/20  Iteration 360/3560 Training loss: 2.1045 0.1271 sec/batch
Epoch 3/20  Iteration 361/3560 Training loss: 2.1012 0.1271 sec/batch
Epoch 3/20  Iteration 362/3560 Training loss: 2.0943 0.1270 sec/batch
Epoch 3/20  Iteration 363/3560 Training loss: 2.0943 0.1270 sec/batch
Epoch 3/20  Iteration 364/3560 Training loss: 2.0940 0.1270 sec/batch
Epoch 3/20  Iteration 365/3560 Training loss: 2.0965 0.1273 sec/batch
Epoch 3/20  Iteration 366/3560 Training loss: 2.0965 0.1274 sec/batch
Epoch 3/20  Iteration 367/3560 Training loss: 2.0935 0.1271 sec/batch
Epoch 3/20  Iteration 368/3560 Training loss: 2.0912 0.1269 sec/batch
Epoch 3/20  Iteration 369/3560 Training loss: 2.0904 0.1272 sec/batch
Epoch 3/20  Iteration 370/3560 Training loss: 2.0917 0.1272 sec/batch
Epoch 3/20  Iteratio

Epoch 3/20  Iteration 475/3560 Training loss: 2.0125 0.1273 sec/batch
Epoch 3/20  Iteration 476/3560 Training loss: 2.0119 0.1272 sec/batch
Epoch 3/20  Iteration 477/3560 Training loss: 2.0114 0.1273 sec/batch
Epoch 3/20  Iteration 478/3560 Training loss: 2.0107 0.1275 sec/batch
Epoch 3/20  Iteration 479/3560 Training loss: 2.0099 0.1272 sec/batch
Epoch 3/20  Iteration 480/3560 Training loss: 2.0094 0.1276 sec/batch
Epoch 3/20  Iteration 481/3560 Training loss: 2.0087 0.1272 sec/batch
Epoch 3/20  Iteration 482/3560 Training loss: 2.0079 0.1275 sec/batch
Epoch 3/20  Iteration 483/3560 Training loss: 2.0073 0.1273 sec/batch
Epoch 3/20  Iteration 484/3560 Training loss: 2.0068 0.1271 sec/batch
Epoch 3/20  Iteration 485/3560 Training loss: 2.0062 0.1275 sec/batch
Epoch 3/20  Iteration 486/3560 Training loss: 2.0056 0.1271 sec/batch
Epoch 3/20  Iteration 487/3560 Training loss: 2.0048 0.1272 sec/batch
Epoch 3/20  Iteration 488/3560 Training loss: 2.0040 0.1271 sec/batch
Epoch 3/20  Iteratio

Epoch 4/20  Iteration 593/3560 Training loss: 1.8483 0.1272 sec/batch
Epoch 4/20  Iteration 594/3560 Training loss: 1.8485 0.1273 sec/batch
Epoch 4/20  Iteration 595/3560 Training loss: 1.8480 0.1272 sec/batch
Epoch 4/20  Iteration 596/3560 Training loss: 1.8484 0.1273 sec/batch
Epoch 4/20  Iteration 597/3560 Training loss: 1.8484 0.1271 sec/batch
Epoch 4/20  Iteration 598/3560 Training loss: 1.8483 0.1272 sec/batch
Epoch 4/20  Iteration 599/3560 Training loss: 1.8479 0.1277 sec/batch
Epoch 4/20  Iteration 600/3560 Training loss: 1.8478 0.1273 sec/batch
Validation loss: 1.71518 Saving checkpoint!
Epoch 4/20  Iteration 601/3560 Training loss: 1.8483 0.1285 sec/batch
Epoch 4/20  Iteration 602/3560 Training loss: 1.8476 0.1272 sec/batch
Epoch 4/20  Iteration 603/3560 Training loss: 1.8471 0.1270 sec/batch
Epoch 4/20  Iteration 604/3560 Training loss: 1.8468 0.1275 sec/batch
Epoch 4/20  Iteration 605/3560 Training loss: 1.8469 0.1274 sec/batch
Epoch 4/20  Iteration 606/3560 Training loss: 

Epoch 4/20  Iteration 711/3560 Training loss: 1.7990 0.1273 sec/batch
Epoch 4/20  Iteration 712/3560 Training loss: 1.7988 0.1275 sec/batch
Epoch 5/20  Iteration 713/3560 Training loss: 1.8197 0.1270 sec/batch
Epoch 5/20  Iteration 714/3560 Training loss: 1.7746 0.1271 sec/batch
Epoch 5/20  Iteration 715/3560 Training loss: 1.7579 0.1274 sec/batch
Epoch 5/20  Iteration 716/3560 Training loss: 1.7489 0.1276 sec/batch
Epoch 5/20  Iteration 717/3560 Training loss: 1.7410 0.1278 sec/batch
Epoch 5/20  Iteration 718/3560 Training loss: 1.7303 0.1272 sec/batch
Epoch 5/20  Iteration 719/3560 Training loss: 1.7305 0.1285 sec/batch
Epoch 5/20  Iteration 720/3560 Training loss: 1.7280 0.1271 sec/batch
Epoch 5/20  Iteration 721/3560 Training loss: 1.7304 0.1273 sec/batch
Epoch 5/20  Iteration 722/3560 Training loss: 1.7288 0.1271 sec/batch
Epoch 5/20  Iteration 723/3560 Training loss: 1.7249 0.1273 sec/batch
Epoch 5/20  Iteration 724/3560 Training loss: 1.7236 0.1273 sec/batch
Epoch 5/20  Iteratio

Epoch 5/20  Iteration 829/3560 Training loss: 1.6851 0.1272 sec/batch
Epoch 5/20  Iteration 830/3560 Training loss: 1.6847 0.1271 sec/batch
Epoch 5/20  Iteration 831/3560 Training loss: 1.6844 0.1275 sec/batch
Epoch 5/20  Iteration 832/3560 Training loss: 1.6840 0.1271 sec/batch
Epoch 5/20  Iteration 833/3560 Training loss: 1.6837 0.1273 sec/batch
Epoch 5/20  Iteration 834/3560 Training loss: 1.6831 0.1273 sec/batch
Epoch 5/20  Iteration 835/3560 Training loss: 1.6825 0.1274 sec/batch
Epoch 5/20  Iteration 836/3560 Training loss: 1.6822 0.1271 sec/batch
Epoch 5/20  Iteration 837/3560 Training loss: 1.6819 0.1273 sec/batch
Epoch 5/20  Iteration 838/3560 Training loss: 1.6812 0.1274 sec/batch
Epoch 5/20  Iteration 839/3560 Training loss: 1.6810 0.1274 sec/batch
Epoch 5/20  Iteration 840/3560 Training loss: 1.6807 0.1271 sec/batch
Epoch 5/20  Iteration 841/3560 Training loss: 1.6804 0.1271 sec/batch
Epoch 5/20  Iteration 842/3560 Training loss: 1.6799 0.1275 sec/batch
Epoch 5/20  Iteratio

Epoch 6/20  Iteration 947/3560 Training loss: 1.6021 0.1273 sec/batch
Epoch 6/20  Iteration 948/3560 Training loss: 1.6015 0.1275 sec/batch
Epoch 6/20  Iteration 949/3560 Training loss: 1.6006 0.1272 sec/batch
Epoch 6/20  Iteration 950/3560 Training loss: 1.6011 0.1274 sec/batch
Epoch 6/20  Iteration 951/3560 Training loss: 1.6010 0.1273 sec/batch
Epoch 6/20  Iteration 952/3560 Training loss: 1.6016 0.1270 sec/batch
Epoch 6/20  Iteration 953/3560 Training loss: 1.6018 0.1274 sec/batch
Epoch 6/20  Iteration 954/3560 Training loss: 1.6020 0.1273 sec/batch
Epoch 6/20  Iteration 955/3560 Training loss: 1.6018 0.1274 sec/batch
Epoch 6/20  Iteration 956/3560 Training loss: 1.6019 0.1274 sec/batch
Epoch 6/20  Iteration 957/3560 Training loss: 1.6020 0.1274 sec/batch
Epoch 6/20  Iteration 958/3560 Training loss: 1.6015 0.1272 sec/batch
Epoch 6/20  Iteration 959/3560 Training loss: 1.6013 0.1273 sec/batch
Epoch 6/20  Iteration 960/3560 Training loss: 1.6010 0.1272 sec/batch
Epoch 6/20  Iteratio

Epoch 6/20  Iteration 1063/3560 Training loss: 1.5745 0.1273 sec/batch
Epoch 6/20  Iteration 1064/3560 Training loss: 1.5742 0.1273 sec/batch
Epoch 6/20  Iteration 1065/3560 Training loss: 1.5741 0.1272 sec/batch
Epoch 6/20  Iteration 1066/3560 Training loss: 1.5738 0.1272 sec/batch
Epoch 6/20  Iteration 1067/3560 Training loss: 1.5734 0.1273 sec/batch
Epoch 6/20  Iteration 1068/3560 Training loss: 1.5733 0.1273 sec/batch
Epoch 7/20  Iteration 1069/3560 Training loss: 1.6442 0.1272 sec/batch
Epoch 7/20  Iteration 1070/3560 Training loss: 1.5946 0.1273 sec/batch
Epoch 7/20  Iteration 1071/3560 Training loss: 1.5755 0.1276 sec/batch
Epoch 7/20  Iteration 1072/3560 Training loss: 1.5693 0.1274 sec/batch
Epoch 7/20  Iteration 1073/3560 Training loss: 1.5577 0.1273 sec/batch
Epoch 7/20  Iteration 1074/3560 Training loss: 1.5460 0.1273 sec/batch
Epoch 7/20  Iteration 1075/3560 Training loss: 1.5443 0.1275 sec/batch
Epoch 7/20  Iteration 1076/3560 Training loss: 1.5414 0.1275 sec/batch
Epoch 

Epoch 7/20  Iteration 1179/3560 Training loss: 1.5031 0.1271 sec/batch
Epoch 7/20  Iteration 1180/3560 Training loss: 1.5027 0.1274 sec/batch
Epoch 7/20  Iteration 1181/3560 Training loss: 1.5024 0.1274 sec/batch
Epoch 7/20  Iteration 1182/3560 Training loss: 1.5020 0.1274 sec/batch
Epoch 7/20  Iteration 1183/3560 Training loss: 1.5015 0.1276 sec/batch
Epoch 7/20  Iteration 1184/3560 Training loss: 1.5011 0.1274 sec/batch
Epoch 7/20  Iteration 1185/3560 Training loss: 1.5007 0.1273 sec/batch
Epoch 7/20  Iteration 1186/3560 Training loss: 1.5005 0.1272 sec/batch
Epoch 7/20  Iteration 1187/3560 Training loss: 1.5003 0.1276 sec/batch
Epoch 7/20  Iteration 1188/3560 Training loss: 1.5000 0.1278 sec/batch
Epoch 7/20  Iteration 1189/3560 Training loss: 1.4997 0.1275 sec/batch
Epoch 7/20  Iteration 1190/3560 Training loss: 1.4994 0.1272 sec/batch
Epoch 7/20  Iteration 1191/3560 Training loss: 1.4990 0.1274 sec/batch
Epoch 7/20  Iteration 1192/3560 Training loss: 1.4988 0.1273 sec/batch
Epoch 

Epoch 8/20  Iteration 1295/3560 Training loss: 1.4480 0.1274 sec/batch
Epoch 8/20  Iteration 1296/3560 Training loss: 1.4484 0.1272 sec/batch
Epoch 8/20  Iteration 1297/3560 Training loss: 1.4481 0.1281 sec/batch
Epoch 8/20  Iteration 1298/3560 Training loss: 1.4488 0.1273 sec/batch
Epoch 8/20  Iteration 1299/3560 Training loss: 1.4486 0.1273 sec/batch
Epoch 8/20  Iteration 1300/3560 Training loss: 1.4486 0.1273 sec/batch
Epoch 8/20  Iteration 1301/3560 Training loss: 1.4485 0.1273 sec/batch
Epoch 8/20  Iteration 1302/3560 Training loss: 1.4483 0.1275 sec/batch
Epoch 8/20  Iteration 1303/3560 Training loss: 1.4485 0.1274 sec/batch
Epoch 8/20  Iteration 1304/3560 Training loss: 1.4482 0.1273 sec/batch
Epoch 8/20  Iteration 1305/3560 Training loss: 1.4475 0.1274 sec/batch
Epoch 8/20  Iteration 1306/3560 Training loss: 1.4479 0.1274 sec/batch
Epoch 8/20  Iteration 1307/3560 Training loss: 1.4477 0.1273 sec/batch
Epoch 8/20  Iteration 1308/3560 Training loss: 1.4486 0.1273 sec/batch
Epoch 

Epoch 8/20  Iteration 1411/3560 Training loss: 1.4331 0.1272 sec/batch
Epoch 8/20  Iteration 1412/3560 Training loss: 1.4330 0.1275 sec/batch
Epoch 8/20  Iteration 1413/3560 Training loss: 1.4331 0.1273 sec/batch
Epoch 8/20  Iteration 1414/3560 Training loss: 1.4335 0.1277 sec/batch
Epoch 8/20  Iteration 1415/3560 Training loss: 1.4333 0.1275 sec/batch
Epoch 8/20  Iteration 1416/3560 Training loss: 1.4333 0.1272 sec/batch
Epoch 8/20  Iteration 1417/3560 Training loss: 1.4331 0.1272 sec/batch
Epoch 8/20  Iteration 1418/3560 Training loss: 1.4329 0.1275 sec/batch
Epoch 8/20  Iteration 1419/3560 Training loss: 1.4330 0.1272 sec/batch
Epoch 8/20  Iteration 1420/3560 Training loss: 1.4330 0.1272 sec/batch
Epoch 8/20  Iteration 1421/3560 Training loss: 1.4330 0.1274 sec/batch
Epoch 8/20  Iteration 1422/3560 Training loss: 1.4328 0.1272 sec/batch
Epoch 8/20  Iteration 1423/3560 Training loss: 1.4326 0.1272 sec/batch
Epoch 8/20  Iteration 1424/3560 Training loss: 1.4326 0.1273 sec/batch
Epoch 

Epoch 9/20  Iteration 1527/3560 Training loss: 1.3939 0.1274 sec/batch
Epoch 9/20  Iteration 1528/3560 Training loss: 1.3937 0.1273 sec/batch
Epoch 9/20  Iteration 1529/3560 Training loss: 1.3935 0.1275 sec/batch
Epoch 9/20  Iteration 1530/3560 Training loss: 1.3933 0.1272 sec/batch
Epoch 9/20  Iteration 1531/3560 Training loss: 1.3932 0.1270 sec/batch
Epoch 9/20  Iteration 1532/3560 Training loss: 1.3930 0.1272 sec/batch
Epoch 9/20  Iteration 1533/3560 Training loss: 1.3929 0.1273 sec/batch
Epoch 9/20  Iteration 1534/3560 Training loss: 1.3929 0.1273 sec/batch
Epoch 9/20  Iteration 1535/3560 Training loss: 1.3926 0.1275 sec/batch
Epoch 9/20  Iteration 1536/3560 Training loss: 1.3925 0.1275 sec/batch
Epoch 9/20  Iteration 1537/3560 Training loss: 1.3923 0.1273 sec/batch
Epoch 9/20  Iteration 1538/3560 Training loss: 1.3921 0.1271 sec/batch
Epoch 9/20  Iteration 1539/3560 Training loss: 1.3918 0.1273 sec/batch
Epoch 9/20  Iteration 1540/3560 Training loss: 1.3914 0.1273 sec/batch
Epoch 

Epoch 10/20  Iteration 1643/3560 Training loss: 1.3765 0.1274 sec/batch
Epoch 10/20  Iteration 1644/3560 Training loss: 1.3769 0.1272 sec/batch
Epoch 10/20  Iteration 1645/3560 Training loss: 1.3763 0.1275 sec/batch
Epoch 10/20  Iteration 1646/3560 Training loss: 1.3755 0.1273 sec/batch
Epoch 10/20  Iteration 1647/3560 Training loss: 1.3753 0.1273 sec/batch
Epoch 10/20  Iteration 1648/3560 Training loss: 1.3744 0.1272 sec/batch
Epoch 10/20  Iteration 1649/3560 Training loss: 1.3742 0.1274 sec/batch
Epoch 10/20  Iteration 1650/3560 Training loss: 1.3736 0.1273 sec/batch
Epoch 10/20  Iteration 1651/3560 Training loss: 1.3734 0.1276 sec/batch
Epoch 10/20  Iteration 1652/3560 Training loss: 1.3734 0.1275 sec/batch
Epoch 10/20  Iteration 1653/3560 Training loss: 1.3728 0.1274 sec/batch
Epoch 10/20  Iteration 1654/3560 Training loss: 1.3735 0.1285 sec/batch
Epoch 10/20  Iteration 1655/3560 Training loss: 1.3733 0.1272 sec/batch
Epoch 10/20  Iteration 1656/3560 Training loss: 1.3733 0.1274 se

Epoch 10/20  Iteration 1757/3560 Training loss: 1.3564 0.1364 sec/batch
Epoch 10/20  Iteration 1758/3560 Training loss: 1.3563 0.1360 sec/batch
Epoch 10/20  Iteration 1759/3560 Training loss: 1.3563 0.1360 sec/batch
Epoch 10/20  Iteration 1760/3560 Training loss: 1.3562 0.1359 sec/batch
Epoch 10/20  Iteration 1761/3560 Training loss: 1.3559 0.1341 sec/batch
Epoch 10/20  Iteration 1762/3560 Training loss: 1.3560 0.1275 sec/batch
Epoch 10/20  Iteration 1763/3560 Training loss: 1.3561 0.1274 sec/batch
Epoch 10/20  Iteration 1764/3560 Training loss: 1.3561 0.1273 sec/batch
Epoch 10/20  Iteration 1765/3560 Training loss: 1.3560 0.1274 sec/batch
Epoch 10/20  Iteration 1766/3560 Training loss: 1.3559 0.1274 sec/batch
Epoch 10/20  Iteration 1767/3560 Training loss: 1.3559 0.1271 sec/batch
Epoch 10/20  Iteration 1768/3560 Training loss: 1.3557 0.1274 sec/batch
Epoch 10/20  Iteration 1769/3560 Training loss: 1.3558 0.1315 sec/batch
Epoch 10/20  Iteration 1770/3560 Training loss: 1.3562 0.1357 se

Epoch 11/20  Iteration 1871/3560 Training loss: 1.3320 0.1274 sec/batch
Epoch 11/20  Iteration 1872/3560 Training loss: 1.3318 0.1272 sec/batch
Epoch 11/20  Iteration 1873/3560 Training loss: 1.3314 0.1274 sec/batch
Epoch 11/20  Iteration 1874/3560 Training loss: 1.3310 0.1279 sec/batch
Epoch 11/20  Iteration 1875/3560 Training loss: 1.3307 0.1272 sec/batch
Epoch 11/20  Iteration 1876/3560 Training loss: 1.3307 0.1273 sec/batch
Epoch 11/20  Iteration 1877/3560 Training loss: 1.3307 0.1275 sec/batch
Epoch 11/20  Iteration 1878/3560 Training loss: 1.3302 0.1287 sec/batch
Epoch 11/20  Iteration 1879/3560 Training loss: 1.3297 0.1273 sec/batch
Epoch 11/20  Iteration 1880/3560 Training loss: 1.3292 0.1273 sec/batch
Epoch 11/20  Iteration 1881/3560 Training loss: 1.3291 0.1279 sec/batch
Epoch 11/20  Iteration 1882/3560 Training loss: 1.3289 0.1272 sec/batch
Epoch 11/20  Iteration 1883/3560 Training loss: 1.3287 0.1274 sec/batch
Epoch 11/20  Iteration 1884/3560 Training loss: 1.3285 0.1273 se

Epoch 12/20  Iteration 1985/3560 Training loss: 1.3126 0.1274 sec/batch
Epoch 12/20  Iteration 1986/3560 Training loss: 1.3133 0.1272 sec/batch
Epoch 12/20  Iteration 1987/3560 Training loss: 1.3138 0.1274 sec/batch
Epoch 12/20  Iteration 1988/3560 Training loss: 1.3139 0.1273 sec/batch
Epoch 12/20  Iteration 1989/3560 Training loss: 1.3131 0.1273 sec/batch
Epoch 12/20  Iteration 1990/3560 Training loss: 1.3119 0.1274 sec/batch
Epoch 12/20  Iteration 1991/3560 Training loss: 1.3119 0.1271 sec/batch
Epoch 12/20  Iteration 1992/3560 Training loss: 1.3121 0.1272 sec/batch
Epoch 12/20  Iteration 1993/3560 Training loss: 1.3117 0.1275 sec/batch
Epoch 12/20  Iteration 1994/3560 Training loss: 1.3116 0.1275 sec/batch
Epoch 12/20  Iteration 1995/3560 Training loss: 1.3109 0.1279 sec/batch
Epoch 12/20  Iteration 1996/3560 Training loss: 1.3099 0.1277 sec/batch
Epoch 12/20  Iteration 1997/3560 Training loss: 1.3085 0.1273 sec/batch
Epoch 12/20  Iteration 1998/3560 Training loss: 1.3080 0.1281 se

Epoch 12/20  Iteration 2099/3560 Training loss: 1.2991 0.1274 sec/batch
Epoch 12/20  Iteration 2100/3560 Training loss: 1.2990 0.1280 sec/batch
Epoch 12/20  Iteration 2101/3560 Training loss: 1.2989 0.1272 sec/batch
Epoch 12/20  Iteration 2102/3560 Training loss: 1.2990 0.1276 sec/batch
Epoch 12/20  Iteration 2103/3560 Training loss: 1.2989 0.1275 sec/batch
Epoch 12/20  Iteration 2104/3560 Training loss: 1.2991 0.1280 sec/batch
Epoch 12/20  Iteration 2105/3560 Training loss: 1.2991 0.1280 sec/batch
Epoch 12/20  Iteration 2106/3560 Training loss: 1.2992 0.1275 sec/batch
Epoch 12/20  Iteration 2107/3560 Training loss: 1.2993 0.1276 sec/batch
Epoch 12/20  Iteration 2108/3560 Training loss: 1.2992 0.1272 sec/batch
Epoch 12/20  Iteration 2109/3560 Training loss: 1.2988 0.1273 sec/batch
Epoch 12/20  Iteration 2110/3560 Training loss: 1.2987 0.1273 sec/batch
Epoch 12/20  Iteration 2111/3560 Training loss: 1.2987 0.1274 sec/batch
Epoch 12/20  Iteration 2112/3560 Training loss: 1.2986 0.1273 se

Epoch 13/20  Iteration 2213/3560 Training loss: 1.2876 0.1274 sec/batch
Epoch 13/20  Iteration 2214/3560 Training loss: 1.2874 0.1287 sec/batch
Epoch 13/20  Iteration 2215/3560 Training loss: 1.2869 0.1273 sec/batch
Epoch 13/20  Iteration 2216/3560 Training loss: 1.2869 0.1273 sec/batch
Epoch 13/20  Iteration 2217/3560 Training loss: 1.2865 0.1276 sec/batch
Epoch 13/20  Iteration 2218/3560 Training loss: 1.2863 0.1272 sec/batch
Epoch 13/20  Iteration 2219/3560 Training loss: 1.2858 0.1274 sec/batch
Epoch 13/20  Iteration 2220/3560 Training loss: 1.2857 0.1274 sec/batch
Epoch 13/20  Iteration 2221/3560 Training loss: 1.2852 0.1272 sec/batch
Epoch 13/20  Iteration 2222/3560 Training loss: 1.2850 0.1271 sec/batch
Epoch 13/20  Iteration 2223/3560 Training loss: 1.2848 0.1275 sec/batch
Epoch 13/20  Iteration 2224/3560 Training loss: 1.2844 0.1276 sec/batch
Epoch 13/20  Iteration 2225/3560 Training loss: 1.2840 0.1273 sec/batch
Epoch 13/20  Iteration 2226/3560 Training loss: 1.2841 0.1274 se

Epoch 14/20  Iteration 2327/3560 Training loss: 1.2824 0.1276 sec/batch
Epoch 14/20  Iteration 2328/3560 Training loss: 1.2831 0.1274 sec/batch
Epoch 14/20  Iteration 2329/3560 Training loss: 1.2808 0.1275 sec/batch
Epoch 14/20  Iteration 2330/3560 Training loss: 1.2783 0.1273 sec/batch
Epoch 14/20  Iteration 2331/3560 Training loss: 1.2779 0.1287 sec/batch
Epoch 14/20  Iteration 2332/3560 Training loss: 1.2788 0.1275 sec/batch
Epoch 14/20  Iteration 2333/3560 Training loss: 1.2786 0.1272 sec/batch
Epoch 14/20  Iteration 2334/3560 Training loss: 1.2794 0.1271 sec/batch
Epoch 14/20  Iteration 2335/3560 Training loss: 1.2789 0.1276 sec/batch
Epoch 14/20  Iteration 2336/3560 Training loss: 1.2789 0.1273 sec/batch
Epoch 14/20  Iteration 2337/3560 Training loss: 1.2776 0.1276 sec/batch
Epoch 14/20  Iteration 2338/3560 Training loss: 1.2771 0.1273 sec/batch
Epoch 14/20  Iteration 2339/3560 Training loss: 1.2767 0.1273 sec/batch
Epoch 14/20  Iteration 2340/3560 Training loss: 1.2751 0.1273 se

Epoch 14/20  Iteration 2441/3560 Training loss: 1.2601 0.1274 sec/batch
Epoch 14/20  Iteration 2442/3560 Training loss: 1.2601 0.1273 sec/batch
Epoch 14/20  Iteration 2443/3560 Training loss: 1.2599 0.1274 sec/batch
Epoch 14/20  Iteration 2444/3560 Training loss: 1.2596 0.1272 sec/batch
Epoch 14/20  Iteration 2445/3560 Training loss: 1.2591 0.1276 sec/batch
Epoch 14/20  Iteration 2446/3560 Training loss: 1.2588 0.1273 sec/batch
Epoch 14/20  Iteration 2447/3560 Training loss: 1.2588 0.1277 sec/batch
Epoch 14/20  Iteration 2448/3560 Training loss: 1.2588 0.1274 sec/batch
Epoch 14/20  Iteration 2449/3560 Training loss: 1.2587 0.1272 sec/batch
Epoch 14/20  Iteration 2450/3560 Training loss: 1.2587 0.1274 sec/batch
Epoch 14/20  Iteration 2451/3560 Training loss: 1.2589 0.1274 sec/batch
Epoch 14/20  Iteration 2452/3560 Training loss: 1.2590 0.1275 sec/batch
Epoch 14/20  Iteration 2453/3560 Training loss: 1.2589 0.1274 sec/batch
Epoch 14/20  Iteration 2454/3560 Training loss: 1.2590 0.1275 se

Epoch 15/20  Iteration 2555/3560 Training loss: 1.2490 0.1274 sec/batch
Epoch 15/20  Iteration 2556/3560 Training loss: 1.2490 0.1276 sec/batch
Epoch 15/20  Iteration 2557/3560 Training loss: 1.2489 0.1284 sec/batch
Epoch 15/20  Iteration 2558/3560 Training loss: 1.2491 0.1275 sec/batch
Epoch 15/20  Iteration 2559/3560 Training loss: 1.2494 0.1271 sec/batch
Epoch 15/20  Iteration 2560/3560 Training loss: 1.2490 0.1274 sec/batch
Epoch 15/20  Iteration 2561/3560 Training loss: 1.2491 0.1276 sec/batch
Epoch 15/20  Iteration 2562/3560 Training loss: 1.2489 0.1280 sec/batch
Epoch 15/20  Iteration 2563/3560 Training loss: 1.2495 0.1273 sec/batch
Epoch 15/20  Iteration 2564/3560 Training loss: 1.2498 0.1272 sec/batch
Epoch 15/20  Iteration 2565/3560 Training loss: 1.2502 0.1274 sec/batch
Epoch 15/20  Iteration 2566/3560 Training loss: 1.2498 0.1273 sec/batch
Epoch 15/20  Iteration 2567/3560 Training loss: 1.2496 0.1276 sec/batch
Epoch 15/20  Iteration 2568/3560 Training loss: 1.2499 0.1273 se

Epoch 15/20  Iteration 2669/3560 Training loss: 1.2424 0.1399 sec/batch
Epoch 15/20  Iteration 2670/3560 Training loss: 1.2425 0.1354 sec/batch
Epoch 16/20  Iteration 2671/3560 Training loss: 1.3616 0.1345 sec/batch
Epoch 16/20  Iteration 2672/3560 Training loss: 1.3087 0.1278 sec/batch
Epoch 16/20  Iteration 2673/3560 Training loss: 1.2843 0.1278 sec/batch
Epoch 16/20  Iteration 2674/3560 Training loss: 1.2794 0.1275 sec/batch
Epoch 16/20  Iteration 2675/3560 Training loss: 1.2670 0.1272 sec/batch
Epoch 16/20  Iteration 2676/3560 Training loss: 1.2555 0.1272 sec/batch
Epoch 16/20  Iteration 2677/3560 Training loss: 1.2534 0.1274 sec/batch
Epoch 16/20  Iteration 2678/3560 Training loss: 1.2505 0.1271 sec/batch
Epoch 16/20  Iteration 2679/3560 Training loss: 1.2489 0.1277 sec/batch
Epoch 16/20  Iteration 2680/3560 Training loss: 1.2479 0.1276 sec/batch
Epoch 16/20  Iteration 2681/3560 Training loss: 1.2442 0.1273 sec/batch
Epoch 16/20  Iteration 2682/3560 Training loss: 1.2435 0.1278 se

Epoch 16/20  Iteration 2783/3560 Training loss: 1.2277 0.1277 sec/batch
Epoch 16/20  Iteration 2784/3560 Training loss: 1.2276 0.1273 sec/batch
Epoch 16/20  Iteration 2785/3560 Training loss: 1.2272 0.1274 sec/batch
Epoch 16/20  Iteration 2786/3560 Training loss: 1.2270 0.1272 sec/batch
Epoch 16/20  Iteration 2787/3560 Training loss: 1.2270 0.1274 sec/batch
Epoch 16/20  Iteration 2788/3560 Training loss: 1.2270 0.1274 sec/batch
Epoch 16/20  Iteration 2789/3560 Training loss: 1.2269 0.1332 sec/batch
Epoch 16/20  Iteration 2790/3560 Training loss: 1.2269 0.1358 sec/batch
Epoch 16/20  Iteration 2791/3560 Training loss: 1.2268 0.1362 sec/batch
Epoch 16/20  Iteration 2792/3560 Training loss: 1.2265 0.1356 sec/batch
Epoch 16/20  Iteration 2793/3560 Training loss: 1.2262 0.1354 sec/batch
Epoch 16/20  Iteration 2794/3560 Training loss: 1.2262 0.1360 sec/batch
Epoch 16/20  Iteration 2795/3560 Training loss: 1.2260 0.1376 sec/batch
Epoch 16/20  Iteration 2796/3560 Training loss: 1.2258 0.1329 se

Epoch 17/20  Iteration 2897/3560 Training loss: 1.2205 0.1272 sec/batch
Epoch 17/20  Iteration 2898/3560 Training loss: 1.2206 0.1275 sec/batch
Epoch 17/20  Iteration 2899/3560 Training loss: 1.2202 0.1274 sec/batch
Epoch 17/20  Iteration 2900/3560 Training loss: 1.2207 0.1273 sec/batch
Epoch 17/20  Iteration 2901/3560 Training loss: 1.2205 0.1275 sec/batch
Epoch 17/20  Iteration 2902/3560 Training loss: 1.2209 0.1273 sec/batch
Epoch 17/20  Iteration 2903/3560 Training loss: 1.2206 0.1275 sec/batch
Epoch 17/20  Iteration 2904/3560 Training loss: 1.2206 0.1275 sec/batch
Epoch 17/20  Iteration 2905/3560 Training loss: 1.2209 0.1274 sec/batch
Epoch 17/20  Iteration 2906/3560 Training loss: 1.2206 0.1275 sec/batch
Epoch 17/20  Iteration 2907/3560 Training loss: 1.2200 0.1272 sec/batch
Epoch 17/20  Iteration 2908/3560 Training loss: 1.2204 0.1278 sec/batch
Epoch 17/20  Iteration 2909/3560 Training loss: 1.2204 0.1274 sec/batch
Epoch 17/20  Iteration 2910/3560 Training loss: 1.2212 0.1274 se

Epoch 17/20  Iteration 3011/3560 Training loss: 1.2147 0.1277 sec/batch
Epoch 17/20  Iteration 3012/3560 Training loss: 1.2147 0.1274 sec/batch
Epoch 17/20  Iteration 3013/3560 Training loss: 1.2147 0.1274 sec/batch
Epoch 17/20  Iteration 3014/3560 Training loss: 1.2146 0.1273 sec/batch
Epoch 17/20  Iteration 3015/3560 Training loss: 1.2148 0.1273 sec/batch
Epoch 17/20  Iteration 3016/3560 Training loss: 1.2152 0.1272 sec/batch
Epoch 17/20  Iteration 3017/3560 Training loss: 1.2152 0.1271 sec/batch
Epoch 17/20  Iteration 3018/3560 Training loss: 1.2152 0.1276 sec/batch
Epoch 17/20  Iteration 3019/3560 Training loss: 1.2151 0.1274 sec/batch
Epoch 17/20  Iteration 3020/3560 Training loss: 1.2150 0.1273 sec/batch
Epoch 17/20  Iteration 3021/3560 Training loss: 1.2152 0.1274 sec/batch
Epoch 17/20  Iteration 3022/3560 Training loss: 1.2153 0.1273 sec/batch
Epoch 17/20  Iteration 3023/3560 Training loss: 1.2153 0.1274 sec/batch
Epoch 17/20  Iteration 3024/3560 Training loss: 1.2151 0.1272 se

Epoch 18/20  Iteration 3125/3560 Training loss: 1.2049 0.1274 sec/batch
Epoch 18/20  Iteration 3126/3560 Training loss: 1.2045 0.1275 sec/batch
Epoch 18/20  Iteration 3127/3560 Training loss: 1.2044 0.1273 sec/batch
Epoch 18/20  Iteration 3128/3560 Training loss: 1.2042 0.1274 sec/batch
Epoch 18/20  Iteration 3129/3560 Training loss: 1.2041 0.1271 sec/batch
Epoch 18/20  Iteration 3130/3560 Training loss: 1.2040 0.1274 sec/batch
Epoch 18/20  Iteration 3131/3560 Training loss: 1.2039 0.1275 sec/batch
Epoch 18/20  Iteration 3132/3560 Training loss: 1.2038 0.1274 sec/batch
Epoch 18/20  Iteration 3133/3560 Training loss: 1.2037 0.1274 sec/batch
Epoch 18/20  Iteration 3134/3560 Training loss: 1.2037 0.1273 sec/batch
Epoch 18/20  Iteration 3135/3560 Training loss: 1.2035 0.1273 sec/batch
Epoch 18/20  Iteration 3136/3560 Training loss: 1.2036 0.1272 sec/batch
Epoch 18/20  Iteration 3137/3560 Training loss: 1.2035 0.1271 sec/batch
Epoch 18/20  Iteration 3138/3560 Training loss: 1.2033 0.1273 se

Epoch 19/20  Iteration 3239/3560 Training loss: 1.2021 0.1274 sec/batch
Epoch 19/20  Iteration 3240/3560 Training loss: 1.2022 0.1274 sec/batch
Epoch 19/20  Iteration 3241/3560 Training loss: 1.2014 0.1275 sec/batch
Epoch 19/20  Iteration 3242/3560 Training loss: 1.2003 0.1276 sec/batch
Epoch 19/20  Iteration 3243/3560 Training loss: 1.1992 0.1275 sec/batch
Epoch 19/20  Iteration 3244/3560 Training loss: 1.1988 0.1274 sec/batch
Epoch 19/20  Iteration 3245/3560 Training loss: 1.1982 0.1273 sec/batch
Epoch 19/20  Iteration 3246/3560 Training loss: 1.1990 0.1274 sec/batch
Epoch 19/20  Iteration 3247/3560 Training loss: 1.1987 0.1273 sec/batch
Epoch 19/20  Iteration 3248/3560 Training loss: 1.1980 0.1275 sec/batch
Epoch 19/20  Iteration 3249/3560 Training loss: 1.1981 0.1274 sec/batch
Epoch 19/20  Iteration 3250/3560 Training loss: 1.1974 0.1274 sec/batch
Epoch 19/20  Iteration 3251/3560 Training loss: 1.1969 0.1272 sec/batch
Epoch 19/20  Iteration 3252/3560 Training loss: 1.1966 0.1273 se

Epoch 19/20  Iteration 3353/3560 Training loss: 1.1910 0.1272 sec/batch
Epoch 19/20  Iteration 3354/3560 Training loss: 1.1910 0.1274 sec/batch
Epoch 19/20  Iteration 3355/3560 Training loss: 1.1907 0.1273 sec/batch
Epoch 19/20  Iteration 3356/3560 Training loss: 1.1906 0.1274 sec/batch
Epoch 19/20  Iteration 3357/3560 Training loss: 1.1906 0.1270 sec/batch
Epoch 19/20  Iteration 3358/3560 Training loss: 1.1906 0.1273 sec/batch
Epoch 19/20  Iteration 3359/3560 Training loss: 1.1906 0.1276 sec/batch
Epoch 19/20  Iteration 3360/3560 Training loss: 1.1906 0.1272 sec/batch
Epoch 19/20  Iteration 3361/3560 Training loss: 1.1906 0.1275 sec/batch
Epoch 19/20  Iteration 3362/3560 Training loss: 1.1905 0.1274 sec/batch
Epoch 19/20  Iteration 3363/3560 Training loss: 1.1903 0.1279 sec/batch
Epoch 19/20  Iteration 3364/3560 Training loss: 1.1905 0.1274 sec/batch
Epoch 19/20  Iteration 3365/3560 Training loss: 1.1907 0.1273 sec/batch
Epoch 19/20  Iteration 3366/3560 Training loss: 1.1908 0.1274 se

Epoch 20/20  Iteration 3467/3560 Training loss: 1.1886 0.1272 sec/batch
Epoch 20/20  Iteration 3468/3560 Training loss: 1.1884 0.1276 sec/batch
Epoch 20/20  Iteration 3469/3560 Training loss: 1.1882 0.1273 sec/batch
Epoch 20/20  Iteration 3470/3560 Training loss: 1.1879 0.1274 sec/batch
Epoch 20/20  Iteration 3471/3560 Training loss: 1.1876 0.1272 sec/batch
Epoch 20/20  Iteration 3472/3560 Training loss: 1.1877 0.1273 sec/batch
Epoch 20/20  Iteration 3473/3560 Training loss: 1.1875 0.1273 sec/batch
Epoch 20/20  Iteration 3474/3560 Training loss: 1.1874 0.1275 sec/batch
Epoch 20/20  Iteration 3475/3560 Training loss: 1.1870 0.1271 sec/batch
Epoch 20/20  Iteration 3476/3560 Training loss: 1.1867 0.1273 sec/batch
Epoch 20/20  Iteration 3477/3560 Training loss: 1.1865 0.1271 sec/batch
Epoch 20/20  Iteration 3478/3560 Training loss: 1.1865 0.1273 sec/batch
Epoch 20/20  Iteration 3479/3560 Training loss: 1.1865 0.1273 sec/batch
Epoch 20/20  Iteration 3480/3560 Training loss: 1.1861 0.1273 se

#### Saved checkpoints

Read up on saving and loading checkpoints here: https://www.tensorflow.org/programmers_guide/variables

In [67]:
tf.train.get_checkpoint_state('checkpoints')

model_checkpoint_path: "checkpoints/i3560_l512_v1.126.ckpt"
all_model_checkpoint_paths: "checkpoints/i200_l512_v2.472.ckpt"
all_model_checkpoint_paths: "checkpoints/i400_l512_v2.028.ckpt"
all_model_checkpoint_paths: "checkpoints/i600_l512_v1.798.ckpt"
all_model_checkpoint_paths: "checkpoints/i800_l512_v1.627.ckpt"
all_model_checkpoint_paths: "checkpoints/i1000_l512_v1.514.ckpt"
all_model_checkpoint_paths: "checkpoints/i1200_l512_v1.425.ckpt"
all_model_checkpoint_paths: "checkpoints/i1400_l512_v1.347.ckpt"
all_model_checkpoint_paths: "checkpoints/i1600_l512_v1.298.ckpt"
all_model_checkpoint_paths: "checkpoints/i1800_l512_v1.263.ckpt"
all_model_checkpoint_paths: "checkpoints/i2000_l512_v1.233.ckpt"
all_model_checkpoint_paths: "checkpoints/i2200_l512_v1.213.ckpt"
all_model_checkpoint_paths: "checkpoints/i2400_l512_v1.192.ckpt"
all_model_checkpoint_paths: "checkpoints/i2600_l512_v1.176.ckpt"
all_model_checkpoint_paths: "checkpoints/i2800_l512_v1.162.ckpt"
all_model_checkpoint_paths: "check

## Sampling

Now that the network is trained, we'll can use it to generate new text. The idea is that we pass in a character, then the network will predict the next character. We can use the new one, to predict the next one. And we keep doing this to generate all new text. I also included some functionality to prime the network with some text by passing in a string and building up a state from that.

The network gives us predictions for each character. To reduce noise and make things a little less random, I'm going to only choose a new character from the top N most likely characters.



In [68]:
def pick_top_n(preds, vocab_size, top_n=5):
    p = np.squeeze(preds)
    p[np.argsort(p)[:-top_n]] = 0
    p = p / np.sum(p)
    c = np.random.choice(vocab_size, 1, p=p)[0]
    return c

In [75]:
def sample(checkpoint, n_samples, lstm_size, vocab_size, prime="The "):
    samples = [c for c in prime]
    model = CharRNN(len(vocab), lstm_size=lstm_size, sampling=True)
    saver = tf.train.Saver()
    with tf.Session() as sess:
        saver.restore(sess, checkpoint)
        new_state = sess.run(model.initial_state)
        for c in prime:
            x = np.zeros((1, 1))
            x[0,0] = vocab_to_int[c]
            feed = {model.inputs: x,
                    model.keep_prob: 1.,
                    model.initial_state: new_state}
            preds, new_state = sess.run([model.prediction, model.final_state], 
                                         feed_dict=feed)

        c = pick_top_n(preds, len(vocab))
        samples.append(int_to_vocab[c])

        for i in range(n_samples):
            x[0,0] = c
            feed = {model.inputs: x,
                    model.keep_prob: 1.,
                    model.initial_state: new_state}
            preds, new_state = sess.run([model.prediction, model.final_state], 
                                         feed_dict=feed)

            c = pick_top_n(preds, len(vocab))
            samples.append(int_to_vocab[c])
        
    return ''.join(samples)

Here, pass in the path to a checkpoint and sample from the network.

In [76]:
tf.train.latest_checkpoint('checkpoints')

'checkpoints/i3560_l512_v1.126.ckpt'

In [77]:
checkpoint = tf.train.latest_checkpoint('checkpoints')
samp = sample(checkpoint, 2000, lstm_size, len(vocab), prime="Far")
print(samp)

Farron's head with the character, and so as a little shamely and him was
so man who such as though she was standing which he had been telling
him off. But the morning the strunge he could not hear his from a little,
who was stopping at the door, which was a smile as an honor at the right
that had not been, and she caught simple of intensity, though he
hed her sister in the clerk of her shortice. They was dispreasune of this
property of the dirorment of his face, and he were so the matter to him to
see him a chood in her handsome assument to activing and an indullabies, and
to the peasant saw that this shoulder something they seemed to him and
did not know her. She wanted to say, and though they were all a misupation.

And this was too. Alexey Alexandrovitch should have been saying that he was
as the dinner who had been asked whether his face, and the simple and
her face and so much were thinking was no little secrets, who had set
off him a long while she had been so sound in which it w

In [83]:
checkpoint = 'checkpoints/i200_l512_v2.397.ckpt'
samp = sample(checkpoint, 1000, lstm_size, len(vocab), prime="Far")
print(samp)

Farged ofo fo that sond ha sorit the harisg antes ot asd an he the se ant ar sat hos he he tha d sher had ant toth tind ard and silt tar he has silt at inle bande toud an on tha hin honges he we cith serserering and saredit hes ang shard oont want hom
sorersant, wan tere te th me te sor the tout
he sering,
ang an he and hos than sot herased to wat she wad so he sos sithirg and ans as tho ho her tores thit hetherer so her ate an allinde nos
and and or han the so he he was ane so thon ha sererand.
"Whe sele at an sathe he to this whim ang tan the wot

he the he hre he he ton thes teothed tithas saren ad as ane thim wome tin whis thengasg at he ale bin he tise so silles tousgin he wald to the went ans ha che he arered sorin had wand ting hat san sis thint ous ot the he wite ales thon hos and ato nit as to was har and tha here the the san se ceatho the wen hang thand an of he she sile,. "ha whor the sothind asd afe the sothon wos he har ha tan than sot har te ho sor an aned thor the wes he

In [84]:
checkpoint = 'checkpoints/i600_l512_v1.798.ckpt'
samp = sample(checkpoint, 1000, lstm_size, len(vocab), prime="Far")
print(samp)

Farllyy she wat a san the
prositat the soncousterthed
him and would to her. Ale the sonether tint her
the saires in the coulting him
her and working of ser the his
befter, and to steling a prepestled, and had and the past and to be her hissess.

"It she tay and hes to he would
to had to bither the chist te whone her aloned he cunttions, and her and hiss alled to be and that horself and theres horself. Ser his word would bet has and that to had talked the sontion her, and his thought, she was
saithing of that the wat him her. Alexee Alexay Arexandivatch, some to sometellert at
heress of his will
the westerting her to be was all sho weres the peasiout,. "Wat a lated and to the sain. "I wan in that he fild that it," ha wese thourd and that the sack of the serpenting on the
partiant of hissenfell, would bett ald the sourd. Her the had to him beat over tere that she had sill for him.

"Ohe said, the pacessersested as on to the crilles. The sore, whele wes to to sereeds worknion
she thinedse

In [86]:
checkpoint = 'checkpoints/i1000_l512_v1.881.ckpt'
samp = sample(checkpoint, 1000, lstm_size, len(vocab), prime="Far")
print(samp)

Farrianct, had he doon the bound soment on the brow had
bot in tho case to taking to she he that she dack to be thanking and would to her wourn the
raigine the rither with hors a doon the stare with same the sting of that
whe wat at at hadent on the bean this to bean
was stronce ond the densted and the rades of the bastersess then this sond of abe couthinc, he dond the came, but she wound of a so that the derter the sore. She was she to bean wene tall to the coresiot that to the borg of the crulde treinced and that
his a that take the thet
to he whined a the beater thit
him a montiof inte trice of
harrseden of thought of the somet an the roor the coms to say she comer the round, and and soudd by his sate the stold the roorte her. She was and condirg to
she suthing ond to beca one one and him to be cust a chich a thers and at treet tike to hum. The
crossering to but
she had. "You con't say a manest interesting!" said the dove of the carst on some a santed ot her wandere of and a shill s