# Anna KaRNNa

In this notebook, I'll build a character-wise RNN trained on Anna Karenina, one of my all-time favorite books. It'll be able to generate new text based on the text from the book.

This network is based off of Andrej Karpathy's [post on RNNs](http://karpathy.github.io/2015/05/21/rnn-effectiveness/) and [implementation in Torch](https://github.com/karpathy/char-rnn). Also, some information [here at r2rt](http://r2rt.com/recurrent-neural-networks-in-tensorflow-ii.html) and from [Sherjil Ozair](https://github.com/sherjilozair/char-rnn-tensorflow) on GitHub. Below is the general architecture of the character-wise RNN.

<img src="assets/charseq.jpeg" width="500">

In [1]:
import time
from collections import namedtuple

import numpy as np
import tensorflow as tf

First we'll load the text file and convert it into integers for our network to use.

In [2]:
with open('anna.txt', 'r') as f:
    text=f.read()
vocab = set(text)
vocab_to_int = {c: i for i, c in enumerate(vocab)}
int_to_vocab = dict(enumerate(vocab))
chars = np.array([vocab_to_int[c] for c in text], dtype=np.int32)

In [3]:
text[:100]

'Chapter 1\n\n\nHappy families are all alike; every unhappy family is unhappy in its own\nway.\n\nEverythin'

In [4]:
chars[:100]

array([76, 13, 60,  0, 57, 24, 59, 61, 21, 53, 53, 53, 10, 60,  0,  0, 20,
       61, 17, 60, 77, 75, 32, 75, 24, 37, 61, 60, 59, 24, 61, 60, 32, 32,
       61, 60, 32, 75, 12, 24, 28, 61, 24, 81, 24, 59, 20, 61, 54, 43, 13,
       60,  0,  0, 20, 61, 17, 60, 77, 75, 32, 20, 61, 75, 37, 61, 54, 43,
       13, 60,  0,  0, 20, 61, 75, 43, 61, 75, 57, 37, 61, 30, 73, 43, 53,
       73, 60, 20, 42, 53, 53, 38, 81, 24, 59, 20, 57, 13, 75, 43], dtype=int32)

Now I need to split up the data into batches, and into training and validation sets. I should be making a test set here, but I'm not going to worry about that. My test will be if the network can generate new text.

Here I'll make both input and target arrays. The targets are the same as the inputs, except shifted one character over. I'll also drop the last bit of data so that I'll only have completely full batches.

The idea here is to make a 2D matrix where the number of rows is equal to the number of batches. Each row will be one long concatenated string from the character data. We'll split this data into a training set and validation set using the `split_frac` keyword. This will keep 90% of the batches in the training set, the other 10% in the validation set.

In [5]:
def split_data(chars, batch_size, num_steps, split_frac=0.9):
    """ 
    Split character data into training and validation sets, inputs and targets for each set.
    
    Arguments
    ---------
    chars: character array
    batch_size: Size of examples in each of batch
    num_steps: Number of sequence steps to keep in the input and pass to the network
    split_frac: Fraction of batches to keep in the training set
    
    
    Returns train_x, train_y, val_x, val_y
    """
    
    slice_size = batch_size * num_steps
    n_batches = int(len(chars) / slice_size)
    
    # Drop the last few characters to make only full batches
    x = chars[: n_batches*slice_size]
    y = chars[1: n_batches*slice_size + 1]
    
    # Split the data into batch_size slices, then stack them into a 2D matrix 
    x = np.stack(np.split(x, batch_size))
    y = np.stack(np.split(y, batch_size))
    
    # Now x and y are arrays with dimensions batch_size x n_batches*num_steps
    
    # Split into training and validation sets, keep the virst split_frac batches for training
    split_idx = int(n_batches*split_frac)
    train_x, train_y= x[:, :split_idx*num_steps], y[:, :split_idx*num_steps]
    val_x, val_y = x[:, split_idx*num_steps:], y[:, split_idx*num_steps:]
    
    return train_x, train_y, val_x, val_y

In [6]:
train_x, train_y, val_x, val_y = split_data(chars, 10, 200)

In [7]:
train_x.shape

(10, 178400)

In [8]:
train_x[:,:10]

array([[76, 13, 60,  0, 57, 24, 59, 61, 21, 53],
       [ 2, 43, 46, 61, 13, 24, 61, 77, 30, 81],
       [61, 64, 60, 57, 64, 13, 75, 43, 18, 61],
       [30, 57, 13, 24, 59, 61, 73, 30, 54, 32],
       [61, 57, 13, 24, 61, 32, 60, 43, 46, 63],
       [61, 44, 13, 59, 30, 54, 18, 13, 61, 32],
       [57, 61, 57, 30, 53, 46, 30, 42, 53, 53],
       [30, 61, 13, 24, 59, 37, 24, 32, 17, 31],
       [13, 60, 57, 61, 75, 37, 61, 57, 13, 24],
       [24, 59, 37, 24, 32, 17, 61, 60, 43, 46]], dtype=int32)

I'll write another function to grab batches out of the arrays made by split data. Here each batch will be a sliding window on these arrays with size `batch_size X num_steps`. For example, if we want our network to train on a sequence of 100 characters, `num_steps = 100`. For the next batch, we'll shift this window the next sequence of `num_steps` characters. In this way we can feed batches to the network and the cell states will continue through on each batch.

In [9]:
def get_batch(arrs, num_steps):
    batch_size, slice_size = arrs[0].shape
    
    n_batches = int(slice_size/num_steps)
    for b in range(n_batches):
        yield [x[:, b*num_steps: (b+1)*num_steps] for x in arrs]

In [10]:
def build_rnn(num_classes, batch_size=50, num_steps=50, lstm_size=128, num_layers=2,
              learning_rate=0.001, grad_clip=5, sampling=False):
        
    if sampling == True:
        batch_size, num_steps = 1, 1

    tf.reset_default_graph()
    
    # Declare placeholders we'll feed into the graph
    with tf.name_scope('inputs'):
        inputs = tf.placeholder(tf.int32, [batch_size, num_steps], name='inputs')
        x_one_hot = tf.one_hot(inputs, num_classes, name='x_one_hot')
    
    with tf.name_scope('targets'):
        targets = tf.placeholder(tf.int32, [batch_size, num_steps], name='targets')
        y_one_hot = tf.one_hot(targets, num_classes, name='y_one_hot')
        y_reshaped = tf.reshape(y_one_hot, [-1, num_classes])
    
    keep_prob = tf.placeholder(tf.float32, name='keep_prob')
    
    # Build the RNN layers
    with tf.name_scope("RNN_cells"):
        lstm = tf.contrib.rnn.BasicLSTMCell(lstm_size)
        drop = tf.contrib.rnn.DropoutWrapper(lstm, output_keep_prob=keep_prob)
        cell = tf.contrib.rnn.MultiRNNCell([drop] * num_layers)
    
    with tf.name_scope("RNN_init_state"):
        initial_state = cell.zero_state(batch_size, tf.float32)

    # Run the data through the RNN layers
    with tf.name_scope("RNN_forward"):
        outputs, state = tf.nn.dynamic_rnn(cell, x_one_hot, initial_state=initial_state)
    
    final_state = state
    
    # Reshape output so it's a bunch of rows, one row for each cell output
    with tf.name_scope('sequence_reshape'):
        seq_output = tf.concat(outputs, axis=1,name='seq_output')
        output = tf.reshape(seq_output, [-1, lstm_size], name='graph_output')
    
    # Now connect the RNN outputs to a softmax layer and calculate the cost
    with tf.name_scope('logits'):
        softmax_w = tf.Variable(tf.truncated_normal((lstm_size, num_classes), stddev=0.1),
                               name='softmax_w')
        softmax_b = tf.Variable(tf.zeros(num_classes), name='softmax_b')
        logits = tf.matmul(output, softmax_w) + softmax_b
        tf.summary.histogram('softmax_w', softmax_w)
        tf.summary.histogram('softmax_b', softmax_b)

    with tf.name_scope('predictions'):
        preds = tf.nn.softmax(logits, name='predictions')
        tf.summary.histogram('predictions', preds)
    
    with tf.name_scope('cost'):
        loss = tf.nn.softmax_cross_entropy_with_logits(logits=logits, labels=y_reshaped, name='loss')
        cost = tf.reduce_mean(loss, name='cost')
        tf.summary.scalar('cost', cost)

    # Optimizer for training, using gradient clipping to control exploding gradients
    with tf.name_scope('train'):
        tvars = tf.trainable_variables()
        grads, _ = tf.clip_by_global_norm(tf.gradients(cost, tvars), grad_clip)
        train_op = tf.train.AdamOptimizer(learning_rate)
        optimizer = train_op.apply_gradients(zip(grads, tvars))
    
    merged = tf.summary.merge_all()
    
    # Export the nodes 
    export_nodes = ['inputs', 'targets', 'initial_state', 'final_state',
                    'keep_prob', 'cost', 'preds', 'optimizer', 'merged']
    Graph = namedtuple('Graph', export_nodes)
    local_dict = locals()
    graph = Graph(*[local_dict[each] for each in export_nodes])
    
    return graph

## Hyperparameters

Here I'm defining the hyperparameters for the network. The two you probably haven't seen before are `lstm_size` and `num_layers`. These set the number of hidden units in the LSTM layers and the number of LSTM layers, respectively. Of course, making these bigger will improve the network's performance but you'll have to watch out for overfitting. If your validation loss is much larger than the training loss, you're probably overfitting. Decrease the size of the network or decrease the dropout keep probability.

In [11]:
batch_size = 100
num_steps = 100
lstm_size = 512
num_layers = 2
learning_rate = 0.001

## Training

Time for training which is is pretty straightforward. Here I pass in some data, and get an LSTM state back. Then I pass that state back in to the network so the next batch can continue the state from the previous batch. And every so often (set by `save_every_n`) I calculate the validation loss and save a checkpoint.

In [12]:
!mkdir -p checkpoints/anna

In [13]:
def train(model, epochs, file_writer):
    
    with tf.Session() as sess:
        sess.run(tf.global_variables_initializer())

        # Use the line below to load a checkpoint and resume training
        #saver.restore(sess, 'checkpoints/anna20.ckpt')

        n_batches = int(train_x.shape[1]/num_steps)
        iterations = n_batches * epochs
        for e in range(epochs):

            # Train network
            new_state = sess.run(model.initial_state)
            loss = 0
            for b, (x, y) in enumerate(get_batch([train_x, train_y], num_steps), 1):
                iteration = e*n_batches + b
                start = time.time()
                feed = {model.inputs: x,
                        model.targets: y,
                        model.keep_prob: 0.5,
                        model.initial_state: new_state}
                summary, batch_loss, new_state, _ = sess.run([model.merged, model.cost, 
                                                              model.final_state, model.optimizer], 
                                                              feed_dict=feed)
                loss += batch_loss
                end = time.time()
                print('Epoch {}/{} '.format(e+1, epochs),
                      'Iteration {}/{}'.format(iteration, iterations),
                      'Training loss: {:.4f}'.format(loss/b),
                      '{:.4f} sec/batch'.format((end-start)))

                file_writer.add_summary(summary, iteration)

In [14]:
epochs = 20
batch_size = 100
num_steps = 100
train_x, train_y, val_x, val_y = split_data(chars, batch_size, num_steps)

for lstm_size in [128,256,512]:
    for num_layers in [1, 2]:
        for learning_rate in [0.002, 0.001]:
            log_string = 'logs/4/lr={},rl={},ru={}'.format(learning_rate, num_layers, lstm_size)
            writer = tf.summary.FileWriter(log_string)
            model = build_rnn(len(vocab), 
                    batch_size=batch_size,
                    num_steps=num_steps,
                    learning_rate=learning_rate,
                    lstm_size=lstm_size,
                    num_layers=num_layers)
            
            train(model, epochs, writer)

Epoch 1/20  Iteration 1/3560 Training loss: 4.4216 0.6137 sec/batch
Epoch 1/20  Iteration 2/3560 Training loss: 4.4110 0.4663 sec/batch
Epoch 1/20  Iteration 3/3560 Training loss: 4.3990 0.4647 sec/batch
Epoch 1/20  Iteration 4/3560 Training loss: 4.3810 0.4547 sec/batch
Epoch 1/20  Iteration 5/3560 Training loss: 4.3435 0.4454 sec/batch
Epoch 1/20  Iteration 6/3560 Training loss: 4.2762 0.4482 sec/batch
Epoch 1/20  Iteration 7/3560 Training loss: 4.1987 0.4440 sec/batch
Epoch 1/20  Iteration 8/3560 Training loss: 4.1223 0.4513 sec/batch
Epoch 1/20  Iteration 9/3560 Training loss: 4.0532 0.4550 sec/batch
Epoch 1/20  Iteration 10/3560 Training loss: 3.9925 0.4518 sec/batch
Epoch 1/20  Iteration 11/3560 Training loss: 3.9369 0.4550 sec/batch
Epoch 1/20  Iteration 12/3560 Training loss: 3.8891 0.4496 sec/batch
Epoch 1/20  Iteration 13/3560 Training loss: 3.8473 0.4525 sec/batch
Epoch 1/20  Iteration 14/3560 Training loss: 3.8110 0.4490 sec/batch
Epoch 1/20  Iteration 15/3560 Training loss

Epoch 1/20  Iteration 120/3560 Training loss: 3.1058 0.4659 sec/batch
Epoch 1/20  Iteration 121/3560 Training loss: 3.1029 0.4656 sec/batch
Epoch 1/20  Iteration 122/3560 Training loss: 3.0998 0.5070 sec/batch
Epoch 1/20  Iteration 123/3560 Training loss: 3.0967 0.4707 sec/batch
Epoch 1/20  Iteration 124/3560 Training loss: 3.0937 0.4747 sec/batch
Epoch 1/20  Iteration 125/3560 Training loss: 3.0907 0.4682 sec/batch
Epoch 1/20  Iteration 126/3560 Training loss: 3.0875 0.4644 sec/batch
Epoch 1/20  Iteration 127/3560 Training loss: 3.0845 0.4664 sec/batch
Epoch 1/20  Iteration 128/3560 Training loss: 3.0817 0.4798 sec/batch
Epoch 1/20  Iteration 129/3560 Training loss: 3.0786 0.4791 sec/batch
Epoch 1/20  Iteration 130/3560 Training loss: 3.0757 0.4660 sec/batch
Epoch 1/20  Iteration 131/3560 Training loss: 3.0728 0.4622 sec/batch
Epoch 1/20  Iteration 132/3560 Training loss: 3.0698 0.4659 sec/batch
Epoch 1/20  Iteration 133/3560 Training loss: 3.0668 0.4660 sec/batch
Epoch 1/20  Iteratio

Epoch 2/20  Iteration 238/3560 Training loss: 2.4786 0.4602 sec/batch
Epoch 2/20  Iteration 239/3560 Training loss: 2.4776 0.4665 sec/batch
Epoch 2/20  Iteration 240/3560 Training loss: 2.4768 0.4611 sec/batch
Epoch 2/20  Iteration 241/3560 Training loss: 2.4763 0.4668 sec/batch
Epoch 2/20  Iteration 242/3560 Training loss: 2.4754 0.4612 sec/batch
Epoch 2/20  Iteration 243/3560 Training loss: 2.4743 0.4661 sec/batch
Epoch 2/20  Iteration 244/3560 Training loss: 2.4739 0.4653 sec/batch
Epoch 2/20  Iteration 245/3560 Training loss: 2.4733 0.4676 sec/batch
Epoch 2/20  Iteration 246/3560 Training loss: 2.4721 0.4681 sec/batch
Epoch 2/20  Iteration 247/3560 Training loss: 2.4712 0.4711 sec/batch
Epoch 2/20  Iteration 248/3560 Training loss: 2.4706 0.4678 sec/batch
Epoch 2/20  Iteration 249/3560 Training loss: 2.4698 0.4679 sec/batch
Epoch 2/20  Iteration 250/3560 Training loss: 2.4693 0.4636 sec/batch
Epoch 2/20  Iteration 251/3560 Training loss: 2.4686 0.4661 sec/batch
Epoch 2/20  Iteratio

Epoch 2/20  Iteration 356/3560 Training loss: 2.4107 0.4676 sec/batch
Epoch 3/20  Iteration 357/3560 Training loss: 2.3704 0.4751 sec/batch
Epoch 3/20  Iteration 358/3560 Training loss: 2.3339 0.4627 sec/batch
Epoch 3/20  Iteration 359/3560 Training loss: 2.3261 0.4657 sec/batch
Epoch 3/20  Iteration 360/3560 Training loss: 2.3234 0.4723 sec/batch
Epoch 3/20  Iteration 361/3560 Training loss: 2.3227 0.4680 sec/batch
Epoch 3/20  Iteration 362/3560 Training loss: 2.3222 0.4673 sec/batch
Epoch 3/20  Iteration 363/3560 Training loss: 2.3228 0.4750 sec/batch
Epoch 3/20  Iteration 364/3560 Training loss: 2.3241 0.4619 sec/batch
Epoch 3/20  Iteration 365/3560 Training loss: 2.3250 0.4693 sec/batch
Epoch 3/20  Iteration 366/3560 Training loss: 2.3257 0.4655 sec/batch
Epoch 3/20  Iteration 367/3560 Training loss: 2.3242 0.4733 sec/batch
Epoch 3/20  Iteration 368/3560 Training loss: 2.3238 0.4838 sec/batch
Epoch 3/20  Iteration 369/3560 Training loss: 2.3238 0.4684 sec/batch
Epoch 3/20  Iteratio

Epoch 3/20  Iteration 474/3560 Training loss: 2.2861 0.4627 sec/batch
Epoch 3/20  Iteration 475/3560 Training loss: 2.2860 0.4651 sec/batch
Epoch 3/20  Iteration 476/3560 Training loss: 2.2857 0.4687 sec/batch
Epoch 3/20  Iteration 477/3560 Training loss: 2.2855 0.4719 sec/batch
Epoch 3/20  Iteration 478/3560 Training loss: 2.2851 0.4657 sec/batch
Epoch 3/20  Iteration 479/3560 Training loss: 2.2847 0.4690 sec/batch
Epoch 3/20  Iteration 480/3560 Training loss: 2.2846 0.4663 sec/batch
Epoch 3/20  Iteration 481/3560 Training loss: 2.2842 0.4645 sec/batch
Epoch 3/20  Iteration 482/3560 Training loss: 2.2838 0.4674 sec/batch
Epoch 3/20  Iteration 483/3560 Training loss: 2.2836 0.4646 sec/batch
Epoch 3/20  Iteration 484/3560 Training loss: 2.2834 0.4781 sec/batch
Epoch 3/20  Iteration 485/3560 Training loss: 2.2831 0.4886 sec/batch
Epoch 3/20  Iteration 486/3560 Training loss: 2.2829 0.4837 sec/batch
Epoch 3/20  Iteration 487/3560 Training loss: 2.2826 0.4744 sec/batch
Epoch 3/20  Iteratio

Epoch 4/20  Iteration 592/3560 Training loss: 2.2131 0.5041 sec/batch
Epoch 4/20  Iteration 593/3560 Training loss: 2.2127 0.5016 sec/batch
Epoch 4/20  Iteration 594/3560 Training loss: 2.2128 0.5016 sec/batch
Epoch 4/20  Iteration 595/3560 Training loss: 2.2124 0.5026 sec/batch
Epoch 4/20  Iteration 596/3560 Training loss: 2.2125 0.4969 sec/batch
Epoch 4/20  Iteration 597/3560 Training loss: 2.2126 0.5017 sec/batch
Epoch 4/20  Iteration 598/3560 Training loss: 2.2125 0.5040 sec/batch
Epoch 4/20  Iteration 599/3560 Training loss: 2.2121 0.4971 sec/batch
Epoch 4/20  Iteration 600/3560 Training loss: 2.2121 0.4995 sec/batch
Epoch 4/20  Iteration 601/3560 Training loss: 2.2120 0.4935 sec/batch
Epoch 4/20  Iteration 602/3560 Training loss: 2.2116 0.5378 sec/batch
Epoch 4/20  Iteration 603/3560 Training loss: 2.2112 0.5628 sec/batch
Epoch 4/20  Iteration 604/3560 Training loss: 2.2111 0.5810 sec/batch
Epoch 4/20  Iteration 605/3560 Training loss: 2.2110 0.5704 sec/batch
Epoch 4/20  Iteratio

Epoch 4/20  Iteration 710/3560 Training loss: 2.1890 0.4774 sec/batch
Epoch 4/20  Iteration 711/3560 Training loss: 2.1887 0.4664 sec/batch
Epoch 4/20  Iteration 712/3560 Training loss: 2.1886 0.4624 sec/batch
Epoch 5/20  Iteration 713/3560 Training loss: 2.2149 0.4658 sec/batch
Epoch 5/20  Iteration 714/3560 Training loss: 2.1748 0.4695 sec/batch
Epoch 5/20  Iteration 715/3560 Training loss: 2.1608 0.4662 sec/batch
Epoch 5/20  Iteration 716/3560 Training loss: 2.1571 0.4705 sec/batch
Epoch 5/20  Iteration 717/3560 Training loss: 2.1568 0.4654 sec/batch
Epoch 5/20  Iteration 718/3560 Training loss: 2.1518 0.4600 sec/batch
Epoch 5/20  Iteration 719/3560 Training loss: 2.1524 0.4690 sec/batch
Epoch 5/20  Iteration 720/3560 Training loss: 2.1542 0.5255 sec/batch
Epoch 5/20  Iteration 721/3560 Training loss: 2.1564 0.5340 sec/batch
Epoch 5/20  Iteration 722/3560 Training loss: 2.1556 0.5287 sec/batch
Epoch 5/20  Iteration 723/3560 Training loss: 2.1543 0.4701 sec/batch
Epoch 5/20  Iteratio

Epoch 5/20  Iteration 828/3560 Training loss: 2.1318 0.4507 sec/batch
Epoch 5/20  Iteration 829/3560 Training loss: 2.1316 0.4488 sec/batch
Epoch 5/20  Iteration 830/3560 Training loss: 2.1314 0.4537 sec/batch
Epoch 5/20  Iteration 831/3560 Training loss: 2.1314 0.4533 sec/batch
Epoch 5/20  Iteration 832/3560 Training loss: 2.1312 0.4514 sec/batch
Epoch 5/20  Iteration 833/3560 Training loss: 2.1310 0.4486 sec/batch
Epoch 5/20  Iteration 834/3560 Training loss: 2.1306 0.4528 sec/batch
Epoch 5/20  Iteration 835/3560 Training loss: 2.1303 0.4464 sec/batch
Epoch 5/20  Iteration 836/3560 Training loss: 2.1303 0.4672 sec/batch
Epoch 5/20  Iteration 837/3560 Training loss: 2.1301 0.4550 sec/batch
Epoch 5/20  Iteration 838/3560 Training loss: 2.1298 0.4459 sec/batch
Epoch 5/20  Iteration 839/3560 Training loss: 2.1296 0.4477 sec/batch
Epoch 5/20  Iteration 840/3560 Training loss: 2.1296 0.4507 sec/batch
Epoch 5/20  Iteration 841/3560 Training loss: 2.1295 0.4475 sec/batch
Epoch 5/20  Iteratio

Epoch 6/20  Iteration 946/3560 Training loss: 2.0767 0.4809 sec/batch
Epoch 6/20  Iteration 947/3560 Training loss: 2.0765 0.4844 sec/batch
Epoch 6/20  Iteration 948/3560 Training loss: 2.0761 0.4813 sec/batch
Epoch 6/20  Iteration 949/3560 Training loss: 2.0757 0.4683 sec/batch
Epoch 6/20  Iteration 950/3560 Training loss: 2.0760 0.4708 sec/batch
Epoch 6/20  Iteration 951/3560 Training loss: 2.0756 0.4727 sec/batch
Epoch 6/20  Iteration 952/3560 Training loss: 2.0761 0.4757 sec/batch
Epoch 6/20  Iteration 953/3560 Training loss: 2.0763 0.4715 sec/batch
Epoch 6/20  Iteration 954/3560 Training loss: 2.0762 0.4716 sec/batch
Epoch 6/20  Iteration 955/3560 Training loss: 2.0757 0.4739 sec/batch
Epoch 6/20  Iteration 956/3560 Training loss: 2.0758 0.4735 sec/batch
Epoch 6/20  Iteration 957/3560 Training loss: 2.0759 0.4725 sec/batch
Epoch 6/20  Iteration 958/3560 Training loss: 2.0753 0.4707 sec/batch
Epoch 6/20  Iteration 959/3560 Training loss: 2.0751 0.4739 sec/batch
Epoch 6/20  Iteratio

Epoch 6/20  Iteration 1063/3560 Training loss: 2.0570 0.4529 sec/batch
Epoch 6/20  Iteration 1064/3560 Training loss: 2.0570 0.4493 sec/batch
Epoch 6/20  Iteration 1065/3560 Training loss: 2.0570 0.4503 sec/batch
Epoch 6/20  Iteration 1066/3560 Training loss: 2.0570 0.4575 sec/batch
Epoch 6/20  Iteration 1067/3560 Training loss: 2.0567 0.4497 sec/batch
Epoch 6/20  Iteration 1068/3560 Training loss: 2.0566 0.4487 sec/batch
Epoch 7/20  Iteration 1069/3560 Training loss: 2.0888 0.4477 sec/batch
Epoch 7/20  Iteration 1070/3560 Training loss: 2.0509 0.4518 sec/batch
Epoch 7/20  Iteration 1071/3560 Training loss: 2.0428 0.4482 sec/batch
Epoch 7/20  Iteration 1072/3560 Training loss: 2.0372 0.4508 sec/batch
Epoch 7/20  Iteration 1073/3560 Training loss: 2.0337 0.4527 sec/batch
Epoch 7/20  Iteration 1074/3560 Training loss: 2.0256 0.4537 sec/batch
Epoch 7/20  Iteration 1075/3560 Training loss: 2.0275 0.4418 sec/batch
Epoch 7/20  Iteration 1076/3560 Training loss: 2.0295 0.4589 sec/batch
Epoch 

Epoch 7/20  Iteration 1179/3560 Training loss: 2.0106 0.4551 sec/batch
Epoch 7/20  Iteration 1180/3560 Training loss: 2.0105 0.4469 sec/batch
Epoch 7/20  Iteration 1181/3560 Training loss: 2.0102 0.4548 sec/batch
Epoch 7/20  Iteration 1182/3560 Training loss: 2.0100 0.4480 sec/batch
Epoch 7/20  Iteration 1183/3560 Training loss: 2.0097 0.4472 sec/batch
Epoch 7/20  Iteration 1184/3560 Training loss: 2.0094 0.4524 sec/batch
Epoch 7/20  Iteration 1185/3560 Training loss: 2.0093 0.4487 sec/batch
Epoch 7/20  Iteration 1186/3560 Training loss: 2.0091 0.4474 sec/batch
Epoch 7/20  Iteration 1187/3560 Training loss: 2.0091 0.4552 sec/batch
Epoch 7/20  Iteration 1188/3560 Training loss: 2.0090 0.4493 sec/batch
Epoch 7/20  Iteration 1189/3560 Training loss: 2.0089 0.4486 sec/batch
Epoch 7/20  Iteration 1190/3560 Training loss: 2.0086 0.4526 sec/batch
Epoch 7/20  Iteration 1191/3560 Training loss: 2.0082 0.4477 sec/batch
Epoch 7/20  Iteration 1192/3560 Training loss: 2.0083 0.4482 sec/batch
Epoch 

Epoch 8/20  Iteration 1295/3560 Training loss: 1.9700 0.4477 sec/batch
Epoch 8/20  Iteration 1296/3560 Training loss: 1.9705 0.4522 sec/batch
Epoch 8/20  Iteration 1297/3560 Training loss: 1.9699 0.4511 sec/batch
Epoch 8/20  Iteration 1298/3560 Training loss: 1.9705 0.4498 sec/batch
Epoch 8/20  Iteration 1299/3560 Training loss: 1.9704 0.4536 sec/batch
Epoch 8/20  Iteration 1300/3560 Training loss: 1.9701 0.4541 sec/batch
Epoch 8/20  Iteration 1301/3560 Training loss: 1.9698 0.4470 sec/batch
Epoch 8/20  Iteration 1302/3560 Training loss: 1.9701 0.4495 sec/batch
Epoch 8/20  Iteration 1303/3560 Training loss: 1.9701 0.4485 sec/batch
Epoch 8/20  Iteration 1304/3560 Training loss: 1.9700 0.4519 sec/batch
Epoch 8/20  Iteration 1305/3560 Training loss: 1.9695 0.4530 sec/batch
Epoch 8/20  Iteration 1306/3560 Training loss: 1.9699 0.4543 sec/batch
Epoch 8/20  Iteration 1307/3560 Training loss: 1.9698 0.4540 sec/batch
Epoch 8/20  Iteration 1308/3560 Training loss: 1.9703 0.4492 sec/batch
Epoch 

Epoch 8/20  Iteration 1411/3560 Training loss: 1.9573 0.4556 sec/batch
Epoch 8/20  Iteration 1412/3560 Training loss: 1.9572 0.4506 sec/batch
Epoch 8/20  Iteration 1413/3560 Training loss: 1.9572 0.4476 sec/batch
Epoch 8/20  Iteration 1414/3560 Training loss: 1.9573 0.4589 sec/batch
Epoch 8/20  Iteration 1415/3560 Training loss: 1.9572 0.4521 sec/batch
Epoch 8/20  Iteration 1416/3560 Training loss: 1.9571 0.4502 sec/batch
Epoch 8/20  Iteration 1417/3560 Training loss: 1.9569 0.4466 sec/batch
Epoch 8/20  Iteration 1418/3560 Training loss: 1.9568 0.4479 sec/batch
Epoch 8/20  Iteration 1419/3560 Training loss: 1.9568 0.4468 sec/batch
Epoch 8/20  Iteration 1420/3560 Training loss: 1.9569 0.4502 sec/batch
Epoch 8/20  Iteration 1421/3560 Training loss: 1.9569 0.4478 sec/batch
Epoch 8/20  Iteration 1422/3560 Training loss: 1.9568 0.4537 sec/batch
Epoch 8/20  Iteration 1423/3560 Training loss: 1.9566 0.4673 sec/batch
Epoch 8/20  Iteration 1424/3560 Training loss: 1.9566 0.4539 sec/batch
Epoch 

Epoch 9/20  Iteration 1527/3560 Training loss: 1.9225 0.4475 sec/batch
Epoch 9/20  Iteration 1528/3560 Training loss: 1.9224 0.4488 sec/batch
Epoch 9/20  Iteration 1529/3560 Training loss: 1.9221 0.4546 sec/batch
Epoch 9/20  Iteration 1530/3560 Training loss: 1.9220 0.4517 sec/batch
Epoch 9/20  Iteration 1531/3560 Training loss: 1.9219 0.4578 sec/batch
Epoch 9/20  Iteration 1532/3560 Training loss: 1.9219 0.4548 sec/batch
Epoch 9/20  Iteration 1533/3560 Training loss: 1.9220 0.4526 sec/batch
Epoch 9/20  Iteration 1534/3560 Training loss: 1.9219 0.4535 sec/batch
Epoch 9/20  Iteration 1535/3560 Training loss: 1.9218 0.4516 sec/batch
Epoch 9/20  Iteration 1536/3560 Training loss: 1.9217 0.4511 sec/batch
Epoch 9/20  Iteration 1537/3560 Training loss: 1.9216 0.4601 sec/batch
Epoch 9/20  Iteration 1538/3560 Training loss: 1.9213 0.4494 sec/batch
Epoch 9/20  Iteration 1539/3560 Training loss: 1.9212 0.4478 sec/batch
Epoch 9/20  Iteration 1540/3560 Training loss: 1.9208 0.4445 sec/batch
Epoch 

Epoch 10/20  Iteration 1642/3560 Training loss: 1.8971 0.4753 sec/batch
Epoch 10/20  Iteration 1643/3560 Training loss: 1.8966 0.4688 sec/batch
Epoch 10/20  Iteration 1644/3560 Training loss: 1.8968 0.4759 sec/batch
Epoch 10/20  Iteration 1645/3560 Training loss: 1.8961 0.4699 sec/batch
Epoch 10/20  Iteration 1646/3560 Training loss: 1.8954 0.4744 sec/batch
Epoch 10/20  Iteration 1647/3560 Training loss: 1.8955 0.4722 sec/batch
Epoch 10/20  Iteration 1648/3560 Training loss: 1.8941 0.4718 sec/batch
Epoch 10/20  Iteration 1649/3560 Training loss: 1.8938 0.4705 sec/batch
Epoch 10/20  Iteration 1650/3560 Training loss: 1.8931 0.4753 sec/batch
Epoch 10/20  Iteration 1651/3560 Training loss: 1.8928 0.4727 sec/batch
Epoch 10/20  Iteration 1652/3560 Training loss: 1.8935 0.4758 sec/batch
Epoch 10/20  Iteration 1653/3560 Training loss: 1.8932 0.4733 sec/batch
Epoch 10/20  Iteration 1654/3560 Training loss: 1.8939 0.4700 sec/batch
Epoch 10/20  Iteration 1655/3560 Training loss: 1.8936 0.4704 se

Epoch 10/20  Iteration 1756/3560 Training loss: 1.8842 0.4814 sec/batch
Epoch 10/20  Iteration 1757/3560 Training loss: 1.8841 0.4749 sec/batch
Epoch 10/20  Iteration 1758/3560 Training loss: 1.8841 0.4743 sec/batch
Epoch 10/20  Iteration 1759/3560 Training loss: 1.8841 0.4772 sec/batch
Epoch 10/20  Iteration 1760/3560 Training loss: 1.8841 0.4681 sec/batch
Epoch 10/20  Iteration 1761/3560 Training loss: 1.8839 0.4846 sec/batch
Epoch 10/20  Iteration 1762/3560 Training loss: 1.8840 0.4786 sec/batch
Epoch 10/20  Iteration 1763/3560 Training loss: 1.8841 0.4680 sec/batch
Epoch 10/20  Iteration 1764/3560 Training loss: 1.8840 0.4666 sec/batch
Epoch 10/20  Iteration 1765/3560 Training loss: 1.8839 0.4764 sec/batch
Epoch 10/20  Iteration 1766/3560 Training loss: 1.8839 0.4784 sec/batch
Epoch 10/20  Iteration 1767/3560 Training loss: 1.8839 0.4794 sec/batch
Epoch 10/20  Iteration 1768/3560 Training loss: 1.8839 0.4762 sec/batch
Epoch 10/20  Iteration 1769/3560 Training loss: 1.8839 0.4760 se

Epoch 11/20  Iteration 1870/3560 Training loss: 1.8601 0.4597 sec/batch
Epoch 11/20  Iteration 1871/3560 Training loss: 1.8599 0.4679 sec/batch
Epoch 11/20  Iteration 1872/3560 Training loss: 1.8597 0.5315 sec/batch
Epoch 11/20  Iteration 1873/3560 Training loss: 1.8593 0.5252 sec/batch
Epoch 11/20  Iteration 1874/3560 Training loss: 1.8589 0.4920 sec/batch
Epoch 11/20  Iteration 1875/3560 Training loss: 1.8587 0.4590 sec/batch
Epoch 11/20  Iteration 1876/3560 Training loss: 1.8585 0.5576 sec/batch
Epoch 11/20  Iteration 1877/3560 Training loss: 1.8584 0.5672 sec/batch
Epoch 11/20  Iteration 1878/3560 Training loss: 1.8580 0.4705 sec/batch
Epoch 11/20  Iteration 1879/3560 Training loss: 1.8575 0.6190 sec/batch
Epoch 11/20  Iteration 1880/3560 Training loss: 1.8571 0.5803 sec/batch
Epoch 11/20  Iteration 1881/3560 Training loss: 1.8570 0.5705 sec/batch
Epoch 11/20  Iteration 1882/3560 Training loss: 1.8568 0.5078 sec/batch
Epoch 11/20  Iteration 1883/3560 Training loss: 1.8566 0.5951 se

Epoch 12/20  Iteration 1984/3560 Training loss: 1.8395 0.4675 sec/batch
Epoch 12/20  Iteration 1985/3560 Training loss: 1.8389 0.4583 sec/batch
Epoch 12/20  Iteration 1986/3560 Training loss: 1.8396 0.4627 sec/batch
Epoch 12/20  Iteration 1987/3560 Training loss: 1.8409 0.4596 sec/batch
Epoch 12/20  Iteration 1988/3560 Training loss: 1.8415 0.5267 sec/batch
Epoch 12/20  Iteration 1989/3560 Training loss: 1.8413 0.4951 sec/batch
Epoch 12/20  Iteration 1990/3560 Training loss: 1.8406 0.4784 sec/batch
Epoch 12/20  Iteration 1991/3560 Training loss: 1.8406 0.4808 sec/batch
Epoch 12/20  Iteration 1992/3560 Training loss: 1.8412 0.4911 sec/batch
Epoch 12/20  Iteration 1993/3560 Training loss: 1.8411 0.4678 sec/batch
Epoch 12/20  Iteration 1994/3560 Training loss: 1.8412 0.4768 sec/batch
Epoch 12/20  Iteration 1995/3560 Training loss: 1.8405 0.4584 sec/batch
Epoch 12/20  Iteration 1996/3560 Training loss: 1.8394 0.4639 sec/batch
Epoch 12/20  Iteration 1997/3560 Training loss: 1.8383 0.4653 se

Epoch 12/20  Iteration 2098/3560 Training loss: 1.8261 0.4496 sec/batch
Epoch 12/20  Iteration 2099/3560 Training loss: 1.8263 0.4789 sec/batch
Epoch 12/20  Iteration 2100/3560 Training loss: 1.8263 0.4485 sec/batch
Epoch 12/20  Iteration 2101/3560 Training loss: 1.8263 0.4531 sec/batch
Epoch 12/20  Iteration 2102/3560 Training loss: 1.8264 0.4553 sec/batch
Epoch 12/20  Iteration 2103/3560 Training loss: 1.8263 0.4553 sec/batch
Epoch 12/20  Iteration 2104/3560 Training loss: 1.8263 0.4521 sec/batch
Epoch 12/20  Iteration 2105/3560 Training loss: 1.8264 0.4522 sec/batch
Epoch 12/20  Iteration 2106/3560 Training loss: 1.8265 0.4484 sec/batch
Epoch 12/20  Iteration 2107/3560 Training loss: 1.8264 0.4467 sec/batch
Epoch 12/20  Iteration 2108/3560 Training loss: 1.8262 0.4500 sec/batch
Epoch 12/20  Iteration 2109/3560 Training loss: 1.8259 0.4504 sec/batch
Epoch 12/20  Iteration 2110/3560 Training loss: 1.8261 0.4603 sec/batch
Epoch 12/20  Iteration 2111/3560 Training loss: 1.8261 0.4487 se

Epoch 13/20  Iteration 2212/3560 Training loss: 1.8105 0.4644 sec/batch
Epoch 13/20  Iteration 2213/3560 Training loss: 1.8106 0.4674 sec/batch
Epoch 13/20  Iteration 2214/3560 Training loss: 1.8106 0.4525 sec/batch
Epoch 13/20  Iteration 2215/3560 Training loss: 1.8102 0.4551 sec/batch
Epoch 13/20  Iteration 2216/3560 Training loss: 1.8101 0.4523 sec/batch
Epoch 13/20  Iteration 2217/3560 Training loss: 1.8096 0.4534 sec/batch
Epoch 13/20  Iteration 2218/3560 Training loss: 1.8096 0.4513 sec/batch
Epoch 13/20  Iteration 2219/3560 Training loss: 1.8091 0.4539 sec/batch
Epoch 13/20  Iteration 2220/3560 Training loss: 1.8090 0.4540 sec/batch
Epoch 13/20  Iteration 2221/3560 Training loss: 1.8086 0.4495 sec/batch
Epoch 13/20  Iteration 2222/3560 Training loss: 1.8083 0.4515 sec/batch
Epoch 13/20  Iteration 2223/3560 Training loss: 1.8082 0.4506 sec/batch
Epoch 13/20  Iteration 2224/3560 Training loss: 1.8081 0.4555 sec/batch
Epoch 13/20  Iteration 2225/3560 Training loss: 1.8076 0.4520 se

Epoch 14/20  Iteration 2326/3560 Training loss: 1.7888 0.4533 sec/batch
Epoch 14/20  Iteration 2327/3560 Training loss: 1.7893 0.4661 sec/batch
Epoch 14/20  Iteration 2328/3560 Training loss: 1.7919 0.4480 sec/batch
Epoch 14/20  Iteration 2329/3560 Training loss: 1.7909 0.4534 sec/batch
Epoch 14/20  Iteration 2330/3560 Training loss: 1.7890 0.4515 sec/batch
Epoch 14/20  Iteration 2331/3560 Training loss: 1.7896 0.4496 sec/batch
Epoch 14/20  Iteration 2332/3560 Training loss: 1.7916 0.4501 sec/batch
Epoch 14/20  Iteration 2333/3560 Training loss: 1.7915 0.4480 sec/batch
Epoch 14/20  Iteration 2334/3560 Training loss: 1.7920 0.4548 sec/batch
Epoch 14/20  Iteration 2335/3560 Training loss: 1.7912 0.4574 sec/batch
Epoch 14/20  Iteration 2336/3560 Training loss: 1.7921 0.4520 sec/batch
Epoch 14/20  Iteration 2337/3560 Training loss: 1.7916 0.4534 sec/batch
Epoch 14/20  Iteration 2338/3560 Training loss: 1.7910 0.4523 sec/batch
Epoch 14/20  Iteration 2339/3560 Training loss: 1.7911 0.4512 se

Epoch 14/20  Iteration 2440/3560 Training loss: 1.7802 0.4705 sec/batch
Epoch 14/20  Iteration 2441/3560 Training loss: 1.7803 0.4581 sec/batch
Epoch 14/20  Iteration 2442/3560 Training loss: 1.7804 0.4587 sec/batch
Epoch 14/20  Iteration 2443/3560 Training loss: 1.7803 0.4644 sec/batch
Epoch 14/20  Iteration 2444/3560 Training loss: 1.7802 0.4644 sec/batch
Epoch 14/20  Iteration 2445/3560 Training loss: 1.7799 0.4627 sec/batch
Epoch 14/20  Iteration 2446/3560 Training loss: 1.7796 0.4625 sec/batch
Epoch 14/20  Iteration 2447/3560 Training loss: 1.7796 0.4646 sec/batch
Epoch 14/20  Iteration 2448/3560 Training loss: 1.7797 0.4609 sec/batch
Epoch 14/20  Iteration 2449/3560 Training loss: 1.7797 0.4727 sec/batch
Epoch 14/20  Iteration 2450/3560 Training loss: 1.7798 0.4562 sec/batch
Epoch 14/20  Iteration 2451/3560 Training loss: 1.7799 0.4576 sec/batch
Epoch 14/20  Iteration 2452/3560 Training loss: 1.7800 0.4530 sec/batch
Epoch 14/20  Iteration 2453/3560 Training loss: 1.7801 0.4648 se

Epoch 15/20  Iteration 2554/3560 Training loss: 1.7685 0.5904 sec/batch
Epoch 15/20  Iteration 2555/3560 Training loss: 1.7689 0.5963 sec/batch
Epoch 15/20  Iteration 2556/3560 Training loss: 1.7692 0.5707 sec/batch
Epoch 15/20  Iteration 2557/3560 Training loss: 1.7689 0.5904 sec/batch
Epoch 15/20  Iteration 2558/3560 Training loss: 1.7692 0.5737 sec/batch
Epoch 15/20  Iteration 2559/3560 Training loss: 1.7694 0.5834 sec/batch
Epoch 15/20  Iteration 2560/3560 Training loss: 1.7690 0.5796 sec/batch
Epoch 15/20  Iteration 2561/3560 Training loss: 1.7691 0.5864 sec/batch
Epoch 15/20  Iteration 2562/3560 Training loss: 1.7690 0.5672 sec/batch
Epoch 15/20  Iteration 2563/3560 Training loss: 1.7695 0.4791 sec/batch
Epoch 15/20  Iteration 2564/3560 Training loss: 1.7695 0.4735 sec/batch
Epoch 15/20  Iteration 2565/3560 Training loss: 1.7701 0.4502 sec/batch
Epoch 15/20  Iteration 2566/3560 Training loss: 1.7699 0.4539 sec/batch
Epoch 15/20  Iteration 2567/3560 Training loss: 1.7700 0.4592 se

Epoch 15/20  Iteration 2668/3560 Training loss: 1.7610 0.6001 sec/batch
Epoch 15/20  Iteration 2669/3560 Training loss: 1.7609 0.6174 sec/batch
Epoch 15/20  Iteration 2670/3560 Training loss: 1.7609 0.4893 sec/batch
Epoch 16/20  Iteration 2671/3560 Training loss: 1.8211 0.4898 sec/batch
Epoch 16/20  Iteration 2672/3560 Training loss: 1.7786 0.4675 sec/batch
Epoch 16/20  Iteration 2673/3560 Training loss: 1.7668 0.4572 sec/batch
Epoch 16/20  Iteration 2674/3560 Training loss: 1.7658 0.4868 sec/batch
Epoch 16/20  Iteration 2675/3560 Training loss: 1.7634 0.4847 sec/batch
Epoch 16/20  Iteration 2676/3560 Training loss: 1.7562 0.4787 sec/batch
Epoch 16/20  Iteration 2677/3560 Training loss: 1.7557 0.4800 sec/batch
Epoch 16/20  Iteration 2678/3560 Training loss: 1.7542 0.4669 sec/batch
Epoch 16/20  Iteration 2679/3560 Training loss: 1.7569 0.4576 sec/batch
Epoch 16/20  Iteration 2680/3560 Training loss: 1.7561 0.4689 sec/batch
Epoch 16/20  Iteration 2681/3560 Training loss: 1.7529 0.4610 se

Epoch 16/20  Iteration 2782/3560 Training loss: 1.7432 0.4864 sec/batch
Epoch 16/20  Iteration 2783/3560 Training loss: 1.7431 0.4539 sec/batch
Epoch 16/20  Iteration 2784/3560 Training loss: 1.7429 0.4603 sec/batch
Epoch 16/20  Iteration 2785/3560 Training loss: 1.7427 0.4588 sec/batch
Epoch 16/20  Iteration 2786/3560 Training loss: 1.7424 0.4500 sec/batch
Epoch 16/20  Iteration 2787/3560 Training loss: 1.7424 0.4582 sec/batch
Epoch 16/20  Iteration 2788/3560 Training loss: 1.7423 0.4591 sec/batch
Epoch 16/20  Iteration 2789/3560 Training loss: 1.7422 0.5123 sec/batch
Epoch 16/20  Iteration 2790/3560 Training loss: 1.7422 0.6882 sec/batch
Epoch 16/20  Iteration 2791/3560 Training loss: 1.7422 0.6339 sec/batch
Epoch 16/20  Iteration 2792/3560 Training loss: 1.7419 0.6109 sec/batch
Epoch 16/20  Iteration 2793/3560 Training loss: 1.7416 0.6227 sec/batch
Epoch 16/20  Iteration 2794/3560 Training loss: 1.7416 0.5924 sec/batch
Epoch 16/20  Iteration 2795/3560 Training loss: 1.7416 0.6717 se

Epoch 17/20  Iteration 2896/3560 Training loss: 1.7286 0.5002 sec/batch
Epoch 17/20  Iteration 2897/3560 Training loss: 1.7284 0.5086 sec/batch
Epoch 17/20  Iteration 2898/3560 Training loss: 1.7291 0.5308 sec/batch
Epoch 17/20  Iteration 2899/3560 Training loss: 1.7290 0.5679 sec/batch
Epoch 17/20  Iteration 2900/3560 Training loss: 1.7299 0.5841 sec/batch
Epoch 17/20  Iteration 2901/3560 Training loss: 1.7300 0.5630 sec/batch
Epoch 17/20  Iteration 2902/3560 Training loss: 1.7303 0.5707 sec/batch
Epoch 17/20  Iteration 2903/3560 Training loss: 1.7301 0.5626 sec/batch
Epoch 17/20  Iteration 2904/3560 Training loss: 1.7302 0.5581 sec/batch
Epoch 17/20  Iteration 2905/3560 Training loss: 1.7305 0.5835 sec/batch
Epoch 17/20  Iteration 2906/3560 Training loss: 1.7303 0.5731 sec/batch
Epoch 17/20  Iteration 2907/3560 Training loss: 1.7299 0.4883 sec/batch
Epoch 17/20  Iteration 2908/3560 Training loss: 1.7305 0.5759 sec/batch
Epoch 17/20  Iteration 2909/3560 Training loss: 1.7304 0.6673 se

Epoch 17/20  Iteration 3010/3560 Training loss: 1.7243 0.4586 sec/batch
Epoch 17/20  Iteration 3011/3560 Training loss: 1.7244 0.4553 sec/batch
Epoch 17/20  Iteration 3012/3560 Training loss: 1.7244 0.4871 sec/batch
Epoch 17/20  Iteration 3013/3560 Training loss: 1.7245 0.5230 sec/batch
Epoch 17/20  Iteration 3014/3560 Training loss: 1.7244 0.4610 sec/batch
Epoch 17/20  Iteration 3015/3560 Training loss: 1.7245 0.4877 sec/batch
Epoch 17/20  Iteration 3016/3560 Training loss: 1.7249 0.4652 sec/batch
Epoch 17/20  Iteration 3017/3560 Training loss: 1.7248 0.4629 sec/batch
Epoch 17/20  Iteration 3018/3560 Training loss: 1.7248 0.5012 sec/batch
Epoch 17/20  Iteration 3019/3560 Training loss: 1.7248 0.4874 sec/batch
Epoch 17/20  Iteration 3020/3560 Training loss: 1.7247 0.4805 sec/batch
Epoch 17/20  Iteration 3021/3560 Training loss: 1.7247 0.5313 sec/batch
Epoch 17/20  Iteration 3022/3560 Training loss: 1.7247 0.7201 sec/batch
Epoch 17/20  Iteration 3023/3560 Training loss: 1.7247 0.8475 se

Epoch 18/20  Iteration 3124/3560 Training loss: 1.7130 0.4589 sec/batch
Epoch 18/20  Iteration 3125/3560 Training loss: 1.7127 0.4595 sec/batch
Epoch 18/20  Iteration 3126/3560 Training loss: 1.7121 0.4864 sec/batch
Epoch 18/20  Iteration 3127/3560 Training loss: 1.7122 0.5628 sec/batch
Epoch 18/20  Iteration 3128/3560 Training loss: 1.7121 0.4632 sec/batch
Epoch 18/20  Iteration 3129/3560 Training loss: 1.7119 0.4723 sec/batch
Epoch 18/20  Iteration 3130/3560 Training loss: 1.7117 0.4579 sec/batch
Epoch 18/20  Iteration 3131/3560 Training loss: 1.7116 0.4668 sec/batch
Epoch 18/20  Iteration 3132/3560 Training loss: 1.7115 0.4576 sec/batch
Epoch 18/20  Iteration 3133/3560 Training loss: 1.7115 0.4609 sec/batch
Epoch 18/20  Iteration 3134/3560 Training loss: 1.7114 0.4592 sec/batch
Epoch 18/20  Iteration 3135/3560 Training loss: 1.7115 0.4922 sec/batch
Epoch 18/20  Iteration 3136/3560 Training loss: 1.7114 0.4723 sec/batch
Epoch 18/20  Iteration 3137/3560 Training loss: 1.7113 0.4794 se

Epoch 19/20  Iteration 3238/3560 Training loss: 1.7059 0.4544 sec/batch
Epoch 19/20  Iteration 3239/3560 Training loss: 1.7059 0.4522 sec/batch
Epoch 19/20  Iteration 3240/3560 Training loss: 1.7062 0.4607 sec/batch
Epoch 19/20  Iteration 3241/3560 Training loss: 1.7055 0.4573 sec/batch
Epoch 19/20  Iteration 3242/3560 Training loss: 1.7042 0.4725 sec/batch
Epoch 19/20  Iteration 3243/3560 Training loss: 1.7028 0.4522 sec/batch
Epoch 19/20  Iteration 3244/3560 Training loss: 1.7023 0.4544 sec/batch
Epoch 19/20  Iteration 3245/3560 Training loss: 1.7016 0.4608 sec/batch
Epoch 19/20  Iteration 3246/3560 Training loss: 1.7019 0.4657 sec/batch
Epoch 19/20  Iteration 3247/3560 Training loss: 1.7013 0.4557 sec/batch
Epoch 19/20  Iteration 3248/3560 Training loss: 1.7004 0.4589 sec/batch
Epoch 19/20  Iteration 3249/3560 Training loss: 1.7010 0.4675 sec/batch
Epoch 19/20  Iteration 3250/3560 Training loss: 1.6999 0.4671 sec/batch
Epoch 19/20  Iteration 3251/3560 Training loss: 1.6996 0.4583 se

Epoch 19/20  Iteration 3352/3560 Training loss: 1.6941 0.4787 sec/batch
Epoch 19/20  Iteration 3353/3560 Training loss: 1.6941 0.4801 sec/batch
Epoch 19/20  Iteration 3354/3560 Training loss: 1.6940 0.4814 sec/batch
Epoch 19/20  Iteration 3355/3560 Training loss: 1.6936 0.4754 sec/batch
Epoch 19/20  Iteration 3356/3560 Training loss: 1.6937 0.4862 sec/batch
Epoch 19/20  Iteration 3357/3560 Training loss: 1.6938 0.5380 sec/batch
Epoch 19/20  Iteration 3358/3560 Training loss: 1.6939 0.5107 sec/batch
Epoch 19/20  Iteration 3359/3560 Training loss: 1.6939 0.5072 sec/batch
Epoch 19/20  Iteration 3360/3560 Training loss: 1.6939 0.5070 sec/batch
Epoch 19/20  Iteration 3361/3560 Training loss: 1.6940 0.5237 sec/batch
Epoch 19/20  Iteration 3362/3560 Training loss: 1.6940 0.4916 sec/batch
Epoch 19/20  Iteration 3363/3560 Training loss: 1.6938 0.4827 sec/batch
Epoch 19/20  Iteration 3364/3560 Training loss: 1.6939 0.4730 sec/batch
Epoch 19/20  Iteration 3365/3560 Training loss: 1.6941 0.4649 se

Epoch 20/20  Iteration 3466/3560 Training loss: 1.6872 0.5095 sec/batch
Epoch 20/20  Iteration 3467/3560 Training loss: 1.6868 0.5696 sec/batch
Epoch 20/20  Iteration 3468/3560 Training loss: 1.6865 0.5496 sec/batch
Epoch 20/20  Iteration 3469/3560 Training loss: 1.6864 0.5510 sec/batch
Epoch 20/20  Iteration 3470/3560 Training loss: 1.6860 0.5172 sec/batch
Epoch 20/20  Iteration 3471/3560 Training loss: 1.6855 0.5787 sec/batch
Epoch 20/20  Iteration 3472/3560 Training loss: 1.6855 0.5036 sec/batch
Epoch 20/20  Iteration 3473/3560 Training loss: 1.6851 0.5209 sec/batch
Epoch 20/20  Iteration 3474/3560 Training loss: 1.6849 0.5540 sec/batch
Epoch 20/20  Iteration 3475/3560 Training loss: 1.6845 0.5107 sec/batch
Epoch 20/20  Iteration 3476/3560 Training loss: 1.6842 0.5752 sec/batch
Epoch 20/20  Iteration 3477/3560 Training loss: 1.6839 0.5805 sec/batch
Epoch 20/20  Iteration 3478/3560 Training loss: 1.6840 0.5109 sec/batch
Epoch 20/20  Iteration 3479/3560 Training loss: 1.6839 0.5906 se

Epoch 1/20  Iteration 21/3560 Training loss: 3.8279 0.4794 sec/batch
Epoch 1/20  Iteration 22/3560 Training loss: 3.8049 0.5150 sec/batch
Epoch 1/20  Iteration 23/3560 Training loss: 3.7836 0.4951 sec/batch
Epoch 1/20  Iteration 24/3560 Training loss: 3.7641 0.5508 sec/batch
Epoch 1/20  Iteration 25/3560 Training loss: 3.7452 0.5288 sec/batch
Epoch 1/20  Iteration 26/3560 Training loss: 3.7278 0.5289 sec/batch
Epoch 1/20  Iteration 27/3560 Training loss: 3.7116 0.5546 sec/batch
Epoch 1/20  Iteration 28/3560 Training loss: 3.6958 0.5862 sec/batch
Epoch 1/20  Iteration 29/3560 Training loss: 3.6812 0.5995 sec/batch
Epoch 1/20  Iteration 30/3560 Training loss: 3.6672 0.5236 sec/batch
Epoch 1/20  Iteration 31/3560 Training loss: 3.6551 0.5059 sec/batch
Epoch 1/20  Iteration 32/3560 Training loss: 3.6423 0.5369 sec/batch
Epoch 1/20  Iteration 33/3560 Training loss: 3.6300 0.5111 sec/batch
Epoch 1/20  Iteration 34/3560 Training loss: 3.6188 0.5012 sec/batch
Epoch 1/20  Iteration 35/3560 Trai

Epoch 1/20  Iteration 140/3560 Training loss: 3.2433 0.5035 sec/batch
Epoch 1/20  Iteration 141/3560 Training loss: 3.2418 0.5320 sec/batch
Epoch 1/20  Iteration 142/3560 Training loss: 3.2400 0.5478 sec/batch
Epoch 1/20  Iteration 143/3560 Training loss: 3.2384 0.6675 sec/batch
Epoch 1/20  Iteration 144/3560 Training loss: 3.2367 0.6326 sec/batch
Epoch 1/20  Iteration 145/3560 Training loss: 3.2351 0.4888 sec/batch
Epoch 1/20  Iteration 146/3560 Training loss: 3.2336 0.4848 sec/batch
Epoch 1/20  Iteration 147/3560 Training loss: 3.2321 0.4867 sec/batch
Epoch 1/20  Iteration 148/3560 Training loss: 3.2307 0.4778 sec/batch
Epoch 1/20  Iteration 149/3560 Training loss: 3.2292 0.4631 sec/batch
Epoch 1/20  Iteration 150/3560 Training loss: 3.2276 0.4793 sec/batch
Epoch 1/20  Iteration 151/3560 Training loss: 3.2262 0.4887 sec/batch
Epoch 1/20  Iteration 152/3560 Training loss: 3.2248 0.4758 sec/batch
Epoch 1/20  Iteration 153/3560 Training loss: 3.2233 0.4662 sec/batch
Epoch 1/20  Iteratio

Epoch 2/20  Iteration 258/3560 Training loss: 2.7747 0.4798 sec/batch
Epoch 2/20  Iteration 259/3560 Training loss: 2.7732 0.4594 sec/batch
Epoch 2/20  Iteration 260/3560 Training loss: 2.7719 0.4820 sec/batch
Epoch 2/20  Iteration 261/3560 Training loss: 2.7706 0.4802 sec/batch
Epoch 2/20  Iteration 262/3560 Training loss: 2.7691 0.4594 sec/batch
Epoch 2/20  Iteration 263/3560 Training loss: 2.7675 0.4577 sec/batch
Epoch 2/20  Iteration 264/3560 Training loss: 2.7659 0.4617 sec/batch
Epoch 2/20  Iteration 265/3560 Training loss: 2.7645 0.4562 sec/batch
Epoch 2/20  Iteration 266/3560 Training loss: 2.7631 0.4595 sec/batch
Epoch 2/20  Iteration 267/3560 Training loss: 2.7617 0.4577 sec/batch
Epoch 2/20  Iteration 268/3560 Training loss: 2.7605 0.4638 sec/batch
Epoch 2/20  Iteration 269/3560 Training loss: 2.7591 0.4623 sec/batch
Epoch 2/20  Iteration 270/3560 Training loss: 2.7579 0.4677 sec/batch
Epoch 2/20  Iteration 271/3560 Training loss: 2.7566 0.4639 sec/batch
Epoch 2/20  Iteratio

Epoch 3/20  Iteration 376/3560 Training loss: 2.5030 0.4782 sec/batch
Epoch 3/20  Iteration 377/3560 Training loss: 2.5022 0.4909 sec/batch
Epoch 3/20  Iteration 378/3560 Training loss: 2.5030 0.4709 sec/batch
Epoch 3/20  Iteration 379/3560 Training loss: 2.5025 0.4864 sec/batch
Epoch 3/20  Iteration 380/3560 Training loss: 2.5017 0.5340 sec/batch
Epoch 3/20  Iteration 381/3560 Training loss: 2.5009 0.4747 sec/batch
Epoch 3/20  Iteration 382/3560 Training loss: 2.5005 0.4639 sec/batch
Epoch 3/20  Iteration 383/3560 Training loss: 2.4999 0.4706 sec/batch
Epoch 3/20  Iteration 384/3560 Training loss: 2.4992 0.4621 sec/batch
Epoch 3/20  Iteration 385/3560 Training loss: 2.4991 0.4673 sec/batch
Epoch 3/20  Iteration 386/3560 Training loss: 2.4986 0.4738 sec/batch
Epoch 3/20  Iteration 387/3560 Training loss: 2.4988 0.5032 sec/batch
Epoch 3/20  Iteration 388/3560 Training loss: 2.4981 0.4693 sec/batch
Epoch 3/20  Iteration 389/3560 Training loss: 2.4971 0.4723 sec/batch
Epoch 3/20  Iteratio

Epoch 3/20  Iteration 494/3560 Training loss: 2.4474 0.5206 sec/batch
Epoch 3/20  Iteration 495/3560 Training loss: 2.4472 0.4741 sec/batch
Epoch 3/20  Iteration 496/3560 Training loss: 2.4469 0.5350 sec/batch
Epoch 3/20  Iteration 497/3560 Training loss: 2.4467 0.5042 sec/batch
Epoch 3/20  Iteration 498/3560 Training loss: 2.4463 0.5350 sec/batch
Epoch 3/20  Iteration 499/3560 Training loss: 2.4460 0.4850 sec/batch
Epoch 3/20  Iteration 500/3560 Training loss: 2.4457 0.5157 sec/batch
Epoch 3/20  Iteration 501/3560 Training loss: 2.4453 0.5270 sec/batch
Epoch 3/20  Iteration 502/3560 Training loss: 2.4453 0.5114 sec/batch
Epoch 3/20  Iteration 503/3560 Training loss: 2.4449 0.5086 sec/batch
Epoch 3/20  Iteration 504/3560 Training loss: 2.4448 0.4884 sec/batch
Epoch 3/20  Iteration 505/3560 Training loss: 2.4444 0.4916 sec/batch
Epoch 3/20  Iteration 506/3560 Training loss: 2.4441 0.5032 sec/batch
Epoch 3/20  Iteration 507/3560 Training loss: 2.4440 0.4974 sec/batch
Epoch 3/20  Iteratio

Epoch 4/20  Iteration 612/3560 Training loss: 2.3630 0.4611 sec/batch
Epoch 4/20  Iteration 613/3560 Training loss: 2.3625 0.4645 sec/batch
Epoch 4/20  Iteration 614/3560 Training loss: 2.3622 0.4647 sec/batch
Epoch 4/20  Iteration 615/3560 Training loss: 2.3618 0.4600 sec/batch
Epoch 4/20  Iteration 616/3560 Training loss: 2.3618 0.4627 sec/batch
Epoch 4/20  Iteration 617/3560 Training loss: 2.3615 0.4591 sec/batch
Epoch 4/20  Iteration 618/3560 Training loss: 2.3611 0.4712 sec/batch
Epoch 4/20  Iteration 619/3560 Training loss: 2.3604 0.4651 sec/batch
Epoch 4/20  Iteration 620/3560 Training loss: 2.3600 0.4702 sec/batch
Epoch 4/20  Iteration 621/3560 Training loss: 2.3598 0.4659 sec/batch
Epoch 4/20  Iteration 622/3560 Training loss: 2.3594 0.4559 sec/batch
Epoch 4/20  Iteration 623/3560 Training loss: 2.3590 0.4663 sec/batch
Epoch 4/20  Iteration 624/3560 Training loss: 2.3588 0.4584 sec/batch
Epoch 4/20  Iteration 625/3560 Training loss: 2.3585 0.4593 sec/batch
Epoch 4/20  Iteratio

Epoch 5/20  Iteration 730/3560 Training loss: 2.3081 0.5069 sec/batch
Epoch 5/20  Iteration 731/3560 Training loss: 2.3083 0.5135 sec/batch
Epoch 5/20  Iteration 732/3560 Training loss: 2.3074 0.6046 sec/batch
Epoch 5/20  Iteration 733/3560 Training loss: 2.3071 0.5582 sec/batch
Epoch 5/20  Iteration 734/3560 Training loss: 2.3080 0.6398 sec/batch
Epoch 5/20  Iteration 735/3560 Training loss: 2.3080 0.5945 sec/batch
Epoch 5/20  Iteration 736/3560 Training loss: 2.3071 0.5885 sec/batch
Epoch 5/20  Iteration 737/3560 Training loss: 2.3066 0.5526 sec/batch
Epoch 5/20  Iteration 738/3560 Training loss: 2.3062 0.5505 sec/batch
Epoch 5/20  Iteration 739/3560 Training loss: 2.3058 0.5487 sec/batch
Epoch 5/20  Iteration 740/3560 Training loss: 2.3058 0.5612 sec/batch
Epoch 5/20  Iteration 741/3560 Training loss: 2.3062 0.6041 sec/batch
Epoch 5/20  Iteration 742/3560 Training loss: 2.3061 0.5914 sec/batch
Epoch 5/20  Iteration 743/3560 Training loss: 2.3063 0.6571 sec/batch
Epoch 5/20  Iteratio

Epoch 5/20  Iteration 848/3560 Training loss: 2.2769 0.4640 sec/batch
Epoch 5/20  Iteration 849/3560 Training loss: 2.2768 0.4595 sec/batch
Epoch 5/20  Iteration 850/3560 Training loss: 2.2767 0.4607 sec/batch
Epoch 5/20  Iteration 851/3560 Training loss: 2.2767 0.4694 sec/batch
Epoch 5/20  Iteration 852/3560 Training loss: 2.2765 0.4556 sec/batch
Epoch 5/20  Iteration 853/3560 Training loss: 2.2765 0.4607 sec/batch
Epoch 5/20  Iteration 854/3560 Training loss: 2.2763 0.4705 sec/batch
Epoch 5/20  Iteration 855/3560 Training loss: 2.2762 0.5010 sec/batch
Epoch 5/20  Iteration 856/3560 Training loss: 2.2760 0.5254 sec/batch
Epoch 5/20  Iteration 857/3560 Training loss: 2.2757 0.5112 sec/batch
Epoch 5/20  Iteration 858/3560 Training loss: 2.2757 0.4742 sec/batch
Epoch 5/20  Iteration 859/3560 Training loss: 2.2756 0.4671 sec/batch
Epoch 5/20  Iteration 860/3560 Training loss: 2.2756 0.4732 sec/batch
Epoch 5/20  Iteration 861/3560 Training loss: 2.2754 0.4706 sec/batch
Epoch 5/20  Iteratio

Epoch 6/20  Iteration 966/3560 Training loss: 2.2293 0.4851 sec/batch
Epoch 6/20  Iteration 967/3560 Training loss: 2.2290 0.4835 sec/batch
Epoch 6/20  Iteration 968/3560 Training loss: 2.2290 0.4852 sec/batch
Epoch 6/20  Iteration 969/3560 Training loss: 2.2285 0.4819 sec/batch
Epoch 6/20  Iteration 970/3560 Training loss: 2.2283 0.4847 sec/batch
Epoch 6/20  Iteration 971/3560 Training loss: 2.2277 0.4890 sec/batch
Epoch 6/20  Iteration 972/3560 Training loss: 2.2278 0.4808 sec/batch
Epoch 6/20  Iteration 973/3560 Training loss: 2.2275 0.4626 sec/batch
Epoch 6/20  Iteration 974/3560 Training loss: 2.2271 0.4598 sec/batch
Epoch 6/20  Iteration 975/3560 Training loss: 2.2264 0.4595 sec/batch
Epoch 6/20  Iteration 976/3560 Training loss: 2.2261 0.4759 sec/batch
Epoch 6/20  Iteration 977/3560 Training loss: 2.2259 0.4765 sec/batch
Epoch 6/20  Iteration 978/3560 Training loss: 2.2256 0.4714 sec/batch
Epoch 6/20  Iteration 979/3560 Training loss: 2.2252 0.4574 sec/batch
Epoch 6/20  Iteratio

Epoch 7/20  Iteration 1082/3560 Training loss: 2.1874 0.4701 sec/batch
Epoch 7/20  Iteration 1083/3560 Training loss: 2.1871 0.5345 sec/batch
Epoch 7/20  Iteration 1084/3560 Training loss: 2.1866 0.6066 sec/batch
Epoch 7/20  Iteration 1085/3560 Training loss: 2.1863 0.5128 sec/batch
Epoch 7/20  Iteration 1086/3560 Training loss: 2.1878 0.5960 sec/batch
Epoch 7/20  Iteration 1087/3560 Training loss: 2.1886 0.6259 sec/batch
Epoch 7/20  Iteration 1088/3560 Training loss: 2.1880 0.6271 sec/batch
Epoch 7/20  Iteration 1089/3560 Training loss: 2.1874 0.6074 sec/batch
Epoch 7/20  Iteration 1090/3560 Training loss: 2.1887 0.6203 sec/batch
Epoch 7/20  Iteration 1091/3560 Training loss: 2.1887 0.6230 sec/batch
Epoch 7/20  Iteration 1092/3560 Training loss: 2.1879 0.6280 sec/batch
Epoch 7/20  Iteration 1093/3560 Training loss: 2.1875 0.6201 sec/batch
Epoch 7/20  Iteration 1094/3560 Training loss: 2.1869 0.5204 sec/batch
Epoch 7/20  Iteration 1095/3560 Training loss: 2.1864 0.4864 sec/batch
Epoch 

Epoch 7/20  Iteration 1198/3560 Training loss: 2.1650 0.4655 sec/batch
Epoch 7/20  Iteration 1199/3560 Training loss: 2.1648 0.5326 sec/batch
Epoch 7/20  Iteration 1200/3560 Training loss: 2.1644 0.4986 sec/batch
Epoch 7/20  Iteration 1201/3560 Training loss: 2.1644 0.5262 sec/batch
Epoch 7/20  Iteration 1202/3560 Training loss: 2.1643 0.4653 sec/batch
Epoch 7/20  Iteration 1203/3560 Training loss: 2.1642 0.4692 sec/batch
Epoch 7/20  Iteration 1204/3560 Training loss: 2.1642 0.4658 sec/batch
Epoch 7/20  Iteration 1205/3560 Training loss: 2.1641 0.5070 sec/batch
Epoch 7/20  Iteration 1206/3560 Training loss: 2.1641 0.4562 sec/batch
Epoch 7/20  Iteration 1207/3560 Training loss: 2.1642 0.4751 sec/batch
Epoch 7/20  Iteration 1208/3560 Training loss: 2.1640 0.4623 sec/batch
Epoch 7/20  Iteration 1209/3560 Training loss: 2.1640 0.4703 sec/batch
Epoch 7/20  Iteration 1210/3560 Training loss: 2.1638 0.4629 sec/batch
Epoch 7/20  Iteration 1211/3560 Training loss: 2.1637 0.4792 sec/batch
Epoch 

Epoch 8/20  Iteration 1314/3560 Training loss: 2.1297 0.4751 sec/batch
Epoch 8/20  Iteration 1315/3560 Training loss: 2.1294 0.4761 sec/batch
Epoch 8/20  Iteration 1316/3560 Training loss: 2.1293 0.4967 sec/batch
Epoch 8/20  Iteration 1317/3560 Training loss: 2.1293 0.5085 sec/batch
Epoch 8/20  Iteration 1318/3560 Training loss: 2.1294 0.5061 sec/batch
Epoch 8/20  Iteration 1319/3560 Training loss: 2.1296 0.4905 sec/batch
Epoch 8/20  Iteration 1320/3560 Training loss: 2.1292 0.4837 sec/batch
Epoch 8/20  Iteration 1321/3560 Training loss: 2.1290 0.4991 sec/batch
Epoch 8/20  Iteration 1322/3560 Training loss: 2.1294 0.4972 sec/batch
Epoch 8/20  Iteration 1323/3560 Training loss: 2.1293 0.5101 sec/batch
Epoch 8/20  Iteration 1324/3560 Training loss: 2.1293 0.5136 sec/batch
Epoch 8/20  Iteration 1325/3560 Training loss: 2.1288 0.4956 sec/batch
Epoch 8/20  Iteration 1326/3560 Training loss: 2.1287 0.4958 sec/batch
Epoch 8/20  Iteration 1327/3560 Training loss: 2.1282 0.5094 sec/batch
Epoch 

Epoch 9/20  Iteration 1430/3560 Training loss: 2.0953 0.5285 sec/batch
Epoch 9/20  Iteration 1431/3560 Training loss: 2.0960 0.7053 sec/batch
Epoch 9/20  Iteration 1432/3560 Training loss: 2.0964 0.6757 sec/batch
Epoch 9/20  Iteration 1433/3560 Training loss: 2.0984 0.6262 sec/batch
Epoch 9/20  Iteration 1434/3560 Training loss: 2.0984 0.5240 sec/batch
Epoch 9/20  Iteration 1435/3560 Training loss: 2.0960 0.4921 sec/batch
Epoch 9/20  Iteration 1436/3560 Training loss: 2.0941 0.4595 sec/batch
Epoch 9/20  Iteration 1437/3560 Training loss: 2.0944 0.4666 sec/batch
Epoch 9/20  Iteration 1438/3560 Training loss: 2.0961 0.4561 sec/batch
Epoch 9/20  Iteration 1439/3560 Training loss: 2.0955 0.4637 sec/batch
Epoch 9/20  Iteration 1440/3560 Training loss: 2.0942 0.4588 sec/batch
Epoch 9/20  Iteration 1441/3560 Training loss: 2.0943 0.4647 sec/batch
Epoch 9/20  Iteration 1442/3560 Training loss: 2.0963 0.4573 sec/batch
Epoch 9/20  Iteration 1443/3560 Training loss: 2.0964 0.5037 sec/batch
Epoch 

Epoch 9/20  Iteration 1546/3560 Training loss: 2.0780 0.4823 sec/batch
Epoch 9/20  Iteration 1547/3560 Training loss: 2.0778 0.4648 sec/batch
Epoch 9/20  Iteration 1548/3560 Training loss: 2.0778 0.4614 sec/batch
Epoch 9/20  Iteration 1549/3560 Training loss: 2.0778 0.4840 sec/batch
Epoch 9/20  Iteration 1550/3560 Training loss: 2.0774 0.4647 sec/batch
Epoch 9/20  Iteration 1551/3560 Training loss: 2.0774 0.4628 sec/batch
Epoch 9/20  Iteration 1552/3560 Training loss: 2.0774 0.4642 sec/batch
Epoch 9/20  Iteration 1553/3560 Training loss: 2.0775 0.4781 sec/batch
Epoch 9/20  Iteration 1554/3560 Training loss: 2.0774 0.4600 sec/batch
Epoch 9/20  Iteration 1555/3560 Training loss: 2.0772 0.4715 sec/batch
Epoch 9/20  Iteration 1556/3560 Training loss: 2.0769 0.4650 sec/batch
Epoch 9/20  Iteration 1557/3560 Training loss: 2.0769 0.4649 sec/batch
Epoch 9/20  Iteration 1558/3560 Training loss: 2.0769 0.4626 sec/batch
Epoch 9/20  Iteration 1559/3560 Training loss: 2.0768 0.4692 sec/batch
Epoch 

Epoch 10/20  Iteration 1661/3560 Training loss: 2.0496 0.4585 sec/batch
Epoch 10/20  Iteration 1662/3560 Training loss: 2.0500 0.4650 sec/batch
Epoch 10/20  Iteration 1663/3560 Training loss: 2.0497 0.4701 sec/batch
Epoch 10/20  Iteration 1664/3560 Training loss: 2.0501 0.4626 sec/batch
Epoch 10/20  Iteration 1665/3560 Training loss: 2.0504 0.4610 sec/batch
Epoch 10/20  Iteration 1666/3560 Training loss: 2.0503 0.4574 sec/batch
Epoch 10/20  Iteration 1667/3560 Training loss: 2.0501 0.4697 sec/batch
Epoch 10/20  Iteration 1668/3560 Training loss: 2.0504 0.4656 sec/batch
Epoch 10/20  Iteration 1669/3560 Training loss: 2.0504 0.4596 sec/batch
Epoch 10/20  Iteration 1670/3560 Training loss: 2.0499 0.4605 sec/batch
Epoch 10/20  Iteration 1671/3560 Training loss: 2.0497 0.4789 sec/batch
Epoch 10/20  Iteration 1672/3560 Training loss: 2.0496 0.4633 sec/batch
Epoch 10/20  Iteration 1673/3560 Training loss: 2.0499 0.4717 sec/batch
Epoch 10/20  Iteration 1674/3560 Training loss: 2.0500 0.4584 se

Epoch 10/20  Iteration 1775/3560 Training loss: 2.0385 0.4630 sec/batch
Epoch 10/20  Iteration 1776/3560 Training loss: 2.0385 0.4619 sec/batch
Epoch 10/20  Iteration 1777/3560 Training loss: 2.0385 0.4639 sec/batch
Epoch 10/20  Iteration 1778/3560 Training loss: 2.0385 0.4617 sec/batch
Epoch 10/20  Iteration 1779/3560 Training loss: 2.0383 0.4631 sec/batch
Epoch 10/20  Iteration 1780/3560 Training loss: 2.0382 0.4698 sec/batch
Epoch 11/20  Iteration 1781/3560 Training loss: 2.0918 0.4590 sec/batch
Epoch 11/20  Iteration 1782/3560 Training loss: 2.0456 0.4663 sec/batch
Epoch 11/20  Iteration 1783/3560 Training loss: 2.0342 0.4624 sec/batch
Epoch 11/20  Iteration 1784/3560 Training loss: 2.0281 0.4634 sec/batch
Epoch 11/20  Iteration 1785/3560 Training loss: 2.0246 0.4614 sec/batch
Epoch 11/20  Iteration 1786/3560 Training loss: 2.0174 0.4599 sec/batch
Epoch 11/20  Iteration 1787/3560 Training loss: 2.0184 0.4559 sec/batch
Epoch 11/20  Iteration 1788/3560 Training loss: 2.0181 0.4567 se

Epoch 11/20  Iteration 1889/3560 Training loss: 2.0082 0.4674 sec/batch
Epoch 11/20  Iteration 1890/3560 Training loss: 2.0082 0.4641 sec/batch
Epoch 11/20  Iteration 1891/3560 Training loss: 2.0081 0.4709 sec/batch
Epoch 11/20  Iteration 1892/3560 Training loss: 2.0081 0.4669 sec/batch
Epoch 11/20  Iteration 1893/3560 Training loss: 2.0079 0.4646 sec/batch
Epoch 11/20  Iteration 1894/3560 Training loss: 2.0077 0.4693 sec/batch
Epoch 11/20  Iteration 1895/3560 Training loss: 2.0074 0.4664 sec/batch
Epoch 11/20  Iteration 1896/3560 Training loss: 2.0071 0.4658 sec/batch
Epoch 11/20  Iteration 1897/3560 Training loss: 2.0071 0.4699 sec/batch
Epoch 11/20  Iteration 1898/3560 Training loss: 2.0070 0.5396 sec/batch
Epoch 11/20  Iteration 1899/3560 Training loss: 2.0071 0.5001 sec/batch
Epoch 11/20  Iteration 1900/3560 Training loss: 2.0071 0.5270 sec/batch
Epoch 11/20  Iteration 1901/3560 Training loss: 2.0071 0.4941 sec/batch
Epoch 11/20  Iteration 1902/3560 Training loss: 2.0068 0.6631 se

Epoch 12/20  Iteration 2003/3560 Training loss: 1.9841 0.4671 sec/batch
Epoch 12/20  Iteration 2004/3560 Training loss: 1.9829 0.4586 sec/batch
Epoch 12/20  Iteration 2005/3560 Training loss: 1.9830 0.4596 sec/batch
Epoch 12/20  Iteration 2006/3560 Training loss: 1.9827 0.4585 sec/batch
Epoch 12/20  Iteration 2007/3560 Training loss: 1.9826 0.4826 sec/batch
Epoch 12/20  Iteration 2008/3560 Training loss: 1.9832 0.5372 sec/batch
Epoch 12/20  Iteration 2009/3560 Training loss: 1.9825 0.5604 sec/batch
Epoch 12/20  Iteration 2010/3560 Training loss: 1.9834 0.4639 sec/batch
Epoch 12/20  Iteration 2011/3560 Training loss: 1.9832 0.4865 sec/batch
Epoch 12/20  Iteration 2012/3560 Training loss: 1.9830 0.4707 sec/batch
Epoch 12/20  Iteration 2013/3560 Training loss: 1.9827 0.4608 sec/batch
Epoch 12/20  Iteration 2014/3560 Training loss: 1.9829 0.4696 sec/batch
Epoch 12/20  Iteration 2015/3560 Training loss: 1.9830 0.4635 sec/batch
Epoch 12/20  Iteration 2016/3560 Training loss: 1.9826 0.4966 se

Epoch 12/20  Iteration 2117/3560 Training loss: 1.9747 0.4852 sec/batch
Epoch 12/20  Iteration 2118/3560 Training loss: 1.9749 0.5158 sec/batch
Epoch 12/20  Iteration 2119/3560 Training loss: 1.9750 0.5309 sec/batch
Epoch 12/20  Iteration 2120/3560 Training loss: 1.9749 0.5064 sec/batch
Epoch 12/20  Iteration 2121/3560 Training loss: 1.9750 0.5065 sec/batch
Epoch 12/20  Iteration 2122/3560 Training loss: 1.9750 0.5105 sec/batch
Epoch 12/20  Iteration 2123/3560 Training loss: 1.9750 0.4667 sec/batch
Epoch 12/20  Iteration 2124/3560 Training loss: 1.9749 0.5384 sec/batch
Epoch 12/20  Iteration 2125/3560 Training loss: 1.9749 0.5093 sec/batch
Epoch 12/20  Iteration 2126/3560 Training loss: 1.9752 0.5279 sec/batch
Epoch 12/20  Iteration 2127/3560 Training loss: 1.9751 0.4605 sec/batch
Epoch 12/20  Iteration 2128/3560 Training loss: 1.9751 0.4928 sec/batch
Epoch 12/20  Iteration 2129/3560 Training loss: 1.9750 0.4744 sec/batch
Epoch 12/20  Iteration 2130/3560 Training loss: 1.9749 0.4779 se

Epoch 13/20  Iteration 2231/3560 Training loss: 1.9555 0.4847 sec/batch
Epoch 13/20  Iteration 2232/3560 Training loss: 1.9555 0.4648 sec/batch
Epoch 13/20  Iteration 2233/3560 Training loss: 1.9554 0.5261 sec/batch
Epoch 13/20  Iteration 2234/3560 Training loss: 1.9550 0.4829 sec/batch
Epoch 13/20  Iteration 2235/3560 Training loss: 1.9547 0.4641 sec/batch
Epoch 13/20  Iteration 2236/3560 Training loss: 1.9542 0.5101 sec/batch
Epoch 13/20  Iteration 2237/3560 Training loss: 1.9541 0.5166 sec/batch
Epoch 13/20  Iteration 2238/3560 Training loss: 1.9540 0.4856 sec/batch
Epoch 13/20  Iteration 2239/3560 Training loss: 1.9537 0.4710 sec/batch
Epoch 13/20  Iteration 2240/3560 Training loss: 1.9536 0.4706 sec/batch
Epoch 13/20  Iteration 2241/3560 Training loss: 1.9534 0.5261 sec/batch
Epoch 13/20  Iteration 2242/3560 Training loss: 1.9534 0.5288 sec/batch
Epoch 13/20  Iteration 2243/3560 Training loss: 1.9533 0.4737 sec/batch
Epoch 13/20  Iteration 2244/3560 Training loss: 1.9534 0.4712 se

Epoch 14/20  Iteration 2345/3560 Training loss: 1.9392 0.4910 sec/batch
Epoch 14/20  Iteration 2346/3560 Training loss: 1.9386 0.4979 sec/batch
Epoch 14/20  Iteration 2347/3560 Training loss: 1.9385 0.5136 sec/batch
Epoch 14/20  Iteration 2348/3560 Training loss: 1.9391 0.4784 sec/batch
Epoch 14/20  Iteration 2349/3560 Training loss: 1.9385 0.5002 sec/batch
Epoch 14/20  Iteration 2350/3560 Training loss: 1.9384 0.4790 sec/batch
Epoch 14/20  Iteration 2351/3560 Training loss: 1.9379 0.4782 sec/batch
Epoch 14/20  Iteration 2352/3560 Training loss: 1.9369 0.4813 sec/batch
Epoch 14/20  Iteration 2353/3560 Training loss: 1.9360 0.4737 sec/batch
Epoch 14/20  Iteration 2354/3560 Training loss: 1.9354 0.4813 sec/batch
Epoch 14/20  Iteration 2355/3560 Training loss: 1.9350 0.4771 sec/batch
Epoch 14/20  Iteration 2356/3560 Training loss: 1.9351 0.4722 sec/batch
Epoch 14/20  Iteration 2357/3560 Training loss: 1.9346 0.4795 sec/batch
Epoch 14/20  Iteration 2358/3560 Training loss: 1.9340 0.4684 se

Epoch 14/20  Iteration 2459/3560 Training loss: 1.9254 0.5343 sec/batch
Epoch 14/20  Iteration 2460/3560 Training loss: 1.9255 0.5633 sec/batch
Epoch 14/20  Iteration 2461/3560 Training loss: 1.9255 0.5754 sec/batch
Epoch 14/20  Iteration 2462/3560 Training loss: 1.9256 0.5184 sec/batch
Epoch 14/20  Iteration 2463/3560 Training loss: 1.9255 0.4853 sec/batch
Epoch 14/20  Iteration 2464/3560 Training loss: 1.9254 0.5725 sec/batch
Epoch 14/20  Iteration 2465/3560 Training loss: 1.9252 0.5328 sec/batch
Epoch 14/20  Iteration 2466/3560 Training loss: 1.9254 0.5056 sec/batch
Epoch 14/20  Iteration 2467/3560 Training loss: 1.9254 0.5199 sec/batch
Epoch 14/20  Iteration 2468/3560 Training loss: 1.9254 0.6084 sec/batch
Epoch 14/20  Iteration 2469/3560 Training loss: 1.9253 0.6651 sec/batch
Epoch 14/20  Iteration 2470/3560 Training loss: 1.9253 0.6107 sec/batch
Epoch 14/20  Iteration 2471/3560 Training loss: 1.9253 0.5483 sec/batch
Epoch 14/20  Iteration 2472/3560 Training loss: 1.9253 0.5669 se

Epoch 15/20  Iteration 2573/3560 Training loss: 1.9118 0.5041 sec/batch
Epoch 15/20  Iteration 2574/3560 Training loss: 1.9119 0.5192 sec/batch
Epoch 15/20  Iteration 2575/3560 Training loss: 1.9115 0.6051 sec/batch
Epoch 15/20  Iteration 2576/3560 Training loss: 1.9115 0.5566 sec/batch
Epoch 15/20  Iteration 2577/3560 Training loss: 1.9110 0.4600 sec/batch
Epoch 15/20  Iteration 2578/3560 Training loss: 1.9106 0.4780 sec/batch
Epoch 15/20  Iteration 2579/3560 Training loss: 1.9105 0.5555 sec/batch
Epoch 15/20  Iteration 2580/3560 Training loss: 1.9103 0.5439 sec/batch
Epoch 15/20  Iteration 2581/3560 Training loss: 1.9099 0.5348 sec/batch
Epoch 15/20  Iteration 2582/3560 Training loss: 1.9099 0.5005 sec/batch
Epoch 15/20  Iteration 2583/3560 Training loss: 1.9096 0.5028 sec/batch
Epoch 15/20  Iteration 2584/3560 Training loss: 1.9095 0.4735 sec/batch
Epoch 15/20  Iteration 2585/3560 Training loss: 1.9091 0.4757 sec/batch
Epoch 15/20  Iteration 2586/3560 Training loss: 1.9088 0.4987 se

Epoch 16/20  Iteration 2687/3560 Training loss: 1.8947 0.5576 sec/batch
Epoch 16/20  Iteration 2688/3560 Training loss: 1.8970 0.6547 sec/batch
Epoch 16/20  Iteration 2689/3560 Training loss: 1.8970 0.5790 sec/batch
Epoch 16/20  Iteration 2690/3560 Training loss: 1.8978 0.4834 sec/batch
Epoch 16/20  Iteration 2691/3560 Training loss: 1.8972 0.5280 sec/batch
Epoch 16/20  Iteration 2692/3560 Training loss: 1.8977 0.5289 sec/batch
Epoch 16/20  Iteration 2693/3560 Training loss: 1.8973 0.5322 sec/batch
Epoch 16/20  Iteration 2694/3560 Training loss: 1.8965 0.5051 sec/batch
Epoch 16/20  Iteration 2695/3560 Training loss: 1.8965 0.5273 sec/batch
Epoch 16/20  Iteration 2696/3560 Training loss: 1.8958 0.5048 sec/batch
Epoch 16/20  Iteration 2697/3560 Training loss: 1.8949 0.5279 sec/batch
Epoch 16/20  Iteration 2698/3560 Training loss: 1.8956 0.4932 sec/batch
Epoch 16/20  Iteration 2699/3560 Training loss: 1.8966 0.4942 sec/batch
Epoch 16/20  Iteration 2700/3560 Training loss: 1.8969 0.5043 se

Epoch 16/20  Iteration 2801/3560 Training loss: 1.8838 0.5319 sec/batch
Epoch 16/20  Iteration 2802/3560 Training loss: 1.8835 0.5890 sec/batch
Epoch 16/20  Iteration 2803/3560 Training loss: 1.8835 0.7262 sec/batch
Epoch 16/20  Iteration 2804/3560 Training loss: 1.8836 0.9094 sec/batch
Epoch 16/20  Iteration 2805/3560 Training loss: 1.8837 0.7276 sec/batch
Epoch 16/20  Iteration 2806/3560 Training loss: 1.8838 0.7588 sec/batch
Epoch 16/20  Iteration 2807/3560 Training loss: 1.8838 0.8839 sec/batch
Epoch 16/20  Iteration 2808/3560 Training loss: 1.8839 0.5846 sec/batch
Epoch 16/20  Iteration 2809/3560 Training loss: 1.8841 0.6988 sec/batch
Epoch 16/20  Iteration 2810/3560 Training loss: 1.8839 0.5772 sec/batch
Epoch 16/20  Iteration 2811/3560 Training loss: 1.8842 0.5213 sec/batch
Epoch 16/20  Iteration 2812/3560 Training loss: 1.8842 0.5032 sec/batch
Epoch 16/20  Iteration 2813/3560 Training loss: 1.8841 0.4927 sec/batch
Epoch 16/20  Iteration 2814/3560 Training loss: 1.8841 0.6667 se

Epoch 17/20  Iteration 2915/3560 Training loss: 1.8725 0.4778 sec/batch
Epoch 17/20  Iteration 2916/3560 Training loss: 1.8720 0.4808 sec/batch
Epoch 17/20  Iteration 2917/3560 Training loss: 1.8720 0.4881 sec/batch
Epoch 17/20  Iteration 2918/3560 Training loss: 1.8719 0.5323 sec/batch
Epoch 17/20  Iteration 2919/3560 Training loss: 1.8725 0.5736 sec/batch
Epoch 17/20  Iteration 2920/3560 Training loss: 1.8727 0.5369 sec/batch
Epoch 17/20  Iteration 2921/3560 Training loss: 1.8731 0.5039 sec/batch
Epoch 17/20  Iteration 2922/3560 Training loss: 1.8730 0.5195 sec/batch
Epoch 17/20  Iteration 2923/3560 Training loss: 1.8729 0.4762 sec/batch
Epoch 17/20  Iteration 2924/3560 Training loss: 1.8733 0.4742 sec/batch
Epoch 17/20  Iteration 2925/3560 Training loss: 1.8732 0.4595 sec/batch
Epoch 17/20  Iteration 2926/3560 Training loss: 1.8734 0.4696 sec/batch
Epoch 17/20  Iteration 2927/3560 Training loss: 1.8729 0.4578 sec/batch
Epoch 17/20  Iteration 2928/3560 Training loss: 1.8727 0.4641 se

Epoch 18/20  Iteration 3029/3560 Training loss: 1.8822 0.4675 sec/batch
Epoch 18/20  Iteration 3030/3560 Training loss: 1.8762 0.4687 sec/batch
Epoch 18/20  Iteration 3031/3560 Training loss: 1.8715 0.4723 sec/batch
Epoch 18/20  Iteration 3032/3560 Training loss: 1.8614 0.4637 sec/batch
Epoch 18/20  Iteration 3033/3560 Training loss: 1.8623 0.4625 sec/batch
Epoch 18/20  Iteration 3034/3560 Training loss: 1.8621 0.4649 sec/batch
Epoch 18/20  Iteration 3035/3560 Training loss: 1.8639 0.4620 sec/batch
Epoch 18/20  Iteration 3036/3560 Training loss: 1.8640 0.4668 sec/batch
Epoch 18/20  Iteration 3037/3560 Training loss: 1.8605 0.4674 sec/batch
Epoch 18/20  Iteration 3038/3560 Training loss: 1.8587 0.4677 sec/batch
Epoch 18/20  Iteration 3039/3560 Training loss: 1.8583 0.4771 sec/batch
Epoch 18/20  Iteration 3040/3560 Training loss: 1.8607 0.4606 sec/batch
Epoch 18/20  Iteration 3041/3560 Training loss: 1.8594 0.4735 sec/batch
Epoch 18/20  Iteration 3042/3560 Training loss: 1.8583 0.4641 se

Epoch 18/20  Iteration 3143/3560 Training loss: 1.8482 0.4640 sec/batch
Epoch 18/20  Iteration 3144/3560 Training loss: 1.8482 0.4999 sec/batch
Epoch 18/20  Iteration 3145/3560 Training loss: 1.8482 0.5363 sec/batch
Epoch 18/20  Iteration 3146/3560 Training loss: 1.8481 0.5142 sec/batch
Epoch 18/20  Iteration 3147/3560 Training loss: 1.8481 0.4908 sec/batch
Epoch 18/20  Iteration 3148/3560 Training loss: 1.8478 0.4892 sec/batch
Epoch 18/20  Iteration 3149/3560 Training loss: 1.8475 0.5139 sec/batch
Epoch 18/20  Iteration 3150/3560 Training loss: 1.8477 0.4794 sec/batch
Epoch 18/20  Iteration 3151/3560 Training loss: 1.8477 0.4738 sec/batch
Epoch 18/20  Iteration 3152/3560 Training loss: 1.8474 0.4721 sec/batch
Epoch 18/20  Iteration 3153/3560 Training loss: 1.8475 0.4673 sec/batch
Epoch 18/20  Iteration 3154/3560 Training loss: 1.8475 0.4755 sec/batch
Epoch 18/20  Iteration 3155/3560 Training loss: 1.8474 0.4738 sec/batch
Epoch 18/20  Iteration 3156/3560 Training loss: 1.8473 0.4955 se

Epoch 19/20  Iteration 3257/3560 Training loss: 1.8359 0.5052 sec/batch
Epoch 19/20  Iteration 3258/3560 Training loss: 1.8359 0.5303 sec/batch
Epoch 19/20  Iteration 3259/3560 Training loss: 1.8357 0.4834 sec/batch
Epoch 19/20  Iteration 3260/3560 Training loss: 1.8359 0.4683 sec/batch
Epoch 19/20  Iteration 3261/3560 Training loss: 1.8363 0.4763 sec/batch
Epoch 19/20  Iteration 3262/3560 Training loss: 1.8363 0.4747 sec/batch
Epoch 19/20  Iteration 3263/3560 Training loss: 1.8359 0.4755 sec/batch
Epoch 19/20  Iteration 3264/3560 Training loss: 1.8363 0.4633 sec/batch
Epoch 19/20  Iteration 3265/3560 Training loss: 1.8362 0.4768 sec/batch
Epoch 19/20  Iteration 3266/3560 Training loss: 1.8372 0.4761 sec/batch
Epoch 19/20  Iteration 3267/3560 Training loss: 1.8375 0.4969 sec/batch
Epoch 19/20  Iteration 3268/3560 Training loss: 1.8377 0.4678 sec/batch
Epoch 19/20  Iteration 3269/3560 Training loss: 1.8374 0.5042 sec/batch
Epoch 19/20  Iteration 3270/3560 Training loss: 1.8377 0.5323 se

Epoch 19/20  Iteration 3371/3560 Training loss: 1.8310 0.4941 sec/batch
Epoch 19/20  Iteration 3372/3560 Training loss: 1.8313 0.4648 sec/batch
Epoch 19/20  Iteration 3373/3560 Training loss: 1.8313 0.4845 sec/batch
Epoch 19/20  Iteration 3374/3560 Training loss: 1.8312 0.4673 sec/batch
Epoch 19/20  Iteration 3375/3560 Training loss: 1.8311 0.4873 sec/batch
Epoch 19/20  Iteration 3376/3560 Training loss: 1.8310 0.4866 sec/batch
Epoch 19/20  Iteration 3377/3560 Training loss: 1.8310 0.4693 sec/batch
Epoch 19/20  Iteration 3378/3560 Training loss: 1.8310 0.4538 sec/batch
Epoch 19/20  Iteration 3379/3560 Training loss: 1.8311 0.4533 sec/batch
Epoch 19/20  Iteration 3380/3560 Training loss: 1.8310 0.4549 sec/batch
Epoch 19/20  Iteration 3381/3560 Training loss: 1.8309 0.4574 sec/batch
Epoch 19/20  Iteration 3382/3560 Training loss: 1.8309 0.4546 sec/batch
Epoch 20/20  Iteration 3383/3560 Training loss: 1.8979 0.4565 sec/batch
Epoch 20/20  Iteration 3384/3560 Training loss: 1.8613 0.4701 se

Epoch 20/20  Iteration 3485/3560 Training loss: 1.8172 0.4695 sec/batch
Epoch 20/20  Iteration 3486/3560 Training loss: 1.8170 0.4630 sec/batch
Epoch 20/20  Iteration 3487/3560 Training loss: 1.8168 0.4727 sec/batch
Epoch 20/20  Iteration 3488/3560 Training loss: 1.8167 0.4618 sec/batch
Epoch 20/20  Iteration 3489/3560 Training loss: 1.8167 0.4714 sec/batch
Epoch 20/20  Iteration 3490/3560 Training loss: 1.8168 0.4715 sec/batch
Epoch 20/20  Iteration 3491/3560 Training loss: 1.8168 0.4835 sec/batch
Epoch 20/20  Iteration 3492/3560 Training loss: 1.8167 0.5330 sec/batch
Epoch 20/20  Iteration 3493/3560 Training loss: 1.8166 0.5402 sec/batch
Epoch 20/20  Iteration 3494/3560 Training loss: 1.8165 0.5097 sec/batch
Epoch 20/20  Iteration 3495/3560 Training loss: 1.8165 0.4730 sec/batch
Epoch 20/20  Iteration 3496/3560 Training loss: 1.8163 0.5279 sec/batch
Epoch 20/20  Iteration 3497/3560 Training loss: 1.8161 0.5106 sec/batch
Epoch 20/20  Iteration 3498/3560 Training loss: 1.8157 0.4587 se

ValueError: Dimensions must be equal, but are 256 and 211 for 'RNN_forward/rnn/while/rnn/multi_rnn_cell/cell_0/cell_0/basic_lstm_cell/MatMul_1' (op: 'MatMul') with input shapes: [100,256], [211,512].

In [35]:
tf.train.get_checkpoint_state('checkpoints/anna')

model_checkpoint_path: "checkpoints/anna/i3560_l512_1.122.ckpt"
all_model_checkpoint_paths: "checkpoints/anna/i200_l512_2.432.ckpt"
all_model_checkpoint_paths: "checkpoints/anna/i400_l512_1.980.ckpt"
all_model_checkpoint_paths: "checkpoints/anna/i600_l512_1.750.ckpt"
all_model_checkpoint_paths: "checkpoints/anna/i800_l512_1.595.ckpt"
all_model_checkpoint_paths: "checkpoints/anna/i1000_l512_1.484.ckpt"
all_model_checkpoint_paths: "checkpoints/anna/i1200_l512_1.407.ckpt"
all_model_checkpoint_paths: "checkpoints/anna/i1400_l512_1.349.ckpt"
all_model_checkpoint_paths: "checkpoints/anna/i1600_l512_1.292.ckpt"
all_model_checkpoint_paths: "checkpoints/anna/i1800_l512_1.255.ckpt"
all_model_checkpoint_paths: "checkpoints/anna/i2000_l512_1.224.ckpt"
all_model_checkpoint_paths: "checkpoints/anna/i2200_l512_1.204.ckpt"
all_model_checkpoint_paths: "checkpoints/anna/i2400_l512_1.187.ckpt"
all_model_checkpoint_paths: "checkpoints/anna/i2600_l512_1.172.ckpt"
all_model_checkpoint_paths: "checkpoints/an

## Sampling

Now that the network is trained, we'll can use it to generate new text. The idea is that we pass in a character, then the network will predict the next character. We can use the new one, to predict the next one. And we keep doing this to generate all new text. I also included some functionality to prime the network with some text by passing in a string and building up a state from that.

The network gives us predictions for each character. To reduce noise and make things a little less random, I'm going to only choose a new character from the top N most likely characters.



In [17]:
def pick_top_n(preds, vocab_size, top_n=5):
    p = np.squeeze(preds)
    p[np.argsort(p)[:-top_n]] = 0
    p = p / np.sum(p)
    c = np.random.choice(vocab_size, 1, p=p)[0]
    return c

In [41]:
def sample(checkpoint, n_samples, lstm_size, vocab_size, prime="The "):
    prime = "Far"
    samples = [c for c in prime]
    model = build_rnn(vocab_size, lstm_size=lstm_size, sampling=True)
    saver = tf.train.Saver()
    with tf.Session() as sess:
        saver.restore(sess, checkpoint)
        new_state = sess.run(model.initial_state)
        for c in prime:
            x = np.zeros((1, 1))
            x[0,0] = vocab_to_int[c]
            feed = {model.inputs: x,
                    model.keep_prob: 1.,
                    model.initial_state: new_state}
            preds, new_state = sess.run([model.preds, model.final_state], 
                                         feed_dict=feed)

        c = pick_top_n(preds, len(vocab))
        samples.append(int_to_vocab[c])

        for i in range(n_samples):
            x[0,0] = c
            feed = {model.inputs: x,
                    model.keep_prob: 1.,
                    model.initial_state: new_state}
            preds, new_state = sess.run([model.preds, model.final_state], 
                                         feed_dict=feed)

            c = pick_top_n(preds, len(vocab))
            samples.append(int_to_vocab[c])
        
    return ''.join(samples)

In [44]:
checkpoint = "checkpoints/anna/i3560_l512_1.122.ckpt"
samp = sample(checkpoint, 2000, lstm_size, len(vocab), prime="Far")
print(samp)

Farlathit that if had so
like it that it were. He could not trouble to his wife, and there was
anything in them of the side of his weaky in the creature at his forteren
to him.

"What is it? I can't bread to those," said Stepan Arkadyevitch. "It's not
my children, and there is an almost this arm, true it mays already,
and tell you what I have say to you, and was not looking at the peasant,
why is, I don't know him out, and she doesn't speak to me immediately, as
you would say the countess and the more frest an angelembre, and time and
things's silent, but I was not in my stand that is in my head. But if he
say, and was so feeling with his soul. A child--in his soul of his
soul of his soul. He should not see that any of that sense of. Here he
had not been so composed and to speak for as in a whole picture, but
all the setting and her excellent and society, who had been delighted
and see to anywing had been being troed to thousand words on them,
we liked him.

That set in her money at th

In [43]:
checkpoint = "checkpoints/anna/i200_l512_2.432.ckpt"
samp = sample(checkpoint, 1000, lstm_size, len(vocab), prime="Far")
print(samp)

Farnt him oste wha sorind thans tout thint asd an sesand an hires on thime sind thit aled, ban thand and out hore as the ter hos ton ho te that, was tis tart al the hand sostint him sore an tit an son thes, win he se ther san ther hher tas tarereng,.

Anl at an ades in ond hesiln, ad hhe torers teans, wast tar arering tho this sos alten sorer has hhas an siton ther him he had sin he ard ate te anling the sosin her ans and
arins asd and ther ale te tot an tand tanginge wath and ho ald, so sot th asend sat hare sother horesinnd, he hesense wing ante her so tith tir sherinn, anded and to the toul anderin he sorit he torsith she se atere an ting ot hand and thit hhe so the te wile har
ens ont in the sersise, and we he seres tar aterer, to ato tat or has he he wan ton here won and sen heren he sosering, to to theer oo adent har herere the wosh oute, was serild ward tous hed astend..

I's sint on alt in har tor tit her asd hade shithans ored he talereng an soredendere tim tot hees. Tise sor 

In [46]:
checkpoint = "checkpoints/anna/i600_l512_1.750.ckpt"
samp = sample(checkpoint, 1000, lstm_size, len(vocab), prime="Far")
print(samp)

Fard as astice her said he celatice of to seress in the raice, and to be the some and sere allats to that said to that the sark and a cast a the wither ald the pacinesse of her had astition, he said to the sount as she west at hissele. Af the cond it he was a fact onthis astisarianing.


"Or a ton to to be that's a more at aspestale as the sont of anstiring as
thours and trey.

The same wo dangring the
raterst, who sore and somethy had ast out an of his book. "We had's beane were that, and a morted a thay he had to tere. Then to
her homent andertersed his his ancouted to the pirsted, the soution for of the pirsice inthirgest and stenciol, with the hard and and
a colrice of to be oneres,
the song to this anderssad.
The could ounterss the said to serom of
soment a carsed of sheres of she
torded
har and want in their of hould, but
her told in that in he tad a the same to her. Serghing an her has and with the seed, and the camt ont his about of the
sail, the her then all houg ant or to hus

In [47]:
checkpoint = "checkpoints/anna/i1000_l512_1.484.ckpt"
samp = sample(checkpoint, 1000, lstm_size, len(vocab), prime="Far")
print(samp)

Farrat, his felt has at it.

"When the pose ther hor exceed
to his sheant was," weat a sime of his sounsed. The coment and the facily that which had began terede a marilicaly whice whether the pose of his hand, at she was alligated herself the same on she had to
taiking to his forthing and streath how to hand
began in a lang at some at it, this he cholded not set all her. "Wo love that is setthing. Him anstering as seen that."

"Yes in the man that say the mare a crances is it?" said Sergazy Ivancatching. "You doon think were somether is ifficult of a mone of
though the most at the countes that the
mean on the come to say the most, to
his feesing of
a man she, whilo he
sained and well, that he would still at to said. He wind at his for the sore in the most
of hoss and almoved to see him. They have betine the sumper into at he his stire, and what he was that at the so steate of the
sound, and shin should have a geest of shall feet on the conderation to she had been at that imporsing the