# Computer Vision Nanodegree

## Project: Image Captioning

---

In this notebook, you will train your CNN-RNN model.  

You are welcome and encouraged to try out many different architectures and hyperparameters when searching for a good model.

This does have the potential to make the project quite messy!  Before submitting your project, make sure that you clean up:
- the code you write in this notebook.  The notebook should describe how to train a single CNN-RNN architecture, corresponding to your final choice of hyperparameters.  You should structure the notebook so that the reviewer can replicate your results by running the code in this notebook.  
- the output of the code cell in **Step 2**.  The output should show the output obtained when training the model from scratch.

This notebook **will be graded**.  

Feel free to use the links below to navigate the notebook:
- [Step 1](#step1): Training Setup
- [Step 2](#step2): Train your Model
- [Step 3](#step3): (Optional) Validate your Model

<a id='step1'></a>
## Step 1: Training Setup

In this step of the notebook, you will customize the training of your CNN-RNN model by specifying hyperparameters and setting other options that are important to the training procedure.  The values you set now will be used when training your model in **Step 2** below.

You should only amend blocks of code that are preceded by a `TODO` statement.  **Any code blocks that are not preceded by a `TODO` statement should not be modified**.

### Task #1

Begin by setting the following variables:
- `batch_size` - the batch size of each training batch.  It is the number of image-caption pairs used to amend the model weights in each training step. 
- `vocab_threshold` - the minimum word count threshold.  Note that a larger threshold will result in a smaller vocabulary, whereas a smaller threshold will include rarer words and result in a larger vocabulary.  
- `vocab_from_file` - a Boolean that decides whether to load the vocabulary from file. 
- `embed_size` - the dimensionality of the image and word embeddings.  
- `hidden_size` - the number of features in the hidden state of the RNN decoder.  
- `num_epochs` - the number of epochs to train the model.  We recommend that you set `num_epochs=3`, but feel free to increase or decrease this number as you wish.  [This paper](https://arxiv.org/pdf/1502.03044.pdf) trained a captioning model on a single state-of-the-art GPU for 3 days, but you'll soon see that you can get reasonable results in a matter of a few hours!  (_But of course, if you want your model to compete with current research, you will have to train for much longer._)
- `save_every` - determines how often to save the model weights.  We recommend that you set `save_every=1`, to save the model weights after each epoch.  This way, after the `i`th epoch, the encoder and decoder weights will be saved in the `models/` folder as `encoder-i.pkl` and `decoder-i.pkl`, respectively.
- `print_every` - determines how often to print the batch loss to the Jupyter notebook while training.  Note that you **will not** observe a monotonic decrease in the loss function while training - this is perfectly fine and completely expected!  You are encouraged to keep this at its default value of `100` to avoid clogging the notebook, but feel free to change it.
- `log_file` - the name of the text file containing - for every step - how the loss and perplexity evolved during training.

If you're not sure where to begin to set some of the values above, you can peruse [this paper](https://arxiv.org/pdf/1502.03044.pdf) and [this paper](https://arxiv.org/pdf/1411.4555.pdf) for useful guidance!  **To avoid spending too long on this notebook**, you are encouraged to consult these suggested research papers to obtain a strong initial guess for which hyperparameters are likely to work best.  Then, train a single model, and proceed to the next notebook (**3_Inference.ipynb**).  If you are unhappy with your performance, you can return to this notebook to tweak the hyperparameters (and/or the architecture in **model.py**) and re-train your model.

### Question 1

**Question:** Describe your CNN-RNN architecture in detail.  With this architecture in mind, how did you select the values of the variables in Task 1?  If you consulted a research paper detailing a successful implementation of an image captioning model, please provide the reference.

**Answer:** 


### (Optional) Task #2

Note that we have provided a recommended image transform `transform_train` for pre-processing the training images, but you are welcome (and encouraged!) to modify it as you wish.  When modifying this transform, keep in mind that:
- the images in the dataset have varying heights and widths, and 
- if using a pre-trained model, you must perform the corresponding appropriate normalization.

### Question 2

**Question:** How did you select the transform in `transform_train`?  If you left the transform at its provided value, why do you think that it is a good choice for your CNN architecture?

**Answer:** The default looks reasonable, although I have some doubts as to whether it makes sense to discard the data around the edge in this case. 

### Task #3

Next, you will specify a Python list containing the learnable parameters of the model.  For instance, if you decide to make all weights in the decoder trainable, but only want to train the weights in the embedding layer of the encoder, then you should set `params` to something like:
```
params = list(decoder.parameters()) + list(encoder.embed.parameters()) 
```

### Question 3

**Question:** How did you select the trainable parameters of your architecture?  Why do you think this is a good choice?

**Answer:** 

### Task #4

Finally, you will select an [optimizer](http://pytorch.org/docs/master/optim.html#torch.optim.Optimizer).

### Question 4

**Question:** How did you select the optimizer used to train your model?

**Answer:** 

In [4]:
import torch
import torch.nn as nn
from torchvision import transforms
import sys
sys.path.append('/opt/cocoapi/PythonAPI')
from pycocotools.coco import COCO
from data_loader import get_loader
from model import EncoderCNN, DecoderRNN
import math

import nltk
nltk.download('punkt')

## TODO #1: Select appropriate values for the Python variables below.
batch_size=10
#batch_size = 64          # batch size, we have 10GB of GPU memory, let's use it
vocab_threshold = 5        # minimum word count threshold
vocab_from_file = False    # if True, load existing vocab file
embed_size = 512           # dimensionality of image and word embeddings
hidden_size = 512          # number of features in hidden state of the RNN decoder
num_epochs = 25             # number of training epochs
save_every = 1             # determines frequency of saving model weights
print_every = 100          # determines window for printing average loss
log_file = 'training_log.txt'       # name of file with saved training loss and perplexity

# (Optional) TODO #2: Amend the image transform below.
transform_train = transforms.Compose([ 
    transforms.Resize(256),                          # smaller edge of image resized to 256
    transforms.RandomCrop(224),                      # get 224x224 crop from random location
    transforms.RandomHorizontalFlip(),               # horizontally flip image with probability=0.5
    transforms.ToTensor(),                           # convert the PIL Image to a tensor
    transforms.Normalize((0.485, 0.456, 0.406),      # normalize image for pre-trained model
                         (0.229, 0.224, 0.225))])

# Build data loader.
data_loader = get_loader(transform=transform_train,
                         mode='train',
                         batch_size=batch_size,
                         vocab_threshold=vocab_threshold,
                         vocab_from_file=vocab_from_file)

# The size of the vocabulary.
vocab_size = len(data_loader.dataset.vocab)

# Initialize the encoder and decoder. 
encoder = EncoderCNN(embed_size)
decoder = DecoderRNN(embed_size, hidden_size, vocab_size, max_batch_size=batch_size)

# Move models to GPU if CUDA is available. 
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
encoder.to(device)
decoder.to(device)

# Define the loss function. 
criterion = nn.CrossEntropyLoss().cuda() if torch.cuda.is_available() else nn.CrossEntropyLoss()

# TODO #3: Specify the learnable parameters of the model.
params = list(decoder.parameters()) + list(encoder.embed.parameters()) + list(encoder.bn.parameters())

# TODO #4: Define the optimizer.
#optimizer = torch.optim.SGD(params, lr=0.01)

# this data is probably pretty sparse, and defaults are probably ok
#http://ruder.io/optimizing-gradient-descent/
optimizer = torch.optim.Adam(params, lr=0.001)

# Set the total number of training steps per epoch.
total_step = math.ceil(len(data_loader.dataset.caption_lengths) / data_loader.batch_sampler.batch_size)

[nltk_data] Downloading package punkt to /home/sthenc/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


loading annotations into memory...
Done (t=0.52s)
creating index...
index created!
[0/414113] Tokenizing captions...


 37%|███▋      | 151876/414113 [00:30<00:26, 9799.58it/s]

[100000/414113] Tokenizing captions...
[200000/414113] Tokenizing captions...
[300000/414113] Tokenizing captions...
[400000/414113] Tokenizing captions...
loading annotations into memory...
Done (t=0.52s)
creating index...



  0%|          | 0/414113 [00:00<?, ?it/s][A
  0%|          | 943/414113 [00:00<00:43, 9428.68it/s][A


index created!
Obtaining caption lengths...


  0%|          | 1810/414113 [00:00<00:44, 9184.90it/s][A
  1%|          | 2678/414113 [00:00<00:45, 9022.86it/s][A
  1%|          | 3589/414113 [00:00<00:45, 9046.64it/s][A
  1%|          | 4514/414113 [00:00<00:44, 9106.31it/s][A
  1%|▏         | 5444/414113 [00:00<00:44, 9162.05it/s][A
  2%|▏         | 6271/414113 [00:00<00:45, 8872.44it/s][A
  2%|▏         | 7084/414113 [00:00<00:48, 8338.77it/s][A
  2%|▏         | 7884/414113 [00:00<00:49, 8231.99it/s][A
  2%|▏         | 8763/414113 [00:01<00:48, 8389.56it/s][A
  2%|▏         | 9631/414113 [00:01<00:47, 8474.19it/s][A
  3%|▎         | 10594/414113 [00:01<00:45, 8789.08it/s][A
  3%|▎         | 11528/414113 [00:01<00:44, 8946.80it/s][A
  3%|▎         | 12419/414113 [00:01<00:45, 8844.77it/s][A
  3%|▎         | 13319/414113 [00:01<00:45, 8889.08it/s][A
  3%|▎         | 14255/414113 [00:01<00:44, 9024.88it/s][A
  4%|▎         | 15157/414113 [00:01<00:44, 8946.59it/s][A
  4%|▍         | 16065/414113 [00:01<00:44, 8983.7

 31%|███       | 128784/414113 [00:14<00:29, 9790.29it/s][A
 31%|███▏      | 129764/414113 [00:14<00:29, 9785.93it/s][A
 32%|███▏      | 130743/414113 [00:14<00:29, 9747.38it/s][A
 32%|███▏      | 131718/414113 [00:14<00:29, 9692.61it/s][A
 32%|███▏      | 132688/414113 [00:14<00:29, 9551.25it/s][A
 32%|███▏      | 133644/414113 [00:14<00:29, 9541.66it/s][A
 33%|███▎      | 134599/414113 [00:14<00:29, 9318.48it/s][A
 33%|███▎      | 135547/414113 [00:14<00:29, 9365.18it/s][A
 33%|███▎      | 136540/414113 [00:15<00:29, 9525.49it/s][A
 33%|███▎      | 137518/414113 [00:15<00:28, 9600.26it/s][A
 33%|███▎      | 138482/414113 [00:15<00:28, 9611.94it/s][A
 34%|███▎      | 139444/414113 [00:15<00:28, 9542.55it/s][A
 34%|███▍      | 140399/414113 [00:15<00:28, 9457.45it/s][A
 34%|███▍      | 141346/414113 [00:15<00:29, 9290.61it/s][A
 34%|███▍      | 142277/414113 [00:15<00:29, 9276.40it/s][A
 35%|███▍      | 143206/414113 [00:15<00:29, 9260.67it/s][A
 35%|███▍      | 144169/

 62%|██████▏   | 256583/414113 [00:28<00:17, 9118.38it/s][A
 62%|██████▏   | 257496/414113 [00:28<00:17, 8894.73it/s][A
 62%|██████▏   | 258428/414113 [00:28<00:17, 9016.13it/s][A
 63%|██████▎   | 259332/414113 [00:28<00:17, 8888.98it/s][A
 63%|██████▎   | 260302/414113 [00:28<00:16, 9116.75it/s][A
 63%|██████▎   | 261242/414113 [00:28<00:16, 9197.60it/s][A
 63%|██████▎   | 262164/414113 [00:28<00:16, 9135.86it/s][A
 64%|██████▎   | 263084/414113 [00:28<00:16, 9151.23it/s][A
 64%|██████▍   | 264001/414113 [00:28<00:16, 9152.89it/s][A
 64%|██████▍   | 264943/414113 [00:29<00:16, 9230.58it/s][A
 64%|██████▍   | 265868/414113 [00:29<00:16, 9234.10it/s][A
 64%|██████▍   | 266845/414113 [00:29<00:15, 9384.73it/s][A
 65%|██████▍   | 267822/414113 [00:29<00:15, 9496.23it/s][A
 65%|██████▍   | 268773/414113 [00:29<00:15, 9464.40it/s][A
 65%|██████▌   | 269721/414113 [00:29<00:15, 9464.74it/s][A
 65%|██████▌   | 270668/414113 [00:29<00:15, 9437.78it/s][A
 66%|██████▌   | 271627/

 93%|█████████▎| 383206/414113 [00:41<00:03, 9062.79it/s][A
 93%|█████████▎| 384135/414113 [00:42<00:03, 9126.32it/s][A
 93%|█████████▎| 385095/414113 [00:42<00:03, 9261.55it/s][A
 93%|█████████▎| 386023/414113 [00:42<00:03, 9124.92it/s][A
 93%|█████████▎| 386972/414113 [00:42<00:02, 9227.95it/s][A
 94%|█████████▎| 387927/414113 [00:42<00:02, 9319.77it/s][A
 94%|█████████▍| 388868/414113 [00:42<00:02, 9345.06it/s][A
 94%|█████████▍| 389811/414113 [00:42<00:02, 9369.25it/s][A
 94%|█████████▍| 390749/414113 [00:42<00:02, 9369.73it/s][A
 95%|█████████▍| 391687/414113 [00:42<00:02, 9336.33it/s][A
 95%|█████████▍| 392654/414113 [00:42<00:02, 9432.48it/s][A
 95%|█████████▌| 393608/414113 [00:43<00:02, 9461.78it/s][A
 95%|█████████▌| 394563/414113 [00:43<00:02, 9488.04it/s][A
 96%|█████████▌| 395513/414113 [00:43<00:02, 9261.45it/s][A
 96%|█████████▌| 396450/414113 [00:43<00:01, 9292.83it/s][A
 96%|█████████▌| 397419/414113 [00:43<00:01, 9406.54it/s][A
 96%|█████████▌| 398374/

<a id='step2'></a>
## Step 2: Train your Model

Once you have executed the code cell in **Step 1**, the training procedure below should run without issue.  

It is completely fine to leave the code cell below as-is without modifications to train your model.  However, if you would like to modify the code used to train the model below, you must ensure that your changes are easily parsed by your reviewer.  In other words, make sure to provide appropriate comments to describe how your code works!  

You may find it useful to load saved weights to resume training.  In that case, note the names of the files containing the encoder and decoder weights that you'd like to load (`encoder_file` and `decoder_file`).  Then you can load the weights by using the lines below:

```python
# Load pre-trained weights before resuming training.
encoder.load_state_dict(torch.load(os.path.join('./models', encoder_file)))
decoder.load_state_dict(torch.load(os.path.join('./models', decoder_file)))
```

While trying out parameters, make sure to take extensive notes and record the settings that you used in your various training runs.  In particular, you don't want to encounter a situation where you've trained a model for several hours but can't remember what settings you used :).

### A Note on Tuning Hyperparameters

To figure out how well your model is doing, you can look at how the training loss and perplexity evolve during training - and for the purposes of this project, you are encouraged to amend the hyperparameters based on this information.  

However, this will not tell you if your model is overfitting to the training data, and, unfortunately, overfitting is a problem that is commonly encountered when training image captioning models.  

For this project, you need not worry about overfitting. **This project does not have strict requirements regarding the performance of your model**, and you just need to demonstrate that your model has learned **_something_** when you generate captions on the test data.  For now, we strongly encourage you to train your model for the suggested 3 epochs without worrying about performance; then, you should immediately transition to the next notebook in the sequence (**3_Inference.ipynb**) to see how your model performs on the test data.  If your model needs to be changed, you can come back to this notebook, amend hyperparameters (if necessary), and re-train the model.

That said, if you would like to go above and beyond in this project, you can read about some approaches to minimizing overfitting in section 4.3.1 of [this paper](http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=7505636).  In the next (optional) step of this notebook, we provide some guidance for assessing the performance on the validation dataset.

In [None]:
import torch.utils.data as data
import numpy as np
import os
import requests
import time

# temporary
%load_ext autoreload
%autoreload 2

from model import EncoderCNN, DecoderRNN

# Open the training log file.
f = open(log_file, 'w')

## running the training locally
#old_time = time.time()
#response = requests.request("GET", 
#                            "http://metadata.google.internal/computeMetadata/v1/instance/attributes/keep_alive_token", 
#                            headers={"Metadata-Flavor":"Google"})


for epoch in range(1, num_epochs+1):

    for i_step in range(1, total_step+1):
        
        ## running the training locally
        #if time.time() - old_time > 60:
        #    old_time = time.time()
        #    requests.request("POST", 
        #                     "https://nebula.udacity.com/api/v1/remote/keep-alive", 
        #                     headers={'Authorization': "STAR " + response.text})

        # Randomly sample a caption length, and sample indices with that length.
        indices = data_loader.dataset.get_train_indices()
        # Create and assign a batch sampler to retrieve a batch with the sampled indices.
        new_sampler = data.sampler.SubsetRandomSampler(indices=indices)
        data_loader.batch_sampler.sampler = new_sampler

        # Obtain the batch.
        images, captions = next(iter(data_loader))

        # Move batch of images and captions to GPU if CUDA is available.
        images = images.to(device)
        captions = captions.to(device)

        # Zero the gradients.
        decoder.zero_grad()
        encoder.zero_grad()

        # Pass the inputs through the CNN-RNN model.
        features = encoder(images)
        outputs = decoder(features, captions)

        # Calculate the batch loss.
        loss = criterion(outputs.contiguous().view(-1, vocab_size), captions.contiguous().view(-1))

        # Backward pass.
        loss.backward(retain_graph=True)

        # Update the parameters in the optimizer.
        optimizer.step()

        # Get training statistics.
        stats = 'Epoch [%d/%d], Step [%d/%d], Loss: %.4f, Perplexity: %5.4f' % (epoch, num_epochs, i_step, total_step, loss.item(), np.exp(loss.item()))

        # Print training statistics (on same line).
        print('\r' + stats, end="")
        sys.stdout.flush()

        # Print training statistics to file.
        f.write(stats + '\n')
        f.flush()

        # Print training statistics (on different line).
        if i_step % print_every == 0:
            print('\r' + stats)

    # Save the weights.
    if epoch % save_every == 0:
        torch.save(decoder.state_dict(), os.path.join('./models', 'decoder-%d.pkl' % epoch))
        torch.save(encoder.state_dict(), os.path.join('./models', 'encoder-%d.pkl' % epoch))

# Close the training log file.
f.close()

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload
Epoch [1/25], Step [100/41412], Loss: 4.4867, Perplexity: 88.8297
Epoch [1/25], Step [200/41412], Loss: 4.0446, Perplexity: 57.08937
Epoch [1/25], Step [300/41412], Loss: 3.7327, Perplexity: 41.79326
Epoch [1/25], Step [400/41412], Loss: 4.2026, Perplexity: 66.8612
Epoch [1/25], Step [500/41412], Loss: 4.0259, Perplexity: 56.0289
Epoch [1/25], Step [600/41412], Loss: 3.3031, Perplexity: 27.19644
Epoch [1/25], Step [700/41412], Loss: 3.5537, Perplexity: 34.9435
Epoch [1/25], Step [800/41412], Loss: 3.8278, Perplexity: 45.9632
Epoch [1/25], Step [900/41412], Loss: 2.8911, Perplexity: 18.0137
Epoch [1/25], Step [1000/41412], Loss: 3.3596, Perplexity: 28.7786
Epoch [1/25], Step [1100/41412], Loss: 2.5576, Perplexity: 12.9045
Epoch [1/25], Step [1200/41412], Loss: 3.7884, Perplexity: 44.1878
Epoch [1/25], Step [1300/41412], Loss: 3.5841, Perplexity: 36.0227
Epoch [1/25], Step [1400/41412], Loss: 2.8287, 

Epoch [1/25], Step [12100/41412], Loss: 2.5063, Perplexity: 12.2601
Epoch [1/25], Step [12200/41412], Loss: 2.4971, Perplexity: 12.1468
Epoch [1/25], Step [12300/41412], Loss: 2.1686, Perplexity: 8.74586
Epoch [1/25], Step [12400/41412], Loss: 2.5789, Perplexity: 13.1820
Epoch [1/25], Step [12500/41412], Loss: 2.3958, Perplexity: 10.9772
Epoch [1/25], Step [12600/41412], Loss: 3.1249, Perplexity: 22.7570
Epoch [1/25], Step [12700/41412], Loss: 2.6593, Perplexity: 14.2860
Epoch [1/25], Step [12800/41412], Loss: 3.3241, Perplexity: 27.7746
Epoch [1/25], Step [12900/41412], Loss: 2.4258, Perplexity: 11.3112
Epoch [1/25], Step [13000/41412], Loss: 1.9556, Perplexity: 7.06794
Epoch [1/25], Step [13100/41412], Loss: 2.7004, Perplexity: 14.8857
Epoch [1/25], Step [13200/41412], Loss: 2.1548, Perplexity: 8.62640
Epoch [1/25], Step [13300/41412], Loss: 2.5604, Perplexity: 12.9410
Epoch [1/25], Step [13400/41412], Loss: 2.2975, Perplexity: 9.94906
Epoch [1/25], Step [13500/41412], Loss: 2.3957, 

Epoch [1/25], Step [24100/41412], Loss: 2.3175, Perplexity: 10.1506
Epoch [1/25], Step [24200/41412], Loss: 2.6511, Perplexity: 14.1691
Epoch [1/25], Step [24300/41412], Loss: 2.1793, Perplexity: 8.84007
Epoch [1/25], Step [24400/41412], Loss: 2.2747, Perplexity: 9.72490
Epoch [1/25], Step [24500/41412], Loss: 2.5413, Perplexity: 12.6961
Epoch [1/25], Step [24600/41412], Loss: 2.3133, Perplexity: 10.1077
Epoch [1/25], Step [24700/41412], Loss: 2.2860, Perplexity: 9.83538
Epoch [1/25], Step [24800/41412], Loss: 1.8260, Perplexity: 6.20927
Epoch [1/25], Step [24900/41412], Loss: 2.4143, Perplexity: 11.1818
Epoch [1/25], Step [25000/41412], Loss: 2.1111, Perplexity: 8.25732
Epoch [1/25], Step [25100/41412], Loss: 2.4265, Perplexity: 11.3192
Epoch [1/25], Step [25200/41412], Loss: 2.2929, Perplexity: 9.90333
Epoch [1/25], Step [25300/41412], Loss: 2.1054, Perplexity: 8.21081
Epoch [1/25], Step [25400/41412], Loss: 2.5464, Perplexity: 12.7612
Epoch [1/25], Step [25500/41412], Loss: 1.8714, 

Epoch [1/25], Step [36100/41412], Loss: 2.0179, Perplexity: 7.52243
Epoch [1/25], Step [36200/41412], Loss: 2.1142, Perplexity: 8.28262
Epoch [1/25], Step [36300/41412], Loss: 3.1995, Perplexity: 24.5212
Epoch [1/25], Step [36400/41412], Loss: 2.4477, Perplexity: 11.5618
Epoch [1/25], Step [36500/41412], Loss: 2.1331, Perplexity: 8.44097
Epoch [1/25], Step [36600/41412], Loss: 2.1254, Perplexity: 8.37616
Epoch [1/25], Step [36700/41412], Loss: 2.1860, Perplexity: 8.89994
Epoch [1/25], Step [36800/41412], Loss: 2.4551, Perplexity: 11.6472
Epoch [1/25], Step [36900/41412], Loss: 1.9135, Perplexity: 6.77717
Epoch [1/25], Step [37000/41412], Loss: 2.1792, Perplexity: 8.83909
Epoch [1/25], Step [37100/41412], Loss: 1.9393, Perplexity: 6.95381
Epoch [1/25], Step [37200/41412], Loss: 2.5538, Perplexity: 12.8554
Epoch [1/25], Step [37300/41412], Loss: 2.5879, Perplexity: 13.3020
Epoch [1/25], Step [37400/41412], Loss: 2.2268, Perplexity: 9.27055
Epoch [1/25], Step [37500/41412], Loss: 1.7550, 

Epoch [2/25], Step [6800/41412], Loss: 2.2374, Perplexity: 9.36911
Epoch [2/25], Step [6900/41412], Loss: 2.1276, Perplexity: 8.39455
Epoch [2/25], Step [7000/41412], Loss: 2.0877, Perplexity: 8.06643
Epoch [2/25], Step [7100/41412], Loss: 1.8404, Perplexity: 6.29929
Epoch [2/25], Step [7200/41412], Loss: 2.9338, Perplexity: 18.7980
Epoch [2/25], Step [7300/41412], Loss: 2.8647, Perplexity: 17.5431
Epoch [2/25], Step [7400/41412], Loss: 1.7883, Perplexity: 5.97938
Epoch [2/25], Step [7500/41412], Loss: 2.7257, Perplexity: 15.2664
Epoch [2/25], Step [7600/41412], Loss: 1.8025, Perplexity: 6.06469
Epoch [2/25], Step [7700/41412], Loss: 2.2752, Perplexity: 9.72961
Epoch [2/25], Step [7800/41412], Loss: 2.3196, Perplexity: 10.1715
Epoch [2/25], Step [7900/41412], Loss: 2.1648, Perplexity: 8.71249
Epoch [2/25], Step [8000/41412], Loss: 2.2196, Perplexity: 9.20357
Epoch [2/25], Step [8100/41412], Loss: 2.2325, Perplexity: 9.32281
Epoch [2/25], Step [8200/41412], Loss: 2.1108, Perplexity: 8.2

Epoch [2/25], Step [18800/41412], Loss: 2.0579, Perplexity: 7.82972
Epoch [2/25], Step [18900/41412], Loss: 2.7128, Perplexity: 15.0714
Epoch [2/25], Step [19000/41412], Loss: 2.0114, Perplexity: 7.47408
Epoch [2/25], Step [19100/41412], Loss: 1.5080, Perplexity: 4.51779
Epoch [2/25], Step [19200/41412], Loss: 1.8150, Perplexity: 6.14146
Epoch [2/25], Step [19300/41412], Loss: 2.0956, Perplexity: 8.13021
Epoch [2/25], Step [19400/41412], Loss: 1.9458, Perplexity: 6.99923
Epoch [2/25], Step [19500/41412], Loss: 1.6256, Perplexity: 5.08147
Epoch [2/25], Step [19600/41412], Loss: 1.8927, Perplexity: 6.63741
Epoch [2/25], Step [19700/41412], Loss: 2.5966, Perplexity: 13.4181
Epoch [2/25], Step [19800/41412], Loss: 2.3014, Perplexity: 9.98784
Epoch [2/25], Step [19900/41412], Loss: 2.1958, Perplexity: 8.98737
Epoch [2/25], Step [20000/41412], Loss: 2.1339, Perplexity: 8.44762
Epoch [2/25], Step [20100/41412], Loss: 1.5842, Perplexity: 4.87551
Epoch [2/25], Step [20200/41412], Loss: 2.3842, 

Epoch [2/25], Step [30800/41412], Loss: 1.8880, Perplexity: 6.60614
Epoch [2/25], Step [30900/41412], Loss: 2.3220, Perplexity: 10.1958
Epoch [2/25], Step [31000/41412], Loss: 2.2793, Perplexity: 9.76948
Epoch [2/25], Step [31100/41412], Loss: 1.5858, Perplexity: 4.88322
Epoch [2/25], Step [31200/41412], Loss: 1.9159, Perplexity: 6.79326
Epoch [2/25], Step [31300/41412], Loss: 1.5127, Perplexity: 4.53929
Epoch [2/25], Step [31400/41412], Loss: 2.1334, Perplexity: 8.44310
Epoch [2/25], Step [31500/41412], Loss: 1.7872, Perplexity: 5.97285
Epoch [2/25], Step [31600/41412], Loss: 2.2166, Perplexity: 9.17641
Epoch [2/25], Step [31700/41412], Loss: 2.2080, Perplexity: 9.09798
Epoch [2/25], Step [31800/41412], Loss: 2.1789, Perplexity: 8.83636
Epoch [2/25], Step [31900/41412], Loss: 2.2201, Perplexity: 9.20806
Epoch [2/25], Step [32000/41412], Loss: 1.8497, Perplexity: 6.35809
Epoch [2/25], Step [32100/41412], Loss: 2.1945, Perplexity: 8.97586
Epoch [2/25], Step [32200/41412], Loss: 2.3686, 

Epoch [3/25], Step [1400/41412], Loss: 2.1421, Perplexity: 8.51768
Epoch [3/25], Step [1500/41412], Loss: 2.4416, Perplexity: 11.4918
Epoch [3/25], Step [1600/41412], Loss: 1.6957, Perplexity: 5.45078
Epoch [3/25], Step [1700/41412], Loss: 2.0754, Perplexity: 7.96777
Epoch [3/25], Step [1800/41412], Loss: 1.9974, Perplexity: 7.37010
Epoch [3/25], Step [1900/41412], Loss: 2.1672, Perplexity: 8.73385
Epoch [3/25], Step [2000/41412], Loss: 2.9432, Perplexity: 18.9765
Epoch [3/25], Step [2100/41412], Loss: 2.3834, Perplexity: 10.8416
Epoch [3/25], Step [2200/41412], Loss: 2.1123, Perplexity: 8.26758
Epoch [3/25], Step [2300/41412], Loss: 2.0312, Perplexity: 7.62357
Epoch [3/25], Step [2400/41412], Loss: 1.9827, Perplexity: 7.26231
Epoch [3/25], Step [2500/41412], Loss: 2.3494, Perplexity: 10.4789
Epoch [3/25], Step [2600/41412], Loss: 2.2748, Perplexity: 9.725734
Epoch [3/25], Step [2700/41412], Loss: 2.0141, Perplexity: 7.49417
Epoch [3/25], Step [2800/41412], Loss: 2.1233, Perplexity: 8.

Epoch [3/25], Step [13500/41412], Loss: 2.0903, Perplexity: 8.08705
Epoch [3/25], Step [13600/41412], Loss: 1.7362, Perplexity: 5.67566
Epoch [3/25], Step [13700/41412], Loss: 1.9915, Perplexity: 7.32628
Epoch [3/25], Step [13800/41412], Loss: 1.5445, Perplexity: 4.68587
Epoch [3/25], Step [13900/41412], Loss: 1.9631, Perplexity: 7.12120
Epoch [3/25], Step [14000/41412], Loss: 1.9542, Perplexity: 7.05861
Epoch [3/25], Step [14100/41412], Loss: 2.0269, Perplexity: 7.59068
Epoch [3/25], Step [14200/41412], Loss: 2.2622, Perplexity: 9.60423
Epoch [3/25], Step [14300/41412], Loss: 2.1861, Perplexity: 8.90021
Epoch [3/25], Step [14400/41412], Loss: 2.3005, Perplexity: 9.97892
Epoch [3/25], Step [14500/41412], Loss: 1.8021, Perplexity: 6.06235
Epoch [3/25], Step [14600/41412], Loss: 1.8856, Perplexity: 6.59067
Epoch [3/25], Step [14700/41412], Loss: 2.1974, Perplexity: 9.00132
Epoch [3/25], Step [14800/41412], Loss: 2.1291, Perplexity: 8.40723
Epoch [3/25], Step [14900/41412], Loss: 1.8376, 

Epoch [3/25], Step [25500/41412], Loss: 1.8821, Perplexity: 6.56765
Epoch [3/25], Step [25600/41412], Loss: 1.9861, Perplexity: 7.28733
Epoch [3/25], Step [25700/41412], Loss: 2.2814, Perplexity: 9.79030
Epoch [3/25], Step [25800/41412], Loss: 1.4764, Perplexity: 4.37727
Epoch [3/25], Step [25900/41412], Loss: 1.6754, Perplexity: 5.34084
Epoch [3/25], Step [26000/41412], Loss: 1.8975, Perplexity: 6.66891
Epoch [3/25], Step [26100/41412], Loss: 1.8312, Perplexity: 6.24122
Epoch [3/25], Step [26200/41412], Loss: 2.2985, Perplexity: 9.95927
Epoch [3/25], Step [26300/41412], Loss: 1.9973, Perplexity: 7.36891
Epoch [3/25], Step [26400/41412], Loss: 1.5961, Perplexity: 4.93374
Epoch [3/25], Step [26500/41412], Loss: 1.9282, Perplexity: 6.87690
Epoch [3/25], Step [26600/41412], Loss: 1.6508, Perplexity: 5.21095
Epoch [3/25], Step [26700/41412], Loss: 2.3922, Perplexity: 10.9375
Epoch [3/25], Step [26800/41412], Loss: 1.6635, Perplexity: 5.27752
Epoch [3/25], Step [26900/41412], Loss: 2.6615, 

Epoch [3/25], Step [37500/41412], Loss: 2.4871, Perplexity: 12.0269
Epoch [3/25], Step [37600/41412], Loss: 2.1923, Perplexity: 8.95554
Epoch [3/25], Step [37700/41412], Loss: 2.1422, Perplexity: 8.51849
Epoch [3/25], Step [37800/41412], Loss: 2.1083, Perplexity: 8.23430
Epoch [3/25], Step [37900/41412], Loss: 1.7443, Perplexity: 5.72216
Epoch [3/25], Step [38000/41412], Loss: 2.1075, Perplexity: 8.22735
Epoch [3/25], Step [38100/41412], Loss: 2.2031, Perplexity: 9.05324
Epoch [3/25], Step [38200/41412], Loss: 2.1597, Perplexity: 8.66901
Epoch [3/25], Step [38300/41412], Loss: 2.1803, Perplexity: 8.84867
Epoch [3/25], Step [38400/41412], Loss: 1.9546, Perplexity: 7.06130
Epoch [3/25], Step [38500/41412], Loss: 2.0967, Perplexity: 8.13943
Epoch [3/25], Step [38600/41412], Loss: 2.6103, Perplexity: 13.6031
Epoch [3/25], Step [38700/41412], Loss: 1.8275, Perplexity: 6.21822
Epoch [3/25], Step [38800/41412], Loss: 1.7231, Perplexity: 5.60210
Epoch [3/25], Step [38900/41412], Loss: 1.9889, 

Epoch [4/25], Step [8200/41412], Loss: 1.6960, Perplexity: 5.45191
Epoch [4/25], Step [8300/41412], Loss: 1.6149, Perplexity: 5.02744
Epoch [4/25], Step [8400/41412], Loss: 2.1046, Perplexity: 8.20413
Epoch [4/25], Step [8500/41412], Loss: 1.9812, Perplexity: 7.25129
Epoch [4/25], Step [8600/41412], Loss: 2.0390, Perplexity: 7.68319
Epoch [4/25], Step [8700/41412], Loss: 1.9527, Perplexity: 7.04776
Epoch [4/25], Step [8800/41412], Loss: 1.8760, Perplexity: 6.52778
Epoch [4/25], Step [8900/41412], Loss: 1.6408, Perplexity: 5.15920
Epoch [4/25], Step [9000/41412], Loss: 1.9845, Perplexity: 7.27531
Epoch [4/25], Step [9100/41412], Loss: 1.5674, Perplexity: 4.79430
Epoch [4/25], Step [9200/41412], Loss: 2.2183, Perplexity: 9.19153
Epoch [4/25], Step [9300/41412], Loss: 2.0960, Perplexity: 8.13362
Epoch [4/25], Step [9400/41412], Loss: 2.1482, Perplexity: 8.56965
Epoch [4/25], Step [9500/41412], Loss: 2.3375, Perplexity: 10.3550
Epoch [4/25], Step [9600/41412], Loss: 1.8706, Perplexity: 6.4

Epoch [4/25], Step [20200/41412], Loss: 1.6667, Perplexity: 5.29450
Epoch [4/25], Step [20300/41412], Loss: 2.4770, Perplexity: 11.9050
Epoch [4/25], Step [20400/41412], Loss: 2.0926, Perplexity: 8.10578
Epoch [4/25], Step [20500/41412], Loss: 1.9760, Perplexity: 7.21354
Epoch [4/25], Step [20600/41412], Loss: 1.8712, Perplexity: 6.49594
Epoch [4/25], Step [20700/41412], Loss: 1.7550, Perplexity: 5.78334
Epoch [4/25], Step [20800/41412], Loss: 1.4319, Perplexity: 4.18655
Epoch [4/25], Step [20900/41412], Loss: 2.2054, Perplexity: 9.07392
Epoch [4/25], Step [21000/41412], Loss: 2.3491, Perplexity: 10.4766
Epoch [4/25], Step [21100/41412], Loss: 1.9302, Perplexity: 6.89063
Epoch [4/25], Step [21200/41412], Loss: 2.4397, Perplexity: 11.4697
Epoch [4/25], Step [21300/41412], Loss: 2.1072, Perplexity: 8.22489
Epoch [4/25], Step [21400/41412], Loss: 2.1690, Perplexity: 8.74992
Epoch [4/25], Step [21500/41412], Loss: 2.1399, Perplexity: 8.49833
Epoch [4/25], Step [21600/41412], Loss: 1.8724, 

Epoch [4/25], Step [32200/41412], Loss: 1.9675, Perplexity: 7.15279
Epoch [4/25], Step [32300/41412], Loss: 1.9351, Perplexity: 6.92454
Epoch [4/25], Step [32400/41412], Loss: 2.2790, Perplexity: 9.76675
Epoch [4/25], Step [32500/41412], Loss: 1.5860, Perplexity: 4.88430
Epoch [4/25], Step [32600/41412], Loss: 2.1605, Perplexity: 8.67569
Epoch [4/25], Step [32700/41412], Loss: 2.0284, Perplexity: 7.60160
Epoch [4/25], Step [32800/41412], Loss: 2.1473, Perplexity: 8.56153
Epoch [4/25], Step [32900/41412], Loss: 1.4850, Perplexity: 4.41494
Epoch [4/25], Step [33000/41412], Loss: 1.7106, Perplexity: 5.53252
Epoch [4/25], Step [33100/41412], Loss: 1.9767, Perplexity: 7.21908
Epoch [4/25], Step [33200/41412], Loss: 1.9731, Perplexity: 7.19320
Epoch [4/25], Step [33300/41412], Loss: 1.7518, Perplexity: 5.76490
Epoch [4/25], Step [33400/41412], Loss: 1.6560, Perplexity: 5.23837
Epoch [4/25], Step [33500/41412], Loss: 1.4773, Perplexity: 4.38109
Epoch [4/25], Step [33600/41412], Loss: 1.6704, 

Epoch [5/25], Step [2800/41412], Loss: 1.8802, Perplexity: 6.55484
Epoch [5/25], Step [2900/41412], Loss: 2.0746, Perplexity: 7.96118
Epoch [5/25], Step [3000/41412], Loss: 1.9318, Perplexity: 6.90193
Epoch [5/25], Step [3100/41412], Loss: 1.8510, Perplexity: 6.36627
Epoch [5/25], Step [3200/41412], Loss: 2.2853, Perplexity: 9.82914
Epoch [5/25], Step [3300/41412], Loss: 2.0056, Perplexity: 7.43062
Epoch [5/25], Step [3400/41412], Loss: 1.8574, Perplexity: 6.40684
Epoch [5/25], Step [3500/41412], Loss: 2.0561, Perplexity: 7.81557
Epoch [5/25], Step [3600/41412], Loss: 1.6683, Perplexity: 5.30319
Epoch [5/25], Step [3700/41412], Loss: 1.8676, Perplexity: 6.47251
Epoch [5/25], Step [3800/41412], Loss: 1.8744, Perplexity: 6.51677
Epoch [5/25], Step [3900/41412], Loss: 1.9232, Perplexity: 6.84257
Epoch [5/25], Step [4000/41412], Loss: 1.8227, Perplexity: 6.18832
Epoch [5/25], Step [4100/41412], Loss: 1.8157, Perplexity: 6.14563
Epoch [5/25], Step [4200/41412], Loss: 1.7119, Perplexity: 5.5

Epoch [5/25], Step [14900/41412], Loss: 2.4379, Perplexity: 11.4495
Epoch [5/25], Step [15000/41412], Loss: 1.7691, Perplexity: 5.86569
Epoch [5/25], Step [15100/41412], Loss: 1.5701, Perplexity: 4.80718
Epoch [5/25], Step [15200/41412], Loss: 2.0227, Perplexity: 7.55897
Epoch [5/25], Step [15300/41412], Loss: 1.6998, Perplexity: 5.47319
Epoch [5/25], Step [15400/41412], Loss: 1.7549, Perplexity: 5.78294
Epoch [5/25], Step [15500/41412], Loss: 1.7964, Perplexity: 6.02788
Epoch [5/25], Step [15600/41412], Loss: 2.1809, Perplexity: 8.85391
Epoch [5/25], Step [15700/41412], Loss: 2.0690, Perplexity: 7.91709
Epoch [5/25], Step [15800/41412], Loss: 1.8212, Perplexity: 6.17906
Epoch [5/25], Step [15900/41412], Loss: 2.2694, Perplexity: 9.67377
Epoch [5/25], Step [16000/41412], Loss: 2.0199, Perplexity: 7.53799
Epoch [5/25], Step [16100/41412], Loss: 1.7793, Perplexity: 5.92590
Epoch [5/25], Step [16200/41412], Loss: 2.5399, Perplexity: 12.6788
Epoch [5/25], Step [16300/41412], Loss: 1.9081, 

Epoch [5/25], Step [26900/41412], Loss: 2.3497, Perplexity: 10.4825
Epoch [5/25], Step [27000/41412], Loss: 2.2311, Perplexity: 9.30970
Epoch [5/25], Step [27100/41412], Loss: 1.7163, Perplexity: 5.56399
Epoch [5/25], Step [27200/41412], Loss: 1.9929, Perplexity: 7.33697
Epoch [5/25], Step [27300/41412], Loss: 2.0127, Perplexity: 7.48325
Epoch [5/25], Step [27400/41412], Loss: 2.8311, Perplexity: 16.9641
Epoch [5/25], Step [27500/41412], Loss: 1.8262, Perplexity: 6.21050
Epoch [5/25], Step [27600/41412], Loss: 1.8681, Perplexity: 6.47611
Epoch [5/25], Step [27700/41412], Loss: 1.8485, Perplexity: 6.35047
Epoch [5/25], Step [27800/41412], Loss: 2.5858, Perplexity: 13.2737
Epoch [5/25], Step [27900/41412], Loss: 2.6113, Perplexity: 13.6168
Epoch [5/25], Step [28000/41412], Loss: 2.0555, Perplexity: 7.81074
Epoch [5/25], Step [28100/41412], Loss: 2.2617, Perplexity: 9.59901
Epoch [5/25], Step [28200/41412], Loss: 1.7318, Perplexity: 5.65073
Epoch [5/25], Step [28300/41412], Loss: 2.3202, 

Epoch [5/25], Step [38900/41412], Loss: 2.0758, Perplexity: 7.97075
Epoch [5/25], Step [39000/41412], Loss: 2.4607, Perplexity: 11.7129
Epoch [5/25], Step [39100/41412], Loss: 1.9833, Perplexity: 7.26696
Epoch [5/25], Step [39200/41412], Loss: 1.9039, Perplexity: 6.71216
Epoch [5/25], Step [39300/41412], Loss: 2.4510, Perplexity: 11.5994
Epoch [5/25], Step [39400/41412], Loss: 1.8373, Perplexity: 6.27958
Epoch [5/25], Step [39500/41412], Loss: 2.7639, Perplexity: 15.8609
Epoch [5/25], Step [39600/41412], Loss: 1.8712, Perplexity: 6.49629
Epoch [5/25], Step [39700/41412], Loss: 1.8757, Perplexity: 6.52557
Epoch [5/25], Step [39800/41412], Loss: 1.5729, Perplexity: 4.82040
Epoch [5/25], Step [39900/41412], Loss: 2.0524, Perplexity: 7.78650
Epoch [5/25], Step [40000/41412], Loss: 1.6563, Perplexity: 5.24002
Epoch [5/25], Step [40100/41412], Loss: 1.8137, Perplexity: 6.13293
Epoch [5/25], Step [40200/41412], Loss: 1.9313, Perplexity: 6.89860
Epoch [5/25], Step [40300/41412], Loss: 1.7391, 

Epoch [6/25], Step [9600/41412], Loss: 1.8948, Perplexity: 6.65106
Epoch [6/25], Step [9700/41412], Loss: 2.4594, Perplexity: 11.6972
Epoch [6/25], Step [9800/41412], Loss: 1.7080, Perplexity: 5.51807
Epoch [6/25], Step [9900/41412], Loss: 1.7137, Perplexity: 5.54924
Epoch [6/25], Step [10000/41412], Loss: 1.6104, Perplexity: 5.0046
Epoch [6/25], Step [10100/41412], Loss: 1.8342, Perplexity: 6.26042
Epoch [6/25], Step [10200/41412], Loss: 1.8666, Perplexity: 6.46661
Epoch [6/25], Step [10300/41412], Loss: 1.8221, Perplexity: 6.18473
Epoch [6/25], Step [10400/41412], Loss: 2.0444, Perplexity: 7.72489
Epoch [6/25], Step [10500/41412], Loss: 1.5519, Perplexity: 4.72069
Epoch [6/25], Step [10600/41412], Loss: 2.2971, Perplexity: 9.94568
Epoch [6/25], Step [10700/41412], Loss: 2.4399, Perplexity: 11.4720
Epoch [6/25], Step [10800/41412], Loss: 1.9049, Perplexity: 6.71871
Epoch [6/25], Step [10900/41412], Loss: 2.0504, Perplexity: 7.77060
Epoch [6/25], Step [11000/41412], Loss: 1.8011, Perpl

Epoch [6/25], Step [21600/41412], Loss: 2.3562, Perplexity: 10.5512
Epoch [6/25], Step [21700/41412], Loss: 2.2336, Perplexity: 9.33389
Epoch [6/25], Step [21800/41412], Loss: 1.4012, Perplexity: 4.06013
Epoch [6/25], Step [21900/41412], Loss: 1.6930, Perplexity: 5.43602
Epoch [6/25], Step [22000/41412], Loss: 2.0204, Perplexity: 7.54118
Epoch [6/25], Step [22100/41412], Loss: 2.1197, Perplexity: 8.32875
Epoch [6/25], Step [22200/41412], Loss: 1.9125, Perplexity: 6.770260
Epoch [6/25], Step [22300/41412], Loss: 1.8127, Perplexity: 6.12723
Epoch [6/25], Step [22400/41412], Loss: 2.0148, Perplexity: 7.49940
Epoch [6/25], Step [22500/41412], Loss: 2.3498, Perplexity: 10.4839
Epoch [6/25], Step [22600/41412], Loss: 1.6266, Perplexity: 5.08666
Epoch [6/25], Step [22700/41412], Loss: 2.5182, Perplexity: 12.4057
Epoch [6/25], Step [22800/41412], Loss: 1.7255, Perplexity: 5.61535
Epoch [6/25], Step [22900/41412], Loss: 2.0351, Perplexity: 7.65321
Epoch [6/25], Step [23000/41412], Loss: 1.3883,

Epoch [6/25], Step [33600/41412], Loss: 2.0266, Perplexity: 7.58824
Epoch [6/25], Step [33700/41412], Loss: 1.8561, Perplexity: 6.39909
Epoch [6/25], Step [33800/41412], Loss: 1.7432, Perplexity: 5.71583
Epoch [6/25], Step [33900/41412], Loss: 1.7284, Perplexity: 5.63177
Epoch [6/25], Step [34000/41412], Loss: 1.9912, Perplexity: 7.32464
Epoch [6/25], Step [34100/41412], Loss: 2.3126, Perplexity: 10.1008
Epoch [6/25], Step [34200/41412], Loss: 2.1191, Perplexity: 8.32376
Epoch [6/25], Step [34300/41412], Loss: 2.1959, Perplexity: 8.98773
Epoch [6/25], Step [34400/41412], Loss: 1.7971, Perplexity: 6.03238
Epoch [6/25], Step [34500/41412], Loss: 1.5561, Perplexity: 4.74049
Epoch [6/25], Step [34600/41412], Loss: 1.9327, Perplexity: 6.90823
Epoch [6/25], Step [34700/41412], Loss: 1.5184, Perplexity: 4.56481
Epoch [6/25], Step [34800/41412], Loss: 1.9380, Perplexity: 6.94487
Epoch [6/25], Step [34900/41412], Loss: 2.1780, Perplexity: 8.82900
Epoch [6/25], Step [35000/41412], Loss: 1.8511, 

Epoch [7/25], Step [4300/41412], Loss: 1.5734, Perplexity: 4.82301
Epoch [7/25], Step [4400/41412], Loss: 2.1155, Perplexity: 8.29366
Epoch [7/25], Step [4500/41412], Loss: 2.1238, Perplexity: 8.36276
Epoch [7/25], Step [4600/41412], Loss: 1.9471, Perplexity: 7.00859
Epoch [7/25], Step [4700/41412], Loss: 1.9633, Perplexity: 7.12316
Epoch [7/25], Step [4800/41412], Loss: 2.6271, Perplexity: 13.8342
Epoch [7/25], Step [4900/41412], Loss: 1.6269, Perplexity: 5.08816
Epoch [7/25], Step [5000/41412], Loss: 1.5289, Perplexity: 4.61320
Epoch [7/25], Step [5100/41412], Loss: 2.3962, Perplexity: 10.9812
Epoch [7/25], Step [5200/41412], Loss: 2.0394, Perplexity: 7.68612
Epoch [7/25], Step [5300/41412], Loss: 2.7512, Perplexity: 15.6611
Epoch [7/25], Step [5400/41412], Loss: 2.3208, Perplexity: 10.1835
Epoch [7/25], Step [5500/41412], Loss: 2.2206, Perplexity: 9.21257
Epoch [7/25], Step [5600/41412], Loss: 2.2129, Perplexity: 9.14211
Epoch [7/25], Step [5700/41412], Loss: 1.9026, Perplexity: 6.7

Epoch [7/25], Step [16400/41412], Loss: 1.9970, Perplexity: 7.36662
Epoch [7/25], Step [16500/41412], Loss: 1.7869, Perplexity: 5.97094
Epoch [7/25], Step [16600/41412], Loss: 1.6925, Perplexity: 5.43312
Epoch [7/25], Step [16700/41412], Loss: 1.5428, Perplexity: 4.67761
Epoch [7/25], Step [16800/41412], Loss: 2.3334, Perplexity: 10.3125
Epoch [7/25], Step [16900/41412], Loss: 2.2452, Perplexity: 9.44194
Epoch [7/25], Step [17000/41412], Loss: 1.7011, Perplexity: 5.47978
Epoch [7/25], Step [17100/41412], Loss: 1.8869, Perplexity: 6.59890
Epoch [7/25], Step [17200/41412], Loss: 2.1855, Perplexity: 8.89480
Epoch [7/25], Step [17300/41412], Loss: 2.1981, Perplexity: 9.00824
Epoch [7/25], Step [17400/41412], Loss: 1.7692, Perplexity: 5.86601
Epoch [7/25], Step [17500/41412], Loss: 1.9966, Perplexity: 7.36410
Epoch [7/25], Step [17600/41412], Loss: 2.0977, Perplexity: 8.14719
Epoch [7/25], Step [17700/41412], Loss: 1.6606, Perplexity: 5.26236
Epoch [7/25], Step [17800/41412], Loss: 1.9830, 

Epoch [7/25], Step [28400/41412], Loss: 1.8780, Perplexity: 6.54014
Epoch [7/25], Step [28500/41412], Loss: 1.8614, Perplexity: 6.432902
Epoch [7/25], Step [28600/41412], Loss: 1.9867, Perplexity: 7.29115
Epoch [7/25], Step [28700/41412], Loss: 2.1078, Perplexity: 8.229976
Epoch [7/25], Step [28800/41412], Loss: 1.9358, Perplexity: 6.92937
Epoch [7/25], Step [28900/41412], Loss: 2.2815, Perplexity: 9.79121
Epoch [7/25], Step [29000/41412], Loss: 1.9435, Perplexity: 6.98308
Epoch [7/25], Step [29100/41412], Loss: 1.7552, Perplexity: 5.78445
Epoch [7/25], Step [29200/41412], Loss: 2.6797, Perplexity: 14.5800
Epoch [7/25], Step [29300/41412], Loss: 2.0899, Perplexity: 8.08389
Epoch [7/25], Step [29400/41412], Loss: 2.0870, Perplexity: 8.06077
Epoch [7/25], Step [29500/41412], Loss: 1.8618, Perplexity: 6.43515
Epoch [7/25], Step [29600/41412], Loss: 1.8425, Perplexity: 6.31200
Epoch [7/25], Step [29700/41412], Loss: 2.5208, Perplexity: 12.4389
Epoch [7/25], Step [29800/41412], Loss: 1.6296

Epoch [7/25], Step [40400/41412], Loss: 2.5385, Perplexity: 12.6611
Epoch [7/25], Step [40500/41412], Loss: 2.0324, Perplexity: 7.63220
Epoch [7/25], Step [40600/41412], Loss: 1.8337, Perplexity: 6.25702
Epoch [7/25], Step [40700/41412], Loss: 1.8797, Perplexity: 6.55183
Epoch [7/25], Step [40800/41412], Loss: 1.9669, Perplexity: 7.14881
Epoch [7/25], Step [40900/41412], Loss: 2.2799, Perplexity: 9.77544
Epoch [7/25], Step [41000/41412], Loss: 2.4608, Perplexity: 11.7142
Epoch [7/25], Step [41100/41412], Loss: 2.0295, Perplexity: 7.61038
Epoch [7/25], Step [41200/41412], Loss: 1.8026, Perplexity: 6.06531
Epoch [7/25], Step [41300/41412], Loss: 2.3226, Perplexity: 10.2025
Epoch [7/25], Step [41400/41412], Loss: 1.7036, Perplexity: 5.49341
Epoch [8/25], Step [100/41412], Loss: 2.5088, Perplexity: 12.289760
Epoch [8/25], Step [200/41412], Loss: 1.9036, Perplexity: 6.70993
Epoch [8/25], Step [300/41412], Loss: 1.8618, Perplexity: 6.43502
Epoch [8/25], Step [400/41412], Loss: 1.8300, Perple

Epoch [8/25], Step [11200/41412], Loss: 1.5516, Perplexity: 4.71924
Epoch [8/25], Step [11300/41412], Loss: 1.8013, Perplexity: 6.05763
Epoch [8/25], Step [11400/41412], Loss: 2.0080, Perplexity: 7.44855
Epoch [8/25], Step [11500/41412], Loss: 2.5656, Perplexity: 13.0087
Epoch [8/25], Step [11600/41412], Loss: 1.8232, Perplexity: 6.19154
Epoch [8/25], Step [11700/41412], Loss: 1.5191, Perplexity: 4.56823
Epoch [8/25], Step [11800/41412], Loss: 2.3020, Perplexity: 9.99379
Epoch [8/25], Step [11900/41412], Loss: 2.0750, Perplexity: 7.96451
Epoch [8/25], Step [12000/41412], Loss: 1.8101, Perplexity: 6.11110
Epoch [8/25], Step [12100/41412], Loss: 1.8553, Perplexity: 6.39341
Epoch [8/25], Step [12200/41412], Loss: 1.4843, Perplexity: 4.41189
Epoch [8/25], Step [12300/41412], Loss: 2.5675, Perplexity: 13.0330
Epoch [8/25], Step [12400/41412], Loss: 2.5873, Perplexity: 13.2941
Epoch [8/25], Step [12500/41412], Loss: 2.1127, Perplexity: 8.27098
Epoch [8/25], Step [12600/41412], Loss: 2.2134, 

Epoch [8/25], Step [23200/41412], Loss: 1.9252, Perplexity: 6.85637
Epoch [8/25], Step [23300/41412], Loss: 2.3228, Perplexity: 10.2043
Epoch [8/25], Step [23400/41412], Loss: 2.2540, Perplexity: 9.52532
Epoch [8/25], Step [23500/41412], Loss: 2.0350, Perplexity: 7.65251
Epoch [8/25], Step [23600/41412], Loss: 1.8533, Perplexity: 6.38081
Epoch [8/25], Step [23700/41412], Loss: 2.1737, Perplexity: 8.79042
Epoch [8/25], Step [23800/41412], Loss: 2.2430, Perplexity: 9.42127
Epoch [8/25], Step [23900/41412], Loss: 2.0108, Perplexity: 7.46921
Epoch [8/25], Step [24000/41412], Loss: 2.1858, Perplexity: 8.89737
Epoch [8/25], Step [24100/41412], Loss: 2.7239, Perplexity: 15.2389
Epoch [8/25], Step [24200/41412], Loss: 1.9309, Perplexity: 6.89594
Epoch [8/25], Step [24300/41412], Loss: 2.6455, Perplexity: 14.0905
Epoch [8/25], Step [24400/41412], Loss: 2.0731, Perplexity: 7.94929
Epoch [8/25], Step [24500/41412], Loss: 1.6358, Perplexity: 5.13355
Epoch [8/25], Step [24600/41412], Loss: 1.8549, 

Epoch [8/25], Step [35200/41412], Loss: 1.5268, Perplexity: 4.60333
Epoch [8/25], Step [35300/41412], Loss: 1.7683, Perplexity: 5.86063
Epoch [8/25], Step [35400/41412], Loss: 2.1269, Perplexity: 8.38908
Epoch [8/25], Step [35500/41412], Loss: 1.4751, Perplexity: 4.37169
Epoch [8/25], Step [35600/41412], Loss: 2.3634, Perplexity: 10.6269
Epoch [8/25], Step [35700/41412], Loss: 1.9150, Perplexity: 6.78691
Epoch [8/25], Step [35800/41412], Loss: 1.8402, Perplexity: 6.29763
Epoch [8/25], Step [35900/41412], Loss: 1.5568, Perplexity: 4.74347
Epoch [8/25], Step [36000/41412], Loss: 1.9159, Perplexity: 6.79298
Epoch [8/25], Step [36100/41412], Loss: 2.4389, Perplexity: 11.4606
Epoch [8/25], Step [36200/41412], Loss: 1.6422, Perplexity: 5.16672
Epoch [8/25], Step [36300/41412], Loss: 1.8887, Perplexity: 6.61085
Epoch [8/25], Step [36400/41412], Loss: 1.8714, Perplexity: 6.49741
Epoch [8/25], Step [36500/41412], Loss: 1.9434, Perplexity: 6.98288
Epoch [8/25], Step [36600/41412], Loss: 1.8791, 

Epoch [9/25], Step [5900/41412], Loss: 1.8366, Perplexity: 6.27509
Epoch [9/25], Step [6000/41412], Loss: 2.1006, Perplexity: 8.17140
Epoch [9/25], Step [6100/41412], Loss: 1.9289, Perplexity: 6.88223
Epoch [9/25], Step [6200/41412], Loss: 1.8538, Perplexity: 6.38398
Epoch [9/25], Step [6300/41412], Loss: 2.2483, Perplexity: 9.47188
Epoch [9/25], Step [6400/41412], Loss: 2.3549, Perplexity: 10.5370
Epoch [9/25], Step [6500/41412], Loss: 2.1578, Perplexity: 8.65232
Epoch [9/25], Step [6600/41412], Loss: 1.9464, Perplexity: 7.00310
Epoch [9/25], Step [6700/41412], Loss: 1.9745, Perplexity: 7.20307
Epoch [9/25], Step [6800/41412], Loss: 1.8614, Perplexity: 6.43261
Epoch [9/25], Step [6900/41412], Loss: 1.7825, Perplexity: 5.94469
Epoch [9/25], Step [7000/41412], Loss: 1.5566, Perplexity: 4.74255
Epoch [9/25], Step [7100/41412], Loss: 1.8387, Perplexity: 6.28843
Epoch [9/25], Step [7200/41412], Loss: 2.1117, Perplexity: 8.26267
Epoch [9/25], Step [7300/41412], Loss: 2.1171, Perplexity: 8.3

Epoch [9/25], Step [18000/41412], Loss: 1.8173, Perplexity: 6.15511
Epoch [9/25], Step [18100/41412], Loss: 2.1121, Perplexity: 8.26528
Epoch [9/25], Step [18200/41412], Loss: 2.6308, Perplexity: 13.8855
Epoch [9/25], Step [18300/41412], Loss: 1.2677, Perplexity: 3.55278
Epoch [9/25], Step [18400/41412], Loss: 1.7863, Perplexity: 5.96732
Epoch [9/25], Step [18500/41412], Loss: 2.5035, Perplexity: 12.2251
Epoch [9/25], Step [18600/41412], Loss: 1.8842, Perplexity: 6.58126
Epoch [9/25], Step [18700/41412], Loss: 2.0282, Perplexity: 7.60049
Epoch [9/25], Step [18800/41412], Loss: 2.1682, Perplexity: 8.74237
Epoch [9/25], Step [18900/41412], Loss: 1.9694, Perplexity: 7.16667
Epoch [9/25], Step [19000/41412], Loss: 1.9417, Perplexity: 6.97062
Epoch [9/25], Step [19100/41412], Loss: 1.7248, Perplexity: 5.61123
Epoch [9/25], Step [19200/41412], Loss: 1.8134, Perplexity: 6.13139
Epoch [9/25], Step [19300/41412], Loss: 1.8240, Perplexity: 6.19682
Epoch [9/25], Step [19400/41412], Loss: 1.5803, 

Epoch [9/25], Step [30000/41412], Loss: 1.8386, Perplexity: 6.28799
Epoch [9/25], Step [30100/41412], Loss: 1.6547, Perplexity: 5.23160
Epoch [9/25], Step [30200/41412], Loss: 2.0252, Perplexity: 7.57725
Epoch [9/25], Step [30300/41412], Loss: 1.9547, Perplexity: 7.06220
Epoch [9/25], Step [30400/41412], Loss: 2.3432, Perplexity: 10.4147
Epoch [9/25], Step [30500/41412], Loss: 1.8720, Perplexity: 6.50102
Epoch [9/25], Step [30600/41412], Loss: 2.2171, Perplexity: 9.18076
Epoch [9/25], Step [30700/41412], Loss: 1.6202, Perplexity: 5.05429
Epoch [9/25], Step [30800/41412], Loss: 2.1592, Perplexity: 8.66421
Epoch [9/25], Step [30900/41412], Loss: 1.6039, Perplexity: 4.97266
Epoch [9/25], Step [31000/41412], Loss: 1.9590, Perplexity: 7.09235
Epoch [9/25], Step [31100/41412], Loss: 1.9972, Perplexity: 7.36838
Epoch [9/25], Step [31200/41412], Loss: 1.7007, Perplexity: 5.47780
Epoch [9/25], Step [31300/41412], Loss: 1.8790, Perplexity: 6.54684
Epoch [9/25], Step [31400/41412], Loss: 1.4857, 

Epoch [10/25], Step [600/41412], Loss: 1.9670, Perplexity: 7.14940
Epoch [10/25], Step [700/41412], Loss: 1.9770, Perplexity: 7.22118
Epoch [10/25], Step [800/41412], Loss: 1.8443, Perplexity: 6.32397
Epoch [10/25], Step [900/41412], Loss: 1.7835, Perplexity: 5.95040
Epoch [10/25], Step [1000/41412], Loss: 1.7942, Perplexity: 6.0150
Epoch [10/25], Step [1100/41412], Loss: 1.7675, Perplexity: 5.85610
Epoch [10/25], Step [1200/41412], Loss: 2.0946, Perplexity: 8.12225
Epoch [10/25], Step [1300/41412], Loss: 1.7573, Perplexity: 5.79709
Epoch [10/25], Step [1400/41412], Loss: 2.0848, Perplexity: 8.04286
Epoch [10/25], Step [1500/41412], Loss: 1.8970, Perplexity: 6.66592
Epoch [10/25], Step [1600/41412], Loss: 2.3509, Perplexity: 10.4952
Epoch [10/25], Step [1700/41412], Loss: 2.0510, Perplexity: 7.77559
Epoch [10/25], Step [1800/41412], Loss: 2.2751, Perplexity: 9.72848
Epoch [10/25], Step [1900/41412], Loss: 2.0344, Perplexity: 7.647328
Epoch [10/25], Step [2000/41412], Loss: 2.1679, Perp

Epoch [10/25], Step [12600/41412], Loss: 1.7938, Perplexity: 6.01251
Epoch [10/25], Step [12700/41412], Loss: 1.8992, Perplexity: 6.68035
Epoch [10/25], Step [12800/41412], Loss: 1.8574, Perplexity: 6.40722
Epoch [10/25], Step [12900/41412], Loss: 1.7708, Perplexity: 5.87543
Epoch [10/25], Step [13000/41412], Loss: 2.0266, Perplexity: 7.58791
Epoch [10/25], Step [13100/41412], Loss: 2.2101, Perplexity: 9.11641
Epoch [10/25], Step [13200/41412], Loss: 1.3301, Perplexity: 3.78146
Epoch [10/25], Step [13300/41412], Loss: 1.6987, Perplexity: 5.46709
Epoch [10/25], Step [13400/41412], Loss: 1.6853, Perplexity: 5.39428
Epoch [10/25], Step [13500/41412], Loss: 1.6456, Perplexity: 5.18397
Epoch [10/25], Step [13600/41412], Loss: 1.6825, Perplexity: 5.37923
Epoch [10/25], Step [13700/41412], Loss: 2.8993, Perplexity: 18.1606
Epoch [10/25], Step [13800/41412], Loss: 2.1816, Perplexity: 8.86054
Epoch [10/25], Step [13900/41412], Loss: 1.9450, Perplexity: 6.99378
Epoch [10/25], Step [14000/41412],

Epoch [10/25], Step [24400/41412], Loss: 1.3255, Perplexity: 3.76428
Epoch [10/25], Step [24500/41412], Loss: 2.3992, Perplexity: 11.0145
Epoch [10/25], Step [24600/41412], Loss: 1.9068, Perplexity: 6.73178
Epoch [10/25], Step [24700/41412], Loss: 1.8210, Perplexity: 6.17780
Epoch [10/25], Step [24800/41412], Loss: 2.1071, Perplexity: 8.22479
Epoch [10/25], Step [24900/41412], Loss: 2.1996, Perplexity: 9.02189
Epoch [10/25], Step [25000/41412], Loss: 1.7319, Perplexity: 5.65129
Epoch [10/25], Step [25100/41412], Loss: 2.1333, Perplexity: 8.44302
Epoch [10/25], Step [25200/41412], Loss: 2.2520, Perplexity: 9.50657
Epoch [10/25], Step [25300/41412], Loss: 1.9006, Perplexity: 6.68971
Epoch [10/25], Step [25400/41412], Loss: 1.9699, Perplexity: 7.17024
Epoch [10/25], Step [25500/41412], Loss: 1.2702, Perplexity: 3.56158
Epoch [10/25], Step [25600/41412], Loss: 1.7339, Perplexity: 5.66291
Epoch [10/25], Step [25700/41412], Loss: 2.3507, Perplexity: 10.4925
Epoch [10/25], Step [25800/41412],

Epoch [10/25], Step [36200/41412], Loss: 1.8718, Perplexity: 6.50000
Epoch [10/25], Step [36300/41412], Loss: 1.9673, Perplexity: 7.15169
Epoch [10/25], Step [36400/41412], Loss: 1.7470, Perplexity: 5.73764
Epoch [10/25], Step [36500/41412], Loss: 1.6984, Perplexity: 5.46525
Epoch [10/25], Step [36600/41412], Loss: 1.8761, Perplexity: 6.52817
Epoch [10/25], Step [36700/41412], Loss: 2.2140, Perplexity: 9.15221
Epoch [10/25], Step [36800/41412], Loss: 1.9597, Perplexity: 7.09700
Epoch [10/25], Step [36900/41412], Loss: 1.9669, Perplexity: 7.14888
Epoch [10/25], Step [37000/41412], Loss: 2.1881, Perplexity: 8.91824
Epoch [10/25], Step [37100/41412], Loss: 1.9396, Perplexity: 6.95605
Epoch [10/25], Step [37200/41412], Loss: 1.9378, Perplexity: 6.94320
Epoch [10/25], Step [37300/41412], Loss: 1.6943, Perplexity: 5.44268
Epoch [10/25], Step [37400/41412], Loss: 1.8510, Perplexity: 6.36608
Epoch [10/25], Step [37500/41412], Loss: 2.0712, Perplexity: 7.93446
Epoch [10/25], Step [37600/41412],

Epoch [11/25], Step [6700/41412], Loss: 2.1347, Perplexity: 8.45416
Epoch [11/25], Step [6800/41412], Loss: 1.5167, Perplexity: 4.55742
Epoch [11/25], Step [6900/41412], Loss: 2.1700, Perplexity: 8.75836
Epoch [11/25], Step [7000/41412], Loss: 1.8018, Perplexity: 6.06031
Epoch [11/25], Step [7100/41412], Loss: 1.9919, Perplexity: 7.32956
Epoch [11/25], Step [7200/41412], Loss: 1.3145, Perplexity: 3.72301
Epoch [11/25], Step [7300/41412], Loss: 1.5699, Perplexity: 4.80642
Epoch [11/25], Step [7307/41412], Loss: 2.0314, Perplexity: 7.6245

<a id='step3'></a>
## Step 3: (Optional) Validate your Model

To assess potential overfitting, one approach is to assess performance on a validation set.  If you decide to do this **optional** task, you are required to first complete all of the steps in the next notebook in the sequence (**3_Inference.ipynb**); as part of that notebook, you will write and test code (specifically, the `sample` method in the `DecoderRNN` class) that uses your RNN decoder to generate captions.  That code will prove incredibly useful here. 

If you decide to validate your model, please do not edit the data loader in **data_loader.py**.  Instead, create a new file named **data_loader_val.py** containing the code for obtaining the data loader for the validation data.  You can access:
- the validation images at filepath `'/opt/cocoapi/images/train2014/'`, and
- the validation image caption annotation file at filepath `'/opt/cocoapi/annotations/captions_val2014.json'`.

The suggested approach to validating your model involves creating a json file such as [this one](https://github.com/cocodataset/cocoapi/blob/master/results/captions_val2014_fakecap_results.json) containing your model's predicted captions for the validation images.  Then, you can write your own script or use one that you [find online](https://github.com/tylin/coco-caption) to calculate the BLEU score of your model.  You can read more about the BLEU score, along with other evaluation metrics (such as TEOR and Cider) in section 4.1 of [this paper](https://arxiv.org/pdf/1411.4555.pdf).  For more information about how to use the annotation file, check out the [website](http://cocodataset.org/#download) for the COCO dataset.

In [None]:
# (Optional) TODO: Validate your model.