In this notebook, we trained the CNN-RNN model.  

INDEX
- [Step 1](#step1): Setting up the training phase
- [Step 2](#step2): Train the Model

<a id='step1'></a>
## Step 1: Training Setup

In this step of the notebook, we will customize the training of our CNN-RNN model by specifying hyperparameters and setting other options that are important to the training procedure.

### Task #1

We begin by setting the following variables:
- `batch_size` - the batch size of each training batch.  It is the number of image-caption pairs used to amend the model weights in each training step. 
- `vocab_threshold` - the minimum word count threshold.  Note that a larger threshold will result in a smaller vocabulary, whereas a smaller threshold will include rarer words and result in a larger vocabulary.  
- `vocab_from_file` - a Boolean that decides whether to load the vocabulary from file. 
- `embed_size` - the dimensionality of the image and word embeddings.  
- `hidden_size` - the number of features in the hidden state of the RNN decoder.  
- `num_epochs` - the number of epochs to train the model.  [This paper](https://arxiv.org/pdf/1502.03044.pdf) trained a captioning model on a single state-of-the-art GPU for 3 days.
- `save_every` - determines how often to save the model weights.  We set `save_every=1`, to save the model weights after each epoch.  This way, after the `i`th epoch, the encoder and decoder weights will be saved in the `models/` folder as `encoder-i.pkl` and `decoder-i.pkl`, respectively.
- `print_every` - determines how often to print the batch loss to the Jupyter notebook while training.
- `log_file` - the name of the text file containing - for every step - how the loss and perplexity evolved during training.

In [None]:
import torch
torch.cuda.empty_cache()

In [None]:
import torch
import torch.nn as nn
from torchvision import transforms
import sys
import math

batch_size = 32 #32 #64        # batch size
vocab_threshold = 5        # minimum word count threshold
vocab_from_file = False   # if True, load existing vocab file
embed_size = 512           # dimensionality of image and word embeddings
hidden_size = 512          # number of features in hidden state of the RNN decoder
num_epochs = 3           # number of training epochs
save_every = 1             # determines frequency of saving model weights
print_every = 100          # determines window for printing average loss
log_file = 'training_log.txt'       # name of file with saved training loss and perplexity

transform_train = transforms.Compose([ 
    transforms.Resize(256),                          # smaller edge of image resized to 256
    transforms.RandomCrop(224),                      # get 224x224 crop from random location
    transforms.RandomHorizontalFlip(),               # horizontally flip image with probability=0.5
    transforms.ToTensor(),                           # convert the PIL Image to a tensor
    transforms.Normalize((0.485, 0.456, 0.406),      # normalize image for pre-trained model
                         (0.229, 0.224, 0.225))])

In [None]:
%load_ext autoreload
%autoreload 2
from model import EncoderCNN, DecoderRNN
from data_loader import get_loader

data_loader = get_loader(transform=transform_train,
                         mode='train',
                         batch_size=batch_size,
                         vocab_threshold=vocab_threshold,
                         vocab_from_file=vocab_from_file)

vocab_size = len(data_loader.dataset.vocab)


encoder = EncoderCNN(embed_size)
decoder = DecoderRNN(embed_size, hidden_size, vocab_size)


device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)
encoder.to(device)
decoder.to(device)

# Define the loss function. 
criterion = nn.CrossEntropyLoss().cuda() if torch.cuda.is_available() else nn.CrossEntropyLoss()

# Specify the learnable parameters of the model.
params = list(decoder.parameters()) + list(encoder.linear.parameters())

# Define the optimizer.
optimizer = torch.optim.Adam(params, lr=0.001)

# Set the total number of training steps per epoch.
total_step = math.ceil(len(data_loader.dataset.caption_lengths) / data_loader.batch_sampler.batch_size)

loading annotations into memory...
Done (t=1.10s)
creating index...
index created!
[0/414113] Tokenizing captions...
[100000/414113] Tokenizing captions...
[200000/414113] Tokenizing captions...
[300000/414113] Tokenizing captions...
[400000/414113] Tokenizing captions...
loading annotations into memory...
Done (t=0.98s)
creating index...


  0%|          | 1146/414113 [00:00<01:11, 5742.26it/s]

index created!
Obtaining caption lengths...


100%|██████████| 414113/414113 [01:14<00:00, 5579.77it/s]


<a id='step2'></a>
## Step 2: Training the Model

Once the above cells have been executed successfully in **Step 1**, we will start the training now.

In [None]:
import torch.utils.data as data
import numpy as np
import os
import requests
import time

# Open the training log file.
f = open(log_file, 'w')

old_time = time.time()


for epoch in range(1, num_epochs+1):
    for i_step in range(1, total_step+1):        

        indices = data_loader.dataset.get_train_indices()
        
        new_sampler = data.sampler.SubsetRandomSampler(indices=indices)
        data_loader.batch_sampler.sampler = new_sampler
        
        # Obtain the batch.
        images, captions = next(iter(data_loader))

        # Move batch of images and captions to GPU if CUDA is available.
        images = images.to(device)
        captions = captions.to(device)
        
        # Zero the gradients.
        decoder.zero_grad()
        encoder.zero_grad()
        
        # Pass the inputs through the CNN-RNN model.
        features = encoder(images)
        outputs = decoder(features, captions)
        
        # Calculate the batch loss.
        loss = criterion(outputs.view(-1, vocab_size), captions.view(-1))
        
        # Backward pass.
        loss.backward()
        
        # Update the parameters in the optimizer.
        optimizer.step()
            
        # Get training statistics.
        stats = 'Epoch [%d/%d], Step [%d/%d], Loss: %.4f, Perplexity: %5.4f' % (epoch, num_epochs, i_step, total_step, loss.item(), np.exp(loss.item()))
        
        # Print training statistics (on same line).
        print('\r' + stats, end="")
        sys.stdout.flush()
        
        # Print training statistics to file.
        f.write(stats + '\n')
        f.flush()
        
        # Print training statistics (on different line).
        if i_step % print_every == 0:
            print('\r' + stats)
            
    # Save the weights.
    if epoch % save_every == 0:
        torch.save(decoder.state_dict(), os.path.join('./models', 'decoder-%d.pkl' % epoch))
        torch.save(encoder.state_dict(), os.path.join('./models', 'encoder-%d.pkl' % epoch))

# Close the training log file.
f.close()

Epoch [1/3], Step [100/3236], Loss: 3.6071, Perplexity: 36.8589
Epoch [1/3], Step [200/3236], Loss: 3.1628, Perplexity: 23.6371
Epoch [1/3], Step [300/3236], Loss: 3.4074, Perplexity: 30.1859
Epoch [1/3], Step [400/3236], Loss: 3.1240, Perplexity: 22.7376
Epoch [1/3], Step [500/3236], Loss: 2.9587, Perplexity: 19.2733
Epoch [1/3], Step [600/3236], Loss: 2.9260, Perplexity: 18.6520
Epoch [1/3], Step [700/3236], Loss: 2.7981, Perplexity: 16.4135
Epoch [1/3], Step [800/3236], Loss: 2.7010, Perplexity: 14.8949
Epoch [1/3], Step [900/3236], Loss: 2.5678, Perplexity: 13.0366
Epoch [1/3], Step [1000/3236], Loss: 2.4018, Perplexity: 11.0432
Epoch [1/3], Step [1100/3236], Loss: 2.3271, Perplexity: 10.2480
Epoch [1/3], Step [1200/3236], Loss: 2.4550, Perplexity: 11.6462
Epoch [1/3], Step [1300/3236], Loss: 2.3690, Perplexity: 10.6865
Epoch [1/3], Step [1400/3236], Loss: 2.4791, Perplexity: 11.9307
Epoch [1/3], Step [1500/3236], Loss: 2.4279, Perplexity: 11.3354
Epoch [1/3], Step [1600/3236], Los

The saved model can be used for inference now.