# Computer Vision Nanodegree

## Project: Image Captioning

---

In this notebook, some template code has already been provided for you ... xyz.

Feel free to use the links below to navigate the notebook:
- [Step 1](#step1): Explore the Data Loader
- [Step 2](#step2): Tinker with the CNN Encoder
- [Step 3](#step3): Tinker with the RNN Decoder
- [Step 4](#step4): Calculate the Loss
- [Step 5](#step5): Celebrate!

<a id='step1'></a>
### Step 1: Explore the Data Loader

- look at sample images and captions.
- tell how data was preprocessed & saved in cocoapi (256 on shortest side)
- print some statistics (number of images)
- note to self: RETRAIN WITHOUT PAD WORD ASSIGNMENT
- show how vocab file is generated
- guide their play with `vocab` output
- tell should only amend the threshold ... vocab does have other args that should stay at default values. mention `vocab_from_file`
- used now to prepare for embedding layer
- size of vocab also determines dimensionality of output layer of LSTM & will be used to decode predicted captions. more on that soon
- uncomment one of the lines below to examine word2idx, idx2word. will use word2idx to convert images to nums, idx2word later to convert predictions to words ...
- show how to get a batch of data
- describe how sampler works
- note allowed to change data loader, but ok if leave as-is. to pass project, just need to make sure that you can defend your data loader

In [1]:
from data_loader import get_loader
from torchvision import transforms

# TODO: Set the minimum word count threshold.
vocab_threshold = 4

# TODO: Define a transform to pre-process the training images.
transform_train = transforms.Compose([ 
    transforms.RandomCrop(224),
    transforms.RandomHorizontalFlip(), 
    transforms.ToTensor(), 
    transforms.Normalize((0.485, 0.456, 0.406), 
                         (0.229, 0.224, 0.225))])

# TODO: Specify the batch size.
batch_size = 10

# Create the data loader.
data_loader = get_loader(transform=transform_train,
                         batch_size=batch_size,
                         vocab_threshold=vocab_threshold,
                         vocab_from_file=True)

Vocabulary successfully loaded from vocab.pkl file!
loading annotations into memory...


  0%|          | 0/414113 [00:00<?, ?it/s]

Done (t=0.55s)
creating index...
index created!
Obtaining caption lengths ...


100%|██████████| 414113/414113 [00:40<00:00, 10225.92it/s]


- when train model later, note don't need to separately populate data loader and vocab. vocab built in to data loader ...
- check now that data_loader includes dataset, which includes vocab ... can also get word2idx and idx2word as above.

In [None]:
print('Number of words in vocabulary:', len(data_loader.dataset.vocab))
#print(data_loader.dataset.vocab.word2idx)
#print(data_loader.dataset.vocab.idx2word)

- describe how get_train_indices works, where we got the procedure for loading batches from

In [4]:
import numpy as np
import torch.utils.data as data

# Randomly sample a caption length, and sample indices with that length.
indices = data_loader.dataset.get_train_indices()
print('sampled indices:', indices)
# Create and assign a batch sampler to retrieve a batch with the sampled indices.
new_sampler = data.sampler.SubsetRandomSampler(indices=indices)
data_loader.batch_sampler.sampler = new_sampler
    
# Obtain the batch.
for batch in data_loader:
    images, captions = batch[0], batch[1]
    break
    
print('images.shape:',images.shape)
print('captions.shape:', captions.shape)

sampled indices: [82752, 355776, 164075, 407508, 92747, 376489, 368934, 121907, 231696, 244272]
images.shape: torch.Size([10, 3, 224, 224])
captions.shape: torch.Size([10, 12])


<a id='step2'></a>
### Step 2: Tinker with the CNN Encoder

- make note of module autoreload capability
- need to explain to them what `volatile` is for, etc

In [28]:
%load_ext autoreload
% autoreload 2

import torch
from torch.autograd import Variable

def to_var(x, volatile=False):
    """ converts a Pytorch Tensor to a variable and moves to GPU if CUDA is available """
    if torch.cuda.is_available():
        x = x.cuda()
    return Variable(x, volatile=volatile)

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


- provide them a CNN encoder.
- let them decide if they want to swap out the pre-trained architecture for another (make clear this is allowed). show how to do for vgg-16
- consider adding batch normalization if you like ...
- make any modifications to the EncoderCNN class you like. use this as a playground to make sure that your model is working well.
- Your output should be a Torch Variable with shape `[batch_size, embed_size]`.

In [32]:
from model import EncoderCNN

# TODO: Specify the dimensionality of the image embedding.
embed_size = 256

# Initialize the encoder.
encoder = EncoderCNN(embed_size)
# Move the encoder to GPU if CUDA is available.
if torch.cuda.is_available():
    encoder = encoder.cuda()
    
# Convert last batch of images (from Step 1) to PyTorch Variable.   
images_var = to_var(images, volatile=True)
# Pass the images through the encoder.
features = encoder(images_var)

print('type(features):', type(features))
print('features.shape:', features.shape)

# Check that your encoder satisfies the requirements of the project.
if (type(features)==torch.autograd.variable.Variable) & (features.shape[0]==batch_size) & (features.shape[1]==embed_size):
    print('\nyou may proceed')

type(features): <class 'torch.autograd.variable.Variable'>
features.shape: torch.Size([10, 256])

you may proceed


<a id='step3'></a>
### Step 3: Tinker with the RNN Decoder

- they have to write the decoder here and I plan to provide a unit test where they can check if implementation is correct
- they have to decide hidden size. note embed_size must be same as with encoder, and output must be same as vocab size.
- if leave notebook and come back, re-run every cell.

In [35]:
from model import DecoderRNN

# TODO: Specify the number of features in the hidden state of the RNN decoder.
hidden_size = 512

# Store the size of the vocabulary.
vocab_size = len(data_loader.dataset.vocab)

# Initialize the decoder.
decoder = DecoderRNN(embed_size, hidden_size, vocab_size)
# Move the decoder to GPU if CUDA is available.
if torch.cuda.is_available():
    decoder = decoder.cuda()
    
# Convert last batch of captions (from Step 1) to a PyTorch Variable. 
captions_var = to_var(captions)
# Pass the encoder output and captions through the decoder.
outputs = decoder(features, captions_var)

print('type(outputs):', type(outputs))
print('outputs.shape:', outputs.shape)

# Check that your decoder satisfies the requirements of the project.
if (type(outputs)==torch.autograd.variable.Variable) & (outputs.shape[0]==batch_size) & (outputs.shape[1]==captions.shape[1]) & (outputs.shape[2]==vocab_size):
    print('\nyou may proceed')

type(outputs): <class 'torch.autograd.variable.Variable'>
outputs.shape: torch.Size([10, 12, 9955])

you may proceed


<a id='step4'></a>
### Step 4: Calculate the Loss

- i give the criterion. students have to calculate the loss
- still trying to figure out how to best assess this. maybe put in py file that's not hidden? tell students not to look unless they really need to? will appear in the next notebook anyway. may as well put it in a file that's not hidden ...

In [None]:
import torch.nn as nn

criterion = nn.CrossEntropyLoss().cuda() if torch.cuda.is_available() else nn.CrossEntropyLoss()

# TODO: Calculate the batch loss.
loss = criterion(outputs.view(-1, vocab_size), captions_var.view(-1))

print('loss:', loss.data[0])

<a id='step5'></a>
### Step 5: Celebrate!

If you have completed all of the steps of this notebook, you're ready to move on to the next notebook in the project sequence.

When you are done with this notebook, please navigate to **1_Train.ipynb**, where you will train your own CNN-RNN model to generate image captions!