# Computer Vision Nanodegree

## Project: Image Captioning

In this notebook, some template code has already been provided for you ...

---

### Introduction

In this notebook, you will xyz.

Feel free to use the links below to navigate the notebook:

(insert table of contents here)

### Step 1: Explore the Dataset

- look at sample images and captions.
- tell how data was preprocessed & saved in cocoapi (256 on shortest side)
- print some statistics (number of images)

### Step 2: Build the Vocabulary

- note to self: RETRAIN WITHOUT PAD WORD ASSIGNMENT
- show how vocab file is generated
- guide their play with `vocab` output
- tell them what they can amend, and what they cannot amend // really they should only be able to amend the threshold ...
- how to delete / override existing file
- used now to prepare for embedding layer
- size of vocab also determines dimensionality of output layer of LSTM & will be used to decode predicted captions. more on that soon
- provide option to override existing file

In [1]:
from vocabulary import Vocabulary

vocab_file = './vocab.pkl'
pad_word = "<pad>"
start_word = "<start>"
end_word = "<end>"
unk_word = "<unk>",
vocab_threshold = 4
captions_file = '../cocoapi/annotations/captions_train2014.json'

vocab = Vocabulary(vocab_file, pad_word, start_word, end_word, unk_word, vocab_threshold, captions_file)

### Step 3: Prepare the Data Loader

- note to self: may need to RE-ORG these files ... especially to separate data loader from caption lengths
- show how to get a batch of data
- describe how sampler works

In [2]:
from data_loader import get_loader
from torchvision import transforms

# image preprocessing
transform = transforms.Compose([ 
    transforms.CenterCrop(224),
    transforms.RandomHorizontalFlip(), 
    transforms.ToTensor(), 
    transforms.Normalize((0.485, 0.456, 0.406), 
                         (0.229, 0.224, 0.225))])

batch_size = 10

data_loader, caption_lengths = get_loader(transform=transform,
                                          batch_size=batch_size)

loading annotations into memory...
Done (t=0.60s)
creating index...


  0%|          | 861/414113 [00:00<00:48, 8606.50it/s]

index created!


100%|██████████| 414113/414113 [00:41<00:00, 10047.98it/s]


In [3]:
import numpy as np
import torch.utils.data as data

sel_length = np.random.choice(caption_lengths)
all_indices = np.where([caption_lengths[i] == sel_length for i in np.arange(len(caption_lengths))])[0]
indices = list(np.random.choice(all_indices, size=batch_size))
new_sampler = data.sampler.SubsetRandomSampler(indices=indices)
data_loader.batch_sampler.sampler = new_sampler

for batch in data_loader:
    images, captions = batch[0], batch[1]
    break
    
print('selected length:', sel_length)
print('sampled indices:', data_loader.batch_sampler.sampler.indices)
print('images.shape:',images.shape)
print('captions.shape:', captions.shape)

selected length: 10
sampled indices: [350874, 146864, 48975, 255536, 114865, 222642, 111681, 181747, 52278, 289845]
images.shape: torch.Size([10, 3, 224, 224])
captions.shape: torch.Size([10, 12])


### Step 4: Tinker with the CNN Encoder

- provide them a CNN encoder.
- let them decide if they want to swap out the pre-trained architecture for another (make clear this is allowed). show how to do for vgg-16
- make note of module autoreload capability
- encourage them to stop here and think about dimensionality of output
- need to explain to them what `volatile` is for, etc
- consider adding batch normalization if you like ...

In [4]:
%load_ext autoreload
% autoreload 2

In [5]:
import torch
from torch.autograd import Variable
from model import EncoderCNN

# amend this if you like
embed_size = 100

encoder = EncoderCNN(embed_size)
if torch.cuda.is_available():
    encoder = encoder.cuda()

In [6]:
def to_var(x, volatile=False):
    if torch.cuda.is_available():
        x = x.cuda()
    return Variable(x, volatile=volatile)

images_var = to_var(images, volatile=True)
features = encoder(images_var)

print('features.shape:', features.shape)

features.shape: torch.Size([10, 100])


### Step 5: Tinker with RNN Decoder

- they have to write the decoder here and I plan to provide a unit test where they can check if implementation is correct
- they have to decide hidden size, embed size - won't provide default values but will provide papers they can read
- may need to use vocabulary.  note `len(vocab)` returns the total number of tokens
- if leave notebook and come back, re-run every cell.

In [7]:
from model import DecoderRNN

embed_size = 100
hidden_size = 30
vocab_size = len(vocab)

decoder = DecoderRNN(embed_size, hidden_size, vocab_size)
if torch.cuda.is_available():
    decoder = decoder.cuda()

In [8]:
captions_var = to_var(captions)
outputs = decoder(features, captions_var)

print('outputs.shape:', outputs.shape)

outputs.shape: torch.Size([10, 12, 9956])


### Step 6: Calculate the Loss

- i give the criterion. students have to calculate the loss

In [9]:
import torch.nn as nn

criterion = nn.CrossEntropyLoss().cuda() if torch.cuda.is_available() else nn.CrossEntropyLoss()

loss = criterion(outputs.view(-1, output_size), captions_var.view(-1))

print('loss:', loss.data[0])

loss: 9.227032661437988


---

### Congratulations!

If you have completed all of the steps of this notebook, you're ready to move on to the next notebook in the project sequence.

When you are done with this notebook, please navigate to **1_Train.ipynb**, where you will train your own CNN-RNN model to generate image captions!