# Computer Vision Nanodegree

## Project: Image Captioning

---

In this notebook, you will learn how to load and pre-process data from the [COCO dataset](http://cocodataset.org/#home). You will also explore how to design a CNN-RNN model for automatically generating image captions.

This notebook is designed only to introduce you to the helper files that you will use to complete your project. 

Note that **any amendments that you make to this notebook will NOT be graded**.

Feel free to use the links below to navigate the notebook:
- [Step 1](#step1): Explore the Data Loader
- [Step 2](#step2): Tinker with the CNN Encoder
- [Step 3](#step3): Tinker with the RNN Decoder
- [Step 4](#step4): Calculate the Loss
- [Step 5](#step5): Celebrate!

<a id='step1'></a>
## Step 1: Explore the Data Loader

We have already written a data loader that you can use to load the COCO dataset in batches. 

In the code cell below, you will initialize the data loader by using the `get_loader` function in **data_loader.py**.  

> For this project, you are not permitted to change the **data_loader.py** file, which must be used as-is.

The `get_loader` function takes as input a number of arguments that can be explored in **data_loader.py**.  Take the time to explore these arguments now by opening **data_loader.py** in a new window.  For now, we need only focus on four arguments:
1. **`transform`** - an [image transform](http://pytorch.org/docs/master/torchvision/transforms.html) specifying how to pre-process the images and convert them to PyTorch tensors before using them as input to the CNN encoder.  For now, you are encouraged to keep the transform as provided in `transform_train`.  You will have the opportunity later to choose your own image transform to pre-process the COCO images.
2. **`mode`** - one of `'train'` (loads the training data in batches), `'val'` (loads the validation data), or `'test'` (for the test data). We will say that the data loader is in training, validation, or test mode, respectively.
3. **`batch_size`** - determines the batch size.  If training the model, this is total number of image-caption pairs used to amend the model weights in each training step.
4. **`vocab_threshold`** - the total number of times that a word must appear in the in the training captions before it is used as part of the vocabulary.  Images that have fewer than `vocab_threshold` occurrences are considered unknown words. 

We will further detail the `vocab_threshold` and `vocab_from_file` arguments soon.  For now, run the code cell below.  Be patient - it may take a couple of minutes to run!

In [1]:
from data_loader import get_loader
from torchvision import transforms

# Define a transform to pre-process the training images.
transform_train = transforms.Compose([ 
    transforms.Resize(256),                          # smaller edge of image resized to 256
    transforms.RandomCrop(224),                      # get 224x224 crop from random location
    transforms.RandomHorizontalFlip(),               # horizontally flip image with probability=0.5
    transforms.ToTensor(),                           # convert the PIL Image to a tensor
    transforms.Normalize((0.485, 0.456, 0.406),      # normalize image for pre-trained model
                         (0.229, 0.224, 0.225))])

# Set the minimum word count threshold.
vocab_threshold = 4

# Specify the batch size.
batch_size = 10

# Obtain the data loader.
data_loader = get_loader(transform=transform_train,
                         mode='train',
                         batch_size=batch_size,
                         vocab_threshold=vocab_threshold,
                         vocab_from_file=False)

loading annotations into memory...
Done (t=0.59s)
creating index...
index created!
[0/414113] Tokenizing captions...
[100000/414113] Tokenizing captions...
[200000/414113] Tokenizing captions...
[300000/414113] Tokenizing captions...
[400000/414113] Tokenizing captions...
loading annotations into memory...


  0%|          | 0/414113 [00:00<?, ?it/s]

Done (t=0.51s)
creating index...
index created!
Obtaining caption lengths...


100%|██████████| 414113/414113 [00:40<00:00, 10228.72it/s]


When you ran the code cell above, the data loader was stored in the variable `data_loader`.  

You can access the corresponding dataset as `data_loader.dataset`.  This dataset is an instance of the `CoCoDataset` class in **data_loader.py**.  If you are unfamiliar with data loaders and datasets, you are encouraged to review [this PyTorch tutorial](http://pytorch.org/tutorials/beginner/data_loading_tutorial.html).

### Exploring the `__getitem__` Method

The `__getitem__` method in the `CoCoDataset` class determines how a data sample (or *image-caption pair*) is pre-processed before being incorporated into a batch.  When the data loader is in training mode, this method begins by first obtaining the image path (`path`) and caption (`caption`).

The pre-processing of the image is relatively straightforward (from the `__getitem__` method in the `CoCoDataset` class):
```python
# Convert image to tensor and pre-process using transform
image = Image.open(os.path.join(self.img_folder, path)).convert('RGB')
image = self.transform(image)
```
After loading the image located at `path`, the image is processed using the same transform that was supplied when instantiating the data loader. 

The pre-processing of the caption is relatively more involved and is covered in the following section.

### How the Vocabulary is used to Convert Captions

To understand how COCO captions are preprocessed, we'll first need to take a look at the `vocab` instance variable of the `CoCoDataset` class (from the `__init__` method of the `CoCoDataset` class):
```python
def __init__(self, transform, mode, batch_size, vocab_threshold, vocab_file, start_word, 
        end_word, unk_word, annotations_file, vocab_from_file, img_folder):
        ...
        self.vocab = Vocabulary(vocab_threshold, vocab_file, start_word,
            end_word, unk_word, annotations_file, vocab_from_file)
        ...
```
Take the time now to check that `data_loader.dataset.vocab` is an instance of the `Vocabulary` class in **vocabulary.py**.  The most important feature of this vocabulary is the `word2idx` variable, which we print in the next code cell.

In [3]:
print(data_loader.dataset.vocab.word2idx)



Examine the dictionary above.  In this case, it maps 

The code for pre-processing the COCO captions appears below (from the `__getitem__` method in the `CoCoDataset` class):

```python
# Convert caption to tensor of word ids.
tokens = nltk.tokenize.word_tokenize(str(caption).lower())
caption = []
caption.append(self.vocab(self.vocab.start_word))
caption.extend([self.vocab(token) for token in tokens])
caption.append(self.vocab(self.vocab.end_word))
caption = torch.Tensor(caption).long()
```



The vocabulary is used to convert each string-valued caption in the COCO dataset to a list of integers, as follows:  

- **Step 1**: Every letter in the caption is converted to lowercase, and the [`nltk.tokenize.word_tokenize`](http://www.nltk.org/) function is used to obtain a list of string-valued tokens.  For illustration, say the caption is `caption = 'The chicken crossed the street.'`  Then, `nltk.tokenize.word_tokenize(caption.lower`

---

- show how vocab file is generated
- guide their play with `vocab` output
- tell should only amend the threshold ... vocab does have other args that should stay at default values. mention `vocab_from_file`

- when train model later, note don't need to separately populate data loader and vocab. vocab built in to data loader ...
- check now that data_loader includes dataset, which includes vocab ... can also get word2idx and idx2word as above.

- used now to prepare for embedding layer
- size of vocab also determines dimensionality of output layer of LSTM & will be used to decode predicted captions. more on that soon
- uncomment one of the lines below to examine word2idx, idx2word. will use word2idx to convert images to nums, idx2word later to convert predictions to words ...

In [None]:
#print(data_loader.dataset.vocab.idx2word)
print('Number of words in vocabulary:', len(data_loader.dataset.vocab))

- describe how get_train_indices works, where we got the procedure for loading batches from
- show how to get a batch of data
- describe how sampler works

In [None]:
import numpy as np
import torch.utils.data as data

# Randomly sample a caption length, and sample indices with that length.
indices = data_loader.dataset.get_train_indices()
print('sampled indices:', indices)
# Create and assign a batch sampler to retrieve a batch with the sampled indices.
new_sampler = data.sampler.SubsetRandomSampler(indices=indices)
data_loader.batch_sampler.sampler = new_sampler
    
# Obtain the batch.
for batch in data_loader:
    images, captions = batch[0], batch[1]
    break
    
print('images.shape:',images.shape)
print('captions.shape:', captions.shape)

<a id='step2'></a>
### Step 2: Tinker with the CNN Encoder

- make note of module autoreload capability
- need to explain to them what `volatile` is for, etc

In [None]:
%load_ext autoreload
% autoreload 2

import torch
from torch.autograd import Variable

def to_var(x, volatile=False):
    """ converts a Pytorch Tensor to a variable and moves to GPU if CUDA is available """
    if torch.cuda.is_available():
        x = x.cuda()
    return Variable(x, volatile=volatile)

- provide them a CNN encoder.
- let them decide if they want to swap out the pre-trained architecture for another (make clear this is allowed). show how to do for vgg-16
- consider adding batch normalization if you like ...
- make any modifications to the EncoderCNN class you like. use this as a playground to make sure that your model is working well.
- Your output should be a Torch Variable with shape `[batch_size, embed_size]`.

In [None]:
from model import EncoderCNN

# TODO: Specify the dimensionality of the image embedding.
embed_size = 256

# Initialize the encoder.
encoder = EncoderCNN(embed_size)
# Move the encoder to GPU if CUDA is available.
if torch.cuda.is_available():
    encoder = encoder.cuda()
    
# Convert last batch of images (from Step 1) to PyTorch Variable.   
images_var = to_var(images, volatile=True)
# Pass the images through the encoder.
features = encoder(images_var)

print('type(features):', type(features))
print('features.shape:', features.shape)

# Check that your encoder satisfies the requirements of the project.
if (type(features)==torch.autograd.variable.Variable) & (features.shape[0]==batch_size) & (features.shape[1]==embed_size):
    print('\nyou may proceed')

<a id='step3'></a>
### Step 3: Tinker with the RNN Decoder

- they have to write the decoder here and I plan to provide a unit test where they can check if implementation is correct
- they have to decide hidden size. note embed_size must be same as with encoder, and output must be same as vocab size.
- if leave notebook and come back, re-run every cell.

In [None]:
from model import DecoderRNN

# TODO: Specify the number of features in the hidden state of the RNN decoder.
hidden_size = 512

# Store the size of the vocabulary.
vocab_size = len(data_loader.dataset.vocab)

# Initialize the decoder.
decoder = DecoderRNN(embed_size, hidden_size, vocab_size)
# Move the decoder to GPU if CUDA is available.
if torch.cuda.is_available():
    decoder = decoder.cuda()
    
# Convert last batch of captions (from Step 1) to a PyTorch Variable. 
captions_var = to_var(captions)
# Pass the encoder output and captions through the decoder.
outputs = decoder(features, captions_var)

print('type(outputs):', type(outputs))
print('outputs.shape:', outputs.shape)

# Check that your decoder satisfies the requirements of the project.
if (type(outputs)==torch.autograd.variable.Variable) & (outputs.shape[0]==batch_size) & (outputs.shape[1]==captions.shape[1]) & (outputs.shape[2]==vocab_size):
    print('\nyou may proceed')

<a id='step4'></a>
### Step 4: Calculate the Loss

- i give the criterion. students have to calculate the loss
- still trying to figure out how to best assess this. maybe put in py file that's not hidden? tell students not to look unless they really need to? will appear in the next notebook anyway. may as well put it in a file that's not hidden ...

In [None]:
import torch.nn as nn

criterion = nn.CrossEntropyLoss().cuda() if torch.cuda.is_available() else nn.CrossEntropyLoss()

# TODO: Calculate the batch loss.
loss = criterion(outputs.view(-1, vocab_size), captions_var.view(-1))

print('loss:', loss.data[0])

<a id='step5'></a>
### Step 5: Celebrate!

If you have completed all of the steps of this notebook, you're ready to move on to the next notebook in the project sequence.

When you are done with this notebook, please navigate to **1_Train.ipynb**, where you will train your own CNN-RNN model to generate image captions!