# Project: Image Captioning
---

## Notebook 2: Data Preprocessing and Model Architecture


<a id='step1'></a>
## 1: Loading the Data Loader

In the code cell below, you will initialize the data loader by using the `get_loader` function in **data_loader.py**.  
``data_loader`` will be used to load the COCO dataset in batches. 


`get_loader` function takes as input a number of arguments that can be explored in **data_loader.py**.  

Main arguements that it takes are:
1. **`transform`** - an [image transform](http://pytorch.org/docs/master/torchvision/transforms.html) specifying how to pre-process the images and convert them to PyTorch tensors before using them as input to the CNN encoder.
2. **`mode`** - one of `'train'` (loads the training data in batches) or `'test'` (for the test data).
3. **`batch_size`** - determines the batch size.
4. **`vocab_threshold`** - the total number of times that a word must appear in the in the training captions before it is used as part of the vocabulary.  Words that have fewer than `vocab_threshold` occurrences in the training captions are considered unknown words. 
5. **`vocab_from_file`** - a Boolean that decides whether to load the vocabulary from file.  

In [None]:
import sys
sys.path.append('/opt/cocoapi/PythonAPI')
from pycocotools.coco import COCO
!pip install nltk
import nltk
nltk.download('punkt')
from data_loader import get_loader
from torchvision import transforms

transform_train = transforms.Compose([ 
    transforms.Resize(256),                          
    transforms.RandomCrop(224),                      
    transforms.RandomHorizontalFlip(),               
    transforms.ToTensor(),                           
    transforms.Normalize((0.485, 0.456, 0.406),      
                         (0.229, 0.224, 0.225))])

vocab_threshold = 5         
batch_size = 10
data_loader = get_loader(transform=transform_train,
                         mode='train',
                         batch_size=batch_size,
                         vocab_threshold=vocab_threshold,
                         vocab_from_file=False)



[nltk_data] Downloading package punkt to C:\Users\Rodrigo
[nltk_data]     Franco\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


loading annotations into memory...
Done (t=0.74s)
creating index...
index created!
[0/414113] Tokenizing captions...
[100000/414113] Tokenizing captions...
[200000/414113] Tokenizing captions...
[300000/414113] Tokenizing captions...
[400000/414113] Tokenizing captions...
loading annotations into memory...
Done (t=0.81s)
creating index...
index created!
Obtaining caption lengths...


100%|████████████████████████████████████████████████████████████████████████| 414113/414113 [00:49<00:00, 8415.68it/s]


### Exploring the `__getitem__` Method in CoCoDataset

The `__getitem__` method in the `CoCoDataset` class determines how an image-caption pair is pre-processed before being incorporated into a batch.

When the data loader is in training mode, this method begins by first obtaining the filename (`path`) of a training image and its corresponding caption (`caption`).

#### Image Pre-Processing 

Image pre-processing is defined in the `__getitem__` method in the `CoCoDataset` class):

After loading the image in the training folder with name `path`, the image is pre-processed using the same transform (`transform_train`) that was supplied when instantiating the data loader.  

#### Caption Pre-Processing 

The captions also need to be pre-processed and prepped for training. In this example, for generating captions, we are aiming to create a model that predicts the next token of a sentence from previous tokens, so we turn the caption associated with any image into a list of tokenized words, before casting it to a PyTorch tensor that we can use to train the network.

String-valued captions are converted to a list of integers, before casting it to a PyTorch tensor. 

In [None]:
sample_caption = 'A person doing a trick on a rail while riding a skateboard.'

In [None]:
import nltk
sample_tokens = nltk.tokenize.word_tokenize(str(sample_caption).lower())
print(sample_tokens)

['a', 'person', 'doing', 'a', 'trick', 'on', 'a', 'rail', 'while', 'riding', 'a', 'skateboard', '.']


The integer `0` is always used to mark the start of a caption.

In [None]:
sample_caption = []
start_word = data_loader.dataset.vocab.start_word
print('Special start word:', start_word)
sample_caption.append(data_loader.dataset.vocab(start_word))
print(sample_caption)

Special start word: <start>
[0]


In [None]:
sample_caption.extend([data_loader.dataset.vocab(token) for token in sample_tokens])
print(sample_caption)

[0, 3, 98, 754, 3, 396, 39, 3, 1009, 207, 139, 3, 753, 18]


As you will see below, the integer `1` is always used to  mark the end of a caption.

In [None]:
end_word = data_loader.dataset.vocab.end_word
print('Special end word:', end_word)

sample_caption.append(data_loader.dataset.vocab(end_word))
print(sample_caption)

Special end word: <end>
[0, 3, 98, 754, 3, 396, 39, 3, 1009, 207, 139, 3, 753, 18, 1]


Converting to Pytorch tensor

In [None]:
import torch

sample_caption = torch.Tensor(sample_caption).long()
print(sample_caption)

tensor([   0,    3,   98,  754,    3,  396,   39,    3, 1009,  207,  139,    3,
         753,   18,    1])


In [None]:
# Preview the word2idx dictionary.
dict(list(data_loader.dataset.vocab.word2idx.items())[:10])

{'<start>': 0,
 '<end>': 1,
 '<unk>': 2,
 'a': 3,
 'very': 4,
 'clean': 5,
 'and': 6,
 'well': 7,
 'decorated': 8,
 'empty': 9}

In [None]:
# Print the total number of keys in the word2idx dictionary.
print('Total number of tokens in vocabulary:', len(data_loader.dataset.vocab))

Total number of tokens in vocabulary: 8856


In **vocabulary.py**, the `word2idx` dictionary is created by looping over the captions in the training dataset.  If a token appears no less than `vocab_threshold` times in the training set, then it is added as a key to the dictionary and assigned a corresponding unique integer.

In [None]:
vocab_threshold = 4
# Obtain the data loader.
data_loader = get_loader(transform=transform_train,
                         mode='train',
                         batch_size=batch_size,
                         vocab_threshold=vocab_threshold,
                         vocab_from_file=False)

loading annotations into memory...
Done (t=0.84s)
creating index...
index created!
[0/414113] Tokenizing captions...
[100000/414113] Tokenizing captions...
[200000/414113] Tokenizing captions...
[300000/414113] Tokenizing captions...
[400000/414113] Tokenizing captions...
loading annotations into memory...
Done (t=0.77s)
creating index...
index created!
Obtaining caption lengths...


100%|████████████████████████████████████████████████████████████████████████| 414113/414113 [00:49<00:00, 8296.06it/s]


In [None]:
print('Total number of tokens in vocabulary:', len(data_loader.dataset.vocab))

Total number of tokens in vocabulary: 9955


There are also a few special keys in the `word2idx` dictionary.:
- special start word is (`"<start>"`)
- special end word (`"<end>"`). 
- corresponding to unknown words (`"<unk>"`). 
- any unknown tokens are mapped to the integer `2`.

In [None]:
unk_word = data_loader.dataset.vocab.unk_word
print('Special unknown word:', unk_word)

print('All unknown words are mapped to this integer:', data_loader.dataset.vocab(unk_word))

Special unknown word: <unk>
All unknown words are mapped to this integer: 2


Checking ``'unk`` token index

In [None]:
print(data_loader.dataset.vocab('rodrigo'))
print(data_loader.dataset.vocab('ieowoqjf'))

2
2


The vocabulary (`data_loader.dataset.vocab`) is saved as a pickle file in the project folder, with filename `vocab.pkl`.

`vocab_from_file=True` : This will load from pkl file


In [None]:
data_loader = get_loader(transform=transform_train,
                         mode='train',
                         batch_size=batch_size,
                         vocab_from_file=True)

Vocabulary successfully loaded from vocab.pkl file!
loading annotations into memory...
Done (t=0.77s)
creating index...
index created!
Obtaining caption lengths...


100%|████████████████████████████████████████████████████████████████████████| 414113/414113 [00:50<00:00, 8166.28it/s]


## 2 :Using the Data Loader to Obtain Batches

The captions in the dataset vary greatly in length. 

In the code cell below,it can be seen that majority of captions have length 10.  Likewise, very short and very long captions are quite rare.  

In [None]:
from collections import Counter
# Tally the total number of training captions with each length.
counter = Counter(data_loader.dataset.caption_lengths)
lengths = sorted(counter.items(), key=lambda pair: pair[1], reverse=True)
for value, count in lengths:
    print('value: %2d --- count: %5d' % (value, count))

value: 10 --- count: 86332
value: 11 --- count: 79945
value:  9 --- count: 71935
value: 12 --- count: 57639
value: 13 --- count: 37648
value: 14 --- count: 22335
value:  8 --- count: 20769
value: 15 --- count: 12842
value: 16 --- count:  7729
value: 17 --- count:  4842
value: 18 --- count:  3103
value: 19 --- count:  2015
value:  7 --- count:  1597
value: 20 --- count:  1451
value: 21 --- count:   999
value: 22 --- count:   683
value: 23 --- count:   534
value: 24 --- count:   383
value: 25 --- count:   277
value: 26 --- count:   215
value: 27 --- count:   159
value: 28 --- count:   115
value: 29 --- count:    86
value: 30 --- count:    58
value: 31 --- count:    49
value: 32 --- count:    44
value: 34 --- count:    39
value: 37 --- count:    32
value: 33 --- count:    31
value: 35 --- count:    31
value: 36 --- count:    26
value: 38 --- count:    18
value: 39 --- count:    18
value: 43 --- count:    16
value: 44 --- count:    16
value: 48 --- count:    12
value: 45 --- count:    11
v

* To generate batches of training data, first sampling a caption length is done. (where the probability that any length is drawn is proportional to the number of captions with that length in the dataset). 
* Then, `batch_size` number of image-caption pairs are retrieved, where all captions have the sampled length.  This approach for assembling batches matches the procedure in [this paper](https://arxiv.org/pdf/1502.03044.pdf) and has been shown to be computationally efficient without degrading performance.

In [None]:
import numpy as np
import torch.utils.data as data

# Randomly sample a caption length, and sample indices with that length.
indices = data_loader.dataset.get_train_indices()
print('sampled indices:', indices)

# Create and assign a batch sampler to retrieve a batch with the sampled indices.
new_sampler = data.sampler.SubsetRandomSampler(indices=indices)
data_loader.batch_sampler.sampler = new_sampler

images, captions = next(iter(data_loader))
    
print('images.shape:', images.shape)
print('captions.shape:', captions.shape)

print('images:', images)
print('captions:', captions)

sampled indices: [59524, 178926, 316467, 261850, 287304, 381893, 371050, 33572, 272310, 163231]
images.shape: torch.Size([10, 3, 224, 224])
captions.shape: torch.Size([10, 14])
images: tensor([[[[ 0.1939,  0.1939,  0.1939,  ...,  0.2453,  0.2796,  0.2796],
          [ 0.1597,  0.0912,  0.1426,  ...,  0.3652,  0.4166,  0.4166],
          [ 0.1426,  0.0056,  0.1254,  ...,  0.4337,  0.4166,  0.4508],
          ...,
          [-0.2171, -0.4739, -0.7479,  ...,  1.2557,  1.1358,  1.0502],
          [-0.7308, -0.9020, -1.2274,  ...,  1.2214,  1.1187,  1.0331],
          [-1.2274, -1.4672, -1.7412,  ...,  1.2557,  1.1872,  1.0673]],

         [[-0.6352, -0.6702, -0.6877,  ..., -0.7052, -0.6352, -0.6527],
          [-0.7227, -0.8277, -0.8102,  ..., -0.6001, -0.5301, -0.5301],
          [-0.7927, -0.9678, -0.8803,  ..., -0.5651, -0.5651, -0.5651],
          ...,
          [-0.4951, -0.7227, -0.9678,  ...,  0.3978,  0.3627,  0.2927],
          [-0.9153, -1.1078, -1.4055,  ...,  0.3627,  0.3277,  

          [-0.4275,  0.4788,  0.9319,  ..., -1.7347, -1.7522, -1.7173]]]])
captions: tensor([[   0,    3,  371,  224,   39,  160,  325,  364,  161,   47,  327,  786,
           18,    1],
        [   0,    3,   91,  224,  111,    3,  112,  360,    3,  925,   13, 2076,
           18,    1],
        [   0,  253,    6,  239, 1548,  509,    3, 1815,   77,  121,   13,    3,
          137,    1],
        [   0,   47,  327,  786,  332,   21,    3,  985,  569,  407,   24, 1165,
           18,    1],
        [   0,    3,  115,    6,    3,  371,  324,   54,  364,  161,   72,  409,
           18,    1],
        [   0,    3,  169, 3381,   86,  360,    3, 3381, 1251,  537,   21, 3864,
           18,    1],
        [   0,    3,  286,  294,  460,   32,  665,  161, 1612,    3, 1492,  353,
           18,    1],
        [   0,  250,   52,    2,   86, 1372,   86,    6,    2,   77,  682, 4542,
           18,    1],
        [   0,    3,  578,  270,  160,  899,   39,    3,   40,   79,   32, 5262,
          

## 3: Building the CNN Encoder
* The CNN Encoder is defined in ``models.py``
* We import `EncoderCNN` and `DecoderRNN` from ``model.py``. 

The encoder uses the pre-trained ResNet-50 architecture (with the final fully-connected layer removed) to extract features from a batch of pre-processed images.  The output is then flattened to a vector, before being passed through a `Linear` layer to transform the feature vector to have the same size as the word embedding.

![Encoder](https://github.com/varangrai/Udacity-CVND-Projects/blob/master/Image%20Captioning/images/encoder.png?raw=true)


In [None]:
from model import EncoderCNN
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

Run the code cell below to instantiate the CNN encoder in `encoder`.  

The pre-processed images from the batch in **Step 2** of this notebook are then passed through the encoder, and the output is stored in `features`.

In [None]:
embed_size = 256      #dimensionality of the image embedding.
encoder = EncoderCNN(embed_size)
encoder.to(device)
images = images.to(device)
features = encoder(images)

print('type(features):', type(features))
print('features.shape:', features.shape)

type(features): <class 'torch.Tensor'>
features.shape: torch.Size([10, 256])


## 4: RNN Decoder

The decoder accepts an arbitrary batch (of embedded image features and pre-processed captions [where all captions have the same length]) as input.  

* The decoder returns `outputs` which has shape `[batch_size, captions.shape[1], vocab_size]` and `outputs[i,j,k]` contains the model's predicted score, indicating how likely the `j`-th token in the `i`-th caption in the batch is the `k`-th token in the vocabulary.


![Decoder](https://github.com/varangrai/Udacity-CVND-Projects/blob/master/Image%20Captioning/images/encoder-decoder.png?raw=true)  

In [None]:
from model import DecoderRNN

In [None]:
hidden_size = 512
vocab_size = len(data_loader.dataset.vocab)
decoder = DecoderRNN(embed_size, hidden_size, vocab_size)
decoder.to(device)
captions = captions.to(device)
outputs = decoder(features, captions)

print('type(outputs):', type(outputs))
print('expected output: [', batch_size, ',', captions.shape[1], ',', vocab_size, ']')
print('outputs.shape:', outputs.shape)

type(outputs): <class 'torch.Tensor'>
expected output: [ 10 , 14 , 9955 ]
outputs.shape: torch.Size([10, 14, 9955])


## Model Summary:

- Pre-trained resnet50,is used as an Encoder provding an an embedding of size 256
- An LSTM is used as a decoder, with embed size of 256 and hidden size of 512, following the [paper](https://arxiv.org/pdf/1411.4555.pdf).
- Vocabulary threshold used is 4 