

## Project: Image captioning for visually impaired people

---

In this notebook, you will learn how to load and pre-process data from the [vizWiz dataset](https://vizwiz.org/tasks-and-datasets/image-captioning/). 

This notebook can be divided in the foloowing sections:
- [Step 1](#step1): Explore the Data Loader
- [Step 2](#step2): Use the Data Loader to Obtain Batches

<a id='step1'></a>
## Step 1: Exploring the Data Loader

We have already written a [data loader](http://pytorch.org/docs/master/data.html#torch.utils.data.DataLoader) that you can use to load the dataset in batches. 

In the code cell below, you will initialize the data loader by using the `get_loader` function in **data_loader.py**.  

> For this project, you are not permitted to change the **data_loader.py** file, which must be used as-is.

The `get_loader` function takes as input a number of arguments that can be explored in **data_loader.py**.  Take the time to explore these arguments now by opening **data_loader.py** in a new window.  Most of the arguments must be left at their default values, and you are only allowed to amend the values of the arguments below:
1. **`transform`** - an [image transform](https://pytorch.org/tutorials/beginner/basics/transforms_tutorial.html) specifying how to pre-process the images and convert them to PyTorch tensors before using them as input to the CNN encoder.  For now, you are encouraged to keep the transform as provided in `transform_train`.  You will have the opportunity later to choose your own image transform to pre-process the images.
2. **`mode`** - one of `'train'` (loads the training data in batches) or `'test'` (for the test data). We will say that the data loader is in training or test mode, respectively.  While following the instructions in this notebook, please keep the data loader in training mode by setting `mode='train'`.
3. **`batch_size`** - determines the batch size.  When training the model, this is number of image-caption pairs used to amend the model weights in each training step.
4. **`vocab_threshold`** - the total number of times that a word must appear in the in the training captions before it is used as part of the vocabulary.  Words that have fewer than `vocab_threshold` occurrences in the training captions are considered unknown words. 
5. **`vocab_from_file`** - a Boolean that decides whether to load the vocabulary from file.  

We will describe the `vocab_threshold` and `vocab_from_file` arguments in more detail soon.  For now, run the code cell below.  Be patient - it may take a couple of minutes to run!

In [3]:
[224,224,3]

[224, 224, 3]

In [2]:
import sys
# !pip install nltk
import nltk
nltk.download('punkt')
from data_loader import get_loader
from torchvision import transforms

# Define a transform to pre-process the training images.
transform_train = transforms.Compose([ 
    transforms.Resize(256),                          # smaller edge of image resized to 256
    transforms.RandomCrop(224),                      # get 224x224 crop from random location
    transforms.RandomHorizontalFlip(),               # horizontally flip image with probability=0.5
    transforms.ToTensor(),                           # convert the PIL Image to a tensor
    transforms.Normalize((0.485, 0.456, 0.406),      # normalize image for pre-trained model
                         (0.229, 0.224, 0.225))])

# Set the minimum word count threshold.
vocab_threshold = 5

# Specify the batch size.
batch_size = 10

# Obtain the data loader.
data_loader = get_loader(transform=transform_train,
                         mode='train',
                         batch_size=batch_size,
                         vocab_threshold=vocab_threshold,
                         vocab_from_file=True)

[nltk_data] Error loading punkt: <urlopen error [WinError 10054] An
[nltk_data]     existing connection was forcibly closed by the remote
[nltk_data]     host>
  0%|                                                                                       | 0/117155 [00:00<?, ?it/s]

Vocabulary successfully loaded from vocab.pkl file!
Obtaining caption lengths...


100%|████████████████████████████████████████████████████████████████████████| 117155/117155 [00:11<00:00, 9832.09it/s]


### `__getitem__` Method

The `__getitem__` method in the `vizWizDataset` class determines how an image-caption pair is pre-processed before being incorporated into a batch.  This is true for all `Dataset` classes in PyTorch.

When the data loader is in training mode, this method begins by first obtaining the filename (`path`) of a training image and its corresponding caption (`caption`).

#### Image Pre-Processing 
```python
# Convert image to tensor and pre-process using transform
image = Image.open(image_address).convert('RGB')
image = self.transform(image)
```
After loading the image in the training folder with name `path`, the image is pre-processed using the same transform (`transform_train`) that was supplied when instantiating the data loader.  

#### Caption Pre-Processing 

The captions also need to be pre-processed and prepped for training.

To understand in more detail how the captions are pre-processed, we'll first need to take a look at the `vocab` instance variable of the `vizWizDataset` class.  The code snippet below is explains the `__init__` method of the `vizWizDataset` class:
```python
def __init__(self, transform, mode, batch_size, vocab_threshold, vocab_file, start_word, 
        end_word, unk_word, annotations_file, vocab_from_file, img_folder):
        ...
        self.vocab = Vocabulary(vocab_threshold, vocab_file, start_word,
            end_word, unk_word, annotations_file, vocab_from_file)
        ...
```
We use this instance to pre-process the captions (from the `__getitem__` method in the `VIZwIZDataset` class):

```python
# Convert caption to tensor of word ids.
tokens = nltk.tokenize.word_tokenize(str(caption).lower())   # line 1
caption = []                                                 # line 2
caption.append(self.vocab(self.vocab.start_word))            # line 3
caption.extend([self.vocab(token) for token in tokens])      # line 4
caption.append(self.vocab(self.vocab.end_word))              # line 5
caption = torch.Tensor(caption).long()                       # line 6
```


In [29]:
sample_caption = 'A person doing a trick on a rail while riding a skateboard.'

In **`line 1`** of the code snippet, every letter in the caption is converted to lowercase, and the [`nltk.tokenize.word_tokenize`](http://www.nltk.org/) function is used to obtain a list of string-valued tokens.

In [2]:
import nltk

sample_tokens = nltk.tokenize.word_tokenize(str(sample_caption).lower())
print(sample_tokens)

C:\Users\ACER\anaconda3\envs\deepLearning\lib\site-packages\numpy\.libs\libopenblas.EL2C6PLE4ZYW3ECEVIV3OXXGRN2NRFM2.gfortran-win_amd64.dll
C:\Users\ACER\anaconda3\envs\deepLearning\lib\site-packages\numpy\.libs\libopenblas.WCDJNK7YVMPZQ2ME2ZZHJJRJ3JIKNDB7.gfortran-win_amd64.dll


['a', 'person', 'doing', 'a', 'trick', 'on', 'a', 'rail', 'while', 'riding', 'a', 'skateboard', '.']


In **`line 2`** and **`line 3`** we initialize an empty list and append an integer to mark the start of a caption. 

This special start word (`"<start>"`) is decided when instantiating the data loader and is passed as a parameter (`start_word`).  You are **required** to keep this parameter at its default value (`start_word="<start>"`).

In [4]:
sample_caption = []

start_word = data_loader.dataset.vocab.start_word
print('Special start word:', start_word)
sample_caption.append(data_loader.dataset.vocab(start_word))
print(sample_caption)

Special start word: <start>
[0]


In **`line 4`**, we continue the list by adding integers that correspond to each of the tokens in the caption.

In [5]:
sample_caption.extend([data_loader.dataset.vocab(token) for token in sample_tokens])
print(sample_caption)

[0, 5, 122, 4497, 5, 2, 34, 5, 6728, 661, 4392, 5, 2, 14]


In **`line 5`**, we append a final integer to mark the end of the caption.  

Identical to the case of the special start word (above), the special end word (`"<end>"`) is decided when instantiating the data loader and is passed as a parameter (`end_word`). 

In [6]:
end_word = data_loader.dataset.vocab.end_word
print('Special end word:', end_word)

sample_caption.append(data_loader.dataset.vocab(end_word))
print(sample_caption)

Special end word: <end>
[0, 5, 122, 4497, 5, 2, 34, 5, 6728, 661, 4392, 5, 2, 14, 1]


Finally, in **`line 6`**, we convert the list of integers to a PyTorch tensor and cast it to [long type](http://pytorch.org/docs/master/tensors.html#torch.Tensor.long). 

In [7]:
import torch

sample_caption = torch.Tensor(sample_caption).long().cuda()
print(sample_caption)

tensor([   0,    5,  122, 4497,    5,    2,   34,    5, 6728,  661, 4392,    5,
           2,   14,    1], device='cuda:0')


Any caption is converted to a list of tokens, with _special_ start and end tokens marking the beginning and end of the sentence:
```
[<start>, 'a', 'person', 'doing', 'a', 'trick', 'while', 'riding', 'a', 'skateboard', '.', <end>]
```
This list of tokens is then turned into a list of integers, where every distinct word in the vocabulary has an associated integer value:
```
[0, 3, 98, 754, 3, 396, 207, 139, 3, 753, 18, 1]
```
Finally, this list is converted to a PyTorch tensor.  All of the captions in the dataset are pre-processed using this same procedure from **`lines 1-6`** described above.  
 
```python
def __call__(self, word):
    if not word in self.word2idx:
        return self.word2idx[self.unk_word]
    return self.word2idx[word]
```

The `word2idx` instance variable is a Python [dictionary](https://docs.python.org/3/tutorial/datastructures.html#dictionaries).

In [16]:
# Preview the word2idx dictionary.
dict(list(data_loader.dataset.vocab.word2idx.items())[:15])

{'<start>': 0,
 '<end>': 1,
 '<unk>': 2,
 'its': 3,
 'is': 4,
 'a': 5,
 'basil': 6,
 'leaves': 7,
 'container': 8,
 'contains': 9,
 'the': 10,
 'net': 11,
 'weight': 12,
 'too': 13,
 '.': 14}

We also print the total number of keys.

In [17]:
# Print the total number of keys in the word2idx dictionary.
print('Total number of tokens in vocabulary:', len(data_loader.dataset.vocab))

Total number of tokens in vocabulary: 7076


In **vocabulary.py**, the `word2idx` dictionary is created by looping over the captions in the training dataset.  If a token appears no less than `vocab_threshold` times in the training set, then it is added as a key to the dictionary and assigned a corresponding unique integer.

In [18]:
# Modify the minimum word count threshold.
vocab_threshold = 4

# Obtain the data loader.
data_loader = get_loader(transform=transform_train,
                         mode='train',
                         batch_size=batch_size,
                         vocab_threshold=vocab_threshold,
                         vocab_from_file=True)

[0/117155] Tokenizing captions...
[100000/117155] Tokenizing captions...


  1%|▌                                                                          | 905/117155 [00:00<00:12, 9015.51it/s]

Obtaining caption lengths...


100%|████████████████████████████████████████████████████████████████████████| 117155/117155 [00:12<00:00, 9262.01it/s]


In [30]:
# Print the total number of keys in the word2idx dictionary.
print('Total number of tokens in vocabulary:', len(data_loader.dataset.vocab))

Total number of tokens in vocabulary: 8099


There is one more special token, corresponding to unknown words (`"<unk>"`).  All tokens that don't appear anywhere in the `word2idx` dictionary are considered unknown words.  In the pre-processing step, any unknown tokens are mapped to the integer `2`.

In [21]:
unk_word = data_loader.dataset.vocab.unk_word
print('Special unknown word:', unk_word)

print('All unknown words are mapped to this integer:', data_loader.dataset.vocab(unk_word))

Special unknown word: <unk>
All unknown words are mapped to this integer: 2


In [38]:
print(data_loader.dataset.vocab('tree'))
print(data_loader.dataset.vocab('hippo'))

926
2


The final thing to mention is the `vocab_from_file` argument that is supplied when creating a data loader. The vocabulary (`data_loader.dataset.vocab`) is saved as a [pickle](https://docs.python.org/3/library/pickle.html) file in the project folder, with filename `vocab.pkl`.

Note that if `vocab_from_file=True`, then any supplied argument for `vocab_threshold` when instantiating the data loader is completely ignored.

In [39]:
# Obtain the data loader (from file). Note that it runs much faster than before!
data_loader = get_loader(transform=transform_train,
                         mode='train',
                         batch_size=1,
                         vocab_from_file=True)

Vocabulary successfully loaded from vocab.pkl file!


  1%|▌                                                                          | 802/117155 [00:00<00:14, 7964.52it/s]

Obtaining caption lengths...


100%|████████████████████████████████████████████████████████████████████████| 117155/117155 [00:14<00:00, 8221.42it/s]


<a id='step2'></a>
## Step 2: Using the Data Loader to Obtain Batches

The captions in the dataset vary greatly in length.  We observed this by examining `data_loader.dataset.caption_lengths`, a Python list with one entry for each training caption (where the value stores the length of the corresponding caption).  

In the code cell below, we use this list to print the total number of captions in the training data with each length.  As you will see below, the majority of captions have length 10.  Likewise, very short and very long captions are quite rare.  

In [26]:
from collections import Counter

# Tally the total number of training captions with each length.
counter = Counter(data_loader.dataset.caption_lengths)
lengths = sorted(counter.items(), key=lambda pair: pair[1], reverse=True)
for value, count in lengths:
    print('value: %2d --- count: %5d' % (value, count))

value: 10 --- count: 30713
value: 11 --- count: 14578
value:  9 --- count: 12391
value: 12 --- count: 12089
value: 13 --- count:  9992
value: 14 --- count:  7471
value: 15 --- count:  5394
value:  8 --- count:  5093
value: 16 --- count:  4184
value: 17 --- count:  3104
value: 18 --- count:  2379
value: 19 --- count:  1865
value: 20 --- count:  1519
value: 21 --- count:  1152
value: 22 --- count:   937
value: 23 --- count:   709
value: 24 --- count:   622
value: 25 --- count:   474
value: 26 --- count:   390
value: 27 --- count:   311
value: 28 --- count:   237
value: 29 --- count:   184
value: 30 --- count:   152
value:  7 --- count:   142
value: 32 --- count:   131
value: 31 --- count:   119
value: 33 --- count:   103
value: 34 --- count:    88
value: 35 --- count:    77
value: 36 --- count:    68
value: 37 --- count:    66
value: 39 --- count:    42
value: 40 --- count:    39
value: 38 --- count:    38
value:  3 --- count:    33
value: 41 --- count:    33
value: 43 --- count:    33
v

To generate batches of training data, we sample a caption length (where the probability that any length is drawn is proportional to the number of captions with that length in the dataset).  Then, we retrieve a batch of size `batch_size` of image-caption pairs, where all captions have the sampled length.  This approach for assembling batches matches the procedure in [this paper](https://arxiv.org/pdf/1502.03044.pdf) and has been shown to be computationally efficient without degrading performance.

In [40]:
import numpy as np
import torch.utils.data as data

# Randomly sample a caption length, and sample indices with that length.
indices = data_loader.dataset.get_train_indices()
print('sampled indices:', indices)

# # Create and assign a batch sampler to retrieve a batch with the sampled indices.
new_sampler = data.sampler.SubsetRandomSampler(indices=indices)
data_loader.batch_sampler.sampler = new_sampler

# Obtain the batch.
images, captions = next(iter(data_loader))
    
print('images.shape:', images.shape)
print('captions.shape:', captions.shape)

# (Optional) Uncomment the lines of code below to print the pre-processed images and captions.
print('images:', images)
print('captions:', captions)

sampled indices: [19281]
images.shape: torch.Size([1, 3, 224, 224])
captions.shape: torch.Size([1, 21])
images: tensor([[[[-1.7069, -1.6898, -1.6727,  ..., -0.8335, -0.7650, -0.8164],
          [-1.6898, -1.7240, -1.7412,  ..., -0.7822, -0.7137, -0.7650],
          [-1.7069, -1.7583, -1.7412,  ..., -0.6965, -0.7137, -0.7479],
          ...,
          [-1.5014, -1.4500, -1.4158,  ...,  0.3138,  0.3309,  0.3138],
          [-1.5528, -1.4672, -1.4329,  ...,  0.3138,  0.3138,  0.3138],
          [-1.5357, -1.4672, -1.4329,  ...,  0.2967,  0.3138,  0.3309]],

         [[-1.7206, -1.7031, -1.6856,  ..., -0.7927, -0.7752, -0.8978],
          [-1.7031, -1.7206, -1.7206,  ..., -0.7577, -0.7227, -0.8627],
          [-1.6856, -1.7206, -1.7206,  ..., -0.6702, -0.7402, -0.8452],
          ...,
          [-1.6681, -1.6331, -1.5805,  ..., -0.3200, -0.2850, -0.3025],
          [-1.6856, -1.6155, -1.5980,  ..., -0.2850, -0.3025, -0.3025],
          [-1.6681, -1.6331, -1.6155,  ..., -0.2850, -0.3025, -0