# Introduction üëãüèª

This notebook aims to reproduce the paper [Show and Tell: A Neural Image Caption Generator](https://arxiv.org/pdf/1411.4555v2.pdf) by [Vinyals](vinyals@google.com) et al. 

Generating a description of an image is called **image captioning** , but it's not that simple. A description must capture not only the objects contained in an image, but it also must express how these objects relate to each other as well as their attributes and the activities they are involved in. Moreover, the above semantic knowledge has to be expressed in a natural language like English, which means that a language model is needed in addition to visual understanding.

The authors propose a single joint model that takes an image $I$ as input, and is trained to maximize the likelihood $p(S|I)$ of producing a target sequence of words $S = {S_1, S_2, . . .}$ where each word $S_t$ comes from a given dictionary, that describes the image adequately. They replace the "*encoder*" in a vanilla RNN with a CNN to transform the image into a fixed length vector representations which are then fed as input to the RNN decoder that generates sentences.

The following hidden cell contains basic imports, random seeds, tokenizer instantiation and weightsandbiases login.

## Packages

* [torch](https://pytorch.org/docs/stable/torch.html): The deep learning framework we'll use in this kernel
* [pandas](https://pandas.pydata.org/): To pre-process input data which is later converted into a PyTorch `Dataset` instance
* [transformers](https://github.com/huggingface/transformers): We'll use a pre-trained `BertTokenizer` for tokenising our captions. We could have also used `BPE`. 
* [torchvision](https://pytorch.org/docs/stable/torchvision/index.html): We'll use pre-trained `resnet50` from torchvision.models and torchvision.transforms
* [sklearn](https://scikit-learn.org/stable/): For splitting our raw dataset into train, valid and test splits.
* [PIL](https://pillow.readthedocs.io/en/stable/): For handling images with PyTorch

In [None]:
%%capture
!pip install --upgrade wandb

## Importing Packages

import os
import torch
import random
import warnings
import numpy as np
import transformers
import pandas as pd 
from PIL import Image
import torch.nn as nn
warnings.filterwarnings("ignore")
import torchvision.transforms as T
import torchvision.models as models
from torch.utils.data import Dataset, DataLoader
from sklearn.model_selection import train_test_split
from typing import Callable, Optional

## Logging into Weights and Biases
from kaggle_secrets import UserSecretsClient
user_secrets = UserSecretsClient()
api_key = user_secrets.get_secret("WANDB_API_KEY")
import wandb
wandb.login(key=api_key);

wandb.init(project="show-and-tell", entity="collaborativeml")

## For Reproducibility
def seed_everything(seed):
    random.seed(seed)
    os.environ['PYTHONHASHSEED'] = str(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.backends.cudnn.deterministic = True
seed_everything(42)

## Tokenizer
tokenizer = transformers.BertTokenizer.from_pretrained("bert-base-uncased",do_lower_case=True)

## Device Configuration 
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

In [None]:
## Basic File Paths
data_dir = '../input/flickr-image-dataset/flickr30k_images'
image_dir = f'{data_dir}/flickr30k_images'
csv_file = f'{data_dir}/results.csv'

# üßπ Pre-Processing

## ‚úçÔ∏è Some Hardcoding

As pointed out by [@aritrag](https://www.kaggle.com/aritrag), The entry at index 19999 was messed up. Therefore, we'll hardcode the value at that particular instance.

In [None]:
df = pd.read_csv(csv_file, delimiter='|')
df[' comment_number'][19999] = ' 4'
df[' comment'][19999] = ' A dog runs across the grass .'
df['image_name'] = image_dir+'/'+df['image_name']
df.head(5)

## üèõ Restructuring Data

In the following code block we create the following structure:

| image_name | comment_0 | comment_1 | comment_2 | comment_3 | comment_4 |
|------------|-----------|-----------|-----------|-----------|-----------|
|            |           |           |           |           |           |

In [None]:
image_name = {
    'image_name':df[df[' comment_number'] == df[' comment_number'][0]]['image_name'].values,
}
comments = {
    'comment_0':df[df[' comment_number'] == df[' comment_number'][0]][' comment'].values,
    'comment_1':df[df[' comment_number'] == df[' comment_number'][1]][' comment'].values,
    'comment_2':df[df[' comment_number'] == df[' comment_number'][2]][' comment'].values,
    'comment_3':df[df[' comment_number'] == df[' comment_number'][3]][' comment'].values,
    'comment_4':df[df[' comment_number'] == df[' comment_number'][4]][' comment'].values,
}

image_name_df = pd.DataFrame.from_dict(image_name)
comments_df = pd.DataFrame.from_dict(comments)

df = pd.concat([image_name_df,comments_df], axis=1)
df.head(5)

## ‚úÇÔ∏è Splitting into Train, Valid and Split

Split the data into train, validation and test splits using [`train_test_split`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) from [`sklearn.model_selection`](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.model_selection). 

We use `0.2` for train and test split and `0.25` for train and validation split

In [None]:
## Obtain Train and Test Split 
train, test = train_test_split(df, test_size=0.2, random_state=42)

## Reset Indexes 
train = train.reset_index(drop=True)
test = test.reset_index(drop=True)

## Obtain Train and Validation Split 
train, val = train_test_split(train, test_size=0.25, random_state=42)

## Reset Indexes 
train = train.reset_index(drop=True)
val = val.reset_index(drop=True)

## Let's see how many entries we have
print(train.shape)
print(val.shape)
print(test.shape)

# üóÑ Dataset

The following code cell aims to convert the Flickr dataset into a torch [`Dataset`](https://pytorch.org/docs/stable/data.html#torch.utils.data.Dataset) object. 

All `Dataset` objects in pytorch represent a map from keys to data samples. We create a subclass which overwrites the `__getitem__()` and `__len__()` method. We also provide a option to perform augmentations on the image using `torchvision.transforms`

---

Each element of our dataset returns:

* Image (single image)
* Captions (list of 5 tokenized captions)


---

Here, we create the `FlickrDataset` class. 

We inherit from the [`Dataset`](https://pytorch.org/docs/stable/data.html#torch.utils.data.Dataset) class which is a abstract class. While creating a subclass of `Dataset` one must overwrite two methods, `__getitem__()` and `__len__()` for it to work well with the [`DataLoader`](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader).

We take as input our dataframe and a `bool` transforms. Feel free to edit the transforms and experiment. !! 

In the `__getitem__()` method, we use `df.<column_name>.values[]` to get the image_id and then use `PIL` to open the image in `RGB` format. If the `transforms` bool is set to `True`, we apply the transforms.  We then extract the comments for our specified instance and then create a empty nested list. We then iterate over the comments and encode them into encodings using the `BertTokenizer`. Finally we convert the captions into a `torch.Tensor` and return a tuple of the form `(image, captions)`.

In the first version of this kernel, I made a empty nested loop which I iterated over while encoding using the `.encode()` function but that resulted in a output shape of `[5,500]` without batching. After going over [this](https://curiousily.com/posts/sentiment-analysis-with-bert-and-hugging-face-using-pytorch-and-python/) blogpost I figured out how to encode the entire list, resulting in the desired shape of `[<batch_size>,100]`. Although for this tutorial we'll just use the first caption for each entry.

In [None]:
class FlickrDataset(Dataset):
    
    def __init__(self, df, 
                 transforms: Optional[Callable] = None) -> None:
        self.df = df
        self.transforms = T.Compose([
            T.ToTensor(),
            T.Normalize(mean = [0.5], std = [0.5]),
            T.Resize((256,256)),
        ])
        
    def __len__(self) -> int:
        return len(self.df)
    
    def __getitem__(self, idx: int):
        
        image_id = self.df.image_name.values[idx]
        image = Image.open(image_id).convert('RGB')
            
        if self.transforms is not None:
            image = self.transforms(image)
            
        comments = self.df[self.df.image_name == image_id].values.tolist()[0][1:][0] # Last zero is to obtain the first caption ONLY
        encoded_inputs = tokenizer(comments,
                                   return_token_type_ids = False, 
                                   return_attention_mask = False, 
                                   max_length = 100, 
                                   padding = "max_length",
                                  return_tensors = "pt")
        
        sample = {"image":image.to(device),"captions": encoded_inputs["input_ids"].flatten().to(device)}
        
        return sample

Since, our dataset has a odd number of instances, we can't have perfect splits into batches. Thus, we have to use the `drop_last` parameter inorder to avoid any errors while training. [This](https://discuss.pytorch.org/t/runtimeerror-expected-hidden-0-size-2-20-256-got-2-50-256/38288/10) discuss post has a nice introduction to this problem.



In [None]:
batch_size = 32

train_dataset = FlickrDataset(train, transforms = True)
train_dataloader = torch.utils.data.DataLoader(train_dataset, batch_size = batch_size, drop_last=True)

val_dataset = FlickrDataset(val, transforms = True)
val_dataloader = torch.utils.data.DataLoader(val_dataset, batch_size = batch_size,drop_last=True)

test_dataset = FlickrDataset(test, transforms = True)
test_dataloader = torch.utils.data.DataLoader(test_dataset, batch_size = batch_size,drop_last=True)

# üõ† The Model (NIC)

## üëÅ CNN Encoder (Show)

We'll use a `resnet50` backbone to serve as the encoder part of our model. We create a custom `CNN` class which inherits from the `nn.Module` class. We introduce a parameter `embed_size` and we'll add a fully-connected layer in the end specifying output dimensions = embed_size.

In [None]:
class CNN(nn.Module):
    
    def __init__(self, embed_size):
        super(CNN, self).__init__()
        model = models.resnet50(pretrained=True)
        for param in model.parameters():
            param.requires_grad_(False)
        
        modules = list(model.children())[:-1]
        self.model = nn.Sequential(*modules)
        self.embed = nn.Linear(model.fc.in_features, embed_size)
        
    def forward(self, image):
        features = self.model(image)
        features = features.view(features.size(0), -1)
        features = self.embed(features)
                
        return features

## üìö RNN Decoder

We create a custom `RNN` class which inherits from the `nn.Module` class. 

* During the forward pass, we'll first create the intial hidden and cell states by creating a tuple of `autograd` variables. The hidden states are initialised to zeros array of shape `(1, <batch_size>, <hidden_size>)` and the cell state is the output from the last hidden layer of the CNN. [This](https://discuss.pytorch.org/t/tuple-object-has-no-attribute-size-in-lstm-but-not-in-rnn/90307) post was helpful in figuring out how to make this kind of system work

* Then we use the `nn.Embedding` layer to convert the real captions into a simple lookup table that stores our embeddings.

* We then pass the generated embeddings into our `nn.LSTM` layer using our previously initialized hidden states.

* Lastly, we pass the output from our LSTM into a fully connected layer with output_dimensions = vocab_size,  and return this output

In [None]:
class RNN(nn.Module):
    
    def __init__(self, input_size, hidden_size, embedding_dim,vocab_size):
        super(RNN, self).__init__()
        
        self.input_size = input_size
        self.hidden_size = hidden_size
        self.embedding_dim = embedding_dim
        self.vocab_size = vocab_size
        
        self.embedding = nn.Embedding(num_embeddings = vocab_size,embedding_dim = embedding_dim)
        
        self.lstm = nn.LSTM(input_size=input_size,
                            hidden_size=hidden_size,
                            batch_first=True)
        
        self.fc = nn.Linear(hidden_size, vocab_size)
        
    def init_hidden(self, features):
        
        return (torch.autograd.Variable(torch.zeros(1,32,512).to(device)), 
                torch.autograd.Variable(features.unsqueeze(0)).to(device))
        
    def forward(self, features, captions):
        
        state = self.init_hidden(features)
        
        embed = self.embedding(captions)
                    
        lstm_out, state = self.lstm(embed, state)
                        
        outputs = self.fc(lstm_out)
        outputs = outputs.view(-1, self.vocab_size)
        
        return outputs

## ‚û°Ô∏è Example Forward Pass

Here, we extract a example batch from our `train_dataloader` and view the Shape Transformation of our images and captions.

In [None]:
example_batch = next(iter(train_dataloader))

image, captions = example_batch["image"], example_batch["captions"]

encoder = CNN(embed_size = 512).to(device)
decoder = RNN(input_size = 512, hidden_size = 512, embedding_dim=512, vocab_size = 28881).to(device)

features = encoder(image)
embed = decoder(features, captions)

print("Image Transformation: ", image.shape, " --> ", features.shape)
print("Captions Transformation: ", captions.shape, " --> ", embed.shape)

# üìñ Some Theory

Our goal with this method is to maximize the probability of the correct description given an input image. 

$$
\theta^{*} = arg\max_{\theta} \sum_{(I , S)} log \, p(S | I ; \theta)
$$

Here, 

* $\theta$ -> Parameters of our model
* $I$ -> Image
* $S$ -> Sentence

The CNN just serves as a encoder which downsamples our image into a fixed-length vector representation. 

For the RNN, 

$$
x_{-1} = CNN(I)
$$

$$
x_t = W_eS_t, \, t \in \{ 0 ... N - 1 \}
$$

$$
p_{t+1} = LSTM(x_t), \, t \in \{ 0 ... N - 1\}
$$

## Loss Function

The paper uses the negative log-likelihood of the correct word at each step: 

$$
L (I,S) = - \sum_{t=1}^{N} log p_t(S_t)
$$

This loss function is minimized w.r.t all the parameters of our RNN decoder and the last fully connected layer of the CNN encoder. We use the `Adam` optimizer with a arbitrarily set learning_rate of `0.001`

In [None]:
%%capture

vocab_size = 90000
steps_per_epoch = 19069 // 32

encoder = CNN(embed_size = 512).to(device)
decoder = RNN(input_size = 512, hidden_size = 512, embedding_dim=512, vocab_size = vocab_size).to(device)

criterion = nn.CrossEntropyLoss().to(device)
params = list(decoder.parameters()) + list(encoder.embed.parameters())

optimizer = torch.optim.Adam(params, lr=0.001)

# üèãÔ∏è Training

We'll train the model**s** for 10 epochs, in the next update we'll perform hyperparameter optimization using [wandb sweeps](https://docs.wandb.ai/sweeps). I'll print the last metrics only in order to avoid a huge output window, In the next update I'll include links to the wandb dashboard used for this project.

In [None]:
for epoch in range(10):

    for idx, sample in enumerate(train_dataloader):
        
        if idx > steps_per_epoch:
            break
        
        image, captions = torch.tensor(sample['image']).to(device), torch.tensor(sample['captions']).to(device)
        
        # zero the parameter gradients
        decoder.zero_grad()
        encoder.zero_grad()
        
        # Forward pass
        features = encoder(image)
        outputs = decoder(features, captions)
        
        # Compute the Loss
        loss = criterion(outputs.view(-1, vocab_size), 
                         captions.view(-1))
        
        # Backward pass.
        loss.backward()
        
        # Update the parameters in the optimizer.
        optimizer.step()
            
        # Get training statistics.
        stats = 'Epoch [%d], Loss: %.4f' % (epoch, loss.item())
        wandb.log({"Loss": loss.item()})
        print('\r' + stats, end="")