## Experiment 3: 
### Show, Attend and Tell Network with VGG19 Encoder, LSTM Decoder with Soft Attention and with Scheduled Sampling for Teacher Forcing

This notebook shows the training, validation and obtained results of the experiment created with show, attend and tell network.

__CONFIGURATION:__  
1. Encoder: VGG19 (Dimension = 512)
2. Decoder: LSTMCell + Deterministic Soft Attention
3. Teacher Forcing for Decoder Training: True for 50% batches (tf_ratio = 0.5)
4. Optimizer: Adam
5. Loss function: Cross Entropy Loss
6. Regularization Constant (alpha_c): 1
7. Learning Rate: 4e-4
8. Step size for learning rate annealing: 5
9. Batch Size: 32
10. Number of epochs: 3

In [1]:
%matplotlib notebook
import json
import torch
import torch.utils.data as td
import torch.nn as nn
import torch.optim as optim
import os
from matplotlib import pyplot as plt

### Creation of the training, validation datasets and the word dictionary

In [2]:
from dataset import COCO14Dataset

data_root_dir = 'data/coco'

train_set = COCO14Dataset(None, data_root_dir) 
val_set = COCO14Dataset(None, data_root_dir, mode='val')

word_dict = json.load(open(data_root_dir + '/word_dict.json', 'r'))
vocabulary_size = len(word_dict)

### Verification of correct loading of the training dataset and corresponding captions

In [3]:
from utils import myimshow, generate_caption

h = myimshow(train_set[10][0])
print('Correct Captions:')
for i in range(5):
    print(generate_caption(train_set[10+i][1], word_dict))

<IPython.core.display.Javascript object>

Correct Captions:
A big red telephone booth that a man is standing in.
A person standing inside of a phone booth.
This is an image of a man in a phone booth.
A man is standing in a red phone booth.
A man using a phone in a phone booth.


### Verification of correct loading of the validation dataset and corresponding captions

In [4]:
h = myimshow(val_set[10][0])
print('Correct Captions:')
for i in range(5):
    print(generate_caption(val_set[10][2][i], word_dict))

<IPython.core.display.Javascript object>

Correct Captions:
A little girl holding a kitten next to a blue fence.
Girl in a tank top holding a kitten in her back yard.
A young girl is holding a small cat.
Girl with a yellow shirt holding a small cat.
A girl smiles as she holds a kitty cat.


### Creation of all Dependencies for Experiment 3

In [5]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(f"device = {device}")

device = cuda


In [6]:
from models import Encoder, Decoder

network = 'vgg19'
tf_ratio = 0.5      # Teacher Forcing will be used 50% of the times
lr = 4e-4           # Learning rate
batch_size = 32     # Batch size for mini-batch gradient descent
step_size = 5       # Step size for learning rate annealing
alpha_c = 1         # Regularization constant
start_epoch = 1     # Starting epoch for the experiment
log_interval = 100  # Frequency for logging statistics

log_filename = 'logs/log_tf_ss.txt'

In [7]:
encoder = Encoder(network).to(device)
decoder = Decoder(device, vocabulary_size, encoder.dim, tf_ratio = tf_ratio).to(device) 
optimizer = optim.Adam(decoder.parameters(), lr=lr)
scheduler = optim.lr_scheduler.StepLR(optimizer, step_size)
cross_entropy_loss = nn.CrossEntropyLoss().to(device)

# Create dataloaders for the training and validation set
train_loader = td.DataLoader(train_set, batch_size=batch_size, shuffle=True, num_workers=1)
val_loader = td.DataLoader(val_set, batch_size=batch_size, shuffle=True, num_workers=1)

In [8]:
# Creation of the log file for logging statistics
log_file = open(log_filename, 'a')

### Creation of Experiment 3

In [9]:
from experiment import Experiment

exp3 = Experiment(start_epoch, encoder, decoder, optimizer, cross_entropy_loss, train_loader, val_loader, 
                  word_dict, alpha_c, log_file, log_interval, device)

If a partial trained model exists and is to be loaded into the experiment, the following command can be run with the variable ``curr_model_path`` pointing to the complete location of the model.

In [10]:
curr_model_path = None    # Set variable to the location that the model is to be loaded from. Eg: 'models/model.pth.tar'
exp3.load(curr_model_path)

### Run training and validation for Experiment 3

In [None]:
# Directory for storing checkpoints during training
os.makedirs('models/model_tf_ss', exist_ok=True)

# Number of epochs for which training is to be performed
epochs = 3

print(f'Starting training from {exp3.start_epoch} for {epochs - exp3.start_epoch + 1} epochs.')

for epoch in range(exp3.start_epoch, epochs + 1):
    model_file = 'models/model_tf_ss/model_' + network + '_' + str(epoch) + '.pth.tar'
    exp3.train()                  # Perform training on the complete dataset in batches
    exp3.validate(epoch)          # Perform validation after every epoch
    scheduler.step()  
    exp3.save(epoch, model_file)  # Save the current state of the model
    print('Saved model to ' + model_file)

log_file.close()

### Analysis of the Training Statistics for Experiment 3

In [11]:
from plot import processing_logfile, plot

# Process the log file and get the necessary training and validation statistics data
train, validate, bleu = processing_logfile('logs/log_tf_ss.txt')

#Plot the training statistics
fig, axes = plt.subplots(nrows=3, figsize=(8, 10))
ylims = [(0.75, 3), (65, 95), (70, 100)]
plot(fig, axes, train, 'Training', ylims)

<IPython.core.display.Javascript object>

### Analysis of Validation Statistics for Experiment 3

In [12]:
#Plot the validation statistics
fig, axes = plt.subplots(nrows=3, figsize=(8, 10))
ylims = [(1.4, 1.8), (78, 83), (84, 90)]
plot(fig, axes, validate, 'Validation', ylims)

<IPython.core.display.Javascript object>

### BLEU Scores for Experiment 3

In [13]:
# Bleu Scores for the Last Epoch
print(f'BLEU 1 Score on Validation Set: {bleu[2][0]}')
print(f'BLEU 2 Score on Validation Set: {bleu[2][1]}')
print(f'BLEU 3 Score on Validation Set: {bleu[2][2]}')
print(f'BLEU 4 Score on Validation Set: {bleu[2][3]}')

BLEU 1 Score on Validation Set: 0.6049254551526239
BLEU 2 Score on Validation Set: 0.3701483922765033
BLEU 3 Score on Validation Set: 0.23920169336199137
BLEU 4 Score on Validation Set: 0.15159516033377676


### Analysis of Captions Produced by the Trained Network on the Testing Dataset

In [14]:
from generate_visualization import generate_caption_visualization, generate_image_caption

# Load the trained network (mainly the decoder into the decoder module)
encoder = Encoder(network)
decoder = Decoder(device, vocabulary_size, encoder.dim, tf_ratio = tf_ratio)

trained_model_path = 'models/model_tf_ss.pth.tar'
decoder.load_state_dict(torch.load(trained_model_path)['state_dict'])

# Set the encoder and decoder in evaluation mode to get captions for testing data
encoder.eval()
decoder.eval()

test_img_paths = json.load(open(data_root_dir + '/test_img_paths.json'))

### Correctly Captioned Images

In [15]:
fig, axes = plt.subplots(nrows=3, figsize=(7,10))

generate_image_caption(encoder, decoder, data_root_dir +'/imgs'+test_img_paths['38822'], word_dict, beam_size=3, ax=axes[0])
generate_image_caption(encoder, decoder, data_root_dir +'/imgs'+test_img_paths['27450'], word_dict, beam_size=3, ax=axes[1])
generate_image_caption(encoder, decoder, data_root_dir +'/imgs'+test_img_paths['16375'], word_dict, beam_size=3, ax=axes[2])

<IPython.core.display.Javascript object>

### Incorrectly Captioned Images

In [16]:
fig, axes = plt.subplots(nrows=3, figsize=(7,10))

generate_image_caption(encoder, decoder, data_root_dir +'/imgs'+test_img_paths['27555'], word_dict, beam_size=3, ax=axes[0])
generate_image_caption(encoder, decoder, data_root_dir +'/imgs'+test_img_paths['31000'], word_dict, beam_size=3, ax=axes[1])
generate_image_caption(encoder, decoder, data_root_dir +'/imgs'+test_img_paths['16300'], word_dict, beam_size=3, ax=axes[2])

<IPython.core.display.Javascript object>

In general, the predictions of this model are more simplistic as compared to the model trained using teacher forcing. Additionally, this training does not necessarily allow the model to form complete sentences on all the images.

### Visualization of Attention and the Corresponding Caption Generation

In [17]:
# Visualization for a correctly classified image

generate_caption_visualization(encoder, decoder, data_root_dir +'/imgs'+test_img_paths['38822'], word_dict, beam_size=3)

<IPython.core.display.Javascript object>

  warn('The default multichannel argument (None) is deprecated.  Please '


This visualization enables us to understand how the caption is generated by the network by focusing on different elements in the image using the attention module. For example, in the visualization above, the model starts by focusing on the man's face. By this it determines that it the object is a man. In addition, it analyzes the immediate neighborhood of the man to identify that he is sitting and has a table in front of him. Finally, the focus is brough on the device on the table which is then captioned as a laptop. This model is able to correctly caption the image in an extremely simplistic manner and does not provide a lot of information on the action being performed by the man on the laptop or so on. But it also does not use unnecessary words for the description of the image, thus being concise and clear.

In [18]:
# Visualization for an incorrectly classified image

generate_caption_visualization(encoder, decoder, data_root_dir +'/imgs'+test_img_paths['27555'], word_dict, beam_size=3)

<IPython.core.display.Javascript object>

In case of this incorrectly classified image, the model correctly identifies the woman and the water in the background. However, it classifies her hairband as a surfboard or a frisbee as it must have seen those objects more often in the training images and they must look similar. This gives us insight into the general behaviour of the attention based network. 

### Comparison of Epoch 1 Caption Vs Epoch 3 Caption

In [19]:
fig, axes = plt.subplots(nrows=2, figsize=(7,7))

# Load the trained network of epoch 1 (mainly the decoder into the decoder module)
encoder1 = Encoder(network)
decoder1 = Decoder(device, vocabulary_size, encoder.dim, tf_ratio = tf_ratio)

trained_model_path = 'models/model_tf_ss_epoch1.pth.tar'
decoder1.load_state_dict(torch.load(trained_model_path)['state_dict'])

# Set the encoder and decoder in evaluation mode to get captions for testing data
encoder1.eval()
decoder1.eval()

generate_image_caption(encoder1, decoder1, data_root_dir +'/imgs'+test_img_paths['16375'], word_dict, beam_size=3, ax=axes[0])
generate_image_caption(encoder, decoder, data_root_dir +'/imgs'+test_img_paths['16375'], word_dict, beam_size=3, ax=axes[1])

<IPython.core.display.Javascript object>

As this model does not always have teacher forcing, it is slower at learning and the estimate after the very first epoch is not very close to the correct caption. As can be seen above, after epoch 1 apart from the wine glass, the caption does not align with the image. However, by the end of epoch 3, the network has successfully identified the man and woman and also indicated the basic action they are performing. The network is not yet able to interpret the surroundings which will mostly be possible after a few more epochs.