# **Image Captioning**

Source:  [https://github.com/d-insight/code-bank.git](https://github.com/d-insight/code-bank.git)  
License: [MIT License](https://opensource.org/licenses/MIT). See open source [license](LICENSE) in the Code Bank repository. 

-------------

## Overview

Given an image like the example below, our goal is to generate a caption such as "a surfer riding on a wave".

<img src="https://upload.wikimedia.org/wikipedia/commons/d/db/Surfing_in_Hawaii.jpg" width="500" height="500" align="center"/>

Image source: https://upload.wikimedia.org/wikipedia/commons/d/db/Surfing_in_Hawaii.jpg

To accomplish this, we'll use an attention-based model, which also enables us to see what parts of the image the model focuses on as it generates a caption. For the image above, here is an overlay of which parts of the image the model associates with which predicted word.

<img src="https://www.tensorflow.org/images/imcap_prediction.png" width="800" height="800" align="center"/>

For this illustration, we are using the [Microsoft COCO](http://cocodataset.org/#home) dataset, a large-scale image dataset for object recognition, image segmentation, and captioning. The entire training data is 13GB big and requires a GPU for efficient model training. We have run the code of the original tutorial on a GPU-powered Google Colab notebook in advance so that we can now import the pre-trained model (see below) and make inferences.

We only include code that's required to make inferences on new images; not the code to train the model from scratch. For the complete code please see the original Google tutorial linked below.

Portions of this page are reproduced from work created and shared by Google and used according to terms described in the [Creative Commons 4.0 Attribution License](https://creativecommons.org/licenses/by/4.0/). For the original tutorial visit: https://www.tensorflow.org/tutorials/text/image_captioning

-------------

## **Part 0**: Setup

### Import packages

In [None]:
# Import all packages
import tensorflow as tf

# We'll generate plots of attention in order to see which parts of an image
# our model focuses on during captioning
import matplotlib.pyplot as plt

# Scikit-learn includes many helpful utilities
from sklearn.model_selection import train_test_split
from sklearn.utils           import shuffle

import re
import os
import time
import json
import pickle
import numpy as np
from glob    import glob
from PIL     import Image
from tqdm    import tqdm

### Constants

In [None]:
PATH = 'data/coco-data/'


### Support functions

In [None]:
def load_image(image_path):
    """
    Load and resize image
    
    Args:
        image_path (str): path to the image 
        
    Returns:
        img: TensorFlow image preprocessed for use with the InceptionV3 model
        image_path: path to the image
    """
    
    img = tf.io.read_file(image_path)
    img = tf.image.decode_jpeg(img, channels=3)
    img = tf.image.resize(img, (299, 299))
    img = tf.keras.applications.inception_v3.preprocess_input(img)
    return img, image_path

# Attention implementation 
class BahdanauAttention(tf.keras.Model):
    def __init__(self, units):
        super(BahdanauAttention, self).__init__()
        self.W1 = tf.keras.layers.Dense(units)
        self.W2 = tf.keras.layers.Dense(units)
        self.V = tf.keras.layers.Dense(1)

    def call(self, features, hidden):
        # features(CNN_encoder output) shape == (batch_size, 64, embedding_dim)

        # hidden shape == (batch_size, hidden_size)
        # hidden_with_time_axis shape == (batch_size, 1, hidden_size)
        hidden_with_time_axis = tf.expand_dims(hidden, 1)

        # score shape == (batch_size, 64, hidden_size)
        score = tf.nn.tanh(self.W1(features) + self.W2(hidden_with_time_axis))

        # attention_weights shape == (batch_size, 64, 1)
        # you get 1 at the last axis because you are applying score to self.V
        attention_weights = tf.nn.softmax(self.V(score), axis=1)

        # context_vector shape after sum == (batch_size, hidden_size)
        context_vector = attention_weights * features
        context_vector = tf.reduce_sum(context_vector, axis=1)

        return context_vector, attention_weights
    
class CNN_Encoder(tf.keras.Model):
    # Since we have already extracted the features and dumped it using pickle
    # This encoder passes those features through a Fully connected layer
    def __init__(self, embedding_dim):
        super(CNN_Encoder, self).__init__()
        # shape after fc == (batch_size, 64, embedding_dim)
        self.fc = tf.keras.layers.Dense(embedding_dim)

    def call(self, x):
        x = self.fc(x)
        x = tf.nn.relu(x)
        return x
    
class RNN_Decoder(tf.keras.Model):
    def __init__(self, embedding_dim, units, vocab_size):
        super(RNN_Decoder, self).__init__()
        self.units = units

        self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
        self.gru = tf.keras.layers.GRU(self.units,
                                       return_sequences=True,
                                       return_state=True,
                                       recurrent_initializer='glorot_uniform')
        self.fc1 = tf.keras.layers.Dense(self.units)
        self.fc2 = tf.keras.layers.Dense(vocab_size)

        self.attention = BahdanauAttention(self.units)

    def call(self, x, features, hidden):
        # defining attention as a separate model
        context_vector, attention_weights = self.attention(features, hidden)

        # x shape after passing through embedding == (batch_size, 1, embedding_dim)
        x = self.embedding(x)

        # x shape after concatenation == (batch_size, 1, embedding_dim + hidden_size)
        x = tf.concat([tf.expand_dims(context_vector, 1), x], axis=-1)

        # passing the concatenated vector to the GRU
        output, state = self.gru(x)

        # shape == (batch_size, max_length, hidden_size)
        x = self.fc1(output)

        # x shape == (batch_size * max_length, hidden_size)
        x = tf.reshape(x, (-1, x.shape[2]))

        # output shape == (batch_size * max_length, vocab)
        x = self.fc2(x)

        return x, state, attention_weights

    def reset_state(self, batch_size):
        return tf.zeros((batch_size, self.units))
    
def loss_function(real, pred):
    """
    Loss function
    
    Args:
        real: ground truth values
        pred: predicted values
    
    Returns: function to reduce loss function in tensorflow
    """
    
    mask = tf.math.logical_not(tf.math.equal(real, 0))
    loss_ = loss_object(real, pred)

    mask = tf.cast(mask, dtype=loss_.dtype)
    loss_ *= mask

    return tf.reduce_mean(loss_)

@tf.function
def train_step(img_tensor, target):
    """
    Runs one training step
    Note that the decorator will compile the function into the tensorflow graph for faster execution on GPUs and TPUs
    
    Args:
        img_tensor: input image data
        targ: target 
        
    Returns: loss for batch, total loss 
    """
    
    loss = 0

    # initializing the hidden state for each batch
    # because the captions are not related from image to image
    hidden = decoder.reset_state(batch_size=target.shape[0])

    dec_input = tf.expand_dims([tokenizer.word_index['<start>']] * target.shape[0], 1)

    with tf.GradientTape() as tape:
        features = encoder(img_tensor)

        for i in range(1, target.shape[1]):
            # passing the features through the decoder
            predictions, hidden, _ = decoder(dec_input, features, hidden)

            loss += loss_function(target[:, i], predictions)

            # using teacher forcing
            dec_input = tf.expand_dims(target[:, i], 1)

    total_loss = (loss / int(target.shape[1]))

    trainable_variables = encoder.trainable_variables + decoder.trainable_variables

    gradients = tape.gradient(loss, trainable_variables)

    optimizer.apply_gradients(zip(gradients, trainable_variables))

    return loss, total_loss

def evaluate(image):
    """
    Predict caption and construct attention plot
    
    Args:
        image: image to evaluate
        
    Returns:
        result: predicted caption
        attention_plot
    """
    
    attention_plot = np.zeros((max_length, attention_features_shape))

    hidden = decoder.reset_state(batch_size=1)

    temp_input = tf.expand_dims(load_image(image)[0], 0)
    img_tensor_val = image_features_extract_model(temp_input)
    img_tensor_val = tf.reshape(img_tensor_val, (img_tensor_val.shape[0], -1, img_tensor_val.shape[3]))

    features = encoder(img_tensor_val)

    dec_input = tf.expand_dims([tokenizer.word_index['<start>']], 0)
    result = []

    for i in range(max_length):
        predictions, hidden, attention_weights = decoder(dec_input, features, hidden)

        attention_plot[i] = tf.reshape(attention_weights, (-1, )).numpy()

        predicted_id = tf.random.categorical(predictions, 1)[0][0].numpy()
        result.append(tokenizer.index_word[predicted_id])

        if tokenizer.index_word[predicted_id] == '<end>':
            return result, attention_plot

        dec_input = tf.expand_dims([predicted_id], 0)

    attention_plot = attention_plot[:len(result), :]
    
    return result, attention_plot

def plot_attention(image, result, attention_plot):
    """
    Construct the attention plot
    
    Args:
        image: input image
        result (str): predicted caption
        attention_plot: data for attention plot 
    """
    temp_image = np.array(Image.open(image))

    fig = plt.figure(figsize=(14, 14))

    len_result = len(result)
    for l in range(len_result):
        temp_att = np.resize(attention_plot[l], (8, 8))
        ax = fig.add_subplot(len_result//2, len_result//2, l+1)
        ax.set_title(result[l])
        img = ax.imshow(temp_image)
        ax.imshow(temp_att, cmap='gray', alpha=0.6, extent=img.get_extent())

    plt.tight_layout()
    plt.show()

## **Part 1**: Load & pre-process data

We'll use the [MS-COCO dataset](http://cocodataset.org/#home) to train our model. The dataset contains over 82,000 images, each of which has at least 5 different caption annotations. The code below loads a sample of 100 images that we have downloaded to save space.

To speed up training for this tutorial, we'll use a subset of 30,000 captions and their corresponding images to train our model. Choosing to use more data would result in improved captioning quality.

In [None]:
# Read the json file
annotation_file = 'data/coco-data/captions_train2014.json'
with open(annotation_file, 'r') as f:
    annotations = json.load(f)

# Store captions and image names in vectors
all_captions = []
all_img_name_vector = []

for annot in annotations['annotations']:
    caption = '<start> ' + annot['caption'] + ' <end>'
    image_id = annot['image_id']
    full_coco_image_path = PATH + 'COCO_train2014_' + '%012d.jpg' % (image_id)

    all_img_name_vector.append(full_coco_image_path)
    all_captions.append(caption)
    
# Shuffle captions and image_names together
# Set a random state
train_captions, img_name_vector = shuffle(all_captions,
                                          all_img_name_vector,
                                          random_state=1)

# Select the first 30000 captions
num_examples = 30000
train_captions = train_captions[:num_examples]
img_name_vector = img_name_vector[:num_examples]

In [None]:
len(train_captions), len(all_captions)


## **Part 2**: Set up InceptionV3 model, cache data, and preprocess captions

* First, we'll use InceptionV3 (which is pretrained on Imagenet) to classify each image. We'll extract features from the last convolutional layer. InceptionV3 requires the following pre-processing:
    * Resizing the image to 299px by 299px
    * [Preprocess the images](https://cloud.google.com/tpu/docs/inception-v3-advanced#preprocessing_stage) using the [preprocess_input](https://www.tensorflow.org/api_docs/python/tf/keras/applications/inception_v3/preprocess_input) method to normalize the image so that it contains pixels in the range of -1 to 1, which matches the format of the images used to train InceptionV3.

* We'll limit the vocabulary size to the top 5,000 words (to save memory). We'll replace all other words with the token "UNK" (unknown).
* We then create word-to-index and index-to-word mappings.

In [None]:
# Set up model
image_model  = tf.keras.applications.InceptionV3(include_top=False, weights='imagenet')
new_input    = image_model.input
hidden_layer = image_model.layers[-1].output

image_features_extract_model = tf.keras.Model(new_input, hidden_layer)

In [None]:
# Choose the top 5000 words from the vocabulary
top_k = 5000
tokenizer = tf.keras.preprocessing.text.Tokenizer(num_words=top_k,
                                                  oov_token="<unk>",
                                                  filters='!"#$%&()*+.,-/:;=?@[\]^_`{|}~ ')
tokenizer.fit_on_texts(all_captions)
train_seqs = tokenizer.texts_to_sequences(all_captions)

# Create index for padding tag
tokenizer.word_index['<pad>'] = 0
tokenizer.index_word[0] = '<pad>'

## **Part 3**: Model: Attention, a CNN Encoder (for image data), and an RNN Decoder (for text data)

Fun fact: the decoder below is identical to the one in the illustration for Neural Machine Translation with Attention for French-to-English translation.

The model architecture is inspired by the [Show, Attend and Tell](https://arxiv.org/pdf/1502.03044.pdf) paper.

* In this example, we extract the features from the lower convolutional layer of InceptionV3 giving us a vector of shape (8, 8, 2048).
* We squash that to a shape of (64, 2048).
* This vector is then passed through the CNN Encoder (which consists of a single Fully connected layer).
* The RNN (here GRU cell) attends over the image to predict the next word.

In [None]:
# Feel free to change these parameters according to your system's configuration

BATCH_SIZE = 64
BUFFER_SIZE = 1000
embedding_dim = 256
units = 512
vocab_size = top_k + 1
num_steps = 0.8*len(train_captions) // BATCH_SIZE
# Shape of the vector extracted from InceptionV3 is (64, 2048)
# These two variables represent that vector shape
features_shape = 2048
attention_features_shape = 64
max_length = 49

In [None]:
# Instantiate encoder and decoder 
encoder = CNN_Encoder(embedding_dim)
decoder = RNN_Decoder(embedding_dim, units, vocab_size)

In [None]:
# Define the optimizer and loss function
optimizer  = tf.keras.optimizers.Adam()
loss_object = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True, reduction='none')


In [None]:
# Save model checkpoints
checkpoint_path = './data/coco_training_checkpoints'
ckpt = tf.train.Checkpoint(encoder=encoder,
                           decoder=decoder,
                           optimizer = optimizer)
ckpt_manager = tf.train.CheckpointManager(ckpt, checkpoint_path, max_to_keep=5)

start_epoch = 0

## **Part 4**: Evaluate model

We have pre-trained the model on 30,000 images on a Google Cloud GPU. It took approximately 2.5 hours on a single NVIDIA P100 GPU, which costs around $5,000. 

In [None]:
# Download pre-trained model checkpoints from Google Cloud storage
# CHECKPOINT LOCATIONS ON GOOGLE MOVED!
!wget -N 'https://storage.googleapis.com/dsfm/coco_training_checkpoints/checkpoint' --directory-prefix='data/coco_training_checkpoints'
!wget -N 'https://storage.googleapis.com/dsfm/coco_training_checkpoints/ckpt-1.data-00000-of-00002' --directory-prefix='data/coco_training_checkpoints'
!wget -N 'https://storage.googleapis.com/dsfm/coco_training_checkpoints/ckpt-1.data-00001-of-00002' --directory-prefix='data/coco_training_checkpoints'
!wget -N 'https://storage.googleapis.com/dsfm/coco_training_checkpoints/ckpt-1.index' --directory-prefix='data/coco_training_checkpoints'
!wget -N 'https://storage.googleapis.com/dsfm/coco_training_checkpoints/tokenizer.pickle' --directory-prefix='data/coco_training_checkpoints'

# Load latest checkpoint
if ckpt_manager.latest_checkpoint:
    start_epoch = int(ckpt_manager.latest_checkpoint.split('-')[-1])
    
    # restoring the latest checkpoint in checkpoint_path
    ckpt.restore(ckpt_manager.latest_checkpoint)
    
# Overwrite tokenizer to match pre-trained model
with open('data/coco_training_checkpoints/tokenizer.pickle', 'rb') as handle:
    tokenizer = pickle.load(handle)

### Your own image

Test the pre-trained model with an image of your choice. There are two things we have to change:

- the image URL
- the image path name, defining the file name on your machine (e.g. image0 below)

A note of caution: we have substantially reduced the training size in the interest of time, so the captions we see might not make a lot of sense. Furthermore, re-run individual image evaluations to get slightly different results, due to the probabilistic nature of the model implemented. 

In [None]:
# Example of a surfer on a wave
image_url = 'https://tensorflow.org/images/surf.jpg'
image_extension = image_url[-4:]
image_path = tf.keras.utils.get_file('image0'+image_extension, origin=image_url)

result, attention_plot = evaluate(image_path)
print ('Prediction Caption:', ' '.join(result))
plot_attention(image_path, result, attention_plot)

# opening the image
Image.open(image_path)

In [None]:
# Another example 
image_url = 'https://upload.wikimedia.org/wikipedia/commons/4/4b/Wdomenada2003b.jpg'
image_extension = image_url[-4:]
image_path = tf.keras.utils.get_file('image1' + image_extension, origin = image_url)

result, attention_plot = evaluate(image_path)
print ('Prediction Caption:', ' '.join(result))
plot_attention(image_path, result, attention_plot)

# opening the image
Image.open(image_path)