# Image Captioning Final Project

In this final project you will define and train an image-to-caption model, that can produce descriptions for real world images!

<img src="images/encoder_decoder.png" style="width:70%">

Model architecture: CNN encoder and RNN decoder. 
(https://research.googleblog.com/2014/11/a-picture-is-worth-thousand-coherent.html)

# Import stuff

In [None]:
import sys
sys.path.append("..")

In [None]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import utils
import time
import zipfile
import json
from collections import defaultdict
import re
import os

# Download data

Takes 10 hours and 20 GB. We've downloaded necessary files for you.

Relevant links (just in case):
- train images http://msvocds.blob.core.windows.net/coco2014/train2014.zip
- validation images http://msvocds.blob.core.windows.net/coco2014/val2014.zip
- captions for both train and validation http://msvocds.blob.core.windows.net/annotations-1-0-3/captions_train-val2014.zip

# Extract image features

We will use pre-trained InceptionV3 model for CNN encoder (https://research.googleblog.com/2016/03/train-your-own-image-classifier-with.html) and extract its last hidden layer as an embedding:

<img src="images/inceptionv3.png" style="width:70%">

In [None]:
IMG_SIZE = 300

In [None]:
# load prepared embeddings
train_img_embeds = utils.read_pickle("train_img_embeds.pickle")
train_img_fns = utils.read_pickle("train_img_fns.pickle")
val_img_embeds = utils.read_pickle("val_img_embeds.pickle")
val_img_fns = utils.read_pickle("val_img_fns.pickle")
# check shapes
print(train_img_embeds.shape, len(train_img_fns))
print(val_img_embeds.shape, len(val_img_fns))

# Extract captions for images

In [None]:
# extract captions from zip
def get_captions_for_fns(fns, zip_fn, zip_json_path):
    zf = zipfile.ZipFile(zip_fn)
    j = json.loads(zf.read(zip_json_path).decode("utf8"))
    id_to_fn = {img["id"]: img["file_name"] for img in j["images"]}
    fn_to_caps = defaultdict(list)
    for cap in j['annotations']:
        fn_to_caps[id_to_fn[cap['image_id']]].append(cap['caption'])
    fn_to_caps = dict(fn_to_caps)
    return list(map(lambda x: fn_to_caps[x], fns))
    
train_captions = get_captions_for_fns(train_img_fns, "captions_train-val2014.zip", 
                                      "annotations/captions_train2014.json")

val_captions = get_captions_for_fns(val_img_fns, "captions_train-val2014.zip", 
                                      "annotations/captions_val2014.json")

# check shape
print(len(train_img_fns), len(train_captions))
print(len(val_img_fns), len(val_captions))

# Prepare captions for training

In [None]:
# preview captions data
train_captions[:2]

In [None]:
# special tokens  
UNK = "#UNK#"
START = "#START#"
END = "#END#"
PAD = "#PAD#"

# split sentence into tokens (split into lowercased words)
def split_sentence(sentence):
    return list(filter(lambda x: len(x) > 0, re.split('\W+', sentence.lower())))

def generate_vocabulary(train_captions):
    """
    Return {token: index} for all train tokens (words) that occur 5 times or more, 
        `index` should be from 0 to N, where N is a number of unique tokens in the resulting dictionary.
    Use `split_sentence` function to split sentence into tokens.
    Also, add PAD (for batch padding), UNK (unknown, out of vocabulary), 
        START (start of sentence) and END (end of sentence) tokens into the vocabulary.
    """
    occ = dict()
    for img_captions in train_captions:
        for caption in img_captions:
            words = split_sentence(caption)
            for word in words:
                if word not in occ:
                    occ[word] = 0
                occ[word] += 1
    occ = {v: k for k, v in occ.items()}
    vocab = [v for k, v in occ.items() if k >= 3]
    
    return sorted(vocab)
    
def caption_tokens_to_indices(captions, vocab):
    """
    `captions` argument is an array of arrays:
    [
        [
            "image1 caption1",
            "image1 caption2",
            ...
        ],
        [
            "image2 caption1",
            "image2 caption2",
            ...
        ],
        ...
    ]
    Use `split_sentence` function to split sentence into tokens.
    Replace all tokens with vocabulary indices, use UNK for unknown words (out of vocabulary).
    Add START and END tokens to start and end of each sentence respectively.
    For the example above you should produce the following:
    [
        [
            [vocab[START], vocab["image1"], vocab["caption1"], vocab[END]],
            [vocab[START], vocab["image1"], vocab["caption2"], vocab[END]],
            ...
        ],
        ...
    ]
    """
    res = []
    for img_captions in captions:
        arr_img = []
        for caption in img_captions:
            temp = [vocab[START]]
            words = split_sentence(caption)
            for word in words:
                if word in vocab.keys():
                    temp.append(vocab[word])
                else:
                    temp.append(vocab[UNK])
            temp.append(vocab[END])
            arr_img.append(temp)
        res.append(arr_img)
    return res

In [None]:
# prepare vocabulary
if "vocab.pickle" not in os.listdir():
    vocab = generate_vocabulary(train_captions)
    vocab = [PAD, START, END, UNK] + vocab
    vocab = {token: index for index, token in enumerate(vocab)}
    utils.save_pickle(vocab, "vocab.pickle")
else:
    vocab = utils.read_pickle("vocab.pickle")
    
if "vocab_inverse.pickle" not in os.listdir():
    vocab_inverse = {idx: w for w, idx in vocab.items()}
    utils.save_pickle(vocab_inverse, "vocab_inverse.pickle")
else:
    vocab_inverse = utils.read_pickle("vocab_inverse.pickle")
print(len(vocab))

In [None]:
# replace tokens with indices
train_captions_indexed = caption_tokens_to_indices(train_captions, vocab)
val_captions_indexed = caption_tokens_to_indices(val_captions, vocab)

In [None]:
print(train_captions_indexed[0])

Captions have different length, but we need to batch them, that's why we will add PAD tokens so that all sentences have an equal length. 

We will crunch LSTM through all the tokens, but we will ignore padding tokens during loss calculation.

In [None]:
# we will use this during training
def batch_captions_to_matrix(batch_captions, pad_idx, max_len):
    """
    `batch_captions` is an array of arrays:
    [
        [vocab[START], ..., vocab[END]],
        [vocab[START], ..., vocab[END]],
        ...
    ]
    Add padding with pad_idx where necessary.
    Input example: [[1, 2, 3], [4, 5]]
    Output example: np.array([[1, 2, 3], [4, 5, pad_idx]]) if max_len=None
    Output example: np.array([[1, 2], [4, 5]]) if max_len=2
    Output example: np.array([[1, 2, 3], [4, 5, pad_idx]]) if max_len=100
    Try to use numpy, we need this function to be fast!
    """
    matrix = np.ones(shape=(len(batch_captions), max_len))*pad_idx
    for i, arr in enumerate(batch_captions):
        l = min(max_len, len(batch_captions[i]))
        matrix[i, :l] = batch_captions[i][:l]
    return matrix

In [None]:
indices = np.arange(val_img_embeds.shape[0])
np.random.shuffle(indices)
num_cut_val_to_train = 20000

In [None]:
train_captions_indexed = np.array(train_captions_indexed)
val_captions_indexed = np.array(val_captions_indexed)

In [None]:
train_img_embeds = np.concatenate((train_img_embeds, val_img_embeds[:20000]), axis=0)
train_captions_indexed = np.concatenate((train_captions_indexed, val_captions_indexed[:20000]), axis=0)

In [None]:
val_img_embeds = val_img_embeds[20000:]
val_captions_indexed = val_captions_indexed[20000:]

In [None]:
train_captions_indexed[0]

# Training

## Define architecture

Since our problem is to generate image captions, RNN text generator should be conditioned on image. The idea is to use image features as an initial state for RNN instead of zeros. 

Remember that you should transform image feature vector to RNN hidden state size by fully-connected layer and then pass it to RNN.

During training we will feed ground truth tokens into the lstm to get predictions of next tokens. 

Notice that we don't need to feed last token (END) as input (http://cs.stanford.edu/people/karpathy/):

<img src="images/encoder_decoder_explained.png" style="width:50%">

In [None]:
IMG_EMBED_SIZE = 2048
IMG_EMBED_BOTTLENECK = 128
WORD_EMBED_SIZE = 100
LSTM_UNITS = 256
LOGIT_BOTTLENECK = 120
pad_idx = vocab[PAD]
max_len = 25

Here we define decoder graph.

We use Keras layers where possible because we can use them in functional style with weights reuse like this:
```python
dense_layer = L.Dense(42, input_shape=(None, 100) activation='relu')
a = tf.placeholder('float32', [None, 100])
b = tf.placeholder('float32', [None, 100])
dense_layer(a)  # that's how we applied dense layer!
dense_layer(b)  # and again
```

In [None]:
import keras
M = keras.models
L = keras.layers
K = keras.backend

In [None]:
K.clear_session()

Define tensors(nodes) in the computational graph

In [None]:
img_embed_input = L.Input(shape=(train_img_embeds.shape[1],))

img_embed_tensor = L.Dense(units=IMG_EMBED_BOTTLENECK, activation="elu")
initial_state_tensor = L.Dense(units=LSTM_UNITS, activation="elu")

X_decoder = L.Input(shape=(None,))
embedding_tensor = L.Embedding(len(vocab), WORD_EMBED_SIZE)
LSTM_tensor = L.LSTM(units=LSTM_UNITS, activation="tanh", return_sequences=True)
logit_bottlekneck_tensor = L.TimeDistributed(L.Dense(LOGIT_BOTTLENECK, activation="elu"))
output_tensor = L.TimeDistributed(L.Dense(len(vocab), activation="softmax"))

Connect those tensors(nodes) together

In [None]:
flow = img_embed_tensor(img_embed_input)
h0 = c0 = initial_state_tensor(flow)

flow = embedding_tensor(X_decoder)
flow = LSTM_tensor(flow, initial_state=[h0, c0])
flow = logit_bottlekneck_tensor(flow)
output = output_tensor(flow)

In [None]:
model = M.Model(inputs=[img_embed_input, X_decoder], outputs=[output])
model.summary()

In [None]:
optimizer = keras.optimizers.Adam(lr=0.001)
model.compile(optimizer=optimizer, loss="categorical_crossentropy", metrics=["accuracy"])

In [None]:
class DataGenerator(keras.utils.Sequence):
    
    def __init__(self, images, captions, batch_size, max_len, shuffle=True):
        self.images = images
        self.captions = captions
        self.batch_size = batch_size
        self.shuffle = shuffle
        self.max_len = max_len
        self.indices = np.arange(self.images.shape[0])
    
    def __len__(self):
        # Denotes the number of batches per epoch
        m = self.images.shape[0]
        num_batches = m//self.batch_size
        return (num_batches if m/self.batch_size == 0 else num_batches + 1) 
    
    def __getitem__(self, index):
        # Generate one batch of data
        image_embed_gen = self.images[index:index+self.batch_size]
        batch_captions = [caption[np.random.randint(5)] for caption in self.captions[index:index+self.batch_size]]
        caption_gen = batch_captions_to_matrix(batch_captions, pad_idx, max_len)
        return [image_embed_gen, caption_gen[:, 1:]], keras.utils.to_categorical(caption_gen[:, :-1], num_classes=len(vocab))
    
    def on_epoch_end(self):
        if self.shuffle:
            np.random.shuffle(self.indices)

In [None]:
batch_size = 64
n_epochs = 5

In [None]:
train_gen = DataGenerator(train_img_embeds, train_captions_indexed, batch_size, max_len)
val_gen = DataGenerator(val_img_embeds, val_captions_indexed, batch_size, max_len)

In [None]:
file_path="model/image_captioning.hdf5"
checkpoints = keras.callbacks.ModelCheckpoint(file_path, save_best_only=True, verbose=1)

In [None]:
from keras_tqdm import TQDMNotebookCallback

model.fit_generator(generator=train_gen, epochs=n_epochs, validation_data=val_gen, 
                    callbacks=[checkpoints, TQDMNotebookCallback()], verbose=0)

## Load Model

In [None]:
model = M.load_model("model/image_captioning.hdf5")

In [None]:
model.summary()

In [None]:
# we take the last hidden layer of IncetionV3 as an image embedding
preprocess_for_model = keras.applications.inception_v3.preprocess_input

encoder_model = keras.applications.InceptionV3(include_top=False)
encoder_model = M.Model(encoder_model.inputs, L.GlobalAveragePooling2D()(encoder_model.output))

In [None]:
# this is an actual prediction loop
def generate_caption(image, encoder_model, model, max_len=20):
    """
    Generate caption for given image.
    """
    # current caption
    # start with only START token
    caption = [vocab[START]]
    img_embed = encoder_model.predict(image)
    
    for i in range(max_len):
        next_word_probs = model.predict([img_embed, np.array([caption])])
        next_word_probs = next_word_probs[:, i, : ]
        next_word = np.argmax(next_word_probs, axis=-1)
        caption.append(next_word[0])
        if next_word[0] == vocab[END]:
            break

    words = ' '.join(list(map(vocab_inverse.get, caption))[1:-1])
    return words

In [None]:
def apply_model_to_image_raw_bytes(raw):
    img = utils.decode_image_from_buf(raw)
    fig = plt.figure(figsize=(7, 7))
    plt.grid('off')
    plt.axis('off')
    plt.imshow(img)
    img = utils.crop_and_preprocess(img, (IMG_SIZE, IMG_SIZE), preprocess_for_model)
    img = img.reshape((1, IMG_SIZE, IMG_SIZE, 3))
    print(generate_caption(img, encoder_model, model, max_len)[1:-1])
    plt.show()

In [None]:
apply_model_to_image_raw_bytes(open("./images/her.jpg", "rb").read())

In [None]:
apply_model_to_image_raw_bytes(open("./images/me.jpg", "rb").read())

In [None]:
apply_model_to_image_raw_bytes(open("./images/her1.jpg", "rb").read())

In [None]:
apply_model_to_image_raw_bytes(open("./images/room.jpg", "rb").read())

In [None]:
apply_model_to_image_raw_bytes(open("./images/wolf.jpg", "rb").read())

In [None]:
apply_model_to_image_raw_bytes(open("./images/dog.jpg", "rb").read())

In [None]:
apply_model_to_image_raw_bytes(open("./images/he_she.jpg", "rb").read())

In [None]:
apply_model_to_image_raw_bytes(open("./images/troll.png", "rb").read())

In [None]:
apply_model_to_image_raw_bytes(open("./images/mock.jpg", "rb").read())

In [None]:
apply_model_to_image_raw_bytes(open("./images/mia.jpg", "rb").read())

Now it's time to find 10 examples where your model works good and 10 examples where it fails! 

You can use images from validation set as follows:
```python
show_valid_example(val_img_fns, example_idx=...)
```

You can use images from the Internet as follows:
```python
! wget ...
apply_model_to_image_raw_bytes(open("...", "rb").read())
```

If you use these functions, the output will be embedded into your notebook and will be visible during peer review!

When you're done, download your noteboook using "File" -> "Download as" -> "Notebook" and prepare that file for peer review!

In [None]:
### YOUR EXAMPLES HERE ###

That's it! 

Congratulations, you've trained your image captioning model and now can produce captions for any picture from the  Internet!