# Introduction

Hello community, welcome to this kernel. In this kernel I am going to create a Feature Extractor - Decoder model that creates captions from images. 

I'll use a pretrained model (VGG16) as image feature extractor and RNNs (GRU) as decoder. I will explain everything as much as I can.

So let's start.

# Importing The Libraries

* In this section we'll import the libraries we'll use. In this kernel we'll build our model in tensorflow, (also Keras API) but after this kernel you can make a similar model using Pytorch.


In [None]:
import numpy as np
import pandas as pd
import tensorflow as tf
import os,sys,time
import matplotlib.pyplot as plt
from PIL import Image

import warnings as wrn
wrn.filterwarnings('ignore')

tf.compat.v1.disable_eager_execution()

from tensorflow.python.keras import backend as K
from tensorflow.python.keras.models import Model
from tensorflow.python.keras.layers import Input,Dense,GRU,CuDNNGRU,Embedding
from tensorflow.keras.applications import VGG16
from tensorflow.keras.optimizers import RMSprop
from tensorflow.python.keras.callbacks import ModelCheckpoint
from tensorflow.python.keras.preprocessing.text import Tokenizer
from tensorflow.python.keras.preprocessing.sequence import pad_sequences 

* Let's check our data description dataframe

In [None]:
dataDesc = pd.read_csv("../input/flickr8kimagescaptions/flickr8k/captions.txt")
dataDesc.head()

* Hmm, it looks like we have captions more than one per image, let's check it.

In [None]:
dataDesc[dataDesc["image"] == "1000268201_693b08cb0e.jpg"]

# Preparing Data Using VGG16

In this section we'll prepare our dataset to use in our decoder using VGG16. But before this I wanna explain why we won't connect VGG16's output layer to decoder's initial state.

You know, VGG16 is a really huge model that has millions of parameters and it was trained on ImageNet, a dataset that contains 1000 class and hundreds of images.

In transfer learning we don't generally train feature extractors (Convolution Operators not fully connected layers) because they work properly without training them again.

So, if we don't need to train VGG16 again, we can use it as a seperate model. We can extract features of all the images once and use them when we train model.

This will help us to save memory and time.


First, we'll a define function that reads the dataset and returns images and captions.

In [None]:
ABSOLUTE_PATH = "../input/flickr8kimagescaptions/flickr8k/images/"

def loadDataset(resize=None):
    images = []
    for index,image in enumerate(np.unique(dataDesc["image"])):
        sys.stdout.write(f"\r Reading Image {index}")
        sys.stdout.flush()
        # Reading Image
        img = Image.open(ABSOLUTE_PATH + image)
        # If we specified resize, resize the image.
        if resize != None:
            img = img.resize(resize,resample=Image.LANCZOS)
        
        images.append(np.asarray(img))
            
    captions = []
    cnt = 0
    for image in np.unique(dataDesc["image"]):
        sys.stdout.write(f"\rReading Caption {cnt+1}")
        cnt += 1
        chck = []
        # Each image has captions more than one, so we'll store captions in a list of lists.
        for cap in dataDesc[dataDesc["image"] == image]["caption"].values:
            chck.append(cap)
        captions.append(chck)
        
    return np.asarray(images),captions  

* Now let's download our VGG16 from google cloud storage.

In [None]:
base_model = VGG16()
base_model.summary()

* As you can see VGG16 has 138M parameters.
* Also VGG16 has a prediction layer with softmax and 1000 neurons, but we won't make predictions, we just wanna extract features. So we'll take the fully connected layer 2 and connect the model using Keras Functional API.

In [None]:
transfer_layer = base_model.get_layer("fc2")
image_model_transfer = Model(inputs=[base_model.input],outputs=[transfer_layer.output])

img_size = K.int_shape(image_model_transfer.input)[1:3]
print(img_size)

* ImageNet has images with shape 224,224,3. And our last dense layer has 4096 neurons.

In [None]:
transfer_values_size = K.int_shape(image_model_transfer.output)[1]
transfer_values_size

* Now we'll define a new function. This function will read the data using the previous function and give the images to the VGG16 and return extracted features and captions.

In [None]:
%%time
def processDataset():
    
    images,captions = loadDataset(resize=img_size)
    print("Reading images finished")
    vecs = image_model_transfer.predict(images)
    return vecs,captions

transfer_values,captions = processDataset()


In [None]:
captions[0][1]

In [None]:
transfer_values.shape

* Our last layer has 4096 neurons and we have 8091 images in our set, so our transfer values array has shape (8091,4096)

* Now we'll add start and end marks to the captions. We're adding these marks because model has to understand where sentence starts and finishes.

* Also when we create captions we'll finish to select words when we see the end mark because end mark means *Oh please stop, I finished*

In [None]:
mark_start = "ssss "
mark_end = " eeee"

def markCaptions(captions_listlist):
    return [[mark_start + caption + mark_end for caption in captions_list]
           for captions_list in captions_listlist]


marked_captions = markCaptions(captions)

In [None]:
marked_captions[0][1]

In [None]:
flattened_captions = [caption for cap_list in marked_captions for caption in cap_list]
print(len(flattened_captions))
print(flattened_captions[0])


* Now we'll define a tokenizer wrap. This tokenizer wrap will have some extra functions such as a function that converts a token to it's represantation.

In [None]:
class TokenizerWrap(Tokenizer): 
    
    def __init__(self, texts, num_words=None):
        Tokenizer.__init__(self, num_words=num_words)
        
        self.fit_on_texts(texts)

        self.index_to_word = dict(zip(self.word_index.values(),
                                      self.word_index.keys()))

    def token_to_word(self, token):
        word = " " if token == 0 else self.index_to_word[token]
        return word 

    def tokens_to_string(self, tokens):
        words = [self.index_to_word[token] for token in tokens if token != 0]
        
        text = " ".join(words)

        return text
    
    def captions_to_tokens(self, captions_listlist):
        tokens = [self.texts_to_sequences(captions_list)
                  for captions_list in captions_listlist]
        return tokens

* Let's create our tokenizer and tokenize our captions.

In [None]:
num_words = 10000
tokenizer = TokenizerWrap(texts=flattened_captions,num_words=num_words)

In [None]:
tokens_train = tokenizer.captions_to_tokens(marked_captions)


In [None]:
token_start = tokenizer.word_index[mark_start.strip()]
token_end = tokenizer.word_index[mark_end.strip()]


In [None]:
print(token_start)
print(token_end)

* We had realised there are captions more than one per image, but we can just use one caption per image in our deep neural network.
* There are several strategies, such as we can use data augmentation to create more image.
* But if we do it, we can encounter with memory problems

* So we'll use a different strategy, we'll choose a random caption per image and change the selected caption in each batch.

In [None]:
# This function will return random captions for specified indices.
def get_random_captions_tokens(idx):
    results = [] 
    for i in idx:
        j = np.random.choice(len(np.asarray(tokens_train[i])))
        tokens = tokens_train[i][j]
        results.append(tokens)
    
    return results

* And now we'll use a generator that generates one batch data per iteration. This generator will randomly select images and randomly select captions for the selected images. Then it will return the data in the true format.

In [None]:
num_images = 8091
def batch_generator(batch_size):
    while True:
        idx = np.random.randint(num_images,size=batch_size)
        
        t_values = np.asarray(list(map(transfer_values.__getitem__,idx)))
        tokens = get_random_captions_tokens(idx)

        num_tokens = [len(t) for t in tokens]
        max_tokens = np.max(num_tokens)

        tokens_padded = pad_sequences(tokens,
                                      maxlen=max_tokens,
                                      padding="post",
                                      truncating="post"
                                     )

        decoder_input_data = tokens_padded[:,0:-1]
        decoder_output_data = tokens_padded[:,1:]

        x_data = {"decoder_input":decoder_input_data,
                  "transfer_values_input":t_values
                 }

        y_data = {"decoder_output":decoder_output_data}
        
        yield (x_data,y_data)


# Decoder Modeling
Everything about data is ready, now we can build our decoder. In decoder we'll use a developed version of RNN. Gated Recurrent Unit. You should learn details if you don't know but knowing this is enough for starting:

***Simple Recurrent Neural Networks store all the data without checking it whether is it relevant or unrelevant. And this cause vanishing gradient problem when we backpropagate the model.***

***But other variants of RNNs such as GRUs and LSTMs don't store all the data, they forget redundant data using forget gates, and this prevents vanishing gradient***

In [None]:
total_num_captions = len(dataDesc["caption"])
total_num_captions

In [None]:
steps_per_epoch = int(total_num_captions / BATCH_SIZE)
steps_per_epoch

* Let's create our layers.

In [None]:
state_size = 256
embedding_size = 100

transfer_values_input = Input(shape=(transfer_values_size,),
                              name="transfer_values_input"
                             )

decoder_transfer_map = Dense(state_size,
                             activation="tanh",
                             name="decoder_transfer_map"
                            )

decoder_input = Input(shape=(None,),name="decoder_input")

* As you know when we work with texts in deep learning we generally use word vectors representation to make sense of the texts. We can use other primitive methods such as BoW but word vectors works better than them.

* In this kernel we won't train word embeddings from scratch, we'll use pretrained glove vectors because our dataset is small to train great word vectors.

In [None]:
word2vec = {}
for line in open("../input/glove6b100dtxt/glove.6B.100d.txt",encoding="utf-8"):
    values = line.split()
    word = values[0]
    vec = np.asarray(values[1:],dtype="float32")
    word2vec[word] = vec


* We've created a dict that contains words and their vector.

In [None]:
replaced_count = 0
embedding_matrix = np.random.uniform(-1,1,(num_words,embedding_size))
for word,i in tokenizer.word_index.items():
    vec = word2vec.get(word)
    if vec is not None:
        embedding_matrix[i] = vec
        replaced_count += 1
        
print(replaced_count)

* And we created a random embedding matrix and changed their values with pre trained ones.

In [None]:
decoder_embedding = Embedding(input_dim=num_words,
                              output_dim=embedding_size,
                              weights=[embedding_matrix],
                              trainable=True,
                              name="decoder_embedding"
                             )


In [None]:
decoder_gru1 = CuDNNGRU(state_size,return_sequences=True,name="decoder_gru1")
decoder_gru2 = CuDNNGRU(state_size,return_sequences=True,name="decoder_gru2")
decoder_gru3 = CuDNNGRU(state_size,return_sequences=True,name="decoder_gru3")

* We'll use GRU in Decoder.

In [None]:
decoder_dense = Dense(num_words,
                      activation="linear",
                      name="decoder_output"
                     )


* Now let's define a function that connect decoder.

In [None]:
def connectDecoder(transfer_values):
    initial_state = decoder_transfer_map(transfer_values)

    net = decoder_input
    net = decoder_embedding(net)
    net = decoder_gru1(net,initial_state=initial_state)
    net = decoder_gru2(net,initial_state=initial_state)
    net = decoder_gru3(net,initial_state=initial_state)
    decoder_output = decoder_dense(net)
    
    return decoder_output


* Everything is ready, we can create our model using Functional API.

In [None]:
decoder_output = connectDecoder(transfer_values_input)

decoder_model = Model(inputs=[transfer_values_input,decoder_input],outputs=[decoder_output])

* Also we'll use a loss function that has softmax in its own, this is why our last dense layer has linear activation.

In [None]:
def sparse_cross_entropy(y_true,y_pred):
    
    loss = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y_true,
                                                          logits=y_pred)
    
    loss_mean = tf.reduce_mean(loss)
    
    return loss_mean

* Now we'll define optimizer and compile the model.

In [None]:
optimizer = RMSprop(lr=1e-3)
decoder_target = tf.compat.v1.placeholder(dtype="int32",shape=(None,None))
decoder_model.compile(optimizer=optimizer,
                      loss=sparse_cross_entropy,
                      target_tensors=[decoder_target]
                     )


In [None]:
path_checkpoint = "model.keras"
checkpoint = ModelCheckpoint(filepath=path_checkpoint,
                             save_weights_only=True
                            )


In [None]:
try:
    decoder_model.load_weights(path_checkpoint)
except Exception as E:
    print("Something went wrong when loading the checkpoint, training will start from scratch")
    print()
    print(E)

    

In [None]:
decoder_model.fit_generator(generator=batch_generator(BATCH_SIZE),
                            steps_per_epoch=steps_per_epoch,
                            epochs=10,
                            callbacks=[checkpoint]
                           )


* We've trained our model, but we need a function to use it. Let's define it.


In [None]:
def generate_caption(image_path,max_tokens=30):
    
    # Reading and extracting features using VGG16.
    image = np.asarray(Image.open(image_path).resize((224,224)))
    transfer_values = image_model_transfer.predict(np.expand_dims(image,axis=0))
    
    # We'll create our input text, it will have just spaces and model will fill them.
    decoder_input_data = np.zeros(shape=(1,max_tokens),dtype=np.int)
    
    token_int = token_start
    output_text = " "
    count_tokens = 0
    
    # While our model don't create finish token 
    while token_int != token_end and count_tokens < max_tokens:
        
        decoder_input_data[0,count_tokens] = token_int
        x_data = {"transfer_values_input":transfer_values,
                  "decoder_input":decoder_input_data
                 }
        
        # Model will predict the next word.
        decoder_output = decoder_model.predict(x_data)
        
        token_onehot = decoder_output[0,count_tokens,:]
        token_int = np.argmax(token_onehot)
        
        sampled_word = tokenizer.token_to_word(token_int)
        output_text = output_text + " " + sampled_word
        count_tokens += 1
        
    plt.imshow(image)
    plt.axis("off")
    print()
        
    print("Predicted Caption:")
    print(output_text.replace("ssss"," ").replace("eeee"," ").strip())
    print()
    

In [None]:
generate_caption("../input/clothing-dataset-full/images_original/00143901-a14c-4600-960f-7747b4a3a8cd.jpg")

In [None]:
generate_caption("../input/clothing-dataset-full/images_original/00f6e504-7c27-438e-a5d7-bc65e557bb2b.jpg")

* As we can see our models works a bit weird because our dataset is soo soo small to create a great image captioning model, but if you're interested in image captioning you must try COCO set.

# Conclusion
Thanks for your attention, if you have questions in your mind, feel free to ask in comment section.