<a href="https://colab.research.google.com/github/shahparth0007/AdvanceDataScience/blob/main/Final_Image_Captioning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Abstract

Image caption generation/ image summarization is a task that involves generating a semantic description of an image in natural language and is currently accomplished by techniques that use a combination of computer vision (CV), natural language processing (NLP), and machine learning methods. The inspiration for such an application can be inferred from Social media platforms like Facebook(Now Meta), that summarize the image posted by the user and infer details like - where you are, what you wear etc. This application also has a profound use in assisting visually impared individuals in comprehending the images of the real world. In this task, we work on a model that generates natural language description of an image. We intend to use a combination of convolutional neural networks to extract features and then use recurrent neural networks to generate text from these features. We incorporated the attention mechanism while generating captions. We evaluated the model on the Flikr8k database.bold text

![image](https://user-images.githubusercontent.com/91229784/166126929-d8c0483d-3e88-4cdd-b194-299b3e69e43e.png)


# **Problem Statement**
Given an image, we want to obtain a sentence that describes what the image consists of.

Note: We suggest you to start GPU for this session as this is a image summarization model.

***Steps:***

*   On the header click on "Runtime"
*   Click on "Change runtime type"
*   Click on "Hardware accelerator"
*   Select "GPU"
*   Click on "Save"

And you are all set to do image caption generation

# **1) Importing Libraries**

In [2]:
import os
import pickle
import numpy as np
from tqdm.notebook import tqdm
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
from PIL import Image
import plotly.express as px

from tensorflow.keras.applications.vgg19 import VGG19, preprocess_input
from tensorflow.keras.preprocessing.image import load_img, img_to_array
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Model
from tensorflow.keras.utils import to_categorical, plot_model
from tensorflow.keras.layers import Input, Dense, LSTM, Embedding, Dropout, add, GRU

# **2) Setup Data**
We have made arrangements where you can directly load the data from our kaggle accounts!



1.   First with the help of gdown the Kaggle.json file will be downloaded
2.   After which we will install kaggle library and get the data from kaggle 

Things which will get downloaded:


*   The flicker3k data with captions
*   Our trained Model V1 (With parameters: )
*   Our trained Model V2 (With parameters: )
*   The features.pkl file. (This file is the output of feature generation model VGG)

All of the above files can be seen in the contents of your google collab session!





In [3]:
!gdown --id 1jXH8vzpVy4gAkc-1lrWlwVUNF4w0B0dU
!pip install kaggle
! mkdir ~/.kaggle
! cp kaggle.json ~/.kaggle/
! chmod 600 ~/.kaggle/kaggle.json
! kaggle datasets download parthshah0007/flickerdataset
! unzip -q '/content/flickerdataset.zip' -d '/content/flickr8k/'

Downloading...
From: https://drive.google.com/uc?id=1jXH8vzpVy4gAkc-1lrWlwVUNF4w0B0dU
To: /content/kaggle.json
100% 69.0/69.0 [00:00<00:00, 115kB/s]
404 - Not Found
unzip:  cannot find or open /content/flickerdataset.zip, /content/flickerdataset.zip.zip or /content/flickerdataset.zip.ZIP.


In [4]:
BASE_DIR = '/content/flickr8k/FlickerAllData/Data/'
WORKING_DIR = '/content/flickr8k/FlickerAllData/'

# **3) Loading the VGG pretrained Model**

In [None]:
# load vgg19 model
model = VGG19()
# restructure the model
model = Model(inputs=model.inputs, outputs=model.layers[-2].output)
# summarize
print(model.summary())

# **4) Image Feature Extraction (Commented)**

Using the VGG pretrained model we will extract the features of the images.

More About VGG can be found: https://keras.io/api/applications/vgg/

Note: We have commented this as this code takes time to run!
We have already created the features and stored in features.pkl file which we downloaded from kaggle

In [None]:
# # extract features from image

# features = {}
# directory = os.path.join(BASE_DIR, 'Images')

# for img_name in tqdm(os.listdir(directory)):
#     # load the image from file
#     img_path = directory + '/' + img_name
#     image = load_img(img_path, target_size=(224, 224))
#     # convert image pixels to numpy array
#     image = img_to_array(image)
#     # reshape data for model
#     image = image.reshape((1, image.shape[0], image.shape[1], image.shape[2]))
#     # preprocess image for vgg
#     image = preprocess_input(image)
#     # extract features
#     feature = model.predict(image, verbose=0)
#     # get image ID
#     image_id = img_name.split('.')[0]
#     # store feature
#     features[image_id] = feature

# # store features in pickle
# pickle.dump(features, open(os.path.join(WORKING_DIR, 'features.pkl'), 'wb'))

In [None]:
# load features from pickle
with open(os.path.join(WORKING_DIR, 'features.pkl'), 'rb') as f:
    features = pickle.load(f)
print(len(features))

In [None]:
with open(os.path.join(BASE_DIR, 'captions.txt'), 'r') as f:
    next(f)
    captions_doc = f.read()

In [None]:
# create mapping of image to captions
mapping = {}
# process lines
for line in tqdm(captions_doc.split('\n')):
    # split the line by comma(,)
    tokens = line.split(',')
    if len(line) < 2:
        continue
    image_id, caption = tokens[0], tokens[1:]
    # remove extension from image ID
    image_id = image_id.split('.')[0]
    # convert caption list to string
    caption = " ".join(caption)
    # create list if needed
    if image_id not in mapping:
        mapping[image_id] = []
    # store the caption
    mapping[image_id].append(caption)

In [None]:
def clean(mapping):
    for key, captions in mapping.items():
        for i in range(len(captions)):
            # take one caption at a time
            caption = captions[i]
            # preprocessing steps
            # convert to lowercase
            caption = caption.lower()
            # delete digits, special chars, etc., 
            caption = caption.replace('[^A-Za-z]', '')
            # delete additional spaces
            caption = caption.replace('\s+', ' ')
            # add start and end tags to the caption
            caption = 'startseq ' + " ".join([word for word in caption.split() if len(word)>1]) + ' endseq'
            captions[i] = caption

In [None]:
# before preprocess of text
mapping['1000268201_693b08cb0e']

In [None]:
# preprocess the text
clean(mapping)

In [None]:
# after preprocess of text
mapping['1000268201_693b08cb0e']

In [None]:
all_captions = []
for key in mapping:
    for caption in mapping[key]:
        all_captions.append(caption)

print("There are total of:",len(all_captions)," captions")

In [None]:
# tokenize the text
tokenizer = Tokenizer()
tokenizer.fit_on_texts(all_captions)
vocab_size = len(tokenizer.word_index) + 1

In [None]:
# get maximum length of the caption available
max_length = max(len(caption.split()) for caption in all_captions)

In [None]:
image_ids = list(mapping.keys())
split = int(len(image_ids) * 0.90)
train = image_ids[:split]
test = image_ids[split:]

In [None]:
# create data generator to get data in batch (avoids session crash)

def data_generator(data_keys, mapping, features, tokenizer, max_length, vocab_size, batch_size):
    # loop over images
    X1, X2, y = list(), list(), list()
    n = 0
    while 1:
        for key in data_keys:
            n += 1
            captions = mapping[key]
            # process each caption
            for caption in captions:
                # encode the sequence
                seq = tokenizer.texts_to_sequences([caption])[0]
                # split the sequence into X, y pairs
                for i in range(1, len(seq)):
                    # split into input and output pairs
                    in_seq, out_seq = seq[:i], seq[i]
                    # pad input sequence
                    in_seq = pad_sequences([in_seq], maxlen=max_length)[0]
                    # encode output sequence
                    out_seq = to_categorical([out_seq], num_classes=vocab_size)[0]
                    
                    # store the sequences
                    X1.append(features[key][0])
                    X2.append(in_seq)
                    y.append(out_seq)
            if n == batch_size:
                X1, X2, y = np.array(X1), np.array(X2), np.array(y)
                yield [X1, X2], y
                X1, X2, y = list(), list(), list()
                n = 0

# **Model 1: Long Short Term Memory**

LSTM has three gates that are input, output, forget.

In [None]:
# encoder model
# image feature layers
inputs1 = Input(shape=(4096,))
fe1 = Dropout(0.4)(inputs1)
fe2 = Dense(256, activation='relu')(fe1)
# sequence feature layers
inputs2 = Input(shape=(max_length,))
se1 = Embedding(vocab_size, 256, mask_zero=True)(inputs2)
se2 = Dropout(0.4)(se1)
se3 = LSTM(256)(se2)

# decoder model
decoder1 = add([fe2, se3])
decoder2 = Dense(256, activation='relu')(decoder1)
outputs = Dense(vocab_size, activation='softmax')(decoder2)

model = Model(inputs=[inputs1, inputs2], outputs=outputs)
model.compile(loss='categorical_crossentropy', optimizer='Adagrad')

# plot the model
plot_model(model, show_shapes=True)

## **Model Training (Commented)**

The following can be executed! It will take time to run a 30 epoch model approximately: 30 Mins and since commented!

P.S: We load the saved model below to show the predictions

In [None]:
# # train the model
# epochs = 20
# batch_size = 32
# steps = len(train) // batch_size

# history_loss = []

# for i in range(epochs):
#     # create data generator
#     generator = data_generator(train, mapping, features, tokenizer, max_length, vocab_size, batch_size)
#     # fit for one epoch
#     history = model.fit(generator, epochs=1, steps_per_epoch=steps, verbose=1)
#     history_loss.append(history.history['loss'])

# # save the model
# model.save(WORKING_DIR+'/LSTM_categorical_crossentropy_Adagrad.h5')

# # Saving the loss
# error = pd.DataFrame(history_loss)
# error.to_csv('LSTM_categorical_crossentropy_Adagrad_loss.csv')


## **Hyperparamter for LSTM**
We have changed the optimizer for LSTM we are using 'Adam' in the below model.
We have also increased the epochs of the model

In [None]:
# encoder model
# image feature layers
inputs1 = Input(shape=(4096,))
fe1 = Dropout(0.4)(inputs1)
fe2 = Dense(256, activation='relu')(fe1)
# sequence feature layers
inputs2 = Input(shape=(max_length,))
se1 = Embedding(vocab_size, 256, mask_zero=True)(inputs2)
se2 = Dropout(0.4)(se1)
se3 = LSTM(256)(se2)

# decoder model
decoder1 = add([fe2, se3])
decoder2 = Dense(256, activation='relu')(decoder1)
outputs = Dense(vocab_size, activation='softmax')(decoder2)

model = Model(inputs=[inputs1, inputs2], outputs=outputs)
model.compile(loss='categorical_crossentropy', optimizer='Adam')

# plot the model
plot_model(model, show_shapes=True)

In [None]:
# # train the model
# epochs = 30
# batch_size = 32
# steps = len(train) // batch_size

# history_loss = []

# for i in range(epochs):
#     # create data generator
#     generator = data_generator(train, mapping, features, tokenizer, max_length, vocab_size, batch_size)
#     # fit for one epoch
#     history = model.fit(generator, epochs=1, steps_per_epoch=steps, verbose=1)
#     history_loss.append(history.history['loss'])

# # save the model
# model.save(WORKING_DIR+'/LSTM_categorical_crossentropy_Adam.h5')

# # Saving the loss
# error = pd.DataFrame(history_loss)
# error.to_csv('LSTM_categorical_crossentropy_Adam_loss.csv')


#**Model 2: GRU Gates Recurrent Unit**

In [None]:
# encoder model
# image feature layers
inputs1 = Input(shape=(4096,))
fe1 = Dropout(0.4)(inputs1)
fe2 = Dense(256, activation='relu')(fe1)
# sequence feature layers
inputs2 = Input(shape=(max_length,))
se1 = Embedding(vocab_size, 256, mask_zero=True)(inputs2)
se2 = Dropout(0.4)(se1)
se3 = GRU(256)(se2)

# decoder model
decoder1 = add([fe2, se3])
decoder2 = Dense(256, activation='relu')(decoder1)
outputs = Dense(vocab_size, activation='softmax')(decoder2)

model = Model(inputs=[inputs1, inputs2], outputs=outputs)
model.compile(loss='categorical_crossentropy', optimizer='Adam')

# plot the model
plot_model(model, show_shapes=True)

## **Model Training (Commented)**

In [None]:
# # train the model
# epochs = 1
# batch_size = 32
# steps = len(train) // batch_size

# history_loss = []

# for i in range(epochs):
#     # create data generator
#     generator = data_generator(train, mapping, features, tokenizer, max_length, vocab_size, batch_size)
#     # fit for one epoch
#     history = model.fit(generator, epochs=1, steps_per_epoch=steps, verbose=1)
#     history_loss.append(history.history['loss'])

# # save the model
# model.save(WORKING_DIR+'/GRU_categorical_crossentropy_Adam.h5')

# # Saving the loss
# error = pd.DataFrame(history_loss)
# error.to_csv('GRU_categorical_crossentropy_Adagrad_Loss.csv')


## **Hyperparamter for GRU**
We have changed the optimizer for GRU we are using 'Adam' in the below model

In [None]:
# encoder model
# image feature layers
inputs1 = Input(shape=(4096,))
fe1 = Dropout(0.4)(inputs1)
fe2 = Dense(256, activation='relu')(fe1)
# sequence feature layers
inputs2 = Input(shape=(max_length,))
se1 = Embedding(vocab_size, 256, mask_zero=True)(inputs2)
se2 = Dropout(0.4)(se1)
se3 = GRU(256)(se2)

# decoder model
decoder1 = add([fe2, se3])
decoder2 = Dense(256, activation='relu')(decoder1)
outputs = Dense(vocab_size, activation='softmax')(decoder2)

model = Model(inputs=[inputs1, inputs2], outputs=outputs)
model.compile(loss='categorical_crossentropy', optimizer='Rmsprop')

# plot the model
plot_model(model, show_shapes=True)

#5) **Performance metric of the algorithm**

Note: We have saved all the performance metrics so we are just loading it from our saved files and showing it



### 1) BLEU Score Explaination:

We are using BLEU score to judge the image captioning for the predicted images with actual captions to compare the model performance.

**BLEU-n** is just the geometric average of the n-gram precision.

```
(precisely it's string matching, at different n-gram levels, between references and hypotheses; 

that's why there has been much criticism on this metric. 

But, people still use it anyways because it has stuck with the community for ages)
```

For example, **BLEU-1** is simply the **unigram precision**, **BLEU-2** is the **geometric average of unigram and bigram precision**, **BLEU-3** is the **geometric average of unigram, bigram, and trigram precision** and so on.

Having said that, if you want to compute specific **n-gram BLEU scores**, you have to pass a weights parameter when you call **corpus_bleu** . Note that if you ignore passing this weights parameter, then by default BLEU-4 scores are returned, which is what happening in the evaluation here.

To compute, BLEU-1 you can call copus_bleu with weights as
```
weights = (1.0/1.0, )
corpus_bleu(references, hypotheses, weights)
```

To compute, BLEU-2 you can call corpus_bleu with weights as

```
weights=(1.0/2.0, 1.0/2.0,)
corpus_bleu(references, hypotheses, weights)
```

To compute, BLEU-3 you can call corpus_bleu with weights as

```
weights=(1.0/3.0, 1.0/3.0, 1.0/3.0,)
corpus_bleu(references, hypotheses, weights)
```

To compute, BLEU-5 you can call corpus_bleu with weights as

```
weights=(1.0/5.0, 1.0/5.0, 1.0/5.0, 1.0/5.0, 1.0/5.0,)
corpus_bleu(references, hypotheses, weights)
```


In [None]:
Bleu_Comprehensive = pd.read_csv(WORKING_DIR + 'Bleu_Comprehensive.csv')
for i in range(1,5):
  fig = px.bar(Bleu_Comprehensive, x='Algorithm', y='BLEU_'+str(i), color='BLEU_'+str(i), width = 600,height=400,title="Bleu Score Comparision for "+ 'Bleu_'+str(i))
  fig.show()

### **2) Loss VS Epoch**

Explaination: epoch refers to the passing of whole dataset through the network; that will be counted as one epoch. So in the metric we will see how is the loss of the model in each epoch and get to know if we need to train our model for more epoch's

In [None]:
Loss_Comprehensive = pd.read_csv(WORKING_DIR + 'Loss_Comprehensive.csv')
fig = px.line(Loss_Comprehensive,x = 'Epoch',y = 'Loss' , color='Algorithm' , title='Loss VS Epoch')
fig.show()

# **6) Inference Pipeline**

In [None]:
from nltk.translate.bleu_score import corpus_bleu
# validate with test data
def idx_to_word(integer, tokenizer):
    for word, index in tokenizer.word_index.items():
        if index == integer:
            return word
    return None

# generate caption for an image
def predict_caption(model, image, tokenizer, max_length):
    # add start tag for generation process
    in_text = 'startseq'
    # iterate over the max length of sequence
    for i in range(max_length):
        # encode input sequence
        sequence = tokenizer.texts_to_sequences([in_text])[0]
        # pad the sequence
        sequence = pad_sequences([sequence], max_length)
        # predict next word
        yhat = model.predict([image, sequence], verbose=0)
        # get index with high probability
        yhat = np.argmax(yhat)
        # convert index to word
        word = idx_to_word(yhat, tokenizer)
        # stop if word not found
        if word is None:
            break
        # append word as input for generating next word
        in_text += " " + word
        # stop if we reach end tag
        if word == 'endseq':
            break
    return in_text

def calculate_bleu(model,test,tokenizer,max_length):
  actual, predicted = list(), list()
  for key in tqdm(test):
      # get actual caption
      captions = mapping[key]
      # predict the caption for image
      y_pred = predict_caption(model, features[key], tokenizer, max_length) 
      # split into words
      actual_captions = [caption.split() for caption in captions]
      y_pred = y_pred.split()
      # append to the list
      actual.append(actual_captions)
      predicted.append(y_pred)
  # calcuate BLEU score
  print("BLEU-1: %f" % corpus_bleu(actual, predicted, weights=(1.0, 0, 0, 0)))
  print("BLEU-2: %f" % corpus_bleu(actual, predicted, weights=(0.5, 0.5, 0, 0)))
  print("BLEU-3: %f" % corpus_bleu(actual, predicted, weights=(0.33, 0.33, 0.33, 0)))
  print("BLEU-4: %f" % corpus_bleu(actual, predicted, weights=(0.25, 0.25, 0.25, 0.25)))

def generate_caption(model,image_name):
    image_id = image_name.split('.')[0]
    img_path = os.path.join(BASE_DIR, "Images", image_name)
    image = Image.open(img_path)
    captions = mapping[image_id]
    print('---------------------Actual---------------------')
    for caption in captions:
        print(caption)
    # predict the caption
    y_pred = predict_caption(model, features[image_id], tokenizer, max_length)
    print('--------------------Predicted--------------------')
    print(y_pred)
    plt.imshow(image)

In [None]:
from tensorflow import keras
GRU_Rmsprop = keras.models.load_model(WORKING_DIR + 'VGG19_GRU_Rmsprop/GRU_Rmsprop.h5')
GRU_Adam = keras.models.load_model(WORKING_DIR + 'VGG19_GRU_Adam/GRU_Adam.h5')
LSTM_Adagrad = keras.models.load_model(WORKING_DIR + 'VGG19_LSTM_Adagrad/LSTM_Adagrad.h5')
LSTM_Adam = keras.models.load_model(WORKING_DIR + 'VGG19_LSTM_Adam/LSTM_Adam.h5')
print("All Models Loaded")

In [None]:
generate_caption(GRU_Rmsprop,"1020651753_06077ec457.jpg")

In [None]:
generate_caption(LSTM_Adagrad,"1020651753_06077ec457.jpg")

In [None]:
generate_caption(GRU_Adam,"1020651753_06077ec457.jpg")

In [None]:
generate_caption(LSTM_Adam,"1020651753_06077ec457.jpg")

# 7) Conclusion

In this notebook we tried to build a image captioning model. Following are some of our takeways:

Model Variations:

1.   LSTM + Adam (Optimizer) Epochs 10/25
2.   LSTM + Adagrad (Optimizer) Epochs 10/25
3.   GRU + Adam (Optimizer) Epochs 10/25
4.   GRU + RMSProp (Optimizer) Epochs 10/25


Based on these variations we evaluated their results based on the BLEU score metric and we found that:

1) LSTM Adam performed well on 3 Variations of BLEU score (1-Gram,2-Gram,3-Gram)

2) GRU RMSProp performed well on all 4 Variations of BLEU score (1-Gram,2-Gram,3-Gram,4-Gram)
