# image caption generator

Human brain can easily recognize an image and tell what the image is all about. But  a computer cannot do the same unless it is trained for it.Automatically describing the content of images using natural languages is a fundamental and challenging task.
which will have various practical benefits in future ranging from Aiding the visually impaired to enabling the Automatic labelling of millions of images that gets uploaded to the internet everyday.
So, to make our image caption generator model, we will be merging CNN-RNN architectures. 
Feature extraction from images is done using CNN. We have used the pre-trained model Exception. 
The information received from CNN is then used by LSTM for generating a description of the image.

    

## Table of Contents
- [1 - Packages](#1)
- [2 - Getting and performing data cleaning](#2)
- [3 - Extracting the feature vector from all images](#3)
- [4 - Loading dataset for Training the model](#4)
- [5 -  Tokenizing the vocabulary](#5)

<a name='1'></a>
## 1 - Packages ##

These are the fundametal package that are required for the project. 
- [numpy](www.numpy.org) is the fundamental package for scientific computing with Python.
- [keras](https://keras.io/) Python library for developing and evaluating deep learning models
- [pickle](https://github.com/python/cpython/blob/3.9/Lib/pickle.py) is the module implements binary protocols for serializing and de-serializing a Python object structure.
- [tqdm](https://tqdm.github.io/) is to show the progress of the loops.
- [PIL](http://www.pythonware.com/products/pil/) and [scipy](https://www.scipy.org/) are used here to test our model with our own picture at the end.

In [7]:
import string
import numpy as np
from PIL import Image
import os
from pickle import dump, load
from keras.applications.xception import Xception, preprocess_input
from keras.preprocessing.image import load_img, img_to_array
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical
from keras.layers.merge import add
from keras.models import Model, load_model
from keras.layers import Input, Dense, LSTM, Embedding, Dropout
from tqdm import tqdm_notebook as tqdm
tqdm().pandas()

ModuleNotFoundError: No module named 'keras'

# 2 - Getting and performing data cleaning
The main text file which contains all image captions is Flickr8k.token in our Flickr_8k_text folder.
The format of our file is image and caption separated by a new line (“\n”).
Each image has 5 captions and we can see that #(0 to 5)number is assigned for each caption.

We will define 6 functions:

**load_doc( filename )** – For loading the document file and reading the contents inside the file into a string.

**all_img_captions( filename )** – This function will create a descriptions dictionary that maps images with a list of 5 captions.

**cleaning_text( descriptions )** – This function takes all descriptions and performs data cleaning. we will be removing punctuations, converting all text to lowercase and removing words that contain numbers.

**text_vocabulary( descriptions )** – This is a simple function that will separate all the unique words and create the vocabulary from all the descriptions.

**save_descriptions( descriptions, filename )** – This function will create a list of all the descriptions that have been preprocessed and store them into a file.We will create a descriptions.txt file to store all the captions.

**data_cleaning( dataset_text )** - get the image data and caption data, cleans it, dump it into files.

### 1.load_doc
For loading the document file and reading the contents inside the file into a string.

In [3]:
# Loading a text file into memory

def load_doc(filename):
    
    """
    Arguments:
        filename -- file containing all the captions.
        
    Returns:
        text -- a string contains all the text inside the file.
    """
    
    # Opening the file as read only
    file = open(filename, 'r')
    text = file.read()
    file.close()
    return text


### 2.all_img_captions
This function will create a descriptions dictionary that maps images with a list of 5 captions. 

In [4]:
# get all imgs with their captions

def all_img_captions(filename):
    
    """
    Arguments:
        filename -- file containing all the captions.
        
    Returns:
        descriptions -- a dictionary contains image_name and corresponding description list.
    """
    
    file = load_doc(filename)
    captions = file.split('\n')
    descriptions ={}
    for caption in captions[:-1]:
        img, caption = caption.split('\t')
        if img[:-2] not in descriptions:
            descriptions[img[:-2]] = [ caption ]
        else:
            descriptions[img[:-2]].append(caption)
    return descriptions


### 3.cleaning_text
This function takes all descriptions and performs data cleaning. we will be removing punctuations, converting all text to lowercase and removing words that contain numbers.

In [5]:
#Data cleaning- lower casing, removing puntuations and words containing numbers

def cleaning_text( captions ):
    
    """
    Arguments:
        captions -- a dictionary contains image_name and corresponding description list.
        
    Returns:
        captions -- cleaned (lower casing, removing puntuations and words containing numbers) version of given dictionary.
    """
    
    table = str.maketrans('','',string.punctuation)
    for img,caps in captions.items():
        for i,img_caption in enumerate(caps):

            img_caption.replace("-"," ")
            desc = img_caption.split()

            #converts to lowercase
            desc = [word.lower() for word in desc]
            #remove punctuation from each token
            desc = [word.translate(table) for word in desc]
            #remove hanging 's and a 
            desc = [word for word in desc if(len(word)>1)]
            #remove tokens with numbers in them
            desc = [word for word in desc if(word.isalpha())]
            #convert back to string

            img_caption = ' '.join(desc)
            captions[img][i]= img_caption
    return captions


### 4.text_vocabulary
This is a simple function that will separate all the unique words and create the vocabulary from all the descriptions.

In [6]:
# build vocabulary of all unique words

def text_vocabulary(descriptions):
    
    """
    Arguments:
        descriptions -- cleaned version of given dictionary.
        
    Returns:
        vocab -- set containing all unique words in the descriptions.
    """
    
    vocab = set()

    for key in descriptions.keys():
        [vocab.update(d.split()) for d in descriptions[key]]

    return vocab


### 5.save_descriptions
This function will create a list of all the descriptions that have been preprocessed and store them into a file.We will create a descriptions.txt file to store all the captions.

In [7]:
#All descriptions in one file 
def save_descriptions(descriptions, filename):
    
    """
    Arguments:
        descriptions -- cleaned version of given dictionary.
        
    Returns:
        Nothing
    """
    
    lines = list()
    for key, desc_list in descriptions.items():
        for desc in desc_list:
            lines.append(key + '\t' + desc )
    data = "\n".join(lines)
    file = open(filename,"w")
    file.write(data)
    file.close()
    

### 6.data_cleaning
This function:

get the caption file,cleans it and save it to descriptions.txt file.

make the vocabulary from those decriptions and returns it.

In [8]:
# get the image data and caption data, cleans it, dump it into files.

def data_cleaning( dataset_text ):
    
    """
    Arguments:
        dataset_text -- path to the dataset's text folder. 
        
    Returns:
        vocabulary -- set containing all unique words in the descriptions.
    """

    #we prepare our text data
    filename = dataset_text + "/" + "check.txt"
    
    #loading the file that contains all data
    #mapping them into descriptions dictionary img to 5 captions
    descriptions = all_img_captions(filename)
    print("Length of descriptions =" ,len(descriptions))

    #cleaning the descriptions
    clean_descriptions = cleaning_text(descriptions)

    #building vocabulary 
    vocabulary = text_vocabulary(clean_descriptions)
    print("Length of vocabulary = ", len(vocabulary))

    #saving each description to file 
    save_descriptions(clean_descriptions, "descriptions.txt")
    
    return vocabulary
    

# 3 - Extracting the feature vector from all images 
we use the pre-trained model that have been already trained on large datasets and extract the features from these models and use them for our tasks. We are using the Xception model which has been trained on imagenet dataset that had 1000 different classes to classify. We can directly import this model from the keras.applications . Make sure you are connected to the internet as the weights get automatically downloaded. Since the Xception model was originally built for imagenet, we will do little changes for integrating with our model. One thing to notice is that the Xception model takes 299*299*3 image size as input. We will remove the last classification layer and get the 2048 feature vector.

The function
**extract_features( image_directory )** will extract features for all images and we will map image names with their respective feature array.

### 1.extract_features
The function will extract features for all images and we will map image names with their respective feature array.

In [9]:
#extract features for all images

def extract_features( image_directory ):
    
    """
    Arguments:
        image_directory -- path to the dataset's image folder. 
        
    Returns:
        features -- map containing image name and corresponding feature array.
    """
    
    model = Xception( include_top=False, pooling='avg' )
    features = {}
    for img in tqdm(os.listdir(image_directory)):
        filename = image_directory + "/" + img
        image = Image.open(filename)
        image = image.resize((299,299))
        image = np.expand_dims(image, axis=0)
        #image = preprocess_input(image)
        image = image/127.5
        image = image - 1.0

        feature = model.predict(image)
        features[img] = feature
    return features


# 4 - Loading dataset for Training the model
In our Flickr_8k_test folder, we have Flickr_8k.trainImages.txt file that contains a list of 6000 image names that we will use for training.

For loading the training dataset, we need more functions:

**load_photos( filename )** – This will load the text file in a string and will return the list of image names.

**load_clean_descriptions( filename, photos )** – This function will create a dictionary that contains captions for each photo from the list of photos.
    
**load_features( photos )** – This function will give us the dictionary for image names and their feature vector which we have previously extracted from the Xception model.

### 1.load_photos
This will load the text file in a string and will return the list of image names.

In [None]:
#load the data 

def load_photos( filename ):
    
    """
    Arguments:
        filename -- filename of the text file(which contains image name and corresponding description). 
        
    Returns:
        photos -- list of the images name.
    """
    
    file = load_doc(filename)
    photos = file.split("\n")[:-1]
    return photos


### 2.load_clean_descriptions
This function will create a dictionary that contains captions for each photo from the list of photos. We also append the <start> and <end> identifier for each caption. We need this so that our LSTM model can identify the starting and ending of the caption.

In [None]:
#loading clean_descriptions

def load_clean_descriptions( filename, photos ): 
    
    """
    Arguments:
        filename -- filename of the text file(which contains image name and corresponding description). 
        photos   -- list of the images name.
        
    Returns:
        descriptions -- dictionary that contains captions for each photo from the list of photos.
    """
    
    file = load_doc(filename)
    descriptions = {}
    for line in file.split("\n"):

        words = line.split()
        if len(words)<1 :
            continue

        image, image_caption = words[0], words[1:]

        if image in photos:
            if image not in descriptions:
                descriptions[image] = []
            desc = '<start> ' + " ".join(image_caption) + ' <end>'
            descriptions[image].append(desc)

    return descriptions


### 3.load_features
This function will give us the dictionary for image names and their feature vector which we have previously extracted from the Xception model.

In [None]:
#loading all features

def load_features( photos ):
    
    """
    Arguments:
        photos   -- list of the images name.
        
    Returns:
        features -- dictionary for image names and their feature vector.
    """
    
    all_features = load(open("features.p","rb"))
    #selecting only needed features
    features = {k:all_features[k] for k in photos}
    return features


# 5 - Tokenizing the vocabulary 
Computers don’t understand English words, for computers, we will have to represent them with numbers. So, we will map each word of the vocabulary with a unique index value. Keras library provides us with the tokenizer function that we will use to create tokens from our vocabulary and save them to a “tokenizer.p” pickle file.

Functions used are:

**dict_to_list( descriptions )** - converting dictionary to clean list of descriptions.

**create_tokenizer( descriptions )** - this will create the tokenizer object and fits on text and return the object.


### 1.dict_to_list
converting dictionary to clean list of descriptions.

In [1]:
#converting dictionary to clean list of descriptions

def dict_to_list( descriptions ):
    
    """
    Arguments:
        descriptions -- dictionary that contains captions for each photo from the list of photos.
        
    Returns:
        all_desc -- list of descriptions.
    """
    
    all_desc = []
    for key in descriptions.keys():
        [all_desc.append(d) for d in descriptions[key]]
    return all_desc


### 2.create_tokenizer
this will create the tokenizer object and fits on text and return the object.

In [None]:
#creating tokenizer class 
#this will vectorise text corpus
#each integer will represent token in dictionary

def create_tokenizer( descriptions ):
    
    """
    Arguments:
        descriptions -- dictionary that contains captions for each photo from the list of photos.
        
    Returns:
        tokenizer -- tokenizer object.
    """
    
    desc_list = dict_to_list(descriptions)
    tokenizer = Tokenizer()
    tokenizer.fit_on_texts(desc_list)
    return tokenizer


# model

In [None]:

# Set these path according to project folder in you system
dataset_text = "G:\image_caption\Flickr8k_text"
dataset_images = "G:\image_caption\Flickr8k_Dataset"

#data cleaning
vocabulary = data_cleaning(dataset_text)

##########################################################################################

#2048 feature vector
features = extract_features(dataset_images)
dump(features, open("features.p","wb"))

features = load(open("features.p","rb"))

##########################################################################################

filename = dataset_text + "/" + "Flickr_8k.trainImages.txt"

#train = loading_data(filename)
train_imgs = load_photos(filename)
train_descriptions = load_clean_descriptions("descriptions.txt", train_imgs)
train_features = load_features(train_imgs)
print("step 4 done ")

##########################################################################################
# give each word an index, and store that into tokenizer.p pickle file
tokenizer = create_tokenizer(train_descriptions)
dump(tokenizer, open('tokenizer.p', 'wb'))
vocab_size = len(tokenizer.word_index) + 1
vocab_size
print("step 5 done")