# Model - Object Captioning

Here I will run a model to generate captions for objects. This will be done in a few step.  
(the main structure of steps and dataset from https://machinelearningmastery.com/develop-a-deep-learning-caption-generation-model-in-python/)

1. Extract feature from pretrained network for object recognition.  
2. Using the flicker dataset with labels, train CNN/RNN model 
3. Art Captioning: for this I'll try couple different method
    - test artworks on a model trained for object captioning
    - include art with description to the CNN/RNN 
    - include artworks to the CNN
    - include less concrete object images (e.g. cloud) to the CNN
    - train more abstract labeling in RNN
    
Additionally, some evaluation steps to take a look.
1. Check what type of objects are classified better or worse
2. Try semantic projection (abstract - concrete) to score the level of abstraction of each word and run statistical testing on how the model performance differs per the level of abstraction
3. 

In [218]:
from tensorflow.keras.applications import NASNetLarge 
from tensorflow.keras.applications.nasnet import preprocess_input
from tensorflow.keras.preprocessing.image import load_img, img_to_array
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.utils import to_categorical

from tensorflow.keras.models import Model

import pickle

import os

import numpy as np

In [9]:
flicker_img_dir = 'IMAGES/Flicker/Flicker8k_Dataset'
flicker_text_dir = 'IMAGES/Flicker/labels'

## Feature Extraction
First, extracting features from Flicker dataset using NASNet network.

In [56]:
def feature_extractor(dir_, network):
    ''' 
    iterate through files in dir_ 
    and get features running on network
    return a dictionary with image id as a key
    '''
    model = network(include_top = False)
    fnames = [x for x in os.listdir(dir_) if x.endswith('.jpg')]
    result = {}
    i = 1
    n = len(fnames)
    
    for fn in fnames:
        img = load_img(f'{dir_}/{fn}', target_size = (model.input.shape[1], model.input.shape[2]))
        img = np.expand_dims(img, 0)
        img = preprocess_input(img)
        feature = model.predict(img)
        ind = fn[:-4]
        result[ind] = feature
        print(f'{i}/{n} feature extraction completed')
        i += 1
    return result

In [None]:
features = feature_extractor(flicker_img_dir, NASNetLarge)

In [58]:
# saving the extracted features
with open('PKL/features.pkl', 'wb') as fp:
    pickle.dump(features, fp)

In [190]:
# loading
with open('PKL/features.pkl', 'rb') as fp:
    features = pickle.load(fp)

## Preprocessing Description
Now I will clean up the descriptions. 

In [194]:
# read the description file
with open(f'{flicker_text_dir}/Flickr8k.token.txt', 'r') as fn:
    text = fn.readlines()

I'll create a dictionary with the image id as a key and the descriptions associated with that id as value.  
While I'm at it I'll also remove punctuations, make them lowercase, remove a single letter.

In [195]:
# extract only image id and description
import re
pattern = '([0-9a-z_]*)\.jpg.*\\t(.*)\\n'
p = re.compile(pattern)
descriptions_pairs = [p.findall(x)[0] for x in text]

In [307]:
import string

descriptions = {}

table_ = str.maketrans('', '', string.punctuation+string.digits)

for ind, text in descriptions_pairs:
    text = text.lower()
    text = str.translate(text, table_)
    text = [x for x in text.split() if len(x) > 1] # remove trailing alphabet
    text = 'seqini ' + ' '.join(text) + ' seqfin' # add tokens
    if ind in descriptions:
        descriptions[ind].append(text)
    else:
        descriptions[ind] = [text]

In [309]:
# saving the description keys
with open('PKL/descriptions.pkl', 'wb') as fp:
    pickle.dump(descriptions, fp)

In [188]:
# loading
with open('PKL/descriptions.pkl', 'rb') as fp:
    descriptions = pickle.load(fp)

In [337]:
total_vocabulary_counts = len(set(' '.join(np.concatenate(list(descriptions.values()))).split()))
total_vocabulary_counts

8767

### Train/Test/Val Split
Now I'll split the list of flicker images into train/test sets then create a function that filters the dictionary of descriptions by a given list of image id.

In [338]:
from sklearn.model_selection import train_test_split
train_list, test_list = train_test_split(list(descriptions.keys()), test_size = 0.3)

## Generating Inputs
Now that we have our dictionaries of features and descriptions. We need to create input and output series.  
We need two separate inputs: image features, description as sequences. The output is the next word in the sequence.  

In [425]:
def get_keys(dict_):
    ''' Helper to return a list of keys '''
    return list(dict_.keys())

def get_vals(dict_):
    ''' Helper to return a list of values '''
    return list(dict_.values())

def breakdown_sequence(list_):
    ''' Helper to return a list of breakdown sequences and the output '''
    x, y = [], []
    for i in range(1, len(list_)):
        x.append(list_[:i])
        y.append(list_[i])
    return x, y

def sequence_process(select_dictionary):
    ''' Helper to process breakdown on all select dictionary '''
    X1, X2, Y = [], [], []

    for ind, texts in select_dictionary.items():
        sequences = tokenizer.texts_to_sequences(texts)
        for seq in sequences:
            x, y = breakdown_sequence(seq)
            X1.extend(np.repeat(ind, len(y)))
            X2.extend(x)
            Y.extend(y)

    return X1, X2, Y
    
    
class sequence_generator:
    def __init__(self, dictionary):
        ''' INPUT: a dictionary of descriptions '''
        self.dictionary = dictionary
        self.img_index = get_keys(self.dictionary)
        self.texts = get_vals(self.dictionary)

    def get_text(self, img_ind):
        ''' RETURN: a list of description given an id '''
        return self.dictionary[img_ind]

    def update_selection(self, list_):
        ''' Helper to update selector, dictionary, img, texts '''
        self.selector = list_
        self.select_dictionary = {k: v for k, v in self.dictionary.items() if k in list_}
        self.select_img_inds = get_keys(self.select_dictionary)
        self.select_texts = get_vals(self.select_dictionary)

    def train_generator(self, train_list):
        '''
        INPUT a list of training ids, 
        RETURN image inputs, text inputs, and outputs
        ASSIGN max_length and vocab size
        '''
        self.update_selection(train_list)

        self.tokenizer = Tokenizer()
        self.tokenizer.fit_on_texts(np.concatenate(self.select_texts))
        self.num_vocab = len(self.tokenizer.word_index)+1

        X1, X2, Y = sequence_process(self.select_dictionary)

        X2 = pad_sequences(X2)
        self.max_length = X2.shape[1]
    
        Y = to_categorical(Y, self.num_vocab)
        return X1, X2, Y

    def validation_generator(self, val_list):
        '''
        INPUT a list of validation ids, 
        RETURN image inputs, text inputs and outputs
        '''
        self.update_selection(val_list)

        X1, X2, Y = sequence_process(self.select_dictionary)
        X2 = pad_sequences(X2, maxlen = self.max_length)
        Y = to_categorical(Y, self.num_vocab)
        return X1, X2, Y

In [426]:
processor = sequence_generator(descriptions)

In [427]:
train_X1, train_X2, train_Y = processor.train_generator(train_list[0:3])

In [433]:
val_X1, val_X2, val_Y = processor.train_generator(test_list[0:3])