# Caption Generation
Caption generation is a challenging aritifical intelligence problem. Given a picture a textual description of the picture needs to be generated. For instance, for [this](https://github.com/rthothad/mlblr/blob/master/CaptionGenerator/DenseNet-Py2/2513260012_03d33305cf.jpg) picture a description such as "A black dog is running after a white dog in the snow" should be generated. 

This needs two models to be combined to generate the required output. A computer vision model to understand the content of the image and a language model (NLP) to convert the understood content to words and those words should be in the right order. The advantage of applying deep learning to this problem is that a single end to end model can be defined to predict a caption without having to build sophisticated data preparation pipelines.

For the **computer vision model** a DenseNet model trained on the Imagenet dataset is used. For the **language model** a Recurrent Neural Networks more specifically Long Short Term Memory Network(LSTM) is used.

The computer vision and language models are structured in a **encoder-decoder architecture**. This is an architecture developed for machine translation where an input sequence, say in French, is encoded as a fixed-length vector by an encoder network. A separate decoder network then reads the encoding and generates an output sequence in the new language, say English. A benefit of this approach in addition to the impressive skill of the approach is that a single end-to-end model can be trained on the problem. When adapted for image captioning, the encoder network is a deep convolutional neural network, and the decoder network is a stack of LSTM layers.

The two dominant methods prior to end-to-end neural network models for generating image captions were template-based methods and nearest-neighbor-based methods and modifying existing captions.


The Flickr8K Dataset was used to train and test the model. The Flickr8K dataset is comprised of more than 8,000 photos and up to 5 captions for each photo. The dataset is available for free. One must complete a request [form](https://illinois.edu/fb/sec/1713398) and the links to the dataset will be emailed.

A Bleu Score of 0.500769703674 is achieved.

This network was trained on Python 2.7, Keras - 2.1.5 and Tensorflow - 1.7.0.

### Computer Vision Model
A DenseNet model is used to extract the features from the picture. The last classification layer is removed from the DenseNet model. Give an image input, this model will give us a fixed length encoding. 
The extracted features are an internal representation of the image, not something directly intelligible. A deep convolutional neural network, or CNN, is used as the feature extraction submodel. This network can be trained directly on the images in the image captioning dataset. Alternately, a pre-trained model, such as a state-of-the-art model used for image classification, can be used. It is popular to use top performing models in the **ImageNet** dataset developed for the ILSVRC challenge, such as the Densely Connected Convolutional Network model, called DenseNet.

This is the encoder layer in the endcoder-decoder architecture.

#### Imagenet
ImageNet is a research project to develop a large database of images with annotations, e.g. images and their descriptions. The images and their annotations have been the basis for an image classification challenge called the ImageNet Large Scale Visual Recognition Challenge or ILSVRC since 2010. The result is that research organizations battle it out on pre-defined datasets to see who has the best model for classifying the objects in images. 

The cell below defines the DenseNet architecture in Keras. Keras also has an implementation of DenseNet but when I tried to classify an elephant picture it misclassified, so I used an implementation from [here]([]https://github.com/titu1994/DenseNet)

In [1]:
from __future__ import print_function
from __future__ import absolute_import
from __future__ import division

import warnings

from keras.models import Model
from keras.layers.core import Dense, Dropout, Activation, Reshape
from keras.layers.convolutional import Conv2D, Conv2DTranspose, UpSampling2D
from keras.layers.pooling import AveragePooling2D, MaxPooling2D
from keras.layers.pooling import GlobalAveragePooling2D
from keras.layers import Input
from keras.layers.merge import concatenate
from keras.layers.normalization import BatchNormalization
from keras.regularizers import l2
from keras.utils.layer_utils import convert_all_kernels_in_model, convert_dense_weights_data_format
from keras.utils.data_utils import get_file
from keras.engine.topology import get_source_inputs
from keras.applications.imagenet_utils import _obtain_input_shape
from keras.applications.imagenet_utils import decode_predictions
import keras.backend as K

# from subpixel import SubPixelUpscaling

DENSENET_121_WEIGHTS_PATH = r'https://github.com/titu1994/DenseNet/releases/download/v3.0/DenseNet-BC-121-32.h5'
DENSENET_161_WEIGHTS_PATH = r'https://github.com/titu1994/DenseNet/releases/download/v3.0/DenseNet-BC-161-48.h5'
DENSENET_169_WEIGHTS_PATH = r'https://github.com/titu1994/DenseNet/releases/download/v3.0/DenseNet-BC-169-32.h5'
DENSENET_121_WEIGHTS_PATH_NO_TOP = r'https://github.com/titu1994/DenseNet/releases/download/v3.0/DenseNet-BC-121-32-no-top.h5'
DENSENET_161_WEIGHTS_PATH_NO_TOP = r'https://github.com/titu1994/DenseNet/releases/download/v3.0/DenseNet-BC-161-48-no-top.h5'
DENSENET_169_WEIGHTS_PATH_NO_TOP = r'https://github.com/titu1994/DenseNet/releases/download/v3.0/DenseNet-BC-169-32-no-top.h5'

def preprocess_input(x, data_format=None):
    """Preprocesses a tensor encoding a batch of images.
    # Arguments
        x: input Numpy tensor, 4D.
        data_format: data format of the image tensor.
    # Returns
        Preprocessed tensor.
    """
    if data_format is None:
        data_format = K.image_data_format()
    assert data_format in {'channels_last', 'channels_first'}

    if data_format == 'channels_first':
        if x.ndim == 3:
            # 'RGB'->'BGR'
            x = x[::-1, ...]
            # Zero-center by mean pixel
            x[0, :, :] -= 103.939
            x[1, :, :] -= 116.779
            x[2, :, :] -= 123.68
        else:
            x = x[:, ::-1, ...]
            x[:, 0, :, :] -= 103.939
            x[:, 1, :, :] -= 116.779
            x[:, 2, :, :] -= 123.68
    else:
        # 'RGB'->'BGR'
        x = x[..., ::-1]
        # Zero-center by mean pixel
        x[..., 0] -= 103.939
        x[..., 1] -= 116.779
        x[..., 2] -= 123.68

    x *= 0.017 # scale values

    return x

def __create_dense_net(nb_classes, img_input, include_top, depth=40, nb_dense_block=3, growth_rate=12, nb_filter=-1,
                       nb_layers_per_block=-1, bottleneck=False, reduction=0.0, dropout_rate=None, weight_decay=1e-4,
                       subsample_initial_block=False, activation='softmax'):
    ''' Build the DenseNet model
    Args:
        nb_classes: number of classes
        img_input: tuple of shape (channels, rows, columns) or (rows, columns, channels)
        include_top: flag to include the final Dense layer
        depth: number or layers
        nb_dense_block: number of dense blocks to add to end (generally = 3)
        growth_rate: number of filters to add per dense block
        nb_filter: initial number of filters. Default -1 indicates initial number of filters is 2 * growth_rate
        nb_layers_per_block: number of layers in each dense block.
                Can be a -1, positive integer or a list.
                If -1, calculates nb_layer_per_block from the depth of the network.
                If positive integer, a set number of layers per dense block.
                If list, nb_layer is used as provided. Note that list size must
                be (nb_dense_block + 1)
        bottleneck: add bottleneck blocks
        reduction: reduction factor of transition blocks. Note : reduction value is inverted to compute compression
        dropout_rate: dropout rate
        weight_decay: weight decay rate
        subsample_initial_block: Set to True to subsample the initial convolution and
                add a MaxPool2D before the dense blocks are added.
        subsample_initial:
        activation: Type of activation at the top layer. Can be one of 'softmax' or 'sigmoid'.
                Note that if sigmoid is used, classes must be 1.
    Returns: keras tensor with nb_layers of conv_block appended
    '''

    concat_axis = 1 if K.image_data_format() == 'channels_first' else -1

    if reduction != 0.0:
        assert reduction <= 1.0 and reduction > 0.0, 'reduction value must lie between 0.0 and 1.0'

    # layers in each dense block
    if type(nb_layers_per_block) is list or type(nb_layers_per_block) is tuple:
        nb_layers = list(nb_layers_per_block)  # Convert tuple to list

        assert len(nb_layers) == (nb_dense_block), 'If list, nb_layer is used as provided. ' \
                                                   'Note that list size must be (nb_dense_block)'
        final_nb_layer = nb_layers[-1]
        nb_layers = nb_layers[:-1]
    else:
        if nb_layers_per_block == -1:
            assert (depth - 4) % 3 == 0, 'Depth must be 3 N + 4 if nb_layers_per_block == -1'
            count = int((depth - 4) / 3)

            if bottleneck:
                count = count // 2

            nb_layers = [count for _ in range(nb_dense_block)]
            final_nb_layer = count
        else:
            final_nb_layer = nb_layers_per_block
            nb_layers = [nb_layers_per_block] * nb_dense_block

    # compute initial nb_filter if -1, else accept users initial nb_filter
    if nb_filter <= 0:
        nb_filter = 2 * growth_rate

    # compute compression factor
    compression = 1.0 - reduction

    # Initial convolution
    if subsample_initial_block:
        initial_kernel = (7, 7)
        initial_strides = (2, 2)
    else:
        initial_kernel = (3, 3)
        initial_strides = (1, 1)

    x = Conv2D(nb_filter, initial_kernel, kernel_initializer='he_normal', padding='same',
               strides=initial_strides, use_bias=False, kernel_regularizer=l2(weight_decay))(img_input)

    if subsample_initial_block:
        x = BatchNormalization(axis=concat_axis, epsilon=1.1e-5)(x)
        x = Activation('relu')(x)
        x = MaxPooling2D((3, 3), strides=(2, 2), padding='same')(x)

    # Add dense blocks
    for block_idx in range(nb_dense_block - 1):
        x, nb_filter = __dense_block(x, nb_layers[block_idx], nb_filter, growth_rate, bottleneck=bottleneck,
                                     dropout_rate=dropout_rate, weight_decay=weight_decay)
        # add transition_block
        x = __transition_block(x, nb_filter, compression=compression, weight_decay=weight_decay)
        nb_filter = int(nb_filter * compression)

    # The last dense_block does not have a transition_block
    x, nb_filter = __dense_block(x, final_nb_layer, nb_filter, growth_rate, bottleneck=bottleneck,
                                 dropout_rate=dropout_rate, weight_decay=weight_decay)

    x = BatchNormalization(axis=concat_axis, epsilon=1.1e-5)(x)
    x = Activation('relu')(x)
    x = GlobalAveragePooling2D()(x)

    if include_top:
        x = Dense(nb_classes, activation=activation)(x)

    return x
  

def __transition_block(ip, nb_filter, compression=1.0, weight_decay=1e-4):
    ''' Apply BatchNorm, Relu 1x1, Conv2D, optional compression, dropout and Maxpooling2D
    Args:
        ip: keras tensor
        nb_filter: number of filters
        compression: calculated as 1 - reduction. Reduces the number of feature maps
                    in the transition block.
        dropout_rate: dropout rate
        weight_decay: weight decay factor
    Returns: keras tensor, after applying batch_norm, relu-conv, dropout, maxpool
    '''
    concat_axis = 1 if K.image_data_format() == 'channels_first' else -1

    x = BatchNormalization(axis=concat_axis, epsilon=1.1e-5)(ip)
    x = Activation('relu')(x)
    x = Conv2D(int(nb_filter * compression), (1, 1), kernel_initializer='he_normal', padding='same', use_bias=False,
               kernel_regularizer=l2(weight_decay))(x)
    x = AveragePooling2D((2, 2), strides=(2, 2))(x)

    return x

def __dense_block(x, nb_layers, nb_filter, growth_rate, bottleneck=False, dropout_rate=None, weight_decay=1e-4,
                  grow_nb_filters=True, return_concat_list=False):
    ''' Build a dense_block where the output of each conv_block is fed to subsequent ones
    Args:
        x: keras tensor
        nb_layers: the number of layers of conv_block to append to the model.
        nb_filter: number of filters
        growth_rate: growth rate
        bottleneck: bottleneck block
        dropout_rate: dropout rate
        weight_decay: weight decay factor
        grow_nb_filters: flag to decide to allow number of filters to grow
        return_concat_list: return the list of feature maps along with the actual output
    Returns: keras tensor with nb_layers of conv_block appended
    '''
    concat_axis = 1 if K.image_data_format() == 'channels_first' else -1

    x_list = [x]

    for i in range(nb_layers):
        cb = __conv_block(x, growth_rate, bottleneck, dropout_rate, weight_decay)
        x_list.append(cb)

        x = concatenate([x, cb], axis=concat_axis)

        if grow_nb_filters:
            nb_filter += growth_rate

    if return_concat_list:
        return x, nb_filter, x_list
    else:
        return x, nb_filter

def __conv_block(ip, nb_filter, bottleneck=False, dropout_rate=None, weight_decay=1e-4):
  ''' Apply BatchNorm, Relu, 3x3 Conv2D, optional bottleneck block and dropout
  Args:
      ip: Input keras tensor
      nb_filter: number of filters
      bottleneck: add bottleneck block
      dropout_rate: dropout rate
      weight_decay: weight decay factor
  Returns: keras tensor with batch_norm, relu and convolution2d added (optional bottleneck)
  '''
  concat_axis = 1 if K.image_data_format() == 'channels_first' else -1

  x = BatchNormalization(axis=concat_axis, epsilon=1.1e-5)(ip)
  x = Activation('relu')(x)

  if bottleneck:
      inter_channel = nb_filter * 4  # Obtained from https://github.com/liuzhuang13/DenseNet/blob/master/densenet.lua

      x = Conv2D(inter_channel, (1, 1), kernel_initializer='he_normal', padding='same', use_bias=False,
                 kernel_regularizer=l2(weight_decay))(x)
      x = BatchNormalization(axis=concat_axis, epsilon=1.1e-5)(x)
      x = Activation('relu')(x)

  x = Conv2D(nb_filter, (3, 3), kernel_initializer='he_normal', padding='same', use_bias=False)(x)
  if dropout_rate:
      x = Dropout(dropout_rate)(x)

  return x

def DenseNet(input_shape=None, depth=40, nb_dense_block=3, growth_rate=12, nb_filter=-1, nb_layers_per_block=-1,
             bottleneck=False, reduction=0.0, dropout_rate=0.0, weight_decay=1e-4, subsample_initial_block=False,
             include_top=True, weights=None, input_tensor=None,
             classes=10, activation='softmax'):
    '''Instantiate the DenseNet architecture,
        optionally loading weights pre-trained
        on CIFAR-10. Note that when using TensorFlow,
        for best performance you should set
        `image_data_format='channels_last'` in your Keras config
        at ~/.keras/keras.json.
        The model and the weights are compatible with both
        TensorFlow and Theano. The dimension ordering
        convention used by the model is the one
        specified in your Keras config file.
        # Arguments
            input_shape: optional shape tuple, only to be specified
                if `include_top` is False (otherwise the input shape
                has to be `(32, 32, 3)` (with `channels_last` dim ordering)
                or `(3, 32, 32)` (with `channels_first` dim ordering).
                It should have exactly 3 inputs channels,
                and width and height should be no smaller than 8.
                E.g. `(200, 200, 3)` would be one valid value.
            depth: number or layers in the DenseNet
            nb_dense_block: number of dense blocks to add to end (generally = 3)
            growth_rate: number of filters to add per dense block
            nb_filter: initial number of filters. -1 indicates initial
                number of filters is 2 * growth_rate
            nb_layers_per_block: number of layers in each dense block.
                Can be a -1, positive integer or a list.
                If -1, calculates nb_layer_per_block from the network depth.
                If positive integer, a set number of layers per dense block.
                If list, nb_layer is used as provided. Note that list size must
                be (nb_dense_block + 1)
            bottleneck: flag to add bottleneck blocks in between dense blocks
            reduction: reduction factor of transition blocks.
                Note : reduction value is inverted to compute compression.
            dropout_rate: dropout rate
            weight_decay: weight decay rate
            subsample_initial_block: Set to True to subsample the initial convolution and
                add a MaxPool2D before the dense blocks are added.
            include_top: whether to include the fully-connected
                layer at the top of the network.
            weights: one of `None` (random initialization) or
                'imagenet' (pre-training on ImageNet)..
            input_tensor: optional Keras tensor (i.e. output of `layers.Input()`)
                to use as image input for the model.
            classes: optional number of classes to classify images
                into, only to be specified if `include_top` is True, and
                if no `weights` argument is specified.
            activation: Type of activation at the top layer. Can be one of 'softmax' or 'sigmoid'.
                Note that if sigmoid is used, classes must be 1.
        # Returns
            A Keras model instance.
        '''

    if weights not in {'imagenet', None}:
        raise ValueError('The `weights` argument should be either '
                         '`None` (random initialization) or `cifar10` '
                         '(pre-training on CIFAR-10).')

    if weights == 'imagenet' and include_top and classes != 1000:
        raise ValueError('If using `weights` as ImageNet with `include_top`'
                         ' as true, `classes` should be 1000')

    if activation not in ['softmax', 'sigmoid']:
        raise ValueError('activation must be one of "softmax" or "sigmoid"')

    if activation == 'sigmoid' and classes != 1:
        raise ValueError('sigmoid activation can only be used when classes = 1')

    # Determine proper input shape
    input_shape = _obtain_input_shape(input_shape,
                                      default_size=32,
                                      min_size=8,
                                      data_format=K.image_data_format(),
                                      require_flatten=include_top)

    if input_tensor is None:
        img_input = Input(shape=input_shape)
    else:
        if not K.is_keras_tensor(input_tensor):
            img_input = Input(tensor=input_tensor, shape=input_shape)
        else:
            img_input = input_tensor

    x = __create_dense_net(classes, img_input, include_top, depth, nb_dense_block,
                           growth_rate, nb_filter, nb_layers_per_block, bottleneck, reduction,
                           dropout_rate, weight_decay, subsample_initial_block, activation)

    # Ensure that the model takes into account
    # any potential predecessors of `input_tensor`.
    if input_tensor is not None:
        inputs = get_source_inputs(input_tensor)
    else:
        inputs = img_input
    # Create model.
    model = Model(inputs, x, name='densenet')

    # load weights
    if weights == 'imagenet':
        weights_loaded = False

        if (depth == 121) and (nb_dense_block == 4) and (growth_rate == 32) and (nb_filter == 64) and \
                (bottleneck is True) and (reduction == 0.5) and (dropout_rate == 0.0) and (subsample_initial_block):
            if include_top:
                weights_path = get_file('DenseNet-BC-121-32.h5',
                                        DENSENET_121_WEIGHTS_PATH,
                                        cache_subdir='models',
                                        md5_hash='a439dd41aa672aef6daba4ee1fd54abd')
            else:
                weights_path = get_file('DenseNet-BC-121-32-no-top.h5',
                                        DENSENET_121_WEIGHTS_PATH_NO_TOP,
                                        cache_subdir='models',
                                        md5_hash='55e62a6358af8a0af0eedf399b5aea99')
            model.load_weights(weights_path)
            weights_loaded = True

        if (depth == 161) and (nb_dense_block == 4) and (growth_rate == 48) and (nb_filter == 96) and \
                (bottleneck is True) and (reduction == 0.5) and (dropout_rate == 0.0) and (subsample_initial_block):
            if include_top:
                weights_path = get_file('DenseNet-BC-161-48.h5',
                                        DENSENET_161_WEIGHTS_PATH,
                                        cache_subdir='models',
                                        md5_hash='6c326cf4fbdb57d31eff04333a23fcca')
            else:
                weights_path = get_file('DenseNet-BC-161-48-no-top.h5',
                                        DENSENET_161_WEIGHTS_PATH_NO_TOP,
                                        cache_subdir='models',
                                        md5_hash='1a9476b79f6b7673acaa2769e6427b92')
            model.load_weights(weights_path)
            weights_loaded = True

        if (depth == 169) and (nb_dense_block == 4) and (growth_rate == 32) and (nb_filter == 64) and \
                (bottleneck is True) and (reduction == 0.5) and (dropout_rate == 0.0) and (subsample_initial_block):
            if include_top:
                weights_path = get_file('DenseNet-BC-169-32.h5',
                                        DENSENET_169_WEIGHTS_PATH,
                                        cache_subdir='models',
                                        md5_hash='914869c361303d2e39dec640b4e606a6')
            else:
                weights_path = get_file('DenseNet-BC-169-32-no-top.h5',
                                        DENSENET_169_WEIGHTS_PATH_NO_TOP,
                                        cache_subdir='models',
                                        md5_hash='89c19e8276cfd10585d5fadc1df6859e')
            model.load_weights(weights_path)
            weights_loaded = True

        if weights_loaded:
            if K.backend() == 'theano':
                convert_all_kernels_in_model(model)

            if K.image_data_format() == 'channels_first' and K.backend() == 'tensorflow':
                warnings.warn('You are using the TensorFlow backend, yet you '
                              'are using the Theano '
                              'image data format convention '
                              '(`image_data_format="channels_first"`). '
                              'For best performance, set '
                              '`image_data_format="channels_last"` in '
                              'your Keras config '
                              'at ~/.keras/keras.json.')

            print("Weights for the model were loaded successfully")

    return model


def DenseNetImageNet121(input_shape=None,
                        bottleneck=True,
                        reduction=0.5,
                        dropout_rate=0.0,
                        weight_decay=1e-4,
                        include_top=True,
                        weights='imagenet',
                        input_tensor=None,
                        classes=1000,
                        activation='softmax'):
    return DenseNet(input_shape, depth=121, nb_dense_block=4, growth_rate=32, nb_filter=64,
                    nb_layers_per_block=[6, 12, 24, 16], bottleneck=bottleneck, reduction=reduction,
                    dropout_rate=dropout_rate, weight_decay=weight_decay, subsample_initial_block=True,
                    include_top=include_top, weights=weights, input_tensor=input_tensor,
                    classes=classes, activation=activation)


  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


Tried using the Keras implementation of DenseNet to classify an elephant picture. The predictions were not correct.

In [4]:
import keras
import numpy as np
from keras.preprocessing import image
from keras.applications.imagenet_utils import decode_predictions

model = keras.applications.densenet.DenseNet121(include_top=True, weights='imagenet', input_tensor=None, input_shape=None, pooling=None, classes=1000)

img_path = 'elephant.jpg'
img = image.load_img(img_path, target_size=(224, 224))
x = image.img_to_array(img)
x = np.expand_dims(x, axis=0)

#x = preprocess_input(x)

preds = model.predict(x)

print('Predicted:', decode_predictions(preds))

('Predicted:', [[(u'n03180011', u'desktop_computer', 0.3106413), (u'n06359193', u'web_site', 0.23376717), (u'n03249569', u'drum', 0.1802558), (u'n04380533', u'table_lamp', 0.16848156), (u'n02105251', u'briard', 0.042605225)]])


In [3]:
import numpy as np
from keras.preprocessing import image
from keras.applications.imagenet_utils import decode_predictions
size = 224

#Load DenseNet model
model = DenseNetImageNet121(input_shape=(size, size, 3))
model.summary()


img_path = 'elephant.jpg'
#Load an image from file
img = image.load_img(img_path, target_size=(size, size))
#Convert the image pixels to a Numpy array
x = image.img_to_array(img)
#Reshape data for the model
x = np.expand_dims(x, axis=0)
#Prepare the image for the DenseNet model - the image pixels need to be prepared in the same way as the ImageNet training data
#was prepared.
x = preprocess_input(x)
#Predict the probability across all classes
preds = model.predict(x)
#Conver the probabilities to class labels and print those
print('Predicted:', decode_predictions(preds))

Weights for the model were loaded successfully
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_2 (InputLayer)            (None, 224, 224, 3)  0                                            
__________________________________________________________________________________________________
conv2d_121 (Conv2D)             (None, 112, 112, 64) 9408        input_2[0][0]                    
__________________________________________________________________________________________________
batch_normalization_122 (BatchN (None, 112, 112, 64) 256         conv2d_121[0][0]                 
__________________________________________________________________________________________________
activation_122 (Activation)     (None, 112, 112, 64) 0           batch_normalization_122[0][0]    
______________________________________________________________

__________________________________________________________________________________________________
activation_169 (Activation)     (None, 14, 14, 384)  0           batch_normalization_169[0][0]    
__________________________________________________________________________________________________
conv2d_168 (Conv2D)             (None, 14, 14, 128)  49152       activation_169[0][0]             
__________________________________________________________________________________________________
batch_normalization_170 (BatchN (None, 14, 14, 128)  512         conv2d_168[0][0]                 
__________________________________________________________________________________________________
activation_170 (Activation)     (None, 14, 14, 128)  0           batch_normalization_170[0][0]    
__________________________________________________________________________________________________
conv2d_169 (Conv2D)             (None, 14, 14, 32)   36864       activation_170[0][0]             
__________

### Flickr 8k Dataset
Flicker8k Dataset contains 2 zip files. 
- Flickr8k Dataset.zip (1 Gigabyte) This is an archive of all photographs. It contains more than 8000 photographs in JPEG format. When you unzip this it will unzip to a folder name that spells 'Flicker' but I renamed the folder to 'Flickr' to be consistent.
- Flickr8k text.zip (2.2 Megabytes) This is an archive of all text descriptions for photographs. It contains a number of files containing different sources of descriptions for the photographs.

Below code goes through the dataset and pre calculates all the features for the pictures, so we can save time while training the network. The starting place for this code is the prepare_dataset() function.

In [16]:
import pickle as pickle
import numpy as np
from keras.preprocessing import image
from keras.applications.imagenet_utils import preprocess_input

counter = 0

def load_doc(filename):
    file = open(filename, 'r')
    text = file.read()
    file.close()
    return text

def load_set(filename):
    doc = load_doc(filename)
    dataset = list()
    for line in doc.split('\n'):
        if len(line) < 1:
            continue
        #identifier = line.split('.')[0]
        dataset.append(line)
    return set(dataset)
    
#Loads a given image from the folder and prepares the image pixels to be compatible with the DenseNet model
def load_image(path):
	#Load an image from file
	img = image.load_img(path, target_size=(224,224))
	#Convert the image pixels to a Numpy array
	x = image.img_to_array(img)
	#Reshape data for the model
	x = np.expand_dims(x, axis=0)
	#Prepare the image for the DenseNet model - the image pixels need to be prepared in the same way as the ImageNet training data
	#was prepared.
	x = preprocess_input(x)
	return np.asarray(x)

#Loads the encoding model to be used to get the encoded values for the pictures
def load_encoding_model():
    size = 224
    model = DenseNetImageNet121(input_shape=(size, size, 3))
    #Remove the top classification layer
    model.layers.pop()
    model = Model(inputs=model.inputs, outputs=model.layers[-1].output)
    #Make those layer non-trainable
    for layer in model.layers:
        layer.trainable = False
    return model

#Given an image this returns the encoding for that picture
def get_encoding(model, img):
	global counter
	counter += 1
	#Load the image from the folder
	image = load_image('Flickr8k_Dataset/'+str(img))
	pred = model.predict(image)
	pred = np.reshape(pred, pred.shape[1])
	if counter%1000 ==0: 
		print ("Encoding image: "+str(counter))
		print (pred.shape)
	return pred

#This function does 2 things. Creates the encoded values for the pictures and adds a start and end tag to the captions in the
#train and test data.
def prepare_dataset(no_imgs = -1):
	#Load train captions to memory    
	f_train_images = open('Flickr8k_text/Flickr_8k.trainImages.txt','rb')
	train_imgs = f_train_images.read().strip().split('\n') if no_imgs == -1 else f_train_images.read().strip().split('\n')[:no_imgs]
	f_train_images.close()

	#Load test captions to memory    
	f_test_images = open('Flickr8k_text/Flickr_8k.testImages.txt','rb')
	test_imgs = f_test_images.read().strip().split('\n') if no_imgs == -1 else f_test_images.read().strip().split('\n')[:no_imgs]
	f_test_images.close()

	#Create a new file to write the tagged train captions
	f_train_dataset = open('Flickr8k_text/flickr_8k_train_dataset.txt','wb')
	f_train_dataset.write("image_id\tcaptions\n")

	#Create a new file to write the tagged test captions
	f_test_dataset = open('Flickr8k_text/flickr_8k_test_dataset.txt','wb')
	f_test_dataset.write("image_id\tcaptions\n")
    
	#Go through the text file that contains all the captions and load them into 'captions'
	f_captions = open('Flickr8k_text/Flickr8k.token.txt', 'rb')
	captions = f_captions.read().strip().split('\n')
	data = {}
	for row in captions:
		row = row.split("\t")
		row[0] = row[0][:len(row[0])-2]
		try:
			data[row[0]].append(row[1])
		except:
			data[row[0]] = [row[1]]
	f_captions.close()

	encoded_images = {}
	#Load encoding model to be used to encode the pictures
	encoding_model = load_encoding_model()

	c_train = 0
	#Go through the train caption list to add the start and end tags    
	for img in train_imgs:
		#print ("Encoding image: "+str(img))
		#Get encoding for that training picture
		encoded_images[img] = get_encoding(encoding_model, img)
		for capt in data[img]:
			caption = "<start> "+capt+" <end>"
			f_train_dataset.write(img+"\t"+caption+"\n")
			f_train_dataset.flush()
			c_train += 1
	f_train_dataset.close()

	c_test = 0
	#Go through the test caption list to add the start and end tags    
	for img in test_imgs:
		#Get encoding for that test picture
		encoded_images[img] = get_encoding(encoding_model, img)
		for capt in data[img]:
			caption = "<start> "+capt+" <end>"
			f_test_dataset.write(img+"\t"+caption+"\n")
			f_test_dataset.flush()
			c_test += 1
	f_test_dataset.close()

	#Save the encoded images to a file, which will be used by the model during training
	with open( "encoded_images.p", "wb" ) as pickle_f:
		pickle.dump( encoded_images, pickle_f )
	return [c_train, c_test]

In [17]:
#Create the encoding files and tag the captions in the test and training set
c_train, c_test = prepare_dataset()
print ("Training samples = "+str(c_train))
print ("Test samples = "+str(c_test))

Weights for the model were loaded successfully
Encoding image: 1000
(1024,)
Encoding image: 2000
(1024,)
Encoding image: 3000
(1024,)
Encoding image: 4000
(1024,)
Encoding image: 5000
(1024,)
Encoding image: 6000
(1024,)
Encoding image: 7000
(1024,)
Training samples = 30000
Test samples = 5000


In [1]:
def get_lr_metric(optimizer):
    def lr(y_true, y_pred):
        return optimizer.lr
    return lr

lr = 0.001 

def schedule_lr(epoch):
    if epoch in [11,12,13,14,15]:
        lrate = lr/2
    elif epoch in [20,21,22,23]:
        lrate = lr/2
    elif epoch in [24,25,26,27]:  
        lrate = lr/4
    elif epoch in [28,29,30,31]:  
        lrate = lr/6
    elif epoch in [32, 33,34,35]:  
        lrate = lr/8
    elif epoch in [36,37,38,39]:  
        lrate = lr/10
    elif epoch in [40,41]:  
        lrate = lr/12
    elif epoch in [42,43]:  
        lrate = lr/14
    elif epoch in [44,45]:  
        lrate = lr/16
    elif epoch in [46,47]:  
        lrate = lr/18
    elif epoch in [48,49]:  
        lrate = lr/20
    else:
        lrate = lr

    return lrate


### Language Model
Using the fixed length encoding output from the **Computer Vision Model** as the input to the LSTM model we get the output, which are the captions for the given picture.
This is the decoder in the encoder-decoder architecture.
While a convolutional neural network is used to encode the images, a recurrent neural network, such as a Long Short-Term Memory network, is used to generate the next word in the sequence. The model generates one word of the output textual description, given both the photograph and the description generated so far as input. The model is called recursively until the entire output sequence is generated.
The encoder and decoder arhitecture can be implemented using one of two architectures, as the inject and the merge models.

### Merge Model
Merge model is used below. The merge model combines both the encoded form of the image input with the encoded form of the text description generated so far. The combination of these two encoded inputs is then used by a very simple decoder model to generate the next word in the sequence. The approach uses the recurrent neural network only to encode the text generated so far.
This separates the concern of modeling the image input, the text input and the combining & interpretation of the encoded inputs. It is common to use a pre-trained model for encoding the image, but similarly, this architecture also permits a pre-trained language model to be used to encode the caption text input.

Below code creates the dataset required for the network during training. For instance when the picture needs to be trained on a picture with a caption say 'A black dog is running after a white dog in the snow'. The prepare_dataset() function will return the data as:

| Picture  | X |y |
| ------------- | ------------- |
| Encoded values  | A  | black |
| Encoded values  | A black  | dog|
| Encoded values  | A black dog | is |
| Encoded values  | A black dog is  | running |
| Encoded values  | A black dog is running  | after |
| Encoded values  | A black dog is running after  | a |
| Encoded values  | A black dog is running after a  | white |
| Encoded values  | A black dog is running after a white  | dog |
| Encoded values  | A black dog is running after a white dog  | in |
| Encoded values  | A black dog is running after a white dog in   | the|
| Encoded values  | A black dog is running after a white dog in the   | snow|

In [2]:
import numpy as np
import pandas as pd
from keras.models import Sequential
from keras.layers import LSTM, Embedding, TimeDistributed, Dense, RepeatVector, Merge, Activation, Flatten
from keras.preprocessing import image, sequence
from keras.callbacks import ModelCheckpoint
import cPickle as pickle

EMBEDDING_DIM = 128


class CaptionGenerator():

    def __init__(self):
        self.max_cap_len = None
        self.vocab_size = None
        self.index_word = None
        self.word_index = None
        self.total_samples = None
        self.encoded_images = pickle.load( open( "encoded_images.p", "rb" ) )
        self.variable_initializer()

    #Create vocabulary to be used during training and encodes words in each sentences
    def variable_initializer(self):
        df = pd.read_csv('Flickr8k_text/flickr_8k_train_dataset.txt', delimiter='\t')
        nb_samples = df.shape[0]
        iter = df.iterrows()
        caps = []
        # Create a list of captions
        for i in range(nb_samples):
            x = iter.next()
            caps.append(x[1][1])

        self.total_samples=0
        # Calculate number of words in the corpus
        for text in caps:
            self.total_samples+=len(text.split())-1
        print ("Total samples : " + str(self.total_samples))
        
        # Create a list of sentences with each sentence split into words
        words = [txt.split() for txt in caps]
        unique = []

        # Creeate a list of words
        for word in words:
            unique.extend(word)

        #Make a unique list of workds
        unique = list(set(unique))
        self.vocab_size = len(unique)
        self.word_index = {}
        self.index_word = {}
        for i, word in enumerate(unique):
            # integer encode words
            self.word_index[word]=i
            self.index_word[i]=word

        max_len = 0
        # Calculate the largest amount of words present in a sentence in the given corpus. This is used to pad all sequences
        # to be of this length so it is consistent.
        for caption in caps:
            if(len(caption.split()) > max_len):
                max_len = len(caption.split())
        self.max_cap_len = max_len
        print ("Vocabulary size: "+str(self.vocab_size))
        print ("Maximum caption length: "+str(self.max_cap_len))
        print ("Variables initialization done!")


    #This progressively loads the data required during training. This technique is used when the entire dataset cannot be 
    #fit into memory
    def data_generator(self, batch_size = 32):
        partial_caps = []
        next_words = []
        images = []
        print ("Generating data...")
        gen_count = 0
        #Read the tagged captions from training dataset
        df = pd.read_csv('Flickr8k_text/flickr_8k_train_dataset.txt', delimiter='\t')
        nb_samples = df.shape[0]
        iter = df.iterrows()
        caps = []
        imgs = []
        #Go through each line in the training dataset and create a list of captions and images
        for i in range(nb_samples):
            x = iter.next()
            caps.append(x[1][1])
            imgs.append(x[1][0])


        total_count = 0
        #this loop will exit when all the data in the training set is passed to the network during training 
        while 1:
            image_counter = -1
            #Loop through all the captions in the training data
            for text in caps:
                image_counter+=1
                #Get the encoded image for the picture. This will be the input to the  denseNet model
                current_image = self.encoded_images[imgs[image_counter]]

                #create a list of words from the sentence
                for i in range(len(text.split())-1):
                    total_count+=1
                    #get the first word in the sentence
                    partial = [self.word_index[txt] for txt in text.split()[:i+1]]
                    partial_caps.append(partial)
                    next = np.zeros(self.vocab_size)
                    #the 'next' calls this in a loop to create the 'X', y mentioned above in the table at the beginning
                    next[self.word_index[text.split()[i+1]]] = 1
                    next_words.append(next)
                    images.append(current_image)

                    #when the batch size is reached return the 'Picture', 'X' and 'y' values collected so far.
                    if total_count>=batch_size:
                        next_words = np.asarray(next_words)
                        images = np.asarray(images)
                        # pad all sequences to a fixed length
                        partial_caps = sequence.pad_sequences(partial_caps, maxlen=self.max_cap_len, padding='post')
                        total_count = 0
                        gen_count+=1
                        if gen_count%1000 ==0: print ("yielding count: "+str(gen_count))
                        yield [[images, partial_caps], next_words]
                        partial_caps = []
                        next_words = []
                        images = []
        
    def load_image(self, path):
        img = image.load_img(path, target_size=(224,224))
        x = image.img_to_array(img)
        return np.asarray(x)

    #Define the model
    def create_model(self, ret_model = False):
        #Define the DenseNet model
        image_model = Sequential()
        #The input dim should match the output generated by the last but one layer of the DenseNet model
        image_model.add(Dense(EMBEDDING_DIM, input_dim = 1024, activation='relu'))
        #Repeat the input depending on the length of the output. We are informing the decoder as to how many times to repeat itself
        image_model.add(RepeatVector(self.max_cap_len))

        #Define the decoder model
        lang_model = Sequential()
        lang_model.add(Embedding(self.vocab_size, 256, input_length=self.max_cap_len))
        lang_model.add(LSTM(256,return_sequences=True))
        #Apply the dense layer to each of the timesteps
        lang_model.add(TimeDistributed(Dense(EMBEDDING_DIM)))

        #Define the 'Merge' model
        model = Sequential()
        model.add(Merge([image_model, lang_model], mode='concat'))
        model.add(LSTM(1000,return_sequences=False))
        model.add(Dense(self.vocab_size))
        model.add(Activation('softmax'))

        print ("Model created!")

        if(ret_model==True):
            return model

        #model.compile(loss='categorical_crossentropy', optimizer='rmsprop', metrics=['accuracy'])
        optimizer = keras.optimizers.Adam(lr=lr, beta_1=0.9, beta_2=0.999, epsilon=None, decay=0.0, amsgrad=False)
        lr_metric = get_lr_metric(optimizer)

        model.compile(loss='categorical_crossentropy', optimizer = 'rmsprop', metrics=['accuracy', lr_metric])
        return model

    def get_word(self,index):
        return self.index_word[index]

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


In [3]:
import keras
from keras.callbacks import ModelCheckpoint

def train_model(weight = None, batch_size=32, epochs = 10, initial_epoch = 0):
    # create captiongenerator
    cg = CaptionGenerator()
    # create the model
    model = cg.create_model()

    #when weight is provided use those
    if weight != None:
        model.load_weights(weight)

    counter = 0
    fileid = model.name
    logfilename = fileid + "-log.csv"
    # update the metrics to a file at the end of each epoch 
    csv_logger = keras.callbacks.CSVLogger(logfilename, separator=',', append=True)

    file_name = 'weights-improvement-{epoch:02d}.hdf5'
    # interested in monitoring the 'loss' value
    checkpoint = ModelCheckpoint(file_name, monitor='loss', verbose=1, save_best_only=True, mode='min')
    
    # define a variable learning rate scheduler
    lr_scheduler = keras.callbacks.LearningRateScheduler(schedule_lr)

    # Functions to be called at the end of each epoch
    callbacks_list = [checkpoint, csv_logger, lr_scheduler]
    #fit the model using a progressive loader
    model.fit_generator(cg.data_generator(batch_size=batch_size), initial_epoch=initial_epoch, steps_per_epoch=cg.total_samples/batch_size, epochs=epochs, verbose=1, callbacks=callbacks_list)
    
    
    try:
        model.save('Models/WholeModel.h5', overwrite=True)
        model.save_weights('Models/Weights.h5',overwrite=True)
    except:
        print ("Error in saving model.")
    print ("Training complete...\n")

In [20]:
import time
tic = time.clock()
train_model(initial_epoch = 0, epochs=10, batch_size=2176)
toc = time.clock()
print('Time taken to train with 50 epochs is ', ((toc - tic)/60))

Total samples : 383454
Vocabulary size: 8256
Maximum caption length: 40
Variables initialization done!




Model created!
Generating data...
Epoch 1/10

Epoch 00001: loss improved from inf to 5.47579, saving model to weights-improvement-01.hdf5
Epoch 2/10

Epoch 00002: loss improved from 5.47579 to 5.07437, saving model to weights-improvement-02.hdf5
Epoch 3/10

Epoch 00003: loss improved from 5.07437 to 4.54932, saving model to weights-improvement-03.hdf5
Epoch 4/10

Epoch 00004: loss improved from 4.54932 to 4.32290, saving model to weights-improvement-04.hdf5
Epoch 5/10

Epoch 00005: loss improved from 4.32290 to 4.08217, saving model to weights-improvement-05.hdf5
Epoch 6/10

Epoch 00006: loss improved from 4.08217 to 3.79212, saving model to weights-improvement-06.hdf5
Epoch 7/10

Epoch 00007: loss improved from 3.79212 to 3.55086, saving model to weights-improvement-07.hdf5
Epoch 8/10

Epoch 00008: loss improved from 3.55086 to 3.37010, saving model to weights-improvement-08.hdf5
Epoch 9/10

Epoch 00009: loss improved from 3.37010 to 3.21866, saving model to weights-improvement-09.hdf

In [26]:
import time
tic = time.clock()
initial_epoch = 10
epochs=15
weight='weights-improvement-10.hdf5'

train_model(initial_epoch = initial_epoch, epochs=epochs, batch_size=2176, weight=weight)

toc = time.clock()
print("Time taken to train with " + str(epochs - initial_epoch) + " epochs is " + str(((toc - tic)/60)) + " minutes")

Total samples : 383454
Vocabulary size: 8256
Maximum caption length: 40
Variables initialization done!




Model created!
Generating data...Epoch 11/15


Epoch 00011: loss improved from inf to 3.02379, saving model to weights-improvement-11.hdf5
Epoch 12/15

Epoch 00012: loss improved from 3.02379 to 2.76205, saving model to weights-improvement-12.hdf5
Epoch 13/15

Epoch 00013: loss improved from 2.76205 to 2.65512, saving model to weights-improvement-13.hdf5
Epoch 14/15

Epoch 00014: loss improved from 2.65512 to 2.57547, saving model to weights-improvement-14.hdf5
Epoch 15/15

Epoch 00015: loss improved from 2.57547 to 2.49902, saving model to weights-improvement-15.hdf5
Error in saving model.
Training complete...

Time taken to train with 5 epochs is 15.43760335 minutes


In [27]:
import time
tic = time.clock()
initial_epoch = 15
epochs=20
weight='weights-improvement-15.hdf5'
lr = 0.001
lr = lr/2

train_model(initial_epoch = initial_epoch, epochs=epochs, batch_size=2176, weight=weight)

toc = time.clock()
print("Time taken to train with " + str(epochs - initial_epoch) + " epochs is " + str(((toc - tic)/60)) + " minutes")

Total samples : 383454
Vocabulary size: 8256
Maximum caption length: 40
Variables initialization done!




Model created!
Generating data...
Epoch 16/20

Epoch 00016: loss improved from inf to 2.39298, saving model to weights-improvement-16.hdf5
Epoch 17/20

Epoch 00017: loss improved from 2.39298 to 2.39190, saving model to weights-improvement-17.hdf5
Epoch 18/20

Epoch 00018: loss improved from 2.39190 to 2.33217, saving model to weights-improvement-18.hdf5
Epoch 19/20

Epoch 00019: loss improved from 2.33217 to 2.26732, saving model to weights-improvement-19.hdf5
Epoch 20/20

Epoch 00020: loss improved from 2.26732 to 2.21030, saving model to weights-improvement-20.hdf5
Error in saving model.
Training complete...

Time taken to train with 5 epochs is 15.53058585 minutes


In [4]:
import time
tic = time.clock()
initial_epoch = 20
epochs=25
weight='weights-improvement-20.hdf5'
lr = 0.001
lr = lr/2

train_model(initial_epoch = initial_epoch, epochs=epochs, batch_size=2176, weight=weight)

toc = time.clock()
print("Time taken to train with " + str(epochs - initial_epoch) + " epochs is " + str(((toc - tic)/60)) + " minutes")

Total samples : 383454
Vocabulary size: 8256
Maximum caption length: 40
Variables initialization done!




Model created!
Generating data...Epoch 21/25


Epoch 00021: loss improved from inf to 2.09154, saving model to weights-improvement-21.hdf5
Epoch 22/25

Epoch 00022: loss improved from 2.09154 to 2.01820, saving model to weights-improvement-22.hdf5
Epoch 23/25

Epoch 00023: loss improved from 2.01820 to 1.96618, saving model to weights-improvement-23.hdf5
Epoch 24/25

Epoch 00024: loss improved from 1.96618 to 1.91589, saving model to weights-improvement-24.hdf5
Epoch 25/25

Epoch 00025: loss improved from 1.91589 to 1.91262, saving model to weights-improvement-25.hdf5
Error in saving model.
Training complete...

Time taken to train with 5 epochs is 16.8826267667 minutes


In [8]:
import time
tic = time.clock()
initial_epoch = 25
epochs=30
weight='weights-improvement-25.hdf5'
lr = 0.001
#lr = lr/4

train_model(initial_epoch = initial_epoch, epochs=epochs, batch_size=2176, weight=weight)

toc = time.clock()
print("Time taken to train with " + str(epochs - initial_epoch) + " epochs is " + str(((toc - tic)/60)) + " minutes")

Total samples : 383454
Vocabulary size: 8256
Maximum caption length: 40
Variables initialization done!




Model created!
Generating data...
Epoch 26/30

Epoch 00026: loss improved from inf to 1.87186, saving model to weights-improvement-26.hdf5
Epoch 27/30

Epoch 00027: loss improved from 1.87186 to 1.81656, saving model to weights-improvement-27.hdf5
Epoch 28/30

Epoch 00028: loss improved from 1.81656 to 1.77244, saving model to weights-improvement-28.hdf5
Epoch 29/30

Epoch 00029: loss improved from 1.77244 to 1.71143, saving model to weights-improvement-29.hdf5
Epoch 30/30

Epoch 00030: loss improved from 1.71143 to 1.67026, saving model to weights-improvement-30.hdf5
Error in saving model.
Training complete...

Time taken to train with 5 epochs is 15.82796535 minutes


In [13]:
import time
tic = time.clock()
initial_epoch = 30
epochs=35
weight='weights-improvement-30.hdf5'
lr = 0.001
#lr = lr/3
# tried with 0.001 it was hovering below 0.59
#Tried wit lr/2 the accuracy went down by 3 basis points during the first epoch
# trying with 0.002
train_model(initial_epoch = initial_epoch, epochs=epochs, batch_size=2176, weight=weight)

toc = time.clock()
print("Time taken to train with " + str(epochs - initial_epoch) + " epochs is " + str(((toc - tic)/60)) + " minutes")

Total samples : 383454
Vocabulary size: 8256
Maximum caption length: 40
Variables initialization done!




Model created!
Epoch 31/35
Generating data...

Epoch 00031: loss improved from inf to 1.64701, saving model to weights-improvement-31.hdf5
Epoch 32/35

Epoch 00032: loss improved from 1.64701 to 1.60813, saving model to weights-improvement-32.hdf5
Epoch 33/35

Epoch 00033: loss improved from 1.60813 to 1.59445, saving model to weights-improvement-33.hdf5
Epoch 34/35

Epoch 00034: loss improved from 1.59445 to 1.56893, saving model to weights-improvement-34.hdf5
Epoch 35/35

Epoch 00035: loss improved from 1.56893 to 1.53968, saving model to weights-improvement-35.hdf5
Error in saving model.
Training complete...

Time taken to train with 5 epochs is 15.39327855 minutes


In [5]:
import time
tic = time.clock()
initial_epoch = 35
epochs=40
weight='weights-improvement-35.hdf5'
lr = 0.001
#lr = lr/2
train_model(initial_epoch = initial_epoch, epochs=epochs, batch_size=2176, weight=weight)

toc = time.clock()
print("Time taken to train with " + str(epochs - initial_epoch) + " epochs is " + str(((toc - tic)/60)) + " minutes")

Total samples : 383454
Vocabulary size: 8256
Maximum caption length: 40
Variables initialization done!




Model created!
Generating data...
Epoch 36/40

Epoch 00036: loss improved from inf to 1.52246, saving model to weights-improvement-36.hdf5
Epoch 37/40

Epoch 00037: loss did not improve
Epoch 38/40

Epoch 00038: loss improved from 1.52246 to 1.51777, saving model to weights-improvement-38.hdf5
Epoch 39/40

Epoch 00039: loss improved from 1.51777 to 1.49675, saving model to weights-improvement-39.hdf5
Epoch 40/40

Epoch 00040: loss improved from 1.49675 to 1.47725, saving model to weights-improvement-40.hdf5
Error in saving model.
Training complete...

Time taken to train with 5 epochs is 16.4409871667 minutes


In [8]:
import time
tic = time.clock()
initial_epoch = 40
epochs=45
weight='weights-improvement-40.hdf5'
lr = 0.001
lr = lr * 2
train_model(initial_epoch = initial_epoch, epochs=epochs, batch_size=2176, weight=weight)

toc = time.clock()
print("Time taken to train with " + str(epochs - initial_epoch) + " epochs is " + str(((toc - tic)/60)) + " minutes")

Total samples : 383454
Vocabulary size: 8256
Maximum caption length: 40
Variables initialization done!




Model created!
Generating data...
Epoch 41/45

Epoch 00041: loss improved from inf to 1.47117, saving model to weights-improvement-41.hdf5
Epoch 42/45

Epoch 00042: loss improved from 1.47117 to 1.43716, saving model to weights-improvement-42.hdf5
Epoch 43/45

Epoch 00043: loss improved from 1.43716 to 1.39589, saving model to weights-improvement-43.hdf5
Epoch 44/45

Epoch 00044: loss improved from 1.39589 to 1.36893, saving model to weights-improvement-44.hdf5
Epoch 45/45

Epoch 00045: loss improved from 1.36893 to 1.33845, saving model to weights-improvement-45.hdf5
Error in saving model.
Training complete...

Time taken to train with 5 epochs is 15.7627675833 minutes


In [9]:
import time
tic = time.clock()
initial_epoch = 45
epochs = 50
weight='weights-improvement-45.hdf5'
lr = 0.001
lr = lr * 2
train_model(initial_epoch = initial_epoch, epochs=epochs, batch_size=2176, weight=weight)

toc = time.clock()
print("Time taken to train with " + str(epochs - initial_epoch) + " epochs is " + str(((toc - tic)/60)) + " minutes")

Total samples : 383454
Vocabulary size: 8256
Maximum caption length: 40
Variables initialization done!




Model created!
Generating data...Epoch 46/50


Epoch 00046: loss improved from inf to 1.32470, saving model to weights-improvement-46.hdf5
Epoch 47/50

Epoch 00047: loss improved from 1.32470 to 1.29706, saving model to weights-improvement-47.hdf5
Epoch 48/50

Epoch 00048: loss improved from 1.29706 to 1.27719, saving model to weights-improvement-48.hdf5
Epoch 49/50

Epoch 00049: loss improved from 1.27719 to 1.26466, saving model to weights-improvement-49.hdf5
Epoch 50/50

Epoch 00050: loss improved from 1.26466 to 1.24835, saving model to weights-improvement-50.hdf5
Error in saving model.
Training complete...

Time taken to train with 5 epochs is 15.6972003667 minutes


In [10]:
import pickle as pickle
import numpy as np
from keras.preprocessing import sequence
import nltk

cg = CaptionGenerator()

def process_caption(caption):
	caption_split = caption.split()
	processed_caption = caption_split[1:]
	try:
		end_index = processed_caption.index('<end>')
		processed_caption = processed_caption[:end_index]
	except:
		pass
	return " ".join([word for word in processed_caption])

def get_best_caption(captions):
    captions.sort(key = lambda l:l[1])
    best_caption = captions[-1][0]
    return " ".join([cg.index_word[index] for index in best_caption])

def get_all_captions(captions):
    final_captions = []
    captions.sort(key = lambda l:l[1])
    for caption in captions:
        text_caption = " ".join([cg.index_word[index] for index in caption[0]])
        final_captions.append([text_caption, caption[1]])
    return final_captions

def generate_captions(model, image, beam_size):
	start = [cg.word_index['<start>']]
	captions = [[start,0.0]]
	while(len(captions[0][0]) < cg.max_cap_len):
		temp_captions = []
		for caption in captions:
			partial_caption = sequence.pad_sequences([caption[0]], maxlen=cg.max_cap_len, padding='post')
			next_words_pred = model.predict([np.asarray([image]), np.asarray(partial_caption)])[0]
			next_words = np.argsort(next_words_pred)[-beam_size:]
			for word in next_words:
				new_partial_caption, new_partial_caption_prob = caption[0][:], caption[1]
				new_partial_caption.append(word)
				new_partial_caption_prob+=next_words_pred[word]
				temp_captions.append([new_partial_caption,new_partial_caption_prob])
		captions = temp_captions
		captions.sort(key = lambda l:l[1])
		captions = captions[-beam_size:]

	return captions

def test_model(weight, img_name, beam_size = 3):
	encoded_images = pickle.load( open( "encoded_images.p", "rb" ) )
	model = cg.create_model(ret_model = True)
	model.load_weights(weight)

	image = encoded_images[img_name]
	captions = generate_captions(model, image, beam_size)
	return process_caption(get_best_caption(captions))
	#return [process_caption(caption[0]) for caption in get_all_captions(captions)] 

def bleu_score(hypotheses, references):
	return nltk.translate.bleu_score.corpus_bleu(references, hypotheses)

def test_model_on_images(weight, img_dir, beam_size = 3):
	imgs = []
	captions = {}
	with open(img_dir, 'rb') as f_images:
		imgs = f_images.read().strip().split('\n')
	encoded_images = pickle.load( open( "encoded_images.p", "rb" ) )
	model = cg.create_model(ret_model = True)
	model.load_weights(weight)

	f_pred_caption = open('predicted_captions.txt', 'wb')

	for count, img_name in enumerate(imgs):
		print ("Predicting for image: "+str(count))
		image = encoded_images[img_name]
		image_captions = generate_captions(model, image, beam_size)
		best_caption = process_caption(get_best_caption(image_captions))
		captions[img_name] = best_caption
		print (img_name+" : "+str(best_caption))
		f_pred_caption.write(img_name+"\t"+str(best_caption))
		f_pred_caption.flush()
	f_pred_caption.close()

	f_captions = open('Flickr8k_text/Flickr8k.token.txt', 'rb')
	captions_text = f_captions.read().strip().split('\n')
	image_captions_pair = {}
	for row in captions_text:
		row = row.split("\t")
		row[0] = row[0][:len(row[0])-2]
		try:
			image_captions_pair[row[0]].append(row[1])
		except:
			image_captions_pair[row[0]] = [row[1]]
	f_captions.close()
	
	hypotheses=[]
	references = []
	for img_name in imgs:
		hypothesis = captions[img_name]
		reference = image_captions_pair[img_name]
		hypotheses.append(hypothesis)
		references.append(reference)

	return bleu_score(hypotheses, references)


Total samples : 383454
Vocabulary size: 8256
Maximum caption length: 40
Variables initialization done!


In [11]:
weight = 'weights-improvement-50.hdf5'
#test_image = '3155451946_c0862c70cb.jpg'
test_img_dir = 'Flickr8k_text/Flickr_8k.testImages.txt'
#print test_model(weight, test_image)
print (test_model_on_images(weight, test_img_dir, beam_size=3))



Model created!
Predicting for image: 0
3385593926_d3e9c21170.jpg : A group of people on a beach .
Predicting for image: 1
2677656448_6b7e7702af.jpg : A child in a blue suit in a pool .
Predicting for image: 2
311146855_0b65fdb169.jpg : A man in a yellow shirt is sitting on a yellow couch .
Predicting for image: 3
1258913059_07c613f7ff.jpg : A woman is sitting on a railing with a firetruck .
Predicting for image: 4
241347760_d44c8d3a01.jpg : A girl in a red uniform is running with a football .
Predicting for image: 5
2654514044_a70a6e2c21.jpg : Two dogs playing in the grass .
Predicting for image: 6
2339106348_2df90aa6a9.jpg : A woman in a white shirt is staring at a woman .
Predicting for image: 7
256085101_2c2617c5d0.jpg : A dog playing with a stuffed ball .
Predicting for image: 8
280706862_14c30d734a.jpg : A small brown dog is standing in the dirt .
Predicting for image: 9
3072172967_630e9c69d0.jpg : A group of men in white uniforms .
Predicting for image: 10
3482062809_3b694322c4.j

2731171552_4a808c7d5a.jpg : A young boy in a red shirt is running around in a red car .
Predicting for image: 82
3609032038_005c789f64.jpg : A man on a skateboard is jumping over a hill .
Predicting for image: 83
3119875880_22f9129a1c.jpg : A man on a white haired rock on a white wall .
Predicting for image: 84
3339140382_2e49bc324a.jpg : A person in midair performing a trick off a ramp .
Predicting for image: 85
2712787899_d85048eb6a.jpg : The little girl is holding a ball .
Predicting for image: 86
3655155990_b0e201dd3c.jpg : A black dog walks in the ocean .
Predicting for image: 87
3325497914_f9014d615b.jpg : A group of men wearing swim trunks run along a city street .
Predicting for image: 88
468310111_d9396abcbd.jpg : A black and white and white dog running through a grassy field .
Predicting for image: 89
747921928_48eb02aab2.jpg : A skateboarder in a onstage .
Predicting for image: 90
3639967449_137f48b43d.jpg : A group of people are looking at the camera .
Predicting for image:

241345811_46b5f157d4.jpg : A woman in a white shirt is running .
Predicting for image: 161
3457045393_2bbbb4e941.jpg : A group of children in a crowded park .
Predicting for image: 162
2797149878_bb8e27ecf9.jpg : A man in a blue shirt and glasses is standing in the snow .
Predicting for image: 163
543007912_23fc735b99.jpg : A man in a white shirt and black shorts sits on a bench .
Predicting for image: 164
3364026240_645d533fda.jpg : Three people are playing in the water .
Predicting for image: 165
466956209_2ffcea3941.jpg : A black and white dog is playing in the dirt .
Predicting for image: 166
2300168895_a9b83e16fc.jpg : Two dogs are playing with each other in the grass .
Predicting for image: 167
106490881_5a2dd9b7bd.jpg : A person in a helmet is jumping over a wooden fence .
Predicting for image: 168
3694991841_141804da1f.jpg : A brown dog is playing in the grass .
Predicting for image: 169
1523984678_edd68464da.jpg : A black and white dog runs through the grass .
Predicting for i

1107246521_d16a476380.jpg : A black dog and a brown dog are playing with a soccer ball .
Predicting for image: 242
3201427741_3033f5b625.jpg : A little boy in a orange jacket is in the snow .
Predicting for image: 243
3540416981_4e74f08cbb.jpg : Two dogs running on the grass .
Predicting for image: 244
410453140_5401bf659a.jpg : A group of people in front of a modern building .
Predicting for image: 245
3702436188_2c26192fd0.jpg : A black and white dog is sitting on a dog in a park .
Predicting for image: 246
2216695423_1362cb25f3.jpg : Two dogs are eating each other .
Predicting for image: 247
2345984157_724823b1e4.jpg : A brown dog jumps over a large pile of water .
Predicting for image: 248
3317073508_7e13565c1b.jpg : A group of people are all on the street in the middle of a race .
Predicting for image: 249
2101457132_69c950bc45.jpg : A man in a red shirt is rock climbing .
Predicting for image: 250
3285993030_87b0f1d202.jpg : A group of three people are running on a dirt path .
Pr

1282392036_5a0328eb86.jpg : Two children and one dogs are standing in front of a flock of dogs .
Predicting for image: 322
2704934519_457dc38986.jpg : A man wearing a wetsuit is jumping in the air into the water .
Predicting for image: 323
3499720588_c32590108e.jpg : A man is jumping over a hurdle .
Predicting for image: 324
506738508_327efdf9c3.jpg : A man in a white shirt and jeans poses for the camera .
Predicting for image: 325
512101751_05a6d93e19.jpg : A little girl in a yellow shirt runs through the grass .
Predicting for image: 326
2317714088_bcd081f926.jpg : A man walking down a street in a busy city street .
Predicting for image: 327
3275704430_a75828048f.jpg : A man in a red hat , and a hat looks at a face .
Predicting for image: 328
2518508760_68d8df7365.jpg : A race car car is standing on a racetrack .
Predicting for image: 329
3254817653_632e840423.jpg : A group of people are walking down the street .
Predicting for image: 330
3113322995_13781860f2.jpg : A black and white

2458269558_277012780d.jpg : A little boy in a blue shirt is riding a ride to a swing .
Predicting for image: 402
2985679744_75a7102aab.jpg : A man in a black shirt with a backpack .
Predicting for image: 403
317383917_d8bfa350b6.jpg : Two dogs running through the snow .
Predicting for image: 404
2482629385_f370b290d1.jpg : A man in a blue shirt is walking through tall grass .
Predicting for image: 405
293327462_20dee0de56.jpg : A woman sits on a wall with her arms out .
Predicting for image: 406
359837950_9e22ffe6c2.jpg : Two dogs are playing in the field .
Predicting for image: 407
354642192_3b7666a2dd.jpg : A dog swims in water .
Predicting for image: 408
1786425974_c7c5ad6aa1.jpg : A man in a black jacket is running on the grass .
Predicting for image: 409
3767841911_6678052eb6.jpg : A little girl in a yellow shirt plays with a Frisbee .
Predicting for image: 410
2884420269_225d27f242.jpg : A man in a pink shirt is climbing a skateboard on a skateboard .
Predicting for image: 411
27

3028969146_26929ae0e8.jpg : Two dogs are running in a field .
Predicting for image: 483
254295381_d98fa049f4.jpg : A boy wearing a blue shirt is jumping over a hurdle .
Predicting for image: 484
2148916767_644ea6a7fa.jpg : A black and white dog is running through the snow .
Predicting for image: 485
3200120942_59cfbb3437.jpg : a , a , a , a , a , in a red , in a black , black and black , a man and a man and a , a man , a , a , and a
Predicting for image: 486
3591458156_f1a9a33918.jpg : A brown and white dog is jumping over a Frisbee in the air .
Predicting for image: 487
3354330935_de75be9d2f.jpg : Skiiers of people on a snowy mountain .
Predicting for image: 488
3320356356_1497e53f80.jpg : A man with a pink shirt is running down a dirt path .
Predicting for image: 489
353180303_6a24179c50.jpg : Two women in orange shirts are smiling for the camera .
Predicting for image: 490
3064383768_f6838f57da.jpg : A black and white dog plays in the water .
Predicting for image: 491
154871781_ae77

3425851292_de92a072ee.jpg : A girl in a red shirt sits on a mat in the air .
Predicting for image: 560
3630641436_8f9ac5b9b2.jpg : A man is carrying a soccer ball in the water .
Predicting for image: 561
2901880865_3fd7b66a45.jpg : A surfer wearing a helmet is riding a large wave .
Predicting for image: 562
2445283938_ff477c7952.jpg : A man in a hat is pushing a picture in front of a crowd of people .
Predicting for image: 563
3315616181_15dd137e27.jpg : A little girl in a white shirt jumping off of a small player .
Predicting for image: 564
1572532018_64c030c974.jpg : A woman and a woman are sitting on some rocks .
Predicting for image: 565
2308271254_27fb466eb4.jpg : A black and black dog is running through the water .
Predicting for image: 566
2498897831_0bbb5d5b51.jpg : A little girl in a purple shirt and pink shorts is standing in front of a group of flowers .
Predicting for image: 567
2170222061_e8bce4a32d.jpg : A brown and black dog is playing with a tennis ball in its mouth .
P

3223224391_be50bf4f43.jpg : A black and white dog is running through the water .
Predicting for image: 638
1461667284_041c8a2475.jpg : A group of girls wearing hats stand in front of the street .
Predicting for image: 639
2196316998_3b2d63f01f.jpg : A man in a white shirt is riding a red bike .
Predicting for image: 640
1998457059_c9ac9a1e1a.jpg : A surfer is doing a leap on a surfboard .
Predicting for image: 641
3294209955_a1f1e2cc19.jpg : Two dogs stand in a field .
Predicting for image: 642
488408004_a1e26d4886.jpg : A person wearing a black hat is running through a field of grass .
Predicting for image: 643
3135504530_0f4130d8f8.jpg : A woman in a red shirt is wearing a red shirt .
Predicting for image: 644
3217910740_d1d61c08ab.jpg : A little boy wearing a green shirt is looking at the camera .
Predicting for image: 645
3602838407_bf13e49243.jpg : Three dogs are competing in the water .
Predicting for image: 646
2984174290_a915748d77.jpg : A young girl plays on a beach
Predicting

2525270674_4ab536e7ec.jpg : A man with a pink hat is walking in a green pool .
Predicting for image: 716
3470951932_27ed74eb0b.jpg : Two soccer players playing in the grass .
Predicting for image: 717
2870875612_2cbb9e4a3c.jpg : A young boy wearing a cap is in the ocean on the beach .
Predicting for image: 718
2541104331_a2d65cfa54.jpg : A black dog is about to swim in the water .
Predicting for image: 719
444057017_f1e0fcaef7.jpg : A young boy in a yellow shirt with his arm inside a yellow nose .
Predicting for image: 720
3597326009_3678a98a43.jpg : A man is sitting on a stool in a crowded park .
Predicting for image: 721
3360930596_1e75164ce6.jpg : A soccer player in blue is ready to throw a ball .
Predicting for image: 722
247637795_fdf26a03cf.jpg : A man wearing a black shirt looks at a camera .
Predicting for image: 723
3696698390_989f1488e7.jpg : Four children are walking through a garden .
Predicting for image: 724
3421789737_f625dd17ed.jpg : A man in a striped jacket is standin

2176980976_7054c99621.jpg : A man wearing a blue shirt and blue shirt is running toward the camera .
Predicting for image: 794
3523559027_a65619a34b.jpg : A little boy is about to climb a curb in the park .
Predicting for image: 795
1329832826_432538d331.jpg : A man and a black and brown dog are in front of a fountain .
Predicting for image: 796
260520547_944f9f4c91.jpg : Two dogs play in front of a fence in front of a fence .
Predicting for image: 797
2473738924_eca928d12f.jpg : A man in a white shirt and jeans is jumping in the air .
Predicting for image: 798
1765164972_92dac06fa9.jpg : A man in a yellow shirt eats a clear book .
Predicting for image: 799
2806710650_e201acd913.jpg : A young boy in blue holds an arms trick in the air .
Predicting for image: 800
2501595799_6316001e89.jpg : A black and white dog is running through a field .
Predicting for image: 801
3697359692_8a5cdbe4fe.jpg : A man and a woman stand in front of a crowd .
Predicting for image: 802
3688858505_e8afd1475d.

2813033949_e19fa08805.jpg : A boy in a red shirt runs through a field .
Predicting for image: 873
745880539_cd3f948837.jpg : A little boy in a striped shirt is looking out on a wooden ramp .
Predicting for image: 874
2480327661_fb69829f57.jpg : A boy in orange pitching a basketball .
Predicting for image: 875
3125309108_1011486589.jpg : A man in a white sweatshirt and a black sweatshirt stands in the air .
Predicting for image: 876
3287549827_04dec6fb6e.jpg : Two people playing in the woods .
Predicting for image: 877
391579205_c8373b5411.jpg : A man and a woman sit on the top of a brick building .
Predicting for image: 878
2610447973_89227ff978.jpg : A skateboarder in the dark .
Predicting for image: 879
2698666984_13e17236ae.jpg : A man wearing a blue helmet is wearing a helmet is jumping through a river .
Predicting for image: 880
339350939_6643bfb270.jpg : A white dog runs on the beach .
Predicting for image: 881
127490019_7c5c08cb11.jpg : Two people sitting on a dirt path in front

3187492926_8aa85f80c6.jpg : A child is doing a swimming pool .
Predicting for image: 953
3673165148_67f217064f.jpg : A man is riding the top of a ramp .
Predicting for image: 954
270724499_107481c88f.jpg : A black dog with a red collar is running through a field .
Predicting for image: 955
2182488373_df73c7cc09.jpg : A group of girls are walking down the street .
Predicting for image: 956
2421446839_fe7d46c177.jpg : A man wearing a brown shirt and a white shirt is holding a brown dog on his nose .
Predicting for image: 957
2603792708_18a97bac97.jpg : A boy with a white hat is standing on the top of a body of water in the water .
Predicting for image: 958
2822290399_97c809d43b.jpg : A black and white dog is running through the grass .
Predicting for image: 959
1332722096_1e3de8ae70.jpg : A woman in a green shirt is walking along a bench .
Predicting for image: 960
3694064560_467683205b.jpg : A group of people sit on the floor in front of a store .
Predicting for image: 961
3263395801_5e

## Things that were focused on

### Descriptions
These are included in the above code

### Different things I tried
1. Changed the optimizer from RMSProp to Adam and ran 5 epcohs. Accuracy increased from  0.261079 to 0.2979. So retained the optimizer to be Adam.
2. Instead of 'Concatenating' the image and language model, I tried to do a 'Sum' of these. But the accuracy did not improve meaningfully from the 1st to the 5th epoch. It improved from 0.0769 to 0.0775. So dropped this idea.
3. Changed the optimizer from Adam to SGD and ran 5 epochs. accuracy did not improve meaningfully from the 1st to the 5th epoch. It improved from 0.0754 to 0.0792. So dropped this idea. Moreover the time it took to train 5 epochs was noticeably longer.
4. Tried running for 10 epochs with Adam and noticed that RMSProp was marginally better at the end of the 10th epoch. So changed it back to RMSProp
5. Tried dropouts with RMSProp. Added a 50% dropout for the image_model. The original model without the dropout was marginally better. So removed the dropouts from image_model.
6. Added a 50% dropout for the lang_model. At the end of the 10th epoch, it did better than the model without the dropout. Accuracy with the dropout at the 10th epoch was 0.37888158 whereas without dropout it was 0.368187563
7. Retained the 50% dropout for the lang_model and added the 50% drop_out  for the image_model and ran for 10 epochs. The accuracy was less by 2 basis points compared to the one I got with dropout at just the lang_model. Up until the 3rd epoch, the values were comparable to my earlier run with dropout at the lang_model.
8. Retained the 50% dropout for the lang_model and reduced the dropout for the image_model to 20% and ran for 10 epochs. Until the 7th epoch this model was doing better but it tapered out for the remaining 3 epochs - finishing the 10th epoch with an accuracy of 0.377 when the model with just the dropout for the lang model finished with an accuracy of 0.37882. I think it is worth trying both the models for another 10 epochs.
9. Trained the model with 50% dropout for the lang_model and 20% dropout for the image model. At the end of the 20th epoch this model was less accurate by more than 2 basis points when than the model with no dropouts. At the 20th epcoh this model produced an accuracy of 0.428760548 and the model without any dropouts produced an accuracy of 0.451811603. The dropouts made the network learn faster during the first 8 epochs but the learning slowed down during the subsequent runs. So I dropped this idea.
10. Cleaned the captions, in the training set by converting all captions to lower case, removing apostrophes and removing all words with single characters such as 's' and 'a'. Ran for 10 epcohs with RMSProp and noticed that the learning was slow and completed the 10th epoch with an accuracy of 0.230183563. 
11. Tried the above setup with 20% drop out for the image model. Stopped it after 5 epochs because the learning was very slow. At the end of the 5th epoch the accuracy was 0.1632. With all my prior runs I use to be around 0.26 at the 5th epcoh. So stopped this. The takeaway from cleaning the description is that it slows down the learning.
12. Tried Adam optimizer with 500 LSTM instead of the LSTSM cells. At the end of the 1st epoch the accuracy was 0.0774 but there was no meaningful uptick in the accuracy. I ran it for 5 epochs and stopped it. The best accuracy at the end of the 5th epcoh was 0.0782.
13. I switched back to RMSProp and tried with variable learning rate. Got a reasonable uptick in the Bleu score, which is in this file.


### Detailed analysis of results
When the model gets stuck and does not improve the results, sometimes doubling the learning rate helps while other times reducing the learning rate helps. I would have typically thought reducing the learning rate will help during the later stages of the training but the reverse was true during my training. I noticed that earlier on (15th to the 25th epoch) I had to reduce the learning rate by half and then had to go back to my original learning rate for the next 15 epochs, then once again I had to double the learning rate for the last 10 epochs.


### What else I would have done if I had more time
1. Refactor the code. The code is not modularized and there is code repetitions. Also Keras provides built in support for converting word to index and vice-versa - probably make use of Keras.Tokenizer.
2. Would have added a validation set to understand the validation accuracy during training.
3. Convert it to Python 3.
4. I would like to explore why when I cleaned the captions, the accuracy was not even comparable. Intuitively this should have given me better results. I would like to spend some time to understand this.
5. Explore ensemble methods. Spent some time trying to understand to do this for deep learning but was unable to find concrete examples.
6. I recently read that LSTSMs are stochastic, meaning different runs on the same dataset will give different results. One of the reasons, is the weight initialization. I will explore possibilities to make it predictable, may be check whether we could use any seed value.
7. Analyze the ones that was captioned incorrectly to understand what type of sentences the network is having problems with.
8. Analyze why some of the things that I tried did not work.

### Comparison between my results and results from current state of art. 
The [reference](https://github.com/anuragmishracse/caption_generator) project using VGG16 architecture got a BLEU score of 0.57. I used Densenet architecture and got a BLEU score of 0.500769703674.
