# "A Caption Generator Full of Curiosities"
> "A creative caption generator based on limited and varied training data."

- toc: true
- branch: master
- badges: true
- comments: true
- categories: [natural language processing, caption generation, LSTM neural network]
- hide: false
- search_exclude: true

## Theory
### What does a neural-network-based caption generator do?
- A caption generator takes a photograph and outputs a caption. A caption generator trained on a neural network knows what kind of words to output based on previous captions and photographs that it has been trained on.
- All of the neural-network-based caption generator projects I found used large datasets of images with descriptive captions (e.g., the Flickr8k dataset, containing 8,000 images with 5 captions each). The three best example projects that I referenced when creating my project are listed below:
    - [Image Captioning with Keras](https://towardsdatascience.com/image-captioning-with-keras-teaching-computers-to-describe-pictures-c88a46a311b8) by Harshall Lamba
    - [How to Develop a Deep Learning Photo Caption Generator from Scratch](https://machinelearningmastery.com/develop-a-deep-learning-caption-generation-model-in-python/) by Dr. Jason Brownlee
    - [Python based Project – Learn to Build Image Caption Generator with CNN & LSTM](https://data-flair.training/blogs/python-based-project-image-caption-generator-cnn/) by the Dataflair Team
- The goal of these caption generators was to describe the contents of the images precisely, not to develop creative captions.

### What was the purpose of my caption generator?
- I created a caption generator that was trained on travel photographs from the "Places" category on the website [Atlas Obscura](https://www.atlasobscura.com/places) in order to help people write creative travel captions by capturing the mood of the photographs. My secondary goal was to explore the uses and limits of neural networks.
- I described in [my last blog post](https://ynusinovich.github.io/Data-Science-Journeys/web%20scraping/scrapy/xpath%20and%20css%20selectors/2020/05/04/The-Itsy-Bitsy-Spider-Climbed-into-Your-Website.html) how I used Scrapy to download the image links and image captions from Atlas Obscura.
- Due to the large variations in captions on Atlas Obscura, which were meant to be historical and artistic and not just descriptive, I did not expect the model to generate perfect captions. [A similar type of project](https://towardsdatascience.com/do-it-for-the-gram-instagram-style-caption-generator-4e7044766e34) for generating Instagram captions ended up creating repeating captions for most photographs.
- The first time fitting my caption generator model gave me the following odd caption for every image I tested: "the worlds largest bioluminescent salt mine is the most remote place in the world and the worlds largest collection of curiosities and curiosities and curiosities and curiosities and curiosities and curiosities and curiosities and curiosities and curiosities and curiosities," which is the inspiration for the name of this blog post.

### What is a neural network?
As explained in [this article from MIT News](https://news.mit.edu/2017/explained-neural-networks-deep-learning-0414), "Neural net\[work\]s are a means of doing machine learning, in which a computer learns to perform some task by analyzing training examples. Usually, the examples have been hand-labeled in advance. An object recognition system, for instance, might be fed thousands of labeled images of cars, houses, coffee cups, and so on, and it would find visual patterns in the images that consistently correlate with particular labels."

## Examples
### What would a *travel* caption generator be good for?
- Instagram, Facebook, or Pinterest captions
- Writing travel articles
- Advertising travel destinations

### What could *other* caption generators be good for?
- Restaurant menus
- Real estate brochures
- Furniture descriptions

## Code Part 1
### How to create a travel photograph caption generator.
To start, complete the necessary imports. The command at the end resolved memory issues that stopped the model from running, but you may not need to use it:

In [2]:
import numpy as np
import pandas as pd
import os
import string
import pickle
import math
import tensorflow as tf
from keras.preprocessing import image
from keras.preprocessing.sequence import pad_sequences
from keras.applications.inception_v3 import InceptionV3, preprocess_input
from keras.models import Model, Sequential, load_model
from keras.layers import Input, Dense, Dropout, Flatten, Conv2D, MaxPooling2D, LSTM, Bidirectional, Embedding
from keras.layers.merge import add
from keras.utils import to_categorical, plot_model
from keras.callbacks import ModelCheckpoint, EarlyStopping
from keras import regularizers, optimizers

os.environ['KMP_DUPLICATE_LIB_OK'] = 'True'

Using TensorFlow backend.


Next, load the caption dictionary and remove punctuation and numbers from the captions:

In [2]:
# creating the vocabulary on the first run

caption_df = pd.read_csv("../Data/atlas_edits_clean.csv")
caption_df.rename(columns = {"Unnamed: 0": "image_number"}, inplace = True)

In [3]:
# creating the vocabulary on the first run

caption_dict = dict()
for i in range(0,len(caption_df.index)):
    caption_dict[caption_df.loc[i, "image_number"]] = caption_df.loc[i, "description"]

In [4]:
# creating the vocabulary on the first run

table = str.maketrans('', '', string.punctuation)
# characters to replace, characters to replace them with, characters to delete

for image_number, description in caption_dict.items():
    # tokenize
    description = description.split()
    # convert to lower case
    description = [word.lower() for word in description]
    # remove punctuation from each token
    description = [word.translate(table) for word in description]
    # remove hanging 's' and 'a'
    description = [word for word in description if len(word) > 1]
    # remove tokens with numbers in them
    description = [word for word in description if word.isalpha()]
    # store as string
    description = ' '.join(description)
    # save in dict
    caption_dict[image_number] =  description

In [5]:
# creating the vocabulary on the first run

# Create a list of all the training captions
all_captions = []
for image_number, description in caption_dict.items():
    all_captions.append(description)

# Consider only words which occur at least 3 times in the corpus
word_count_threshold = 3
word_counts = dict()
num_sentences = 0
for sent in all_captions:
    num_sentences += 1
    for w in sent.split(' '):
        word_counts[w] = word_counts.get(w, 0) + 1 # add one to the count of the word

vocab = [w for w in word_counts if word_counts[w] >= word_count_threshold]

print(f'Preprocessed words {len(vocab)} ')
print(f'Number of sentences {num_sentences} ')

In [6]:
# creating the vocabulary on the first run

def save_obj(obj, name):
    with open('../Obj/' + name + '.pickle', 'wb') as f:
        pickle.dump(obj, f, pickle.HIGHEST_PROTOCOL)
        
save_obj(vocab, "vocab")

On subsequent runs of your code, the vocabulary can be loaded from the folder you saved it to:

In [7]:
# loading the vocabulary on subsequent runs

def load_obj(name):
    with open('../Obj/' + name + '.pickle', 'rb') as f:
        return pickle.load(f)
    
vocab = load_obj("vocab")

Add 1 to the vocabulary, to account for blanks:

In [8]:
vocab_size = len(vocab) + 1

Add "startseq" to the beginning of each sequence and "endseq" to the end, so the model learns how to start and end a caption, and then save the finalized caption dictionary:

In [9]:
# creating the caption dictionary on the first run

for image_number, description in caption_dict.items():
    tokens = description.split()
    caption_dict[image_number] = 'startseq ' + ' '.join(tokens) + ' endseq'
    print(f"Image {image_number} added to caption dictionary")

In [10]:
# creating the caption dictionary on the first run

def save_obj(obj, name):
    with open('../Obj/' + name + '.pickle', 'wb') as f:
        pickle.dump(obj, f, pickle.HIGHEST_PROTOCOL)
        
save_obj(caption_dict, "caption_dict")

On subsequent runs of your code, the caption dictionary can be loaded from the folder you saved it to:

In [11]:
# loading the caption dictionary on subsequent runs:
    
caption_dict = load_obj("caption_dict")

Run the images through both the pre-processing step and the processing step of [Google's InceptionV3 image processing model](https://towardsdatascience.com/a-simple-guide-to-the-versions-of-the-inception-network-7fc52b863202). Remove the last two layers of the model so that the output would be a processed image data vector. The last two layers would have taken the processed image vector and outputted image classes based on images that the model was trained on. I saved after each step, but you could also complete both pre-processing and processing and then save:

In [12]:
# Get the InceptionV3 model trained on imagenet data
model = InceptionV3(weights = 'imagenet')
# Remove the last layer (output softmax layer) from the InceptionV3
model_new = Model(model.input, model.layers[-2].output)





In [13]:
# creating the image dictionaries on the first run

pre_image_dict = dict()

for filename in os.listdir("../Data/Atlas_Images/"):
    if filename.endswith(".jpg"): 
        image_path = os.path.join("../Data/Atlas_Images/", filename)
        # Convert all the images to size 299 x 299 as expected by the InceptionV3 Model
        img = image.load_img(image_path, target_size = (299, 299))
        # Convert PIL image to numpy array of 3-dimensions
        x = image.img_to_array(img)
        # Add one more dimension
        x = np.expand_dims(x, axis = 0)
        # preprocess images using preprocess_input() from inception module
        x = preprocess_input(x)
        # add to pre-imagedict
        image_number = int(filename.split(".")[0])
        pre_image_dict[image_number] = x
        print(f"Image {image_number} added to pre-image dictionary")

In [14]:
# creating the image dictionaries on the first run

save_obj(pre_image_dict, "pre_image_dict")

In [15]:
# creating the image dictionaries on the first run

image_dict = dict()

for filename in os.listdir("../Data/Atlas_Images/"):
    if filename.endswith(".jpg"): 
        image_path = os.path.join("../Data/Atlas_Images/", filename)
        image_number = int(filename.split(".")[0])
        pre_image_dict[image_number] = x
        feature = model_new.predict(x, verbose = 0)
        feature = np.reshape(feature, feature.shape[1])
        image_dict[image_number] = feature
        print(f"Image {image_number} added to image dictionary")

In [16]:
# creating the image dictionaries on the first run

save_obj(image_dict, "image_dict")

Please note that the only reason I saved the pre-image dictionary was that my computer crashed between the pre-processing step and the processing step. If yours doesn't, then don't save the pre-image dictionary - it's quite large.

On subsequent runs of your code, the image dictionary can be loaded from the folder you saved it to. You do not need to re-load the pre-image dictionary:

In [17]:
# load the image dictionary on subsequent runs
    
image_dict = load_obj("image_dict")

Generate an index to encode words to numbers and back to words. Additionally, calculate the maximum length of the captions:

In [19]:
index_to_word = {}
word_to_index = {}
index = 1
for word in vocab:
    word_to_index[word] = index
    index_to_word[index] = word
    index += 1

In [20]:
# convert a dictionary of clean descriptions to a list of descriptions
def to_lines(a_caption_dict):
    all_desc = list()
    for key in a_caption_dict.keys():
        all_desc.append(a_caption_dict[key])
    return all_desc

# calculate the length of the description with the most words
def max_length(a_caption_dict):
    lines = to_lines(a_caption_dict)
    return max(len(caption.split()) for caption in lines)

# determine the maximum sequence length
max_length = max_length(caption_dict)
print(f'Max Caption Length, in Words: {max_length}')

Max Caption Length, in Words: 40


Create a caption generator function. The caption generator continuously takes images and captions and yields data for the model to fit. The images are in the form of arrays. The captions are in the form of arrays of input word sequences, padded with zeros to be the same length, and words encoded categorically based on the vocabulary:

In [21]:
def data_generator(a_caption_dict, an_image_dict, a_word_to_index, a_max_length, a_num_photos_per_batch):
    X1, X2, y = list(), list(), list()
    n = 0
    # loop forever over images
    while 1:
        for image_number, caption in a_caption_dict.items():
            n += 1
            # retrieve the photo features
            photo_data = an_image_dict[image_number]
            # encode the sequence
            seq = [a_word_to_index[word] for word in caption.split(' ') if word in a_word_to_index]
            # split one sequence into multiple X, y pairs
            for i in range(1, len(seq)):
                # split into input and output pair
                in_seq, out_seq = seq[:i], seq[i]
                # pad input sequence
                in_seq = pad_sequences([in_seq], maxlen = a_max_length)[0]
                # encode output sequence
                out_seq = to_categorical([out_seq], num_classes = vocab_size)[0]
                # store
                X1.append(photo_data)
                X2.append(in_seq)
                y.append(out_seq)
            # yield the batch data
            if n == a_num_photos_per_batch:
                yield [[np.array(X1), np.array(X2)], np.array(y)]
                X1, X2, y = list(), list(), list()
                n = 0

Create an embedding matrix using the pre-trained [GloVe vectors](https://nlp.stanford.edu/projects/glove/) to be able to embed the caption words as vectors based on their similarity:

In [22]:
# Load GloVe vectors
glove_dir = '../Data/'
embeddings_index = dict()
file = open(os.path.join(glove_dir, 'glove.6B.200d.txt'), encoding = "utf-8")
for line in file:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype = 'float32')
    embeddings_index[word] = coefs
file.close()

In [23]:
embedding_dim = 200
# Get 200-dim dense matrix for each of the words in our vocabulary
embedding_matrix = np.zeros((vocab_size, embedding_dim))
for word, index in word_to_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        # Words not found in the embedding index will be all zeros
        embedding_matrix[index] = embedding_vector

Create a neural network model to process the image and caption data. The image extractor uses a regular dense layer with dropout for regularization, and the caption sequencer uses an embedding layer, dropout for regularization, and a [long short term memory](https://colah.github.io/posts/2015-08-Understanding-LSTMs/)  network (LSTM) to remember words from earlier in the sequence. Finally, both outputs are sent to two more dense layers, the first of which has [L2 regularization](https://ynusinovich.github.io/Data-Science-Journeys/linear%20regression/regularization/2020/04/01/My-Linear-Regressions-Are-Looking-Irregular.html). The weights of the embedding layer are fixed so that the fitting of the model does not change the pre-trained vectors of the words.

In [24]:
# image feature extractor model
inputs1 = Input(shape = (2048,))
fe1 = Dropout(0.05)(inputs1)
fe2 = Dense(256, activation = 'relu')(fe1)

# partial caption sequence model
inputs2 = Input(shape = (max_length,))
se1 = Embedding(vocab_size, embedding_dim, mask_zero = True)(inputs2)
se2 = Dropout(0.1)(se1)
se3 = LSTM(256)(se2)

# decoder (feed forward) model
decoder1 = add([fe2, se3])
decoder2 = Dense(256, activation = 'relu', kernel_regularizer = regularizers.l2(0.005))(decoder1)
outputs = Dense(vocab_size, activation = 'softmax')(decoder2)

# merge the two input models
model = Model(inputs = [inputs1, inputs2], outputs = outputs)

Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where


In [25]:
model.layers[2].set_weights([embedding_matrix])
model.layers[2].trainable = False

In [26]:
model.summary()

Model: "model_2"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_3 (InputLayer)            (None, 40)           0                                            
__________________________________________________________________________________________________
input_2 (InputLayer)            (None, 2048)         0                                            
__________________________________________________________________________________________________
embedding_1 (Embedding)         (None, 40, 200)      1609400     input_3[0][0]                    
__________________________________________________________________________________________________
dropout_1 (Dropout)             (None, 2048)         0           input_2[0][0]                    
____________________________________________________________________________________________

Select a `learning_rate`. [This blog post](https://machinelearningmastery.com/learning-rate-for-deep-learning-neural-networks/) is useful guidance when setting the `learning_rate`. Use `categorical_crossentropy` for the `loss`, since each output word corresponds to one of the words (categories) in the vocabulary:

In [27]:
opt = optimizers.Adam(learning_rate = 0.005)
model.compile(loss = 'categorical_crossentropy', optimizer = opt)

Finally, train the model and then save it. The `num_photos_per_batch` can be reduced if you need to use less memory, and you can add `EarlyStopping` so that the model will stop training if `loss` is not decreasing for a `patience` of 5 epochs. The `steps_per_epoch` below are the default from the [Keras documentation](https://www.tensorflow.org/api_docs/python/tf/keras/Model#fit):

In [None]:
num_photos_per_batch = 16

generator = data_generator(caption_dict, image_dict, word_to_index, max_length, num_photos_per_batch)

early_stop = EarlyStopping(monitor = "loss", 
                           patience = 5,
                           min_delta = 0,
                           restore_best_weights = True)

model.fit_generator(generator, epochs = 100, verbose = 1,
                    steps_per_epoch = math.ceil(len(caption_dict)/num_photos_per_batch),
                    callbacks = [early_stop])


In [None]:
model.save("../Obj/final_model_2.h5")

## Code Part 2
### How to output captions with your travel photograph caption generator.
Start with the same steps as in the previous section. First, load up the vocabulary:

In [3]:
def load_obj(name):
    with open('../Obj/' + name + '.pickle', 'rb') as f:
        return pickle.load(f)
    
vocab = load_obj("vocab")

In [4]:
vocab_size = len(vocab) + 1

In [5]:
index_to_word = {}
word_to_index = {}
index = 1
for word in vocab:
    word_to_index[word] = index
    index_to_word[index] = word
    index += 1

In [6]:
max_length = 40

Then, process the images in your test folder using the InceptionV3 model, and load the trained caption generator model:

In [7]:
pre_model = InceptionV3(weights = 'imagenet')
new_model = Model(pre_model.input, pre_model.layers[-2].output)

In [8]:
image_dict = dict()

for filename in os.listdir("../Data/Test_Photos/"):
    if filename.endswith(".jpg") or filename.endswith(".jpeg") or filename.endswith(".png"): 
        image_path = os.path.join("../Data/Test_Photos/", filename)
        # Convert all the images to size 299 x 299 as expected by the InceptionV3 Model
        img = image.load_img(image_path, target_size = (299, 299))
        # Convert PIL image to numpy array of 3-dimensions
        x = image.img_to_array(img)
        # Add one more dimension
        x = np.expand_dims(x, axis = 0)
        # preprocess images using preprocess_input() from inception module
        x = preprocess_input(x)
        feature = new_model.predict(x, verbose = 0)
        feature = np.reshape(feature, feature.shape[1])
        image_number = int(filename.split(".")[0])
        image_dict[image_number] = feature

In [9]:
model = load_model("../Obj/final_model.h5")

Finally, define a function to generate captions. Given a photograph processed through the steps above, the function chooses one of the top `a_randomness` word indices predicted by the model, converts it back to a word, and adds it to the output sequence. The next word in the output sequence is predicted by the photograph and the previous words. Using the top `a_randomness` words, instead of restricting the caption generator function to the top single word, introduces some randomness to prevent captions from looking too similar:

In [11]:
def captionGenerator(a_photo, a_randomness):
    in_text = 'startseq'
    for i in range(max_length):
        sequence = [word_to_index[w] for w in in_text.split() if w in word_to_index]
        sequence = pad_sequences([sequence], maxlen = max_length)
        yhat = model.predict([a_photo, sequence], verbose = 0)
        yhat = np.random.choice((-yhat).argsort(axis = -1)[:, :a_randomness][0])
        word = index_to_word[yhat]
        in_text += ' ' + word
        if word == 'endseq':
            break
    final = in_text.split()
    final = final[1:-1]
    final = ' '.join(final)
    return final

In [10]:
# this brief example only prints the caption for the first photo in the folder

photo = image_dict[0].reshape(1, 2048)

In [None]:
print(captionGenerator(photo, 3))

### Example Captions for an Image
I used this sample photograph of Kotor, Montenegro, from the [NBC News](https://media4.s-nbcnews.com/i/newscms/2015_44/839916/lonely-planet-travel-kotor-today-151029-tease_84543027bb82cc77d65d99229ec6cac0.jpg) website:

![](2020-05-15-Image-1.jpeg)

Running the caption generator resulted in captions like:
- "the largest manmade crater on the earth and and an ancient egyptian village in an extreme world of the world is now home to the most important collection with oddities and oddities specimens and antique penny of dentistry and"
- "of the world and most powerful of the world and an ancient church with the largest collection of the world and most important and curiosities of pathological planetariums and and the largest planetariums of the world and most curious"
- "and the worlds tallest cave in north america the world is the worlds tallest hole in europe is the largest drain in europe of the world is now an amazing illusion and an eccentric illusion of the illustrious achievements"

Powerful, I couldn't have captioned it better myself.

Feel free to play around with a smaller version of the caption generator (that would fit on the free version of Heroku) in [my web app](https://caption-generator-obscura.herokuapp.com/). Keep in mind that on the web app, I refer to the "randomness" as "unpredictability."