#Neural Image Captioning for Mortals (Part 1) 


##Introduction
Image Captioning is damn hard problem — one of those frontier-AI problems that defy what we think computers can really do.

This summer, I had an opportunity to work on this problem for the advanced development team during my internship at @indicodata. The work I did was fascinating but not revolutionary — I was attempting to understand and replicate the work of researchers who had published recent success stories of hybrid neural networks explaining photos.

![Four examples of images for which a hybrid neural network has automatically generated descriptions of the scenes](blogimages/karpathy-4-images.png "source: Andrej Karpathy http://cs.stanford.edu/people/karpathy/deepimagesent/")

The project extended over several weeks, which included precursory learning on how to implement common neural network architectures using Theano (a symbolic-math framework in the Python programming language) and finally reading papers and creating models directly related to the image caption task.

###A two-part post about process

While the point of this is not to be a tutorial, I am going to explain my process of dividing the problem into several mini-projects, each with increasing difficulty, that would get me closer to the final goal of creating a model that generated a description of a scene in a natural photo. I’ll be providing my code with examples for you to follow along. If you are curious about how to go from reading a research paper to replicating its work, this post will satisfy.

The mini-projects are

* __Rating how relevant an image and caption are to each other (Part 1)__
* Given an image of a hand-written digit, generating the word to describe it character-by-character (i.e. “z-e-r-o”) (Part 2)
* Given a natural scene photo, generating the sentence to describe it word-by-word (Part 2)

###A two-part post about gratitude

More importantly, I want bring to light the fact that I stood on the shoulders of giants to be able to accomplish this feat. There are incredibly smart people in the world, and some of them happen to also be very generous in making their ideas, tools, models, and code accessible. I, for one, am so grateful that communities of people working on this stuff

* share findings in research papers, detailed for an undergraduate intern like me to understand and hope to replicate!
* maintain open source tools like Theano to easily construct high-performing models
* provide pre-trained models via download or API as building blocks for bigger systems
* release their research source code for anyone to freely reuse in their own projects or experiments

The things brilliant people are willing to share are invaluable to the current generation of people pushing the boundaries in fields like machine learning.

##Project 1. Rating how relevant an image and caption are to each other

Scoping what parts to tackle first was important in my journey to automatically caption images. 

At the time, I reasoned that _generating_ sentences was a difficult task for my skillset. I postponed generating any form of language for Projects 2 and 3, which you can read about in Part 2 of this post. Thus, I decided to approach a slightly different problem with the same image caption dataset: ranking the most relevant captions for a photo.

One version of the task goes like this: you have one photo and a list caption candidates. You’re going to sort the text captions by how relevant they are to the image.

The other version of the task is finding the most relevant image from a list of image candidates.  Both framings of the task are supported by the end deliverable of project 1: a joint image-text embedder.

![An example of selecting the most relevant caption from a list given an image](blogimages/oneimage-manycaptions.png "An example of selecting the most relevant caption from a list given an image")

![An example of selecting the most relevant image from a list given a caption](blogimages/onecaption-manyimages.png "An example of selecting the most relevant caption from a list given an image")

###How does it work?
You start with a large dataset of images with their accompanying captions.  Many of the best datasets for this task have been collected by asking roughly 5 labelers to describe an image, giving a healthy amount of diversity to the captions.  I used the most recently released dataset for this challenge, the [Microsoft Common Objects in Context (MSCOCO)](http://mscoco.org/home/).  It contains a little more than 80000 images, all the at least 5 captions.  This is a perfect task to train a supervised learning algorithm to relate captions with images.

The model I chose to implement came from the first half of a paper called [“Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models.”](http://arxiv.org/abs/1411.2539). I'll do my best to describe the gist of how the encoder model works.

The purpose of this model is to encode the visual information from an image and semantic information from a caption, into a embedding space;  this embedding space has the property that vectors that are close to each other are visually or semantically related.

Learning a model like this would be incredible. It would be a great way to relate how relevant an image and caption are to each other. For a batch of images and captions, we can use the model to map them all into this embedding space, compute a distance metric, and for each image and for each caption find its nearest neighbors.  If you rank the neighbors by which examples are closest, you have ranked how relevant images and captions are to each other.

![A visualization of finding nearest neighbors in a visual-semantic embedding space](blogimages/nearest-neighbors.png)

Okay, now imagine that two black boxes existed which had a whole bunch of knobs on them.  They represent the image encoder and caption encoder. How do we tune the knobs so that a related image and caption pair will be close in the embedding space, while the encoded vectors for an unrelated image and caption will have a large distance between them?

![A system diagram of the image encoder and caption encoder working to map the data in a visual-semantic embedding space](blogimages/image-caption-encoder-system-diagram1.png)

The answer is that we write a cost function that respects this property, between related and contrastive visual-semantic pairs.  If the similarity between related pairs is high, make the cost small, or vice-versa.  If the similarity between contrastive pairs is low, make the cost small, or vice-versa.  Then, compute the entire cost for a batch in the training dataset, and use backpropagation and a gradient-descent-based algorithm to tune the knobs to make this cost low.  This is the essence of learning to minimize a pairwise-ranking cost.

![The pairwise ranking cost function in its mathematical form](blogimages/pairwise-ranking-cost.png)

There are two main branches to this system, each processing image and text, respectively.

From the image side, a convolutional neural network needs to be learned to convert an image pixel-grid into meaningful fixed length feature vectors. These vectors are then linearly transformed in order to be further reduced down to a lower-dimensional embedding.

On the text side, strings of sentences are parsed into words, each which are then transformed into a continuous word vector embedding space. These words vectors are fed into a recurrent neural network sequentially, accumulating a sentence embedding over the entire sequence of words.

Two branches of this simple little flow chart are daunting on first thought, until one realizes that main components from both branches can be substituted in for pre-trained models! Let me explain. 

###What building blocks did I have at my disposal?

Recent advances in the engines to power computer vision models (i.e more powerful hardware to run convolutional neural networks) and the amassing of large image data sets (i.e ImageNet, a dataset of 1 million images with over 1000 labeled classes) to fuel these algorithms have given rise to state-of-the-art object-detection systems. More exciting, the last layer in these networks produce a robust fixed-length feature vector that transfers well to other image understanding tasks. This is the image feature vector needed in the image-caption-ranking model. Although the research papers on these models are public, they are unwieldy and can take days to train and many more days to implement and understand entirely. The good news is that some people in the machine learning community have made these models available without the computational or mental overhead . For example, the machine learning guys at @indicodata host an image features API that produces these feature vectors by feeding an image through their pre-trained ImageNet model. This is what I used to compute the image features for all the images in the image-caption data set.

Similar pre-trained models that allow one to extract feature vectors from words are also made freely available. GloVe: Global Vectors for Word Representations was the one I used. A model like GloVe was trained on 6 billion different words tokens, on a corpus that included Wikipedia and Gigaword 5th edition. Learning continuous bag-of-word models like GloVe would have taken weeks to train — an enormous amount of time and effort which would have put me way farther back in my process.

I wouldn’t say my job was cake from here, but it was a ton more doable. The final tasks were to implement 

a image encoder — a linear transformation from the 4096 dimensional image feature vector to a 300 dimensional embedding space)
a caption encoder — a recurrent neural network which takes word vectors as input at each time step, accumulates their collective meaning, and outputs a single sentence embedding at the end of the sequence.
A cost function which computed cosine similarity between image and caption embeddings

These tasks were accomplished without much headache, for the following reasons:

* a linear transformation is effectively a weight matrix multiply, which is one the simplest operations one can do in any machine learning model implementation
* I had code for the recurrent neural network, which I had used in the task of sentiment analysis. This problem is closely related to the caption encoder, as both take sequence input (words of a sentence or paragraph) and return a single output (sentence embedding vector or sentiment score).
* The cost function included computing norms of vectors, sums, subtractions, and multiplies — all very common array operations in scientific python. It ended up being even easier than I expected when I stumbled upon a [code example by Ryan Kiros](https://github.com/ryankiros/skip-thoughts/blob/master/eval_rank.py#L146), one of the researchers who published the paper, with the same pair-wise similarity cost function written in Python/NumPy.

###Source Code and Demos!
I hope I've intrigued you enough to be hungry for more details! _Please_ take the learnings I've presented here to create something awesome of your own. 

* [Python Code to my ranking encoder implementation](https://github.com/youralien/image-captioning-for-mortals/tree/master/project1). Everything is written in [Blocks/Fuel](http://arxiv.org/abs/1506.00619), a framework that helps you build and manage neural network models on using Theano.
* [IPython Notebook that demonstrates Phrase-based Image Search](https://github.com/youralien/image-captioning-for-mortals/blob/master/project1/phrase_based_image_search.ipynb), an excellent application of the image caption embedding models. I inputted the example phrase _"in the sky"_ and the images it returns are of airplanes, kites, birds, etc. flying in the sky!