# im2latex(S): Deep Learning Model

&copy; Copyright 2017 Sumeet S Singh

    This file is part of the im2latex solution (by Sumeet S Singh in particular since there are other solutions out there).

    This program is free software: you can redistribute it and/or modify
    it under the terms of the Affero GNU General Public License as published by
    the Free Software Foundation, either version 3 of the License, or
    (at your option) any later version.

    This program is distributed in the hope that it will be useful,
    but WITHOUT ANY WARRANTY; without even the implied warranty of
    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
    Affero GNU General Public License for more details.

    You should have received a copy of the Affero GNU General Public License
    along with this program.  If not, see <http://www.gnu.org/licenses/>.

## The Model
* Follows the [Show, Attend and Tell paper](https://www.semanticscholar.org/paper/Show-Attend-and-Tell-Neural-Image-Caption-Generati-Xu-Ba/146f6f6ed688c905fb6e346ad02332efd5464616)
* [VGG ConvNet (16 or 19)](http://www.robots.ox.ac.uk/~vgg/research/very_deep/) without the top-3 layers
    * Pre-initialized with the VGG weights but allowed to train
    * The ConvNet outputs $D$ dimensional vectors in a WxH grid where W and H are 1/16th of the input image size (due to 4 max-pool layers). Defining $W.H \equiv L$ the ConvNet output represents L locations of the image $i \in [1,L]$ and correspondingly outputs to L annotation vectors $a_i$, each of size $D$.
* A dense (FC) attention model: The deterministic soft-attention model of the paper computes $\alpha_{t,i}$ which is used to select or blend the $a_i$ vectors before being fed as inputs to the decoder LSTM network (see below).
    * Inputs to the attention model are $a_i$ and $h_{t-1}$ (previous hidden state of LSTM network - see below)
    and $$\alpha_{t,i} = softmax ( f_{att}(a_i, h_{t-1}) )$$
* A Decoder model: A conditioned LSTM that outputs probabilities of the text tokens $y_t$ at each step. The LSTM is conditioned upon $z_t = \sum_i^L(\alpha_{t,i}.a_i)$ and takes the previous hidden state $h_{t-1}$ as input. In addition, an embedding of the previous output $Ey_{t-1}$ is also input to the LSTM. At training time, $y_{t-1}$ would be derived from the training samples, while at inferencing time it would be fed-back from the previous predicted word.
    * $y$ is taken from a fixed vocabulary of K words. An embedding matrix $E$ is used to narrow its representation. The embedding weights $E$ are learnt end-to-end by the model as well.
    * The decoder LSTM uses a deep layer between $h_t$ and $y_t$. It is called a deep output layer and is described in [section 3.2.2 of this paper](https://www.semanticscholar.org/paper/How-to-Construct-Deep-Recurrent-Neural-Networks-Pascanu-G%C3%BCl%C3%A7ehre/533ee188324b833e059cb59b654e6160776d5812). That is:
    $$ p(y_t) = Softmax \Big( f_out(Ey_{t-1}, h_t, \hat{z}_t) \Big) $$
* Initialization MLPs: Two MLPs are used to produce the initial memory-state of the LSTM as well as $h_{t-1}$ value. Each MLP takes in the entire image's features (i.e. average of $a_i$) as its input and is trained end-to-end.
    $$ c_o = f_{init,c}\Big( \sum_i^L a_i \Big) $$
    $$ h_o = f_{init,h}\Big( \sum_i^L a_i \Big) $$
* Training:
    * 3 models from above - all except the conv-net - are trained end-to-end using SGD
    * The model is trained for a variable number of time steps - depending on each batch

## References
1. Show, Attend and Tell
    * [Paper](https://www.semanticscholar.org/paper/Show-Attend-and-Tell-Neural-Image-Caption-Generati-Xu-Ba/146f6f6ed688c905fb6e346ad02332efd5464616)
    * [Slides](https://pdfs.semanticscholar.org/b336/f6215c3c15802ca5327cd7cc1747bd83588c.pdf?_ga=2.52116077.559595598.1498604153-2037060338.1496182671)
    * [Author's Theano code](https://github.com/kelvinxu/arctic-captions)
1. [Simonyan, Karen and Andrew Zisserman. “Very Deep Convolutional Networks for Large-Scale Image Recognition.” CoRR abs/1409.1556 (2014): n. pag.](http://www.robots.ox.ac.uk/~vgg/research/very_deep/)
1. [im2latex solution of Harvard NLP](http://lstm.seas.harvard.edu/latex/)
1. [im2latex-dataset tools forked from Harvard NLP](https://github.com/untrix/im2latex-dataset)

In [1]:
import pandas as pd
import os
import dl_commons as dlc
import tensorflow as tf
from keras.applications.vgg16 import VGG16
from keras.layers import Input, Embedding, Dense, Activation, Dropout, Concatenate, Permute
from keras.callbacks import LambdaCallback
from keras.models import Model
from keras import backend as K
from keras.engine import Layer
import keras
import threading
import numpy as np
import collections
from Im2LatexDecoderRNN import Im2LatexDecoderRNN
from Im2LatexModel import Im2LatexModel, HYPER

Using TensorFlow backend.


{'init_layers': 1, 'init_c_activation': 'tanh', 'MeanSumAlphaEquals1': True, 'att_weighted_gather': True, 'init_1_n': 512, 'init_1_activation': 'tanh', 'keep_prob': 1.0, 'init_dropout_rate': 0.2, 'Max_Seq_Len': 151, 'att_layers': 1, 'init_h_activation': 'tanh', 'init_1_dropout_rate': 0.0, 'embeddings_initializer': 'glorot_uniform', 'embeddings_initializer_tf': <function _initializer at 0x81a1ba9b0>, 'image_shape': (120, 1075, 3), 'decoder_out_layers': 1, 'att_share_weights': True, 'att_1_n': 512, 'att_weights_initializer': 'glorot_normal', 'init_c_dropout_rate': 0.2, 'B': 128, 'D': 512, 'H': 3, 'K': 556, 'L': 99, 'pLambda': 0.0001, 'init_h_dropout_rate': 0.2, 'sum_logloss': True, 'output_follow_paper': True, 'W': 33, 'att_activation': 'tanh', 'm': 64, 'n': 1000, 'decoder_lstm_peephole': False, 'output_activation': 'tanh', 'output_1_n': 64}


# TODOs
* CTS Loss requires that the 'blank' token should be have the last id. In our case that would be the white-space token. Ensure that it is assigned token-id 555.
* Introduce dropouts
* Implement the beta scalar ('selector') that scales alpha.

In [2]:
data_folder = '../data/generated2'

### HyperParams

In [3]:
def get_vocab_size(data_dir_):
    df_vocab = pd.read_pickle(os.path.join(data_folder, 'df_vocab.pkl'))
    return df_vocab.id.max() + 1

### Encoder Model
[VGG ConvNet (16 or 19)](http://www.robots.ox.ac.uk/~vgg/research/very_deep/) without the top-3 layers
* Pre-initialized with the VGG weights but allowed to train
* The ConvNet outputs $D$ dimensional vectors in a WxH grid where W and H are scaled-down dimensions of the input image size (due to 5 max-pool layers). Defining $W.H \equiv L$ the ConvNet output represents L locations of the image $i \in [1,L]$ and correspondingly outputs to L annotation vectors $a_i$, each of size $D$.

The conv-net is *not trained* in the original paper and therefore the files can be separately preprocessed and their outputs directly fed into the model.

### Input Generator

In [4]:
@staticmethod
def make_batch_list(df_, batch_size_):
    ## Make a list of batches
    bin_lens = sorted(df_.bin_len.unique())
    bin_counts = [df_[df_.bin_len==l].shape[0] for l in bin_lens]
    batch_list = []
    for i in range(len(bin_lens)):
        bin_ = bin_lens[i]
        num_batches = (bin_counts[i] // batch_size_)
        ## Just making sure bin size is integral multiple of batch_size.
        ## This is not a requirement for this function to operate, rather
        ## is a way of possibly catching data-corrupting bugs
        assert (bin_counts[i] % batch_size_) == 0
        batch_list.extend([(bin_, j) for j in range(num_batches)])

    np.random.shuffle(batch_list)
    return batch_list

class ShuffleIterator(object):
    def __init__(self, df_, batch_size_):
        self._df = df_.sample(frac=1)
        self._batch_size = batch_size_
        self._batch_list = make_batch_list(self._df, batch_size_)
        self._next_pos = 0
        self._num_items = (df_.shape[0] // batch_size_)
        self.lock = threading.Lock()
        
#     def __iter__(self):
#         return self
    
    def next(self):
        ## This is an infinite iterator
        with self.lock:
            if self._next_pos >= self._num_items:
                ## Recompose the batch-list
                ## Shuffle the samples
                self._df = self._df.sample(frac=1)
                self._batch_list = make_batch_list(self._df, batch_size_)
                self._next_pos %= self._num_items
            next_pos = self._next_pos
            self._next_pos += 1
        
        batch = self._batch_list[next_pos]
        df_bin = self._df[self._df.bin_len == batch[0]]
        assert df_bin.bin_len.iloc[batch[1]*self._batch_size] == batch[0]
        assert df_bin.bin_len.iloc[(batch[1]+1)*self._batch_size-1] == batch[0]
        return df_bin.iloc[batch[1]*self._batch_size : (batch[1]+1)*self._batch_size]

class ImageIterator(ShuffleIterator):
    def __init__(self, df_, batch_size_, image_dim_, image_dir_):
        Shuffler.__init__(self, df_, batch_size_)
        self._im_dim = image_dim_
        self._image_dir = image_dir_

    @staticmethod
    def get_image_matrix(image_path_, height_, width_, padded_height_, padded_width_):
        MAX_PIXEL = 255.0 # Ensure this is a float literal
        ## Load image and convert to a 3-channel array
        im_ar = ndimage.imread(os.path.join(image_dir_,sr_row_.image), mode='RGB')
        ## normalize values to lie between -1.0 and 1.0.
        ## This is done in place of data whitening - i.e. normalizing to mean=0 and std-dev=0.5
        ## Is is a very rough technique but legit for images
        im_ar = (im_ar - MAX_PIXEL/2.0) / MAX_PIXEL
        height, width, channels = im_ar.shape
        assert height == height
        assert width == width
        assert channels == 3
        if (height < padded_height_) or (width < padded_width_):
            ar = np.full((padded_height_, padded_width_), 0.5, dtype=np.float32)
            h = (padded_height_-height)//2
            ar[h:h+height, 0:width] = im_ar
            im_ar = ar

        return im_ar

    def next(self):
        df_batch = Shuffler.next(self)[['image', 'height', 'width']]
        im_batch = []
        for image in df_batch.image.itertuples():
            im_batch.append(self._get_image_array(os.path.join(self._image_dir, image[0]), row[1], row[2], self._im_dim[0], self._im_dim[1]))
            
        return np.asarray(im_batch)

class FormulaIterator(ShuffleIterator):
    def __init__(self, df_, batch_size_, data_dir_, seq_filename_):
        Shuffler.__init__(self, df_, batch_size_)
        self._seq_data = pd.read_pickle(os.path.join(data_dir_, seq_filename_))
        
    def next(self):
        df_batch = Shuffler.next(self)['bin_len']
        bin_len = df_batch.iloc[0].bin_len
        return self._seq_data[bin_len][df_batch.index].values

#### Decoder Model
A dense (FC) attention model: The deterministic soft-attention model of the paper computes $\alpha_{t,i}$ which is used to select or blend the $a_i$ vectors before being fed as inputs to the decoder LSTM network (see below).
* Inputs to the attention model are $a_i$ and $h_{t-1}$ (previous hidden state of LSTM network - see below) and $$\alpha_{t,i} = softmax ( f_{att}(a_i, h_{t-1}) )$$
* Note that the model $f_{att}$ shares weights across all values of a_i (i.e. for all i = 1-L). Therefore the shared weight matrix for all a_i has shape (D, D), while shape of a is (B, L, D) where is B=batch-size. Weight matrix of h_i is separate and has the expected shape (n, D). This sharing of weights across a_i is interesting.

A Decoder model: A conditioned LSTM that outputs probabilities of the text tokens $y_t$ at each step. The LSTM is conditioned upon $z_t = \sum_i^L(\alpha_{t,i}.a_i)$ and takes the previous hidden state $h_{t-1}$ as input. In addition, an embedding of the previous output $Ey_{t-1}$ is also input to the LSTM. At training time, $y_{t-1}$ would be derived from the training samples, while at inferencing time it would be fed-back from the previous predicted word.
* $y$ is taken from a fixed vocabulary of K words. An embedding matrix $E$ is used to narrow its representation to an $m$ dimensional dense vector. The embedding weights $E$ are learnt end-to-end by the model as well.
* The decoder LSTM uses a deep layer between $h_t$ and $y_t$. It is called a deep output layer and is described in [section 3.2.2 of this paper](https://www.semanticscholar.org/paper/How-to-Construct-Deep-Recurrent-Neural-Networks-Pascanu-G%C3%BCl%C3%A7ehre/533ee188324b833e059cb59b654e6160776d5812). That is:
$$ p(y_t) = Softmax \Big( f_out(Ey_{t-1}, h_t, \hat{z}_t) \Big) $$
* Optionally $z_t = \beta \sum_i^L(\alpha_{t,i}.a_i)$ where $\beta = \sigma(f_{\beta}(h_{t-1}))$ is a scalar used to modulate the strength of the context. It turns out that for the original use-case of caption generation, the network would learn to emphasize objects by turning up the value of this scalar when it was focusing on objects. It is not clear at this time whether we'll need this feature for im2latex.


In [5]:
# with tf.variable_scope('test_1'):
#     m = Im2LatexModel().build()
#     print 'yProbs shape = ', K.int_shape(m.yProbs)

In [6]:
def test_rnn():
    B = HYPER.B
    Kv = HYPER.K
    L = HYPER.L

    ## TODO: Introduce Beam Search
    ## TODO: Introduce Stochastic Learning

    m = Im2LatexModel()
    rv = dlc.Properties()
    im = tf.placeholder(dtype=tf.float32, shape=(HYPER.B,) + HYPER.image_shape, name='image_batch')
    y_s = tf.placeholder(tf.int32, shape=(HYPER.B, None))
    print 'y_s[:,0] shape: ', K.int_shape(y_s[:,0])
    print 'embedding_lookup shape: ', K.int_shape(m._embedding_lookup(y_s[:,0]))
    a = m._build_image_context(im)
    rnn = Im2LatexDecoderRNN(HYPER, a, 10)
    print 'rnn ', rnn.state_size, rnn.output_size
    #init_c, init_h = m._build_init_layer(a)
    
    decoder = tf.contrib.seq2seq.BeamSearchDecoder(rnn, 
                                                   m._embedding_lookup,
                                                   y_s[:,0],
                                                   0,
                                                   rnn.zero_state(HYPER.B*rnn.BeamWidth, tf.float32),
                                                   beam_width=rnn.BeamWidth)
    
    print 'decoder._start_tokens: ', K.int_shape(decoder._start_tokens)
    print 'decoder._start_inputs: ', K.int_shape(decoder._start_inputs)
    final_outputs, final_state, final_sequence_lengths = tf.contrib.seq2seq.dynamic_decode(decoder,
                                                                                           maximum_iterations=HYPER.Max_Seq_Len + 10,
                                                                                           swap_memory=True)
    print 'final_outputs: ', K.int_shape(final_outputs.predicted_ids)
    print 'final_state: ', (final_state)
    print 'final_sequence_lengths', (final_sequence_lengths)

with tf.variable_scope('test_run18', reuse=False):
    test_rnn()


y_s[:,0] shape:  (128,)
embedding_lookup shape:  (128, 64)
convnet output_shape =  (None, 3, 33, 512)
rnn  ((1000, 1000), 99) 556
decoder._start_tokens:  (128, 10)
decoder._start_inputs:  (128, 10, 64)
shape(Ex_t) =  (1280, 64)
final_outputs:  (128, None, 10)
final_state:  BeamSearchDecoderState(cell_state=((<tf.Tensor 'test_run18/decoder/while/Exit_4:0' shape=(128, 10, 1000) dtype=float32>, <tf.Tensor 'test_run18/decoder/while/Exit_5:0' shape=(128, 10, 1000) dtype=float32>), <tf.Tensor 'test_run18/decoder/while/Exit_6:0' shape=(128, 10, 99) dtype=float32>), log_probs=<tf.Tensor 'test_run18/decoder/while/Exit_7:0' shape=(128, 10) dtype=float32>, finished=<tf.Tensor 'test_run18/decoder/while/Exit_8:0' shape=(128, 10) dtype=bool>, lengths=<tf.Tensor 'test_run18/decoder/while/Exit_9:0' shape=(128, 10) dtype=int32>)
final_sequence_lengths Tensor("test_run18/decoder/while/Exit_12:0", shape=(128, 10), dtype=int32)


In [7]:
# ## Conv-net
# # K.set_image_data_format('channels_last')
# #image_input = Input(shape=HYPER.image_shape, name='image_input')
# image_input = tf.placeholder(dtype=tf.float32, shape=(HYPER.B,) + HYPER.image_shape, name='image_batch2')
# convnet = VGG16(include_top=False, weights='imagenet', pooling=None, input_shape=HYPER.image_shape)
# convnet.trainable = False
# print 'convnet output_shape = ', convnet.output_shape
# a = convnet(image_input)
# a

# End