# im2latex(S): Deep Learning Model

&copy; Copyright 2017 Sumeet S Singh

    This file is part of the im2latex solution (by Sumeet S Singh in particular since there are other solutions out there).

    This program is free software: you can redistribute it and/or modify
    it under the terms of the Affero GNU General Public License as published by
    the Free Software Foundation, either version 3 of the License, or
    (at your option) any later version.

    This program is distributed in the hope that it will be useful,
    but WITHOUT ANY WARRANTY; without even the implied warranty of
    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
    Affero GNU General Public License for more details.

    You should have received a copy of the Affero GNU General Public License
    along with this program.  If not, see <http://www.gnu.org/licenses/>.

## The Model
* Follows the [Show, Attend and Tell paper](https://www.semanticscholar.org/paper/Show-Attend-and-Tell-Neural-Image-Caption-Generati-Xu-Ba/146f6f6ed688c905fb6e346ad02332efd5464616)
* [VGG ConvNet (16 or 19)](http://www.robots.ox.ac.uk/~vgg/research/very_deep/) without the top-3 layers
    * Pre-initialized with the VGG weights but allowed to train
    * The ConvNet outputs $D$ dimensional vectors in a WxH grid where W and H are 1/16th of the input image size (due to 4 max-pool layers). Defining $W.H \equiv L$ the ConvNet output represents L locations of the image $i \in [1,L]$ and correspondingly outputs to L annotation vectors $a_i$, each of size $D$.
* A dense (FC) attention model: The deterministic soft-attention model of the paper computes $\alpha_{t,i}$ which is used to select or blend the $a_i$ vectors before being fed as inputs to the decoder LSTM network (see below).
    * Inputs to the attention model are $a_i$ and $h_{t-1}$ (previous hidden state of LSTM network - see below)
    and $$\alpha_{t,i} = softmax ( f_{att}(a_i, h_{t-1}) )$$
* A Decoder model: A conditioned LSTM that outputs probabilities of the text tokens $y_t$ at each step. The LSTM is conditioned upon $z_t = \sum_i^L(\alpha_{t,i}.a_i)$ and takes the previous hidden state $h_{t-1}$ as input. In addition, an embedding of the previous output $Ey_{t-1}$ is also input to the LSTM. At training time, $y_{t-1}$ would be derived from the training samples, while at inferencing time it would be fed-back from the previous predicted word.
    * $y$ is taken from a fixed vocabulary of K words. An embedding matrix $E$ is used to narrow its representation. The embedding weights $E$ are learnt end-to-end by the model as well.
    * The decoder LSTM uses a deep layer between $h_t$ and $y_t$. It is called a deep output layer and is described in [section 3.2.2 of this paper](https://www.semanticscholar.org/paper/How-to-Construct-Deep-Recurrent-Neural-Networks-Pascanu-G%C3%BCl%C3%A7ehre/533ee188324b833e059cb59b654e6160776d5812). That is:
    $$ p(y_t) = Softmax \Big( f_out(Ey_{t-1}, h_t, \hat{z}_t) \Big) $$
* Initialization MLPs: Two MLPs are used to produce the initial memory-state of the LSTM as well as $h_{t-1}$ value. Each MLP takes in the entire image's features (i.e. average of $a_i$) as its input and is trained end-to-end.
    $$ c_o = f_{init,c}\Big( \sum_i^L a_i \Big) $$
    $$ h_o = f_{init,h}\Big( \sum_i^L a_i \Big) $$
* Training:
    * 3 components above - i.e. all except the conv-net - are trained end-to-end using SGD
    * The model is trained for a variable number of time steps - depending on each batch

## References
1. Show, Attend and Tell
    * [Paper](https://www.semanticscholar.org/paper/Show-Attend-and-Tell-Neural-Image-Caption-Generati-Xu-Ba/146f6f6ed688c905fb6e346ad02332efd5464616)
    * [Slides](https://pdfs.semanticscholar.org/b336/f6215c3c15802ca5327cd7cc1747bd83588c.pdf?_ga=2.52116077.559595598.1498604153-2037060338.1496182671)
    * [Original Theano code](https://github.com/kelvinxu/arctic-captions)
1. [Simonyan, Karen and Andrew Zisserman. “Very Deep Convolutional Networks for Large-Scale Image Recognition.” CoRR abs/1409.1556 (2014): n. pag.](http://www.robots.ox.ac.uk/~vgg/research/very_deep/)
1. [im2latex solution of Harvard NLP](http://lstm.seas.harvard.edu/latex/)
1. [im2latex-dataset tools forked from Harvard NLP](https://github.com/untrix/im2latex-dataset)

In [1]:
import pandas as pd
import os
import dl_commons as dlc
import tensorflow as tf
from dl_commons import PD 
from dl_commons import mandatory
from dl_commons import instanceof
from keras.applications.vgg16 import VGG16
from keras.layers import Input, Embedding
from keras.callbacks import LambdaCallback
from keras.models import Model
from keras import backend as K
import keras
import threading
import tensorflow as tf

Using TensorFlow backend.


In [2]:
data_folder = '../data/generated2'

### HyperParams

In [3]:
def get_vocab_size(data_dir_):
    df_vocab = pd.read_pickle(os.path.join(data_folder, 'df_vocab.pkl'))
    return df_vocab.id.max() + 1

In [4]:
class tensorshape(object):
    """Tensor shape validator to go with ParamDesc"""
    def __init__(self, shape):
        self._shape = shape
    def __contains__(self, obj):
        return keras.backend.int_shape(obj) == self._shape

In [21]:
try:
    del HYPER
except:
    pass

HYPER = dlc.HyperParams((
        PD('image_shape',
           'Shape of input images. Should be a python sequence.',
           None,
           (120,1075,3)
           ),
        PD('B',
           '(integer): Size of mini-batch for training, validation and testing.',
           instanceof(int),
           128
           ),
        PD('K',
           'Vocabulary size including zero',
           range(500,1000),
           get_vocab_size(data_folder)
           ),
        PD('m',
           '(integer): dimensionality of the embedded input vector (i.e. Ey)', 
           instanceof(int),
           64
           ),
        PD('L',
           '(integer): number of pixels in an image feature-map = WxD (see paper or model description)', 
           instanceof(int)),
        PD('D', 
           '(integer): number of features coming out of the conv-net. Depth/channels of the last conv-net layer. See paper or model description.', 
           instanceof(int)),
        PD('n',
           '(integer): Number of hidden-units of the LSTM cell',
           instanceof(int),
           1000),
    
    ### Attention Model Params ###
        PD('att_activation', 'Activation used in the attention model',
           ('tanh'), 
           'tanh'),
        PD('att_a_share_weights', 
           '(boolean): Flag indicating whether the attention_a model should share weights across locations (L)',
           instanceof(bool),
           True
          ),
        PD('att_a_layers', 'Number of layers in the attention_a model', range(1,10), 1),
        PD('att_h_layers', 'Number of layers in the attention_h model', range(1,10), 1),
        PD('att_a_1_n', 'Number of units in first layer of attention_a model', range(1,10000)),
        PD('att_h_1_n', 'Number of units in first layer of attention_h model', range(1,10000))
        ))
print HYPER

{'att_a_layers': 1, 'B': 128, 'D': None, 'att_a_share_weights': True, 'att_activation': 'tanh', 'm': 64, 'L': None, 'n': 1000, 'att_a_1_n': None, 'att_h_layers': 1, 'K': 556, 'att_h_1_n': None, 'image_shape': (120, 1075, 3)}


#### Encoder Model
[VGG ConvNet (16 or 19)](http://www.robots.ox.ac.uk/~vgg/research/very_deep/) without the top-3 layers
* Pre-initialized with the VGG weights but allowed to train
* The ConvNet outputs $D$ dimensional vectors in a WxH grid where W and H are scaled-down dimensions of the input image size (due to 5 max-pool layers). Defining $W.H \equiv L$ the ConvNet output represents L locations of the image $i \in [1,L]$ and correspondingly outputs to L annotation vectors $a_i$, each of size $D$.

The conv-net is *not trained* in the original paper and therefore the files can be separately preprocessed and their outputs directly fed into the model.

In [22]:
## Conv-net
# K.set_image_data_format('channels_last')
image_input = Input(shape=HYPER.image_shape, name='image_input')
convnet = VGG16(include_top=False, weights='imagenet', pooling=None, input_shape=HYPER.image_shape)
convnet.trainable = False
print 'convnet output_shape = ', convnet.output_shape
a = convnet(image_input)

convnet output_shape =  (None, 3, 33, 512)


In [23]:
HYPER.L = K.int_shape(a)[1]*K.int_shape(a)[2]
HYPER.D = K.int_shape(a)[3]
HYPER.att_a_1_n = HYPER.D
HYPER.att_h_1_n = HYPER.D
print HYPER

{'att_a_layers': 1, 'B': 128, 'D': 512, 'att_a_share_weights': True, 'att_activation': 'tanh', 'm': 64, 'L': 99, 'n': 1000, 'att_a_1_n': 512, 'att_h_layers': 1, 'K': 556, 'att_h_1_n': 512, 'image_shape': (120, 1075, 3)}


### Input Generator

In [9]:
@staticmethod
def make_batch_list(df_, batch_size_):
    ## Make a list of batches
    bin_lens = sorted(df_.bin_len.unique())
    bin_counts = [df_[df_.bin_len==l].shape[0] for l in bin_lens]
    batch_list = []
    for i in range(len(bin_lens)):
        bin_ = bin_lens[i]
        num_batches = (bin_counts[i] // batch_size_)
        ## Just making sure bin size is integral multiple of batch_size.
        ## This is not a requirement for this function to operate, rather
        ## is a way of possibly catching data-corrupting bugs
        assert (bin_counts[i] % batch_size_) == 0
        batch_list.extend([(bin_, j) for j in range(num_batches)])

    np.random.shuffle(batch_list)
    return batch_list

class ShuffleIterator(object):
    def __init__(self, df_, batch_size_):
        self._df = df_.sample(frac=1)
        self._batch_size = batch_size_
        self._batch_list = make_batch_list(self._df, batch_size_)
        self._next_pos = 0
        self._num_items = (df_.shape[0] // batch_size_)
        self.lock = threading.Lock()
        
#     def __iter__(self):
#         return self
    
    def next(self):
        ## This is an infinite iterator
        with self.lock:
            if self._next_pos >= self._num_items:
                ## Recompose the batch-list
                ## Shuffle the samples
                self._df = self._df.sample(frac=1)
                self._batch_list = make_batch_list(self._df, batch_size_)
                self._next_pos %= self._num_items
            next_pos = self._next_pos
            self._next_pos += 1
        
        batch = self._batch_list[next_pos]
        df_bin = self._df[self._df.bin_len == batch[0]]
        assert df_bin.bin_len.iloc[batch[1]*self._batch_size] == batch[0]
        assert df_bin.bin_len.iloc[(batch[1]+1)*self._batch_size-1] == batch[0]
        return df_bin.iloc[batch[1]*self._batch_size : (batch[1]+1)*self._batch_size]

class ImageIterator(ShuffleIterator):
    def __init__(self, df_, batch_size_, image_dim_, image_dir_):
        Shuffler.__init__(self, df_, batch_size_)
        self._im_dim = image_dim_
        self._image_dir = image_dir_

    @staticmethod
    def get_image_matrix(image_path_, height_, width_, padded_height_, padded_width_):
        MAX_PIXEL = 255.0 # Ensure this is a float literal
        ## Load image and convert to a 3-channel array
        im_ar = ndimage.imread(os.path.join(image_dir_,sr_row_.image), mode='RGB')
        ## normalize values to lie between -1.0 and 1.0.
        ## This is done in place of data whitening - i.e. normalizing to mean=0 and std-dev=0.5
        ## Is is a very rough technique but legit for images
        im_ar = (im_ar - MAX_PIXEL/2.0) / MAX_PIXEL
        height, width = im_ar.shape
        assert height == height
        assert width == width
        if (height < padded_height_) or (width < padded_width_):
            ar = np.full((max_height_, max_widt_h), 0.5, dtype=np.float32)
            h = (padded_height_-height)//2
            ar[h:h+height, 0:width] = im_ar
            im_ar = ar

        return im_ar

    def next(self):
        df_batch = Shuffler.next(self)[['image', 'height', 'width']]
        im_batch = []
        for image in df_batch.image.itertuples():
            im_batch.append(self._get_image_array(os.path.join(self._image_dir, image[0]), row[1], row[2], self._im_dim[0], self._im_dim[1]))
            
        return np.asarray(im_batch)

class FormulaIterator(ShuffleIterator):
    def __init__(self, df_, batch_size_, data_dir_, seq_filename_):
        Shuffler.__init__(self, df_, batch_size_)
        self._seq_data = pd.read_pickle(os.path.join(data_dir_, seq_filename_))
        
    def next(self):
        df_batch = Shuffler.next(self).bin_len
        bin_len = df_batch.iloc[0].bin_len
        return self._seq_data[bin_len][df_batch.index].values

In [None]:
# sequence_input = Input(shape=(None,), dtype='int32', name='sequence_input')
# embedding_output = Embedding(HYPER.vocab_size, HYPER.embedding_size, mask_zero=True, name='embedding')(sequence_input)
#model = Model(inputs=[sequence_input], outputs=[embedding_output])
#model.compile(optimizer='rmsprop', loss='binary_crossentropy')
#model.output_shape

#### Decoder Model
A dense (FC) attention model: The deterministic soft-attention model of the paper computes $\alpha_{t,i}$ which is used to select or blend the $a_i$ vectors before being fed as inputs to the decoder LSTM network (see below).
* Inputs to the attention model are $a_i$ and $h_{t-1}$ (previous hidden state of LSTM network - see below) and $$\alpha_{t,i} = softmax ( f_{att}(a_i, h_{t-1}) )$$
* Note that the model $f_{att}$ shares weights across all values of a_i (i.e. for all i = 1-L). Therefore the shared weight matrix for all a_i has shape (D, D), while shape of a is (B, L, D) where is B=batch-size. Weight matrix of h_i is separate and has the expected shape (n, D). This sharing of weights across a_i is interesting.

A Decoder model: A conditioned LSTM that outputs probabilities of the text tokens $y_t$ at each step. The LSTM is conditioned upon $z_t = \sum_i^L(\alpha_{t,i}.a_i)$ and takes the previous hidden state $h_{t-1}$ as input. In addition, an embedding of the previous output $Ey_{t-1}$ is also input to the LSTM. At training time, $y_{t-1}$ would be derived from the training samples, while at inferencing time it would be fed-back from the previous predicted word.
* $y$ is taken from a fixed vocabulary of K words. An embedding matrix $E$ is used to narrow its representation to an $m$ dimensional dense vector. The embedding weights $E$ are learnt end-to-end by the model as well.
* The decoder LSTM uses a deep layer between $h_t$ and $y_t$. It is called a deep output layer and is described in [section 3.2.2 of this paper](https://www.semanticscholar.org/paper/How-to-Construct-Deep-Recurrent-Neural-Networks-Pascanu-G%C3%BCl%C3%A7ehre/533ee188324b833e059cb59b654e6160776d5812). That is:
$$ p(y_t) = Softmax \Big( f_out(Ey_{t-1}, h_t, \hat{z}_t) \Big) $$
* Optionally $z_t = \beta \sum_i^L(\alpha_{t,i}.a_i)$ where $\beta = \sigma(f_{\beta}(h_{t-1}))$ is a scalar used to modulate the strength of the context. It turns out that for the original use-case of caption generation, the network would learn to emphasize objects by turning up the value of this scalar when it was focusing on objects. It is not clear at this time whether we'll need this feature for im2latex.


In [None]:
class ConditionedAttentiveRNN(object):
    """
    One timestep of the decoder model. The entire function can be seen as a complex RNN-cell: 
    which includes a LSTM stack and an attention model.
    """
    def __init__(self):
        pass
    
    def _build_attention_model(self, a, h_prev):
        B = HYPER.B
        n = HYPER.n
        L = HYPER.L
        D = HYPER.D
        att_actv = HYPER.att_activation
        initializer = HYPER.weights_initializer_att

        ## For #layers > 1 this will endup being different than the paper's implementation
        if HYPER.att_share_weights:
            """
            Here we'll effectively create L MLP stacks all sharing the same weights. Each
            stack receives a concatenated vector of a(l) and h as input.

            TODO: We could also
            use 2D convolution here with a kernel of size (1,D) and stride=1 resulting in
            an output dimension of (L,1,depth) or (B, L, 1, depth) including the batch dimension.
            That may be more efficient.
            """
            ## h.shape = (B,n). Convert it to (B,1,n) and then broadcast to (B,L,n) in order
            ## to concatenate with feature vectors of a whose shape=(B,L,D)
            h = K.tile(K.expand_dims(h, axis=1), (1,L,1))
            ## Concatenate a and h. Final shape = (B, L, D+n)
            ah = K.concatenate(a,h)
            dim = D+n
            for i in range(HYPER.att_layers) + 1 :
                n_units = HYPER['att_%d_n'%(i,)]; assert(n_units <= dim)
                ah = Dense(num_units, activation=HYPER.att_actv)(ah)
                dim = n_units

            ## Below is roughly how it is implemented in the code released by the authors of the paper
#                 for i in range(1, HYPER.att_a_layers+1):
#                     a = Dense(HYPER['att_a_%d_n'%(i,)], activation=HYPER.att_actv)(a)
#                 for i in range(1, HYPER.att_h_layers+1):
#                     h = Dense(HYPER['att_h_%d_n'%(i,)], activation=HYPER.att_actv)(h)    
#                ah = a + K.expand_dims(h, axis=1)

            ## Gather all activations across the features; go from (B, L, x) to (B,L,1).
            ## One could've just summed
            ## them all here, but the paper uses another set of weights to accomplish this.
            if HYPER.att_weighted_gather:
                ah = Dense(1, activation='linear')(ah) # output shape = (B, L, 1)
                ah = K.reshape(ah, (B,L))
            else:
                ah = K.mean(ah, axis=2) # output shape = (B, L)

        else: # weights not shared
            ## concatenate a and h_prev and pass them through a MLP. This is different than the theano
            ## implementation of the paper because we flatten a from (B,L,D) to (B,L*D). Hence each element
            ## of the L*D vector receives its own weight because the effective weight matrix here would be
            ## shape (L*D, num_dense_units) as compared to (D, num_dense_units) as in the shared_weights case

            ## Concatenate a and h. Final shape will be (B, L*D+n)
            ah = K.concatenate(K.batch_flatten(a), h)
            dim = L*D + n
            for i in range(HYPER.att_layers) + 1 :
                n_units = HYPER['att_%d_n'%(i,)]; assert(n_units <= dim)
                ah = Dense(HYPER['att_%d_n'%(i,)], activation=HYPER.att_actv)(ah)
                dim = n_units
            ## At this point, ah.shape = (B, dim)
            assert dim >= L

        alpha = Dense(L, activation='softmax', name='alpha')(ah)
        return alpha
            
    def build_model(self, a):
        """
        Args:
            a (tensor): The image annotation vectors as described in the Show Attend & Tell paper.
                Should be a tensor of shape (B, L, D).
        """
        ## Set and validate all function params
        dlc.HyperParams((
                        PD('a', '', tensorshape((B, L, D)))
                        ), 
                       initvals=locals())

        ## Renaming HyperParams for convenience
        B = HYPER.B
        n = HYPER.n
        L = HYPER.L
        D = HYPER.D
        m = HYPER.m
        K = HYPER.K
        att_actv = HYPER.att_activation
        e_init = HYPER.embeddings_initializer

        ## Create the placeholders
        h = K.placeholder(shape=(B, n), name='h_prev')
        y = K.placeholder(shape=(B, 1), name='y_prev')
        
        ################ Attention Model ################
        with tf.name_scope('Attention'):
            alpha = self.build_attention_model(a, h_prev)
        
        ################ Soft deterministic attention: z = alpha-weighted mean of a ################
        ## (B, L) batch_dot (B,L,D) -> (B, D)
        with tf.name_scope('Phi'):
            z = K.batch_dot(alpha, a, axes=[1,1], name='phi')
        
        ################ Build the embedding layer ################
        with tf.name_scope('Ey'):
            Ey = Embedding(K, m, embeddings_initializer=e_init, mask_zero=True, input_length=1)
            Ey = K.reshape(Ey, (B,m))
        
        ################ Build the LSTM Cell ################
        
        
    def step_function_train(input, states):
        """
        Conforms to step_function required by keras.backend.rnn. Takes in previous states (h, c), the current
        input and the image annotations (a) as input and outputs the states and outputs for the current timestep.
        Note that input(t) = Ey(t-1) and input(t=0) = Null. When training, the target output is used for Ey
        whereas at prediction time (via. beam-search for e.g.) the actual output is used.
        Args:
            input (tensor): is a input for one time-step. Should be a tensor of shape (batch-size, m) where m is
                the dimensionality of the embedded input vector.
            states (list of tensors): Same as new_states returned by this function at previous time-step.
                states(t) = new_states(t-1).
        Returns:
            outputs (tensor): The output of the cell. A tensor of shape (batch_size, num_units)
        """
        pass
    def __init__(h_prev, Ey_prev, c_prev, a):
        pass

In [13]:
b = K.ones(shape=(3, 4))
c = K.ones(shape=(3, 4))
K.concatenate([b,c])

<tf.Tensor 'concat_2:0' shape=(3, 8) dtype=float32>

In [None]:
import numpy as np
a = np.array([[1,1],[2,2],[3,3]])


In [None]:
a[0]

In [None]:
a[1]

In [None]:
a.mean(0)

In [None]:
a