# im2latex(S): Deep Learning Model

&copy; Copyright 2017 Sumeet S Singh

    This file is part of the im2latex solution (by Sumeet S Singh in particular since there are other solutions out there).

    This program is free software: you can redistribute it and/or modify
    it under the terms of the Affero GNU General Public License as published by
    the Free Software Foundation, either version 3 of the License, or
    (at your option) any later version.

    This program is distributed in the hope that it will be useful,
    but WITHOUT ANY WARRANTY; without even the implied warranty of
    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
    Affero GNU General Public License for more details.

    You should have received a copy of the Affero GNU General Public License
    along with this program.  If not, see <http://www.gnu.org/licenses/>.

## The Model
* Follows the [Show, Attend and Tell paper](https://www.semanticscholar.org/paper/Show-Attend-and-Tell-Neural-Image-Caption-Generati-Xu-Ba/146f6f6ed688c905fb6e346ad02332efd5464616)
* [VGG ConvNet (16 or 19)](http://www.robots.ox.ac.uk/~vgg/research/very_deep/) without the top-3 layers
    * Pre-initialized with the VGG weights but allowed to train
    * The ConvNet outputs $D$ dimensional vectors in a WxH grid where W and H are 1/16th of the input image size (due to 4 max-pool layers). Defining $W.H \equiv L$ the ConvNet output represents L locations of the image $i \in [1,L]$ and correspondingly outputs to L annotation vectors $a_i$, each of size $D$.
* A dense (FC) attention model: The deterministic soft-attention model of the paper computes $\alpha_{t,i}$ which is used to select or blend the $a_i$ vectors before being fed as inputs to the decoder LSTM network (see below).
    * Inputs to the attention model are $a_i$ and $h_{t-1}$ (previous hidden state of LSTM network - see below)
    and $$\alpha_{t,i} = softmax ( f_{att}(a_i, h_{t-1}) )$$
* A Decoder model: A conditioned LSTM that outputs probabilities of the text tokens $y_t$ at each step. The LSTM is conditioned upon $z_t = \sum_i^L(\alpha_{t,i}.a_i)$ and takes the previous hidden state $h_{t-1}$ as input. In addition, an embedding of the previous output $Ey_{t-1}$ is also input to the LSTM. At training time, $y_{t-1}$ would be derived from the training samples, while at inferencing time it would be fed-back from the previous predicted word.
    * $y$ is taken from a fixed vocabulary of K words. An embedding matrix $E$ is used to narrow its representation. The embedding weights $E$ are learnt end-to-end by the model as well.
    * The decoder LSTM uses a deep layer between $h_t$ and $y_t$. It is called a deep output layer and is described in [section 3.2.2 of this paper](https://www.semanticscholar.org/paper/How-to-Construct-Deep-Recurrent-Neural-Networks-Pascanu-G%C3%BCl%C3%A7ehre/533ee188324b833e059cb59b654e6160776d5812). That is:
    $$ p(y_t) = Softmax \Big( f_out(Ey_{t-1}, h_t, \hat{z}_t) \Big) $$
* Initialization MLPs: Two MLPs are used to produce the initial memory-state of the LSTM as well as $h_{t-1}$ value. Each MLP takes in the entire image's features (i.e. average of $a_i$) as its input and is trained end-to-end.
    $$ c_o = f_{init,c}\Big( \sum_i^L a_i \Big) $$
    $$ h_o = f_{init,h}\Big( \sum_i^L a_i \Big) $$
* Training:
    * 3 models from above - all except the conv-net - are trained end-to-end using SGD
    * The model is trained for a variable number of time steps - depending on each batch

## References
1. Show, Attend and Tell
    * [Paper](https://www.semanticscholar.org/paper/Show-Attend-and-Tell-Neural-Image-Caption-Generati-Xu-Ba/146f6f6ed688c905fb6e346ad02332efd5464616)
    * [Slides](https://pdfs.semanticscholar.org/b336/f6215c3c15802ca5327cd7cc1747bd83588c.pdf?_ga=2.52116077.559595598.1498604153-2037060338.1496182671)
    * [Author's Theano code](https://github.com/kelvinxu/arctic-captions)
1. [Simonyan, Karen and Andrew Zisserman. “Very Deep Convolutional Networks for Large-Scale Image Recognition.” CoRR abs/1409.1556 (2014): n. pag.](http://www.robots.ox.ac.uk/~vgg/research/very_deep/)
1. [im2latex solution of Harvard NLP](http://lstm.seas.harvard.edu/latex/)
1. [im2latex-dataset tools forked from Harvard NLP](https://github.com/untrix/im2latex-dataset)

In [1]:
import pandas as pd
import os
import dl_commons as dlc
import tensorflow as tf
from dl_commons import PD, mandatory, boolean, integer, decimal, equalto
from keras.applications.vgg16 import VGG16
from keras.layers import Input, Embedding, Dense, Activation, Dropout, Concatenate, Permute
from keras.callbacks import LambdaCallback
from keras.models import Model
from keras import backend as K
from keras.engine import Layer
import keras
import threading
import tensorflow as tf
import numpy as np
import collections

Using TensorFlow backend.


In [2]:
data_folder = '../data/generated2'

### HyperParams

In [3]:
def get_vocab_size(data_dir_):
    df_vocab = pd.read_pickle(os.path.join(data_folder, 'df_vocab.pkl'))
    return df_vocab.id.max() + 1

In [4]:
try:
    del HYPER
except:
    pass

HYPER_PD = (
        PD('image_shape',
           'Shape of input images. Should be a python sequence.',
           None,
           (120,1075,3)
           ),
        PD('B',
           '(integer or None): Size of mini-batch for training, validation and testing.',
           (None, 128),
           128
           ),
        PD('K',
           'Vocabulary size including zero',
           xrange(500,1000),
           556 #get_vocab_size(data_folder)
           ),
        PD('m',
           '(integer): dimensionality of the embedded input vector (Ey / Ex)', 
           xrange(50,250),
           64
           ),
        PD('H', 'Height of feature-map produced by conv-net. Specific to the dataset image size.', None, 3),
        PD('W', 'Width of feature-map produced by conv-net. Specific to the dataset image size.', None, 33),
        PD('L',
           '(integer): number of pixels in an image feature-map = HxW (see paper or model description)', 
           integer(1),
           lambda _, d: d['H'] * d['W']),
        PD('D', 
           '(integer): number of features coming out of the conv-net. Depth/channels of the last conv-net layer.'
           'See paper or model description.', 
           integer(1),
           512),
        PD('keep_prob', '(decimal): Value between 0.1 and 1.0 indicating the keep_probability of dropout layers.'
           'A value of 1 implies no dropout.',
           decimal(0.1, 1), 
           1.0),
    ### Attention Model Params ###
        PD('att_layers', 'Number of layers in the attention_a model', xrange(1,10), 1),
        PD('att_1_n', 'Number of units in first layer of the attention model. Defaults to D as it is in the paper"s source-code.', 
           xrange(1,10000),
           equalto('D')),
        PD('att_share_weights', 'Whether the attention model should share weights across the "L" image locations or not.'
           'Choosing "True" conforms to the paper resulting in a (D+n,att_1_n) weight matrix. Choosing False will result in a MLP with (L*D+n,att_1_n) weight matrix. ',
           boolean,
           True),
        PD('att_activation', 
           'Activation to use for the attention MLP model. Defaults to tanh as in the paper source.',
           None,
           'tanh'),
        PD('att_weighted_gather', 'The paper"s source uses an affine transform with trainable weights, to narrow the output of the attention'
           "model from (B,L,dim) to (B,L,1). I don't think this is helpful since there is no nonlinearity here." 
           "Therefore I have an alternative implementation that simply averages the matrix (B,L,dim) to (B,L,1)." 
           "Default value however, is True in conformance with the paper's implementation.",
           (True, False),
           True),
        PD('att_weights_initializer', 'weights initializer to use for the attention model', None,
           'glorot_normal'),
    ### Embedding Layer ###
        PD('embeddings_initializer', 'Initializer for embedding weights', None, 'glorot_uniform'),
        #PD('embeddings_initializer_tf', 'Initializer for embedding weights', None, 
        #   tf.contrib.layers.xavier_initializer),
    ### Decoder LSTM Params ###
        PD('n',
           '(integer): Number of hidden-units of the LSTM cell',
           integer(100,10000),
           1000),
        PD('decoder_lstm_peephole',
           '(boolean): whether to employ peephole connections in the decoder LSTM',
           (True, False),
           False),
        PD('decoder_out_layers',
           'Number of layers in the decoder output MLP. defaults to 1 as in the papers source',
           xrange(1,10), 1),
        PD('output_activation', 'Activtion function for deep output layer', None,
           'tanh'),
        PD('output_follow_paper',
           'Output deep layer uses some funky logic in the paper instead of a straight MLP'
           'Setting this value to True (default) will follow the paper"s logic. Otherwise'
           "a straight MLP will be used.", boolean, 
           True),
        PD('output_1_n', 
           'Number of units in the first hidden layer of the output MLP. Used only if output_follow_paper == False'
           "Default's to 'm' - same as when output_follow_paper == True", None,
           equalto('m')),
    ### Initializer MLP ###
        PD('init_layers', 'Number of layers in the initializer MLP', xrange(1,10),
           1),
        PD('init_dropout_rate', '(decimal): Global dropout_rate variable for init_layer',
           decimal(0.0, 0.9), 
           0.2),
        PD('init_h_activation', '', None, 'tanh'),
        PD('init_h_dropout_rate', '', 
           decimal(0.0, 0.9), 
           equalto('init_dropout_rate')),
        PD('init_c_activation', '', None, 'tanh'),
        PD('init_c_dropout_rate', '', 
           decimal(0.0, 0.9),
           equalto('init_dropout_rate')),
        PD('init_1_n', 'Number of units in hidden layer 1. The paper sets it to D',
           integer(1, 10000), 
           equalto('D')),
        PD('init_1_dropout_rate', '(decimal): dropout rate for the layer', 
           decimal(0.0, 0.9), 
           0.),
        PD('init_1_activation', 
           'Activation function of the first layer. In the paper, the final' 
           'layer has tanh and all penultinate layers have relu activation', 
           None,
           'tanh'),
    ### Loss / Cost Layer ###
        PD('sum_logloss',
           'Whether to normalize log-loss per sample as in standard log perplexity ' 
           'calculation or whether to just sum up log-losses as in the paper. Defaults' 
           'to True in conformance with the paper.',
           boolean,
           True
          ),
        PD('MeanSumAlphaEquals1',
          '(boolean): When calculating the alpha penalty, the paper uses the term: '
           'square{1 - sum_over_t{alpha_t_i}}). This assumes that the mean sum_over_t should be 1. '
           "However, that's not true, since the mean of sum_over_t term should be C/L. This "
           "variable if set to True, causes the term to change to square{C/L - sum_over_t{alpha_t_i}}). "
           "The default value is True in conformance with the paper.",
          boolean,
          False),
        PD('pLambda', 'Lambda value for alpha penalty',
           decimal(0),
           0.0001)   
)

HYPER = dlc.HyperParams(HYPER_PD,
        ## Overrides of default values.
        ## FYI: By convention, all boolean params' default value is True
        {
            'att_weighted_gather': True,
            'sum_logloss': True,
            'MeanSumAlphaEquals1': True,
            'output_follow_paper': True
        })
print HYPER

{'init_layers': 1, 'init_c_activation': 'tanh', 'MeanSumAlphaEquals1': True, 'att_weighted_gather': True, 'init_1_n': 512, 'init_1_activation': 'tanh', 'keep_prob': 1.0, 'init_dropout_rate': 0.2, 'att_layers': 1, 'init_h_activation': 'tanh', 'init_1_dropout_rate': 0.0, 'embeddings_initializer': 'glorot_uniform', 'image_shape': (120, 1075, 3), 'decoder_out_layers': 1, 'att_share_weights': True, 'att_1_n': 512, 'att_weights_initializer': 'glorot_normal', 'init_c_dropout_rate': 0.2, 'B': 128, 'D': 512, 'H': 3, 'K': 556, 'L': 99, 'pLambda': 0.0001, 'init_h_dropout_rate': 0.2, 'sum_logloss': True, 'output_follow_paper': True, 'W': 33, 'att_activation': 'tanh', 'm': 64, 'n': 1000, 'decoder_lstm_peephole': False, 'output_activation': 'tanh', 'output_1_n': 64}


### Encoder Model
[VGG ConvNet (16 or 19)](http://www.robots.ox.ac.uk/~vgg/research/very_deep/) without the top-3 layers
* Pre-initialized with the VGG weights but allowed to train
* The ConvNet outputs $D$ dimensional vectors in a WxH grid where W and H are scaled-down dimensions of the input image size (due to 5 max-pool layers). Defining $W.H \equiv L$ the ConvNet output represents L locations of the image $i \in [1,L]$ and correspondingly outputs to L annotation vectors $a_i$, each of size $D$.

The conv-net is *not trained* in the original paper and therefore the files can be separately preprocessed and their outputs directly fed into the model.

### Input Generator

In [5]:
@staticmethod
def make_batch_list(df_, batch_size_):
    ## Make a list of batches
    bin_lens = sorted(df_.bin_len.unique())
    bin_counts = [df_[df_.bin_len==l].shape[0] for l in bin_lens]
    batch_list = []
    for i in range(len(bin_lens)):
        bin_ = bin_lens[i]
        num_batches = (bin_counts[i] // batch_size_)
        ## Just making sure bin size is integral multiple of batch_size.
        ## This is not a requirement for this function to operate, rather
        ## is a way of possibly catching data-corrupting bugs
        assert (bin_counts[i] % batch_size_) == 0
        batch_list.extend([(bin_, j) for j in range(num_batches)])

    np.random.shuffle(batch_list)
    return batch_list

class ShuffleIterator(object):
    def __init__(self, df_, batch_size_):
        self._df = df_.sample(frac=1)
        self._batch_size = batch_size_
        self._batch_list = make_batch_list(self._df, batch_size_)
        self._next_pos = 0
        self._num_items = (df_.shape[0] // batch_size_)
        self.lock = threading.Lock()
        
#     def __iter__(self):
#         return self
    
    def next(self):
        ## This is an infinite iterator
        with self.lock:
            if self._next_pos >= self._num_items:
                ## Recompose the batch-list
                ## Shuffle the samples
                self._df = self._df.sample(frac=1)
                self._batch_list = make_batch_list(self._df, batch_size_)
                self._next_pos %= self._num_items
            next_pos = self._next_pos
            self._next_pos += 1
        
        batch = self._batch_list[next_pos]
        df_bin = self._df[self._df.bin_len == batch[0]]
        assert df_bin.bin_len.iloc[batch[1]*self._batch_size] == batch[0]
        assert df_bin.bin_len.iloc[(batch[1]+1)*self._batch_size-1] == batch[0]
        return df_bin.iloc[batch[1]*self._batch_size : (batch[1]+1)*self._batch_size]

class ImageIterator(ShuffleIterator):
    def __init__(self, df_, batch_size_, image_dim_, image_dir_):
        Shuffler.__init__(self, df_, batch_size_)
        self._im_dim = image_dim_
        self._image_dir = image_dir_

    @staticmethod
    def get_image_matrix(image_path_, height_, width_, padded_height_, padded_width_):
        MAX_PIXEL = 255.0 # Ensure this is a float literal
        ## Load image and convert to a 3-channel array
        im_ar = ndimage.imread(os.path.join(image_dir_,sr_row_.image), mode='RGB')
        ## normalize values to lie between -1.0 and 1.0.
        ## This is done in place of data whitening - i.e. normalizing to mean=0 and std-dev=0.5
        ## Is is a very rough technique but legit for images
        im_ar = (im_ar - MAX_PIXEL/2.0) / MAX_PIXEL
        height, width, channels = im_ar.shape
        assert height == height
        assert width == width
        assert channels == 3
        if (height < padded_height_) or (width < padded_width_):
            ar = np.full((padded_height_, padded_width_), 0.5, dtype=np.float32)
            h = (padded_height_-height)//2
            ar[h:h+height, 0:width] = im_ar
            im_ar = ar

        return im_ar

    def next(self):
        df_batch = Shuffler.next(self)[['image', 'height', 'width']]
        im_batch = []
        for image in df_batch.image.itertuples():
            im_batch.append(self._get_image_array(os.path.join(self._image_dir, image[0]), row[1], row[2], self._im_dim[0], self._im_dim[1]))
            
        return np.asarray(im_batch)

class FormulaIterator(ShuffleIterator):
    def __init__(self, df_, batch_size_, data_dir_, seq_filename_):
        Shuffler.__init__(self, df_, batch_size_)
        self._seq_data = pd.read_pickle(os.path.join(data_dir_, seq_filename_))
        
    def next(self):
        df_batch = Shuffler.next(self)['bin_len']
        bin_len = df_batch.iloc[0].bin_len
        return self._seq_data[bin_len][df_batch.index].values

#### Decoder Model
A dense (FC) attention model: The deterministic soft-attention model of the paper computes $\alpha_{t,i}$ which is used to select or blend the $a_i$ vectors before being fed as inputs to the decoder LSTM network (see below).
* Inputs to the attention model are $a_i$ and $h_{t-1}$ (previous hidden state of LSTM network - see below) and $$\alpha_{t,i} = softmax ( f_{att}(a_i, h_{t-1}) )$$
* Note that the model $f_{att}$ shares weights across all values of a_i (i.e. for all i = 1-L). Therefore the shared weight matrix for all a_i has shape (D, D), while shape of a is (B, L, D) where is B=batch-size. Weight matrix of h_i is separate and has the expected shape (n, D). This sharing of weights across a_i is interesting.

A Decoder model: A conditioned LSTM that outputs probabilities of the text tokens $y_t$ at each step. The LSTM is conditioned upon $z_t = \sum_i^L(\alpha_{t,i}.a_i)$ and takes the previous hidden state $h_{t-1}$ as input. In addition, an embedding of the previous output $Ey_{t-1}$ is also input to the LSTM. At training time, $y_{t-1}$ would be derived from the training samples, while at inferencing time it would be fed-back from the previous predicted word.
* $y$ is taken from a fixed vocabulary of K words. An embedding matrix $E$ is used to narrow its representation to an $m$ dimensional dense vector. The embedding weights $E$ are learnt end-to-end by the model as well.
* The decoder LSTM uses a deep layer between $h_t$ and $y_t$. It is called a deep output layer and is described in [section 3.2.2 of this paper](https://www.semanticscholar.org/paper/How-to-Construct-Deep-Recurrent-Neural-Networks-Pascanu-G%C3%BCl%C3%A7ehre/533ee188324b833e059cb59b654e6160776d5812). That is:
$$ p(y_t) = Softmax \Big( f_out(Ey_{t-1}, h_t, \hat{z}_t) \Big) $$
* Optionally $z_t = \beta \sum_i^L(\alpha_{t,i}.a_i)$ where $\beta = \sigma(f_{\beta}(h_{t-1}))$ is a scalar used to modulate the strength of the context. It turns out that for the original use-case of caption generation, the network would learn to emphasize objects by turning up the value of this scalar when it was focusing on objects. It is not clear at this time whether we'll need this feature for im2latex.


In [6]:
class Im2LatexModel(object):
    """
    One timestep of the decoder model. The entire function can be seen as a complex RNN-cell
    that includes a LSTM stack and an attention model.
    """
    def __init__(self):
        self._define_params()
        self._numSteps = 0

    def _define_attention_params(self):
        """Define Shared Weights for Attention Model"""
        ## 1) Dense layer, 2) Optional gather layer and 3) softmax layer

        ## Renaming HyperParams for convenience
        B = HYPER.B
        n = HYPER.n
        L = HYPER.L
        D = HYPER.D
        m = HYPER.m
        
        ## _att_dense array indices start from 1
        self._att_dense_layer = []

        if HYPER.att_share_weights:
        ## Here we'll effectively create L MLP stacks all sharing the same weights. Each
        ## stack receives a concatenated vector of a(l) and h as input.
            dim = D+n
            for i in range(1, HYPER.att_layers+1):
                n_units = HYPER['att_%d_n'%(i,)]; assert(n_units <= dim)
                self._att_dense_layer.append(Dense(n_units, activation=HYPER.att_activation,
                                                   batch_input_shape=(B,L,dim)))
                dim = n_units
            ## Optional gather layer (that comes after the Dense Layer)
            if HYPER.att_weighted_gather:
                self._att_gather_layer = Dense(1, activation='linear') # output shape = (B, L, 1)
        else:
            ## concatenate a and h_prev and pass them through a MLP. This is different than the theano
            ## implementation of the paper because we flatten a from (B,L,D) to (B,L*D). Hence each element
            ## of the L*D vector receives its own weight because the effective weight matrix here would be
            ## shape (L*D, num_dense_units) as compared to (D, num_dense_units) as in the shared_weights case
            dim = L*D+n        
            for i in range(1, HYPER.att_layers+1):
                n_units = HYPER['att_%d_n'%(i,)]; assert(n_units <= dim)
                self._att_dense_layer.append(Dense(n_units, activation=HYPER.att_actv,
                                                   batch_input_shape=(B,dim)))
                dim = n_units
        
        assert dim >= L
        self._att_softmax_layer = Dense(L, activation='softmax', name='alpha')
        
    def _build_attention_model(self, a, h_prev):
        B = HYPER.B
        n = HYPER.n
        L = HYPER.L
        D = HYPER.D
        h = h_prev

        assert K.int_shape(h_prev) == (B, n)
        assert K.int_shape(a) == (B, L, D)

        ## For #layers > 1 this will endup being different than the paper's implementation
        if HYPER.att_share_weights:
            """
            Here we'll effectively create L MLP stacks all sharing the same weights. Each
            stack receives a concatenated vector of a(l) and h as input.

            TODO: We could also
            use 2D convolution here with a kernel of size (1,D) and stride=1 resulting in
            an output dimension of (L,1,depth) or (B, L, 1, depth) including the batch dimension.
            That may be more efficient.
            """
            ## h.shape = (B,n). Convert it to (B,1,n) and then broadcast to (B,L,n) in order
            ## to concatenate with feature vectors of 'a' whose shape=(B,L,D)
            h = K.tile(K.expand_dims(h, axis=1), (1,L,1))
            ## Concatenate a and h. Final shape = (B, L, D+n)
            ah = tf.concat([a,h], -1)
            for i in range(HYPER.att_layers) :
                ah = self._att_dense_layer[i](ah)

            ## Below is roughly how it is implemented in the code released by the authors of the paper
#                 for i in range(1, HYPER.att_a_layers+1):
#                     a = Dense(HYPER['att_a_%d_n'%(i,)], activation=HYPER.att_actv)(a)
#                 for i in range(1, HYPER.att_h_layers+1):
#                     h = Dense(HYPER['att_h_%d_n'%(i,)], activation=HYPER.att_actv)(h)    
#                ah = a + K.expand_dims(h, axis=1)

            ## Gather all activations across the features; go from (B, L, dim) to (B,L,1).
            ## One could've just summed/averaged them all here, but the paper uses yet
            ## another set of weights to accomplish this. So we'll keeep that as an option.
            if HYPER.att_weighted_gather:
                ah = self._att_gather_layer(ah) # output shape = (B, L, 1)
                ah = K.squeeze(ah, axis=2) # output shape = (B, L)
            else:
                ah = K.mean(ah, axis=2) # output shape = (B, L)

        else: # weights not shared across L
            ## concatenate a and h_prev and pass them through a MLP. This is different than the theano
            ## implementation of the paper because we flatten a from (B,L,D) to (B,L*D). Hence each element
            ## of the L*D vector receives its own weight because the effective weight matrix here would be
            ## shape (L*D, num_dense_units) as compared to (D, num_dense_units) as in the shared_weights case

            ## Concatenate a and h. Final shape will be (B, L*D+n)
            ah = K.concatenate(K.batch_flatten(a), h)
            for i in range(HYPER.att_layers):
                ah = self._att_dense_layer(ah)
            ## At this point, ah.shape = (B, dim)

        alpha = self._att_softmax_layer(ah) # output shape = (B, L)
        assert K.int_shape(alpha) == (B, L)
        return alpha
            
    def _define_output_params(self):
        ## Renaming HyperParams for convenience
        B = HYPER.B
        n = HYPER.n
        L = HYPER.L
        D = HYPER.D
        m = HYPER.m
        Kv= HYPER.K

        ## First layer of output MLP
        ## Affine transformation of h_t and z_t from size n/D to m followed by a summation
        self._output_affine = Dense(m, activation='linear', batch_input_shape=(B,n+D)) # output size = (B, m)
        ## non-linearity for the first layer - will be chained by the _call function after adding Ex / Ey
        self._output_activation = Activation(HYPER.output_activation)

        ## Additional layers if any
        if HYPER.decoder_out_layers > 1:
            self._output_dense = []
            for i in range(1, HYPER.decoder_out_layers):
                self._output_dense.append(Dense(m, activation=HYPER['output_%d_activation'%i], 
                                           batch_input_shape=(B,m))
                                         )

        ## Final softmax layer
        self._output_softmax = Dense(Kv, activation='softmax', batch_input_shape=(B,m))
        
    def _build_output_layer(self, Ex_t, h_t, z_t):
        ## Renaming HyperParams for convenience
        B = HYPER.B
        n = HYPER.n
        L = HYPER.L
        D = HYPER.D
        m = HYPER.m
        Kv =HYPER.K
        
        assert K.int_shape(Ex_t) == (B, m)
        assert K.int_shape(h_t) == (B, n)
        assert K.int_shape(z_t) == (B, D)
        
        ## First layer of output MLP
        ## Affine transformation of h_t and z_t from size n/D to size m followed by a summation
        o_t = Dense(m, activation='linear', batch_input_shape=(B,n+D))(tf.concat([h_t, z_t], -1)) # output size = (B, m)
        o_t = o_t + Ex_t
        
        ## non-linearity for the first layer
        o_t = Activation(HYPER.output_activation)(o_t)

        ## Subsequent MLP layers
        if HYPER.decoder_out_layers > 1:
            for i in range(1, HYPER.decoder_out_layers):
                o_t = Dense(m, 
                            activation=HYPER['output_%d_activation'%i], 
                            batch_input_shape=(B,m))(o_t)
                
        ## Final logits
        logits_t = Dense(Kv, activation=HYPER.output_activation, batch_input_shape=(B,m))(o_t) # shape = (B,K)
        assert K.int_shape(logits_t) == (B, Kv)
        
        # softmax
        return tf.nn.softmax(logits_t), logits_t

    def _define_init_params(self):
        ## As per the paper, this is a two-headed MLP. It has a stack of common layers at the bottom
        ## two output layers at the top - one each for h and c LSTM states.
        self._init_layer = []
        self._init_dropout = []
        for i in xrange(1, HYPER.init_layers):
            key = 'init_%d_'%i
            self._init_layer.append(Dense(HYPER[key+'n'], activation=Hyper[key+'activation']))
            if HYPER[key+'dropout_rate'] > 0.0:
                self._init_dropout.append(Dropout(HYPER[key+'dropout_rate']))

        ## Final layer for h
        self._init_h = Dense(HYPER['n'], activation=HYPER['init_h_activation'])
        if HYPER['init_h_dropout_rate'] > 0.0:
            self._init_h_dropout = Dropout(HYPER['init_h_dropout_rate'])

        ## Final layer for c
        self._init_c = Dense(HYPER['n'], activation=HYPER['init_c_activation'])
        if HYPER['init_c_dropout_rate'] > 0.0:
            self._init_c_dropout = Dropout(HYPER['init_c_dropout_rate'])

    def _build_init_layer(self, a):
        assert K.int_shape(a) == (HYPER.B, HYPER.L, HYPER.D)
        
        ################ Initializer MLP ################
        with tf.variable_scope('Initializer_MLP'):

            ## As per the paper, this is a two-headed MLP. It has a stack of common layers at the bottom,
            ## two output layers at the top - one each for h and c LSTM states.
            a = K.mean(a, axis=1) # final shape = (B, D)
            for i in xrange(1, HYPER.init_layers):
                key = 'init_%d_'%i
                a = self._init_layer[i](a)
                if HYPER[key+'dropout_rate'] > 0.0:
                    a = self._init_dropout[i](a)

            init_c = self._init_c(a)
            if HYPER['init_c_dropout_rate'] > 0.0:
                init_c = self._init_c_dropout(init_c)

            init_h = self._init_h(a)
            if HYPER['init_h_dropout_rate'] > 0.0:
                init_h = self._init_h_dropout(init_h)

            assert K.int_shape(init_c) == (HYPER.B, HYPER.n)
            assert K.int_shape(init_h) == (HYPER.B, HYPER.n)

        return init_c, init_h
            
    def _embedding_lookup2(self, ids):
        B = HYPER.B
        m = HYPER.m
        assert self._embedding is not None
        assert K.int_shape(ids) == (B,)
        embedded = self._embedding(K.expand_dims(ids, axis=-1)) # (None,1,m)
        embedded = tf.squeeze(embedded, axis=1) # (None,m)
        embedded = tf.reshape(embedded, (B,m)) # (B,m)
        return embedded
    
    def _embedding_lookup(self, ids):
        B = HYPER.B
        m = HYPER.m
        assert self._embedding_matrix is not None
        assert K.int_shape(ids) == (B,)
        embedded = tf.nn.embedding_lookup(self._embedding_matrix, ids)
        embedded = tf.reshape(embedded, (B,m)) # (B,m)
        return embedded    
    
    def _define_params(self):
        ## Renaming HyperParams for convenience
        B = HYPER.B
        n = HYPER.n
        L = HYPER.L
        D = HYPER.D
        m = HYPER.m
        Kv = HYPER.K
        att_actv = HYPER.att_activation
        e_init = HYPER.embeddings_initializer

        ################ Attention Model ################
        with tf.variable_scope('Attention'):
            self._define_attention_params()
                
        ################ Embedding Layer ################
        with tf.variable_scope('Ey'):
            self._embedding = Embedding(Kv, m, 
                                        embeddings_initializer=e_init, 
                                        mask_zero=True, 
                                        input_length=1,
                                        batch_input_shape=(B,1)
                                        #input_shape=(1,)
                                        ) ## (B, 1, m)
            
            ## Above Embedding layer will get replaced by this one.
            self._embedding_matrix = tf.get_variable('Embedding_Matrix', (Kv, m))
        
        ################ Decoder LSTM Cell ################
        with tf.variable_scope('Decoder_LSTM'):
            #LSTM by Zaremba et. al 2014: http://arxiv.org/abs/1409.2329
            self._decoder_lstm = tf.contrib.rnn.LSTMBlockCell(n, forget_bias=1.0, 
                                                              use_peephole=HYPER.decoder_lstm_peephole)
            
        ################ Output Layer ################
        with tf.variable_scope('Decoder_Output_Layer'):
            self._define_output_params()

        ################ Initializer MLP ################
        with tf.variable_scope('Initializer_MLP'):
            self._define_init_params()
            
    def _build_rnn_step1(self, out_t_1, x_t, testing=False):
        """
        Builds tf graph for the first iteration of the RNN. Works for both training and testing graphs.
        """
        return self._build_rnn_step(out_t_1, x_t, isStep1=True, testing=testing)
        
    def _build_rnn_training_stepN(self, out_t_1, x_t):
        """
        Builds tf graph for the subsequent iterations of the RNN - training mode.
        """
        return self._build_rnn_step(out_t_1, x_t, isStep1=False, testing=False)
        
    def _build_rnn_testing_stepN(self, out_t_1, x_t):
        """
        Builds tf graph for the subsequent iterations of the RNN - testing mode.
        """
        return self._build_rnn_step(out_t_1, x_t, isStep1=False, testing=True)
        
    def _build_rnn_step(self, out_t_1, x_t, isStep1=False, testing=False):
        """
        TODO: Incorporate Dropout
        Builds/threads tf graph for one RNN iteration.
        Conforms to loop function fn required by tf.scan. Takes in previous lstm states (h and c), 
        the current input and the image annotations (a) as input and outputs the states and outputs for the
        current timestep.
        Note that input(t) = Ey(t-1). Input(t=0) = Null. When training, the target output is used for Ey
        whereas at prediction time (via. beam-search for e.g.) the actual output is used.
        Args:
            x_t (tensor): is a input for one time-step. Should be a tensor of shape (batch-size, 1).
            out_t_1 (tuple of tensors): Output returned by this function at previous time-step.
        Returns:
            out_t (tuple of tensors): The output y_t shape= (B,K) - the probability of words/tokens. Also returns
                states needed in the next iteration of the RNN - i.e. (h_t, lstm_states_t and a). lstm_states_t = 
                (h_t, c_t) - which means h_t is included twice in the returned tuple.            
        """
        #x_t = input at t             # shape = (B,)
        step = out_t_1[0] + 1
        h_t_1 = out_t_1[1]            # shape = (B,n)
        lstm_states_t_1 = out_t_1[2]  # shape = ((B,n), (B,n)) = (c_t_1, h_t_1)
        a = out_t_1[3]                # shape = (B, L, D)
        if not isStep1: ## init_accum does not have everything
            yProbs_t_1 = out_t_1[4]           # shape = (B, Kv)
        #yLogits_t_1 = out_t_1[5]          # shape = (B, Kv)
        #alpha_t_1 = out_t_1[6]
        
        B = HYPER.B
        m = HYPER.m
        n = HYPER.n
        L = HYPER.L
        D = HYPER.D
        Kv = HYPER.K
        
        assert K.int_shape(h_t_1) == (B, n)
        assert K.int_shape(a) == (B, L, D)
        assert K.int_shape(lstm_states_t_1[1]) == (B, n)
        
        if not isStep1:
            assert K.int_shape(yProbs_t_1) == (B, Kv)
            tf.get_variable_scope().reuse_variables()
            if testing:
                x_t = tf.argmax(yProbs_t_1, axis=1)
        elif testing:
            tf.get_variable_scope().reuse_variables()
        
        ################ Attention Model ################
        with tf.variable_scope('Attention'):
            alpha_t = self._build_attention_model(a, h_t_1) # alpha.shape = (B, L)

        ################ Soft deterministic attention: z = alpha-weighted mean of a ################
        ## (B, L) batch_dot (B,L,D) -> (B, D)
        with tf.variable_scope('Phi'):
            z_t = K.batch_dot(alpha_t, a, axes=[1,1]) # z_t.shape = (B, D)

        ################ Embedding layer ################
        with tf.variable_scope('Ey'):
            Ex_t = self._embedding(K.expand_dims(x_t, axis=-1) ) # output.shape= (None,1,m)
            Ex_t = K.squeeze(Ex_t, axis=1) # output.shape= (None,m)
            Ex_t = K.reshape(Ex_t, (B,m)) # (B,m)
            
        ################ Decoder Layer ################
        with tf.variable_scope("Decoder_LSTM") as var_scope:
            (h_t, lstm_states_t) = self._decoder_lstm(K.concatenate((Ex_t, z_t)), lstm_states_t_1) # h_t.shape=(B,n)
            
        ################ Decoder Layer ################
        with tf.variable_scope('Output_Layer'):
            yProbs_t, yLogits_t = self._build_output_layer(Ex_t, h_t, z_t) # yProbs_t.shape = (B,K)
        
        assert K.int_shape(h_t) == (B, n)
        assert K.int_shape(a) == (B, L, D)
        assert K.int_shape(lstm_states_t[1]) == (B, n)
        assert K.int_shape(yProbs_t) == (B, Kv)
        assert K.int_shape(yLogits_t) == (B, Kv)
        assert K.int_shape(alpha_t) == (B, L)
        
        return step, h_t, lstm_states_t, a, yProbs_t, yLogits_t, alpha_t
        
    def _build_rnn_testing(self, a, y_s, init_c, init_h):
        return self._build_rnn(a, y_s, init_c, init_h, False)

    def _build_rnn_training(self, a, y_s, init_c, init_h):
        return self._build_rnn(a, y_s, init_c, init_h, True)

    def _build_rnn(self, a, y_s, init_c, init_h, training=True):
        B = HYPER.B
        L = HYPER.L
        D = HYPER.D
        n = HYPER.n
        assert K.int_shape(a) == (B, L, D)
        assert K.int_shape(y_s) == (B, None) # (B, T, 1)
        assert K.int_shape(init_c) == (B, n)
        assert K.int_shape(init_h) == (B, n)
        
        # LSTMStateTuple Stores two elements: (c, h), in that order.
        init_lstm_states = tf.contrib.rnn.LSTMStateTuple(init_c, init_h)

        ## tf.scan requires time-dimension to be the first dimension
        
        y_s = K.permute_dimensions(y_s, (1, 0)) # (T, B)

        ################ Build x_s ################
        ## First step of x_s is zero indicating begin-of-sequence
        x_s = tf.zeros((1, tf.shape(y_s)[1]), dtype=tf.int32)
        if training:
            ## x_s is y_s shifted forward by 1 timestep
            ## last time-step of y_s which is zero indicating <eos> will get removed.
            x_s = K.concatenate((x_s, y_s[0:-2]), axis=0)
            

        ################ Build RNN ################
        with tf.variable_scope('RNN'):
            initial_accum = (0, init_h, init_lstm_states, a)
            ## Weights are created in first step and then reused in subsequent steps.
            ## Hence we need to separate them out.
            step1_out = self._build_rnn_step1(initial_accum, x_s[0], testing=(not training))
            ## Subsequent steps in training are different than validation/testing/prediction
            ## Hence we need to separate them
            if training:
                stepN_out = tf.scan(self._build_rnn_training_stepN, x_s[1:], initializer=step1_out)
            else:
                ## Uses y_t_1 as input instead of x_t
                stepN_out = tf.scan(self._build_rnn_testing_stepN, x_s[1:], initializer=step1_out)

            yProbs1, yLogits1, alpha1 = step1_out[4], step1_out[5], step1_out[6]
            yProbsN, yLogitsN, alphaN = stepN_out[4], stepN_out[5], stepN_out[6]

            yProbs = K.concatenate([K.expand_dims(yProbs1, axis=0), yProbsN], axis=0)
            yLogits = K.concatenate([K.expand_dims(yLogits1, axis=0), yLogitsN], axis=0)
            alpha = K.concatenate([K.expand_dims(alpha1, axis=0), alphaN], axis=0)
        
            ## Switch the batch dimension back to first position - (B, T, ...)
            yProbs = K.permute_dimensions(yProbs, [1,0,2])
            yLogits = K.permute_dimensions(yLogits, [1,0,2])
            alpha = K.permute_dimensions(alpha, [1,0,2])
            
        return yProbs, yLogits, alpha
        
    def _build_loss(self, yLogits, y_s, alpha, sequence_lengths):
        assert K.int_shape(yLogits) == (HYPER.B, None, HYPER.K) # (B, T, K)
        assert K.int_shape(y_s) == (HYPER.B, None) # (B, T)
        assert K.int_shape(alpha) == (HYPER.B, None, HYPER.L) # (B, T, L)
        assert K.int_shape(sequence_lengths) == (HYPER.B,) # (B,)
        
        ################ Build Cost Function ################
        with tf.variable_scope('Cost'):
            sequence_mask = tf.sequence_mask(sequence_lengths, maxlen=tf.shape(y_s)[1], dtype=tf.float32) # (B, T)

            ## Masked negative log-likelihood of the sequence.
            ## Note that log(product(p_t)) = sum(log(p_t)) therefore taking taking log of
            ## joint-sequence-probability is same as taking sum of log of probability at each time-step

            ## Compute Sequence Log-Loss / Log-Likelihood = -Log( product(p_t) ) = -sum(Log(p_t))
            if HYPER.sum_logloss:
                ## Here we do not normalize the log-loss across time-steps because the
                ## paper as well as it's source-code do not do that.
                loss_vector = tf.contrib.seq2seq.sequence_loss(logits=yLogits, 
                                                               targets=y_s, 
                                                               weights=sequence_mask, 
                                                               average_across_timesteps=False,
                                                               average_across_batch=True)
                loss = tf.reduce_sum(loss_vector) # scalar
            else: ## Standard log perplexity (average per-word)
                loss = tf.contrib.seq2seq.sequence_loss(logits=yLogits, 
                                                               targets=y_s, 
                                                               weights=sequence_mask, 
                                                               average_across_timesteps=True,
                                                               average_across_batch=True)

            ## Calculate the alpha penalty: lambda * sum_over_i(square(C/L - sum_over_t(alpha_i)))
            ## 
            if HYPER.MeanSumAlphaEquals1:
                mean_sum_alpha_i = 1.0
            else:
                mean_sum_alpha_i = tf.cast(sequence_lengths, dtype=tf.float32) / HYPER.L # (B,)

            sum_alpha_i = tf.reduce_sum(tf.multiply(alpha,sequence_mask), axis=1, keep_dims=False)# (B, L)
            squared_diff = tf.squared_difference(sum_alpha_i, mean_sum_alpha_i)
            penalty = HYPER.pLambda * tf.reduce_sum(squared_diff, keep_dims=False) # scalar
            
            cost = loss + penalty
            
            ################ Build Scoring Function ################
            ## People have used BLEU score, but that probably is not suitable for markup comparison
            ## Best of course is to compare images produced by the output markup.
            
            ## Compute CTC score with intermediate blanks collapsed (we've collapsed all blanks in our
            ## train/test sequences to a single space so we'll hopefully get a better comparison by
            ## using CTC.)
        
        return cost
    
    def _build_image_context(self, image_batch):
        ## Conv-net
        assert K.int_shape(image_batch) == (HYPER.B,) + HYPER.image_shape
        ################ Build VGG Net ################
        with tf.variable_scope('VGGNet'):
            # K.set_image_data_format('channels_last')
            convnet = VGG16(include_top=False, weights='imagenet', pooling=None, input_shape=HYPER.image_shape)
            convnet.trainable = False
            print 'convnet output_shape = ', convnet.output_shape
            a = convnet(image_batch)
            assert K.int_shape(a) == (HYPER.B, HYPER.H, HYPER.W, HYPER.D)

            ## Combine HxW into a single dimension L
            a = tf.reshape(a, shape=(HYPER.B or -1, HYPER.L, HYPER.D))
            assert K.int_shape(a) == (HYPER.B, HYPER.L, HYPER.D)
        
        return a
        
    def build(self):
        B = HYPER.B
        Kv = HYPER.K
        L = HYPER.L
        
        ## TODO: Introduce Beam Search
        ## TODO: Introduce Stochastic Learning
        
        rv = dlc.Properties()
        im = tf.placeholder(dtype=tf.float32, shape=(HYPER.B,) + HYPER.image_shape, name='image_batch')
        y_s = tf.placeholder(tf.int32, shape=(HYPER.B, None))
        #a = tf.placeholder(tf.float32, shape=(HYPER.B, HYPER.L, HYPER.D))
        Ts = tf.placeholder(tf.int32, shape=(HYPER.B,))

        a = self._build_image_context(im)
        init_c, init_h = self._build_init_layer(a)
        yProbs, yLogits, alpha = self._build_rnn_training(a, y_s, init_c, init_h)
        self._build_rnn_testing(a, y_s, init_c, init_h)
        loss = self._build_loss(yLogits, y_s, alpha, Ts)
        
        assert K.int_shape(yProbs) == (B, None, Kv)
        assert K.int_shape(yLogits) == (B, None, Kv)
        assert K.int_shape(alpha) == (B, None, L)
        
        rv.im = im
        rv.y_s = y_s
        rv.Ts = Ts
        rv.yProbs = yProbs
        rv.yLogits = yLogits
        rv.alpha = alpha
        
        return rv.freeze()
        

In [7]:
# with tf.variable_scope('test_1'):
#     m = Im2LatexModel().build()
#     print 'yProbs shape = ', K.int_shape(m.yProbs)

In [8]:
Im2LatexRNNStateTuple = collections.namedtuple("Im2LatexRNNStateTuple", ('lstm_state', 'alpha'))

class Im2LatexDecoderRNN(tf.nn.rnn_cell.RNNCell):
    """
    One timestep of the decoder model. The entire function can be seen as a complex RNN-cell
    that includes a LSTM stack and an attention model.
    """

    def __init__(self, config, context, reuse=None):
        super(Im2LatexDecoderRNN, self).__init__(_reuse=reuse)
        self.C = config.copy().freeze()
        self._a = context ## Image features from the Conv-Net

        #LSTM by Zaremba et. al 2014: http://arxiv.org/abs/1409.2329
        self._LSTM_cell = tf.contrib.rnn.LSTMBlockCell

        assert K.int_shape(self._a) == (config.B, config.L, config.D)

    @property
    def state_size(self):
        n = self.C.n
        Kv = self.C.K
        L = self.C.L
    
        # lstm_states_t, alpha_t
        #return Im2LatexRNNStateTuple(tf.nn.rnn_cell.LSTMStateTuple(n, n), L)
        return ((n,n), L)

    def zero_state(self, batch_size, dtype):
        with ops.name_scope(type(self).__name__ + "ZeroState", values=[batch_size]):
            return (self._LSTM_cell.zero_state(batch_size, dtype), tf.zeros(dtype, shape=(batch_size, self.C.L)))

    @property
    def output_size(self):
        # yLogits
        return self.C.K
       
    def _attention_model(self, a, h_prev):
        CONF = self.C
        B = CONF.B
        n = CONF.n
        L = CONF.L
        D = CONF.D
        h = h_prev

        assert K.int_shape(h_prev) == (B, n)
        assert K.int_shape(a) == (B, L, D)

        ## For #layers > 1 this will endup being different than the paper's implementation
        if CONF.att_share_weights:
            """
            Here we'll effectively create L MLP stacks all sharing the same weights. Each
            stack receives a concatenated vector of a(l) and h as input.

            TODO: We could also
            use 2D convolution here with a kernel of size (1,D) and stride=1 resulting in
            an output dimension of (L,1,depth) or (B, L, 1, depth) including the batch dimension.
            That may be more efficient.
            """
            ## h.shape = (B,n). Convert it to (B,1,n) and then broadcast to (B,L,n) in order
            ## to concatenate with feature vectors of 'a' whose shape=(B,L,D)
            h = K.tile(K.expand_dims(h, axis=1), (1,L,1))
            ## Concatenate a and h. Final shape = (B, L, D+n)
            ah = tf.concat([a,h], -1); dim = D+n
            for i in range(1, CONF.att_layers+1):
                n_units = CONF['att_%d_n'%(i,)]; assert(n_units <= dim)
                ah = Dense(n_units, activation=CONF.att_activation, batch_input_shape=(B,L,dim))(ah)
                dim = n_units
                
            ## Below is roughly how it is implemented in the code released by the authors of the paper
#                 for i in range(1, CONF.att_a_layers+1):
#                     a = Dense(CONF['att_a_%d_n'%(i,)], activation=CONF.att_actv)(a)
#                 for i in range(1, CONF.att_h_layers+1):
#                     h = Dense(CONF['att_h_%d_n'%(i,)], activation=CONF.att_actv)(h)    
#                ah = a + K.expand_dims(h, axis=1)

            ## Gather all activations across the features; go from (B, L, dim) to (B,L,1).
            ## One could've just summed/averaged them all here, but the paper uses yet
            ## another set of weights to accomplish this. So we'll keeep that as an option.
            if CONF.att_weighted_gather:
                ah = Dense(1, activation='linear')(ah) # output shape = (B, L, 1)
                ah = K.squeeze(ah, axis=2) # output shape = (B, L)
            else:
                ah = K.mean(ah, axis=2) # output shape = (B, L)
                
            alpha = tf.nn.softmax(ah) # output shape = (B, L)
            
        else: # weights not shared across L
            ## concatenate a and h_prev and pass them through a MLP. This is different than the theano
            ## implementation of the paper because we flatten a from (B,L,D) to (B,L*D). Hence each element
            ## of the L*D vector receives its own weight because the effective weight matrix here would be
            ## shape (L*D, num_dense_units) as compared to (D, num_dense_units) as in the shared_weights case

            ## Concatenate a and h. Final shape will be (B, L*D+n)
            ah = K.concatenate(K.batch_flatten(a), h)
            dim = L*D+n
            for i in range(1, CONF.att_layers+1):
                n_units = CONF['att_%d_n'%(i,)]; assert(n_units <= dim)
                ah = Dense(n_units, activation=CONF.att_actv, batch_input_shape=(B,dim))(ah)
                dim = n_units
            ## At this point, ah.shape = (B, dim)        
            assert dim >= L        
            ## NOTE: An extra dense layer is not needed if dim == L. Simply a softmax activation would
            ## suffice in that case.
            alpha = self.Dense(L, activation='softmax', name='alpha')(ah) # output shape = (B, L)
        
        assert K.int_shape(alpha) == (B, L)
        return alpha

    def _build_decoder_lstm(self, Ex_t, z_t, lstm_states_t_1):
        """Represents invocation of the decoder lstm. (h_t, lstm_states_t) = *(z_t|Ex_t, lstm_states_t_1)"""
        CONF = self.C
        m = self.C.m
        D = self.C.D
        B = self.C.B
        n = self.C.n
        
        inputs_t = K.concatenate((Ex_t, z_t))
        assert K.int_shape(inputs_t) == (B, m+D)
        assert K.int_shape(lstm_states_t_1[1]) == (B, n)
        
        ## TODO: Make this multi-layered
        (h_t, lstm_states_t) = self._LSTM_cell(n, forget_bias=1.0,
                                            use_peephole=CONF.decoder_lstm_peephole)(inputs_t, lstm_states_t_1)
        return (h_t, lstm_states_t)

    def _build_output_layer(self, Ex_t, h_t, z_t):
        
        ## Renaming HyperParams for convenience
        CONF = self.C
        B = self.C.B
        n = self.C.n
        L = self.C.L
        D = self.C.D
        m = self.C.m
        Kv =self.C.K
        
        assert K.int_shape(Ex_t) == (B, m)
        assert K.int_shape(h_t) == (B, n)
        assert K.int_shape(z_t) == (B, D)
        
        ## First layer of output MLP
        if not CONF.output_follow_paper: ## Follow the paper.
            ## Affine transformation of h_t and z_t from size n/D to bring it down to m
            o_t = Dense(m, activation='linear', batch_input_shape=(B,n+D))(tf.concat([h_t, z_t], -1)) # o_t: (B, m)
            ## h_t and z_t are both dimension m now. So they can now be added to Ex_t.
            o_t = o_t + Ex_t # Paper does not multiply this with weights - weird.
            ## non-linearity for the first layer
            o_t = Activation(CONF.output_activation)(o_t)
            dim = m
        else: ## Use a straight FC layer
            o_t = K.concatenate((Ex_t, h_t, z_t)) # (B, m+n+D)
            o_t = Dense(CONF.output_1_n, activation=CONF.output_activation, batch_input_shape=(B,D+m+n))(o_t)
            dim = CONF.output_1_n
            
        ## Subsequent MLP layers
        if CONF.decoder_out_layers > 1:
            for i in range(2, CONF.decoder_out_layers+1):
                o_t = Dense(m, activation=CONF.output_activation, 
                            batch_input_shape=(B,dim))(o_t)
                
        ## Final logits layer
        logits_t = Dense(Kv, activation=CONF.output_activation, batch_input_shape=(B,m))(o_t) # shape = (B,K)
        assert K.int_shape(logits_t) == (B, Kv)
        
        # return tf.nn.softmax(logits_t), logits_t
        return logits_t

    def call(self, inputs, state):
        """
        TODO: Incorporate Dropout
        Builds/threads tf graph for one RNN iteration.
        Takes in previous lstm states (h and c),
        the current input and the image annotations (a) as input and outputs the states and outputs for the
        current timestep.
        Note that input(t) = Ey(t-1). Input(t=0) = Null. When training, the target output is used for Ey
        whereas at prediction time (via. beam-search for e.g.) the actual output is used.
        """

        Ex_t = inputs                          # shape = (B,m)
        state = Im2LatexRNNStateTuple(state[0], state[1])
        lstm_states_t_1 = state.lstm_state   # shape = ((B,n), (B,n)) = (c_t_1, h_t_1)
        alpha_t_1 = state.alpha            # shape = (B, L)
        h_t_1 = lstm_states_t_1[1]
        a = self._a

        CONF = self.C
        B = CONF.B
        m = CONF.m
        n = CONF.n
        L = CONF.L
        D = CONF.D
        Kv =CONF.K

        print 'shape(Ex_t) = ', K.int_shape(Ex_t)
        assert K.int_shape(Ex_t) == (B,m)
        assert K.int_shape(h_t_1) == (B, n)
        assert K.int_shape(lstm_states_t_1[1]) == (B, n)
        assert K.int_shape(alpha_t_1) == (B, L)

        ################ Attention Model ################
        with tf.variable_scope('Attention'):
            alpha_t = self._attention_model(a, h_t_1) # alpha.shape = (B, L)

        ################ Soft deterministic attention: z = alpha-weighted mean of a ################
        ## (B, L) batch_dot (B,L,D) -> (B, D)
        with tf.variable_scope('Phi'):
            z_t = K.batch_dot(alpha_t, a, axes=[1,1]) # z_t.shape = (B, D)

        ################ Decoder Layer ################
        with tf.variable_scope("Decoder_LSTM") as var_scope:
            (h_t, lstm_states_t) = self._build_decoder_lstm(Ex_t, z_t, lstm_states_t_1) # h_t.shape=(B,n)

        ################ Decoder Layer ################
        with tf.variable_scope('Output_Layer'):
            yLogits_t = self._build_output_layer(Ex_t, h_t, z_t) # yProbs_t.shape = (B,K)

        assert K.int_shape(h_t) == (B, n)
        assert K.int_shape(lstm_states_t.h) == (B, n)
        assert K.int_shape(lstm_states_t.c) == (B, n)
        #assert K.int_shape(yProbs_t) == (B, Kv)
        assert K.int_shape(yLogits_t) == (B, Kv)
        assert K.int_shape(alpha_t) == (B, L)

        return yLogits_t, (tuple(lstm_states_t), alpha_t)


In [9]:
def test_rnn():
    B = HYPER.B
    Kv = HYPER.K
    L = HYPER.L

    ## TODO: Introduce Beam Search
    ## TODO: Introduce Stochastic Learning

    m = Im2LatexModel()
    rv = dlc.Properties()
    im = tf.placeholder(dtype=tf.float32, shape=(HYPER.B,) + HYPER.image_shape, name='image_batch')
    y_s = tf.placeholder(tf.int32, shape=(HYPER.B, None))
    print 'y_s[:,0] shape: ', K.int_shape(y_s[:,0])
    print 'embedding_lookup shape: ', K.int_shape(m._embedding_lookup(y_s[:,0]))
    a = m._build_image_context(im)
    rnn = Im2LatexDecoderRNN(HYPER, a)
    print 'rnn ', rnn.state_size, rnn.output_size
    #init_c, init_h = m._build_init_layer(a)
    init_c = tf.placeholder(tf.float32, shape=(HYPER.B, HYPER.n))
    init_h = tf.placeholder(tf.float32, shape=(HYPER.B, HYPER.n))
    init_alpha = tf.placeholder(tf.float32, shape=(HYPER.B, HYPER.L))
    
    decoder = tf.contrib.seq2seq.BeamSearchDecoder(rnn, 
                                                   m._embedding_lookup,
                                                   y_s[:,0],
                                                   0,
                                                   ((init_c, init_h), init_alpha),
                                                   beam_width=10)
    print 'decoder._start_tokens: ', K.int_shape(decoder._start_tokens)
    print 'decoder._start_inputs: ', K.int_shape(decoder._start_inputs)
    final_outputs, final_state, final_sequence_lengths = tf.contrib.seq2seq.dynamic_decode(decoder,
                                                                                           #maximum_iterations=10,
                                                                                           swap_memory=True)
    print 'final_outputs: ', K.int_shape(final_outputs.predicted_ids)
    print 'final_state: ', (final_state)
    print 'final_sequence_lengths', (final_sequence_lengths)

with tf.variable_scope('test_run18', reuse=False):
    test_rnn()


y_s[:,0] shape:  (128,)
embedding_lookup shape:  (128, 64)
convnet output_shape =  (None, 3, 33, 512)
rnn  ((1000, 1000), 99) 556


ValueError: Unexpected behavior when reshaping between beam width and batch size.  The reshaped tensor has shape: (128, 10, 100).  We expected it to have shape (batch_size, beam_width, depth) == (128, 10, 1000).  Perhaps you forgot to create a zero_state with batch_size=encoder_batch_size * beam_width?

In [None]:
# ## Conv-net
# # K.set_image_data_format('channels_last')
# #image_input = Input(shape=HYPER.image_shape, name='image_input')
# image_input = tf.placeholder(dtype=tf.float32, shape=(HYPER.B,) + HYPER.image_shape, name='image_batch2')
# convnet = VGG16(include_top=False, weights='imagenet', pooling=None, input_shape=HYPER.image_shape)
# convnet.trainable = False
# print 'convnet output_shape = ', convnet.output_shape
# a = convnet(image_input)
# a

# End