# im2latex(S): Deep Learning Model

&copy; Copyright 2017 Sumeet S Singh

    This file is part of the im2latex solution (by Sumeet S Singh in particular since there are other solutions out there).

    This program is free software: you can redistribute it and/or modify
    it under the terms of the Affero GNU General Public License as published by
    the Free Software Foundation, either version 3 of the License, or
    (at your option) any later version.

    This program is distributed in the hope that it will be useful,
    but WITHOUT ANY WARRANTY; without even the implied warranty of
    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
    Affero GNU General Public License for more details.

    You should have received a copy of the Affero GNU General Public License
    along with this program.  If not, see <http://www.gnu.org/licenses/>.

## The Model
* Follows the [Show, Attend and Tell paper](https://www.semanticscholar.org/paper/Show-Attend-and-Tell-Neural-Image-Caption-Generati-Xu-Ba/146f6f6ed688c905fb6e346ad02332efd5464616)
* [VGG ConvNet (16 or 19)](http://www.robots.ox.ac.uk/~vgg/research/very_deep/) without the top-3 layers
    * Pre-initialized with the VGG weights but allowed to train
    * The ConvNet outputs $D$ dimensional vectors in a WxH grid where W and H are 1/16th of the input image size (due to 4 max-pool layers). Defining $W.H \equiv L$ the ConvNet output represents L locations of the image $i \in [1,L]$ and correspondingly outputs to L annotation vectors $a_i$, each of size $D$.
* A dense (FC) attention model: The deterministic soft-attention model of the paper computes $\alpha_{t,i}$ which is used to select or blend the $a_i$ vectors before being fed as inputs to the decoder LSTM network (see below).
    * Inputs to the attention model are $a_i$ and $h_{t-1}$ (previous hidden state of LSTM network - see below)
    and $$\alpha_{t,i} = softmax ( f_{att}(a_i, h_{t-1}) )$$
* A Decoder model: A conditioned LSTM that outputs probabilities of the text tokens $y_t$ at each step. The LSTM is conditioned upon $z_t = \sum_i^L(\alpha_{t,i}.a_i)$ and takes the previous hidden state $h_{t-1}$ as input. In addition, an embedding of the previous output $Ey_{t-1}$ is also input to the LSTM. At training time, $y_{t-1}$ would be derived from the training samples, while at inferencing time it would be fed-back from the previous predicted word.
    * $y$ is taken from a fixed vocabulary of K words. An embedding matrix $E$ is used to narrow its representation. The embedding weights $E$ are learnt end-to-end by the model as well.
    * The decoder LSTM uses a deep layer between $h_t$ and $y_t$. It is called a deep output layer and is described in [section 3.2.2 of this paper](https://www.semanticscholar.org/paper/How-to-Construct-Deep-Recurrent-Neural-Networks-Pascanu-G%C3%BCl%C3%A7ehre/533ee188324b833e059cb59b654e6160776d5812). That is:
    $$ p(y_t) = Softmax \Big( f_out(Ey_{t-1}, h_t, \hat{z}_t) \Big) $$
* Initialization MLPs: Two MLPs are used to produce the initial memory-state of the LSTM as well as $h_{t-1}$ value. Each MLP takes in the entire image's features (i.e. average of $a_i$) as its input and is trained end-to-end.
    $$ c_o = f_{init,c}\Big( \sum_i^L a_i \Big) $$
    $$ h_o = f_{init,h}\Big( \sum_i^L a_i \Big) $$
* Training:
    * 3 models from above - all except the conv-net - are trained end-to-end using SGD
    * The model is trained for a variable number of time steps - depending on each batch

## References
1. Show, Attend and Tell
    * [Paper](https://www.semanticscholar.org/paper/Show-Attend-and-Tell-Neural-Image-Caption-Generati-Xu-Ba/146f6f6ed688c905fb6e346ad02332efd5464616)
    * [Slides](https://pdfs.semanticscholar.org/b336/f6215c3c15802ca5327cd7cc1747bd83588c.pdf?_ga=2.52116077.559595598.1498604153-2037060338.1496182671)
    * [Author's Theano code](https://github.com/kelvinxu/arctic-captions)
1. [Simonyan, Karen and Andrew Zisserman. “Very Deep Convolutional Networks for Large-Scale Image Recognition.” CoRR abs/1409.1556 (2014): n. pag.](http://www.robots.ox.ac.uk/~vgg/research/very_deep/)
1. [im2latex solution of Harvard NLP](http://lstm.seas.harvard.edu/latex/)
1. [im2latex-dataset tools forked from Harvard NLP](https://github.com/untrix/im2latex-dataset)

In [1]:
import pandas as pd
import os
import dl_commons as dlc
import tensorflow as tf
from dl_commons import PD, mandatory, instanceof, boolean, integer, decimal, frange_incl, equalto
from keras.applications.vgg16 import VGG16
from keras.layers import Input, Embedding, Dense, Activation, Dropout
from keras.callbacks import LambdaCallback
from keras.models import Model
from keras import backend as K
import keras
import threading
import tensorflow as tf

Using TensorFlow backend.


In [2]:
data_folder = '../data/generated2'

### HyperParams

In [3]:
def get_vocab_size(data_dir_):
    df_vocab = pd.read_pickle(os.path.join(data_folder, 'df_vocab.pkl'))
    return df_vocab.id.max() + 1

In [4]:
class tensorshape(object):
    """Tensor shape validator to go with ParamDesc"""
    def __init__(self, shape):
        self._shape = shape
    def __contains__(self, obj):
        return keras.backend.int_shape(obj) == self._shape

In [5]:
try:
    del HYPER
except:
    pass

HYPER = dlc.HyperParams((
        PD('image_shape',
           'Shape of input images. Should be a python sequence.',
           None,
           (120,1075,3)
           ),
        PD('B',
           '(integer): Size of mini-batch for training, validation and testing.',
           instanceof(int),
           128
           ),
        PD('K',
           'Vocabulary size including zero',
           xrange(500,1000),
           get_vocab_size(data_folder)
           ),
        PD('m',
           '(integer): dimensionality of the embedded input vector (Ey / Ex)', 
           xrange(50,250),
           64
           ),
        PD('L',
           '(integer): number of pixels in an image feature-map = WxD (see paper or model description)', 
           instanceof(int),
           99),
        PD('D', 
           '(integer): number of features coming out of the conv-net. Depth/channels of the last conv-net layer.'
           'See paper or model description.', 
           instanceof(int),
           512),
        PD('keep_prob', '(decimal): Value between 0.1 and 1.0 indicating the keep_probability of dropout layers.'
           'A value of 1 implies no dropout.',
           frange_incl(0.1, 1), 1.0),
    ### Attention Model Params ###
        PD('att_layers', 'Number of layers in the attention_a model', xrange(1,10), 1),
        PD('att_1_n', 'Number of units in first layer of the attention model. Defaults to D as it is in the paper"s source-code.', 
           xrange(1,10000),
           equalto('D')),
        PD('att_share_weights', 'Whether the attention model should share weights across the "L" image locations or not.'
           'Choosing "True" conforms to the paper resulting in a (D+n,att_1_n) weight matrix. Choosing False will result in a MLP with (L*D+n,att_1_n) weight matrix. ',
           instanceof(bool),
           True),
        PD('att_activation', 
           'Activation to use for the attention MLP model. Defaults to tanh as in the paper source.',
           None,
           'tanh'),
        PD('att_weighted_gather', 'The paper"s source uses an affine transform with trainable weights, to narrow the output of the attention'
           "model from (B,L,dim) to (B,L,1). I don't think this is helpful since there is no nonlinearity here." 
           "Therefore I have an alternative implementation that simply averages the matrix (B,L,dim) to (B,L,1)." 
           "Default value however, is True so as to default to the paper's implementation.",
           (True, False),
           True),
        PD('att_weights_initializer', 'weights initializer to use for the attention model', None,
           'glorot_normal'),
    ### Embedding Layer ###
        PD('embeddings_initializer', 'Initializer for embedding weights', None, 'glorot_uniform'),
    ### Decoder LSTM Params ###
        PD('decoder_out_layers', 'Number of layers in the decoder output MLP. defaults to 1 as in the papers source',
           xrange(1,10), 1),
        PD('n',
           '(integer): Number of hidden-units of the LSTM cell',
           instanceof(int),
           1000),
        PD('decoder_lstm_peephole',
           '(boolean): whether to employ peephole connections in the decoder LSTM',
           (True, False),
           False),
        PD('output_activation', 'Activtion function for deep output layer', None,
           'tanh'),
    ### Initializer MLP ###
        PD('init_layers', 'Number of layers in the initializer MLP', xrange(1,10),
           1),
        PD('init_dropout_rate', '(decimal): Global dropout_rate variable for init_layer',
           frange_incl(0.0, 0.9), 
           0.2),
        PD('init_h_activation', '', None, 'tanh'),
        PD('init_h_dropout_rate', '', frange_incl(0.0, 0.9), 
           equalto('init_dropout_rate')),
        PD('init_c_activation', '', None, 'tanh'),
        PD('init_c_dropout_rate', '', frange_incl(0.0, 0.9),
           equalto('init_dropout_rate')),
        PD('init_1_n', 'Number of units in hidden layer 1. The paper sets it to D',
            xrange(1, 10000), 
           equalto('D')),
        PD('init_1_dropout_rate', '(decimal): dropout rate for the layer', 
           frange_incl(0.0, 0.9), 
           0),
        PD('init_1_activation', 
           'Activation function of the first layer. In the paper, the final' 
           'layer has tanh and all penultinate layers have relu activation', 
           None,
           'tanh')
))
print HYPER

{'init_layers': 1, 'init_c_activation': 'tanh', 'att_weighted_gather': True, 'init_1_n': 512, 'init_1_activation': 'tanh', 'keep_prob': 1.0, 'init_dropout_rate': 0.2, 'att_layers': 1, 'init_h_activation': 'tanh', 'init_1_dropout_rate': 0, 'embeddings_initializer': 'glorot_uniform', 'image_shape': (120, 1075, 3), 'decoder_out_layers': 1, 'att_share_weights': True, 'att_weights_initializer': 'glorot_normal', 'init_c_dropout_rate': 0.2, 'B': 128, 'D': 512, 'att_1_n': 512, 'K': 556, 'L': 99, 'init_h_dropout_rate': 0.2, 'att_activation': 'tanh', 'm': 64, 'n': 1000, 'decoder_lstm_peephole': False, 'output_activation': 'tanh'}


### Encoder Model
[VGG ConvNet (16 or 19)](http://www.robots.ox.ac.uk/~vgg/research/very_deep/) without the top-3 layers
* Pre-initialized with the VGG weights but allowed to train
* The ConvNet outputs $D$ dimensional vectors in a WxH grid where W and H are scaled-down dimensions of the input image size (due to 5 max-pool layers). Defining $W.H \equiv L$ the ConvNet output represents L locations of the image $i \in [1,L]$ and correspondingly outputs to L annotation vectors $a_i$, each of size $D$.

The conv-net is *not trained* in the original paper and therefore the files can be separately preprocessed and their outputs directly fed into the model.

In [6]:
## Conv-net
# K.set_image_data_format('channels_last')
image_input = Input(shape=HYPER.image_shape, name='image_input')
convnet = VGG16(include_top=False, weights='imagenet', pooling=None, input_shape=HYPER.image_shape)
convnet.trainable = False
print 'convnet output_shape = ', convnet.output_shape
a = convnet(image_input)

convnet output_shape =  (None, 3, 33, 512)


### Input Generator

In [7]:
@staticmethod
def make_batch_list(df_, batch_size_):
    ## Make a list of batches
    bin_lens = sorted(df_.bin_len.unique())
    bin_counts = [df_[df_.bin_len==l].shape[0] for l in bin_lens]
    batch_list = []
    for i in range(len(bin_lens)):
        bin_ = bin_lens[i]
        num_batches = (bin_counts[i] // batch_size_)
        ## Just making sure bin size is integral multiple of batch_size.
        ## This is not a requirement for this function to operate, rather
        ## is a way of possibly catching data-corrupting bugs
        assert (bin_counts[i] % batch_size_) == 0
        batch_list.extend([(bin_, j) for j in range(num_batches)])

    np.random.shuffle(batch_list)
    return batch_list

class ShuffleIterator(object):
    def __init__(self, df_, batch_size_):
        self._df = df_.sample(frac=1)
        self._batch_size = batch_size_
        self._batch_list = make_batch_list(self._df, batch_size_)
        self._next_pos = 0
        self._num_items = (df_.shape[0] // batch_size_)
        self.lock = threading.Lock()
        
#     def __iter__(self):
#         return self
    
    def next(self):
        ## This is an infinite iterator
        with self.lock:
            if self._next_pos >= self._num_items:
                ## Recompose the batch-list
                ## Shuffle the samples
                self._df = self._df.sample(frac=1)
                self._batch_list = make_batch_list(self._df, batch_size_)
                self._next_pos %= self._num_items
            next_pos = self._next_pos
            self._next_pos += 1
        
        batch = self._batch_list[next_pos]
        df_bin = self._df[self._df.bin_len == batch[0]]
        assert df_bin.bin_len.iloc[batch[1]*self._batch_size] == batch[0]
        assert df_bin.bin_len.iloc[(batch[1]+1)*self._batch_size-1] == batch[0]
        return df_bin.iloc[batch[1]*self._batch_size : (batch[1]+1)*self._batch_size]

class ImageIterator(ShuffleIterator):
    def __init__(self, df_, batch_size_, image_dim_, image_dir_):
        Shuffler.__init__(self, df_, batch_size_)
        self._im_dim = image_dim_
        self._image_dir = image_dir_

    @staticmethod
    def get_image_matrix(image_path_, height_, width_, padded_height_, padded_width_):
        MAX_PIXEL = 255.0 # Ensure this is a float literal
        ## Load image and convert to a 3-channel array
        im_ar = ndimage.imread(os.path.join(image_dir_,sr_row_.image), mode='RGB')
        ## normalize values to lie between -1.0 and 1.0.
        ## This is done in place of data whitening - i.e. normalizing to mean=0 and std-dev=0.5
        ## Is is a very rough technique but legit for images
        im_ar = (im_ar - MAX_PIXEL/2.0) / MAX_PIXEL
        height, width = im_ar.shape
        assert height == height
        assert width == width
        if (height < padded_height_) or (width < padded_width_):
            ar = np.full((max_height_, max_widt_h), 0.5, dtype=np.float32)
            h = (padded_height_-height)//2
            ar[h:h+height, 0:width] = im_ar
            im_ar = ar

        return im_ar

    def next(self):
        df_batch = Shuffler.next(self)[['image', 'height', 'width']]
        im_batch = []
        for image in df_batch.image.itertuples():
            im_batch.append(self._get_image_array(os.path.join(self._image_dir, image[0]), row[1], row[2], self._im_dim[0], self._im_dim[1]))
            
        return np.asarray(im_batch)

class FormulaIterator(ShuffleIterator):
    def __init__(self, df_, batch_size_, data_dir_, seq_filename_):
        Shuffler.__init__(self, df_, batch_size_)
        self._seq_data = pd.read_pickle(os.path.join(data_dir_, seq_filename_))
        
    def next(self):
        df_batch = Shuffler.next(self)['bin_len']
        bin_len = df_batch.iloc[0].bin_len
        return self._seq_data[bin_len][df_batch.index].values

In [8]:
# sequence_input = Input(shape=(None,), dtype='int32', name='sequence_input')
# embedding_output = Embedding(HYPER.vocab_size, HYPER.embedding_size, mask_zero=True, name='embedding')(sequence_input)
#model = Model(inputs=[sequence_input], outputs=[embedding_output])
#model.compile(optimizer='rmsprop', loss='binary_crossentropy')
#model.output_shape

#### Decoder Model
A dense (FC) attention model: The deterministic soft-attention model of the paper computes $\alpha_{t,i}$ which is used to select or blend the $a_i$ vectors before being fed as inputs to the decoder LSTM network (see below).
* Inputs to the attention model are $a_i$ and $h_{t-1}$ (previous hidden state of LSTM network - see below) and $$\alpha_{t,i} = softmax ( f_{att}(a_i, h_{t-1}) )$$
* Note that the model $f_{att}$ shares weights across all values of a_i (i.e. for all i = 1-L). Therefore the shared weight matrix for all a_i has shape (D, D), while shape of a is (B, L, D) where is B=batch-size. Weight matrix of h_i is separate and has the expected shape (n, D). This sharing of weights across a_i is interesting.

A Decoder model: A conditioned LSTM that outputs probabilities of the text tokens $y_t$ at each step. The LSTM is conditioned upon $z_t = \sum_i^L(\alpha_{t,i}.a_i)$ and takes the previous hidden state $h_{t-1}$ as input. In addition, an embedding of the previous output $Ey_{t-1}$ is also input to the LSTM. At training time, $y_{t-1}$ would be derived from the training samples, while at inferencing time it would be fed-back from the previous predicted word.
* $y$ is taken from a fixed vocabulary of K words. An embedding matrix $E$ is used to narrow its representation to an $m$ dimensional dense vector. The embedding weights $E$ are learnt end-to-end by the model as well.
* The decoder LSTM uses a deep layer between $h_t$ and $y_t$. It is called a deep output layer and is described in [section 3.2.2 of this paper](https://www.semanticscholar.org/paper/How-to-Construct-Deep-Recurrent-Neural-Networks-Pascanu-G%C3%BCl%C3%A7ehre/533ee188324b833e059cb59b654e6160776d5812). That is:
$$ p(y_t) = Softmax \Big( f_out(Ey_{t-1}, h_t, \hat{z}_t) \Big) $$
* Optionally $z_t = \beta \sum_i^L(\alpha_{t,i}.a_i)$ where $\beta = \sigma(f_{\beta}(h_{t-1}))$ is a scalar used to modulate the strength of the context. It turns out that for the original use-case of caption generation, the network would learn to emphasize objects by turning up the value of this scalar when it was focusing on objects. It is not clear at this time whether we'll need this feature for im2latex.


In [9]:
class ConditionedAttentiveRNNCell(object):
    """
    One timestep of the decoder model. The entire function can be seen as a complex RNN-cell
    that includes a LSTM stack and an attention model.
    """
    def __init__(self):
        self._define_params()
        self._numSteps = 0
    
    def _define_attention_params(self):
        """Define Shared Weights for Attention Model"""
        ## 1) Dense layer, 2) Optional gather layer and 3) softmax layer

        ## Renaming HyperParams for convenience
        B = HYPER.B
        n = HYPER.n
        L = HYPER.L
        D = HYPER.D
        m = HYPER.m
        att_actv = HYPER.att_activation
        e_init = HYPER.embeddings_initializer
        
        ## _att_dense array indices start from 1
        self._att_dense_layer = []

        if HYPER.att_share_weights:
        ## Here we'll effectively create L MLP stacks all sharing the same weights. Each
        ## stack receives a concatenated vector of a(l) and h as input.
            dim = D+n
            for i in range(1, HYPER.att_layers+1):
                n_units = HYPER['att_%d_n'%(i,)]; assert(n_units <= dim)
                self._att_dense_layer.append(Dense(n_units, activation=HYPER.att_activation,
                                                   batch_input_shape=(B,L,dim)))
                dim = n_units
            ## Optional gather layer (that comes after the Dense Layer)
            if HYPER.att_weighted_gather:
                self._att_gather_layer = Dense(1, activation='linear') # output shape = (B, L, 1)
        else:
            ## concatenate a and h_prev and pass them through a MLP. This is different than the theano
            ## implementation of the paper because we flatten a from (B,L,D) to (B,L*D). Hence each element
            ## of the L*D vector receives its own weight because the effective weight matrix here would be
            ## shape (L*D, num_dense_units) as compared to (D, num_dense_units) as in the shared_weights case
            dim = L*D+n        
            for i in range(1, HYPER.att_layers+1):
                n_units = HYPER['att_%d_n'%(i,)]; assert(n_units <= dim)
                self._att_dense_layer.append(Dense(n_units, activation=HYPER.att_actv,
                                                   batch_input_shape=(B,dim)))
                dim = n_units
        
        assert dim >= L
        self._att_softmax_layer = Dense(L, activation='softmax', name='alpha')
        
    def _call_attention_model(self, a, h_prev):
        B = HYPER.B
        n = HYPER.n
        L = HYPER.L
        D = HYPER.D
        h = h_prev
        att_actv = HYPER.att_activation
        initializer = HYPER.att_weights_initializer

        assert K.int_shape(h_prev) == (B, n)
        assert K.int_shape(a) == (B, L, D)

        ## For #layers > 1 this will endup being different than the paper's implementation
        if HYPER.att_share_weights:
            """
            Here we'll effectively create L MLP stacks all sharing the same weights. Each
            stack receives a concatenated vector of a(l) and h as input.

            TODO: We could also
            use 2D convolution here with a kernel of size (1,D) and stride=1 resulting in
            an output dimension of (L,1,depth) or (B, L, 1, depth) including the batch dimension.
            That may be more efficient.
            """
            ## h.shape = (B,n). Convert it to (B,1,n) and then broadcast to (B,L,n) in order
            ## to concatenate with feature vectors of 'a' whose shape=(B,L,D)
            print 'Shape of h is ', K.int_shape(h)
            h = K.tile(K.expand_dims(h, axis=1), (1,L,1))
            ## Concatenate a and h. Final shape = (B, L, D+n)
            ah = tf.concat([a,h], -1)
            for i in range(HYPER.att_layers) :
                ah = self._att_dense_layer[i](ah)

            ## Below is roughly how it is implemented in the code released by the authors of the paper
#                 for i in range(1, HYPER.att_a_layers+1):
#                     a = Dense(HYPER['att_a_%d_n'%(i,)], activation=HYPER.att_actv)(a)
#                 for i in range(1, HYPER.att_h_layers+1):
#                     h = Dense(HYPER['att_h_%d_n'%(i,)], activation=HYPER.att_actv)(h)    
#                ah = a + K.expand_dims(h, axis=1)

            ## Gather all activations across the features; go from (B, L, dim) to (B,L,1).
            ## One could've just summed/averaged them all here, but the paper uses yet
            ## another set of weights to accomplish this. So we'll keeep that as an option.
            if HYPER.att_weighted_gather:
                ah = self._att_gather_layer(ah) # output shape = (B, L, 1)
                ah = K.reshape(ah, (B,L)) # output shape = (B, L)
            else:
                ah = K.mean(ah, axis=2) # output shape = (B, L)

        else: # weights not shared across L
            ## concatenate a and h_prev and pass them through a MLP. This is different than the theano
            ## implementation of the paper because we flatten a from (B,L,D) to (B,L*D). Hence each element
            ## of the L*D vector receives its own weight because the effective weight matrix here would be
            ## shape (L*D, num_dense_units) as compared to (D, num_dense_units) as in the shared_weights case

            ## Concatenate a and h. Final shape will be (B, L*D+n)
            ah = K.concatenate(K.batch_flatten(a), h)
            for i in range(HYPER.att_layers):
                ah = self._att_dense_layer(ah)
            ## At this point, ah.shape = (B, dim)

        alpha = self._att_softmax_layer(ah) # output shape = (B, L)
        assert K.int_shape(alpha) == (B, L)
        return alpha
            
    def _define_output_params(self):
        ## Renaming HyperParams for convenience
        B = HYPER.B
        n = HYPER.n
        L = HYPER.L
        D = HYPER.D
        m = HYPER.m
        Kv= HYPER.K
        att_actv = HYPER.att_activation
        e_init = HYPER.embeddings_initializer

        ## First layer of output MLP
        ## Affine transformation of h_t and z_t from size n/D to m followed by a summation
        self._output_affine = Dense(m, activation='linear', batch_input_shape=(B,n+D)) # output size = (B, m)
        ## non-linearity for the first layer - will be chained by the _call function after adding Ex / Ey
        self._output_activation = Activation(HYPER.output_activation)

        ## Additional layers if any
        if HYPER.decoder_out_layers > 1:
            self._output_dense = []
            for i in range(1, HYPER.decoder_out_layers):
                self._output_dense.append(Dense(m, activation=HYPER['output_%d_activation'%i], 
                                           batch_input_shape=(B,m)))

        ## Final softmax layer
        self._output_softmax = Dense(Kv, activation='softmax', batch_input_shape=(B,m))
        
    def _call_output_layer(self, Ex_t, h_t, z_t):
        ## Renaming HyperParams for convenience
        B = HYPER.B
        n = HYPER.n
        L = HYPER.L
        D = HYPER.D
        m = HYPER.m
        Kv =HYPER.K
        
        assert K.int_shape(Ex_t) == (B, m)
        assert K.int_shape(h_t) == (B, n)
        assert K.int_shape(z_t) == (B, D)
        
        o_t = self._output_affine(tf.concat([h_t, z_t], -1)) + Ex_t
        o_t = self._output_activation(o_t)
        if HYPER.decoder_out_layers > 1:
            for i in range(1, HYPER.decoder_out_layers):
                o_t = self._output_dense[i](o_t)
        
        o_t = self._output_softmax(o_t) # shape = (B,K)
        assert K.int_shape(o_t) == (B, Kv)
        return o_t

    def _define_init_params(self):
        ## As per the paper, this is a two-headed MLP. It has a stack of common layers at the bottom
        ## two output layers at the top - one each for h and c LSTM states.
        self._init_layer = []
        self._init_dropout = []
        for i in xrange(1, HYPER.init_layers):
            key = 'init_%d_'%i
            self._init_layer.append(Dense(HYPER[key+'n'], activation=Hyper[key+'activation']))
            if HYPER[key+'dropout_rate'] > 0.0:
                self._init_dropout.append(Dropout(HYPER[key+'dropout_rate']))

        ## Final layer for h
        self._init_h = Dense(HYPER['n'], activation=HYPER['init_h_activation'])
        if HYPER['init_h_dropout_rate'] > 0.0:
            self._init_h_dropout = Dropout(HYPER['init_h_dropout_rate'])

        ## Final layer for c
        self._init_c = Dense(HYPER['n'], activation=HYPER['init_c_activation'])
        if HYPER['init_c_dropout_rate'] > 0.0:
            self._init_c_dropout = Dropout(HYPER['init_c_dropout_rate'])

    def _call_init_layer(self, a):
        assert K.int_shape(a) == (HYPER.B, HYPER.L, HYPER.D)
        
        ## As per the paper, this is a two-headed MLP. It has a stack of common layers at the bottom,
        ## two output layers at the top - one each for h and c LSTM states.
        a = K.mean(a, axis=1) # final shape = (B, D)
        for i in xrange(1, HYPER.init_layers):
            key = 'init_%d_'%i
            a = self._init_layer[i](a)
            if HYPER[key+'dropout_rate'] > 0.0:
                a = self._init_dropout[i](a)
        
        init_h = self._init_h(a)
        if HYPER['init_h_dropout_rate'] > 0.0:
            init_h = self._init_h_dropout(init_h)
            
        init_c = self._init_c(a)
        if HYPER['init_c_dropout_rate'] > 0.0:
            init_c = self._init_c_dropout(init_c)
            
        assert K.int_shape(init_h) == (HYPER.B, HYPER.n)
        assert K.int_shape(init_c) == (HYPER.B, HYPER.n)
        return init_h, init_c
        
    def _define_params(self):
        ## Renaming HyperParams for convenience
        B = HYPER.B
        n = HYPER.n
        L = HYPER.L
        D = HYPER.D
        m = HYPER.m
        Kv = HYPER.K
        att_actv = HYPER.att_activation
        e_init = HYPER.embeddings_initializer

        ################ Attention Model ################
        with tf.variable_scope('Attention'):
            self._define_attention_params()
                
        ################ Embedding Layer ################
        with tf.variable_scope('Ey'):
            self._embedding = Embedding(Kv, m, embeddings_initializer=e_init, mask_zero=True, 
                                        input_length=1,
                                        batch_input_shape=(B,1))
        
        ################ Decoder LSTM Cell ################
        with tf.variable_scope('Decoder_LSTM'):
            #LSTM by Zaremba et. al 2014: http://arxiv.org/abs/1409.2329
            self._decoder_lstm = tf.contrib.rnn.LSTMBlockCell(n, forget_bias=1.0, 
                                                              use_peephole=HYPER.decoder_lstm_peephole)
            
        ################ Output Layer ################
        with tf.variable_scope('Decoder_Output_Layer'):
            self._define_output_params()

        ################ Initializer MLP ################
        with tf.variable_scope('Initializer_MLP'):
            self._define_init_params()
            
    def make_tf_lstm_training_step1(self, out_t_1, x_t):
        """
        Builds tf graph for the first iteration of the RNN. Calls make_tf_lstm_training with isStep1=True.
        """
        return self.make_tf_lstm_training(out_t_1, x_t, True)
        
    def make_tf_lstm_training_stepN(self, out_t_1, x_t):
        """
        Builds tf graph for the subsequent iterations of the RNN. Calls make_tf_lstm_training with isStep1=False.
        """
        return self.make_tf_lstm_training(out_t_1, x_t, False)
        
    def make_tf_lstm_training(self, out_t_1, x_t, isStep1=False):
        """
        TODO: Incorporate Dropout
        Builds/threads tf graph for one RNN iteration.
        Conforms to loop function fn required by tf.scan. Takes in previous lstm states (h and c), 
        the current input and the image annotations (a) as input and outputs the states and outputs for the
        current timestep.
        Note that input(t) = Ey(t-1). Input(t=0) = Null. When training, the target output is used for Ey
        whereas at prediction time (via. beam-search for e.g.) the actual output is used.
        Args:
            x_t (tensor): is a input for one time-step. Should be a tensor of shape (batch-size, 1).
            out_t_1 (tuple of tensors): Output returned by this function at previous time-step.
        Returns:
            out_t (tuple of tensors): The output y_t shape= (B,K) - the probability of words/tokens. Also returns
                states needed in the next iteration of the RNN - i.e. (h_t, lstm_states_t and a). lstm_states_t = 
                (h_t, c_t) - which means h_t is included twice in the returned tuple.            
        """
        #x_t = input at t             # shape = (B, 1)
        step = out_t_1[0] + 1
        h_t_1 = out_t_1[1]            # shape = (B,n)
        lstm_states_t_1 = out_t_1[2]  # shape = ((B,n), (B,n)) = (c_t_1, h_t_1)
        a = out_t_1[3]               # shape = (B, L, D)
        #yProbs_t_1 = out_t_1[4]           # shape = (B, Kv)
        
        B = HYPER.B
        m = HYPER.m
        n = HYPER.n
        L = HYPER.L
        D = HYPER.D
        Kv = HYPER.K
        
        assert K.int_shape(h_t_1) == (B, n)
        assert K.int_shape(a) == (B, L, D)
        assert K.int_shape(lstm_states_t_1[1]) == (B, n)
        if not isStep1:
            assert K.int_shape(out_t_1[4]) == (B, Kv)
        
        ################ Attention Model ################
        with tf.variable_scope('Attention'):
            alpha = self._call_attention_model(a, h_t_1) # alpha.shape = (B, L)

        ################ Soft deterministic attention: z = alpha-weighted mean of a ################
        ## (B, L) batch_dot (B,L,D) -> (B, D)
        with tf.variable_scope('Phi'):
            z_t = K.batch_dot(alpha, a, axes=[1,1]) # z_t.shape = (B, D)

        ################ Embedding layer ################
        with tf.variable_scope('Ey'):
            Ex_t = self._embedding(x_t) # output.shape= (B,m,1)
            Ex_t = K.reshape(Ex_t, (B,m)) # output.shape= (B,m)

        ################ Decoder Layer ################
        with tf.variable_scope("Decoder_LSTM", reuse = False if isStep1 else True) as var_scope:
            (h_t, lstm_states_t) = self._decoder_lstm(Ex_t, lstm_states_t_1) # h_t.shape=(B,n)
            #print 'After decoder construction, h_t shape = ', K.int_shape(h_t)
            
        ################ Decoder Layer ################
        with tf.variable_scope('Output_Layer'):
            yProbs_t = self._call_output_layer(Ex_t, h_t, z_t) # yProbs_t.shape = (B,K)
        
        assert K.int_shape(h_t) == (B, n)
        assert K.int_shape(a) == (B, L, D)
        assert K.int_shape(lstm_states_t[1]) == (B, n)
        assert K.int_shape(yProbs_t) == (B, Kv)
        return step, h_t, lstm_states_t, a, yProbs_t

#     def step_function_training(self, input, states):
#         """
#         Conforms to step_function required by keras.backend.rnn. Takes in previous lstm states (c&h), 
#         the current input and the image annotations (a) as input and outputs the states and outputs for the
#         current timestep.
#         Note that input(t) = Ey(t-1) and input(t=0) = Null. When training, the target output is used for Ey
#         whereas at prediction time (via. beam-search for e.g.) the actual output is used.
#         Args:
#             input (tensor): is a input for one time-step. Should be a tensor of shape (batch-size, m) where m is
#                 the dimensionality of the embedded input vector.
#             states (list of tensors): Same as new_states returned by this function at previous time-step.
#                 states(t) = new_states(t-1).
#         Returns:
#             outputs (tensor): The output of the cell. A tensor of shape (batch_size, num_units)
#         """
#         x_t = input                  # shape = (B,1)
#         step_output_t_1 = states[0]  # shape = (B, K)
#         h_t_1 = states[1]            # shape = (B,n)
#         lstm_states_t_1 = states[2]  # shape = ((B,n), (B,n)) = (c_t_1, h_t_1)
#         a = states[-1]               # shape = (B, L, D)
#         B = HYPER.B
#         m = HYPER.m
        
#         ################ Attention Model ################
#         with tf.name_scope('Attention'):
#             alpha = self._call_attention_model(a, h_t_1) # alpha.shape = (B, L)

#         ################ Soft deterministic attention: z = alpha-weighted mean of a ################
#         ## (B, L) batch_dot (B,L,D) -> (B, D)
#         with tf.name_scope('Phi'):
#             z_t = K.batch_dot(alpha, a, axes=[1,1]) # z_t.shape = (B, D)

#         ################ Embedding layer ################
#         with tf.name_scope('Ey'):
#             Ex_t = self._embedding(x_t) # output.shape= (B,m,1)
#             Ex_t = K.reshape(Ex_t, (B,m)) # output.shape= (B,m)

#         ################ Decoder Layer ################
#         print 'num steps = ', self._numSteps
#         with tf.variable_scope("Decoder_LSTM", reuse= False if self._numSteps == 0 else True) as var_scope:
#             (h_t, lstm_states_t) = self._decoder_lstm(Ex_t, lstm_states_t_1) # h_t.shape=(B,n)
#             yProbs_t = self._call_output_layer(Ex_t, h_t, z_t) # yProbs_t.shape = (B,K)
        
#         self._numSteps += 1
#         ## output = yProbs_t
#         ## state = (yProbs_t, h_t, lstm_states_t, a)
#         ## return (output, state)
#         return yProbs_t, [yProbs_t, h_t, lstm_states_t, a]

#     def make_tf_lstm_training2(self, step, step_output_t_1, x_s, h_t_1, lstm_states_t_1, a):
#         """
#         Conforms to loop function required by tf.while_loop. Takes in previous lstm states (c&h), 
#         the current input and the image annotations (a) as input and outputs the states and outputs for the
#         current timestep.
#         Note that input(t) = Ey(t-1) and input(t=0) = Null. When training, the target output is used for Ey
#         whereas at prediction time (via. beam-search for e.g.) the actual output is used.
#         Args:
#             input (tensor): is a input for one time-step. Should be a tensor of shape (batch-size, m) where m is
#                 the dimensionality of the embedded input vector.
#             states (list of tensors): Same as new_states returned by this function at previous time-step.
#                 states(t) = new_states(t-1).
#         Returns:
#             outputs (tensor): The output of the cell. A tensor of shape (batch_size, num_units)
#         """
# #         step_output_t_1 = states[0]  # shape = (B, K)
# #         x_s = input sequence         # shape = (B, sequence_bin_length)
# #         h_t_1 = states[1]            # shape = (B,n)
# #         lstm_states_t_1 = states[2]  # shape = ((B,n), (B,n)) = (c_t_1, h_t_1)
# #         a = states[-1]               # shape = (B, L, D)
#         B = HYPER.B
#         m = HYPER.m
        
#         ################ Attention Model ################
#         with tf.name_scope('Attention'):
#             alpha = self._call_attention_model(a, h_t_1) # alpha.shape = (B, L)

#         ################ Soft deterministic attention: z = alpha-weighted mean of a ################
#         ## (B, L) batch_dot (B,L,D) -> (B, D)
#         with tf.name_scope('Phi'):
#             z_t = K.batch_dot(alpha, a, axes=[1,1]) # z_t.shape = (B, D)

#         ################ Embedding layer ################
#         with tf.name_scope('Ey'):
#             Ex_t = self._embedding(x_s[:, step]) # output.shape= (B,m,1)
#             Ex_t = K.reshape(Ex_t, (B,m)) # output.shape= (B,m)

#         ################ Decoder Layer ################
#         with tf.variable_scope("Decoder_LSTM", reuse= (False if step == 0 else True)) as var_scope:
#             (h_t, lstm_states_t) = self._decoder_lstm(Ex_t, lstm_states_t_1) # h_t.shape=(B,n)
            
#         ################ Decoder Layer ################
#         with tf.variable_scope('Output_Layer'):
#             yProbs_t = self._call_output_layer(Ex_t, h_t, z_t) # yProbs_t.shape = (B,K)
        
#         return step+1, yProbs_t, x_s, h_t, lstm_states_t, a


In [12]:
bin_len = 100
cell = ConditionedAttentiveRNNCell()
x_sequence = K.placeholder(shape=(HYPER.B, bin_len, 1), dtype=tf.int32)
im_context = K.placeholder(shape=(HYPER.B, HYPER.L, HYPER.D), dtype=tf.float32)
init_h, init_c = cell._call_init_layer(im_context)
init_lstm_states = tf.contrib.rnn.LSTMStateTuple(init_c, init_h)
initial_accum = (0, init_h, init_lstm_states, im_context)
x_sequence = tf.transpose(x_sequence, perm=[1, 0, 2])
step1_out = cell.make_tf_lstm_training_step1(initial_accum, x_sequence[0,:,:])
print step1_out
stepN_out = tf.scan(cell.make_tf_lstm_training_stepN, x_sequence[1:,:,:], initializer=step1_out)
print stepN_out

Shape of h is  (128, 1000)
(1, <tf.Tensor 'Decoder_LSTM_3/lstm_cell/LSTMBlockCell:6' shape=(128, 1000) dtype=float32>, LSTMStateTuple(c=<tf.Tensor 'Decoder_LSTM_3/lstm_cell/LSTMBlockCell:1' shape=(128, 1000) dtype=float32>, h=<tf.Tensor 'Decoder_LSTM_3/lstm_cell/LSTMBlockCell:6' shape=(128, 1000) dtype=float32>), <tf.Tensor 'Placeholder_32:0' shape=(128, 99, 512) dtype=float32>, <tf.Tensor 'Output_Layer/dense_16/Softmax:0' shape=(128, 556) dtype=float32>)
Shape of h is  (128, 1000)
(<tf.Tensor 'scan/TensorArrayStack/TensorArrayGatherV3:0' shape=(99,) dtype=int32>, <tf.Tensor 'scan/TensorArrayStack_1/TensorArrayGatherV3:0' shape=(99, 128, 1000) dtype=float32>, LSTMStateTuple(c=<tf.Tensor 'scan/TensorArrayStack_2/TensorArrayGatherV3:0' shape=(99, 128, 1000) dtype=float32>, h=<tf.Tensor 'scan/TensorArrayStack_3/TensorArrayGatherV3:0' shape=(99, 128, 1000) dtype=float32>), <tf.Tensor 'scan/TensorArrayStack_4/TensorArrayGatherV3:0' shape=(99, 128, 99, 512) dtype=float32>, <tf.Tensor 'scan/T

In [None]:
print stepN_out

In [None]:
# x_sequence = K.placeholder(shape=(HYPER.B, 100, 1), dtype=tf.int32)
# init_output = K.placeholder(shape=(HYPER.B, HYPER.K), dtype=tf.float32)
# init_h = K.placeholder(shape=(HYPER.B, HYPER.n), dtype=tf.float32)
# init_c = K.placeholder(shape=(HYPER.B, HYPER.n), dtype=tf.float32)
# init_lstm_states = (init_c, init_h)
# im_context = K.placeholder(shape=(HYPER.B, HYPER.L, HYPER.D), dtype=tf.float32)
# initial_states = [init_output, init_h, init_lstm_states, im_context]
# cell = ConditionedAttentiveRNNCell()
# K.rnn(cell.step_function_training, x_sequence, initial_states=initial_states, go_backwards=False, mask=None, constants=None, unroll=False, input_length=None)

In [None]:
# step = 0
# bin_len = 100
# x_sequence = K.placeholder(shape=(HYPER.B, bin_len, 1), dtype=tf.int32)
# init_h = K.placeholder(shape=(HYPER.B, HYPER.n), dtype=tf.float32)
# init_c = K.placeholder(shape=(HYPER.B, HYPER.n), dtype=tf.float32)
# init_lstm_states = tf.contrib.rnn.LSTMStateTuple(init_c, init_h)
# im_context = K.placeholder(shape=(HYPER.B, HYPER.L, HYPER.D), dtype=tf.float32)
# initial_params = (step, init_output, x_sequence, init_h, init_lstm_states, im_context)
# cell = ConditionedAttentiveRNNCell()
# tf.while_loop(lambda step, y, x_s, *_: step < tf.shape(x_s)[0], cell.make_tf_lstm_training, initial_params)

In [None]:
print(K.int_shape(init_h))