# im2latex(S): Deep Learning Model

&copy; Copyright 2017 Sumeet S Singh

    This file is part of the im2latex solution (by Sumeet S Singh in particular since there are other solutions out there).

    This program is free software: you can redistribute it and/or modify
    it under the terms of the Affero GNU General Public License as published by
    the Free Software Foundation, either version 3 of the License, or
    (at your option) any later version.

    This program is distributed in the hope that it will be useful,
    but WITHOUT ANY WARRANTY; without even the implied warranty of
    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
    Affero GNU General Public License for more details.

    You should have received a copy of the Affero GNU General Public License
    along with this program.  If not, see <http://www.gnu.org/licenses/>.

## The Model
* Follows the [Show, Attend and Tell paper](https://www.semanticscholar.org/paper/Show-Attend-and-Tell-Neural-Image-Caption-Generati-Xu-Ba/146f6f6ed688c905fb6e346ad02332efd5464616)
* [VGG ConvNet (16 or 19)](http://www.robots.ox.ac.uk/~vgg/research/very_deep/) without the top-3 layers
    * Pre-initialized with the VGG weights but allowed to train
    * The ConvNet outputs $D$ dimensional vectors in a WxH grid where W and H are 1/16th of the input image size (due to 4 max-pool layers). Defining $W.H \equiv L$ the ConvNet output represents L locations of the image $i \in [1,L]$ and correspondingly outputs to L annotation vectors $a_i$, each of size $D$.
* A dense (FC) attention model: The deterministic soft-attention model of the paper computes $\alpha_{t,i}$ which is used to select or blend the $a_i$ vectors before being fed as inputs to the decoder LSTM network (see below).
    * Inputs to the attention model are $a_i$ and $h_{t-1}$ (previous hidden state of LSTM network - see below)
    and $$\alpha_{t,i} = softmax ( f_{att}(a_i, h_{t-1}) )$$
* A Decoder model: A conditioned LSTM that outputs probabilities of the text tokens $y_t$ at each step. The LSTM is conditioned upon $z_t = \sum_i^L(\alpha_{t,i}.a_i)$ and takes the previous hidden state $h_{t-1}$ as input. In addition, an embedding of the previous output $Ey_{t-1}$ is also input to the LSTM. At training time, $y_{t-1}$ would be derived from the training samples, while at inferencing time it would be fed-back from the previous predicted word.
    * $y$ is taken from a fixed vocabulary of K words. An embedding matrix $E$ is used to narrow its representation. The embedding weights $E$ are learnt end-to-end by the model as well.
    * The decoder LSTM uses a deep layer between $h_t$ and $y_t$. It is called a deep output layer and is described in [section 3.2.2 of this paper](https://www.semanticscholar.org/paper/How-to-Construct-Deep-Recurrent-Neural-Networks-Pascanu-G%C3%BCl%C3%A7ehre/533ee188324b833e059cb59b654e6160776d5812). That is:
    $$ p(y_t) = Softmax \Big( f_out(Ey_{t-1}, h_t, \hat{z}_t) \Big) $$
* Initialization MLPs: Two MLPs are used to produce the initial memory-state of the LSTM as well as $h_{t-1}$ value. Each MLP takes in the entire image's features (i.e. average of $a_i$) as its input and is trained end-to-end.
    $$ c_o = f_{init,c}\Big( \sum_i^L a_i \Big) $$
    $$ h_o = f_{init,h}\Big( \sum_i^L a_i \Big) $$
* Training:
    * 3 models from above - all except the conv-net - are trained end-to-end using SGD
    * The model is trained for a variable number of time steps - depending on each batch

## References
1. Show, Attend and Tell
    * [Paper](https://www.semanticscholar.org/paper/Show-Attend-and-Tell-Neural-Image-Caption-Generati-Xu-Ba/146f6f6ed688c905fb6e346ad02332efd5464616)
    * [Slides](https://pdfs.semanticscholar.org/b336/f6215c3c15802ca5327cd7cc1747bd83588c.pdf?_ga=2.52116077.559595598.1498604153-2037060338.1496182671)
    * [Author's Theano code](https://github.com/kelvinxu/arctic-captions)
1. [Simonyan, Karen and Andrew Zisserman. “Very Deep Convolutional Networks for Large-Scale Image Recognition.” CoRR abs/1409.1556 (2014): n. pag.](http://www.robots.ox.ac.uk/~vgg/research/very_deep/)
1. [im2latex solution of Harvard NLP](http://lstm.seas.harvard.edu/latex/)
1. [im2latex-dataset tools forked from Harvard NLP](https://github.com/untrix/im2latex-dataset)

In [1]:
import pandas as pd
import os
from six.moves import cPickle as pickle
import dl_commons as dlc
import tf_commons as tfc
import tensorflow as tf
from keras.applications.vgg16 import VGG16
from keras.layers import Input, Embedding, Dense, Activation, Dropout, Concatenate, Permute
from keras.callbacks import LambdaCallback
from keras.models import Model
from keras import backend as K
from keras.engine import Layer
import keras
import threading
import numpy as np
import collections
# from Im2LatexDecoderRNNParams import D_RNN
# from Im2LatexDecoderRNN import Im2LatexDecoderRNN
# from Im2LatexModel import Im2LatexModel, HYPER
from data_reader import BatchIterator

Using TensorFlow backend.


# TODOs
* Implement the beta scalar ('selector') that scales alpha.

In [2]:
data_folder = '../data/generated2'
image_folder = os.path.join(data_folder,'formula_images')
raw_data_folder = os.path.join(data_folder, 'training')

### HyperParams

In [3]:
def get_vocab_size(data_dir_):
    df_vocab = pd.read_pickle(os.path.join(data_folder, 'df_vocab.pkl'))
    return df_vocab.id.max() + 1

### Encoder Model
[VGG ConvNet (16 or 19)](http://www.robots.ox.ac.uk/~vgg/research/very_deep/) without the top-3 layers
* Pre-initialized with the VGG weights but allowed to train
* The ConvNet outputs $D$ dimensional vectors in a WxH grid where W and H are scaled-down dimensions of the input image size (due to 5 max-pool layers). Defining $W.H \equiv L$ the ConvNet output represents L locations of the image $i \in [1,L]$ and correspondingly outputs to L annotation vectors $a_i$, each of size $D$.

The conv-net is *not trained* in the original paper and therefore the files can be separately preprocessed and their outputs directly fed into the model.

#### Decoder Model
A dense (FC) attention model: The deterministic soft-attention model of the paper computes $\alpha_{t,i}$ which is used to select or blend the $a_i$ vectors before being fed as inputs to the decoder LSTM network (see below).
* Inputs to the attention model are $a_i$ and $h_{t-1}$ (previous hidden state of LSTM network - see below) and $$\alpha_{t,i} = softmax ( f_{att}(a_i, h_{t-1}) )$$
* Note that the model $f_{att}$ shares weights across all values of a_i (i.e. for all i = 1-L). Therefore the shared weight matrix for all a_i has shape (D, D), while shape of a is (B, L, D) where is B=batch-size. Weight matrix of h_i is separate and has the expected shape (n, D). This sharing of weights across a_i is interesting.

A Decoder model: A conditioned LSTM that outputs probabilities of the text tokens $y_t$ at each step. The LSTM is conditioned upon $z_t = \sum_i^L(\alpha_{t,i}.a_i)$ and takes the previous hidden state $h_{t-1}$ as input. In addition, an embedding of the previous output $Ey_{t-1}$ is also input to the LSTM. At training time, $y_{t-1}$ would be derived from the training samples, while at inferencing time it would be fed-back from the previous predicted word.
* $y$ is taken from a fixed vocabulary of K words. An embedding matrix $E$ is used to narrow its representation to an $m$ dimensional dense vector. The embedding weights $E$ are learnt end-to-end by the model as well.
* The decoder LSTM uses a deep layer between $h_t$ and $y_t$. It is called a deep output layer and is described in [section 3.2.2 of this paper](https://www.semanticscholar.org/paper/How-to-Construct-Deep-Recurrent-Neural-Networks-Pascanu-G%C3%BCl%C3%A7ehre/533ee188324b833e059cb59b654e6160776d5812). That is:
$$ p(y_t) = Softmax \Big( f_out(Ey_{t-1}, h_t, \hat{z}_t) \Big) $$
* Optionally $z_t = \beta \sum_i^L(\alpha_{t,i}.a_i)$ where $\beta = \sigma(f_{\beta}(h_{t-1}))$ is a scalar used to modulate the strength of the context. It turns out that for the original use-case of caption generation, the network would learn to emphasize objects by turning up the value of this scalar when it was focusing on objects. It is not clear at this time whether we'll need this feature for im2latex.


### Input Generator

In [4]:
# df_train = pd.read_pickle(os.path.join(data_folder, 'training', 'df_train.pkl'))
# df_train

In [5]:
# df_train.shape

In [6]:
# raw_seq_train = pd.read_pickle(os.path.join(data_folder, 'training', 'raw_seq_train.pkl'))
# raw_seq_train.keys()

In [7]:
# with open(os.path.join(data_folder, 'training', 'padded_image_dim.pkl'), 'rb') as f:
#   padded_image_dim = pickle.load(f)
# print(padded_image_dim)

In [8]:
print ('starting')
it = BatchIterator(raw_data_folder, image_folder)
print ('created batch iterator')


starting
created batch iterator


In [14]:
for i in range(684):
    batch = it.next()

ShuffleIterator epoch 1, iter 3, bin-batch id (31, 23)
ShuffleIterator epoch 1, iter 4, bin-batch id (111, 52)
ShuffleIterator epoch 1, iter 5, bin-batch id (61, 49)
ShuffleIterator epoch 1, iter 6, bin-batch id (151, 39)
ShuffleIterator epoch 1, iter 7, bin-batch id (41, 71)
ShuffleIterator epoch 1, iter 8, bin-batch id (41, 2)
ShuffleIterator epoch 1, iter 9, bin-batch id (91, 50)
ShuffleIterator epoch 1, iter 10, bin-batch id (91, 14)
ShuffleIterator epoch 1, iter 11, bin-batch id (111, 3)
ShuffleIterator epoch 1, iter 12, bin-batch id (151, 46)
ShuffleIterator epoch 1, iter 13, bin-batch id (51, 43)
ShuffleIterator epoch 1, iter 14, bin-batch id (91, 1)
ShuffleIterator epoch 1, iter 15, bin-batch id (61, 72)
ShuffleIterator epoch 1, iter 16, bin-batch id (41, 89)
ShuffleIterator epoch 1, iter 17, bin-batch id (41, 1)
ShuffleIterator epoch 1, iter 18, bin-batch id (41, 69)
ShuffleIterator epoch 1, iter 19, bin-batch id (31, 2)
ShuffleIterator epoch 1, iter 20, bin-batch id (31, 7)
S

ShuffleIterator epoch 1, iter 149, bin-batch id (111, 65)
ShuffleIterator epoch 1, iter 150, bin-batch id (51, 101)
ShuffleIterator epoch 1, iter 151, bin-batch id (41, 85)
ShuffleIterator epoch 1, iter 152, bin-batch id (31, 17)
ShuffleIterator epoch 1, iter 153, bin-batch id (61, 63)
ShuffleIterator epoch 1, iter 154, bin-batch id (151, 23)
ShuffleIterator epoch 1, iter 155, bin-batch id (151, 42)
ShuffleIterator epoch 1, iter 156, bin-batch id (81, 22)
ShuffleIterator epoch 1, iter 157, bin-batch id (61, 74)
ShuffleIterator epoch 1, iter 158, bin-batch id (61, 75)
ShuffleIterator epoch 1, iter 159, bin-batch id (51, 63)
ShuffleIterator epoch 1, iter 160, bin-batch id (61, 37)
ShuffleIterator epoch 1, iter 161, bin-batch id (81, 37)
ShuffleIterator epoch 1, iter 162, bin-batch id (71, 37)
ShuffleIterator epoch 1, iter 163, bin-batch id (71, 9)
ShuffleIterator epoch 1, iter 164, bin-batch id (81, 15)
ShuffleIterator epoch 1, iter 165, bin-batch id (81, 12)
ShuffleIterator epoch 1, ite

ShuffleIterator epoch 1, iter 293, bin-batch id (41, 68)
ShuffleIterator epoch 1, iter 294, bin-batch id (31, 3)
ShuffleIterator epoch 1, iter 295, bin-batch id (81, 46)
ShuffleIterator epoch 1, iter 296, bin-batch id (81, 1)
ShuffleIterator epoch 1, iter 297, bin-batch id (81, 18)
ShuffleIterator epoch 1, iter 298, bin-batch id (111, 36)
ShuffleIterator epoch 1, iter 299, bin-batch id (81, 3)
ShuffleIterator epoch 1, iter 300, bin-batch id (91, 19)
ShuffleIterator epoch 1, iter 301, bin-batch id (71, 10)
ShuffleIterator epoch 1, iter 302, bin-batch id (41, 84)
ShuffleIterator epoch 1, iter 303, bin-batch id (91, 46)
ShuffleIterator epoch 1, iter 304, bin-batch id (111, 28)
ShuffleIterator epoch 1, iter 305, bin-batch id (61, 34)
ShuffleIterator epoch 1, iter 306, bin-batch id (61, 94)
ShuffleIterator epoch 1, iter 307, bin-batch id (71, 27)
ShuffleIterator epoch 1, iter 308, bin-batch id (41, 60)
ShuffleIterator epoch 1, iter 309, bin-batch id (51, 98)
ShuffleIterator epoch 1, iter 31

ShuffleIterator epoch 1, iter 437, bin-batch id (31, 29)
ShuffleIterator epoch 1, iter 438, bin-batch id (31, 38)
ShuffleIterator epoch 1, iter 439, bin-batch id (51, 74)
ShuffleIterator epoch 1, iter 440, bin-batch id (71, 24)
ShuffleIterator epoch 1, iter 441, bin-batch id (51, 70)
ShuffleIterator epoch 1, iter 442, bin-batch id (91, 24)
ShuffleIterator epoch 1, iter 443, bin-batch id (41, 40)
ShuffleIterator epoch 1, iter 444, bin-batch id (81, 38)
ShuffleIterator epoch 1, iter 445, bin-batch id (31, 40)
ShuffleIterator epoch 1, iter 446, bin-batch id (111, 32)
ShuffleIterator epoch 1, iter 447, bin-batch id (41, 86)
ShuffleIterator epoch 1, iter 448, bin-batch id (41, 80)
ShuffleIterator epoch 1, iter 449, bin-batch id (41, 81)
ShuffleIterator epoch 1, iter 450, bin-batch id (71, 30)
ShuffleIterator epoch 1, iter 451, bin-batch id (51, 38)
ShuffleIterator epoch 1, iter 452, bin-batch id (151, 11)
ShuffleIterator epoch 1, iter 453, bin-batch id (91, 15)
ShuffleIterator epoch 1, iter

ShuffleIterator epoch 1, iter 581, bin-batch id (51, 75)
ShuffleIterator epoch 1, iter 582, bin-batch id (91, 32)
ShuffleIterator epoch 1, iter 583, bin-batch id (91, 37)
ShuffleIterator epoch 1, iter 584, bin-batch id (71, 41)
ShuffleIterator epoch 1, iter 585, bin-batch id (81, 32)
ShuffleIterator epoch 1, iter 586, bin-batch id (91, 20)
ShuffleIterator epoch 1, iter 587, bin-batch id (111, 44)
ShuffleIterator epoch 1, iter 588, bin-batch id (71, 14)
ShuffleIterator epoch 1, iter 589, bin-batch id (71, 17)
ShuffleIterator epoch 1, iter 590, bin-batch id (51, 29)
ShuffleIterator epoch 1, iter 591, bin-batch id (61, 3)
ShuffleIterator epoch 1, iter 592, bin-batch id (111, 21)
ShuffleIterator epoch 1, iter 593, bin-batch id (51, 66)
ShuffleIterator epoch 1, iter 594, bin-batch id (41, 23)
ShuffleIterator epoch 1, iter 595, bin-batch id (51, 97)
ShuffleIterator epoch 1, iter 596, bin-batch id (61, 42)
ShuffleIterator epoch 1, iter 597, bin-batch id (41, 75)
ShuffleIterator epoch 1, iter 

<data_reader.BatchIterator object at 0x1040bdb90>


# End