## **Interpretability via Attentional and Memory-based Interfaces Using TensorFlow**
#### A closer look at the reasoning inside your deep networks

### **Table of Contents**
5. [Preprocessing Components](#5)
6. [Sample the Data](#6)
7. [Model](#7)
7. [Training](#8)
9. [Results](#9)
10. [Attention for a Sample](#10)
11. [Attentional History](#11)
12. [Attentional Interface Variants](#12)
13. [Caveats](#13)
13. [Interpretability and why it’s important](#14)
14. [References](#15)
15. [Author Bio](#16)

In [None]:
# Establish basedir (useful if running as python package)
import os
basedir = ""

In [None]:
# Hide all warning messages
import warnings
warnings.filterwarnings('ignore')

<a id='5'></a>
### **V. Preprocessing Components**

In this section, we will preprocess our raw input data. The main components are the Vocab class, which we initialize using our vocab.txt file. This file contains all of the tokens (words) from our raw input, sorted by descending frequency. The next helper function we need is ids_to_tokens(), which will convert a list of ids into tokens we can understand. We will use this for reading our input and associating the word with its respective attention score.

#### Processing Pipeline

In [None]:
from utility import *

<a id='6'></a>
### **VI. Sample the data**

Now, we will see what our inputs will look like. The processed_review represents our reviews with ids. The `review_seq_len` tells us how long the review is. Unless we use dynamic computation graphs, we need to feed in fixed sized inputs into our TensorFlow models per batch. This means that we will have some padding (with `PAD`s) and we do not want these to influence our model. In this implementation, the PADs do not prove to be too problematic, since inference will depend on the entire summarized context (so no loss masking is needed). And we also want to keep the PAD tokens, even when determining the attention scores, to show how the model learns not to focus on the PADs over time.

In [None]:
%pylab inline

In [None]:
class parameters():
    """
    Arguments for data processing.
    """
    def __init__(self):
        """
        """  
        self.data_dir="data/processed_reviews/train.p"           # location of reviews data (train|validation)

In [None]:
FLAGS = parameters()
sample_data(FLAGS.data_dir, basedir)

<a id='7'></a>
### **VII. Model**

We will start by talking about operation functions. `_xavier_weight_init()` is a little function we made to properly initialize our weights, depending on the nonlinearity that will be applied to them. The initialization is such that we will receive outputs with unit variance prior to sending to the activation function. 

This is an optimization technique we use so that we do not have large values when applying the nonlinearity, as that will lead to saturation at the extremes and lead to gradient issues. We also have a helper function for layer normalization, `ln()`, which is another optimization technique that will normalize our inputs into the GRU (Gated Recurrent Unit) before applying the activation function. This will allow us to control gradient issues and even allow us to use larger learning rates. The layer normalization is applied in the `custom_GRU()` function prior to the sigmoid and tanh operations. The last helper function is `add_dropout_and_layers()` which will add dropout to our recurrent outputs and will allow us to create multi-layered recurrent architectures.

#### Operation functions

Let's briefly describe the model pipelines and see how our inputs undergo representation changes. First we will initialize our placeholders which will hold the reviews, lens, sentiment, embeddings etc. Then we will build the encoder which will take our input review and first embed using the GloVe embeddings. We will then feed the embedded tokens into a GRU in order to encode the input. We will use the output from each timestep in the GRU as our inputs to the attentional layer. Notice that we could have completely removed the attentional interface, and just used the last relevant hidden state from the encoder GRU in order to receive our predicted sentiment, but -- adding this attention layer allows us to see how the model processes the input review.

In the attentional layer, we apply a nonlinearity followed by another one, in order to reduce our representation to one dimension. Now, we can normalize to compute our attention scores. These scores are then broadcasted and multiplied with the original inputs to receive our summarized vector. We use this vector to receive our predicted sentiment via normalization in the decoder. Notice that we do not use a previous state ($s_{i-1}$) since the task involves creating just one context and extracting the sentiment from that.

We then define our loss as the cross entropy between the predicted and the ground truth sentiment. We use a bit of decay for our learning rate with an absolute minimum and use the ADAM optimizer [9]. With all of these components, we have built our graph.

<a id='8'></a>
### **VIII. Training**

#### Helper functions

In [None]:
from helper_functions import *

In [None]:
class parameters():
    """
    Arguments for data processing.
    """
    def __init__(self):
        """
        """  
        self.data_dir="data/processed_reviews"           # location of reviews data
        self.ckpt_dir="data/processed_reviews/ckpt"      # location of model checkpoints
        self.mode="train"                                # train|infer
        self.model="new"                                 # old|new
        self.lr=1e-4                                     # learning rate
        self.num_epochs=1                                # num of epochs 
        self.batch_size=256                              # batch size
        self.hidden_size=200                             # num hidden units for RNN
        self.embedding="glove"                           # random|glove
        self.emb_size=200                                # num hidden units for embeddings
        self.max_grad_norm=5                             # max gradient norm
        self.keep_prob=0.9                               # Keep prob for dropout layers
        self.num_layers=1                                # number of layers for recurrsion
        self.max_input_length=300                        # max number of words per review
        self.min_lr=1e-6                                 # minimum learning rate
        self.decay_rate=0.96                             # Decay rate for lr per global step (train batch)
        self.save_every=10                               # Save the model every <save_every> epochs
        self.model_name="imdb_model"                     # Name of the model

In [None]:
from utility import *
from operation_functions import *
from model import *

import time
start_time = time.time()

FLAGS = parameters()
train(FLAGS, basedir)

print("--- %s seconds ---" % (time.time() - start_time))

<a id='9'></a>
### **IX. Results**

In [None]:
class parameters():
    """
    Arguments for data processing.
    """
    def __init__(self):
        """
        """
        self.ckpt_dir="data/processed_reviews/ckpt"      # location of model checkpoints
        self.model_name="imdb_model"                     # Name of the model

In [None]:
from results import *

In [None]:
FLAGS = parameters()
# Add model name to ckpt dir
FLAGS.ckpt_dir = FLAGS.ckpt_dir + '/%s'%(FLAGS.model_name)
plot_metrics(FLAGS, basedir)

We can see a bit of overfitting after ~epoch 7. If you want to achieve the best performance, use all 25,000 training/test samples and include a lot more stringent regularization along with gradient clipping a more rigorous decay. But since just wanted to see some interpretable attention scores, this performance was satifactory.

<a id='10'></a>
### **X. Attention for a Sample**

In [None]:
class parameters():
    """
    Arguments for data processing.
    """
    def __init__(self):
        """
        """
        self.data_dir="data/processed_reviews"           # location of reviews data
        self.ckpt_dir="data/processed_reviews/ckpt"      # location of model checkpoints
        self.model_name="imdb_model"                     # Name of the model
        self.sample_num=2                                # Sample num to view attn plot. [0-4]
        self.num_rows=5                                  # Number of rows to show in attn visualization.

In [None]:
from attention_for_sample import *

In [None]:
FLAGS = parameters()
# Add model name to ckpt dir
FLAGS.ckpt_dir = FLAGS.ckpt_dir + '/%s'%(FLAGS.model_name)
process_sample_attn(FLAGS, basedir)