[View in Colaboratory](https://colab.research.google.com/github/santhavathi/Deep-Learning-MLBLR/blob/master/CommonSense.ipynb)

## Three-way Attention and Relational Knowledge for Commonsense Machine Comprehension

#### Research Paper and Pytorch Code
This paper discusses about machine comprehension using commonsense knowledge 

[Research Paper](https://arxiv.org/pdf/1803.00191.pdf)

[Pytorch implementation](https://github.com/intfloat/commonsenserc/tree/ed4d40d20eabf7788c56366fe0e224287db0d383)



#### Objective
1. The objective of this paper is to train the machine to learn from a sequence of English passages and predict whether the anwser to the question based on the passage is right or not.

2. The dataset is a corpus of passage text, each passage text has one or more questions and each question has a choice of 4 answers of which one is the right answer.

3. So input to the model is a passage text + question + answer and output is a label 0/1 representing wrong/right answer.

4. Model should predict if the answer for a question based on a passage is right or not.



#### Model

**Input Layer**

300 dimensional glove embedding + 12 dimensional part-of-speech embedding + 8 dimensional named-entity embedding + 10 dimensional relation embedding from Conceptnet

**Attention Layers**

Question-aware passage representation<br>
Passage-aware answer representation<br>
Question-aware answer representation<br>

**LSTM Layers**

Three BiLSTMs are applied to the concatenation of the above attention layer vectors to model the temporal dependency

**Attention Layers**

3 Self attention layers one each for the above 3 LSTM layers 

**Output Layer**

Sigmoid of the Bilinear interaction of previous attention layers



### Highlights

* Threeway Attentive Networks (TriAN) is used to model interactions between the passage, question and answers.
* Different questions need to focus on different parts of the passage, attention mechanism is a natural choice and turns out to be effective for reading comprehension.
* To incorporate commonsense knowledge, input is augmented with relation embedding from the graph of general knowledge ConceptNet which improved accuracy by 1%
* ConceptNet consists of over 21 million edges and 8 million nodes and shows state-of-the-art performance on tasks like word analogy and word relatedness.
* State-of-the-art performance with 83.95% accuracy on test data and 85.27% accuracy on dev data.
* First pretrained on RACE dataset for 10 epochs which improved accuracy by 1%
* Model TriAN was implemented based on PyTorch 4.
* Models were trained on a single GPU(Tesla P40) and each epoch took about 80 seconds. 
* Only the word embeddings of top 10 frequent words are fine-tuned during training.
* One layer of LSTM is used.
* The dimension of both forward and backward LSTM hidden state is set to 96. 
* Dropout rate is set to 0.4 for both input embeddings and BiLSTM outputs. 
* For parameter optimization, Adamax was used with an initial learning rate 2 × 10−3.
* Learning rate is then halved after 10 and 15 training epochs. 
* The model converges after 50 epochs.
* Gradients are clipped to have a maximum L2 norm of 10. 
* Minibatch with batch size 32 is used.
* Hyperparameters are optimized by random search strategy.
* Standard cross entropy function is used as the loss function to minimize.


### Detailed Explanation



#### Input Layer

**Passage** = p_inp => **Shape**(32 x 900) 
batch_size=32, Max(length of words of each of 32 passages)=900, Each of the 900 words is represented by its normalised value

**Passage Mask** = p_mask_inp => Shape(32 x 900)
batch_size=32, For passages where the length is < 900 the corresponding empty word value will be set to "0"

**Passage Part of Speech** = p_pos_inp => Shape(32 x 900)
batch_size=32, for each of the 900 words its part of speech is represented by a normalised value. Examples of Part of Speech are Noun, Verb, etc

**Passage Named Entity Relationship** = p_ner_inp => Shape(32 x 900)
batch_size=32, for each of the 900 words its named entity relationship is represented by a normalised value. Examples of Named Entity Relationship are Product, RelatesTo, Date, Person, etc

**Question** = q_inp => Shape(32 x 25)
batch_size=32, Max(length of words of each of 32 questions)=25, Each of the 25 words is represented by its normalised value

**Question Mask** = q_mask_inp => Shape(32 x 25)
batch_size=32, For questions where the length is < 25 the corresponding empty word value will be set to "0"

**Question Part of Speech** = q_pos_inp => Shape(32 x 25)
batch_size=32, for each of the 25 words its part of speech is represented by a normalised value. Examples of Part of Speech are Noun, Verb, etc

**Answer/Choice** = c_inp => Shape(32 x 35)
batch_size=32, Max(length of words of each of 32 answers)=35, Each of the 35 words is represented by its normalised value

**Choice Mask** = c_mask_inp => Shape(32 x 35)
batch_size=32, For answers where the length is < 35 the corresponding empty word value will be set to "0"

**Handcrafted Features** = f_tensor_inp => Shape(32 x 900 x 5)
batch_size=32, 5 handcrafted features(in_q, in_c, lemma_in_q, lemma_in_c, tf) for each of the 900 words
in_q = 1 if passage word is present in question, else 0
in_c = 1 if passage word is present in answer, else 0
lemma_in_q = 1 if lemmatised passage word is present in question, else 0 (for, eg. sitting is a lemmatised word for sit)
lema_in_c = 1 if lemmatised passage word is present in answer, else 0
tf = term frequency representation of each word

**Passage Question Relationship** = p_q_rel_inp => Shape(32 x 900)
batch_size=32, if there is a relationship of each of the 900 words with the question, this is from conceptnet model

**Passage Choice Relationship** = p_c_rel_inp => Shape(32 x 900)
batch_size=32, if there is a relationship of each of the 900 words with the answer, this is from conceptnet model



#### Common Sense Core Layers

**Step 1 - Embeddings**

Create following embeddings

p_emb = $E^{glove}_{{P}_{i}}$ <br>
q_emb = $E^{glove}_{{Q}_{i}}$ <br>
c_emb = $E^{glove}_{{A}_{i}}$ <br>
p_pos_emb = $E^{pos}_{{P}_{i}}$ <br>
q_pos_emb = $E^{pos}_{{Q}_{i}}$ <br>
p_ner_emb = $E^{ner}_{{P}_{i}}$ <br>
p_q_rel_emb = $E^{rel}_{{P}_{i},\{{Q}_{i}\}^{|Q|}_{i=1}}$ <br>
p_c_rel_emb = $E^{rel}_{{P}_{i},\{{A}_{i}\}^{|A|}_{i=1}}$ <br>

p_emb(32 x 900 x 300), q_emb(32 x 25 x 300), c_emb(32 x 35 x 300) => 300 vector Glove embedding for Passage, Question, Answer <br>
p_pos_emb(32 x 900 x 12), q_pos_emb(32 x 25 x 12) => 12 vector Part of Speech embedding for Passage, Question <br>
p_ner_emb(32 x 900 x 8) => 8 vector Named Entity Relationship embedding for Passage <br>
p_q_rel_emb(32 x 900 x 10), p_c_rel_emb(32 x 900 x 10) => 10 vector Relationship embedding from Conceptnet for Passage+Question and Passage+Answer <br>

**Step 2 - Sequential Attention Layer**

${Att}_{seq}(u, \{v_i\}^n_{i=1}) = {\sum_{i=1}^n} {\alpha}_i v_i$ <br>
${\alpha}_i = {softmax}_i(f(W_1u)^T f(W_1v_i))$ => where f=ReLU <br>

Question-aware passage representation => p_q_weighted_emb(32 x 900 x 300)<br>
Paying attention to the passage with respect to the question<br>
$\{w^q_{P_i}\}^{|P|}_{i=1}$ <br>
$w^q_{P_i} = {Att}_{seq}(E^{glove}_{{P}_{i}},\{E^{glove}_{{Q}_{i}}\}^{|Q|}_{i=1})$ <br>

Passage-aware answer representation => c_p_weighted_emb(32 x 35 x 300)<br>
Paying attention to the passage with respect to the answer<br>
$\{w^p_{A_i}\}^{|A|}_{i=1}$ <br>
$w^p_{A_i} = {Att}_{seq}(E^{glove}_{{A}_{i}},\{E^{glove}_{{P}_{i}}\}^{|P|}_{i=1})$ <br>

Question-aware answer representation => c_q_weighted_emb(32 x 35 x 300)<br>
Paying attention to the question with respect to the answer<br>
$\{w^q_{A_i}\}^{|A|}_{i=1}$ <br>
$w^q_{A_i} = {Att}_{seq}(E^{glove}_{{A}_{i}},\{E^{glove}_{{Q}_{i}}\}^{|Q|}_{i=1})$ <br>


**Step 3 - Bidirectional LSTMs**

2 * hidden_layer_size= 2 * 96 = 192<br>
$h^q = BiLSTM(\{wQ_i\}^{|Q|}_{i=1})$ => q_hiddens(32 x 25 x 192)<br>
$h^p = BiLSTM(\{ [w_{P_i};  w^q_{P_i}] \}^{|P|}_{i=1})$ => p_hiddens(32 x 900 x 192)<br>
$h^a = BiLSTM(\{ [w_{A_i};  w^p_{A_i}; w^q_{A_i}] \}^{|A|}_{i=1})$ => a_hiddens(32 x 35 x 192)<br>

**Step 4 - Self Attention Layers**

LinearSeqAttLayer - Question representation => q_merge_weights(32 x 25)<br>
${Att}_{self}(\{h_i^q\}^{|Q|}_{i=1})$<br>

LinearSeqAttLayer - Answer representation => c_merge_weights(32 x 35)<br>
${Att}_{self}(\{h_i^a\}^{|A|}_{i=1})$<br>

BiLinearSeqAttLayer - Passage representation => p_merge_weights(32 x 900)<br>
${Att}_{self}(q,\{h_i^p\}^{|P|}_{i=1})$<br>



#### Output Layer
Output Layer => preds(32 x 1) where 1 represents the probability (0 or 1)<br>
$y = \sigma_(p^TW_3a + q^TW_4a)$ 




### Input Files Needed
[preprocessed.zip](https://drive.google.com/open?id=1M1saVYk-4Xh0Y0Ok6e8liDLnElnGc0P4)

[rel_vocab](https://github.com/intfloat/commonsense-rc/blob/master/data/rel_vocab)

### Code Implementation in Keras

#### Download and Unzip Input Files

In [2]:
from google.colab import files

uploaded = files.upload()

Saving preprocessed.zip to preprocessed.zip


In [3]:
uploaded = files.upload()

Saving rel_vocab to rel_vocab


In [4]:
!unzip preprocessed.zip
!pwd
!ls

Archive:  preprocessed.zip
  inflating: concept.filter          
  inflating: dev-data-processed.json  
  inflating: test-data-processed.json  
  inflating: train-data-processed.json  
  inflating: trial-data-processed.json  
  inflating: glove.840B.300d.txt     
/content
concept.filter		 glove.840B.300d.txt  test-data-processed.json
datalab			 preprocessed.zip     train-data-processed.json
dev-data-processed.json  rel_vocab	      trial-data-processed.json


#### Import modules

In [5]:
!python -c 'import keras; print(keras.__version__)'

Using TensorFlow backend.
2.1.6


In [0]:
import os
import json
import string
import unicodedata
import numpy as np

from collections import Counter

In [6]:
import keras
from keras.models import Model
from keras.layers import Bidirectional, LSTM, Concatenate, Dense, Input, Embedding, Dropout, merge, Activation, Lambda, Flatten
from keras.layers import Add, Multiply, Reshape
from keras.optimizers import Adamax, SGD
from keras.initializers import TruncatedNormal
from keras.engine.topology import Layer
from keras import backend as K

Using TensorFlow backend.


#### Data preparation


##### Load trial, train, dev, test data from files

In [0]:
class Dictionary():
    NULL = '<NULL>'
    UNK = '<UNK>'
    START = 2

    @staticmethod
    def normalize(token):
        return unicodedata.normalize('NFD', token)

    def __init__(self):
        self.tok2ind = {self.NULL: 0, self.UNK: 1}
        self.ind2tok = {0: self.NULL, 1: self.UNK}

    def __len__(self):
        return len(self.tok2ind)

    def __iter__(self):
        return iter(self.tok2ind)

    def __contains__(self, key):
        if type(key) == int:
            return key in self.ind2tok
        elif type(key) == str:
            return self.normalize(key) in self.tok2ind

    def __getitem__(self, key):
        if type(key) == int:
            return self.ind2tok.get(key, self.UNK)
        if type(key) == str:
            return self.tok2ind.get(self.normalize(key),
                                    self.tok2ind.get(self.UNK))

    def __setitem__(self, key, item):
        if type(key) == int and type(item) == str:
            self.ind2tok[key] = item
        elif type(key) == str and type(item) == int:
            self.tok2ind[key] = item
        else:
            raise RuntimeError('Invalid (key, item) types.')

    def add(self, token):
        token = self.normalize(token)
        if token not in self.tok2ind:
            index = len(self.tok2ind)
            self.tok2ind[token] = index
            self.ind2tok[index] = token

    def tokens(self):
        """Get dictionary tokens.

        Return all the words indexed by this dictionary, except for special
        tokens.
        """
        tokens = [k for k in self.tok2ind.keys()
                  if k not in {'<NULL>', '<UNK>'}]
        return tokens
    
vocab, pos_vocab, ner_vocab, rel_vocab = Dictionary(), Dictionary(), Dictionary(), Dictionary()

In [0]:
class Example:

    def __init__(self, input_dict):
        self.id = input_dict['id']
        self.passage = input_dict['d_words']
        self.question = input_dict['q_words']
        self.choice = input_dict['c_words']
        self.d_pos = input_dict['d_pos']
        self.d_ner = input_dict['d_ner']
        self.q_pos = input_dict['q_pos']
        assert len(self.q_pos) == len(self.question.split()), (self.q_pos, self.question)
        assert len(self.d_pos) == len(self.passage.split())
        self.features = np.stack([input_dict['in_q'], input_dict['in_c'], \
                                    input_dict['lemma_in_q'], input_dict['lemma_in_c'], \
                                    input_dict['tf']], 1)
        assert len(self.features) == len(self.passage.split())
        self.label = input_dict['label']    
     
        self.d_tensor = np.array([vocab[w] for w in self.passage.split()])
        self.q_tensor = np.array([vocab[w] for w in self.question.split()])
        self.c_tensor = np.array([vocab[w] for w in self.choice.split()])
        self.d_pos_tensor = np.array([pos_vocab[w] for w in self.d_pos])
        self.q_pos_tensor = np.array([pos_vocab[w] for w in self.q_pos])
        self.d_ner_tensor = np.array([ner_vocab[w] for w in self.d_ner])
        self.p_q_relation = np.array([rel_vocab[r] for r in input_dict['p_q_relation']])
        self.p_c_relation = np.array([rel_vocab[r] for r in input_dict['p_c_relation']])

    def __str__(self):
        return 'Passage: %s\n Question: %s\n Answer: %s, Label: %d' % (self.passage, self.question, self.choice, self.label)


In [0]:
def load_data(path):
    #from doc import Example
    i=0
    data = []
    for line in open(path, 'r', encoding='utf-8'):
        #i=i+1
        #if i > 7:
            #break
        if path.find('race') < 0 or np.random.random() < 0.6:
            data.append(Example(json.loads(line)))
    print('Load %d examples from %s...' % (len(data), path))
    return data

##### Build vocab file
Vocab file contains all the distinct words in the entire corpus of passage, questions, answers

In [0]:
def build_vocab(data=None):
    global vocab, pos_vocab, ner_vocab, rel_vocab
    # build word vocabulary
    if os.path.exists('./vocab'):
        print('Load vocabulary from ./vocab...')
        for w in open('./vocab', encoding='utf-8'):
            vocab.add(w.strip())
        print('Vocabulary size: %d' % len(vocab))
    else:
        cnt = Counter()
        for ex in data:
            cnt += Counter(ex.passage.split())
            cnt += Counter(ex.question.split())
            cnt += Counter(ex.choice.split())
        for key, val in cnt.most_common():
            vocab.add(key)
        print('Vocabulary size: %d' % len(vocab))
        writer = open('./vocab', 'w', encoding='utf-8')
        writer.write('\n'.join(vocab.tokens()))
        writer.close()
    # build part-of-speech vocabulary
    if os.path.exists('./pos_vocab'):
        print('Load pos vocabulary from ./pos_vocab...')
        for w in open('./pos_vocab', encoding='utf-8'):
            pos_vocab.add(w.strip())
        print('POS vocabulary size: %d' % len(pos_vocab))
    else:
        cnt = Counter()
        for ex in data:
            cnt += Counter(ex.d_pos)
            cnt += Counter(ex.q_pos)
        for key, val in cnt.most_common():
            if key: pos_vocab.add(key)
        print('POS vocabulary size: %d' % len(pos_vocab))
        writer = open('./pos_vocab', 'w', encoding='utf-8')
        writer.write('\n'.join(pos_vocab.tokens()))
        writer.close()
    # build named entity vocabulary
    if os.path.exists('./ner_vocab'):
        print('Load ner vocabulary from ./ner_vocab...')
        for w in open('./ner_vocab', encoding='utf-8'):
            ner_vocab.add(w.strip())
        print('NER vocabulary size: %d' % len(ner_vocab))
    else:
        cnt = Counter()
        for ex in data:
            cnt += Counter(ex.d_ner)
        for key, val in cnt.most_common():
            if key: ner_vocab.add(key)
        print('NER vocabulary size: %d' % len(ner_vocab))
        writer = open('./ner_vocab', 'w', encoding='utf-8')
        writer.write('\n'.join(ner_vocab.tokens()))
        writer.close()
    # Load conceptnet relation vocabulary
    assert os.path.exists('./rel_vocab')
    print('Load relation vocabulary from ./rel_vocab...')
    for w in open('./rel_vocab', encoding='utf-8'):
        rel_vocab.add(w.strip())
    print('Rel vocabulary size: %d' % len(rel_vocab))

In [11]:
trial_data = load_data('./trial-data-processed.json')
train_data = load_data('./train-data-processed.json')
dev_data = load_data('./dev-data-processed.json')
test_data = load_data('./test-data-processed.json')
build_vocab(trial_data + train_data + dev_data + test_data)
#build_vocab(trial_data)

Load 1020 examples from ./trial-data-processed.json...
Load 19462 examples from ./train-data-processed.json...
Load 2822 examples from ./dev-data-processed.json...
Load 5594 examples from ./test-data-processed.json...
Vocabulary size: 12718
POS vocabulary size: 51
NER vocabulary size: 20
Load relation vocabulary from ./rel_vocab...
Rel vocabulary size: 39


In [12]:
#Need to run this block again, because load_data and build_vocab are inter dependent
#May need to change the code to overcome this cyclic dependency
trial_data = load_data('./trial-data-processed.json')
train_data = load_data('./train-data-processed.json')
dev_data = load_data('./dev-data-processed.json')
test_data = load_data('./test-data-processed.json')
build_vocab(trial_data + train_data + dev_data + test_data)
#build_vocab(trial_data)

Load 1020 examples from ./trial-data-processed.json...
Load 19462 examples from ./train-data-processed.json...
Load 2822 examples from ./dev-data-processed.json...
Load 5594 examples from ./test-data-processed.json...
Load vocabulary from ./vocab...
Vocabulary size: 12718
Load pos vocabulary from ./pos_vocab...
POS vocabulary size: 51
Load ner vocabulary from ./ner_vocab...
NER vocabulary size: 20
Load relation vocabulary from ./rel_vocab...
Rel vocabulary size: 39


##### Sample Input Data

In [60]:
print(np.shape(trial_data))
print(len(trial_data))
print(trial_data[0])
print(trial_data[0].passage)
print(trial_data[2].c_tensor)
print(type(trial_data[2].c_tensor))
print(len(trial_data[2].c_tensor))
print(np.shape(trial_data[2].features))
print(trial_data[2].d_tensor)
print(trial_data[2].p_q_relation)

(1020,)
1020
Passage: I began by finding the culprit of my flat tire , it was a nail on the side of the road . Unfortunately I did not have a spare tire so I had to carry my bike the rest of the way home . After the gruesome trek back to my house I flipped the bicycle upside down and began my repair . First I took off the outer tire to replace my tube with a better more expensive tube that has the ability to fix a puncture . I do not want to experience carrying my bike a mile ever again . After putting the new tube on I screw back on the tire and put the outer tire over the tube . I then fill the PSI to 170 in my tire and listen for any leaking air . Then I spin the tire to make all the slime inside become evenly distributed throughout . This will prevent leaks in the future and even with a slight leak I will still be able to ride back home and fix the tire quickly yet again .
 Question: how often does their bike have problems like that ?
 Answer: every day, Label: 0
I began by finding

In [87]:
print(len(vocab.tokens()))
print('cake' in vocab)
print(vocab['cake'])
print(vocab.normalize('cake'))
#{w for w in vocab.tokens() if w in vocab}

128
False
1
cake


##### Find maximum length of para, question, choice

In [121]:
len1=[]
for i in train_data:
  len1.append(len(i.passage.split()))
print(max(len1))

len1=[]
for i in trial_data:
  len1.append(len(i.question.split()))
print(max(len1))

len1=[]
for i in test_data:
  len1.append(len(i.choice.split()))
print(max(len1))

860
19
31


In [0]:
#Choose a length that is higher than all the input data
#It is recommended to have fixed size inputs for LSTM, than vaiable input shapes
#The missing inputs are filled with zeros
#For eg., if passage lenth of an input row is less than 900 words, the remaining words are represented by zeros
MAX_PARA_LENGTH = 900
MAX_QUESTION_LENGTH = 25
MAX_CHOICE_LENGTH = 35
HANDCRAFTED_FEATURES = 5

batch_size=32

##### Load Glove Embeddings

In [0]:
def load_embeddings(words, embedding_file):
        """Load pretrained embeddings for a given list of words, if they exist.
        Args:
            words: iterable of tokens. Only those that are indexed in the
              dictionary are kept.
            embedding_file: path to text file of embeddings, space separated.
        """
        words = {w for w in words if w in vocab}
        print('Loading pre-trained embeddings for %d words from %s' %(len(words), embedding_file))
        #embedding = self.network.embedding.weight.data

        # When normalized, some words are duplicated. (Average the embeddings).
        vec_counts = {}
        embedding = {}
        with open(embedding_file) as f:
            for line in f:
                parsed = line.rstrip().split(' ')
                assert(len(parsed) == embedding_dim + 1)
                w = vocab.normalize(parsed[0])
                if w in words:
                    #vec = torch.Tensor([float(i) for i in parsed[1:]])
                    vec = np.array([float(i) for i in parsed[1:]])
                    if w not in vec_counts:
                        vec_counts[w] = 1
                        embedding[vocab[w]] = vec
                    else:
                        print('WARN: Duplicate embedding found for %s' % w)
                        vec_counts[w] = vec_counts[w] + 1
                        embedding[vocab[w]] = np.add(embedding[vocab[w]],vec)

        for w, c in vec_counts.items():
            embedding[vocab[w]] = embedding[vocab[w]]/c

        print('Loaded %d embeddings (%.2f%%)' %(len(vec_counts), 100 * len(vec_counts) / len(words)))
        return embedding

In [28]:
#There are 12716 distinct words in the entire vocabulary of passage+question+choice
#The Glove embedding matrix will be 12716 x 300, the additional 12716-12614=102 vectors is zero padded
#If the additional zero padded 102 x 300 dimension vector is not added, model.fit_generator gives error
embedding_dim=300
i=0
embedding_file = "./glove.840B.300d.txt"
glove_emb_matrix = []

glove_emb_dict = load_embeddings(vocab.tokens(), embedding_file)

pad_zeros_len = len(vocab.tokens()) - len(glove_emb_dict)

embedding_zeros = np.zeros((pad_zeros_len, embedding_dim))
print(np.shape(embedding_zeros))

for k, v in glove_emb_dict.items():
    glove_emb_matrix.append(v)

print(np.shape(glove_emb_matrix))
glove_emb_matrix = np.concatenate((np.array(glove_emb_matrix),embedding_zeros),axis=0)
print(np.shape(glove_emb_matrix))

Loading pre-trained embeddings for 12716 words from ./glove.840B.300d.txt
WARN: Duplicate embedding found for ;
Loaded 12614 embeddings (99.20%)
(102, 300)
(12614, 300)
(12716, 300)


In [29]:
print(len(vocab.tokens()), np.shape(glove_emb_matrix))
print(len(pos_vocab.tokens()))

12716 (12716, 300)
49


##### Custom Data Generator

In [0]:
def _to_indices_and_mask(batch_tensor, mask_type, need_mask=True):
    #print(batch_tensor[0])
    #mx_len = max([len(t) for t in batch_tensor])
    if mask_type == 'p':
        mx_len = MAX_PARA_LENGTH
    elif mask_type == 'q':
        mx_len = MAX_QUESTION_LENGTH
    elif mask_type == 'c':
        mx_len = MAX_CHOICE_LENGTH
   
    batch_size = len(batch_tensor)
    
    indices = np.zeros((batch_size,mx_len))

    if need_mask:
        mask = np.zeros((batch_size,mx_len))
    
    for i, t in enumerate(batch_tensor):
        indices[i, :len(t)] = t
        #This creates a mask of ones if the word is present, and zeros for the remaining part of the shorter sequence
        #If a question is say 10 words long, then the mask will be [1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0... and so on upto 25]
        if need_mask:
            mask[i, :len(t)] = np.ones((len(t)))

    if need_mask:
        return indices, mask
    else:
        return indices

def _to_feature_tensor(features):
    #mx_len = max([len(f) for f in features])
    mx_len = MAX_PARA_LENGTH
    batch_size = len(features)
    #f_dim is 5 for the handcrafted features
    f_dim = len(features[0][1])
    
    f_tensor = np.zeros((batch_size, mx_len, f_dim))
    for i, f in enumerate(features):
        f_tensor[i, :len(f), :] = f
    return f_tensor

In [0]:
def batchify(batch_data):
    #print("#####{}".format(batch_data[0].d_tensor))
    #x=[ex.d_tensor for ex in batch_data]
    #print(x)
    #print(type(x))
    #p, p_mask = _to_indices_and_mask([ex.d_tensor for ex in batch_data])
    p, p_mask = _to_indices_and_mask([ex.d_tensor for ex in batch_data],mask_type='p')
    p_pos = _to_indices_and_mask([ex.d_pos_tensor for ex in batch_data], need_mask=False, mask_type='p')
    p_ner = _to_indices_and_mask([ex.d_ner_tensor for ex in batch_data], need_mask=False, mask_type='p')
    p_q_relation = _to_indices_and_mask([ex.p_q_relation for ex in batch_data], need_mask=False, mask_type='p')
    p_c_relation = _to_indices_and_mask([ex.p_c_relation for ex in batch_data], need_mask=False, mask_type='p')
    q, q_mask = _to_indices_and_mask([ex.q_tensor for ex in batch_data], mask_type='q')
    q_pos = _to_indices_and_mask([ex.q_pos_tensor for ex in batch_data], need_mask=False, mask_type='q')
    choices = [ex.choice.split() for ex in batch_data]
    c, c_mask = _to_indices_and_mask([ex.c_tensor for ex in batch_data], mask_type='c')
    f_tensor = _to_feature_tensor([ex.features for ex in batch_data])
    y = [ex.label for ex in batch_data]
    #It is necessary for custom generator to have format as below
    #Multiple inputs should be a list of numpy arrays, inside a tuple for eg. ([list of numpy array inputs],numpy array output label)
    return [np.array(p), np.array(p_mask), np.array(p_pos), np.array(p_ner), np.array(q), np.array(q_mask), \
            np.array(q_pos), np.array(c), np.array(c_mask), np.array(f_tensor), np.array(p_q_relation), \
            np.array(p_c_relation)],np.array(y) 
 

In [0]:
def _iter_data(data):
    #Below while true is needed for custom data generator when called by model.fit_generator
    #Else model.fit_generator will throw StopIteration Error
    #When testing using model.fit, while true can be commented out
    while True:
        #print(type(data[0]))
        #print(data[0])
        num_iter = (len(data) + batch_size - 1) // batch_size
        #print(len(data))
        for i in range(num_iter):
            start_idx = i * batch_size
            batch_data = data[start_idx:(start_idx + batch_size)]
            #print(i)
            batch_input = batchify(batch_data)
            
            #Use yield when using model.fit_generator, 
            #Use return statement for manual testing/model.fit
            #Ensure yield is within the for loop
            yield batch_input
        #return batch_input

In [22]:
#iter_cnt, num_iter = 0, (len(train_data) + batch_size - 1) // batch_size
#for batch_input in _iter_data(train_data):
#        feed_input = [x for x in batch_input[:-1]]
#        y = batch_input[-1]
def data_generator(data):
    iter_cnt, num_iter = 0, (len(trial_data) + batch_size - 1) // batch_size
    for batch_input, y in _iter_data(data):
        feed_input = [x for x in batch_input[:-1]]
        y = batch_input[-1]
    return (feed_input, y)

In [70]:
#When calling below statement for manual testing/when using model.fit, ensure _iter_data has return statement and not yield
(feed_input,y) = _iter_data(trial_data)

#### Model


##### Attention Layers
Build custom Keras layer

In [0]:
class SeqAttLayer(Layer):

    def __init__(self, output_dim, **kwargs):
        self.output_dim = output_dim
        super(SeqAttLayer, self).__init__(**kwargs)

    def build(self, input_shape):
        # Create a trainable weight variable for this layer.
        self.kernel = self.add_weight(name='kernel', \
                                      shape=(self.output_dim,input_shape[0][1]),\
                                      initializer='uniform',\
                                      trainable=True)
        super(SeqAttLayer, self).build(input_shape)  # Be sure to call this at the end

    def call(self, x, mask=None):
        x_proj = K.relu(K.dot(x[0],self.kernel))
        y_proj = K.relu(K.dot(x[1],self.kernel))
        
        #For every word in the passage generate mapping for every word in the question
        #For row 1, if question is 900 words x 300 dim, and associated question is 25 words x 300 dim
        #Generate 900 x 25, ie match each word in the question with each word in the paassage
        # passage(32 x 900 x 300) * question(32 x 25 x 300) 
        # (32 x 900 x 300) matmul (32 x 300 x 25) = (32 x 900 x 25) 
        # where rows=7=32, passage=900 words each of 300 vector size, question=25 words each of 300 vector size
        scores = K.tf.matmul(x_proj,K.permute_dimensions(y_proj,(0,2,1)))
        
        #Apply the question mask to every word in the passage
        #Creates 32 x 900 x 25 mask for the 900 words in the passage
        y_mask = x[2]
        y_mask = K.repeat(y_mask, K.int_shape(scores)[1])
        #When the mask is multiplied by the scores, (max question length(25) - actual length of the question(say 10)) are filled with zeros
        scores = K.tf.multiply(scores,y_mask)

        # Normalize with softmax
        alpha_flat = K.softmax(K.reshape(scores,(-1,K.int_shape(scores)[2])))
        alpha = K.reshape(alpha_flat,(-1, K.int_shape(scores)[1],K.int_shape(scores)[2]))
        
        # Take weighted average
        # (32 x 900 x 25) * (32 x 25 x 300) = (32 x 900 x 300)        
        matched_seq = K.tf.matmul(alpha, x[1])
        
        return matched_seq

    def compute_output_shape(self, input_shape):
        return (input_shape[0])  

In [0]:
class LinearSeqAttLayer(Layer):

    def __init__(self, output_dim, **kwargs):
        self.output_dim = output_dim
        super(LinearSeqAttLayer, self).__init__(**kwargs)

    def build(self, input_shape):
        # Create a trainable weight variable for this layer.
        self.kernel = self.add_weight(name='kernel', 
                                      shape=(self.output_dim,1),
                                      initializer='uniform',
                                      trainable=True)
        super(LinearSeqAttLayer, self).build(input_shape)  # Be sure to call this at the end

    def call(self, x, mask=None):
        #(32 x 35 x 192) => (1120 x 192)
        x_flat = K.reshape(x[0],(-1,K.int_shape(x[0])[2]))

        #(1120 x 192)  => (1120 x 1) => (32 x 35)
        scores = K.reshape(K.dot(x_flat,self.kernel),(-1,K.int_shape(x[0])[1]))

        y_mask = x[1]
        scores = K.tf.multiply(scores,y_mask)
        alpha = K.softmax(scores)
        #(32 x 1 x 35) matmul (32 x 35 x 192) = (32 x 1 x 192) => (32 x 192)
        alpha_hidden = K.squeeze(K.tf.matmul(K.expand_dims(alpha, axis=1),x[0]),axis=1)
        #print(alpha_hidden)
        
        #return alpha
        return alpha_hidden

    def compute_output_shape(self, input_shape):
        return (input_shape[0][0], self.output_dim)   

In [0]:
class BiLinearSeqAttLayer(Layer):

    def __init__(self, output_dim, normalize=True, **kwargs):
        self.output_dim = output_dim
        self.normalize = normalize
        super(BiLinearSeqAttLayer, self).__init__(**kwargs)

    def build(self, input_shape):
        # Create a trainable weight variable for this layer.
        self.kernel = self.add_weight(name='kernel', 
                                      shape=(self.output_dim,self.output_dim),
                                      initializer='uniform',
                                      trainable=True)
        super(BiLinearSeqAttLayer, self).build(input_shape)  # Be sure to call this at the end

    def call(self, x, mask=None):
        
        Wy = K.dot(x[1],self.kernel)
        
        #(32 x 900 x 192) matmul (32 x 192 x 1) => (32 x 900 x 1) => (32 x 900)
        xWy = K.squeeze(K.tf.matmul(x[0],K.expand_dims(Wy,axis=2)), axis=2)

        y_mask = x[2]
        xWy = K.tf.multiply(xWy,y_mask)
        
        if self.normalize:
            alpha = K.softmax(xWy)
        else:
            alpha = K.exp(xWy)
         
        #(32 x 1 x 900) matmul (32 x 900 x 192) => (32 x 1 x 192) => (32 x 192)
        alpha_hidden = K.squeeze(K.tf.matmul(K.expand_dims(alpha, axis=1),x[0]),axis=1)

        #return alpha
        return alpha_hidden

    def compute_output_shape(self, input_shape):
        return (input_shape[0][0], self.output_dim)       

##### Common Sense Core Layers and Model

In [37]:
embedding_dim=300
pos_emb_dim=12
ner_emb_dim=8
rel_emb_dim=10
dropout_rate = 0.4
hidden_size=96
        
output_dim = 1
p_inp = Input(shape=(MAX_PARA_LENGTH,))
p_mask_inp = Input(shape=(MAX_PARA_LENGTH,))
p_pos_inp = Input(shape=(MAX_PARA_LENGTH,))
p_ner_inp = Input(shape=(MAX_PARA_LENGTH,))
q_inp = Input(shape=(MAX_QUESTION_LENGTH,))
q_mask_inp = Input(shape=(MAX_QUESTION_LENGTH,))
q_pos_inp = Input(shape=(MAX_QUESTION_LENGTH,))
c_inp = Input(shape=(MAX_CHOICE_LENGTH,))
c_mask_inp = Input(shape=(MAX_CHOICE_LENGTH,))
f_tensor_inp = Input(shape=(MAX_PARA_LENGTH,5))
p_q_rel_inp = Input(shape=(MAX_PARA_LENGTH,))
p_c_rel_inp = Input(shape=(MAX_PARA_LENGTH,))

#glov_emb = Embedding(len(vocab.tokens()), embedding_dim, embeddings_initializer=TruncatedNormal(mean=0.0, stddev=0.1))
#Loading the Glove embeddings, ensure trainable=False
glov_emb = Embedding(len(vocab.tokens()), embedding_dim, weights=[glove_emb_matrix], trainable=False)
pos_emb = Embedding(len(pos_vocab.tokens()), pos_emb_dim, embeddings_initializer=TruncatedNormal(mean=0.0, stddev=0.1))
ner_emb = Embedding(len(ner_vocab.tokens()), ner_emb_dim, embeddings_initializer=TruncatedNormal(mean=0.0, stddev=0.1))
rel_emb = Embedding(len(rel_vocab.tokens()), rel_emb_dim, embeddings_initializer=TruncatedNormal(mean=0.0, stddev=0.1))
       
#p_emb:(32 x 900 x 300), q_emb:(32 x 25 x 300), c_emb:(32 x 35 x 300), q_mask_inp:(32 x 25)  
p_emb, q_emb, c_emb = glov_emb(p_inp), glov_emb(q_inp), glov_emb(c_inp)
#p_pos_emb:(32 x 900 x 12), p_ner_emb:(32 x 900 x 8), q_pos_emb:(32 x 25 x 12)
p_pos_emb, p_ner_emb, q_pos_emb = pos_emb(p_pos_inp), ner_emb(p_ner_inp), pos_emb(q_pos_inp)
#p_q_rel_emb:(32 x 900 x 10), p_c_rel_emb:(32 x 900 x 10)
p_q_rel_emb, p_c_rel_emb = rel_emb(p_q_rel_inp), rel_emb(p_c_rel_inp)


p_emb = Dropout(dropout_rate)(p_emb)
q_emb = Dropout(dropout_rate)(q_emb)
c_emb = Dropout(dropout_rate)(c_emb)
p_pos_emb = Dropout(dropout_rate)(p_pos_emb)
p_ner_emb = Dropout(dropout_rate)(p_ner_emb)
q_pos_emb = Dropout(dropout_rate)(q_pos_emb)
p_q_rel_emb = Dropout(dropout_rate)(p_q_rel_emb)
p_c_rel_emb = Dropout(dropout_rate)(p_c_rel_emb)

#(32 x 900 x 300)
p_q_weighted_emb = SeqAttLayer(embedding_dim)([p_emb,q_emb,q_mask_inp])
#(32 x 35 x 300)
c_q_weighted_emb = SeqAttLayer(embedding_dim)([c_emb,q_emb,q_mask_inp])
#(32 x 35 x 300)
c_p_weighted_emb = SeqAttLayer(embedding_dim)([c_emb,p_emb,p_mask_inp])

p_q_weighted_emb = Dropout(dropout_rate)(p_q_weighted_emb)
c_q_weighted_emb = Dropout(dropout_rate)(c_q_weighted_emb)
c_p_weighted_emb = Dropout(dropout_rate)(c_p_weighted_emb)

#(32x900x300)+(32x900x300)+(32x900x12)+(32x900x8)+(32x900x10)+(32x900x5)+(32x900x10)+(32x900x10) = (32x900x645)
p_rnn_input = Concatenate(axis=-1)([p_emb, p_q_weighted_emb, p_pos_emb, p_ner_emb, f_tensor_inp, p_q_rel_emb, p_c_rel_emb])
c_rnn_input = Concatenate(axis=-1)([c_emb, c_q_weighted_emb, c_p_weighted_emb])
q_rnn_input = Concatenate(axis=-1)([q_emb, q_pos_emb])

#(32 x 900 x 192)
p_hiddens = Bidirectional(LSTM(hidden_size, return_sequences=True))([p_rnn_input])
#(32 x 35 x 192)
c_hiddens = Bidirectional(LSTM(hidden_size, return_sequences=True))([c_rnn_input])
#(32 x 25 x 192)
q_hiddens = Bidirectional(LSTM(hidden_size, return_sequences=True))([q_rnn_input])

#c_merge_weights = LinearSeqAttLayer(2 * hidden_size)([c_hiddens, c_mask_inp])
#c_hidden = K.squeeze(K.tf.matmul(K.expand_dims(c_merge_weights, axis=1),c_hiddens),axis=1)
#Input = (32 x 35 x 192) and (32 x 35)
#Output = (32 x 192)
c_hidden = LinearSeqAttLayer(2 * hidden_size)([c_hiddens, c_mask_inp])

#q_merge_weights = LinearSeqAttLayer(2 * hidden_size)([q_hiddens, q_mask_inp])
#q_hidden = K.squeeze(K.tf.matmul(K.expand_dims(q_merge_weights, axis=1),q_hiddens),axis=1)
#Input = (32 x 25 x 192) and (32 x 25)
#Output = (32 x 192)
q_hidden = LinearSeqAttLayer(2 * hidden_size)([q_hiddens, q_mask_inp])

#p_merge_weights = BiLinearSeqAttLayer(2 * hidden_size)([p_hiddens, q_hidden, p_mask_inp])
#p_hidden = K.squeeze(K.tf.matmul(K.expand_dims(p_merge_weights, axis=1),p_hiddens),axis=1)
#Input = (32 x 900 x 192), (32 x 192), (32 x 900)
#Output = (32 x 192)
p_hidden = BiLinearSeqAttLayer(2 * hidden_size)([p_hiddens, q_hidden, p_mask_inp])

#(32 x 192)
logits1_mul = Multiply()([Dense(2 * hidden_size)(p_hidden), c_hidden])
#(32 x 1)
logits1 = Reshape((1,))(Lambda(lambda x: K.sum(x,axis=1))(logits1_mul))

#(32 x 192)
logits2_mul = Multiply()([Dense(2 * hidden_size)(q_hidden), c_hidden])
#(32 x 1)
logits2 = Reshape((1,))(Lambda(lambda x: K.sum(x,axis=1))(logits1_mul))

#(32 x 1)
preds = Activation('sigmoid')(Add()([logits1,logits2]))

print(logits1, logits2, preds)
model = Model(inputs=[p_inp,p_mask_inp,p_pos_inp,p_ner_inp,q_inp,q_mask_inp,q_pos_inp,c_inp,c_mask_inp,f_tensor_inp,p_q_rel_inp,p_c_rel_inp], outputs=preds)

Tensor("reshape_1/Reshape:0", shape=(?, 1), dtype=float32) Tensor("reshape_2/Reshape:0", shape=(?, 1), dtype=float32) Tensor("activation_1/Sigmoid:0", shape=(?, 1), dtype=float32)


##### Compile Model

In [38]:
model.summary()
adamax = Adamax(lr=2e-3)
model.compile(loss='binary_crossentropy',optimizer=adamax,metrics=['acc'])

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_25 (InputLayer)           (None, 900)          0                                            
__________________________________________________________________________________________________
input_29 (InputLayer)           (None, 25)           0                                            
__________________________________________________________________________________________________
embedding_9 (Embedding)         multiple             3814800     input_25[0][0]                   
                                                                 input_29[0][0]                   
                                                                 input_32[0][0]                   
__________________________________________________________________________________________________
input_27 (

##### Train Model

In [0]:
#model.evaluate(feed_input, y)
#model.fit(feed_input, y, batch_size=batch_size, epochs=2) 

#19462/32=608, so chose 600 steps
model.fit_generator(generator=_iter_data(train_data), steps_per_epoch=600, epochs=10, verbose=1, callbacks=None, \
                    validation_data=_iter_data(dev_data), validation_steps=600)



Epoch 1/10
Epoch 2/10

Epoch 3/10

Epoch 4/10

Epoch 5/10

##### Prediction

In [0]:
model.predict_generator(generator=_iter_data(test_data[:10]))

### Pending Action Items
* Python generator should be an instance of keras.utils.Sequence, which guarantees ordering and single use of every input per epoch when using use_multiprocessing=True and also avoids duplicate data.
* Model was not trained on RACE dataset for initial 10 epochs
* CUDA was not used
* Gradient Clipping was not used
* Learning rate was not halved after 10 epochs, 50 epochs were not done
* Finetune topk embeddings during training was not done
* RNN padding and masking was not done

### References

[CS231n Winter 2016: Lecture 10: Recurrent Neural Networks, Image Captioning, LSTM](https://www.youtube.com/watch?v=yCC09vCHzF8)

[Evolution: from vanilla RNN to GRU & LSTMs (How it works)](https://www.youtube.com/watch?v=lycKqccytfU)

[Keras LSTM tutorial – How to easily build a powerful deep learning language model](http://adventuresinmachinelearning.com/keras-lstm-tutorial/)

[How to Develop a Bidirectional LSTM For Sequence Classification in Python with Keras](https://machinelearningmastery.com/develop-bidirectional-lstm-sequence-classification-python-keras/)

[When should one use bidirectional LSTM as opposed to normal LSTM?](https://www.quora.com/When-should-one-use-bidirectional-LSTM-as-opposed-to-normal-LSTM)

[What is exactly the attention mechanism introduced to RNN (recurrent neural network)? It would be nice if you could make it easy to understand!](https://www.quora.com/What-is-exactly-the-attention-mechanism-introduced-to-RNN-recurrent-neural-network-It-would-be-nice-if-you-could-make-it-easy-to-understand)

[A Brief Overview of Attention Mechanism](https://medium.com/syncedreview/a-brief-overview-of-attention-mechanism-13c578ba9129)

[Keras Attention Mechanism](https://github.com/philipperemy/keras-attention-mechanism)

[How to add Attention on top of a Recurrent Layer (Text Classification)](https://github.com/keras-team/keras/issues/4962)

[How Does Attention Work in Encoder-Decoder Recurrent Neural Networks](https://machinelearningmastery.com/how-does-attention-work-in-encoder-decoder-recurrent-neural-networks/)

[Attention in Long Short-Term Memory Recurrent Neural Networks](https://machinelearningmastery.com/attention-long-short-term-memory-recurrent-neural-networks/)

[Pytorch Documentation](https://pytorch.org/docs/master/nn.html)

[Writing your own Keras Layer - Keras Documentation](https://keras.io/layers/writing-your-own-keras-layers/)

[Keras Backend functions - Keras Documentation](https://keras.io/backend/)

[Text Classification, Part 2 - sentence level Attentional RNN](https://richliao.github.io/supervised/classification/2016/12/26/textclassifier-RNN/)

[Tensorflow Documentation](https://www.tensorflow.org/api_docs/python/tf)

[Can Keras deal with input images with different size?](https://stackoverflow.com/questions/39814777/can-keras-deal-with-input-images-with-different-size/41092113)

[Pre-trained Word Embedding Using Keras - Github Code](https://github.com/keras-team/keras/blob/master/examples/pretrained_word_embeddings.py)