## BiDAF

This notebook implements one of the most important papers in NLP literature: [BiDAF](https://arxiv.org/abs/1611.01603) or Bidirectional Attention Flow for Machine Comprehension. The key issue that this paper tries to address is that of *early summarization* in all the earlier approches that use attention mechanisms. The attention mechanisms until then were used to obtain a fixed-size summarization of given values and query. This, according to the authors leads to early summarization and loss of information. Moreover, previously, attention was only calculated in only one direction. To improve upon these issues, the authors propose a *hierarchical, multi-stage network*.   
> *Our attention layer is not used to summarize the context paragraph into a ﬁxed-size vector. Instead, the attention is computed for every time step, and the attended vector at each time step, along with the representations from previous layers, is allowed to ﬂow through to the subsequent modeling layer. *

Let's get into the intricacies of the model.
The flow of this notebook will be similar to the previous one.


In [1]:
!python3 -m spacy download en_core_web_sm

Collecting en_core_web_sm==2.1.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.1.0/en_core_web_sm-2.1.0.tar.gz (11.1 MB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m11.1/11.1 MB[0m [31m27.4 MB/s[0m eta [36m0:00:00[0mMB/s[0m eta [36m0:00:01[0m:01[0m
[?25h  Preparing metadata (setup.py) ... [?25ldone
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')


In [2]:
from torch import nn
import torch
import numpy as np
import pandas as pd
import pickle, time
import re, os, string, typing, gc, json
import torch.nn.functional as F
import spacy
from sklearn.model_selection import train_test_split
from collections import Counter
nlp = spacy.load("en_core_web_sm")
from preprocess import *
%load_ext autoreload
%autoreload 2

## Data Preprocessing

In [3]:
# load dataset json files
TRAIN_SQUAD_FILE = './data/Squad/train-v1.1.json'
VALID_SQUAD_FILE = './data/Squad/dev-v1.1.json'


train_data = load_json(TRAIN_SQUAD_FILE)
valid_data = load_json(VALID_SQUAD_FILE)

# parse the json structure to return the data as a list of dictionaries

train_list = parse_data(train_data)
valid_list = parse_data(valid_data)
print('--------------------------')



print('Train list len: ',len(train_list))
print('Valid list len: ',len(valid_list))

# converting the lists into dataframes

train_df = pd.DataFrame(train_list)
valid_df = pd.DataFrame(valid_list)

Length of data:  442
Data Keys:  dict_keys(['title', 'paragraphs'])
Title:  University_of_Notre_Dame
Length of data:  48
Data Keys:  dict_keys(['title', 'paragraphs'])
Title:  Super_Bowl_50
--------------------------
Train list len:  87599
Valid list len:  34726


In [4]:
train_df.head()

Unnamed: 0,id,context,question,label,answer
0,5733be284776f41900661182,"Architecturally, the school has a Catholic cha...",To whom did the Virgin Mary allegedly appear i...,"[515, 541]",Saint Bernadette Soubirous
1,5733be284776f4190066117f,"Architecturally, the school has a Catholic cha...",What is in front of the Notre Dame Main Building?,"[188, 213]",a copper statue of Christ
2,5733be284776f41900661180,"Architecturally, the school has a Catholic cha...",The Basilica of the Sacred heart at Notre Dame...,"[279, 296]",the Main Building
3,5733be284776f41900661181,"Architecturally, the school has a Catholic cha...",What is the Grotto at Notre Dame?,"[381, 420]",a Marian place of prayer and reflection
4,5733be284776f4190066117e,"Architecturally, the school has a Catholic cha...",What sits on top of the Main Building at Notre...,"[92, 126]",a golden statue of the Virgin Mary


In [5]:
def preprocess_df(df):
    
    def to_lower(text):
        return text.lower()

    df.context = df.context.apply(to_lower)
    df.question = df.question.apply(to_lower)
    df.answer = df.answer.apply(to_lower)

In [6]:
preprocess_df(train_df)
preprocess_df(valid_df)

In [7]:
# gather text to build vocabularies

%time vocab_text = gather_text_for_vocab([train_df, valid_df])
print("Number of sentences in dataset: ", len(vocab_text))

CPU times: user 362 ms, sys: 37.4 ms, total: 399 ms
Wall time: 398 ms
Number of sentences in dataset:  118822


In [8]:
# build word and character-level vocabularies

%time word2idx, idx2word, word_vocab = build_word_vocab(vocab_text)
print("----------------------------------")
%time char2idx, char_vocab = build_char_vocab(vocab_text)

raw-vocab: 97475
vocab-length: 97477
word2idx-length: 97477
CPU times: user 40.4 s, sys: 213 ms, total: 40.6 s
Wall time: 40.6 s
----------------------------------
raw-char-vocab: 1316
char-vocab-intersect: 202
char2idx-length: 204
CPU times: user 1.81 s, sys: 130 ms, total: 1.94 s
Wall time: 1.94 s


In [9]:
# numericalize context and questions for training and validation set

%time train_df['context_ids'] = train_df.context.apply(context_to_ids, word2idx=word2idx)
%time valid_df['context_ids'] = valid_df.context.apply(context_to_ids, word2idx=word2idx)
%time train_df['question_ids'] = train_df.question.apply(question_to_ids, word2idx=word2idx)
%time valid_df['question_ids'] = valid_df.question.apply(question_to_ids, word2idx=word2idx)

CPU times: user 1min 24s, sys: 291 ms, total: 1min 25s
Wall time: 1min 25s
CPU times: user 35.8 s, sys: 28.8 ms, total: 35.8 s
Wall time: 35.8 s
CPU times: user 16.2 s, sys: 0 ns, total: 16.2 s
Wall time: 16.2 s
CPU times: user 6.35 s, sys: 0 ns, total: 6.35 s
Wall time: 6.35 s


In [10]:
# get indices with tokenization errors and drop those indices 

train_err = get_error_indices(train_df, idx2word)
valid_err = get_error_indices(valid_df, idx2word)

train_df.drop(train_err, inplace=True)
valid_df.drop(valid_err, inplace=True)

Number of error indices: 1004
Number of error indices: 433


In [11]:
# get start and end positions of answers from the context
# this is basically the label for training QA models

train_label_idx = train_df.apply(index_answer, axis=1, idx2word=idx2word)
valid_label_idx = valid_df.apply(index_answer, axis=1, idx2word=idx2word)

train_df['label_idx'] = train_label_idx
valid_df['label_idx'] = valid_label_idx

In [12]:
# dump to pickle files

train_df.to_pickle('./pickle_files/bidaftrain.pkl')
valid_df.to_pickle('./pickle_files/bidafvalid.pkl')

with open('./pickle_files/bidafw2id.pickle','wb') as handle:
    pickle.dump(word2idx, handle)

with open('./pickle_files/bidafc2id.pickle','wb') as handle:
    pickle.dump(char2idx, handle)

In [13]:
# load data from pickle files


train_df = pd.read_pickle('./pickle_files/bidaftrain.pkl')
valid_df = pd.read_pickle('./pickle_files/bidafvalid.pkl')

with open('./pickle_files/bidafw2id.pickle','rb') as handle:
    word2idx = pickle.load(handle)
with open('./pickle_files/bidafc2id.pickle','rb') as handle:
    char2idx = pickle.load(handle)

idx2word = {v:k for k,v in word2idx.items()}

In [14]:
train_df

Unnamed: 0,id,context,question,label,answer,context_ids,question_ids,label_idx
0,5733be284776f41900661182,"architecturally, the school has a catholic cha...",to whom did the virgin mary allegedly appear i...,"[515, 541]",saint bernadette soubirous,"[16248, 3, 2, 133, 40, 10, 544, 801, 5, 8708, ...","[9, 569, 25, 2, 2676, 849, 5986, 1082, 6, 8044...","[102, 104]"
1,5733be284776f4190066117f,"architecturally, the school has a catholic cha...",what is in front of the notre dame main building?,"[188, 213]",a copper statue of christ,"[16248, 3, 2, 133, 40, 10, 544, 801, 5, 8708, ...","[11, 12, 6, 1201, 4, 2, 1258, 1195, 236, 300, 7]","[37, 41]"
2,5733be284776f41900661180,"architecturally, the school has a catholic cha...",the basilica of the sacred heart at notre dame...,"[279, 296]",the main building,"[16248, 3, 2, 133, 40, 10, 544, 801, 5, 8708, ...","[2, 4540, 4, 2, 3909, 1498, 31, 1258, 1195, 12...","[57, 59]"
3,5733be284776f41900661181,"architecturally, the school has a catholic cha...",what is the grotto at notre dame?,"[381, 420]",a marian place of prayer and reflection,"[16248, 3, 2, 133, 40, 10, 544, 801, 5, 8708, ...","[11, 12, 2, 19572, 31, 1258, 1195, 7]","[76, 82]"
4,5733be284776f4190066117e,"architecturally, the school has a catholic cha...",what sits on top of the main building at notre...,"[92, 126]",a golden statue of the virgin mary,"[16248, 3, 2, 133, 40, 10, 544, 801, 5, 8708, ...","[11, 8826, 24, 402, 4, 2, 236, 300, 31, 1258, ...","[17, 23]"
...,...,...,...,...,...,...,...,...
87594,5735d259012e2f140011a09d,"kathmandu metropolitan city (kmc), in order to...",in what us state did kathmandu first establish...,"[229, 235]",oregon,"[1729, 1145, 53, 22, 25335, 21, 3, 6, 238, 9, ...","[6, 11, 201, 79, 25, 1729, 44, 1387, 34, 196, ...","[38, 38]"
87595,5735d259012e2f140011a09e,"kathmandu metropolitan city (kmc), in order to...",what was yangon previously known as?,"[414, 421]",rangoon,"[1729, 1145, 53, 22, 25335, 21, 3, 6, 238, 9, ...","[11, 13, 18118, 1035, 92, 15, 7]","[71, 71]"
87596,5735d259012e2f140011a09f,"kathmandu metropolitan city (kmc), in order to...",with what belorussian city does kathmandu have...,"[476, 481]",minsk,"[1729, 1145, 53, 22, 25335, 21, 3, 6, 238, 9, ...","[23, 11, 49291, 53, 57, 1729, 39, 10, 806, 7]","[85, 85]"
87597,5735d259012e2f140011a0a0,"kathmandu metropolitan city (kmc), in order to...",in what year did kathmandu create its initial ...,"[199, 203]",1975,"[1729, 1145, 53, 22, 25335, 21, 3, 6, 238, 9, ...","[6, 11, 58, 25, 1729, 711, 46, 1622, 196, 806, 7]","[31, 31]"


## Dataloader/Dataset

In [38]:
class SquadDataset:
    '''
    - Creates batches dynamically by padding to the length of largest example
      in a given batch.
    - Calulates character vectors for contexts and question.
    - Returns tensors for training.
    '''
    
    def __init__(self, data, batch_size):
        
        self.batch_size = batch_size
        data = [data[i:i+self.batch_size] for i in range(0, len(data), self.batch_size)]
        self.data = data
        
        
    def __len__(self):
        return len(self.data)
    
    def make_char_vector(self, max_sent_len, max_word_len, sentence):
        
        char_vec = torch.ones(max_sent_len, max_word_len).type(torch.LongTensor)
        
        for i, word in enumerate(nlp(sentence, disable=['parser','tagger','ner'])):
            for j, ch in enumerate(word.text):
                char_vec[i][j] = char2idx.get(ch, 0)
        
        return char_vec    
    
    def get_span(self, text):
        
        text = nlp(text, disable=['parser','tagger','ner'])
        span = [(w.idx, w.idx+len(w.text)) for w in text]

        return span

    def __iter__(self):
        '''
        Creates batches of data and yields them.
        
        Each yield comprises of:
        :padded_context: padded tensor of contexts for each batch 
        :padded_question: padded tensor of questions for each batch 
        :char_ctx & ques_ctx: character-level ids for context and question
        :label: start and end index wrt context_ids
        :context_text,answer_text: used while validation to calculate metrics
        :ids: question_ids for evaluation
        
        '''
        
        for batch in self.data:
            
            spans = []
            ctx_text = []
            answer_text = []
            
            for ctx in batch.context:
                ctx_text.append(ctx)
                spans.append(self.get_span(ctx))
                
            for ans in batch.answer:
                answer_text.append(ans)
            max_context_len = max([len(ctx) for ctx in batch.context_ids])
            padded_context = torch.LongTensor(len(batch), max_context_len).fill_(1)
            
            for i, ctx in enumerate(batch.context_ids):
                padded_context[i, :len(ctx)] = torch.LongTensor(ctx)
             
            max_word_ctx = 0
            for context in batch.context:
                for word in nlp(context, disable=['parser','tagger','ner']):
                    if len(word.text) > max_word_ctx:
                        max_word_ctx = len(word.text)
            
            char_ctx = torch.ones(len(batch), max_context_len, max_word_ctx).type(torch.LongTensor)
            for i, context in enumerate(batch.context):
                char_ctx[i] = self.make_char_vector(max_context_len, max_word_ctx, context)
#             print(char_ctx)   
#             break
            
            max_question_len = max([len(ques) for ques in batch.question_ids])
            padded_question = torch.LongTensor(len(batch), max_question_len).fill_(1)
            
            for i, ques in enumerate(batch.question_ids):
                padded_question[i, :len(ques)] = torch.LongTensor(ques)
                
            max_word_ques = 0
            for question in batch.question:
                for word in nlp(question, disable=['parser','tagger','ner']):
                    if len(word.text) > max_word_ques:
                        max_word_ques = len(word.text)
            
            char_ques = torch.ones(len(batch), max_question_len, max_word_ques).type(torch.LongTensor)
            for i, question in enumerate(batch.question):
                char_ques[i] = self.make_char_vector(max_question_len, max_word_ques, question)
            
            ids = list(batch.id)  
            label = torch.LongTensor(list(batch.label_idx))
            
#             print(padded_context, padded_question, char_ctx, char_ques, label, ctx_text, answer_text, ids)
            yield (padded_context, padded_question, char_ctx, char_ques, label, ctx_text, answer_text, ids)
            
            

In [55]:
train_dataset = SquadDataset(train_df, 16)
# train_dataset.data[:2]
next(iter(train_dataset))

(tensor([[16248,     3,     2,  ...,     1,     1,     1],
         [16248,     3,     2,  ...,     1,     1,     1],
         [16248,     3,     2,  ...,     1,     1,     1],
         ...,
         [    2,   108,    12,  ...,     1,     1,     1],
         [    2,   108,    12,  ...,     1,     1,     1],
         [    2,   303,     4,  ...,     1,     1,     1]]),
 tensor([[    9,   569,    25,     2,  2676,   849,  5986,  1082,     6,  8044,
              6, 19573,   230,     7,     1,     1],
         [   11,    12,     6,  1201,     4,     2,  1258,  1195,   236,   300,
              7,     1,     1,     1,     1,     1],
         [    2,  4540,     4,     2,  3909,  1498,    31,  1258,  1195,    12,
           7628,     9,    28,   764,     7,     1],
         [   11,    12,     2, 19572,    31,  1258,  1195,     7,     1,     1,
              1,     1,     1,     1,     1,     1],
         [   11,  8826,    24,   402,     4,     2,   236,   300,    31,  1258,
           1195,  

In [17]:
len('architecturally, the school has a catholic character. atop the main building\'s gold dome is a golden statue of the virgin mary. immediately in front of the main building and facing it, is a copper statue of christ with arms upraised with the legend "venite ad me omnes". next to the main building is the basilica of the sacred heart. immediately behind the basilica is the grotto, a marian place of prayer and reflection. it is a replica of the grotto at lourdes, france where the virgin mary reputedly appeared to saint bernadette soubirous in 1858. at the end of the main drive (and in a direct line that connects through 3 statues and the gold dome), is a simple, modern stone statue of mary.')

695

In [56]:
valid_dataset = SquadDataset(valid_df, 16)

In [19]:
a = next(iter(train_dataset))

tensor([[[200, 128, 173,  ..., 178, 178,  27],
         [ 18,   1,   1,  ...,   1,   1,   1],
         [ 59,  71,  39,  ...,   1,   1,   1],
         ...,
         [193, 141,   1,  ...,   1,   1,   1],
         [  4, 200, 128,  ...,   1,   1,   1],
         [ 96,   1,   1,  ...,   1,   1,   1]],

        [[200, 128, 173,  ..., 178, 178,  27],
         [ 18,   1,   1,  ...,   1,   1,   1],
         [ 59,  71,  39,  ...,   1,   1,   1],
         ...,
         [193, 141,   1,  ...,   1,   1,   1],
         [  4, 200, 128,  ...,   1,   1,   1],
         [ 96,   1,   1,  ...,   1,   1,   1]]])
tensor([[16248,     3,     2,   133,    40,    10,   544,   801,     5,  8708,
             2,   236,   300,    18,  1234,  5753,    12,    10,  2326,  4968,
             4,     2,  2676,   849,     5,  2177,     6,  1201,     4,     2,
           236,   300,     8,  4271,    32,     3,    12,    10,   987,  4968,
             4,  1493,    23,  2131, 54455,    23,     2,  4619,    14, 54456,
         

## BiDAF Model

## Word Embedding

> *Word embedding layer also maps each word to a high-dimensional vector space. We use pre-trained word vectors, GloVe to obtain the ﬁxed word embedding of each word. *

This model uses 100-dimensional pre-trained word vectors. The `weights_matrix` obtained below is initialized as an `nn.Embedding`'s weight. This is done in the last module in a function which as follows:

```
weights_matrix = np.load('bidafglove.npy')
num_embeddings, embedding_dim = weights_matrix.shape
embedding = nn.Embedding.from_pretrained(torch.FloatTensor(weights_matrix).to(self.device),freeze=True)

```


In [20]:
def get_glove_dict():
    '''
    Parses the glove word vectors text file and returns a dictionary with the words as
    keys and their respective pretrained word vectors as values.

    '''
    glove_dict = {}
    with open("./data/Glove/glove.6B.100d.txt", "r", encoding="utf-8") as f:
        for line in f:
            values = line.split()
            word = values[0]
            vector = np.asarray(values[1:], "float32")
            glove_dict[word] = vector
            
    f.close()
    
    return glove_dict

In [21]:
glove_dict = get_glove_dict()

In [22]:
def create_weights_matrix(glove_dict):
    '''
    Creates a weight matrix of the words that are common in the GloVe vocab and
    the dataset's vocab. Initializes OOV words with a zero vector.
    '''
    weights_matrix = np.zeros((len(word_vocab), 100))
    words_found = 0
    for i, word in enumerate(word_vocab):
        try:
            weights_matrix[i] = glove_dict[word]
            words_found += 1
        except:
            pass
        
    return weights_matrix, words_found


In [23]:
weights_matrix, words_found = create_weights_matrix(glove_dict)
print("Words found in the GloVe vocab: " ,words_found)

Words found in the GloVe vocab:  73003


In [24]:
# dump the weights to load in future

np.save('bidafglove_tv.npy', weights_matrix)

## Character Embedding

A character embedding is calculated for each context and query word. This is done by using convolutions.   
>  *It maps each word to a vector space using character-level CNNs.*

Using CNNs in NLP was first proposed by Yoon Kim in 2014 in his paper "Convolutional Neural Networks for Sentence Classification". This paper tries to use CNNs in NLP as they are used in vision. Most of the state-of-the-art results in CV at that time were achieved by transfer learning from larger models pretrained on ImageNet. In this paper, they train a simple CNN with one layer of convolution on top of pretrained word vectors and hypothesized that these pretrained word vectors could work as a universal feature extractors for various classification tasks. This is analogous to the earlier layers of vision models like VGG and Inception working as generic feature extractors.

The intuition is simple over here. Just as convolutional filters learn various features in an image by operating on its pixels, here they'll do so by operating on characters of words. Let's get into the working of this layer.  

We first pass each word through an embedding layer to get a fixed size vector. Let the embedding dimension be $d$.
Let $C$ represent a matrix representation of word of length $l$. Therefore $C$ is a matrix with dimensions $d$ x $l$.
<img src="images/charemb1.PNG" width="500" height="300"/>

Let $H$ represent a convolutional filter with dimensions $d$ x $w$, where $d$ is the embedding dimension and $w$ is the width or the window size of the filter.   
The weights of this filter are randomly initialized and learnt parallelly via backpropogation. We convolve this filter $H$ over our word representation $C$ as shown below. 
<img src="images/charemb2.PNG" width="500" height="400"/>  

The convolution operation is simply the inner product of the filter $H$ and matrix $C$. The convolution operations can be visualized as follows:
<img src="images/charemb3.PNG" width="900" height="700"/>    
The result of the above operation is a feature vector. A single filter is usually associated with a unique feature that it captures from the image/matrix. To get the most representative value related to the feature, we perform max pooling over the dimension of this vector.
<img src="images/charemb4.PNG" width="500" height="300"/>     

The above process was described for a single filter. This same process is repeated with $N$ number of filters. Each of these filters captures a different property of word. In an image, for example, if one filter captures the edges, another filter will capture the texture and another one the shapes in the image and so on. $N$ is also the size of the desired character embedding. In this paper authors have trained the model with $N$ = 100.  

The implementation of this layer is fairly straightforward.  The input to this layer is of dimension `[batch_size, seq_len, word_len]` where `seq_len` and `word_len` are the lengths of largest sequence and word respectively within a given batch . We first embed the character tokens into a fixed size vector using an embedding layer. This gives a vector of dimension `[batch_size, seq_len, word_len, emb_dim]`.   
We then convert this tensor into a format that closely resembles an image, of type [ $N$, $C_{in}$, $H_{in}$, $W_{in}$]. The number of input channels, $C_{in}$ would be 1 and the output channels would be the desired embedding size which is 100. This is then passed through the convolution layer which gives an output of shape [ $N$, $C_{out}$, $H_{out}$, $W_{out}$]. Here,

<img src="images/conv.PNG" width="600" height="500"/>  

If `padding` = [0,0], `kernel_size` (or filter_size) = [$H_{in}$, $w$], `dilation` = [1,1], `stride` = [1,1] (as visible in images above), then,   
$H_{out}$ = 1, and $W_{out}$ = $W_{in}$ - $w$ - 1.

Since $H_{out}$ = 1, we squeeze that dimension and perform max pooling with a kernel_size = $L_{in}$. The value of $L_{in}$ =  $W_{in}$ - $w$ - 1.

<img src="images/maxpool.PNG" width="600" height="500"/>     
If the kernel size = $L_{in}$, we get $L_{out}$ = 1 if other values are default. This dimension is again squeezed to finally give us a tensor of dimension   `[batch_size, seq_len, output_channels (or 100)]`. 



In [25]:
class CharacterEmbeddingLayer(nn.Module):
    
    def __init__(self, char_vocab_dim, char_emb_dim, num_output_channels, kernel_size):
        
        super().__init__()
        
        self.char_emb_dim = char_emb_dim
        
        self.char_embedding = nn.Embedding(char_vocab_dim, char_emb_dim, padding_idx=1)
        
        self.char_convolution = nn.Conv2d(in_channels=1, out_channels=100, kernel_size=kernel_size)
        
        self.relu = nn.ReLU()
    
        self.dropout = nn.Dropout(0.2)
        
    def forward(self, x):
        # x = [bs, seq_len, word_len]
        # returns : [batch_size, seq_len, num_output_channels]
        # the output can be thought of as another feature embedding of dim 100.
        
        batch_size = x.shape[0]
        
        x = self.dropout(self.char_embedding(x))
        # x = [bs, seq_len, word_len, char_emb_dim]
        
        # following three operations manipulate x in such a way that
        # it closely resembles an image. this format is important before 
        # we perform convolution on the character embeddings.
        
        x = x.permute(0,1,3,2)
        # x = [bs, seq_len, char_emb_dim, word_len]
        
        x = x.view(-1, self.char_emb_dim, x.shape[3])
        # x = [bs*seq_len, char_emb_dim, word_len]
        
        x = x.unsqueeze(1)
        # x = [bs*seq_len, 1, char_emb_dim, word_len]
        
        # x is now in a format that can be accepted by a conv layer. 
        # think of the tensor above in terms of an image of dimension
        # (N, C_in, H_in, W_in).
        
        x = self.relu(self.char_convolution(x))
        # x = [bs*seq_len, out_channels, H_out, W_out]
        
        x = x.squeeze()
        # x = [bs*seq_len, out_channels, W_out]
                
        x = F.max_pool1d(x, x.shape[2]).squeeze()
        # x = [bs*seq_len, out_channels, 1] => [bs*seq_len, out_channels]
        
        x = x.view(batch_size, -1, x.shape[-1])
        # x = [bs, seq_len, out_channels]
        # x = [bs, seq_len, features] = [bs, seq_len, 100]
        
        
        return x        

## Highway Networks

Highway networks were originally introduced to ease the training of deep neural networks. While researchers had cracked the code for optimizing shallow neural networks, training *deep* networks was still a challenging task owing to problems such as vanishing gradients etc. Quoting the paper,

>  *We present a novel architecture that enables the optimization of networks with virtually arbitrary depth. This is accomplished through the use of a learned gating mechanism for regulating information ﬂow which is inspired by Long Short Term Memory recurrent neural networks. Due to this gating mechanism, a neural network can have paths along which information can ﬂow across several layers without attenuation. We call such paths information highways, and such networks highway networks.* 

This paper takes the key idea of learned gating mechanism from LSTMs which process information internally through a sequence of learned gates. The purpose of this layer is to *learn* to pass relevant information from the input. A highway network is a series of feed-forward or linear layers with a gating mechanism. The gating is implemented by using a sigmoid function which decides what amount of information should be transformed and what should be passed as it is.   

A plain feed-forward layer is associated with a linear transform $H$ parameterized by ($W_{H}, b_{H}$), such that for input $x$, the output $y$ is  

$$ y = g(W_{H}.x + b_{H})$$
where $g$ is a non-linear activation.  
For highway networks, two additional linear transforms are defined viz. $T$ ($W_{T},b_{T}$) and $C$ ($W_{C}$,$b_{C}$).
Then,    
  
$$ y = T(x) . H(x) + x . C(x) $$ 
> *We refer to T as the transform gate and C as the carry gate, since they express how much of the output is produced by
transforming the input and carrying it, respectively. For simplicity, in this paper we set C = 1 − T. *

$$ y = T(x) . H(x) + x . (1 - T(x)) $$  
  
$$ y = T(x) . g(W_{H}.x + b_{H}) + x . (1 - T(x)) $$  
where $T(x)$ = $\sigma$ ($W_{T}$ . $x$ + $b_{T}$) and $g$ is relu activation.  

The input to this layer is the concatenation of word and character embeddings of each word. To implement this we use `nn.ModuleList` to add multiple linear layers. This is done for the gate layer as well as for a normal linear transform. In code the `flow_layer` is the same as linear transform $H$ discussed above and `gate_layer` is $T$. In the forward method we loop through each layer and compute the output according to the highway equation described above.   
  
The output of this layer for context is $X$ $\epsilon$ $R^{\ d \ X \ T}$ and for query is $Q$ $\epsilon$ $R^{\ d \ X \ J}$, where $d$ is hidden size of the LSTM, $T$ is the context length, $J$ is the query length.  

The structure discussed so far is a recurring pattern in many NLP systems. Although this might be out of favor now with the advent of transformers and large pretrained language models, you will find this pattern in many NLP systems before transformers came into being. The idea behind this is that adding highway layers enables the network to make more efficient use of character embeddings. If a particular word is not found in the pretrained word vector vocabulary (OOV word), it will most likely be initialized with a zero vector. It then makes much more sense to look at the character embedding of that word rather than the word embedding. The soft gating mechanism in highway layers helps the model to achieve this. 

In [26]:
class HighwayNetwork(nn.Module):
    
    def __init__(self, input_dim, num_layers=2):
        
        super().__init__()
        
        self.num_layers = num_layers
        
        self.flow_layer = nn.ModuleList([nn.Linear(input_dim, input_dim) for _ in range(num_layers)])
        self.gate_layer = nn.ModuleList([nn.Linear(input_dim, input_dim) for _ in range(num_layers)])
        
    def forward(self, x):
        
        for i in range(self.num_layers):
            
            flow_value = F.relu(self.flow_layer[i](x))
            gate_value = torch.sigmoid(self.gate_layer[i](x))
            
            x = gate_value * flow_value + (1-gate_value) * x
        
        return x

## Contextual Embedding

This layer is the final embedding layer in the model.
> *Bi-Directional Attention Flow (BIDAF) network, a hierarchical multi-stage architecture for modeling the representations ofthe context paragraph at different levels of granularity. BIDAF includes character-level, word-level, and contextual embeddings*

The output of highway layers is passed to a bidirection LSTM to model the temporal features of the text. This is done for both, the context and the query.   
>  *Utilizes contextual cues from surrounding words to reﬁne the embedding of the words*

The output of this layer for context is called as $H$ $\epsilon$ $R^{\ 2d \ X \ T}$ and for query it is named as $U$ $\epsilon$ $R^{\ 2d \  X \  J}$. The $2d$ is because of the features from both forward and backward LSTMs. In code, we simply use the outputs of the LSTM layer and ignore the hidden states.

In [27]:
class ContextualEmbeddingLayer(nn.Module):
    
    def __init__(self, input_dim, hidden_dim):
        
        super().__init__()
        
        self.lstm = nn.LSTM(input_dim, hidden_dim, batch_first=True, bidirectional=True)
        
        self.highway_net = HighwayNetwork(input_dim)
        
    def forward(self, x):
        # x = [bs, seq_len, input_dim] = [bs, seq_len, emb_dim*2]
        # the input is the concatenation of word and characeter embeddings
        # for the sequence.
        
        highway_out = self.highway_net(x)
        # highway_out = [bs, seq_len, input_dim]
        
        outputs, _ = self.lstm(highway_out)
        # outputs = [bs, seq_len, emb_dim*2]
        
        return outputs

## Attention Flow Layer


> *It is worth noting that the ﬁrst three layers of the model are computing features from the query and context at different levels of granularity, akin to the multi-stage feature computation of convolutional neural networks in the computer vision ﬁeld.*

The output of the previous layer: contextual representation of context $H$ and query $U$ are passed on to this layer. Until now processing of the context and the query has been independent of each other. This layer, however, is responsible for fusing and linking the context and query representations.   

This layer calculates attention in two directions: from context to query and from query to context. Attention vectors for these calculations are derived from a common matrix which is called as the similarity matrix and is denoted by $S$ $\epsilon$ $R^{\ T \ X \ J}$. $T$ is the length of the context (number of words/tokens) and $J$ is the length of the query for each training example. The similarity matrix is computed by,
$$ S_{tj} = \alpha\ (H_{:t}, U_{:j}) $$  

where $S_{tj}$ $\epsilon$ $R$.  
$S_{tj}$ is a single float value that determines the similarity between the $t$-th context word and $j$-th query word.
$\alpha$ is a trainable function that encodes the similarity between two input vectors $H_{:t}$ and $U_{:j}$.  
$H_{:t}$ is the contextual representation of the $t$-th context word and $U_{:j}$ is the contextual representation of the $j$-th query word. 
<img src="images/simimat.PNG" width="500" height="500"/>     



The trainable function is defined as,  

$$ \alpha \ (h, u) = w_{(S)}^{T}\ [h\ ;\ u \ ; \ h \circ u]  $$
where $;$ denotes concatenation and $\circ$ denotes an element wise product.    
$w_{T}^{S}$ is a trainable weight matrix of dimension $6d$. This is because $H$ $\epsilon$ $R^{\ 2d \ X \ T}$ and $U$ $\epsilon$ $R^{\ 2d \ X \ J}$, and we are concatenating 3 such vectors.  

In code, all the computations above are performed directly by operating on matrices/ tensors. We first `repeat` the contextual representations of context and query along appropriate dimensions to get two tensors of shape `[batch_size, ctx_len, query_len, emb_dim*2 (d*2)]`. $w_{(S)}^T$ is characterized by a linear layer called `similarity_weight`. We concat the tensors along the last dimension and pass it through the linear layer. 

### Context-to-Query Attention

> *. Context-to-query (C2Q) attention signiﬁes which query words are most relevant to each context word.*
 
Let $a_{t}$ represent the vector that encodes the attention paid on each query word by the $t$-th context word.   
$$ \sum_{j}a_{tj} = 1  $$  
$$ a_{t} = softmax(S_{:t}) $$
where $a_{t}$ $\epsilon$ $R^{J}$   
This can be visualized as,
<img src="images/c2q.PNG" width="700" height="750"/>     

Subsequently, each attended query vector is calculated by,
$$ \overline U_{:t} = \sum_{j} a_{tj} U_{:j} $$

This vector indicates the most important word in the query with respect to context.  

In code, again vectorizing operations into tensors, this simply involves taking a softmax of the similarity matrix across the last dimension, i.e across the columns and multiplying this with $U$. The shape of similarity matrix is `[batch_size, ctx_len, query_len]`. The shape of $U$ is `[batch_size, query_len, emb_dim*2]` . Hence performing matrix multiplication by using `torch.bmm` would give us a tensor of shape `[batch_size, ctx_len, emb_dim*2 ]`. Therefore, $\overline U$ will be $2d$ X $T$
 matrix.


### Query-to-Context Attention

>  *Query-to-context (Q2C) attention signiﬁes which context words have the closest similarity to one of the query words and are hence critical for answering the query.*

The attention vector here is calculated as,  
$$ b = softmax \ (max_{col} \ (S)) \ \epsilon R^{T} $$  
where $max_{col}$ performs the maximum function across the column. The attended context vector is then calculated as

$$ \overline h = \sum_{t} b_{t} H_{:t}$$

The above equations can be visualized as,

<img src="images/q2c.PNG" width="800" height="900"/>

The above figure helps us in understanding that for each training example we'll get a single $T$ - dimensional vector. This vector will then be multiplied by the contextual representation $H$ to get a single $2d$ vector.  To get the matrix $\overline H$, we need to tile/repeat this vector $T$ times to get a $2d$-by-$T$ representation.
This is unlike the previous attention we calculated where we directly got a $2d$-by-$T$ matrix.

The implementation for this part is straightforward and very similar to the explanation above. We first perform maximum across columns using `torch.max`. This gives a tensor of shape `[batch_size, ctx_len]`. This is then passed through a softmax to get weights that add up to 1. We then insert an extra dimension in this tensor and multiply it with $H$. This gives a tensor of shape `[batch_size, 1, emb_dim*2]`. This is exactly what we have discussed above. This tensor corresponds to $\overline h$. We then `repeat` this $T$ times to get $\overline H$ of shape `[batch_size, ctx_len, emb_dim*2]`


### Combining attention vectors and contextual embeddings

> *Finally, the contextual embeddings and the attention vectors are combined together to yield G, where each column vector can be considered as the query-aware representation of each context word.*

$G$ is defined as,

$$ G_{:t} = \beta \ (\ H_{:t}\ ;\ \overline U_{:t} \ ;\ \overline H_{:t}  \ )$$

where $\beta$ is a trainable function.  
$$ \beta (h, \overline u, \overline h) = [h\ ;\ \overline u\ ;\ h \circ \overline u\ ;\ h \circ \overline h\ ] $$

This concatenation gives a $8d$ vector. Hence the trainable weight here should have an input dimension of $8d$. However here we don't involve a trainable weight because according to the authors,

> *While the β function can be an arbitrary trainable neural network, such as multi-layer perceptron, a simple concatenation as following still shows good performance in our experiments.*

In code, this simply involves a single line of concatenating tensors using `torch.cat` to yeild $G$. This tensor holds the query-aware representation of the context words.


## Modeling Layer

$G$ is then passed on to this layer. This layer is responsible for capturing temporal features interactions among the context words. This is done using a bidirectional LSTM. The difference between this layer and the contextual layer, both of which involve an LSTM layer is that here we have a *query-aware* representation of the context while in the contextual layer, encoding of the context and query was independent.  
Using an LSTM layer with a hidden size of $d$ we get a $2d$-by-$T$ output called $M$.  

> *Each column vector of M is expected to contain contextual information about the word with respect to the entire context paragraph and the query.*


## Output Layer

The start index $p^{1}$ of the answer span is calculated by,

<img src="images/p1.PNG" width="300" height="300"/>

where $w_{(p^{1})}$ is a $10d$ trainable weight vector which corresponds to a linear layer in code called `output_start`.  
To predict the end of the answer span, $M$ is once again passed to a bidirectional LSTM to get $M^{2}$ which again is a $2d$-by-$T$ matrix. The end is then computed as,  

<img src="images/p2.PNG" width="300" height="300"/>  


$w_{(p^{2})}$ is again a $10d$ trainable weight vector and corresponds to the linear layer `output_end`. 

In code we don't explicitly calculate softmax since this is taken care of while calculating the losses during training. 

> *The output layer is application-speciﬁc. The modular nature of BIDAF allows us to easily swap out the output layer based on the task, with the rest of the architecture remaining exactly the same. *


In [47]:
class BiDAF(nn.Module):
    
    def __init__(self, char_vocab_dim, emb_dim, char_emb_dim, num_output_channels, 
                 kernel_size, ctx_hidden_dim, device):
        '''
        char_vocab_dim = len(char2idx)
        emb_dim = 100
        char_emb_dim = 8
        num_output_chanels = 100
        kernel_size = (8,5)
        ctx_hidden_dim = 100
        '''
        super().__init__()
        
        self.device = device
        
        self.word_embedding = self.get_glove_embedding()
        
        self.character_embedding = CharacterEmbeddingLayer(char_vocab_dim, char_emb_dim, 
                                                      num_output_channels, kernel_size)
        
        self.contextual_embedding = ContextualEmbeddingLayer(emb_dim*2, ctx_hidden_dim)
        
        self.dropout = nn.Dropout()
        
        self.similarity_weight = nn.Linear(emb_dim*6, 1, bias=False)
        
        self.modeling_lstm = nn.LSTM(emb_dim*8, emb_dim, bidirectional=True, num_layers=2, batch_first=True, dropout=0.2)
        
        self.output_start = nn.Linear(emb_dim*10, 1, bias=False)
        
        self.output_end = nn.Linear(emb_dim*10, 1, bias=False)
        
        self.end_lstm = nn.LSTM(emb_dim*2, emb_dim, bidirectional=True, batch_first=True)
        
    
    def get_glove_embedding(self):
        
        weights_matrix = np.load('bidafglove_tv.npy')
        num_embeddings, embedding_dim = weights_matrix.shape
        embedding = nn.Embedding.from_pretrained(torch.FloatTensor(weights_matrix).to(self.device),freeze=True)

        return embedding
        
    def forward(self, ctx, ques, char_ctx, char_ques):
        # ctx = [bs, ctx_len]
        # ques = [bs, ques_len]
        # char_ctx = [bs, ctx_len, ctx_word_len]
        # char_ques = [bs, ques_len, ques_word_len]
        
        ctx_len = ctx.shape[1]
        
        ques_len = ques.shape[1]
        
        ## GET WORD AND CHARACTER EMBEDDINGS
        
        ctx_word_embed = self.word_embedding(ctx)
        # ctx_word_embed = [bs, ctx_len, emb_dim]
        
        ques_word_embed = self.word_embedding(ques)
        # ques_word_embed = [bs, ques_len, emb_dim]
        
#         ctx_char_embed = self.character_embedding(char_ctx)
        # ctx_char_embed =  [bs, ctx_len, emb_dim]
        
#         ques_char_embed = self.character_embedding(char_ques)
        # ques_char_embed = [bs, ques_len, emb_dim]
        
        ## CREATE CONTEXTUAL EMBEDDING
        
#         ctx_contextual_inp = torch.cat([ctx_word_embed, ctx_char_embed],dim=2)
        ctx_contextual_inp = torch.cat([ctx_word_embed, ctx_word_embed],dim=2)
        # [bs, ctx_len, emb_dim*2]
        
#         ques_contextual_inp = torch.cat([ques_word_embed, ques_char_embed],dim=2)
        ques_contextual_inp = torch.cat([ques_word_embed, ques_word_embed],dim=2)
        # [bs, ques_len, emb_dim*2]
        
        ctx_contextual_emb = self.contextual_embedding(ctx_contextual_inp)
        # [bs, ctx_len, emb_dim*2]
        
        ques_contextual_emb = self.contextual_embedding(ques_contextual_inp)
        # [bs, ques_len, emb_dim*2]
        
        
        ## CREATE SIMILARITY MATRIX
        
        ctx_ = ctx_contextual_emb.unsqueeze(2).repeat(1,1,ques_len,1)
        # [bs, ctx_len, 1, emb_dim*2] => [bs, ctx_len, ques_len, emb_dim*2]
        
        ques_ = ques_contextual_emb.unsqueeze(1).repeat(1,ctx_len,1,1)
        # [bs, 1, ques_len, emb_dim*2] => [bs, ctx_len, ques_len, emb_dim*2]
        
        elementwise_prod = torch.mul(ctx_, ques_)
        # [bs, ctx_len, ques_len, emb_dim*2]
        
        alpha = torch.cat([ctx_, ques_, elementwise_prod], dim=3)
        # [bs, ctx_len, ques_len, emb_dim*6]
        
        similarity_matrix = self.similarity_weight(alpha).view(-1, ctx_len, ques_len)
        # [bs, ctx_len, ques_len]
        
        
        ## CALCULATE CONTEXT2QUERY ATTENTION
        
        a = F.softmax(similarity_matrix, dim=-1)
        # [bs, ctx_len, ques_len]
        
        c2q = torch.bmm(a, ques_contextual_emb)
        # [bs] ([ctx_len, ques_len] X [ques_len, emb_dim*2]) => [bs, ctx_len, emb_dim*2]
        
        
        ## CALCULATE QUERY2CONTEXT ATTENTION
        
        b = F.softmax(torch.max(similarity_matrix,2)[0], dim=-1)
        # [bs, ctx_len]
        
        b = b.unsqueeze(1)
        # [bs, 1, ctx_len]
        
        q2c = torch.bmm(b, ctx_contextual_emb)
        # [bs] ([bs, 1, ctx_len] X [bs, ctx_len, emb_dim*2]) => [bs, 1, emb_dim*2]
        
        q2c = q2c.repeat(1, ctx_len, 1)
        # [bs, ctx_len, emb_dim*2]
        
        ## QUERY AWARE REPRESENTATION
        
        G = torch.cat([ctx_contextual_emb, c2q, 
                       torch.mul(ctx_contextual_emb,c2q), 
                       torch.mul(ctx_contextual_emb, q2c)], dim=2)
        
        # [bs, ctx_len, emb_dim*8]
        
        
        ## MODELING LAYER
        
        M, _ = self.modeling_lstm(G)
        # [bs, ctx_len, emb_dim*2]
        
        ## OUTPUT LAYER
        
        M2, _ = self.end_lstm(M)
        
        # START PREDICTION
        
        p1 = self.output_start(torch.cat([G,M], dim=2))
        # [bs, ctx_len, 1]
        
        p1 = p1.squeeze()
        # [bs, ctx_len]
        
        #p1 = F.softmax(p1, dim=-1)
        
        # END PREDICTION
        
        p2 = self.output_end(torch.cat([G, M2], dim=2)).squeeze()
        # [bs, ctx_len, 1] => [bs, ctx_len]
        
        #p2 = F.softmax(p2, dim=-1)
        
        
        return p1, p2
    

## Training

>  *We use 100 1D ﬁlters for CNN char embedding, each with a width of 5. The hidden state size (d) of the model is 100. The model has about 2.6 million parameters. We use the AdaDelta(Zeiler,2012) optimizer, with a mini batch size of 60 and an initial learning rate of 0.5, for 12 epochs. A dropout rate of 0.2 isused for the CNN, all LSTM layers, and the linear transformation before the softmax for the answers.*

__Note__- Although the mini-batch size mentioned here is 60, you might need to change this depending on your GPU. The authors have trained this model on Titan X. I have used GTX 1080 Ti for training this model which has a RAM of 11.2 GB. I had to reduce my mini-batch size to 16 or 12 to make it work.

In [48]:
CHAR_VOCAB_DIM = len(char2idx)
EMB_DIM = 100
CHAR_EMB_DIM = 8
NUM_OUTPUT_CHANNELS = 100
KERNEL_SIZE = (8,5)
HIDDEN_DIM = 100
device = torch.device('cuda')

model = BiDAF(CHAR_VOCAB_DIM, 
              EMB_DIM, 
              CHAR_EMB_DIM, 
              NUM_OUTPUT_CHANNELS, 
              KERNEL_SIZE, 
              HIDDEN_DIM, 
              device).to(device)

In [49]:
import torch.optim as optim
from torch.autograd import Variable
optimizer = optim.Adadelta(model.parameters())

In [83]:
def train(model, train_dataset):
    print("Starting training ........")
   

    train_loss = 0.
    batch_count = 0
    model.train()
    for batch in tqdm(train_dataset):
        
        optimizer.zero_grad()
    
    
#         if batch_count % 10 == 0:
#             break
            
        if batch_count % 500 == 0:
            print(f"Starting batch: {batch_count}")
        batch_count += 1
        
        context, question, char_ctx, char_ques, label, ctx_text, ans, ids = batch

        context, question, char_ctx, char_ques, label = context.to(device), question.to(device),\
                                   char_ctx.to(device), char_ques.to(device), label.to(device)


        preds = model(context, question, char_ctx, char_ques)

        start_pred, end_pred = preds

        s_idx, e_idx = label[:,0], label[:,1]

        loss = F.cross_entropy(start_pred, s_idx) + F.cross_entropy(end_pred, e_idx)

        loss.backward()
        
#         plot_grad_flow(model.named_parameters())
        
#         for name, param in model.named_parameters():
#             if(param.requires_grad) and ("bias" not in name):
#                 writer.add_histogram(name+'_grad',param.grad.abs().mean())
    

        optimizer.step()

        train_loss += loss.item()

    return train_loss/len(train_dataset)

In [84]:
def valid(model, valid_dataset):
    
    print("Starting validation .........")
   
    valid_loss = 0.

    batch_count = 0
    
    f1, em = 0., 0.
    
    model.eval()
        
   
    predictions = {}
    
    for batch in valid_dataset:

        
#         if batch_count % 10 == 0:
#             break
        if batch_count % 500 == 0:
            print(f"Starting batch {batch_count}")
        batch_count += 1

        context, question, char_ctx, char_ques, label, ctx, answers, ids = batch

        context, question, char_ctx, char_ques, label = context.to(device), question.to(device),\
                                   char_ctx.to(device), char_ques.to(device), label.to(device)
        
       

        
        with torch.no_grad():
            
            s_idx, e_idx = label[:,0], label[:,1]

            preds = model(context, question, char_ctx, char_ques)

            p1, p2 = preds

            
            loss = F.cross_entropy(p1, s_idx) + F.cross_entropy(p2, e_idx)

            valid_loss += loss.item()

            batch_size, c_len = p1.size()
            ls = nn.LogSoftmax(dim=1)
            mask = (torch.ones(c_len, c_len) * float('-inf')).to(device).tril(-1).unsqueeze(0).expand(batch_size, -1, -1)
            score = (ls(p1).unsqueeze(2) + ls(p2).unsqueeze(1)) + mask
            score, s_idx = score.max(dim=1)
            score, e_idx = score.max(dim=1)
            s_idx = torch.gather(s_idx, 1, e_idx.view(-1, 1)).squeeze()
            
           
            for i in range(batch_size):
                id = ids[i]
                pred = context[i][s_idx[i]:e_idx[i]+1]
                pred = ' '.join([idx2word[idx.item()] for idx in pred])
                predictions[id] = pred
            

    
    em, f1 = evaluate(predictions)
    return valid_loss/len(valid_dataset), em, f1

In [85]:
def evaluate(predictions):
    '''
    Gets a dictionary of predictions with question_id as key
    and prediction as value. The validation dataset has multiple 
    answers for a single question. Hence we compare our prediction
    with all the answers and choose the one that gives us
    the maximum metric (em or f1). 
    This method first parses the JSON file, gets all the answers
    for a given id and then passes the list of answers and the 
    predictions to calculate em, f1.
    
    
    :param dict predictions
    Returns
    : exact_match: 1 if the prediction and ground truth 
      match exactly, 0 otherwise.
    : f1_score: 
    '''
    with open('./data/Squad/squad_dev.json','r',encoding='utf-8') as f:
        dataset = json.load(f)
        
    dataset = dataset['data']
    f1 = exact_match = total = 0
    for article in dataset:
        for paragraph in article['paragraphs']:
            for qa in paragraph['qas']:
                total += 1
                if qa['id'] not in predictions:
                    continue
                
                ground_truths = list(map(lambda x: x['text'], qa['answers']))
                
                prediction = predictions[qa['id']]
                
                exact_match += metric_max_over_ground_truths(
                    exact_match_score, prediction, ground_truths)
                
                f1 += metric_max_over_ground_truths(
                    f1_score, prediction, ground_truths)
                
    
    exact_match = 100.0 * exact_match / total
    f1 = 100.0 * f1 / total
    
    return exact_match, f1



In [86]:
def normalize_answer(s):
    '''
    Performs a series of cleaning steps on the ground truth and 
    predicted answer.
    '''
    def remove_articles(text):
        return re.sub(r'\b(a|an|the)\b', ' ', text)

    def white_space_fix(text):
        return ' '.join(text.split())

    def remove_punc(text):
        exclude = set(string.punctuation)
        return ''.join(ch for ch in text if ch not in exclude)

    def lower(text):
        return text.lower()

    return white_space_fix(remove_articles(remove_punc(lower(s))))


def metric_max_over_ground_truths(metric_fn, prediction, ground_truths):
    '''
    Returns maximum value of metrics for predicition by model against
    multiple ground truths.
    
    :param func metric_fn: can be 'exact_match_score' or 'f1_score'
    :param str prediction: predicted answer span by the model
    :param list ground_truths: list of ground truths against which
                               metrics are calculated. Maximum values of 
                               metrics are chosen.
                            
    
    '''
    scores_for_ground_truths = []
    for ground_truth in ground_truths:
        score = metric_fn(prediction, ground_truth)
        scores_for_ground_truths.append(score)
        
    return max(scores_for_ground_truths)


def f1_score(prediction, ground_truth):
    '''
    Returns f1 score of two strings.
    '''
    prediction_tokens = normalize_answer(prediction).split()
    ground_truth_tokens = normalize_answer(ground_truth).split()
    common = Counter(prediction_tokens) & Counter(ground_truth_tokens)
    num_same = sum(common.values())
    if num_same == 0:
        return 0
    precision = 1.0 * num_same / len(prediction_tokens)
    recall = 1.0 * num_same / len(ground_truth_tokens)
    f1 = (2 * precision * recall) / (precision + recall)
    return f1


def exact_match_score(prediction, ground_truth):
    '''
    Returns exact_match_score of two strings.
    '''
    return (normalize_answer(prediction) == normalize_answer(ground_truth))


def epoch_time(start_time, end_time):
    
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs

In [87]:
from tqdm import tqdm
train_losses = []
valid_losses = []
ems = []
f1s = []
epochs = 5
for epoch in range(epochs):
    print(f"Epoch {epoch+1}")
    start_time = time.time()
    
    train_loss = train(model, train_dataset)
    valid_loss, em, f1 = valid(model, valid_dataset)
    
#     writer.add_scalar('train_loss', train_loss, epoch)
#     writer.add_scalar('valid_loss', valid_loss, epoch)
    
    
#     for name, param in model.named_parameters():
#         writer.add_histogram(name, param, epoch)
    
    torch.save({
            'epoch': epoch,
            'model_state_dict': model.state_dict(),
            'optimizer_state_dict': optimizer.state_dict(),
            'loss': valid_loss,
            'em':em,
            'f1':f1,
            }, 'bidaf_run4_{}.pth'.format(epoch))
    
    
    end_time = time.time()
    
    epoch_mins, epoch_secs = epoch_time(start_time, end_time)
    
    train_losses.append(train_loss)
    valid_losses.append(valid_loss)
    ems.append(em)
    f1s.append(f1)

    print(f"Epoch train loss : {train_loss}| Time: {epoch_mins}m {epoch_secs}s")
    print(f"Epoch valid loss: {valid_loss}")
    print(f"Epoch EM: {em}")
    print(f"Epoch F1: {f1}")
    print("====================================================================================")
    

Epoch 1
Starting training ........


  0%|                                                  | 0/5413 [00:00<?, ?it/s]

Starting batch: 0


  9%|███▋                                    | 501/5413 [01:29<12:58,  6.31it/s]

Starting batch: 500


 18%|███████▏                               | 1001/5413 [02:57<13:11,  5.58it/s]

Starting batch: 1000


 28%|██████████▊                            | 1501/5413 [04:36<10:28,  6.22it/s]

Starting batch: 1500


 37%|██████████████▍                        | 2001/5413 [06:32<11:31,  4.93it/s]

Starting batch: 2000


 46%|██████████████████                     | 2501/5413 [08:20<09:50,  4.93it/s]

Starting batch: 2500


 55%|█████████████████████▌                 | 3001/5413 [10:05<09:07,  4.40it/s]

Starting batch: 3000


 65%|█████████████████████████▏             | 3501/5413 [11:53<06:38,  4.80it/s]

Starting batch: 3500


 74%|████████████████████████████▊          | 4001/5413 [13:28<05:14,  4.49it/s]

Starting batch: 4000


 83%|████████████████████████████████▍      | 4501/5413 [15:14<03:12,  4.74it/s]

Starting batch: 4500


 92%|████████████████████████████████████   | 5001/5413 [16:56<01:33,  4.43it/s]

Starting batch: 5000


100%|███████████████████████████████████████| 5413/5413 [18:22<00:00,  4.91it/s]


Starting validation .........
Starting batch 0
Starting batch 500
Starting batch 1000
Starting batch 1500
Starting batch 2000
Epoch train loss : 4.65327239948303| Time: 23m 13s
Epoch valid loss: 5.431483743184093
Epoch EM: 28.997161778618732
Epoch F1: 40.3208747093589
Epoch 2
Starting training ........


  0%|                                                  | 0/5413 [00:00<?, ?it/s]

Starting batch: 0


  9%|███▋                                    | 501/5413 [01:33<12:58,  6.31it/s]

Starting batch: 500


 18%|███████▏                               | 1001/5413 [03:01<13:09,  5.59it/s]

Starting batch: 1000


 28%|██████████▊                            | 1502/5413 [04:28<09:16,  7.03it/s]

Starting batch: 1500


 37%|██████████████▍                        | 2001/5413 [06:17<11:45,  4.83it/s]

Starting batch: 2000


 46%|██████████████████                     | 2501/5413 [08:05<09:54,  4.90it/s]

Starting batch: 2500


 55%|█████████████████████▌                 | 3001/5413 [09:49<08:54,  4.51it/s]

Starting batch: 3000


 65%|█████████████████████████▏             | 3501/5413 [11:35<06:09,  5.17it/s]

Starting batch: 3500


 74%|████████████████████████████▊          | 4001/5413 [13:23<05:19,  4.42it/s]

Starting batch: 4000


 83%|████████████████████████████████▍      | 4501/5413 [15:11<03:58,  3.82it/s]

Starting batch: 4500


 92%|████████████████████████████████████   | 5001/5413 [17:15<01:35,  4.30it/s]

Starting batch: 5000


100%|███████████████████████████████████████| 5413/5413 [18:42<00:00,  4.82it/s]


Starting validation .........
Starting batch 0
Starting batch 500
Starting batch 1000
Starting batch 1500
Starting batch 2000
Epoch train loss : 4.434699736195572| Time: 23m 29s
Epoch valid loss: 5.400505067141198
Epoch EM: 30.558183538315987
Epoch F1: 42.27860882461797
Epoch 3
Starting training ........


  0%|                                          | 1/5413 [00:00<17:09,  5.26it/s]

Starting batch: 0


  9%|███▋                                    | 501/5413 [01:47<16:13,  5.04it/s]

Starting batch: 500


 18%|███████▏                               | 1001/5413 [03:34<16:02,  4.58it/s]

Starting batch: 1000


 28%|██████████▊                            | 1502/5413 [05:07<09:20,  6.98it/s]

Starting batch: 1500


 37%|██████████████▍                        | 2001/5413 [06:54<12:11,  4.66it/s]

Starting batch: 2000


 46%|██████████████████                     | 2501/5413 [08:45<10:17,  4.72it/s]

Starting batch: 2500


 55%|█████████████████████▌                 | 3001/5413 [10:30<09:00,  4.46it/s]

Starting batch: 3000


 65%|█████████████████████████▏             | 3501/5413 [12:16<06:14,  5.10it/s]

Starting batch: 3500


 74%|████████████████████████████▊          | 4001/5413 [13:59<05:13,  4.51it/s]

Starting batch: 4000


 83%|████████████████████████████████▍      | 4501/5413 [15:54<03:22,  4.50it/s]

Starting batch: 4500


 92%|████████████████████████████████████   | 5001/5413 [17:40<01:32,  4.44it/s]

Starting batch: 5000


100%|███████████████████████████████████████| 5413/5413 [19:09<00:00,  4.71it/s]


Starting validation .........
Starting batch 0
Starting batch 500
Starting batch 1000
Starting batch 1500
Starting batch 2000
Epoch train loss : 4.257521064823664| Time: 23m 56s
Epoch valid loss: 5.381577951861407
Epoch EM: 31.078524124881742
Epoch F1: 42.64638078980324
Epoch 4
Starting training ........


  0%|                                                  | 0/5413 [00:00<?, ?it/s]

Starting batch: 0


  9%|███▋                                    | 501/5413 [01:33<13:31,  6.05it/s]

Starting batch: 500


 18%|███████▏                               | 1001/5413 [03:02<13:02,  5.64it/s]

Starting batch: 1000


 28%|██████████▊                            | 1502/5413 [04:33<08:57,  7.27it/s]

Starting batch: 1500


 37%|██████████████▍                        | 2001/5413 [06:16<12:29,  4.55it/s]

Starting batch: 2000


 46%|██████████████████                     | 2501/5413 [08:13<09:52,  4.92it/s]

Starting batch: 2500


 55%|█████████████████████▌                 | 3001/5413 [09:56<08:55,  4.50it/s]

Starting batch: 3000


 65%|█████████████████████████▏             | 3501/5413 [11:42<06:11,  5.15it/s]

Starting batch: 3500


 74%|████████████████████████████▊          | 4001/5413 [13:26<05:04,  4.63it/s]

Starting batch: 4000


 83%|████████████████████████████████▍      | 4501/5413 [15:11<03:14,  4.70it/s]

Starting batch: 4500


 92%|████████████████████████████████████   | 5001/5413 [16:54<01:31,  4.52it/s]

Starting batch: 5000


100%|███████████████████████████████████████| 5413/5413 [18:20<00:00,  4.92it/s]


Starting validation .........
Starting batch 0
Starting batch 500
Starting batch 1000
Starting batch 1500
Starting batch 2000
Epoch train loss : 4.118458608900049| Time: 23m 7s
Epoch valid loss: 5.4124905886267545
Epoch EM: 31.163670766319772
Epoch F1: 42.90313216454146
Epoch 5
Starting training ........


  0%|                                          | 1/5413 [00:00<17:34,  5.13it/s]

Starting batch: 0


  9%|███▋                                    | 501/5413 [01:29<13:48,  5.93it/s]

Starting batch: 500


 18%|███████▏                               | 1001/5413 [02:57<13:00,  5.65it/s]

Starting batch: 1000


 28%|██████████▊                            | 1502/5413 [04:24<09:03,  7.19it/s]

Starting batch: 1500


 37%|██████████████▍                        | 2001/5413 [06:08<11:46,  4.83it/s]

Starting batch: 2000


 46%|██████████████████                     | 2501/5413 [07:56<10:15,  4.73it/s]

Starting batch: 2500


 55%|█████████████████████▌                 | 3001/5413 [09:44<09:14,  4.35it/s]

Starting batch: 3000


 65%|█████████████████████████▏             | 3501/5413 [11:35<06:29,  4.91it/s]

Starting batch: 3500


 74%|████████████████████████████▊          | 4001/5413 [13:23<06:14,  3.78it/s]

Starting batch: 4000


 83%|████████████████████████████████▍      | 4501/5413 [15:15<03:17,  4.62it/s]

Starting batch: 4500


 92%|████████████████████████████████████   | 5001/5413 [17:11<01:32,  4.47it/s]

Starting batch: 5000


100%|███████████████████████████████████████| 5413/5413 [18:37<00:00,  4.85it/s]


Starting validation .........
Starting batch 0
Starting batch 500
Starting batch 1000
Starting batch 1500
Starting batch 2000
Epoch train loss : 4.000779066203857| Time: 23m 52s
Epoch valid loss: 5.487070748054269
Epoch EM: 31.759697256385998
Epoch F1: 42.93662796486243


In [79]:
import os
os.listdir('./data/Squad')

['squad_train.json', 'squad_dev.json']

## References
* Papers read/referred:
    1. BiDAF: https://arxiv.org/abs/1611.01603
    2. Convolutional Neural Networks for Sentence Classification: https://arxiv.org/abs/1408.5882
    3. Highway Networks: https://arxiv.org/abs/1505.00387
* Other helpful links:
    1. https://nlp.seas.harvard.edu/slides/aaai16.pdf. A great resource for character embeddings. The figures in the character embedding section are taken from here.
    2. https://towardsdatascience.com/the-definitive-guide-to-bi-directional-attention-flow-d0e96e9e666b. A great series of blogs to understand BiDAF.
    Some of the following repos might be out of date.
    3. https://github.com/allenai/bi-att-flow
    4. https://github.com/galsang
    5. https://github.com/jojonki/BiDAF/