<a href="https://colab.research.google.com/github/sandipanbasu/aiml-capstone/blob/master/mrc_LSTM_baseline0.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## 1. Import Libraries, setting Google Drive

In [5]:
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


In [41]:
tf.__version__

'2.2.0'

In [0]:
import tensorflow as tf
import pickle
from tensorflow.keras import layers
from tensorflow.keras import preprocessing
import numpy as np
import pandas as pd
import json
from sklearn.model_selection import train_test_split
import pprint


In [0]:
# we will store the params as we go along in this object
params = {}
project_path = "/content/drive/My Drive/AIML-MRC-Capstone/datasets/Squad2.0/TrainingDataset/"
model_path = "/content/drive/My Drive/AIML-MRC-Capstone/models/"

# Objective - LSTM Baseline 0 

*   **Inputs: A question q = {q1, ..., qQ} of length Q and a context paragraph p = {p1, ..., pP } of length P.**
*   **Output: An answer span {as, ae} where as is the index of the first answer token in p, ae is the index of the last answer token in p, 0 <= as, ae >= m, and as >= ae.** 



## 2. Load Squad Data - Cleaned and curated (output of preprocessing step)

### 2.1 Load Data

In [8]:
squad_df = pd.read_csv(project_path+'squad_data_final.csv')
squad_df.drop('Unnamed: 0',axis=1,inplace=True)
squad_df.head(2)

Unnamed: 0,title,context,question,id,answer_start,answer,plausible_answer_start,plausible_answer,is_impossible,clean_context,clean_question,clean_answer,answer_len,answer_end,answer_span,answer_word_span
0,Beyoncé,Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ b...,When did Beyonce start becoming popular?,56be85543aeaaa14008c9063,269,in the late 1990s,,,False,beyonc giselle knowlescarter bijnse beeyonsay ...,when did beyonce start becoming popular,in the late 1990s,17,286,"(269, 286)","(-1, -1)"
1,Beyoncé,Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ b...,What areas did Beyonce compete in when she was...,56be85543aeaaa14008c9065,207,singing and dancing,,,False,beyonc giselle knowlescarter bijnse beeyonsay ...,what areas did beyonce compete in when she was...,singing and dancing,19,226,"(207, 226)","(21, 23)"


### 2.2 Create Train, Validation and Test data

In [9]:
from sklearn.model_selection import train_test_split
from sklearn.utils import resample,shuffle

# train = resample(train)
# train = shuffle(train,n_samples =50000)

train,test = train_test_split(squad_df,test_size = 0.2)
train,val = train_test_split(train,test_size=0.25)

print(train.shape)
print(val.shape)
print(test.shape)

(78183, 16)
(26061, 16)
(26062, 16)


In [10]:
train["answer_word_span"] = train["answer_word_span"].apply(lambda x :eval(x))
test["answer_word_span"] = test["answer_word_span"].apply(lambda x :eval(x))
val["answer_word_span"] = val["answer_word_span"].apply(lambda x :eval(x))
train.info()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


<class 'pandas.core.frame.DataFrame'>
Int64Index: 78183 entries, 115157 to 106639
Data columns (total 16 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   title                   78183 non-null  object 
 1   context                 78183 non-null  object 
 2   question                78183 non-null  object 
 3   id                      78183 non-null  object 
 4   answer_start            78183 non-null  int64  
 5   answer                  51967 non-null  object 
 6   plausible_answer_start  26215 non-null  float64
 7   plausible_answer        26215 non-null  object 
 8   is_impossible           78183 non-null  bool   
 9   clean_context           78183 non-null  object 
 10  clean_question          78183 non-null  object 
 11  clean_answer            78183 non-null  object 
 12  answer_len              78183 non-null  int64  
 13  answer_end              78183 non-null  int64  
 14  answer_span             78183 no

### 2.3 Build Tokenizer

In [11]:
from tqdm import tqdm
params['tokenizer_num_words'] = 80000
tokenizer = preprocessing.text.Tokenizer(num_words=params['tokenizer_num_words'])

# NOTE: tokenizer is been made out of original dataset
for text in tqdm([squad_df['clean_context'], squad_df['clean_question']]):  
  tokenizer.fit_on_texts(text.values)

# total tokenizer words
params['vocab_size'] = len(tokenizer.word_index)

### SAVE TOKENIZERS
with open(model_path + "tokenizer.pkl","wb") as f:
    pickle.dump(tokenizer,f)

100%|██████████| 2/2 [00:08<00:00,  4.29s/it]


### 2.4 Update parameters

In [12]:
# From the EDA and historgrams we can conclude that - 
# 99% percentile of context word length = 285
# 99% percentile or question word lengt = 20
context_length = 285
question_length = 20
params['train_shape'] = train.shape
params['val_shape'] = val.shape
params['test_shape'] = test.shape
params['context_length_99'] = context_length # initialize with a high percentile
params['question_length_99'] = question_length # initialize with a high percentile
params['embedding_size'] = 100
params['rnn_units'] = 256
params['context_pad_seq'] = 'pre'
params['question_pad_seq'] = 'pre'

pprint.pprint(params)

{'context_length_99': 285,
 'context_pad_seq': 'pre',
 'embedding_size': 100,
 'question_length_99': 20,
 'question_pad_seq': 'pre',
 'rnn_units': 256,
 'test_shape': (26062, 16),
 'tokenizer_num_words': 80000,
 'train_shape': (78183, 16),
 'val_shape': (26061, 16),
 'vocab_size': 100850}


### 2.5 Vectorization / Encoding

#### 2.5.1 Integer Sequence of Context and Question 

In [0]:
train_clean_context_sequence = tokenizer.texts_to_sequences(train["clean_context"].values)
test_clean_context_sequence = tokenizer.texts_to_sequences(test["clean_context"].values)
val_clean_context_sequence = tokenizer.texts_to_sequences(val["clean_context"].values)


train_clean_question_sequence = tokenizer.texts_to_sequences(train["clean_question"].values)
test_clean_question_sequence = tokenizer.texts_to_sequences(test["clean_question"].values)
val_clean_question_sequence = tokenizer.texts_to_sequences(val["clean_question"].values)


#### 2.5.2 Find Max Sequence length of Context and Question

In [14]:
# max length of context
params['context_max_length'] = max(max(len(txt) for txt in train_clean_context_sequence),
                                  max(len(txt) for txt in test_clean_context_sequence),
                                  max(len(txt) for txt in val_clean_context_sequence))

params['question_max_length'] = max(max(len(txt) for txt in train_clean_question_sequence),
                                  max(len(txt) for txt in test_clean_question_sequence),
                                  max(len(txt) for txt in val_clean_question_sequence))


pprint.pprint(params)

{'context_length_99': 285,
 'context_max_length': 426,
 'context_pad_seq': 'pre',
 'embedding_size': 100,
 'question_length_99': 20,
 'question_max_length': 40,
 'question_pad_seq': 'pre',
 'rnn_units': 256,
 'test_shape': (26062, 16),
 'tokenizer_num_words': 80000,
 'train_shape': (78183, 16),
 'val_shape': (26061, 16),
 'vocab_size': 100850}


#### 2.5.3 Padding of the sequences

In [15]:
train_context_sequence = preprocessing.sequence.pad_sequences(train_clean_context_sequence,maxlen=params['context_max_length'])

print("Max context Sequence length is {}".format(train_context_sequence.shape[1]))

test_context_sequence = preprocessing.sequence.pad_sequences(test_clean_context_sequence,maxlen=params['context_max_length'])
val_context_sequence = preprocessing.sequence.pad_sequences(val_clean_context_sequence,maxlen=params['context_max_length'])

print(train_context_sequence.shape)
print(test_context_sequence.shape)
print(val_context_sequence.shape)

Max context Sequence length is 426
(78183, 426)
(26062, 426)
(26061, 426)


In [17]:
train_question_sequence = preprocessing.sequence.pad_sequences(train_clean_question_sequence,maxlen=params['question_max_length'])
print("Max Question Sequence length is {}".format(train_question_sequence.shape[1]))
test_question_sequence = preprocessing.sequence.pad_sequences(test_clean_question_sequence,maxlen=params['question_max_length'])
val_question_sequence = preprocessing.sequence.pad_sequences(val_clean_question_sequence,maxlen=params['question_max_length'])

print(train_question_sequence.shape)
print(test_question_sequence.shape)
print(val_question_sequence.shape)


Max Question Sequence length is 40
(78183, 40)
(26062, 40)
(26061, 40)


### 2.5.3 Create Answer Sequence 

Encode y_trues as big array consisting of ans_start + ans_end. This has to be used in loss function as well.

**y_true = answer_start + answer_end**

In [0]:
# for train data
y_train = []
span_ofr = 0;
params['train_span_outofrange'] = 0
params['test_span_outofrange'] = 0
params['val_span_outofrange'] = 0

for i in range(len(train)):    
    s = np.zeros(params['context_max_length'],dtype = "int")
    e = np.zeros(params['context_max_length'],dtype = "int")
    start, end = train["answer_word_span"].iloc[i]
    # if(start < params['context_length'] and end < params['context_length']):
    s[start] = 1
    e[end] = 1
    # else:
    #   span_ofr = span_ofr + 1
    #   print(start,end)
    y_train.append(np.concatenate((s,e)))    

params['train_span_outofrange'] = span_ofr
span_ofr = 0;

# for test data
y_test = []
for i in range(len(test)):    
    s = np.zeros(params['context_max_length'],dtype = "int")
    e = np.zeros(params['context_max_length'],dtype = "int")        
    start,end = test["answer_word_span"].iloc[i]    
    s[start] = 1
    e[end] = 1
    y_test.append(np.concatenate((s,e)))

params['test_span_outofrange'] = span_ofr
span_ofr = 0;
                
# for val data
y_val = []
for i in range(len(val)):
    s = np.zeros(params['context_max_length'],dtype = "int")
    e = np.zeros(params['context_max_length'],dtype = "int")        
    start,end = val["answer_word_span"].iloc[i]    
    s[start] = 1
    e[end] = 1      
    y_val.append(np.concatenate((s,e)))

params['val_span_outofrange'] = span_ofr    

In [25]:
print(len(y_train),len(y_train[0]))
print(len(y_test),len(y_test[0]))
print(len(y_val),len(y_val[0]))

78183 852
26062 852
26061 852


In [20]:
pprint.pprint(params)

{'context_length_99': 285,
 'context_max_length': 426,
 'context_pad_seq': 'pre',
 'embedding_size': 100,
 'question_length_99': 20,
 'question_max_length': 40,
 'question_pad_seq': 'pre',
 'rnn_units': 256,
 'test_shape': (26062, 16),
 'test_span_outofrange': 0,
 'tokenizer_num_words': 80000,
 'train_shape': (78183, 16),
 'train_span_outofrange': 0,
 'val_shape': (26061, 16),
 'val_span_outofrange': 0,
 'vocab_size': 100850}


### Find Max Sequence Length for both the Sequences

In [20]:
# # max length of context
# max_context_seq_length= max(len(txt) for txt in context_sequence)
# print('max_context_seq_length=',max_context_seq_length)

# vocab size of context
vocab_size=len(tokenizer.word_index)
print('vocab_size=',vocab_size)

# max_question_seq_length=max(len(txt) for txt in questions_sequence)
# print('max_question_seq_length=',max_question_seq_length)


vocab_size= 100850


In [0]:
# From the EDA and historgrams we can conclude that - 
# 99% percentile of context word length = 285
# 99% percentile or question word lengt = 20
max_context_seq_length = 285
max_question_seq_length = 20

In [15]:
print(squad_df['clean_context'][2000])
print(context_sequence[3000])
print(squad_df['clean_question'][2000])
print(questions_sequence[2000])

october 21 2008 apple reported 14 21 total revenue fiscal quarter 4 year 2008 came ipods september 9 2009 keynote presentation apple event phil schiller announced total cumulative sales ipods exceeded 220 million continual decline ipod sales since 2009 surprising trend apple corporation apple cfo peter oppenheimer explained june 2009 expect traditional mp3 players decline time cannibalize ipod touch iphone since 2009 companys ipod sales continually decreased every financial quarter 2013 new model introduced onto market
[384, 511, 58838, 520, 504, 34, 7655, 7507, 1434, 2740, 3028, 8, 22, 1434, 5566, 48288, 1070, 2828, 4793, 266, 1211, 9403, 2828, 5566, 7837, 3016, 1636, 4027, 697, 1922, 5566, 252, 2271, 4793, 27639, 99, 578, 5016, 142, 1434, 680, 5683, 142, 5566, 7, 58839, 1562, 11769, 4284, 17681, 255, 58840, 1050, 3760, 5209, 24179, 1084, 2740, 3028, 235, 1376, 115, 6740, 179, 1578, 374, 4793, 27639, 578, 8588, 44, 3439, 19, 455, 2820, 574, 193, 578, 87, 13, 11280, 60, 3366, 636]
who 

### Padding the sequences

In [16]:
# padding context
context_input_data= tf.keras.preprocessing.sequence.pad_sequences(context_sequence, maxlen=max_context_seq_length, padding='pre')
# padding question
question_input_data=tf.keras.preprocessing.sequence.pad_sequences(questions_sequence, maxlen=max_question_seq_length,padding='pre')

print(context_input_data.shape)
print(question_input_data.shape)

(130306, 285)
(130306, 20)


# Build the LSTM Model for both Sequence

In [0]:
embedding_size = 50
rnn_units=256

### Embedding Layer for Context and Question

In [0]:
# CONTEXT LSTM
# input layer
context_input=layers.Input(shape=(max_context_seq_length,),name="CONTEXT_INPUT")
# Build Embedding layer and Get Embedding Layer output

context_embedding_output=layers.Embedding(input_dim=tokenizer.num_words+1, 
                                          output_dim=embedding_size, 
                                          input_length=max_context_seq_length,
                                          name="CONTEXT_EMBEDDING")(context_input)


# QUESTION LSTM
#input layer
question_input=layers.Input(shape=(max_question_seq_length,),name="QUESTION_INPUT")
#Embedding layer and #Embedding layer output
question_embedding_output=layers.Embedding(input_dim=tokenizer.num_words+1, 
                                           output_dim=embedding_size, 
                                           input_length=max_question_seq_length,
                                           name="QUESTION_EMBEDDING")(question_input)

### Encoder Layer for Context and Question

In [0]:
# RNN Encoder with LSTM for context
c_output,c_h, c_s = layers.LSTM(rnn_units,name='CONTEXT_LSTM', return_state=True)(context_embedding_output)
context_states= [c_h, c_s]

# RNN Encoder with LSTM for question
q_output,q_h, q_s= tf.keras.layers.LSTM(rnn_units,name='QUESTION_LSTM',return_state=True)(question_embedding_output)
questions_states = [q_h, q_s]

In [23]:
context_states,questions_states

([<tf.Tensor 'CONTEXT_LSTM_1/Identity_1:0' shape=(None, 256) dtype=float32>,
  <tf.Tensor 'CONTEXT_LSTM_1/Identity_2:0' shape=(None, 256) dtype=float32>],
 [<tf.Tensor 'QUESTION_LSTM_1/Identity_1:0' shape=(None, 256) dtype=float32>,
  <tf.Tensor 'QUESTION_LSTM_1/Identity_2:0' shape=(None, 256) dtype=float32>])

### Concat the both RNN LSTM Encoder layers to get merged cell state and hidden state

In [0]:
MERGED_cell_state =layers.concatenate([context_states[0],questions_states[0]],name="CONCAT_CELL_STATE")
MERGED_hidden_state =layers.concatenate([context_states[1],questions_states[1]],name="HIDDEN_CELL_STATE")

In [14]:
decoder_initial_state = [MERGED_cell_state,MERGED_hidden_state]
decoder_initial_state

[<tf.Tensor 'CONCAT_CELL_STATE/Identity:0' shape=(None, 512) dtype=float32>,
 <tf.Tensor 'HIDDEN_CELL_STATE/Identity:0' shape=(None, 512) dtype=float32>]

# Create Decoder for Answer

# Add  Start and  End tokens to Answers

In [0]:
squad_df['answer_start_end']= '<start>' + squad_df['clean_answer'] + '<end>'
squad_df['answer_start_end']=squad_df['answer_start_end'].astype(str)

#Tokenize the Answers

In [0]:
answers_tokenize=tf.keras.preprocessing.text.Tokenizer()
answers_tokenize.fit_on_texts(squad_df['answer_start_end'])

In [17]:
#Vocab
print(len(answers_tokenize.word_index))

41475


In [0]:
#Convert sentences to numbers 
answers_seq = answers_tokenize.texts_to_sequences(squad_df['answer_start_end']) 

In [19]:
print(squad_df['answer_start_end'][2000])
print(answers_seq[2000])

<start>peter oppenheimer<end>
[2, 924, 7781, 1]


# Get maximum length and pad the sequences

In [20]:
squad_df[squad_df['clean_answer'].str.len() > 200]['answer_start_end']

3201    <start>that the sudden shift of a huge quantit...
Name: answer_start_end, dtype: object

In [21]:
max_answers_seq_length=max(len(txt) for txt in squad_df['answer_start_end'])
print('max_answers_seq_length=',max_answers_seq_length)

answers_vocab_size=len(answers_tokenize.word_index)
print('answers_vocab_size=',answers_vocab_size)

max_answers_seq_length= 248
answers_vocab_size= 41475


In [0]:
# From the EDA and historgrams we can conclude that - 
# 99% percentile of answer word length = 17
max_answers_seq_length=17

In [0]:
# pad pre
answers_input_data= tf.keras.preprocessing.sequence.pad_sequences(answers_seq,maxlen=max_answers_seq_length,padding='pre')

# Building Decoder Output

In [24]:
answers_input_data.shape

(130306, 17)

In [0]:
#Initialize array
answers_target_data = np.zeros((answers_input_data.shape[0], #number of sentences 130306
                                answers_input_data.shape[1])) #number of words in each sentence 248

#Shift Target output by one word
for i in range(answers_input_data.shape[0]):
    for j in range(1,answers_input_data.shape[1]):
        answers_target_data[i][j-1] = answers_input_data[i][j]

In [26]:
print(squad_df['answer_start_end'][2000])
print(answers_input_data[2000])

<start>peter oppenheimer<end>
[   0    0    0    0    0    0    0    0    0    0    0    0    0    2
  924 7781    1]


# Convert Answers to one-hot vector

In [0]:
## Crashing !!
answers_target_data_one_hot= np.zeros((answers_input_data.shape[0], #number of sentences
                                       answers_input_data.shape[1], #Number of words in each sentence
                                       len(answers_tokenize.word_index)+1)) #Vocab size + 1

In [27]:
print(answers_input_data.shape)
print(len(answers_tokenize.word_index)+1)

(130306, 17)
41476


In [0]:
answers_embedding_size = 50
decoder_rnn_units = 512

# Build Decoder

In [0]:
#input layer
answers_inputs=tf.keras.layers.Input(shape=(max_answers_seq_length,),name="ANSWER_INPUT")

#Embedding
answers_embedding_output=tf.keras.layers.Embedding(answers_vocab_size+1, answers_embedding_size, name="ANSWER_EMBEDDING")(answers_inputs)

#lstm layer
answers_lstm= tf.keras.layers.LSTM(decoder_rnn_units,return_sequences=True,name="ANSWER_LSTM", return_state=True)

#LSTM Output, State initialization from Encoder states(concat of question and answer)
#Output will be all hidden sequences, last 'h' state and last 'c' state

output,_,_=answers_lstm(answers_embedding_output,initial_state=decoder_initial_state)

#dense layer
lstm3_dense= tf.keras.layers.Dense(answers_vocab_size+1,activation='softmax',name="FINAL_OUTPUT")

#answer output
answer_outputs=lstm3_dense(output)

# Build Model using Encoder ( output of concat) and Decoder

In [0]:
model = tf.keras.models.Model([context_input,question_input, answers_inputs],answer_outputs) #Output of the model

In [31]:
model.summary()

Model: "model"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
CONTEXT_INPUT (InputLayer)      [(None, 285)]        0                                            
__________________________________________________________________________________________________
QUESTION_INPUT (InputLayer)     [(None, 20)]         0                                            
__________________________________________________________________________________________________
CONTEXT_EMBEDDING (Embedding)   (None, 285, 50)      4676500     CONTEXT_INPUT[0][0]              
__________________________________________________________________________________________________
QUESTION_EMBEDDING (Embedding)  (None, 20, 50)       2364500     QUESTION_INPUT[0][0]             
______________________________________________________________________________________________

# Train the Model