<a href="https://colab.research.google.com/github/tjsiledar/Machine-Learning/blob/master/MemN2N.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Question Answering Model on the Facebook babI dataset using End to End Memory Network Model**


The Facebook babI dataset consists of 20 tasks. We train our model on different tasks one by one and check the accuracies achieved. The trained models for each task are also provided with this file. Each is provided with the task number. For example. model5.h5 is for task 5 - Three Arg Relations.


01. Single Supporting Fact - 96.30
02. Two Supporting Facts - 27
03. Three Supporting Facts - 20
04. Two Arg Relations - 100
05. Three Arg Relations - 86.60
06. Yes/No questions - 80
07. Counting - 77.40
08. Lists/Sets - 73.80
09. Simple Negation - 80.30
10. Indefinite Knowledge - 94.40
11. Basic Coreference - 99.20
12. Conjunction - 97.90
13. Compound Coreference   - 93.50
14. Time Reasoning - 38.50
15. Basic Deduction - 53.40
16. Basic Induction - 44.80
17. Positional Reasoning - 74
18. Size Reasoning - 92.60
19. Path Finding - 12.20
20. Agent's Motivation - 97.8






In [1]:
#libraries to download data and preprocess it.

import re
import tarfile
import numpy as np
from functools import reduce
from keras.utils.data_utils import get_file
from keras.preprocessing.sequence import pad_sequences

Using TensorFlow backend.


In [0]:
'''
Function to create tokens of a sentence

The input to the tokenize function is a sentence and it returns a list of tokens from the sentence

a = "hello how are you?"
tokenize(a)
result --->  ['hello', 'how', 'are', 'you', '?']

'''

def tokenize(sentence):
  return [x.strip() for x in re.split('(\W+)?', sentence) if x.strip()]

In [0]:
'''
Function to create tuples of each episode

The input to the parse_episodes is a list of sentences and the output is a global data list consisting of tuple 
for each story in the form of (story, question, answer)
'''

def parse_episodes(lines):
  
  data = []
  story = []
  
  for line in lines:
    line = line.decode('utf-8').strip()
    nid, line = line.split(' ', 1)
    nid = int(nid)
    
    if nid==1:
      
      # id=1 means new story
      story = []
    
    if '\t' in line:
      # here line is tab separated question, answer and supporting id.
      q, a, support_id = line.split('\t')
      q = tokenize(q)
      substory = [x for x in story if x]
      
      # add substory question and answer as a tuple to the data list.
      # single story ends here and is added as a tuple to the global data list.
      data.append((substory,q,a))
      story.append('')
      
    else:
      
      # here each sentence is tokenized and added to the story list.
      sentence = tokenize(line)
      story.append(sentence)
      
  return data

In [0]:
'''
Function to create a list of all episodes from the file
'''

def get_episodes(file):
  
  # file.readlines() returns all the sentences in the file in the form of one list and is then passed to our function parse_episodes.
  data = parse_episodes(file.readlines())
  
  # flatten is defined to convert list of lists into a single list
  flatten = lambda data: reduce(lambda x,y: x+y, data)
  
  data = [(flatten(story), question, answer) for story, question, answer in data]
  return data

In [0]:
'''
Function to vectorize episodes

here we convert our stories, questions, answers into vector forms using a word-id dictionary created.
'''

def vectorize(data, word_idx, story_maxlen, query_maxlen):
  stories, queries, answers = [], [], []
  
  for story, query, answer in data:
    
    # using word to id dictionary we map each word to a number so that our stories, question, and anwers are vectors consisting of numbers.
    stories.append([word_idx[w] for w in story])
    queries.append([word_idx[w] for w in query])
    y = np.zeros(len(word_idx)+1)
    y[word_idx[answer]]=1
    answers.append(y)
    
  # pad_sequences are used to pad our vectors with zeros.
  return (pad_sequences(stories, maxlen=story_maxlen), pad_sequences(queries, maxlen=query_maxlen), np.array(answers))

In [6]:
#downloading the dataset

path = get_file('babi-tasks-v1-2.tar.gz', origin='https://s3.amazonaws.com/text-datasets/babi_tasks_1-20_v1-2.tar.gz')
file = tarfile.open(path)

Downloading data from https://s3.amazonaws.com/text-datasets/babi_tasks_1-20_v1-2.tar.gz


In [0]:
# dataset consists of 20 tasks from the Facebook babI dataset.

dataset = {
    
    'single-supporting-fact' : 'tasks_1-20_v1-2/en-10k/qa1_single-supporting-fact_{}.txt',
    'two-supporting-facts' : 'tasks_1-20_v1-2/en-10k/qa2_two-supporting-facts_{}.txt',
    'three-supporting-facts' : 'tasks_1-20_v1-2/en-10k/qa3_three-supporting-facts_{}.txt',
    'two-arg-relations' : 'tasks_1-20_v1-2/en-10k/qa4_two-arg-relations_{}.txt',
    'three-arg-relations' : 'tasks_1-20_v1-2/en-10k/qa5_three-arg-relations_{}.txt',
    'yes-no-questions' : 'tasks_1-20_v1-2/en-10k/qa6_yes-no-questions_{}.txt',
    'counting' : 'tasks_1-20_v1-2/en-10k/qa7_counting_{}.txt',
    'lists-sets' : 'tasks_1-20_v1-2/en-10k/qa8_lists-sets_{}.txt',
    'simple-negation' : 'tasks_1-20_v1-2/en-10k/qa9_simple-negation_{}.txt',
    'indefinite-knowledge' : 'tasks_1-20_v1-2/en-10k/qa10_indefinite-knowledge_{}.txt',
    'basic-coreference' : 'tasks_1-20_v1-2/en-10k/qa11_basic-coreference_{}.txt',
    'conjunction' : 'tasks_1-20_v1-2/en-10k/qa12_conjunction_{}.txt',
    'compound-coreference' : 'tasks_1-20_v1-2/en-10k/qa13_compound-coreference_{}.txt',
    'time-reasoning' : 'tasks_1-20_v1-2/en-10k/qa14_time-reasoning_{}.txt',
    'basic-deduction' : 'tasks_1-20_v1-2/en-10k/qa15_basic-deduction_{}.txt',
    'basic-induction' : 'tasks_1-20_v1-2/en-10k/qa16_basic-induction_{}.txt',
    'positional-reasoning' : 'tasks_1-20_v1-2/en-10k/qa17_positional-reasoning_{}.txt',
    'size-reasoning' : 'tasks_1-20_v1-2/en-10k/qa18_size-reasoning_{}.txt',
    'path-finding' : 'tasks_1-20_v1-2/en-10k/qa19_path-finding_{}.txt',
    'agents-motivations' : 'tasks_1-20_v1-2/en-10k/qa20_agents-motivations_{}.txt',
    
}

current_dataset = 'three-arg-relations'

dataset = dataset[current_dataset]

In [8]:
#Extracting train and test datasets

train_set = get_episodes(file.extractfile(dataset.format('train')))
test_set = get_episodes(file.extractfile(dataset.format('test')))

  return _compile(pattern, flags).split(string, maxsplit)


In [9]:
#checking our sets

print(len(train_set))
print(len(test_set))

print(train_set[0])

10000
1000
(['Bill', 'travelled', 'to', 'the', 'office', '.', 'Bill', 'picked', 'up', 'the', 'football', 'there', '.', 'Bill', 'went', 'to', 'the', 'bedroom', '.', 'Bill', 'gave', 'the', 'football', 'to', 'Fred', '.'], ['What', 'did', 'Bill', 'give', 'to', 'Fred', '?'], 'football')


In [0]:
#creating a vocabulary from our sentences and sorting it

vocab = set()

for story, query, answer in train_set + test_set:
  vocab |= set(story + query + [answer])
vocab = sorted(vocab)

In [0]:
# as 0 is reserved for padding total vocab size +1

vocab_size = len(vocab) + 1 

In [0]:
#calculating the maximum length of story and query

story_maxlen = max(map(len, (x for x,_,_ in train_set + test_set)))
query_maxlen = max(map(len, (x for _,x,_ in train_set + test_set)))

In [0]:
# creating word to index and index to word dictionary

word_idx = dict((c,i+1) for i,c in enumerate(vocab))
idx_word = dict((i+1, c) for i,c in enumerate(vocab))

In [14]:
print(vocab_size)
print(word_idx)

42
{'.': 1, '?': 2, 'Bill': 3, 'Fred': 4, 'Jeff': 5, 'Mary': 6, 'What': 7, 'Who': 8, 'apple': 9, 'back': 10, 'bathroom': 11, 'bedroom': 12, 'did': 13, 'discarded': 14, 'down': 15, 'dropped': 16, 'football': 17, 'garden': 18, 'gave': 19, 'give': 20, 'got': 21, 'grabbed': 22, 'hallway': 23, 'handed': 24, 'journeyed': 25, 'kitchen': 26, 'left': 27, 'milk': 28, 'moved': 29, 'office': 30, 'passed': 31, 'picked': 32, 'put': 33, 'received': 34, 'the': 35, 'there': 36, 'to': 37, 'took': 38, 'travelled': 39, 'up': 40, 'went': 41}


In [0]:
# vectorizing story, query and answer using vocab

stories_train, queries_train, answers_train = vectorize(train_set, word_idx, story_maxlen, query_maxlen)
stories_test, queries_test, answers_test = vectorize(test_set, word_idx, story_maxlen, query_maxlen)

In [16]:
print(stories_train.shape)
stories_train[0]

(10000, 782)


array([ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0

In [17]:
print(queries_train.shape)
queries_train[0]

(10000, 8)


array([ 0,  7, 13,  3, 20, 37,  4,  2], dtype=int32)

In [18]:
print(answers_train.shape)
answers_train[0,:]

(10000, 42)


array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0.])

Lets start buliding our End to End Memory Network model

In [0]:
import keras
from keras.models import Sequential, Model
from keras.layers.embeddings import Embedding
from keras.layers import Permute, dot, add, concatenate
from keras.layers import LSTM, Dense, Dropout, Input, Activation

In [0]:
# Number of epochs to run
train_epochs = 100
# Training batch size
batch_size = 32
# Hidden embedding size
embed_size = 50
# Number of nodes in LSTM layer
lstm_size = 64
# Dropout rate
dropout_rate = 0.30

In [21]:
#placeholders

input_sequence = Input((story_maxlen,))
question = Input((query_maxlen,))

W0731 03:45:11.799602 139678408763264 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:74: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.

W0731 03:45:11.841530 139678408763264 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:517: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.



In [22]:
#encoders

input_encoder_m = Sequential()
input_encoder_m.add(Embedding(input_dim=vocab_size, output_dim=embed_size))
input_encoder_m.add(Dropout(dropout_rate))

input_encoder_c = Sequential()
input_encoder_c.add(Embedding(input_dim=vocab_size, output_dim=query_maxlen))
input_encoder_c.add(Dropout(dropout_rate))

question_encoder = Sequential()
question_encoder.add(Embedding(input_dim=vocab_size, output_dim=embed_size, input_length=query_maxlen))
question_encoder.add(Dropout(dropout_rate))

W0731 03:45:12.465162 139678408763264 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:4138: The name tf.random_uniform is deprecated. Please use tf.random.uniform instead.

W0731 03:45:12.484796 139678408763264 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:133: The name tf.placeholder_with_default is deprecated. Please use tf.compat.v1.placeholder_with_default instead.

W0731 03:45:12.495900 139678408763264 deprecation.py:506] From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:3445: calling dropout (from tensorflow.python.ops.nn_ops) with keep_prob is deprecated and will be removed in a future version.
Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.


In [0]:
input_encoded_m = input_encoder_m(input_sequence)
input_encoded_c = input_encoder_c(input_sequence)
question_encoded = question_encoder(question)

In [0]:
# compute match between first input vector and question vector

match = dot([input_encoded_m, question_encoded], axes=-1, normalize=False)
match = Activation('softmax')(match)

In [0]:
# add match to the second input vector

res = add([match, input_encoded_c])
res = Permute((2,1))(res)

In [0]:
# concatenate the response vector with question vector

answer = concatenate([res, question_encoded])

In [0]:
answer = LSTM(lstm_size)(answer)
answer = Dropout(dropout_rate)(answer)
answer = Dense(vocab_size)(answer)
answer = Activation('softmax')(answer)

In [28]:
# building the model

model = Model([input_sequence, question], answer)
model.compile(optimizer='rmsprop', loss='categorical_crossentropy', metrics=['accuracy'])

W0731 03:45:15.915228 139678408763264 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/keras/optimizers.py:790: The name tf.train.Optimizer is deprecated. Please use tf.compat.v1.train.Optimizer instead.

W0731 03:45:15.947654 139678408763264 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:3295: The name tf.log is deprecated. Please use tf.math.log instead.



In [29]:
model.summary()

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_1 (InputLayer)            (None, 782)          0                                            
__________________________________________________________________________________________________
input_2 (InputLayer)            (None, 8)            0                                            
__________________________________________________________________________________________________
sequential_1 (Sequential)       multiple             2100        input_1[0][0]                    
__________________________________________________________________________________________________
sequential_3 (Sequential)       (None, 8, 50)        2100        input_2[0][0]                    
__________________________________________________________________________________________________
dot_1 (Dot

In [30]:
model.fit([stories_train, queries_train], answers_train, batch_size, train_epochs, validation_data = ([stories_test, queries_test], answers_test))

W0731 03:45:18.606018 139678408763264 deprecation.py:323] From /usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/math_grad.py:1250: add_dispatch_support.<locals>.wrapper (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where


Train on 10000 samples, validate on 1000 samples
Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/10

<keras.callbacks.History at 0x7f0919c3c978>

In [0]:
#saving our model

model.save('model5.h5')

In [32]:
# checking predictions on first 10 stories from our test set

for i in range(0,10):
  c_input = test_set[i]
  
  # vectorizing our test story
  c_story, c_query, c_answer = vectorize([c_input], word_idx, story_maxlen, query_maxlen)
  
  # using our model to predict 
  c_prediction = model.predict([c_story, c_query])
  c_prediction = idx_word[np.argmax(c_prediction)]
  
  #printing our output
  print(' '.join(c_input[0]), ' '.join(c_input[1]))
  print("Prediction " + str(c_prediction) + " | " + "Answer : " + str(c_input[2]))
  print("-----------------------------------------------------------------------------------------")

Fred picked up the football there . Fred gave the football to Jeff . What did Fred give to Jeff ?
Prediction football | Answer : football
-----------------------------------------------------------------------------------------
Fred picked up the football there . Fred gave the football to Jeff . Bill went back to the bathroom . Jeff grabbed the milk there . Who gave the football to Jeff ?
Prediction Fred | Answer : Fred
-----------------------------------------------------------------------------------------
Fred picked up the football there . Fred gave the football to Jeff . Bill went back to the bathroom . Jeff grabbed the milk there . Jeff gave the football to Fred . Fred handed the football to Jeff . What did Fred give to Jeff ?
Prediction football | Answer : football
-----------------------------------------------------------------------------------------
Fred picked up the football there . Fred gave the football to Jeff . Bill went back to the bathroom . Jeff grabbed the milk the

In [0]:
'''
Make sure the story and query you input should contain spaces after every word. Even for fullstop and question mark

The story and question should contain words only from our vocabulary. 
'''  
while 1 :
  print('Please input a story')
  user_story_inp = input().split(' ')
  print('Please input a query')
  user_query_inp = input().split(' ')
  user_story, user_query, user_ans = vectorize([[user_story_inp, user_query_inp, '.']], word_idx, story_maxlen, query_maxlen)
  user_prediction = model.predict([user_story, user_query])
  user_prediction = idx_word[np.argmax(user_prediction)]
  print('Result')
  print(' '.join(user_story_inp), ' '.join(user_query_inp))
  print('| Prediction:', user_prediction)

Please input a story
Jeff travelled to the garden . Mary moved to the kitchen . Mary moved to the kitchen . Fred picked up the football there . Bill went to the bathroom . Fred gave the football to Mary . Bill travelled to the hallway .
Please input a query
Who received the football ?
Result
Jeff travelled to the garden . Mary moved to the kitchen . Mary moved to the kitchen . Fred picked up the football there . Bill went to the bathroom . Fred gave the football to Mary . Bill travelled to the hallway . Who received the football ?
| Prediction: Mary
Please input a story
Jeff travelled to the garden . Mary moved to the kitchen . Mary moved to the kitchen . Fred picked up the football there . Bill went to the bathroom . Fred gave the football to Mary . Bill travelled to the hallway .
Please input a query
Who gave the football ?
Result
Jeff travelled to the garden . Mary moved to the kitchen . Mary moved to the kitchen . Fred picked up the football there . Bill went to the bathroom . Fred