# Babi End to End MemNN

In [56]:
%matplotlib inline
from importlib import reload 
import utils2; reload(utils2)
from utils2 import *

In [57]:
np.set_printoptions(4)

A memory network is a network that can retain information; it can be trained on a structured story and will learn how to answer questions about said story.

This notebook contains an implementation of an end-to-end memory network trained on the Babi tasks dataset.

## Create datasets

Code from this section is mainly taken from the babi-memnn example in the keras repo.

* [Popular Science](http://www.popsci.com/facebook-ai)
* [Slate](http://www.slate.com/blogs/future_tense/2016/06/28/facebook_s_ai_researchers_are_making_bots_smarter_by_giving_them_memory.html)

The Babi dataset is a collection of tasks (or stories) that detail events in a particular format. At the end of each task is a question with a labelled answer.

This section shows how to construct the dataset from the raw data.

In [58]:
def tokenize(sent):
    return [x.strip() for x in re.split('(\W+)?', sent) if x.strip()]

This parser formats the story into a time-order labelled sequence of sentences, followed by the question and the labelled answer.

In [59]:
def parse_stories(lines):
    data = []
    story = []
    for line in lines:
        line = line.decode('utf-8').strip()
        nid, line = line.split(' ', 1)
        if int(nid) == 1: story = []
        if '\t' in line:
            q, a, supporting = line.split('\t')
            q = tokenize(q)
            substory = None
            substory = [[str(i)+":"]+x for i,x in enumerate(story) if x]
            data.append((substory, q, a))
            story.append('')
        else: story.append(tokenize(line))
    return data

In [17]:
# Understanding parse_stories(lines) 
f = tar.extractfile(challenge.format('train'))
lines = f.readlines()
print(len(lines))

data = []
story = []
for line in lines[:30]:
    line = line.decode('utf-8').strip()
    nid, line = line.split(' ', 1)
    if int(nid) == 1: story = []
    print(nid, line)
    if '\t' in line:
        q, a, supporting = line.split('\t')
        q = tokenize(q)
        substory = None
        substory = [[str(i)+":"]+x for i, x in enumerate(story) if x]
        data.append( (substory, q, a) )
        story.append('')
        #print(q, a, supporting)
        print("substory:", substory)
    else:
        story.append(tokenize(line))
        
#print("\nDATA:", data)

NameError: name 'challenge' is not defined

Next we download and parse the data set.

In [60]:
path = get_file('babi-tasks-v1-2.tar.gz', 
                origin='https://s3.amazonaws.com/text-datasets/babi_tasks_1-20_v1-2.tar.gz')
tar = tarfile.open(path)

In [61]:
challenges = {
    # QA1 with 10,000 samples
    'single_supporting_fact_10k': 'tasks_1-20_v1-2/en-10k/qa1_single-supporting-fact_{}.txt',
    # QA2 with 10,000 samples
    'two_supporting_facts_10k': 'tasks_1-20_v1-2/en-10k/qa2_two-supporting-facts_{}.txt',
    'two_supporting_facts_1k': 'tasks_1-20_v1-2/en/qa2_two-supporting-facts_{}.txt',
}
challenge_type = 'single_supporting_fact_10k'
# challenge_type = 'two_supporting_facts_10k'
challenge = challenges[challenge_type]

In [62]:
def get_stories(f):
    data = parse_stories(f.readlines())
    return [(story, q, answer) for story, q, answer in data]

In [63]:
train_stories = get_stories(tar.extractfile(challenge.format('train')))
test_stories = get_stories(tar.extractfile(challenge.format('test')))

  return _compile(pattern, flags).split(string, maxsplit)


Here we calculate upper bounds for things like words in sentence, sentences in a story, etc. for the corpus, which will be useful later.

In [64]:
stories = train_stories + test_stories

In [65]:
stories[1]

([['0:', 'Mary', 'moved', 'to', 'the', 'bathroom', '.'],
  ['1:', 'John', 'went', 'to', 'the', 'hallway', '.'],
  ['3:', 'Daniel', 'went', 'back', 'to', 'the', 'hallway', '.'],
  ['4:', 'Sandra', 'moved', 'to', 'the', 'garden', '.']],
 ['Where', 'is', 'Daniel', '?'],
 'hallway')

In [66]:
story_maxlen = max(len(s) for x, _, _ in stories for s in x)
story_maxsents = max(len(x) for x, _, _ in stories)
query_maxlen = max(len(x) for _, x, _ in stories)

In [67]:
def do_flatten(el):
    return isinstance(el, collections.Iterable) and not isinstance(el, (str, bytes))

def flatten(l):
    for el in l:
        if do_flatten(el): yield from flatten(el)
        else: yield el

Create vocabulary of corpus and find size, including a padding element.

In [68]:
vocab = sorted(set(flatten(stories)))
vocab.insert(0, '<PAD>')
vocab_size = len(vocab)

In [69]:
story_maxsents, vocab_size, story_maxlen, query_maxlen, len(train_stories), len(test_stories)

(10, 32, 8, 4, 10000, 1000)

Now the dataset is in the correct format.

Each task in the dataset contains a list of tokenized sentences ordered in time, followed by a question about the story with a given answer.

In the example below, we go can backward through the sentences to find the answer to the question "Where is Daniel?" as sentence 12, the last sentence to mention Daniel.

This task structure is called a **"one supporting fact"** structure, which means that we only need to find one sentence in the story to answer our question.

In [70]:
test_stories[534]

([['0:', 'Mary', 'moved', 'to', 'the', 'office', '.'],
  ['1:', 'John', 'moved', 'to', 'the', 'garden', '.'],
  ['3:', 'Sandra', 'moved', 'to', 'the', 'bedroom', '.'],
  ['4:', 'Sandra', 'went', 'back', 'to', 'the', 'office', '.'],
  ['6:', 'John', 'went', 'to', 'the', 'bedroom', '.'],
  ['7:', 'John', 'journeyed', 'to', 'the', 'garden', '.'],
  ['9:', 'Daniel', 'went', 'back', 'to', 'the', 'hallway', '.'],
  ['10:', 'John', 'journeyed', 'to', 'the', 'bedroom', '.'],
  ['12:', 'Daniel', 'journeyed', 'to', 'the', 'bathroom', '.'],
  ['13:', 'John', 'travelled', 'to', 'the', 'garden', '.']],
 ['Where', 'is', 'Daniel', '?'],
 'bathroom')

Create an index mapping for the vocabulary.

In [71]:
word_idx = dict((c,i) for i,c in enumerate(vocab))
#word_idx = {c:i for i,c in enumerate(vocab)}　#same way

Next we vectorize our dataset by mapping words to their indices. We enforce consistent dimension by padding vectors up to the upper bounds we calculated earlier with our pad element.

In [81]:
def vectorize_stories(data, word_idx, story_maxlen, query_maxlen):
    X = []; Xq = []; Y = []
    for story, query, answer in data:
        x = [[word_idx[w] for w in s] for s in story]
        xq = [word_idx[w] for w in query]
        y = [word_idx[answer]]
        X.append(x); Xq.append(xq); Y.append(y)
    return ([pad_sequences(x, maxlen=story_maxlen) for x in X],
             pad_sequences(Xq, maxlen=query_maxlen),
             np.array(Y))

In [82]:
inputs_train, queries_train, answers_train = vectorize_stories(train_stories, 
     word_idx, story_maxlen, query_maxlen)
inputs_test, queries_test, answers_test = vectorize_stories(test_stories, 
     word_idx, story_maxlen, query_maxlen)

In [83]:
len(inputs_train), inputs_train[2].shape, inputs_train[-2].shape  # inputs_train shape: 10000 * [sent, 8] arrays  
                                                                  # m = len(train_stories): 10000
                                                                  # story_maxlen (maximu words per sentence): 8
                                                                  # sent (per story): not fixed number < story_maxsents

(10000, (6, 8), (8, 8))

In [84]:
inputs_train[5]

array([[ 0,  2, 16, 30, 29, 28, 27,  1],
       [ 0,  6, 16, 31, 29, 28, 19,  1]])

In [85]:
def stack_inputs(inputs):
    for i, it in enumerate(inputs):
        inputs[i] = np.concatenate([it, 
                           np.zeros((story_maxsents-it.shape[0],story_maxlen), 'int')])
    return np.stack(inputs)

inputs_train = stack_inputs(inputs_train)
inputs_test = stack_inputs(inputs_test)

In [86]:
inputs_train[5]

array([[ 0,  2, 16, 30, 29, 28, 27,  1],
       [ 0,  6, 16, 31, 29, 28, 19,  1],
       [ 0,  0,  0,  0,  0,  0,  0,  0],
       [ 0,  0,  0,  0,  0,  0,  0,  0],
       [ 0,  0,  0,  0,  0,  0,  0,  0],
       [ 0,  0,  0,  0,  0,  0,  0,  0],
       [ 0,  0,  0,  0,  0,  0,  0,  0],
       [ 0,  0,  0,  0,  0,  0,  0,  0],
       [ 0,  0,  0,  0,  0,  0,  0,  0],
       [ 0,  0,  0,  0,  0,  0,  0,  0]])

In [87]:
print(inputs_train.shape) # inputs_train shape: [m = len(train_stories), story_maxsents, story_maxlen] arrays       
print(inputs_test.shape)  # inputs_test shape: [m = len(test_stories), story_maxsents, story_maxlen] arrays                                                                

(10000, 10, 8)
(1000, 10, 8)


In [88]:
print(queries_train.shape) # queries_train shape: [m = len(train_stories), query_maxlen] arrays       
print(queries_test.shape) # queries_test shape: [m = len(train_stories), query_maxlen] arrays    

((10000, 4), (1000, 4))

Our inputs for keras.

In [90]:
inps = [inputs_train, queries_train]  # train dataset = [inputs_train, queries_train]
val_inps = [inputs_test, queries_test] # val dataset = [inputs_test, queries_test]

Our dataset labels

In [119]:
print(answers_train.shape)
print(answers_test.shape)

((10000, 1), (1000, 1))

## Model

The approach to solving this task relies not only on word embeddings, but sentence embeddings.

The authors of the Babi paper constructed sentence embeddings by simply adding up the word embeddings; this might seem naive, but given the relatively small length of these sentences we can expect the sum to capture relevant information.

In [159]:
emb_dim = 20
params = {'verbose': 2, 'callbacks': [TQDMNotebookCallback(leave_inner=False)]}

We use **[TimeDistributed](https://keras.io/layers/wrappers/)** here to apply the embedding to every element of the sequence, then the <tt>Lambda</tt> layer adds them up

In [160]:
def emb_sent_bow(inp):
    emb = TimeDistributed(Embedding(vocab_size, emb_dim))(inp) # emb.shape: [m, story_maxsents=10, story_maxlen=8, emb_dim=20]
                                                               # TimeDistributed() seems to be for this first inp dim: 'story_maxsents=10'
    #print('emb.shape:', repr(emb.shape))
    return Lambda(lambda x: K.sum(x, axis=2))(emb)    # the Lambda layer adds up all embedding (in shape [m, story_maxsents=10, emb_dim=20])

The embedding works as desired; the raw input has 10 sentences of 8 words, and the output has 10 sentence embeddings of length 20.

In [161]:
inp_story = Input((story_maxsents, story_maxlen))  # TimeDistributed() is for this first inp_story dim: 'story_maxsents=10'
emb_story = emb_sent_bow(inp_story)
print('inp_story.shape:', repr(inp_story.shape))  # inp_story.shape: [m, story_maxsents=10, story_maxlen=8]
print('emb_story.shape:', repr(emb_story.shape))  # emb_story.shape: [m, story_maxsents=10, emb_dim=20    ]

inp_story.shape: TensorShape([Dimension(None), Dimension(10), Dimension(8)])
emb_story.shape: TensorShape([Dimension(None), Dimension(10), Dimension(20)])


We do the same for the queries, omitting the <tt>TimeDistributed</tt> since there is only one query. We use <tt>Reshape</tt> to match the rank of the input.

In [162]:
inp_q = Input((query_maxlen,))
emb_q = Embedding(vocab_size, emb_dim)(inp_q) # emb_q.shape: [m, query_maxlen=4, emb_dim=20]
#print('emb_q.shape', repr(emb_q.shape))
emb_q = Lambda(lambda x: K.sum(x, axis=1))(emb_q) # # the Lambda layer adds up all embedding (in shape [m, emb_dim=20])
#print('emb_q.shape', repr(emb_q.shape))
emb_q = Reshape((1, emb_dim))(emb_q)
print('inp_q.shape:', repr(inp_q.shape)) 
print('emb_q.shape', repr(emb_q.shape))

inp_q.shape: TensorShape([Dimension(None), Dimension(4)])
emb_q.shape TensorShape([Dimension(None), Dimension(1), Dimension(20)])


The actual memory network is incredibly simple.

* For each story, we take the dot product of every sentence embedding with that story's query embedding. This gives us a list of numbers proportional to how similar each sentence is with the query.
* We pass this vector of dot products through a softmax function to return a list of scalars that sum to one and tell us how similar the query is to each sentence.

In [163]:
x = Dot(axes=2)([emb_story, emb_q]) 
#print(repr(x.shape))              # x.shape: [m, story_maxsents=10, 1]  
x = Reshape((story_maxsents,))(x)  # Softmax works on the last dimension, so I have to Reshape to get rid of the unit axis 
                                   # and then I Reshape again to put the unit axis back on again.
x = Activation('softmax')(x)
#print(repr(x.shape))              # x.shape: [m, story_maxsents=10]  
match = Reshape((story_maxsents, 1))(x)  # match: how similar the query is to each sentence.
print(repr(match.shape))     # match.shape: [m, story_maxsents=10, 1]

TensorShape([Dimension(None), Dimension(10), Dimension(1)])


* Next, we construct a second, separate, embedding function for the sentences
* We then take the weighted average of these embeddings, using the softmax outputs as weights
* Finally, we pass this weighted average though a dense layer and classify it w/ a softmax into one of the words in the vocabulary

In [164]:
emb_c = emb_sent_bow(inp_story)  # emb_c.shape: [m, story_maxsents=10, emb_dim=20]
x = Dot(axes=1)([match, emb_c])
print(repr(x.shape))             # x.shape: [m, 1, emb_dim=20]
response = Reshape((emb_dim,))(x)  # reshape to get rid of the unit axis, so that we can stick it 
                                   # through a dense layer with a softmax and that gives us our final result.
res = Dense(vocab_size, activation='softmax')(response)

TensorShape([Dimension(None), Dimension(1), Dimension(20)])


In [165]:
answer = Model([inp_story, inp_q], res)

In [166]:
answer.compile(optimizer='rmsprop', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

And it works extremely well

In [167]:
batch_size = 32

In [168]:
K.set_value(answer.optimizer.lr, 1e-2)
hist = answer.fit(inps, answers_train, **params, epochs=4, batch_size=batch_size,
                 validation_data=(val_inps, answers_test))

Train on 10000 samples, validate on 1000 samples


A Jupyter Widget

A Jupyter Widget

Epoch 1/4
2s - loss: 0.4782 - acc: 0.8343 - val_loss: 0.0395 - val_acc: 0.9890


A Jupyter Widget

Epoch 2/4
0s - loss: 0.0068 - acc: 0.9981 - val_loss: 7.7516e-04 - val_acc: 0.9990


A Jupyter Widget

Epoch 3/4
0s - loss: 0.0054 - acc: 0.9989 - val_loss: 1.1262e-05 - val_acc: 1.0000


A Jupyter Widget

Epoch 4/4
0s - loss: 0.0075 - acc: 0.9981 - val_loss: 0.0018 - val_acc: 0.9990



## Test

We can look inside our model to see how it's weighting the sentence embeddings.

In [169]:
f = Model([inp_story, inp_q], match)

In [216]:
qnum = 6 

In [217]:
l_st = len(train_stories[qnum][0]); print(l_st)
train_stories[qnum]

4


([['0:', 'Sandra', 'travelled', 'to', 'the', 'office', '.'],
  ['1:', 'Sandra', 'went', 'to', 'the', 'bathroom', '.'],
  ['3:', 'Mary', 'went', 'to', 'the', 'bedroom', '.'],
  ['4:', 'Daniel', 'moved', 'to', 'the', 'hallway', '.']],
 ['Where', 'is', 'Sandra', '?'],
 'bathroom')

Sure enough, for the question "Where is Sandra?", the largest weight is the last sentence with the name Sandra, sentence 1 with 0.98.

The second highest is of course the first sentence, which also mentions Sandra. But the model has learned that the last occurring sentence is what is important; this is why we added the counter at the beginning of each sentence.

In [218]:
np.squeeze(f.predict([inputs_train[qnum:qnum+1], queries_train[qnum:qnum+1]]))[:l_st]

array([4.1552e-02, 9.5723e-01, 5.3446e-05, 1.1608e-03], dtype=float32)

In [219]:
answers_train[qnum:qnum+1,0]

array([19])

In [220]:
answer.predict([inputs_train[qnum:qnum+1], queries_train[qnum:qnum+1]])

array([[7.9062e-11, 1.0200e-10, 7.7399e-11, 5.3735e-11, 7.6401e-11,
        6.7729e-11, 1.8837e-10, 1.8352e-10, 1.3282e-10, 1.4026e-10,
        1.0462e-10, 1.9415e-10, 1.0147e-10, 3.3772e-11, 1.3133e-10,
        4.3228e-11, 5.0356e-11, 1.2204e-10, 1.2993e-10, 1.0000e+00,
        2.5772e-08, 6.1652e-09, 9.5249e-07, 1.5805e-10, 3.1113e-11,
        1.0092e-07, 1.0156e-10, 4.2901e-08, 1.0828e-11, 1.0363e-10,
        2.7131e-10, 1.0985e-10]], dtype=float32)

In [223]:
vocab[19]

'bathroom'

summary of above

In [224]:
answer_idx = np.argmax(np.squeeze(f.predict([inputs_train[qnum:qnum+1], queries_train[qnum:qnum+1]]))[:l_st])
print('question:', train_stories[qnum][1])
print('predict answer:', train_stories[qnum][0][answer_idx])

print('real answer:', vocab[int(answers_train[qnum:qnum+1][0])] )

question: ['Where', 'is', 'Sandra', '?']
predict answer: ['1:', 'Sandra', 'went', 'to', 'the', 'bathroom', '.']
real answer: bathroom


## Multi hop

Next, let's look at an example of a two-supporting fact story.

In [237]:
challenges = {
    # QA1 with 10,000 samples
    'single_supporting_fact_10k': 'tasks_1-20_v1-2/en-10k/qa1_single-supporting-fact_{}.txt',
    # QA2 with 10,000 samples
    'two_supporting_facts_10k': 'tasks_1-20_v1-2/en-10k/qa2_two-supporting-facts_{}.txt',
    'two_supporting_facts_1k': 'tasks_1-20_v1-2/en/qa2_two-supporting-facts_{}.txt',
}
#challenge_type = 'single_supporting_fact_10k'
challenge_type = 'two_supporting_facts_10k'
challenge = challenges[challenge_type]

In [238]:
train_stories = get_stories(tar.extractfile(challenge.format('train')))
test_stories = get_stories(tar.extractfile(challenge.format('test')))

  return _compile(pattern, flags).split(string, maxsplit)


In [239]:
test_stories[534]

([['0:', 'Mary', 'went', 'to', 'the', 'hallway', '.'],
  ['1:', 'Daniel', 'went', 'back', 'to', 'the', 'bedroom', '.'],
  ['2:', 'Sandra', 'went', 'back', 'to', 'the', 'garden', '.'],
  ['3:', 'Mary', 'went', 'to', 'the', 'office', '.'],
  ['4:', 'Mary', 'journeyed', 'to', 'the', 'kitchen', '.'],
  ['5:', 'Sandra', 'moved', 'to', 'the', 'office', '.'],
  ['6:', 'Sandra', 'journeyed', 'to', 'the', 'hallway', '.'],
  ['7:', 'Daniel', 'journeyed', 'to', 'the', 'garden', '.'],
  ['8:', 'Mary', 'journeyed', 'to', 'the', 'bathroom', '.'],
  ['9:', 'John', 'went', 'back', 'to', 'the', 'bathroom', '.'],
  ['10:', 'Sandra', 'travelled', 'to', 'the', 'garden', '.'],
  ['11:', 'John', 'moved', 'to', 'the', 'office', '.'],
  ['12:', 'Daniel', 'went', 'back', 'to', 'the', 'kitchen', '.'],
  ['13:', 'Mary', 'moved', 'to', 'the', 'kitchen', '.'],
  ['14:', 'Mary', 'moved', 'to', 'the', 'hallway', '.'],
  ['15:', 'Mary', 'went', 'to', 'the', 'kitchen', '.'],
  ['16:', 'Sandra', 'went', 'back', 'to', '

We can see that the question "Where is the milk?" requires to supporting facts to answer, "Daniel traveled to the hallway" and "Daniel left the milk there".

(Again) Here we calculate upper bounds for things like words in sentence, sentences in a story, etc. for the corpus, which will be useful later.

In [242]:
stories = train_stories + test_stories

In [259]:
stories[534]

([['0:', 'Sandra', 'journeyed', 'to', 'the', 'kitchen', '.'],
  ['1:', 'Daniel', 'travelled', 'to', 'the', 'office', '.'],
  ['2:', 'John', 'travelled', 'to', 'the', 'bathroom', '.'],
  ['3:', 'Sandra', 'moved', 'to', 'the', 'bathroom', '.'],
  ['4:', 'Mary', 'went', 'back', 'to', 'the', 'garden', '.'],
  ['5:', 'Mary', 'grabbed', 'the', 'milk', 'there', '.'],
  ['6:', 'Mary', 'left', 'the', 'milk', '.'],
  ['7:', 'Sandra', 'journeyed', 'to', 'the', 'garden', '.'],
  ['9:', 'Mary', 'journeyed', 'to', 'the', 'bathroom', '.'],
  ['10:', 'Sandra', 'took', 'the', 'milk', 'there', '.'],
  ['11:', 'Daniel', 'moved', 'to', 'the', 'bedroom', '.'],
  ['12:', 'Daniel', 'journeyed', 'to', 'the', 'kitchen', '.'],
  ['13:', 'Sandra', 'discarded', 'the', 'milk', '.'],
  ['14:', 'Daniel', 'went', 'back', 'to', 'the', 'bathroom', '.'],
  ['16:', 'John', 'travelled', 'to', 'the', 'kitchen', '.'],
  ['17:', 'Daniel', 'went', 'to', 'the', 'kitchen', '.'],
  ['19:', 'Mary', 'went', 'back', 'to', 'the', 'g

In [244]:
story_maxlen = max(len(s) for x, _, _ in stories for s in x)
story_maxsents = max(len(x) for x, _, _ in stories)
query_maxlen = max(len(x) for _, x, _ in stories)

In [245]:
def do_flatten(el):
    return isinstance(el, collections.Iterable) and not isinstance(el, (str, bytes))

def flatten(l):
    for el in l:
        if do_flatten(el): yield from flatten(el)
        else: yield el

(Again) Create vocabulary of corpus and find size, including a padding element.

In [246]:
vocab = sorted(set(flatten(stories)))
vocab.insert(0, '<PAD>')
vocab_size = len(vocab)

In [247]:
story_maxsents, vocab_size, story_maxlen, query_maxlen, len(train_stories), len(test_stories)

(88, 124, 8, 5, 10000, 1000)

(Again) Create an index mapping for the vocabulary.

In [249]:
word_idx = dict((c,i) for i,c in enumerate(vocab))
#word_idx = {c:i for i,c in enumerate(vocab)}　#same way

(Again) Next we vectorize our dataset by mapping words to their indices. We enforce consistent dimension by padding vectors up to the upper bounds we calculated earlier with our pad element.

In [250]:
inputs_train, queries_train, answers_train = vectorize_stories(train_stories, 
     word_idx, story_maxlen, query_maxlen)
inputs_test, queries_test, answers_test = vectorize_stories(test_stories, 
     word_idx, story_maxlen, query_maxlen)

In [251]:
len(inputs_train), inputs_train[2].shape, inputs_train[-2].shape  # inputs_train shape: 10000 * [sent, 8] arrays  
                                                                  # m = len(train_stories): 10000
                                                                  # story_maxlen (maximu words per sentence): 8
                                                                  # sent (per story): not fixed number < story_maxsents

(10000, (16, 8), (18, 8))

In [252]:
inputs_train[5]

array([[  0,   2,  91, 123, 119, 117, 110,   1],
       [  0,  13,  91, 109, 119, 117, 107,   1],
       [ 24,  93, 123,  97, 119, 117, 104,   1],
       [ 35,  91, 115, 122, 117,  96, 118,   1],
       [  0,  46,  94, 123, 119, 117, 114,   1],
       [  0,  57,  94, 121, 119, 117,  99,   1],
       [  0,  68,  93, 105, 117, 103, 118,   1],
       [  0,  79,  94, 106, 117, 112, 118,   1],
       [  0,   0,  86,  93, 111, 117, 103,   1],
       [  0,   0,  89,  91, 111, 117,  96,   1]])

In [253]:
def stack_inputs(inputs):
    for i, it in enumerate(inputs):
        inputs[i] = np.concatenate([it, 
                           np.zeros((story_maxsents-it.shape[0],story_maxlen), 'int')])
    return np.stack(inputs)

inputs_train = stack_inputs(inputs_train)
inputs_test = stack_inputs(inputs_test)

In [254]:
inputs_train[5]

array([[  0,   2,  91, 123, 119, 117, 110,   1],
       [  0,  13,  91, 109, 119, 117, 107,   1],
       [ 24,  93, 123,  97, 119, 117, 104,   1],
       [ 35,  91, 115, 122, 117,  96, 118,   1],
       [  0,  46,  94, 123, 119, 117, 114,   1],
       [  0,  57,  94, 121, 119, 117,  99,   1],
       [  0,  68,  93, 105, 117, 103, 118,   1],
       [  0,  79,  94, 106, 117, 112, 118,   1],
       [  0,   0,  86,  93, 111, 117, 103,   1],
       [  0,   0,  89,  91, 111, 117,  96,   1],
       [  0,   0,   0,   0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0,   0,   0,   0],
       [  0,   0,   0,   0,   0,   0,   0,   0],
       ...,
       [

In [255]:
print(inputs_train.shape) # inputs_train shape: [m = len(train_stories), story_maxsents, story_maxlen] arrays       
print(inputs_test.shape)  # inputs_test shape: [m = len(test_stories), story_maxsents, story_maxlen] arrays                                                                

(10000, 88, 8)
(1000, 88, 8)


In [256]:
print(queries_train.shape) # queries_train shape: [m = len(train_stories), query_maxlen] arrays       
print(queries_test.shape) # queries_test shape: [m = len(train_stories), query_maxlen] arrays    

(10000, 5)
(1000, 5)


(Again) Our inputs for keras.

In [257]:
inps = [inputs_train, queries_train]  # train dataset = [inputs_train, queries_train]
val_inps = [inputs_test, queries_test] # val dataset = [inputs_test, queries_test]

(Again) Our dataset labels

In [258]:
print(answers_train.shape)
print(answers_test.shape)

(10000, 1)
(1000, 1)


### Multi hop Model
The approach is basically the same; we add more embedding dimensions to account for the increased task complexity.

In [260]:
emb_dim = 30
params = {'verbose': 2, 'callbacks': [TQDMNotebookCallback(leave_inner=False)]}

In [261]:
def emb_sent_bow(inp):
    emb_op = TimeDistributed(Embedding(vocab_size, emb_dim)) 
    emb = emb_op(inp)                                       # emb.shape: [m, story_maxsents=10, story_maxlen=88, emb_dim=30]
                                                            # TimeDistributed() seems to be for this first inp dim: 'story_maxsents=88'
    #print('emb.shape:', repr(emb.shape))
    emb = Lambda(lambda x: K.sum(x, axis=2))(emb)    # the Lambda layer adds up all embedding (in shape [m, story_maxsents=88, emb_dim=30])
    return emb, emb_op  # Difference on Multi Hop: this time we output also emb_op to use for inp_q outputing emb_q 

In [262]:
inp_story = Input((story_maxsents, story_maxlen))  # TimeDistributed() is for this first inp_story dim: 'story_maxsents=88'
emb_story, emb_story_op = emb_sent_bow(inp_story)
print('inp_story.shape:', repr(inp_story.shape))  # inp_story.shape: [m, story_maxsents=88, story_maxlen=8]
print('emb_story.shape:', repr(emb_story.shape))  # emb_story.shape: [m, story_maxsents=88, emb_dim=30    ]

inp_story.shape: TensorShape([Dimension(None), Dimension(88), Dimension(8)])
emb_story.shape: TensorShape([Dimension(None), Dimension(88), Dimension(30)])


In [265]:
inp_q = Input((query_maxlen,))
emb_q = emb_story_op.layer(inp_q) # emb_q.shape: [m, query_maxlen=5, emb_dim=30]
print('emb_q.shape', repr(emb_q.shape))
emb_q = Lambda(lambda x: K.sum(x, axis=1))(emb_q) # # the Lambda layer adds up all embedding (in shape [m, emb_dim=30])
print('inp_q.shape:', repr(inp_q.shape)) 
print('emb_q.shape', repr(emb_q.shape))

emb_q.shape TensorShape([Dimension(None), Dimension(5), Dimension(30)])
inp_q.shape: TensorShape([Dimension(None), Dimension(5)])
emb_q.shape TensorShape([Dimension(None), Dimension(30)])


The main difference is that we are going to do the same process twice. Here we've defined a "hop" as the operation that returns the weighted average of the input sentence embeddings.

In [266]:
hop = Dense(emb_dim)

In [268]:
def one_hop(u, A):                
    '''
       u: query question
       A: emb_story
    '''
    x = Reshape((1, emb_dim))(u)   # Difference
    x = Dot(axes=2)([A, x])        # Difference
    #print(repr(x.shape))              # x.shape: [m, story_maxsents=88, 1]  
    x = Reshape((story_maxsents,))(x)  # Softmax works on the last dimension, so I have to Reshape to get rid of the unit axis 
                                       # and then I Reshape again to put the unit axis back on again.
    x = Activation('softmax')(x)
    #print(repr(x.shape))              # x.shape: [m, story_maxsents=88]  
    match = Reshape((story_maxsents, 1))(x)  # match: how similar the query is to each sentence.
    #print(repr(match.shape))     # match.shape: [m, story_maxsents=88, 1]
    
    emb_c, _ = emb_sent_bow(inp_story)  # emb_c.shape: [m, story_maxsents=88, emb_dim=30]
    x = Dot(axes=1)([match, emb_c])
    #print(repr(x.shape))             # x.shape: [m, 1, emb_dim=30]
    x = Reshape((emb_dim,))(x)      
    x = hop(x)                      # Main Difference; # x.shape: [m, emb_dim=30]
    #print(repr(x.shape))
    x = Add()([x, emb_q])           # Difference
    return x, emb_c

[Ref] Here is above one-hop model

We do one hop, and repeat the process using the resulting weighted sentence average as the new weights.

This works because the first hop allows us to find the first fact relevant to the query, and then we can use that fact to find the next fact that answers the question. In our example, our model would first find the last sentence to mention "milk", and then use the information in that fact to know that it next has to find the last occurrence of "Daniel".

This is facilitated by generating a new embedding function for the input story each time we hop. This means that the first embedding is learning things that help us find the first fact from the query, and the second is helping us find the second fact from the first.

This approach can be extended to n-supporting factor problems by doing n hops.

In [269]:
response, emb_story = one_hop(emb_q, emb_story)     # use emb_q as question to find 1st response
response, emb_story = one_hop(response, emb_story)  # use 1st response as question to find 2nd response

TensorShape([Dimension(None), Dimension(88), Dimension(1)])
TensorShape([Dimension(None), Dimension(88)])
TensorShape([Dimension(None), Dimension(88), Dimension(1)])
TensorShape([Dimension(None), Dimension(1), Dimension(30)])
TensorShape([Dimension(None), Dimension(30)])
TensorShape([Dimension(None), Dimension(88), Dimension(1)])
TensorShape([Dimension(None), Dimension(88)])
TensorShape([Dimension(None), Dimension(88), Dimension(1)])
TensorShape([Dimension(None), Dimension(1), Dimension(30)])
TensorShape([Dimension(None), Dimension(30)])


In [270]:
res = Dense(vocab_size, activation='softmax')(response)

In [271]:
answer = Model([inp_story, inp_q], res)
answer.compile(optimizer='rmsprop', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

Fitting this model can be tricky.

In [272]:
K.set_value(answer.optimizer.lr, 5e-3)
hist = answer.fit(inps, answers_train, **params, epochs=8, batch_size=batch_size,
                 validation_data=(val_inps, answers_test))

Train on 10000 samples, validate on 1000 samples


A Jupyter Widget

A Jupyter Widget

Epoch 1/8
11s - loss: 1.7038 - acc: 0.2950 - val_loss: 1.4643 - val_acc: 0.4250


A Jupyter Widget

Epoch 2/8
10s - loss: 0.9887 - acc: 0.6355 - val_loss: 0.9092 - val_acc: 0.6820


A Jupyter Widget

Epoch 3/8
9s - loss: 0.7793 - acc: 0.7242 - val_loss: 0.7466 - val_acc: 0.7180


A Jupyter Widget

Epoch 4/8
11s - loss: 0.6805 - acc: 0.7673 - val_loss: 0.7754 - val_acc: 0.7590


A Jupyter Widget

Epoch 5/8
10s - loss: 0.6356 - acc: 0.7842 - val_loss: 0.6768 - val_acc: 0.7600


A Jupyter Widget

Epoch 6/8
9s - loss: 0.6000 - acc: 0.8065 - val_loss: 0.7982 - val_acc: 0.7380


A Jupyter Widget

Epoch 7/8
9s - loss: 0.5961 - acc: 0.8137 - val_loss: 0.7440 - val_acc: 0.7950


A Jupyter Widget

Epoch 8/8
9s - loss: 0.5840 - acc: 0.8246 - val_loss: 0.7258 - val_acc: 0.7920



In [273]:
np.array(hist.history['val_acc'])

array([0.425, 0.682, 0.718, 0.759, 0.76 , 0.738, 0.795, 0.792])