##### Requirements
Reference: _Sukhbaatar, S., Szlam, A., Weston, J., Fergus, R. (2015). End-To-End Memory Networks. arXiv: 1503.08895v5_

__Data:__ 
* BABI dataset

__Model:__
* RNN with external memory instead of internal memory
  * input memory representation: $m_i = \sum_j Ax_{ij}$
    * output memory representation: $c_i = \sum_j Cx_{ij}$ 
    * input embedding is an inner product of embedding $q$ and $m_i$
      * subsequent input embeddings: $u = u^{k+1} = u^k + o^k$
    * output embedding: $o = \sum_i Softmax(u^T m_i) c_i $
    * generate prediction: $\hat{a} = Softmax(W(o^k + u^k))$
  * Adjacent weight tying: 
    * output embedding reused in next layer: $A^{k+1} = C^{k}$
    * question embedding == 1st input embedding: $\beta = A^1$
    * prediction matrix constrained to final output embedding: $W^T = C^K$
  * External weight tying:
    * input embedding and output embeddings the same: $A^K = C^K$
    * add a linear mapping $H$ to update of input embedding: $ u^{k+1} = H u^k + o^k$
    
__Loss function:__
* Standard Cross-Entropy $L = \frac{1}{N}\sum_i{D(g(Ax_i+b), L_i)}$

__Optimiser:__
* annealing learning rate according to:
  * 100 epochs: $\eta = 0.01$  
    * if epoch % 25 == 0 and epoch < 100: 
    $$\eta += \eta / 2 $$
  * 60 epochs: $\eta = 0.01$
    * if epoch % 15 == 0 and epoch < 60: 
    $$\eta += \eta / 2 $$
  * 20 epochs: $\eta = 0.01$
    * if epoch % 5 == 0 and epoch < 60: 
    $$\eta += \eta / 2 $$
* SGD without momentum

__Train:__
* anything?

__Predict:__
* anything?

__Plots:__
* anything?

In [2]:
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torch.autograd import Variable

from torchtext import data


In [3]:
# Here anything to do with the data 
# dataset/tasks_1-20_v1-2/...
# The data is in the form:
'''
    [
        (
            [
                ['mary', 'moved', 'to', 'the', 'bathroom'], 
                ['john', 'went', 'to', 'the', 'hallway']
            ], 
            ['where', 'is', 'mary'], 
            ['bathroom']
        ),
        .
        .
        .
        ()
    ]
    '''

qa = data.TabularDataset(
    path='data/pos/pos_wsj_train.tsv', format='tsv',
    fields=[('text', data.Field()),
            ('labels', data.Field())])

In [4]:
# the model
class MemNN(nn.Module):
    def __init__(self, vocab_size, embedding_size, story_max_len, query_max_len):
        super(MemNN, self).__init__()
        # embedding of sentences: A and C
        self.A = nn.Embedding(vocab_size, embedding_size)
        self.C = nn.Embedding(vocab_size, query_max_len)
        # embedding of question
        self.u = nn.Embedding(vocab_size, embedding_size)
        
        self.fc_a = nn.Linear(embedding_size, vocab_size)
        
    def forward(self, story, query):
        
        story_a = self.A(story)
        query = self.B(query)
        story_c = self.C(story)
        
        p = F.softmax(torch.mm(query, story_a.permute(1, 0)))
        #should be shape (story_len, query_len)
                      
        #print(p)
        
        story_c = story_c.permute(1, 0)
        
        o = p.mul(story_c)
        
        print(o.size())
        print(query.size())
        a = o.add(query)
        
        print(a)
        
        return story_a, query, story_c

In [5]:
story = Variable(torch.LongTensor([1,2,3,4,5,6,7,8,9,0]))
query = Variable(torch.LongTensor([1,2,3,4,5]))

memnn = MemNN(10, 100, 10, 5)

In [6]:
# testing the sizes  
story_a, query, story_c = memnn(story, query)

torch.Size([5, 10])
torch.Size([5, 100])


RuntimeError: inconsistent tensor size, expected r_ [5 x 10], t [5 x 10] and src [5 x 100] to have the same number of elements, but got 50, 50 and 500 elements respectively at /opt/conda/conda-bld/pytorch_1503970438496/work/torch/lib/TH/generic/THTensorMath.c:887

In [7]:
# Loss and Optimiser
# loss function
criterion = nn.CrossEntropyLoss()

In [None]:
# annealing the learning rate and SGD optimiser
learning_rate = 0.01 # initial learning rate


# how are you updating the parameters for every epoch?
optimizer = torch.optim.SGD(model.parameters(),lr = learning_rate)