# Steps
1. Preprocessing: tokenization, sentence splitting, lower-case, `<SOS>` sentence `<EOS>`
2. sentences are in one text file, questions are in another text file
3. Vocab: most frequent 45k for sentence, most frequent 28k for questions
4. Pretrained Word Embedding: glove.840b.300d in the embedding layer

# Architecture Details
1. Both encoder & decoder has: Bidirectional LSTM, 2 Layers, 600 hidden dim 
2. SGD Optimizer
3. Initial LR 1.0, at epoch 8, LR 0.5
4. batch size 64
5. Dropout 0.3 between vertical LSTM stacks
6. Gradient Clipping when norm exceeds 5
7. Max epoch 15
8. Attention based encoding: the encoder produces hidden state from a bidirectional LSTM $h_t$ from input sequence $x$ where $t = 1...|x|$. Attention based encoding $c_t$ is the weighted average of all the hidden states, $h_i$ in that sentence.
9. The weight is an attention matrix of shape $(|x|, |x|)$. \
    Attention of i to t, $a_{i,t}$ is exponential of (i's hidden state $\times$ learnable parameter $\times$ t's hidden state) over sum of the above for all tokens to t. 

# Inference
1. Beam Size 3
2. Generate until `<EOS>` tag found

I ----------- $h_0$ \
am -----------$h_1$ \
human -------- $h_2$ \
android ------$h_3$ 

t = 2 \
encoding of `human` = attention `i` to `human` x `i` state \
               \+ attention `am` to `human` x `am` state

thus, encoding of a token is the weighted average of all the token's hidden state in the sentence.

# Import

In [1]:
from pathlib import Path
import io

import numpy as np

import torch
from torch import nn

# Look at Data
Data already preprocessed collected from the original repo.

In [2]:
data_root = Path('data/processed')

In [3]:
_n = 5
with open(data_root / 'src-train.txt') as f:
    _sentences = [f.readline() for i in range(_n)]

with open(data_root / 'tgt-train.txt') as f:
    _questions = [f.readline() for i in range(_n)]

for s, q in zip(_sentences, _questions):
    print(s)
    print()
    print(q)
    print('\n\n')

a pub / pʌb / , or public house is , despite its name , a private house , but is called a public house because it is licensed to sell alcohol to the general public . 


what is a pub licensed to sell ?




a pub / pʌb / , or public house is , despite its name , a private house , but is called a public house because it is licensed to sell alcohol to the general public . 


what is the term ` pub ' short for ?




the writings of samuel pepys describe the pub as the heart of england . 


who said that pubs are the heart of england ?




the history of pubs can be traced back to roman taverns , through the anglo-saxon alehouse to the development of the modern tied house system in the 19th century . 


how far back does the history of pubs go back ?




the history of pubs can be traced back to roman taverns , through the anglo-saxon alehouse to the development of the modern tied house system in the 19th century . 


what is a pub tied to in the 19th century ?






# Vocabulary

In [4]:
import torchtext
from torchtext.vocab import build_vocab_from_iterator, GloVe

In [5]:
def yield_token(text_file_path):
    with io.open(text_file_path, encoding='utf-8') as f:
        for line in f:
            yield line.strip().split()



In [6]:
%%time
sentence_vocab = build_vocab_from_iterator(yield_token(data_root / 'src-train.txt'), 
                                           max_tokens=45000)
question_vocab = build_vocab_from_iterator(yield_token(data_root / 'tgt-train.txt'), 
                                           max_tokens=28000)

CPU times: user 801 ms, sys: 100 ms, total: 901 ms
Wall time: 905 ms


In [7]:
# merge two vocabs once collected from separate corpus
vocab = torchtext.vocab.Vocab(sentence_vocab)

vocab.append_token('<SOS>')
vocab.append_token('<EOS>')
vocab.append_token('<UNK>')
vocab.set_default_index(vocab['<UNK>'])

for token in question_vocab.get_itos():
    if token not in vocab:
        vocab.append_token(token)

print(len(vocab))

49825


In [8]:
embedding_vector = torch.zeros(size=(len(vocab), 300))
glove = GloVe(cache="data/")
for index in range(len(vocab)):
    embedding_vector[index] = glove[vocab.lookup_token(index)]

In [9]:
# embedding_layer = nn.Embedding.from_pretrained(embeddings=embedding_vector)

In [16]:
corpus = ['the brown fox jumped', 'have a little faith']
_indexed = [vocab(document.split()) for document in corpus]
indexed = torch.tensor(_indexed, dtype=torch.long)

tensor([[    0,  1840,  3326, 45002],
        [   36,     7,   771,  1298]])


# Model

In [46]:
class EncoderDecoder(nn.Module):
    def __init__(self, embedding_vector, embedding_dim=300, hidden_dim=8, bidirectional=False):
        super().__init__()
        self.embedding = nn.Embedding.from_pretrained(embedding_vector)
        
        self.encoder = nn.LSTM(input_size=embedding_dim, 
                               hidden_size=hidden_dim,
                               batch_first=True,
                               bidirectional=bidirectional)

        # self.decoder = nn.LSTM(input_size=embedding_dim, 
        #                       hidden_size=hidden_dim, 
        #                       batch_first=True,
        #                       bidirectional=False)

        self.decoder_lstm_cell = nn.LSTMCell(input_size=embedding_dim,
                                             hidden_size=hidden_dim)

        self.attn_linear = nn.Linear(in_features=hidden_dim,
                                    out_features=hidden_dim)


    def forward(self, source):
        """
        Parameters
        ----------
        source: torch.Tensor. 
        """
        source = self.embedding(source)
        # b (N, L, d), hT & cT (1, N, d), hidden states from the last timestep
        b, (hT, cT) = self.encoder(source)
        h = self.decoder(b)
        print(h.shape)


    def decoder(self, src_states):
        """
        Parameters
        ----------
        src_states: torch.Tensor (N, L, d)
        """
        prefix = torch.tensor([[45000,],[45000,]])
        t = 0
        
        while True:
            embeddings = self.embedding(prefix)
            ht, _ = self.decoder_lstm_cell(embeddings[:, t, :])
            at = self.attention(src_states, ht)
            # max step
            t+=1
            if t > 5: break

    def attention(self, src_states, ht):
        

In [47]:
model = EncoderDecoder(embedding_vector)
model(indexed)

torch.Size([2, 1, 300]) torch.Size([2, 300])


IndexError: index 1 is out of bounds for dimension 1 with size 1

<div>
<img src="data/attention.png" width="300"/>
</div>


Consider a single sample, ht = (12, 1), htT = (1, 12), bi = (12, 1), ait is a float. Wb @ biT must produce a (12, 1), so Wb = (12, 1)