## Attention over attention baseline model

<img src="attention_over_attention_baseline.jpeg.jpg"> 

## In this baseline model, we do as follows:
1. **Contextual Embedding**:<br>
h(D)=bi-GRU(D)<br>
h(Q)=bi-GRU(Q)<br>
2. **Parwise Matching Score**: <br>
$M(i,j)=h_D(i)^Th_Q(j)$<br>
3. **Compute Query-to-Document Attention and vice-versa**:<br>
$\alpha(t)=softmax(M(1,t),M(2,t),...,M(|D|,t)$ for t=1,2,...|Q|<br>
$\beta(t)=softmax(M(t,1),M(t,2),...,M(t,|Q|)$ for t=1,2,...|D|<br>
for |D| is the number of tokens in document <br>
and |Q| is the number of tokens in query <br>
4. **Attention over Attention**:<br>
$\beta=\frac{1}{|D|}\sum_{t=1}^{|D|}\beta(t)$<br>
$s=\alpha^T\beta$<br>
5. **Prediction**: <br>
$P(w|D,Q)=\sum_{i \in I(w,D)}s_i$<br>
where I(w,D) indicate the position that word w appears in D

## Drawback of this model:
-  As the number of tokens in the Document usually is very big, in this data, the max length is 2000, the gru may probably not capture the that long dependency well
- Not only that, gru may cause the running time very long, so we should find a way to make model run faster


## Proposed model
-  Transformer is the model ultilizing the attention mechanism that compute the representation of sequence parallel:
<img src="Transformer_Model.jpg"> 

- As here we only need to **replace gru** by the different mechanism to overcome it cons, so we only need to use the Transformer Encoder part
<img src="Transformer_Encoder.jpg"> 

## In this component, we compute the representation of input as follow:
-  Attention mechanism: $Attention(Q,K,V)=softmax(\frac{QK^T}{\sqrt{d_k}})V$ <br>
This one generalize the normal attention mechanism we know with RNN: for example, in the encoder-decoder scheme, K and V can be the representation of data in the encoder side, and Q is the vector we want to find the attention on K and V.
- Multi-head Attention: $Multihead(Q,K,V)=concat(head_0,head_1,head_h)W^O$ <br>
where $head_i=Attention(QW_i^Q,KW_i^K,VW_i^V)$

- Feed Forward: $FFN(x)=RELU(xW_1+b_1)W_2 +b_2$
- Positional Encoding: We represent position of the token by <br>
$PE(pos,2i)=sin(\frac{pos}{1000^\frac{2i}{d_{model}}})$<br>
$PE(pos,2i+1)=cos(\frac{pos}{1000^\frac{2i}{d_{model}}})$<br>
Please note that PE of pos here is a vector of size $d_{model}$ whose index 2i or 2i+1 is computed by those above formulas

Here we will implement the transformer encoder and then run the model with replacing bidirectional-gru by transformer encoder

As our data process for this model is different from other model, Please refer the readme file for details of data process. Here we already save the data into pytorch support form and you can find it in the link https://drive.google.com/open?id=11ii6U7-Nz2bVm8zKBff0H1K9uuH7QC-i <br>
Please copy data files in cnn_data(google drive folder) to directory ./data/cnn/
And also copy data files in cnn_models( google drive folder) to directory ./cnn_models 

Moreover, if you have the notebook and the folder separately, to run this notebook sucessfully, please put this note book has the same parents as aoareader/data/model_cnn/models as follows <br>
. <br>
├──Analyze Model.ipynb <br>
├── aoareader <br>
├── data <br>
├── model_cnn <br>
├── models <br>

In [1]:
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
import math
from aoareader import Constants
import aoareader as reader

At first we have the positional encoding

In [2]:
def _gen_timing_signal(length, channels, min_timescale=1.0, max_timescale=1.0e4):
    position = np.arange(length)
    num_timescales = channels // 2
    log_timescale_increment = (
                    math.log(float(max_timescale) / float(min_timescale)) /
                    (float(num_timescales) - 1))
    inv_timescales = min_timescale * np.exp(
                    np.arange(num_timescales).astype(np.float) * -log_timescale_increment)
    scaled_time = np.expand_dims(position, 1) * np.expand_dims(inv_timescales, 0)


    signal = np.concatenate([np.sin(scaled_time), np.cos(scaled_time)], axis=1)
    signal = np.pad(signal, [[0, 0], [0, channels % 2]], 
                    'constant', constant_values=[0.0, 0.0])
    signal =  signal.reshape([1, length, channels])

    return torch.from_numpy(signal).type(torch.FloatTensor)

Now we define the multiple head attention

In [3]:
class MultiHeadAttention(nn.Module):
    
    def __init__(self, input_depth, total_key_depth, total_value_depth, output_depth, 
                 num_heads, bias_mask=None, dropout=0.0):
        
        super(MultiHeadAttention, self).__init__()
        
        if total_key_depth % num_heads != 0:
            raise ValueError("Key depth (%d) must be divisible by the number of "
                             "attention heads (%d)." % (total_key_depth, num_heads))
        if total_value_depth % num_heads != 0:
            raise ValueError("Value depth (%d) must be divisible by the number of "
                             "attention heads (%d)." % (total_value_depth, num_heads))
            
        self.num_heads = num_heads
        self.query_scale = (total_key_depth//num_heads)**-0.5
        self.bias_mask = bias_mask
        
        # Key and query depth will be same
        self.query_linear = nn.Linear(input_depth, total_key_depth, bias=False)
        self.key_linear = nn.Linear(input_depth, total_key_depth, bias=False)
        self.value_linear = nn.Linear(input_depth, total_value_depth, bias=False)
        self.output_linear = nn.Linear(total_value_depth, output_depth, bias=False)
        
        self.dropout = nn.Dropout(dropout)
    
    def _split_heads(self, x):
        
        if len(x.shape) != 3:
            raise ValueError("x must have rank 3")
        shape = x.shape
        return x.view(shape[0], shape[1], self.num_heads, shape[2]//self.num_heads).permute(0, 2, 1, 3)
    
    def _merge_heads(self, x):
        
        if len(x.shape) != 4:
            raise ValueError("x must have rank 4")
        shape = x.shape
        return x.permute(0, 2, 1, 3).contiguous().view(shape[0], shape[2], shape[3]*self.num_heads)
        
    def forward(self, queries, keys, values):
        
        # Do a linear for each component
        queries = self.query_linear(queries)
        keys = self.key_linear(keys)
        values = self.value_linear(values)
        
        # Split into multiple heads
        queries = self._split_heads(queries)
        keys = self._split_heads(keys)
        values = self._split_heads(values)
        
        # Scale queries
        queries *= self.query_scale
        
        # Combine queries and keys
        logits = torch.matmul(queries, keys.permute(0, 1, 3, 2))
        
        # Add bias to mask future values
        if self.bias_mask is not None:
            logits += self.bias_mask[:, :, :logits.shape[-2], :logits.shape[-1]].type_as(logits.data)
        
        # Convert to probabilites
        weights = nn.functional.softmax(logits, dim=-1)
        
        # Dropout
        weights = self.dropout(weights)
        
        # Combine with values to get context
        contexts = torch.matmul(weights, values)
        
        # Merge heads
        contexts = self._merge_heads(contexts)
        #contexts = torch.tanh(contexts)
        
        # Linear to get output
        outputs = self.output_linear(contexts)
        
        return outputs

Then the Feed Forward, Conv network

In [4]:
class PositionwiseFeedForward(nn.Module):
    """
    Does a Linear + RELU + Linear on each of the timesteps
    """
    def __init__(self, input_depth, filter_size, output_depth, layer_config='ll', padding='left', dropout=0.0):
        
        super(PositionwiseFeedForward, self).__init__()
        
        layers = []
        sizes = ([(input_depth, filter_size)] + 
                 [(filter_size, filter_size)]*(len(layer_config)-2) + 
                 [(filter_size, output_depth)])

        for lc, s in zip(list(layer_config), sizes):
            if lc == 'l':
                layers.append(nn.Linear(*s))
            elif lc == 'c':
                layers.append(Conv(*s, kernel_size=3, pad_type=padding))
            else:
                raise ValueError("Unknown layer type {}".format(lc))

        self.layers = nn.ModuleList(layers)
        self.relu = nn.ReLU()
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, inputs):
        x = inputs
        for i, layer in enumerate(self.layers):
            x = layer(x)
            if i < len(self.layers):
                x = self.relu(x)
                x = self.dropout(x)

        return x
class Conv(nn.Module):
    
    def __init__(self, input_size, output_size, kernel_size, pad_type):
        
        super(Conv, self).__init__()
        padding = (kernel_size - 1, 0) if pad_type == 'left' else (kernel_size//2, (kernel_size - 1)//2)
        self.pad = nn.ConstantPad1d(padding, 0)
        self.conv = nn.Conv1d(input_size, output_size, kernel_size=kernel_size, padding=0)

    def forward(self, inputs):
        inputs = self.pad(inputs.permute(0, 2, 1))
        outputs = self.conv(inputs).permute(0, 2, 1)

        return outputs

Then LayerNorm layer

In [5]:
class LayerNorm(nn.Module):

    def __init__(self, features, eps=1e-6):
        super(LayerNorm, self).__init__()
        self.gamma = nn.Parameter(torch.ones(features))
        self.beta = nn.Parameter(torch.zeros(features))
        self.eps = eps

    def forward(self, x):
        mean = x.mean(-1, keepdim=True)
        std = x.std(-1, keepdim=True)
        return self.gamma * (x - mean) / (std + self.eps) + self.beta

Now we combine the MultiheadAttention, LayerNorm, FeedForward and then  LayerNorm (as the figure above)

In [6]:
class EncoderLayer(nn.Module):
    
    def __init__(self, hidden_size, total_key_depth, total_value_depth, filter_size, num_heads,
                 bias_mask=None, layer_dropout=0.0, attention_dropout=0.0, relu_dropout=0.0):
                
        super(EncoderLayer, self).__init__()
        
        self.multi_head_attention = MultiHeadAttention(hidden_size, total_key_depth, total_value_depth, 
                                                       hidden_size, num_heads, bias_mask, attention_dropout)
        
        self.positionwise_feed_forward = PositionwiseFeedForward(hidden_size, filter_size, hidden_size,
                                                                 layer_config='cc', padding = 'both', 
                                                                 dropout=relu_dropout)
        self.dropout = nn.Dropout(layer_dropout)
        self.layer_norm_mha = LayerNorm(hidden_size)
        self.layer_norm_ffn = LayerNorm(hidden_size)
        
    def forward(self, inputs):
        x = inputs
        
        # Layer Normalization
        x_norm = self.layer_norm_mha(x)
        
        # Multi-head attention
        y = self.multi_head_attention(x_norm, x_norm, x_norm)
        
        # Dropout and residual
        x = self.dropout(x + y)
        
        # Layer Normalization
        x_norm = self.layer_norm_ffn(x)
        
        # Positionwise Feedforward
        y = self.positionwise_feed_forward(x_norm)
        
        # Dropout and residual
        y = self.dropout(x + y)
        
        return y

With the Positional Encoding we can define the transformer encoder now

In [7]:
class Encoder(nn.Module):    
    def __init__(self, embedding_size, hidden_size, num_layers, num_heads, total_key_depth, total_value_depth,
                 filter_size, max_length=100, input_dropout=0.0, layer_dropout=0.0, 
                 attention_dropout=0.0, relu_dropout=0.0, use_mask=False):
        
        super(Encoder, self).__init__()
        
        self.timing_signal = _gen_timing_signal(max_length, hidden_size)
        
        params =(hidden_size, 
                 total_key_depth or hidden_size,
                 total_value_depth or hidden_size,
                 filter_size, 
                 num_heads, 
                 _gen_bias_mask(max_length) if use_mask else None,
                 layer_dropout, 
                 attention_dropout, 
                 relu_dropout)

        self.embedding_proj = nn.Linear(embedding_size, hidden_size, bias=False)
        self.enc = nn.Sequential(*[EncoderLayer(*params) for l in range(num_layers)])
        
        self.layer_norm = LayerNorm(hidden_size)
        self.input_dropout = nn.Dropout(input_dropout)
        
    
    def forward(self, inputs):
        #Add input dropout

        x = self.input_dropout(inputs)
        
        # Project to hidden size
        x = self.embedding_proj(x)
        
        # Add timing signal
        x += self.timing_signal[:, :inputs.shape[1], :].type_as(inputs.data)
        
        y = self.enc(x)
        
        y = self.layer_norm(y)
        return y

Now we test our implementation of transformer encoder

In [8]:
check_transformer = Encoder(384, 384, 1, 1, 384, 384,64, max_length=2000, input_dropout=0.0, layer_dropout=0.0,
                 attention_dropout=0.0, relu_dropout=0.0, use_mask=False)
check_transformer.cuda()

Encoder(
  (embedding_proj): Linear(in_features=384, out_features=384, bias=False)
  (enc): Sequential(
    (0): EncoderLayer(
      (multi_head_attention): MultiHeadAttention(
        (query_linear): Linear(in_features=384, out_features=384, bias=False)
        (key_linear): Linear(in_features=384, out_features=384, bias=False)
        (value_linear): Linear(in_features=384, out_features=384, bias=False)
        (output_linear): Linear(in_features=384, out_features=384, bias=False)
        (dropout): Dropout(p=0.0)
      )
      (positionwise_feed_forward): PositionwiseFeedForward(
        (layers): ModuleList(
          (0): Conv(
            (pad): ConstantPad1d(padding=(1, 1), value=0)
            (conv): Conv1d(384, 64, kernel_size=(3,), stride=(1,))
          )
          (1): Conv(
            (pad): ConstantPad1d(padding=(1, 1), value=0)
            (conv): Conv1d(64, 384, kernel_size=(3,), stride=(1,))
          )
        )
        (relu): ReLU()
        (dropout): Dropout(p=0.

Now we try

In [9]:
def sort_batch(data, seq_len):
    sorted_seq_len, sorted_idx = torch.sort(seq_len, dim=0, descending=True)
    sorted_data = data[sorted_idx.data]
    _, reverse_idx = torch.sort(sorted_idx, dim=0, descending=False)
    return sorted_data, sorted_seq_len.cuda(), reverse_idx.cuda()
vocab_dict = torch.load('data/dict.pt')
train_data= torch.load('data/train.txt.pt')
train_dataset = reader.Dataset(train_data, 64, True)
(batch_docs, batch_docs_len, doc_mask), (batch_querys, batch_querys_len, query_mask), batch_answers, candidates = train_dataset[0]
check_embedding=nn.Embedding(vocab_dict.size(), 384, padding_idx=Constants.PAD).cuda()
s_docs, s_docs_len, reverse_docs_idx = sort_batch(batch_docs, batch_docs_len)
docs_embedding=check_embedding(s_docs)
docs_outputs = check_transformer(docs_embedding)
print(docs_outputs.size())

torch.Size([64, 564, 384])


if you do not have enough gpu memory, you may consider to restart kernel and run from here.

In [1]:
import aoareader as reader
import torch
import time

We here create a class to have all paraeters.It is for the case that we want to train from scratch.
For convinience, we set the valid data as the test data.
To fully understand the parameters, please refer to file train.py


In [2]:
class opt(object):
    def __init__(self):
        self.traindata='data/cnn/train.txt.pt'
        self.validdata='data/cnn/test.txt.pt'
        self.dict='data/cnn/dict.pt'
        self.save_model='model_cnn'
        self.train_from=''
        self.hidden_size=384
        self.embed_size=384
        self.batch_size=64
        self.dropout=0.1
        self.start_epoch=1
        self.epochs=13
        self.learning_rate=0.001
        self.weight_decay=0.0001
        self.gpu=0
        self.log_interval=100
opt=opt()
print(vars(opt))
if opt.gpu:
    torch.cuda.set_device(opt.gpu)

{'traindata': 'data/cnn/train.txt.pt', 'validdata': 'data/cnn/test.txt.pt', 'dict': 'data/cnn/dict.pt', 'save_model': 'model_cnn', 'train_from': '', 'hidden_size': 384, 'embed_size': 384, 'batch_size': 64, 'dropout': 0.1, 'start_epoch': 1, 'epochs': 13, 'learning_rate': 0.001, 'weight_decay': 0.0001, 'gpu': 0, 'log_interval': 100}


Define the loss function and evalutation function

In [3]:
def loss_func(answers, pred_answers, answer_probs):
    num_correct = (answers == pred_answers).sum().squeeze().data[0]
    loss = - torch.mean(torch.log(answer_probs),0, keepdim=True)
    return loss.cuda(), num_correct

In [4]:
def eval(model, data):
    total_loss = 0.0
    total = 0.0
    total_correct = 0.0

    model.eval()
    for i in range(len(data)):
        (batch_docs, batch_docs_len, doc_mask), (batch_querys, batch_querys_len, query_mask), batch_answers, candidates = data[i]

        pred_answers, probs = model(batch_docs, batch_docs_len, doc_mask,
                                    batch_querys, batch_querys_len, query_mask,
                                    answers=batch_answers, candidates=candidates)

        loss, num_correct = loss_func(batch_answers, pred_answers, probs)

        total_in_minibatch = batch_answers.size(0)
        total_loss += loss.data[0] * total_in_minibatch
        total_correct += num_correct
        total += total_in_minibatch
        del loss, pred_answers, probs

    model.train()
    return total_loss / total, total_correct.type(torch.float) / total

In [5]:
def trainModel(model, trainData, validData, optimizer: torch.optim.Adam):
    print(model)
    start_time = time.time()

    def trainEpoch(epoch):
        trainData.shuffle()

        total_loss, total, total_num_correct = 0, 0, 0
        report_loss, report_total, report_num_correct = 0, 0, 0
        for i in range(len(trainData)):
            (batch_docs, batch_docs_len, doc_mask), (batch_querys, batch_querys_len, query_mask), batch_answers, candidates = trainData[i]

            model.zero_grad()
            pred_answers, answer_probs = model(batch_docs, batch_docs_len, doc_mask, batch_querys, batch_querys_len, query_mask,answers=batch_answers, candidates=candidates)

            loss, num_correct = loss_func(batch_answers, pred_answers, answer_probs)

            loss.backward()
            for parameter in model.parameters():
                parameter.grad.data.clamp_(-5.0, 5.0)
            # update the parameters
            optimizer.step()

            total_in_minibatch = batch_answers.size(0)

            report_loss += loss.data[0] * total_in_minibatch
            report_num_correct += num_correct
            report_total += total_in_minibatch

            total_loss += loss.data[0] * total_in_minibatch
            total_num_correct += num_correct
            total += total_in_minibatch
            if i % opt.log_interval == 0:
                print("Epoch %2d, %5d/%5d; avg loss: %.2f; acc: %2.2f;  %6.0f s elapsed" %
                      (epoch, i+1, len(trainData),
                       report_loss / report_total,
                       (100.0*report_num_correct / report_total),
                       time.time()-start_time))
                valid_loss, valid_acc = eval(model, validData)
                print('=' * 20)
                print('Evaluating on validation set:')
                print('Validation loss: %.2f' % valid_loss)
                print('Validation accuracy: %2.2f' % (valid_acc * 100.0))
                print('=' * 20)

                report_loss = 0.0
                report_total = 0.0
                report_num_correct = 0.0
            del loss, pred_answers, answer_probs

        return total_loss / total, total_num_correct / total

    for epoch in range(1, opt.epochs + 1):
        print('')

        #  (1) train for one epoch on the training set
        train_loss, train_acc = trainEpoch(epoch)
        print('Epoch %d:\t average loss: %.2f\t train accuracy: %g' % (epoch, train_loss, train_acc*100))

        #  (2) evaluate on the validation set
        valid_loss, valid_acc = eval(model, validData)
        print('=' * 20)
        print('Evaluating on validation set:')
        print('Validation loss: %.2f' % valid_loss)
        print('Validation accuracy: %2.2f' % (valid_acc*100.0))
        print('=' * 20)

        model_state_dict = model.state_dict()
        optimizer_state_dict = optimizer.state_dict()
        #  (4) drop a checkpoint
        checkpoint = {
            'model': model_state_dict,
            'epoch': epoch,
            'optimizer': optimizer_state_dict,
            'opt': opt,
        }
        torch.save(checkpoint,
                   'model_cnn/%s_epoch%d_acc_%.2f.pt' % (opt.save_model, epoch, 100*valid_acc))

Here we set our model train from a pretrained model

In [6]:
opt.train_from='model_cnn/model_cnn_epoch17_acc_69.48.pt'
train_from = opt.train_from
if opt.train_from:
    train_from = True
    checkpoint = torch.load(opt.train_from)
    opt = checkpoint['opt']
    opt.hidden_size=opt.gru_size

print("Loading dictrionary from ", opt.dict)
vocab_dict = torch.load(opt.dict)
print("Loading train data from ", opt.traindata)
train_data = torch.load(opt.traindata)
print("Loading valid data from ", opt.validdata)
valid_data = torch.load(opt.validdata)

train_dataset = reader.Dataset(train_data, opt.batch_size, True)
valid_dataset = reader.Dataset(valid_data, opt.batch_size, True, volatile=True)

print(' * vocabulary size = %d' %
      (vocab_dict.size()))
print(' * number of training samples. %d' %
      len(train_data['answers']))
print(' * maximum batch size. %d' % opt.batch_size)

print('Building model...')

model = reader.AoAReader(vocab_dict, dropout_rate=opt.dropout, embed_dim=opt.embed_size, hidden_dim=opt.hidden_size)
# no way on CPU
model.cuda()
# for parameter in model.parameters():
#     print(parameter)
if train_from:
    print('Loading model from checkpoint at %s' % opt.train_from)
    chk_model = checkpoint['model']
    model.load_state_dict(chk_model)
    opt.start_epoch = checkpoint['epoch'] + 1

optimizer = torch.optim.Adam(model.parameters(), lr=opt.learning_rate, weight_decay=opt.weight_decay)

if train_from:
    optimizer.load_state_dict(checkpoint['optimizer'])

nParams = sum([p.nelement() for p in model.parameters()])
print('* number of parameters: %d' % nParams)

trainModel(model, train_dataset, valid_dataset, optimizer)

Loading dictrionary from  data/cnn/dict.pt
Loading train data from  data/cnn/train.txt.pt
Loading valid data from  data/cnn/test.txt.pt
 * vocabulary size = 119656
 * number of training samples. 380298
 * maximum batch size. 64
Building model...


  weigth_init.orthogonal(weight.data)


Loading model from checkpoint at 
* number of parameters: 46835392
AoAReader(
  (embedding): Embedding(119656, 384, padding_idx=0)
  (transformer): Encoder(
    (embedding_proj): Linear(in_features=384, out_features=384, bias=False)
    (enc): Sequential(
      (0): EncoderLayer(
        (multi_head_attention): MultiHeadAttention(
          (query_linear): Linear(in_features=384, out_features=384, bias=False)
          (key_linear): Linear(in_features=384, out_features=384, bias=False)
          (value_linear): Linear(in_features=384, out_features=384, bias=False)
          (output_linear): Linear(in_features=384, out_features=384, bias=False)
          (dropout): Dropout(p=0.0)
        )
        (positionwise_feed_forward): PositionwiseFeedForward(
          (layers): ModuleList(
            (0): Conv(
              (pad): ConstantPad1d(padding=(1, 1), value=0)
              (conv): Conv1d(384, 64, kernel_size=(3,), stride=(1,))
            )
            (1): Conv(
              (pad)

  


Epoch  1,     1/ 5943; avg loss: 1.05; acc: 65.00;       2 s elapsed


  b = Variable(b, volatile=self.volatile, requires_grad=False)


Evaluating on validation set:
Validation loss: 1.02
Validation accuracy: 69.45
Epoch  1,   101/ 5943; avg loss: 0.96; acc: 71.00;      96 s elapsed
Evaluating on validation set:
Validation loss: 1.04
Validation accuracy: 69.45
Epoch  1,   201/ 5943; avg loss: 0.96; acc: 71.00;     194 s elapsed
Evaluating on validation set:
Validation loss: 1.04
Validation accuracy: 69.76
Epoch  1,   301/ 5943; avg loss: 0.98; acc: 70.00;     293 s elapsed
Evaluating on validation set:
Validation loss: 1.07
Validation accuracy: 69.23
Epoch  1,   401/ 5943; avg loss: 0.98; acc: 70.00;     387 s elapsed
Evaluating on validation set:
Validation loss: 1.05
Validation accuracy: 68.61
Epoch  1,   501/ 5943; avg loss: 0.98; acc: 71.00;     479 s elapsed
Evaluating on validation set:
Validation loss: 1.04
Validation accuracy: 69.26
Epoch  1,   601/ 5943; avg loss: 0.97; acc: 70.00;     575 s elapsed
Evaluating on validation set:
Validation loss: 1.05
Validation accuracy: 68.61
Epoch  1,   701/ 5943; avg loss: 

Epoch  1,  4401/ 5943; avg loss: 1.04; acc: 68.00;    4111 s elapsed
Evaluating on validation set:
Validation loss: 1.02
Validation accuracy: 68.95
Epoch  1,  4501/ 5943; avg loss: 1.02; acc: 69.00;    4204 s elapsed
Evaluating on validation set:
Validation loss: 1.02
Validation accuracy: 69.45
Epoch  1,  4601/ 5943; avg loss: 1.06; acc: 69.00;    4295 s elapsed
Evaluating on validation set:
Validation loss: 1.02
Validation accuracy: 69.67
Epoch  1,  4701/ 5943; avg loss: 1.06; acc: 68.00;    4383 s elapsed
Evaluating on validation set:
Validation loss: 1.02
Validation accuracy: 69.64
Epoch  1,  4801/ 5943; avg loss: 1.02; acc: 69.00;    4475 s elapsed
Evaluating on validation set:
Validation loss: 1.02
Validation accuracy: 69.01
Epoch  1,  4901/ 5943; avg loss: 1.06; acc: 68.00;    4566 s elapsed
Evaluating on validation set:
Validation loss: 1.02
Validation accuracy: 69.48
Epoch  1,  5001/ 5943; avg loss: 1.07; acc: 68.00;    4658 s elapsed
Evaluating on validation set:
Validation lo

Evaluating on validation set:
Validation loss: 1.02
Validation accuracy: 69.48
Epoch  2,  2701/ 5943; avg loss: 1.04; acc: 70.00;    8058 s elapsed
Evaluating on validation set:
Validation loss: 1.02
Validation accuracy: 69.01
Epoch  2,  2801/ 5943; avg loss: 1.03; acc: 70.00;    8153 s elapsed
Evaluating on validation set:
Validation loss: 1.02
Validation accuracy: 69.67
Epoch  2,  2901/ 5943; avg loss: 1.06; acc: 69.00;    8248 s elapsed
Evaluating on validation set:
Validation loss: 1.02
Validation accuracy: 69.73
Epoch  2,  3001/ 5943; avg loss: 1.04; acc: 68.00;    8341 s elapsed
Evaluating on validation set:
Validation loss: 1.02
Validation accuracy: 69.54
Epoch  2,  3101/ 5943; avg loss: 1.04; acc: 69.00;    8432 s elapsed
Evaluating on validation set:
Validation loss: 1.01
Validation accuracy: 70.08
Epoch  2,  3201/ 5943; avg loss: 1.04; acc: 69.00;    8525 s elapsed
Evaluating on validation set:
Validation loss: 1.01
Validation accuracy: 69.98
Epoch  2,  3301/ 5943; avg loss: 

Epoch  3,   901/ 5943; avg loss: 0.96; acc: 72.00;   11946 s elapsed
Evaluating on validation set:
Validation loss: 1.04
Validation accuracy: 69.45
Epoch  3,  1001/ 5943; avg loss: 1.01; acc: 70.00;   12044 s elapsed
Evaluating on validation set:
Validation loss: 1.03
Validation accuracy: 69.76
Epoch  3,  1101/ 5943; avg loss: 0.96; acc: 71.00;   12135 s elapsed
Evaluating on validation set:
Validation loss: 1.03
Validation accuracy: 70.04
Epoch  3,  1201/ 5943; avg loss: 0.95; acc: 71.00;   12227 s elapsed
Evaluating on validation set:
Validation loss: 1.04
Validation accuracy: 69.64
Epoch  3,  1301/ 5943; avg loss: 0.99; acc: 70.00;   12319 s elapsed
Evaluating on validation set:
Validation loss: 1.03
Validation accuracy: 69.73
Epoch  3,  1401/ 5943; avg loss: 0.97; acc: 70.00;   12412 s elapsed
Evaluating on validation set:
Validation loss: 1.02
Validation accuracy: 70.26
Epoch  3,  1501/ 5943; avg loss: 1.01; acc: 70.00;   12503 s elapsed
Evaluating on validation set:
Validation lo

Evaluating on validation set:
Validation loss: 1.02
Validation accuracy: 69.86
Epoch  3,  5301/ 5943; avg loss: 1.02; acc: 70.00;   16021 s elapsed
Evaluating on validation set:
Validation loss: 1.02
Validation accuracy: 69.76
Epoch  3,  5401/ 5943; avg loss: 1.02; acc: 69.00;   16112 s elapsed
Evaluating on validation set:
Validation loss: 1.01
Validation accuracy: 70.17
Epoch  3,  5501/ 5943; avg loss: 1.02; acc: 69.00;   16206 s elapsed
Evaluating on validation set:
Validation loss: 1.02
Validation accuracy: 69.89
Epoch  3,  5601/ 5943; avg loss: 1.03; acc: 70.00;   16300 s elapsed
Evaluating on validation set:
Validation loss: 1.02
Validation accuracy: 69.48
Epoch  3,  5701/ 5943; avg loss: 1.04; acc: 69.00;   16392 s elapsed
Evaluating on validation set:
Validation loss: 1.02
Validation accuracy: 70.01
Epoch  3,  5801/ 5943; avg loss: 1.04; acc: 69.00;   16485 s elapsed
Evaluating on validation set:
Validation loss: 1.02
Validation accuracy: 69.14
Epoch  3,  5901/ 5943; avg loss: 

Epoch  4,  3501/ 5943; avg loss: 1.01; acc: 70.00;   19905 s elapsed
Evaluating on validation set:
Validation loss: 1.02
Validation accuracy: 70.01
Epoch  4,  3601/ 5943; avg loss: 1.00; acc: 70.00;   20001 s elapsed
Evaluating on validation set:
Validation loss: 1.02
Validation accuracy: 69.98
Epoch  4,  3701/ 5943; avg loss: 0.99; acc: 70.00;   20092 s elapsed
Evaluating on validation set:
Validation loss: 1.02
Validation accuracy: 69.92
Epoch  4,  3801/ 5943; avg loss: 0.99; acc: 70.00;   20182 s elapsed
Evaluating on validation set:
Validation loss: 1.01
Validation accuracy: 70.26
Epoch  4,  3901/ 5943; avg loss: 1.02; acc: 70.00;   20276 s elapsed
Evaluating on validation set:
Validation loss: 1.01
Validation accuracy: 69.57
Epoch  4,  4001/ 5943; avg loss: 1.05; acc: 68.00;   20370 s elapsed
Evaluating on validation set:
Validation loss: 1.02
Validation accuracy: 69.54
Epoch  4,  4101/ 5943; avg loss: 1.00; acc: 70.00;   20461 s elapsed
Evaluating on validation set:
Validation lo

KeyboardInterrupt: 

**Experience Result**

|Randomized Embedding                          | Accuracy | Number of parameters | Training speed |   
|----------------------------------------------|----------|----------------------|----------------|
| Attention over Attention                     | 69.89%   | 46,835,392           | 13,874 s/epoch |   
| Transformer Encoder Attention over Attention | 70.86%   | 47,721,984           | 5,582 s/epoch  |   


**Result Analysis**
- In here we can see that Transformer Encoder gives a boost of around 1 percentage which show that Transformer Encoder is a good alternative to the GRU. 
- More over, one of the best properties of Transformer is allowing parallel computing as all of its components can do that. Here you can see that the training speed is faster, only around 1 and half hour for 1 epoch while the original took nearly 4 hours for 1 epoch.
