<a href="https://colab.research.google.com/github/rajy4683/EVAP2/blob/master/ENDS9_Attn_CommonQa.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!nvidia-smi

Mon Jan  4 10:40:40 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.27.04    Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   35C    P0    25W / 250W |      0MiB / 16280MiB |      0%      Default |
|                               |                      |                 ERR! |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [None]:
%%bash
python -m spacy download en
python -m spacy download de

[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')
[38;5;2m✔ Linking successful[0m
/usr/local/lib/python3.6/dist-packages/en_core_web_sm -->
/usr/local/lib/python3.6/dist-packages/spacy/data/en
You can now load the model via spacy.load('en')
Collecting de_core_news_sm==2.2.5
  Downloading https://github.com/explosion/spacy-models/releases/download/de_core_news_sm-2.2.5/de_core_news_sm-2.2.5.tar.gz (14.9MB)
Building wheels for collected packages: de-core-news-sm
  Building wheel for de-core-news-sm (setup.py): started
  Building wheel for de-core-news-sm (setup.py): finished with status 'done'
  Created wheel for de-core-news-sm: filename=de_core_news_sm-2.2.5-cp36-none-any.whl size=14907057 sha256=c52da8b473ed5b8a95f12df6ff5330a22a9f0e9004600fada6d547cc119b5bf5
  Stored in directory: /tmp/pip-ephem-wheel-cache-0d7b1zun/wheels/ba/3f/ed/d4aa8e45e7191b7f32db4bfad565e7da1edbf05c916ca7a1ca
Successfully built de-core-news-sm
Inst

**Restart Notebook**

# 3 - Neural Machine Translation by Jointly Learning to Align and Translate

In this third notebook on sequence-to-sequence models using PyTorch and TorchText, we'll be implementing the model from [Neural Machine Translation by Jointly Learning to Align and Translate](https://arxiv.org/abs/1409.0473). This model achives our best perplexity yet, ~27 compared to ~34 for the previous model.

## Introduction

As a reminder, here is the general encoder-decoder model:

![](https://github.com/bentrevett/pytorch-seq2seq/blob/master/assets/seq2seq1.png?raw=1)

In the previous model, our architecture was set-up in a way to reduce "information compression" by explicitly passing the context vector, $z$, to the decoder at every time-step and by passing both the context vector and embedded input word, $d(y_t)$, along with the hidden state, $s_t$, to the linear layer, $f$, to make a prediction.

![](https://github.com/bentrevett/pytorch-seq2seq/blob/master/assets/seq2seq7.png?raw=1)

Even though we have reduced some of this compression, our context vector still needs to contain all of the information about the source sentence. The model implemented in this notebook avoids this compression by allowing the decoder to look at the entire source sentence (via its hidden states) at each decoding step! How does it do this? It uses *attention*. 

Attention works by first, calculating an attention vector, $a$, that is the length of the source sentence. The attention vector has the property that each element is between 0 and 1, and the entire vector sums to 1. We then calculate a weighted sum of our source sentence hidden states, $H$, to get a weighted source vector, $w$. 

$$w = \sum_{i}a_ih_i$$

We calculate a new weighted source vector every time-step when decoding, using it as input to our decoder RNN as well as the linear layer to make a prediction. We'll explain how to do all of this during the session.

## Preparing Data

Again, the preparation is similar to last time.

First we import all the required modules.

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F

from torchtext.datasets import Multi30k
from torchtext.data import Field, BucketIterator

import spacy
import numpy as np
import pandas as pd
import random
import math
import time
import json

Set the random seeds for reproducability.

In [None]:
SEED = 1234

random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
torch.cuda.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

Load the German and English spaCy models.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
!cp /content/drive/MyDrive/EVA4/ENDS9/CommonQA.zip .

In [None]:
!unzip /content/drive/MyDrive/EVA4/ENDS9/CommonQA.zip

Archive:  /content/drive/MyDrive/EVA4/ENDS9/CommonQA.zip
   creating: CommonQA/
  inflating: CommonQA/dev_rand_split.jsonl  
  inflating: CommonQA/test_rand_split_no_answers.jsonl  
  inflating: CommonQA/train_rand_split.jsonl  


In [None]:
spacy_de = spacy.load('de')
spacy_en = spacy.load('en')

In [None]:
sentence = []
label = []
with open('/content/CommonQA/train_rand_split.jsonl') as h:
    for line in h:
        example = json.loads(line)
        scores = []
        merged_choices = ' A: '.join([choice['text'] for choice in example['question']['choices']])
        input = 'Q: ' + example['question']['stem'] + ' A: ' + merged_choices
        correct_answer = [ choice['text'] for choice in example['question']['choices'] if choice['label'] == example['answerKey'] ][0]
        sentence.append(input)
        label.append(correct_answer)
        #print(input, correct_answer)
dataset_df = pd.DataFrame({'sentence':sentence, 'label':label})

In [None]:
#dataset = pd.read_csv("/content/dev.tsv", sep='\t', header=None)
dataset = pd.read_csv("/content/qasc/train.tsv", sep='\t', header=None)
dataset.columns = ["sentence", "label"]

We create the tokenizers.

In [None]:
def tokenize_de(text):
    """
    Tokenizes German text from a string into a list of strings
    """
    return [tok.text for tok in spacy_de.tokenizer(text)]

def tokenize_en(text):
    """
    Tokenizes English text from a string into a list of strings
    """
    return [tok.text for tok in spacy_en.tokenizer(text)]

The fields remain the same as before.

Load the data.

In [None]:
from torchtext import data 
import matplotlib.pyplot as plt
Sentence = data.Field(sequential = True, tokenize = 'spacy', batch_first =False, include_lengths=False, lower=False)
Label = data.Field(sequential =True, tokenize ='spacy', is_target=False, batch_first =False, include_lengths=False, lower=False)

In [None]:
dataset = dataset_df
fields = [('sentence', Sentence),('label',Label)]
example = [data.Example.fromlist([dataset.sentence[i],dataset.label[i]], fields) for i in range(dataset.shape[0])] 
commonqa_ds = data.Dataset(example, fields)
(train, valid) = commonqa_ds.split(split_ratio=[0.85, 0.15], random_state=random.seed(SEED))

Build the vocabulary.

In [None]:
Sentence.build_vocab(train)
Label.build_vocab(train)

In [None]:
print('Size of input vocab : ', len(Sentence.vocab))
print('Size of label vocab : ', len(Label.vocab))
print('Top 10 words appreared repeatedly :', list(Sentence.vocab.freqs.most_common(10)))
print('Labels : ', Label.vocab.stoi)

Size of input vocab :  11736
Size of label vocab :  3515
Top 10 words appreared repeatedly : [(':', 49680), ('A', 41641), ('Q', 8280), ('?', 8228), ('to', 5004), ('a', 4761), ('the', 3593), ('what', 3290), (',', 3103), ('of', 2335)]
Labels :  defaultdict(<function _default_unk_index at 0x7f8099eafa60>, {'<unk>': 0, '<pad>': 1, 'store': 2, 'of': 3, 'house': 4, 'to': 5, 'office': 6, "'s": 7, 'room': 8, 'city': 9, 'get': 10, 'money': 11, 'building': 12, 'school': 13, 'go': 14, 'have': 15, 'new': 16, 'in': 17, 'music': 18, 'water': 19, 'home': 20, 'down': 21, 'food': 22, 'feel': 23, 'being': 24, 'park': 25, 'area': 26, 'restaurant': 27, 'shop': 28, 'good': 29, 'fun': 30, 'game': 31, 'out': 32, 'kitchen': 33, 'own': 34, 'cabinet': 35, 'countryside': 36, 'death': 37, 'car': 38, 'getting': 39, 'make': 40, 'people': 41, 'work': 42, 'play': 43, 'table': 44, 'up': 45, 'band': 46, 'desk': 47, 'feeling': 48, 'for': 49, 'market': 50, 'michigan': 51, 'station': 52, 'better': 53, 'ground': 54, 'hotel

In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

In [None]:
BATCH_SIZE=128
train_iterator, test_iterator = BucketIterator.splits(
    (train, valid), 
    batch_size = BATCH_SIZE, 
    device = device, sort=False, shuffle=False)

Define the device.

In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

Create the iterators.

## Building the Seq2Seq Model

### Encoder

First, we'll build the encoder. Similar to the previous model, we only use a single layer GRU, however we now use a *bidirectional RNN*. With a bidirectional RNN, we have two RNNs in each layer. A *forward RNN* going over the embedded sentence from left to right (shown below in green), and a *backward RNN* going over the embedded sentence from right to left (teal). All we need to do in code is set `bidirectional = True` and then pass the embedded sentence to the RNN as before. 

![](https://github.com/bentrevett/pytorch-seq2seq/blob/master/assets/seq2seq8.png?raw=1)

We now have:

$$\begin{align*}
h_t^\rightarrow &= \text{EncoderGRU}^\rightarrow(e(x_t^\rightarrow),h_{t-1}^\rightarrow)\\
h_t^\leftarrow &= \text{EncoderGRU}^\leftarrow(e(x_t^\leftarrow),h_{t-1}^\leftarrow)
\end{align*}$$

Where $x_0^\rightarrow = \text{<sos>}, x_1^\rightarrow = \text{guten}$ and $x_0^\leftarrow = \text{<eos>}, x_1^\leftarrow = \text{morgen}$.

As before, we only pass an input (`embedded`) to the RNN, which tells PyTorch to initialize both the forward and backward initial hidden states ($h_0^\rightarrow$ and $h_0^\leftarrow$, respectively) to a tensor of all zeros. We'll also get two context vectors, one from the forward RNN after it has seen the final word in the sentence, $z^\rightarrow=h_T^\rightarrow$, and one from the backward RNN after it has seen the first word in the sentence, $z^\leftarrow=h_T^\leftarrow$.

The RNN returns `outputs` and `hidden`. 

`outputs` is of size **[src len, batch size, hid dim * num directions]** where the first `hid_dim` elements in the third axis are the hidden states from the top layer forward RNN, and the last `hid_dim` elements are hidden states from the top layer backward RNN. We can think of the third axis as being the forward and backward hidden states concatenated together other, i.e. $h_1 = [h_1^\rightarrow; h_{T}^\leftarrow]$, $h_2 = [h_2^\rightarrow; h_{T-1}^\leftarrow]$ and we can denote all encoder hidden states (forward and backwards concatenated together) as $H=\{ h_1, h_2, ..., h_T\}$.

`hidden` is of size **[n layers * num directions, batch size, hid dim]**, where **[-2, :, :]** gives the top layer forward RNN hidden state after the final time-step (i.e. after it has seen the last word in the sentence) and **[-1, :, :]** gives the top layer backward RNN hidden state after the final time-step (i.e. after it has seen the first word in the sentence).

As the decoder is not bidirectional, it only needs a single context vector, $z$, to use as its initial hidden state, $s_0$, and we currently have two, a forward and a backward one ($z^\rightarrow=h_T^\rightarrow$ and $z^\leftarrow=h_T^\leftarrow$, respectively). We solve this by concatenating the two context vectors together, passing them through a linear layer, $g$, and applying the $\tanh$ activation function. 

$$z=\tanh(g(h_T^\rightarrow, h_T^\leftarrow)) = \tanh(g(z^\rightarrow, z^\leftarrow)) = s_0$$

**Note**: this is actually a deviation from the paper. Instead, they feed only the first backward RNN hidden state through a linear layer to get the context vector/decoder initial hidden state. This doesn't seem to make sense to me, so we have changed it.

As we want our model to look back over the whole of the source sentence we return `outputs`, the stacked forward and backward hidden states for every token in the source sentence. We also return `hidden`, which acts as our initial hidden state in the decoder.

In [None]:
class Encoder(nn.Module):
    def __init__(self, input_dim, emb_dim, enc_hid_dim, dec_hid_dim, dropout):
        super().__init__()
        
        self.embedding = nn.Embedding(input_dim, emb_dim)
        
        self.rnn = nn.GRU(emb_dim, enc_hid_dim, bidirectional = True)
        
        self.fc = nn.Linear(enc_hid_dim * 2, dec_hid_dim)
        
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, src):
        
        #src = [src len, batch size]
        
        embedded = self.dropout(self.embedding(src))
        
        #embedded = [src len, batch size, emb dim]
        
        outputs, hidden = self.rnn(embedded)
                
        #outputs = [src len, batch size, hid dim * num directions]
        #hidden = [n layers * num directions, batch size, hid dim]
        
        #hidden is stacked [forward_1, backward_1, forward_2, backward_2, ...]
        #outputs are always from the last layer
        
        #hidden [-2, :, : ] is the last of the forwards RNN 
        #hidden [-1, :, : ] is the last of the backwards RNN
        
        #initial decoder hidden is final hidden state of the forwards and backwards 
        #  encoder RNNs fed through a linear layer
        hidden = torch.tanh(self.fc(torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim = 1)))
        
        #outputs = [src len, batch size, enc hid dim * 2]
        #hidden = [batch size, dec hid dim]
        
        return outputs, hidden

### Attention

Next up is the attention layer. This will take in the previous hidden state of the decoder, $s_{t-1}$, and all of the stacked forward and backward hidden states from the encoder, $H$. The layer will output an attention vector, $a_t$, that is the length of the source sentence, each element is between 0 and 1 and the entire vector sums to 1.

Intuitively, this layer takes what we have decoded so far, $s_{t-1}$, and all of what we have encoded, $H$, to produce a vector, $a_t$, that represents which words in the source sentence we should pay the most attention to in order to correctly predict the next word to decode, $\hat{y}_{t+1}$. 

First, we calculate the *energy* between the previous decoder hidden state and the encoder hidden states. As our encoder hidden states are a sequence of $T$ tensors, and our previous decoder hidden state is a single tensor, the first thing we do is `repeat` the previous decoder hidden state $T$ times. We then calculate the energy, $E_t$, between them by concatenating them together and passing them through a linear layer (`attn`) and a $\tanh$ activation function. 

$$E_t = \tanh(\text{attn}(s_{t-1}, H))$$ 

This can be thought of as calculating how well each encoder hidden state "matches" the previous decoder hidden state.

We currently have a **[dec hid dim, src len]** tensor for each example in the batch. We want this to be **[src len]** for each example in the batch as the attention should be over the length of the source sentence. This is achieved by multiplying the `energy` by a **[1, dec hid dim]** tensor, $v$.

$$\hat{a}_t = v E_t$$

We can think of $v$ as the weights for a weighted sum of the energy across all encoder hidden states. These weights tell us how much we should attend to each token in the source sequence. The parameters of $v$ are initialized randomly, but learned with the rest of the model via backpropagation. Note how $v$ is not dependent on time, and the same $v$ is used for each time-step of the decoding. We implement $v$ as a linear layer without a bias.

Finally, we ensure the attention vector fits the constraints of having all elements between 0 and 1 and the vector summing to 1 by passing it through a $\text{softmax}$ layer.

$$a_t = \text{softmax}(\hat{a_t})$$

This gives us the attention over the source sentence!

Graphically, this looks something like below. This is for calculating the very first attention vector, where $s_{t-1} = s_0 = z$. The green/teal blocks represent the hidden states from both the forward and backward RNNs, and the attention computation is all done within the pink block.

![](https://github.com/bentrevett/pytorch-seq2seq/blob/master/assets/seq2seq9.png?raw=1)

In [None]:
class Attention(nn.Module):
    def __init__(self, enc_hid_dim, dec_hid_dim):
        super().__init__()
        
        self.attn = nn.Linear((enc_hid_dim * 2) + dec_hid_dim, dec_hid_dim)
        self.v = nn.Linear(dec_hid_dim, 1, bias = False)
        
    def forward(self, hidden, encoder_outputs):
        
        #hidden = [batch size, dec hid dim]
        #encoder_outputs = [src len, batch size, enc hid dim * 2]
        
        batch_size = encoder_outputs.shape[1]
        src_len = encoder_outputs.shape[0]
        
        #repeat decoder hidden state src_len times
        hidden = hidden.unsqueeze(1).repeat(1, src_len, 1)
        
        encoder_outputs = encoder_outputs.permute(1, 0, 2)
        
        #hidden = [batch size, src len, dec hid dim]
        #encoder_outputs = [batch size, src len, enc hid dim * 2]
        
        energy = torch.tanh(self.attn(torch.cat((hidden, encoder_outputs), dim = 2))) 
        
        #energy = [batch size, src len, dec hid dim]

        attention = self.v(energy).squeeze(2)
        
        #attention= [batch size, src len]
        
        return F.softmax(attention, dim=1)

### Decoder

Next up is the decoder. 

The decoder contains the attention layer, `attention`, which takes the previous hidden state, $s_{t-1}$, all of the encoder hidden states, $H$, and returns the attention vector, $a_t$.

We then use this attention vector to create a weighted source vector, $w_t$, denoted by `weighted`, which is a weighted sum of the encoder hidden states, $H$, using $a_t$ as the weights.

$$w_t = a_t H$$

The embedded input word, $d(y_t)$, the weighted source vector, $w_t$, and the previous decoder hidden state, $s_{t-1}$, are then all passed into the decoder RNN, with $d(y_t)$ and $w_t$ being concatenated together.

$$s_t = \text{DecoderGRU}(d(y_t), w_t, s_{t-1})$$

We then pass $d(y_t)$, $w_t$ and $s_t$ through the linear layer, $f$, to make a prediction of the next word in the target sentence, $\hat{y}_{t+1}$. This is done by concatenating them all together.

$$\hat{y}_{t+1} = f(d(y_t), w_t, s_t)$$

The image below shows decoding the first word in an example translation.

![](https://github.com/bentrevett/pytorch-seq2seq/blob/master/assets/seq2seq10.png?raw=1)

The green/teal blocks show the forward/backward encoder RNNs which output $H$, the red block shows the context vector, $z = h_T = \tanh(g(h^\rightarrow_T,h^\leftarrow_T)) = \tanh(g(z^\rightarrow, z^\leftarrow)) = s_0$, the blue block shows the decoder RNN which outputs $s_t$, the purple block shows the linear layer, $f$, which outputs $\hat{y}_{t+1}$ and the orange block shows the calculation of the weighted sum over $H$ by $a_t$ and outputs $w_t$. Not shown is the calculation of $a_t$.

In [None]:
class Decoder(nn.Module):
    def __init__(self, output_dim, emb_dim, enc_hid_dim, dec_hid_dim, dropout, attention):
        super().__init__()

        self.output_dim = output_dim
        self.attention = attention
        
        self.embedding = nn.Embedding(output_dim, emb_dim)
        
        self.rnn = nn.GRU((enc_hid_dim * 2) + emb_dim, dec_hid_dim)
        
        self.fc_out = nn.Linear((enc_hid_dim * 2) + dec_hid_dim + emb_dim, output_dim)
        
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, input, hidden, encoder_outputs):
             
        #input = [batch size]
        #hidden = [batch size, dec hid dim]
        #encoder_outputs = [src len, batch size, enc hid dim * 2]
        
        input = input.unsqueeze(0)
        
        #input = [1, batch size]
        
        embedded = self.dropout(self.embedding(input))
        
        #embedded = [1, batch size, emb dim]
        
        a = self.attention(hidden, encoder_outputs)
                
        #a = [batch size, src len]
        
        a = a.unsqueeze(1)
        
        #a = [batch size, 1, src len]
        
        encoder_outputs = encoder_outputs.permute(1, 0, 2)
        
        #encoder_outputs = [batch size, src len, enc hid dim * 2]
        
        weighted = torch.bmm(a, encoder_outputs)
        
        #weighted = [batch size, 1, enc hid dim * 2]
        
        weighted = weighted.permute(1, 0, 2)
        
        #weighted = [1, batch size, enc hid dim * 2]
        
        rnn_input = torch.cat((embedded, weighted), dim = 2)
        
        #rnn_input = [1, batch size, (enc hid dim * 2) + emb dim]
            
        output, hidden = self.rnn(rnn_input, hidden.unsqueeze(0))
        
        #output = [seq len, batch size, dec hid dim * n directions]
        #hidden = [n layers * n directions, batch size, dec hid dim]
        
        #seq len, n layers and n directions will always be 1 in this decoder, therefore:
        #output = [1, batch size, dec hid dim]
        #hidden = [1, batch size, dec hid dim]
        #this also means that output == hidden
        assert (output == hidden).all()
        
        embedded = embedded.squeeze(0)
        output = output.squeeze(0)
        weighted = weighted.squeeze(0)
        
        prediction = self.fc_out(torch.cat((output, weighted, embedded), dim = 1))
        
        #prediction = [batch size, output dim]
        
        return prediction, hidden.squeeze(0)

### Seq2Seq

This is the first model where we don't have to have the encoder RNN and decoder RNN have the same hidden dimensions, however the encoder has to be bidirectional. This requirement can be removed by changing all occurences of `enc_dim * 2` to `enc_dim * 2 if encoder_is_bidirectional else enc_dim`. 

This seq2seq encapsulator is similar to the last two. The only difference is that the `encoder` returns both the final hidden state (which is the final hidden state from both the forward and backward encoder RNNs passed through a linear layer) to be used as the initial hidden state for the decoder, as well as every hidden state (which are the forward and backward hidden states stacked on top of each other). We also need to ensure that `hidden` and `encoder_outputs` are passed to the decoder. 

Briefly going over all of the steps:
- the `outputs` tensor is created to hold all predictions, $\hat{Y}$
- the source sequence, $X$, is fed into the encoder to receive $z$ and $H$
- the initial decoder hidden state is set to be the `context` vector, $s_0 = z = h_T$
- we use a batch of `<sos>` tokens as the first `input`, $y_1$
- we then decode within a loop:
  - inserting the input token $y_t$, previous hidden state, $s_{t-1}$, and all encoder outputs, $H$, into the decoder
  - receiving a prediction, $\hat{y}_{t+1}$, and a new hidden state, $s_t$
  - we then decide if we are going to teacher force or not, setting the next input as appropriate

In [None]:
class Seq2Seq(nn.Module):
    def __init__(self, encoder, decoder, device):
        super().__init__()
        
        self.encoder = encoder
        self.decoder = decoder
        self.device = device
        
    def forward(self, src, trg, teacher_forcing_ratio = 0.5):
        
        #src = [src len, batch size]
        #trg = [trg len, batch size]
        #teacher_forcing_ratio is probability to use teacher forcing
        #e.g. if teacher_forcing_ratio is 0.75 we use teacher forcing 75% of the time
        
        batch_size = src.shape[1]
        trg_len = trg.shape[0]
        trg_vocab_size = self.decoder.output_dim
        
        #tensor to store decoder outputs
        outputs = torch.zeros(trg_len, batch_size, trg_vocab_size).to(self.device)
        
        #encoder_outputs is all hidden states of the input sequence, back and forwards
        #hidden is the final forward and backward hidden states, passed through a linear layer
        encoder_outputs, hidden = self.encoder(src)
                
        #first input to the decoder is the <sos> tokens
        input = trg[0,:]
        
        for t in range(1, trg_len):
            
            #insert input token embedding, previous hidden state and all encoder hidden states
            #receive output tensor (predictions) and new hidden state
            output, hidden = self.decoder(input, hidden, encoder_outputs)
            
            #place predictions in a tensor holding predictions for each token
            outputs[t] = output
            
            #decide if we are going to use teacher forcing or not
            teacher_force = random.random() < teacher_forcing_ratio
            
            #get the highest predicted token from our predictions
            top1 = output.argmax(1) 
            
            #if teacher forcing, use actual next token as next input
            #if not, use predicted token
            input = trg[t] if teacher_force else top1

        return outputs

## Training the Seq2Seq Model

The rest of this session is very similar to the previous one.

We initialise our parameters, encoder, decoder and seq2seq model (placing it on the GPU if we have one). 

In [None]:
# INPUT_DIM = len(SRC.vocab)
# OUTPUT_DIM = len(TRG.vocab)

INPUT_DIM = len(Sentence.vocab)
OUTPUT_DIM = len(Label.vocab)

ENC_EMB_DIM = 256
DEC_EMB_DIM = 256
ENC_HID_DIM = 512
DEC_HID_DIM = 512
ENC_DROPOUT = 0.5
DEC_DROPOUT = 0.5

attn = Attention(ENC_HID_DIM, DEC_HID_DIM)
enc = Encoder(INPUT_DIM, ENC_EMB_DIM, ENC_HID_DIM, DEC_HID_DIM, ENC_DROPOUT)
dec = Decoder(OUTPUT_DIM, DEC_EMB_DIM, ENC_HID_DIM, DEC_HID_DIM, DEC_DROPOUT, attn)

model = Seq2Seq(enc, dec, device).to(device)

We use a simplified version of the weight initialization scheme used in the paper. Here, we will initialize all biases to zero and all weights from $\mathcal{N}(0, 0.01)$.

In [None]:
def init_weights(m):
    for name, param in m.named_parameters():
        if 'weight' in name:
            nn.init.normal_(param.data, mean=0, std=0.01)
        else:
            nn.init.constant_(param.data, 0)
            
model.apply(init_weights)

Seq2Seq(
  (encoder): Encoder(
    (embedding): Embedding(11736, 256)
    (rnn): GRU(256, 512, bidirectional=True)
    (fc): Linear(in_features=1024, out_features=512, bias=True)
    (dropout): Dropout(p=0.5, inplace=False)
  )
  (decoder): Decoder(
    (attention): Attention(
      (attn): Linear(in_features=1536, out_features=512, bias=True)
      (v): Linear(in_features=512, out_features=1, bias=False)
    )
    (embedding): Embedding(3515, 256)
    (rnn): GRU(1280, 512)
    (fc_out): Linear(in_features=1792, out_features=3515, bias=True)
    (dropout): Dropout(p=0.5, inplace=False)
  )
)

Calculate the number of parameters. We get an increase of almost 50% in the amount of parameters from the last model. 

In [None]:
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f'The model has {count_parameters(model):,} trainable parameters')

The model has 16,639,931 trainable parameters


We create an optimizer.

In [None]:
optimizer = optim.Adam(model.parameters())

We initialize the loss function.

In [None]:
# TRG_PAD_IDX = TRG.vocab.stoi[TRG.pad_token]

TRG_PAD_IDX = Label.vocab.stoi[Label.pad_token]

criterion = nn.CrossEntropyLoss(ignore_index = TRG_PAD_IDX)

We then create the training loop...

In [None]:
def train(model, iterator, optimizer, criterion, clip):
    
    model.train()
    
    epoch_loss = 0
    
    for i, batch in enumerate(iterator):
        
        # src = batch.src
        # trg = batch.trg
        src = batch.sentence
        trg = batch.label
        optimizer.zero_grad()
        
        output = model(src, trg)
        
        #trg = [trg len, batch size]
        #output = [trg len, batch size, output dim]
        
        output_dim = output.shape[-1]
        
        output = output[1:].view(-1, output_dim)
        trg = trg[1:].view(-1)
        
        #trg = [(trg len - 1) * batch size]
        #output = [(trg len - 1) * batch size, output dim]
        
        loss = criterion(output, trg)
        
        loss.backward()
        
        torch.nn.utils.clip_grad_norm_(model.parameters(), clip)
        
        optimizer.step()
        
        epoch_loss += loss.item()
        
    return epoch_loss / len(iterator)

...and the evaluation loop, remembering to set the model to `eval` mode and turn off teaching forcing.

In [None]:
def evaluate(model, iterator, criterion):
    
    model.eval()
    
    epoch_loss = 0
    
    with torch.no_grad():
    
        for i, batch in enumerate(iterator):

            # src = batch.src
            # trg = batch.trg
            src = batch.sentence
            trg = batch.label

            output = model(src, trg, 0) #turn off teacher forcing

            #trg = [trg len, batch size]
            #output = [trg len, batch size, output dim]

            output_dim = output.shape[-1]
            
            output = output[1:].view(-1, output_dim)
            trg = trg[1:].view(-1)

            #trg = [(trg len - 1) * batch size]
            #output = [(trg len - 1) * batch size, output dim]

            loss = criterion(output, trg)

            epoch_loss += loss.item()
        
    return epoch_loss / len(iterator)

Finally, define a timing function.

In [None]:
def epoch_time(start_time, end_time):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs

Then, we train our model, saving the parameters that give us the best validation loss.

In [None]:
N_EPOCHS = 10
CLIP = 1

best_valid_loss = float('inf')

for epoch in range(N_EPOCHS):
    
    start_time = time.time()
    
    train_loss = train(model, train_iterator, optimizer, criterion, CLIP)
    # valid_loss = evaluate(model, valid_iterator, criterion)
    valid_loss = evaluate(model, test_iterator, criterion)

    
    end_time = time.time()
    
    epoch_mins, epoch_secs = epoch_time(start_time, end_time)
    
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'tut3-model.pt')
    
    print(f'Epoch: {epoch+1:02} | Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f} | Train PPL: {math.exp(train_loss):7.3f}')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. PPL: {math.exp(valid_loss):7.3f}')

Epoch: 01 | Time: 0m 3s
	Train Loss: 7.231 | Train PPL: 1380.961
	 Val. Loss: 6.866 |  Val. PPL: 958.940
Epoch: 02 | Time: 0m 3s
	Train Loss: 6.018 | Train PPL: 410.732
	 Val. Loss: 6.821 |  Val. PPL: 917.049
Epoch: 03 | Time: 0m 3s
	Train Loss: 5.492 | Train PPL: 242.746
	 Val. Loss: 6.860 |  Val. PPL: 953.283
Epoch: 04 | Time: 0m 3s
	Train Loss: 5.149 | Train PPL: 172.277
	 Val. Loss: 7.036 |  Val. PPL: 1136.637
Epoch: 05 | Time: 0m 3s
	Train Loss: 4.732 | Train PPL: 113.573
	 Val. Loss: 6.670 |  Val. PPL: 788.376
Epoch: 06 | Time: 0m 3s
	Train Loss: 4.191 | Train PPL:  66.087
	 Val. Loss: 6.554 |  Val. PPL: 702.244
Epoch: 07 | Time: 0m 3s
	Train Loss: 3.877 | Train PPL:  48.294
	 Val. Loss: 6.776 |  Val. PPL: 876.250
Epoch: 08 | Time: 0m 3s
	Train Loss: 3.424 | Train PPL:  30.677
	 Val. Loss: 6.700 |  Val. PPL: 812.122
Epoch: 09 | Time: 0m 3s
	Train Loss: 2.862 | Train PPL:  17.496
	 Val. Loss: 6.616 |  Val. PPL: 746.680
Epoch: 10 | Time: 0m 3s
	Train Loss: 2.430 | Train PPL:  11.36

In [None]:
optimizer

Adam (
Parameter Group 0
    amsgrad: False
    betas: (0.9, 0.999)
    eps: 1e-08
    lr: 0.001
    weight_decay: 0
)

In [None]:
optimizer = optim.Adam(model.parameters(), lr=0.00001, weight_decay=0.0001)

In [None]:
N_EPOCHS = 20
CLIP = 1
model.load_state_dict(torch.load('tut3-model.pt'))
best_valid_loss = 6.283

for epoch in range(N_EPOCHS):
    
    start_time = time.time()
    
    train_loss = train(model, train_iterator, optimizer, criterion, CLIP)
    # valid_loss = evaluate(model, valid_iterator, criterion)
    valid_loss = evaluate(model, test_iterator, criterion)

    
    end_time = time.time()
    
    epoch_mins, epoch_secs = epoch_time(start_time, end_time)
    
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'tut3-model.pt')
    
    print(f'Epoch: {epoch+1:02} | Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f} | Train PPL: {math.exp(train_loss):7.3f}')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. PPL: {math.exp(valid_loss):7.3f}')

Epoch: 01 | Time: 0m 3s
	Train Loss: 1.904 | Train PPL:   6.714
	 Val. Loss: 6.282 |  Val. PPL: 534.699
Epoch: 02 | Time: 0m 3s
	Train Loss: 1.886 | Train PPL:   6.590
	 Val. Loss: 6.278 |  Val. PPL: 532.762
Epoch: 03 | Time: 0m 3s
	Train Loss: 1.867 | Train PPL:   6.468
	 Val. Loss: 6.268 |  Val. PPL: 527.622
Epoch: 04 | Time: 0m 3s
	Train Loss: 1.852 | Train PPL:   6.372
	 Val. Loss: 6.260 |  Val. PPL: 523.436
Epoch: 05 | Time: 0m 3s
	Train Loss: 1.852 | Train PPL:   6.371
	 Val. Loss: 6.253 |  Val. PPL: 519.555
Epoch: 06 | Time: 0m 3s
	Train Loss: 1.837 | Train PPL:   6.277
	 Val. Loss: 6.247 |  Val. PPL: 516.603
Epoch: 07 | Time: 0m 3s
	Train Loss: 1.837 | Train PPL:   6.280
	 Val. Loss: 6.243 |  Val. PPL: 514.369
Epoch: 08 | Time: 0m 3s
	Train Loss: 1.820 | Train PPL:   6.171
	 Val. Loss: 6.235 |  Val. PPL: 510.437
Epoch: 09 | Time: 0m 3s
	Train Loss: 1.819 | Train PPL:   6.168
	 Val. Loss: 6.227 |  Val. PPL: 506.366
Epoch: 10 | Time: 0m 3s
	Train Loss: 1.799 | Train PPL:   6.042


In [None]:
N_EPOCHS = 20
CLIP = 1
model.load_state_dict(torch.load('tut3-model.pt'))
best_valid_loss = 6.283

for epoch in range(N_EPOCHS):
    
    start_time = time.time()
    
    train_loss = train(model, train_iterator, optimizer, criterion, CLIP)
    # valid_loss = evaluate(model, valid_iterator, criterion)
    valid_loss = evaluate(model, test_iterator, criterion)

    
    end_time = time.time()
    
    epoch_mins, epoch_secs = epoch_time(start_time, end_time)
    
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'tut3-model.pt')
    
    print(f'Epoch: {epoch+1:02} | Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f} | Train PPL: {math.exp(train_loss):7.3f}')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. PPL: {math.exp(valid_loss):7.3f}')

Epoch: 01 | Time: 0m 3s
	Train Loss: 1.722 | Train PPL:   5.597
	 Val. Loss: 6.145 |  Val. PPL: 466.476
Epoch: 02 | Time: 0m 3s
	Train Loss: 1.727 | Train PPL:   5.621
	 Val. Loss: 6.137 |  Val. PPL: 462.837
Epoch: 03 | Time: 0m 3s
	Train Loss: 1.719 | Train PPL:   5.578
	 Val. Loss: 6.120 |  Val. PPL: 454.918
Epoch: 04 | Time: 0m 3s
	Train Loss: 1.708 | Train PPL:   5.516
	 Val. Loss: 6.116 |  Val. PPL: 452.971
Epoch: 05 | Time: 0m 3s
	Train Loss: 1.711 | Train PPL:   5.535
	 Val. Loss: 6.108 |  Val. PPL: 449.493
Epoch: 06 | Time: 0m 3s
	Train Loss: 1.687 | Train PPL:   5.405
	 Val. Loss: 6.104 |  Val. PPL: 447.607
Epoch: 07 | Time: 0m 3s
	Train Loss: 1.680 | Train PPL:   5.366
	 Val. Loss: 6.097 |  Val. PPL: 444.413
Epoch: 08 | Time: 0m 3s
	Train Loss: 1.681 | Train PPL:   5.371
	 Val. Loss: 6.093 |  Val. PPL: 442.943
Epoch: 09 | Time: 0m 3s
	Train Loss: 1.669 | Train PPL:   5.306
	 Val. Loss: 6.087 |  Val. PPL: 440.146
Epoch: 10 | Time: 0m 3s
	Train Loss: 1.666 | Train PPL:   5.291


In [None]:
best_valid_loss

6.034162958463033

In [None]:
N_EPOCHS = 50
CLIP = 1
model.load_state_dict(torch.load('tut3-model.pt'))
best_valid_loss = 5.836661656697591

for epoch in range(N_EPOCHS):
    
    start_time = time.time()
    
    train_loss = train(model, train_iterator, optimizer, criterion, CLIP)
    # valid_loss = evaluate(model, valid_iterator, criterion)
    valid_loss = evaluate(model, test_iterator, criterion)

    
    end_time = time.time()
    
    epoch_mins, epoch_secs = epoch_time(start_time, end_time)
    
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'tut3-model.pt')
    
    print(f'Epoch: {epoch+1:02} | Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f} | Train PPL: {math.exp(train_loss):7.3f}')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. PPL: {math.exp(valid_loss):7.3f}')

Epoch: 01 | Time: 0m 3s
	Train Loss: 1.276 | Train PPL:   3.583
	 Val. Loss: 5.853 |  Val. PPL: 348.233
Epoch: 02 | Time: 0m 3s
	Train Loss: 1.275 | Train PPL:   3.578
	 Val. Loss: 5.855 |  Val. PPL: 348.853
Epoch: 03 | Time: 0m 3s
	Train Loss: 1.268 | Train PPL:   3.554
	 Val. Loss: 5.857 |  Val. PPL: 349.797
Epoch: 04 | Time: 0m 3s
	Train Loss: 1.250 | Train PPL:   3.492
	 Val. Loss: 5.858 |  Val. PPL: 350.080
Epoch: 05 | Time: 0m 3s
	Train Loss: 1.258 | Train PPL:   3.517
	 Val. Loss: 5.855 |  Val. PPL: 349.093
Epoch: 06 | Time: 0m 3s
	Train Loss: 1.249 | Train PPL:   3.488
	 Val. Loss: 5.852 |  Val. PPL: 347.810
Epoch: 07 | Time: 0m 3s
	Train Loss: 1.231 | Train PPL:   3.425
	 Val. Loss: 5.850 |  Val. PPL: 347.169
Epoch: 08 | Time: 0m 3s
	Train Loss: 1.236 | Train PPL:   3.443
	 Val. Loss: 5.855 |  Val. PPL: 348.925
Epoch: 09 | Time: 0m 3s
	Train Loss: 1.229 | Train PPL:   3.419
	 Val. Loss: 5.851 |  Val. PPL: 347.631
Epoch: 10 | Time: 0m 3s
	Train Loss: 1.225 | Train PPL:   3.404


In [None]:
optimizer = optim.Adam(model.parameters(), lr=0.000005, weight_decay=0.0001)

In [None]:
optimizer = optim.Adam(model.parameters(), lr=0.000005, weight_decay=0.0001)
_EPOCHS = 50
CLIP = 1
model.load_state_dict(torch.load('tut3-model.pt'))
best_valid_loss = 6.034162958463033

for epoch in range(N_EPOCHS):
    
    start_time = time.time()
    
    train_loss = train(model, train_iterator, optimizer, criterion, CLIP)
    # valid_loss = evaluate(model, valid_iterator, criterion)
    valid_loss = evaluate(model, test_iterator, criterion)

    
    end_time = time.time()
    
    epoch_mins, epoch_secs = epoch_time(start_time, end_time)
    
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'tut3-model.pt')
    
    print(f'Epoch: {epoch+1:02} | Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f} | Train PPL: {math.exp(train_loss):7.3f}')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. PPL: {math.exp(valid_loss):7.3f}')

Epoch: 01 | Time: 0m 3s
	Train Loss: 1.173 | Train PPL:   3.232
	 Val. Loss: 5.839 |  Val. PPL: 343.432
Epoch: 02 | Time: 0m 3s
	Train Loss: 1.164 | Train PPL:   3.201
	 Val. Loss: 5.842 |  Val. PPL: 344.312
Epoch: 03 | Time: 0m 3s
	Train Loss: 1.167 | Train PPL:   3.214
	 Val. Loss: 5.833 |  Val. PPL: 341.394
Epoch: 04 | Time: 0m 3s
	Train Loss: 1.164 | Train PPL:   3.201
	 Val. Loss: 5.832 |  Val. PPL: 341.184
Epoch: 05 | Time: 0m 3s
	Train Loss: 1.165 | Train PPL:   3.207
	 Val. Loss: 5.833 |  Val. PPL: 341.287
Epoch: 06 | Time: 0m 3s
	Train Loss: 1.159 | Train PPL:   3.186
	 Val. Loss: 5.833 |  Val. PPL: 341.391
Epoch: 07 | Time: 0m 3s
	Train Loss: 1.159 | Train PPL:   3.186
	 Val. Loss: 5.833 |  Val. PPL: 341.529
Epoch: 08 | Time: 0m 3s
	Train Loss: 1.150 | Train PPL:   3.159
	 Val. Loss: 5.836 |  Val. PPL: 342.380
Epoch: 09 | Time: 0m 3s
	Train Loss: 1.151 | Train PPL:   3.161
	 Val. Loss: 5.831 |  Val. PPL: 340.538
Epoch: 10 | Time: 0m 3s
	Train Loss: 1.147 | Train PPL:   3.149


In [None]:
best_valid_loss

5.830525795618693

In [None]:
optimizer = optim.Adam(model.parameters(), lr=0.000001, weight_decay=0.0001)
_EPOCHS = 50
CLIP = 1
model.load_state_dict(torch.load('tut3-model.pt'))
best_valid_loss = 5.830525795618693

for epoch in range(N_EPOCHS):
    
    start_time = time.time()
    
    train_loss = train(model, train_iterator, optimizer, criterion, CLIP)
    # valid_loss = evaluate(model, valid_iterator, criterion)
    valid_loss = evaluate(model, test_iterator, criterion)

    
    end_time = time.time()
    
    epoch_mins, epoch_secs = epoch_time(start_time, end_time)
    
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'tut3-model.pt')
    
    print(f'Epoch: {epoch+1:02} | Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f} | Train PPL: {math.exp(train_loss):7.3f}')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. PPL: {math.exp(valid_loss):7.3f}')

Epoch: 01 | Time: 0m 3s
	Train Loss: 1.140 | Train PPL:   3.128
	 Val. Loss: 5.827 |  Val. PPL: 339.435
Epoch: 02 | Time: 0m 3s
	Train Loss: 1.142 | Train PPL:   3.134
	 Val. Loss: 5.832 |  Val. PPL: 340.935
Epoch: 03 | Time: 0m 3s
	Train Loss: 1.142 | Train PPL:   3.133
	 Val. Loss: 5.833 |  Val. PPL: 341.237
Epoch: 04 | Time: 0m 3s
	Train Loss: 1.149 | Train PPL:   3.156
	 Val. Loss: 5.834 |  Val. PPL: 341.640
Epoch: 05 | Time: 0m 3s
	Train Loss: 1.149 | Train PPL:   3.155
	 Val. Loss: 5.834 |  Val. PPL: 341.767
Epoch: 06 | Time: 0m 3s
	Train Loss: 1.141 | Train PPL:   3.131
	 Val. Loss: 5.834 |  Val. PPL: 341.697
Epoch: 07 | Time: 0m 3s
	Train Loss: 1.139 | Train PPL:   3.125
	 Val. Loss: 5.834 |  Val. PPL: 341.656
Epoch: 08 | Time: 0m 3s
	Train Loss: 1.146 | Train PPL:   3.146
	 Val. Loss: 5.833 |  Val. PPL: 341.459
Epoch: 09 | Time: 0m 3s
	Train Loss: 1.141 | Train PPL:   3.130
	 Val. Loss: 5.831 |  Val. PPL: 340.756
Epoch: 10 | Time: 0m 3s
	Train Loss: 1.150 | Train PPL:   3.159


In [None]:
best_valid_loss

5.826586365699768

In [None]:
optimizer = optim.Adam(model.parameters(), lr=0.000001, weight_decay=0.001)
_EPOCHS = 50
CLIP = 1
model.load_state_dict(torch.load('tut3-model.pt'))
best_valid_loss = 5.826586365699768

for epoch in range(N_EPOCHS):
    
    start_time = time.time()
    
    train_loss = train(model, train_iterator, optimizer, criterion, CLIP)
    # valid_loss = evaluate(model, valid_iterator, criterion)
    valid_loss = evaluate(model, test_iterator, criterion)

    
    end_time = time.time()
    
    epoch_mins, epoch_secs = epoch_time(start_time, end_time)
    
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'tut3-model.pt')
    
    print(f'Epoch: {epoch+1:02} | Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f} | Train PPL: {math.exp(train_loss):7.3f}')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. PPL: {math.exp(valid_loss):7.3f}')

Epoch: 01 | Time: 0m 3s
	Train Loss: 1.135 | Train PPL:   3.110
	 Val. Loss: 5.828 |  Val. PPL: 339.714
Epoch: 02 | Time: 0m 3s
	Train Loss: 1.141 | Train PPL:   3.129
	 Val. Loss: 5.826 |  Val. PPL: 339.031
Epoch: 03 | Time: 0m 3s
	Train Loss: 1.134 | Train PPL:   3.108
	 Val. Loss: 5.826 |  Val. PPL: 338.842
Epoch: 04 | Time: 0m 3s
	Train Loss: 1.137 | Train PPL:   3.118
	 Val. Loss: 5.823 |  Val. PPL: 337.905
Epoch: 05 | Time: 0m 3s
	Train Loss: 1.130 | Train PPL:   3.097
	 Val. Loss: 5.822 |  Val. PPL: 337.494
Epoch: 06 | Time: 0m 3s
	Train Loss: 1.138 | Train PPL:   3.120
	 Val. Loss: 5.819 |  Val. PPL: 336.738
Epoch: 07 | Time: 0m 3s
	Train Loss: 1.141 | Train PPL:   3.130
	 Val. Loss: 5.820 |  Val. PPL: 336.824
Epoch: 08 | Time: 0m 3s
	Train Loss: 1.136 | Train PPL:   3.115
	 Val. Loss: 5.817 |  Val. PPL: 335.905
Epoch: 09 | Time: 0m 3s
	Train Loss: 1.142 | Train PPL:   3.133
	 Val. Loss: 5.815 |  Val. PPL: 335.417
Epoch: 10 | Time: 0m 3s
	Train Loss: 1.142 | Train PPL:   3.133


In [None]:
best_valid_loss

5.770679036776225

In [None]:
optimizer = optim.Adam(model.parameters(), lr=0.000001, weight_decay=0.001)
_EPOCHS = 50
CLIP = 1
model.load_state_dict(torch.load('tut3-model.pt'))
best_valid_loss = 5.770679036776225

for epoch in range(N_EPOCHS):
    
    start_time = time.time()
    
    train_loss = train(model, train_iterator, optimizer, criterion, CLIP)
    # valid_loss = evaluate(model, valid_iterator, criterion)
    valid_loss = evaluate(model, test_iterator, criterion)

    
    end_time = time.time()
    
    epoch_mins, epoch_secs = epoch_time(start_time, end_time)
    
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'tut3-model.pt')
    
    print(f'Epoch: {epoch+1:02} | Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f} | Train PPL: {math.exp(train_loss):7.3f}')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. PPL: {math.exp(valid_loss):7.3f}')

Epoch: 01 | Time: 0m 3s
	Train Loss: 1.155 | Train PPL:   3.173
	 Val. Loss: 5.769 |  Val. PPL: 320.253
Epoch: 02 | Time: 0m 3s
	Train Loss: 1.156 | Train PPL:   3.176
	 Val. Loss: 5.768 |  Val. PPL: 319.925
Epoch: 03 | Time: 0m 3s
	Train Loss: 1.153 | Train PPL:   3.167
	 Val. Loss: 5.767 |  Val. PPL: 319.485
Epoch: 04 | Time: 0m 3s
	Train Loss: 1.155 | Train PPL:   3.173
	 Val. Loss: 5.767 |  Val. PPL: 319.586
Epoch: 05 | Time: 0m 3s
	Train Loss: 1.150 | Train PPL:   3.158
	 Val. Loss: 5.765 |  Val. PPL: 319.074
Epoch: 06 | Time: 0m 3s
	Train Loss: 1.155 | Train PPL:   3.174
	 Val. Loss: 5.765 |  Val. PPL: 318.985
Epoch: 07 | Time: 0m 3s
	Train Loss: 1.155 | Train PPL:   3.173
	 Val. Loss: 5.765 |  Val. PPL: 318.908
Epoch: 08 | Time: 0m 3s
	Train Loss: 1.157 | Train PPL:   3.179
	 Val. Loss: 5.764 |  Val. PPL: 318.487
Epoch: 09 | Time: 0m 3s
	Train Loss: 1.162 | Train PPL:   3.196
	 Val. Loss: 5.764 |  Val. PPL: 318.474
Epoch: 10 | Time: 0m 3s
	Train Loss: 1.156 | Train PPL:   3.177


In [None]:
best_valid_loss

5.731511831283569

In [None]:
optimizer = optim.Adam(model.parameters(), lr=0.0000001, weight_decay=0.01)
_EPOCHS = 50
CLIP = 1
model.load_state_dict(torch.load('tut3-model.pt'))
best_valid_loss = 5.731511831283569

for epoch in range(N_EPOCHS):
    
    start_time = time.time()
    
    train_loss = train(model, train_iterator, optimizer, criterion, CLIP)
    # valid_loss = evaluate(model, valid_iterator, criterion)
    valid_loss = evaluate(model, test_iterator, criterion)

    
    end_time = time.time()
    
    epoch_mins, epoch_secs = epoch_time(start_time, end_time)
    
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'tut3-model.pt')
    
    print(f'Epoch: {epoch+1:02} | Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f} | Train PPL: {math.exp(train_loss):7.3f}')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. PPL: {math.exp(valid_loss):7.3f}')

Epoch: 01 | Time: 0m 3s
	Train Loss: 1.210 | Train PPL:   3.354
	 Val. Loss: 5.672 |  Val. PPL: 290.649
Epoch: 02 | Time: 0m 3s
	Train Loss: 1.209 | Train PPL:   3.351
	 Val. Loss: 5.672 |  Val. PPL: 290.542
Epoch: 03 | Time: 0m 3s
	Train Loss: 1.208 | Train PPL:   3.347
	 Val. Loss: 5.671 |  Val. PPL: 290.356
Epoch: 04 | Time: 0m 3s
	Train Loss: 1.204 | Train PPL:   3.334
	 Val. Loss: 5.671 |  Val. PPL: 290.255
Epoch: 05 | Time: 0m 3s
	Train Loss: 1.213 | Train PPL:   3.362
	 Val. Loss: 5.670 |  Val. PPL: 290.128
Epoch: 06 | Time: 0m 3s
	Train Loss: 1.205 | Train PPL:   3.337
	 Val. Loss: 5.669 |  Val. PPL: 289.886
Epoch: 07 | Time: 0m 3s
	Train Loss: 1.206 | Train PPL:   3.340
	 Val. Loss: 5.669 |  Val. PPL: 289.736
Epoch: 08 | Time: 0m 3s
	Train Loss: 1.217 | Train PPL:   3.376
	 Val. Loss: 5.668 |  Val. PPL: 289.537
Epoch: 09 | Time: 0m 3s
	Train Loss: 1.217 | Train PPL:   3.378
	 Val. Loss: 5.667 |  Val. PPL: 289.294
Epoch: 10 | Time: 0m 3s
	Train Loss: 1.210 | Train PPL:   3.354


In [None]:
best_valid_loss

5.650694847106934

In [None]:
optimizer = optim.Adam(model.parameters(), lr=0.000001, weight_decay=0.001)
_EPOCHS = 50
CLIP = 1
model.load_state_dict(torch.load('tut3-model.pt'))
best_valid_loss = 5.650694847106934

for epoch in range(N_EPOCHS):
    
    start_time = time.time()
    
    train_loss = train(model, train_iterator, optimizer, criterion, CLIP)
    # valid_loss = evaluate(model, valid_iterator, criterion)
    valid_loss = evaluate(model, test_iterator, criterion)

    
    end_time = time.time()
    
    epoch_mins, epoch_secs = epoch_time(start_time, end_time)
    
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'tut3-model.pt')
    
    print(f'Epoch: {epoch+1:02} | Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f} | Train PPL: {math.exp(train_loss):7.3f}')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. PPL: {math.exp(valid_loss):7.3f}')

Epoch: 01 | Time: 0m 3s
	Train Loss: 1.223 | Train PPL:   3.397
	 Val. Loss: 5.650 |  Val. PPL: 284.173
Epoch: 02 | Time: 0m 3s
	Train Loss: 1.228 | Train PPL:   3.414
	 Val. Loss: 5.650 |  Val. PPL: 284.239
Epoch: 03 | Time: 0m 3s
	Train Loss: 1.226 | Train PPL:   3.408
	 Val. Loss: 5.648 |  Val. PPL: 283.809
Epoch: 04 | Time: 0m 3s
	Train Loss: 1.227 | Train PPL:   3.409
	 Val. Loss: 5.649 |  Val. PPL: 283.925
Epoch: 05 | Time: 0m 3s
	Train Loss: 1.223 | Train PPL:   3.396
	 Val. Loss: 5.651 |  Val. PPL: 284.496
Epoch: 06 | Time: 0m 3s
	Train Loss: 1.226 | Train PPL:   3.409
	 Val. Loss: 5.648 |  Val. PPL: 283.754
Epoch: 07 | Time: 0m 3s
	Train Loss: 1.230 | Train PPL:   3.420
	 Val. Loss: 5.651 |  Val. PPL: 284.475
Epoch: 08 | Time: 0m 3s
	Train Loss: 1.228 | Train PPL:   3.415
	 Val. Loss: 5.650 |  Val. PPL: 284.199
Epoch: 09 | Time: 0m 3s
	Train Loss: 1.227 | Train PPL:   3.410
	 Val. Loss: 5.651 |  Val. PPL: 284.462
Epoch: 10 | Time: 0m 3s
	Train Loss: 1.236 | Train PPL:   3.443


In [None]:
best_valid_loss

5.630644083023071

In [None]:
optimizer = optim.Adam(model.parameters(), lr=0.000001, weight_decay=0.001)
_EPOCHS = 50
CLIP = 1
model.load_state_dict(torch.load('tut3-model.pt'))
best_valid_loss = 5.630644083023071

for epoch in range(N_EPOCHS):
    
    start_time = time.time()
    
    train_loss = train(model, train_iterator, optimizer, criterion, CLIP)
    # valid_loss = evaluate(model, valid_iterator, criterion)
    valid_loss = evaluate(model, test_iterator, criterion)

    
    end_time = time.time()
    
    epoch_mins, epoch_secs = epoch_time(start_time, end_time)
    
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'tut3-model.pt')
    
    print(f'Epoch: {epoch+1:02} | Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f} | Train PPL: {math.exp(train_loss):7.3f}')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. PPL: {math.exp(valid_loss):7.3f}')

Epoch: 01 | Time: 0m 3s
	Train Loss: 1.231 | Train PPL:   3.424
	 Val. Loss: 5.628 |  Val. PPL: 278.128
Epoch: 02 | Time: 0m 3s
	Train Loss: 1.251 | Train PPL:   3.494
	 Val. Loss: 5.630 |  Val. PPL: 278.536
Epoch: 03 | Time: 0m 3s
	Train Loss: 1.243 | Train PPL:   3.467
	 Val. Loss: 5.628 |  Val. PPL: 278.091
Epoch: 04 | Time: 0m 3s
	Train Loss: 1.252 | Train PPL:   3.498
	 Val. Loss: 5.628 |  Val. PPL: 278.084
Epoch: 05 | Time: 0m 3s
	Train Loss: 1.251 | Train PPL:   3.495
	 Val. Loss: 5.627 |  Val. PPL: 277.782
Epoch: 06 | Time: 0m 3s
	Train Loss: 1.248 | Train PPL:   3.482
	 Val. Loss: 5.628 |  Val. PPL: 277.988
Epoch: 07 | Time: 0m 3s
	Train Loss: 1.251 | Train PPL:   3.493
	 Val. Loss: 5.627 |  Val. PPL: 277.884
Epoch: 08 | Time: 0m 3s
	Train Loss: 1.251 | Train PPL:   3.492
	 Val. Loss: 5.628 |  Val. PPL: 278.071
Epoch: 09 | Time: 0m 3s
	Train Loss: 1.240 | Train PPL:   3.455
	 Val. Loss: 5.629 |  Val. PPL: 278.326
Epoch: 10 | Time: 0m 3s
	Train Loss: 1.254 | Train PPL:   3.505


In [None]:
best_valid_loss

5.616332133611043

In [None]:
optimizer = optim.Adam(model.parameters(), lr=0.000001, weight_decay=0.01)
_EPOCHS = 50
CLIP = 1
model.load_state_dict(torch.load('tut3-model.pt'))
best_valid_loss = 5.616332133611043

for epoch in range(N_EPOCHS):
    
    start_time = time.time()
    
    train_loss = train(model, train_iterator, optimizer, criterion, CLIP)
    # valid_loss = evaluate(model, valid_iterator, criterion)
    valid_loss = evaluate(model, test_iterator, criterion)

    
    end_time = time.time()
    
    epoch_mins, epoch_secs = epoch_time(start_time, end_time)
    
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'tut3-model.pt')
    
    print(f'Epoch: {epoch+1:02} | Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f} | Train PPL: {math.exp(train_loss):7.3f}')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. PPL: {math.exp(valid_loss):7.3f}')

Epoch: 01 | Time: 0m 3s
	Train Loss: 1.265 | Train PPL:   3.544
	 Val. Loss: 5.610 |  Val. PPL: 273.085
Epoch: 02 | Time: 0m 3s
	Train Loss: 1.270 | Train PPL:   3.559
	 Val. Loss: 5.604 |  Val. PPL: 271.609
Epoch: 03 | Time: 0m 3s
	Train Loss: 1.273 | Train PPL:   3.572
	 Val. Loss: 5.599 |  Val. PPL: 270.230
Epoch: 04 | Time: 0m 3s
	Train Loss: 1.283 | Train PPL:   3.608
	 Val. Loss: 5.593 |  Val. PPL: 268.543
Epoch: 05 | Time: 0m 3s
	Train Loss: 1.289 | Train PPL:   3.629
	 Val. Loss: 5.588 |  Val. PPL: 267.248
Epoch: 06 | Time: 0m 3s
	Train Loss: 1.287 | Train PPL:   3.622
	 Val. Loss: 5.585 |  Val. PPL: 266.388
Epoch: 07 | Time: 0m 3s
	Train Loss: 1.295 | Train PPL:   3.650
	 Val. Loss: 5.580 |  Val. PPL: 265.112
Epoch: 08 | Time: 0m 3s
	Train Loss: 1.296 | Train PPL:   3.654
	 Val. Loss: 5.577 |  Val. PPL: 264.261
Epoch: 09 | Time: 0m 3s
	Train Loss: 1.301 | Train PPL:   3.674
	 Val. Loss: 5.572 |  Val. PPL: 262.968
Epoch: 10 | Time: 0m 3s
	Train Loss: 1.306 | Train PPL:   3.691


In [None]:
best_valid_loss

5.439995964368184

In [None]:
optimizer = optim.Adam(model.parameters(), lr=0.000001, weight_decay=0.01)
_EPOCHS = 50
CLIP = 1
model.load_state_dict(torch.load('tut3-model.pt'))
best_valid_loss = 5.439995964368184

for epoch in range(N_EPOCHS):
    
    start_time = time.time()
    
    train_loss = train(model, train_iterator, optimizer, criterion, CLIP)
    # valid_loss = evaluate(model, valid_iterator, criterion)
    valid_loss = evaluate(model, test_iterator, criterion)

    
    end_time = time.time()
    
    epoch_mins, epoch_secs = epoch_time(start_time, end_time)
    
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'tut3-model.pt')
    
    print(f'Epoch: {epoch+1:02} | Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f} | Train PPL: {math.exp(train_loss):7.3f}')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. PPL: {math.exp(valid_loss):7.3f}')

Epoch: 01 | Time: 0m 3s
	Train Loss: 1.476 | Train PPL:   4.377
	 Val. Loss: 5.436 |  Val. PPL: 229.580
Epoch: 02 | Time: 0m 3s
	Train Loss: 1.500 | Train PPL:   4.483
	 Val. Loss: 5.433 |  Val. PPL: 228.804
Epoch: 03 | Time: 0m 3s
	Train Loss: 1.504 | Train PPL:   4.499
	 Val. Loss: 5.432 |  Val. PPL: 228.696
Epoch: 04 | Time: 0m 3s
	Train Loss: 1.520 | Train PPL:   4.573
	 Val. Loss: 5.429 |  Val. PPL: 227.990
Epoch: 05 | Time: 0m 3s
	Train Loss: 1.514 | Train PPL:   4.545
	 Val. Loss: 5.427 |  Val. PPL: 227.365
Epoch: 06 | Time: 0m 3s
	Train Loss: 1.510 | Train PPL:   4.528
	 Val. Loss: 5.424 |  Val. PPL: 226.824
Epoch: 07 | Time: 0m 3s
	Train Loss: 1.521 | Train PPL:   4.577
	 Val. Loss: 5.421 |  Val. PPL: 226.141
Epoch: 08 | Time: 0m 3s
	Train Loss: 1.529 | Train PPL:   4.613
	 Val. Loss: 5.420 |  Val. PPL: 225.778
Epoch: 09 | Time: 0m 3s
	Train Loss: 1.534 | Train PPL:   4.635
	 Val. Loss: 5.416 |  Val. PPL: 224.938
Epoch: 10 | Time: 0m 3s
	Train Loss: 1.538 | Train PPL:   4.654


In [None]:
optimizer = optim.Adam(model.parameters(), lr=0.000001, weight_decay=0.01)
_EPOCHS = 100
CLIP = 1
model.load_state_dict(torch.load('tut3-model.pt'))
best_valid_loss = 5.439995964368184

for epoch in range(N_EPOCHS):
    
    start_time = time.time()
    
    train_loss = train(model, train_iterator, optimizer, criterion, CLIP)
    # valid_loss = evaluate(model, valid_iterator, criterion)
    valid_loss = evaluate(model, test_iterator, criterion)

    
    end_time = time.time()
    
    epoch_mins, epoch_secs = epoch_time(start_time, end_time)
    
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'tut3-model.pt')
    
    print(f'Epoch: {epoch+1:02} | Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f} | Train PPL: {math.exp(train_loss):7.3f}')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. PPL: {math.exp(valid_loss):7.3f}')

Epoch: 01 | Time: 0m 3s
	Train Loss: 1.743 | Train PPL:   5.713
	 Val. Loss: 5.337 |  Val. PPL: 207.907
Epoch: 02 | Time: 0m 3s
	Train Loss: 1.747 | Train PPL:   5.740
	 Val. Loss: 5.336 |  Val. PPL: 207.661
Epoch: 03 | Time: 0m 3s
	Train Loss: 1.749 | Train PPL:   5.751
	 Val. Loss: 5.333 |  Val. PPL: 207.077
Epoch: 04 | Time: 0m 3s
	Train Loss: 1.754 | Train PPL:   5.778
	 Val. Loss: 5.332 |  Val. PPL: 206.846
Epoch: 05 | Time: 0m 3s
	Train Loss: 1.768 | Train PPL:   5.860
	 Val. Loss: 5.330 |  Val. PPL: 206.516
Epoch: 06 | Time: 0m 3s
	Train Loss: 1.763 | Train PPL:   5.832
	 Val. Loss: 5.332 |  Val. PPL: 206.782
Epoch: 07 | Time: 0m 3s
	Train Loss: 1.775 | Train PPL:   5.901
	 Val. Loss: 5.329 |  Val. PPL: 206.146
Epoch: 08 | Time: 0m 3s
	Train Loss: 1.781 | Train PPL:   5.937
	 Val. Loss: 5.329 |  Val. PPL: 206.163
Epoch: 09 | Time: 0m 3s
	Train Loss: 1.780 | Train PPL:   5.932
	 Val. Loss: 5.331 |  Val. PPL: 206.621
Epoch: 10 | Time: 0m 3s
	Train Loss: 1.785 | Train PPL:   5.962


In [None]:
best_valid_loss

5.282568017641704

In [None]:
#optimizer = optim.Adam(model.parameters(), lr=0.000001, weight_decay=0.01)
N_EPOCHS = 100
CLIP = 1
model.load_state_dict(torch.load('tut3-model.pt'))
best_valid_loss = 5.282568017641704

for epoch in range(N_EPOCHS):
    
    start_time = time.time()
    
    train_loss = train(model, train_iterator, optimizer, criterion, CLIP)
    # valid_loss = evaluate(model, valid_iterator, criterion)
    valid_loss = evaluate(model, test_iterator, criterion)

    
    end_time = time.time()
    
    epoch_mins, epoch_secs = epoch_time(start_time, end_time)
    
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'tut3-model.pt')
    
    print(f'Epoch: {epoch+1:02} | Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f} | Train PPL: {math.exp(train_loss):7.3f}')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. PPL: {math.exp(valid_loss):7.3f}')

Epoch: 01 | Time: 0m 3s
	Train Loss: 1.987 | Train PPL:   7.296
	 Val. Loss: 5.282 |  Val. PPL: 196.699
Epoch: 02 | Time: 0m 3s
	Train Loss: 1.999 | Train PPL:   7.382
	 Val. Loss: 5.281 |  Val. PPL: 196.512
Epoch: 03 | Time: 0m 3s
	Train Loss: 2.006 | Train PPL:   7.435
	 Val. Loss: 5.280 |  Val. PPL: 196.402
Epoch: 04 | Time: 0m 3s
	Train Loss: 2.008 | Train PPL:   7.451
	 Val. Loss: 5.280 |  Val. PPL: 196.393
Epoch: 05 | Time: 0m 3s
	Train Loss: 2.016 | Train PPL:   7.506
	 Val. Loss: 5.279 |  Val. PPL: 196.149
Epoch: 06 | Time: 0m 3s
	Train Loss: 2.023 | Train PPL:   7.563
	 Val. Loss: 5.277 |  Val. PPL: 195.812
Epoch: 07 | Time: 0m 3s
	Train Loss: 2.023 | Train PPL:   7.559
	 Val. Loss: 5.277 |  Val. PPL: 195.746
Epoch: 08 | Time: 0m 3s
	Train Loss: 2.027 | Train PPL:   7.593
	 Val. Loss: 5.277 |  Val. PPL: 195.851
Epoch: 09 | Time: 0m 3s
	Train Loss: 2.036 | Train PPL:   7.658
	 Val. Loss: 5.276 |  Val. PPL: 195.588
Epoch: 10 | Time: 0m 3s
	Train Loss: 2.031 | Train PPL:   7.620


In [None]:
best_valid_loss

5.219060341517131

In [None]:
optimizer = optim.Adam(model.parameters(), lr=0.0000001, weight_decay=0.1)
N_EPOCHS = 100
CLIP = 1
model.load_state_dict(torch.load('tut3-model.pt'))
best_valid_loss = 5.219060341517131

for epoch in range(N_EPOCHS):
    
    start_time = time.time()
    
    train_loss = train(model, train_iterator, optimizer, criterion, CLIP)
    # valid_loss = evaluate(model, valid_iterator, criterion)
    valid_loss = evaluate(model, test_iterator, criterion)

    
    end_time = time.time()
    
    epoch_mins, epoch_secs = epoch_time(start_time, end_time)
    
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'tut3-model.pt')
    
    print(f'Epoch: {epoch+1:02} | Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f} | Train PPL: {math.exp(train_loss):7.3f}')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. PPL: {math.exp(valid_loss):7.3f}')

Epoch: 01 | Time: 0m 3s
	Train Loss: 2.620 | Train PPL:  13.732
	 Val. Loss: 5.219 |  Val. PPL: 184.761
Epoch: 02 | Time: 0m 3s
	Train Loss: 2.617 | Train PPL:  13.700
	 Val. Loss: 5.219 |  Val. PPL: 184.765
Epoch: 03 | Time: 0m 3s
	Train Loss: 2.609 | Train PPL:  13.592
	 Val. Loss: 5.219 |  Val. PPL: 184.774
Epoch: 04 | Time: 0m 3s
	Train Loss: 2.614 | Train PPL:  13.651
	 Val. Loss: 5.219 |  Val. PPL: 184.782
Epoch: 05 | Time: 0m 3s
	Train Loss: 2.630 | Train PPL:  13.868
	 Val. Loss: 5.219 |  Val. PPL: 184.783
Epoch: 06 | Time: 0m 3s
	Train Loss: 2.620 | Train PPL:  13.741
	 Val. Loss: 5.219 |  Val. PPL: 184.787
Epoch: 07 | Time: 0m 3s
	Train Loss: 2.624 | Train PPL:  13.795
	 Val. Loss: 5.219 |  Val. PPL: 184.796
Epoch: 08 | Time: 0m 3s
	Train Loss: 2.626 | Train PPL:  13.820
	 Val. Loss: 5.219 |  Val. PPL: 184.799
Epoch: 09 | Time: 0m 3s
	Train Loss: 2.625 | Train PPL:  13.798
	 Val. Loss: 5.219 |  Val. PPL: 184.799
Epoch: 10 | Time: 0m 3s
	Train Loss: 2.630 | Train PPL:  13.871


Finally, we test the model on the test set using these "best" parameters.

In [None]:
model.load_state_dict(torch.load('tut3-model.pt'))

test_loss = evaluate(model, test_iterator, criterion)

print(f'| Test Loss: {test_loss:.3f} | Test PPL: {math.exp(test_loss):7.3f} |')

| Test Loss: 3.178 | Test PPL:  24.009 |


We've improved on the previous model, but this came at the cost of doubling the training time.

Next, we'll be using the same architecture but using a few tricks that are applicable to all RNN architectures - packed padded sequences and masking. We'll also implement code which will allow us to look at what words in the input the RNN is paying attention to when decoding the output. Check this [Notebook](https://colab.research.google.com/github/bentrevett/pytorch-seq2seq/blob/master/4%20-%20Packed%20Padded%20Sequences%2C%20Masking%2C%20Inference%20and%20BLEU.ipynb) 