In [1]:
!pip install nltk

[0m

In [2]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

# Sequence to Sequence model with RNN encoder and decoder

We are now ready to begin implementing our first sequence to sequence model. We will do this using a RNN encoder and a RNN decoder. 

The encoder will take a source sentence as input and will encode it into a single vector (also known as a context vector or a latent vector) which will be passed to the decoder. The decoder will then use this context vector to generate a new sequence (in our case, a translation of the source sentence).

So, given a sequential input sentence $X = \{x_1, x_2, ..., x_T\}$, we want to encode it into a single vector $z$. At each step we have the hidden state $h_t$:
<br>
<br>
$$h_t = \text{Encoder}(e(x_t), h_{t-1}),$$
<br>
where $e(x_t)$ is the embedding of the current token $x_t$. This means that in practive, the $z$ vector will actually be $h_T$, the last hidden state of the RNN.

After this, we will use $z$ as the initial hidden state of the decoder. The decoder works just like the encoder, except that it takes $z$ as the initial hidden state and takes as input the previous token embedding $e(y_t)$ and the previous hidden state $s_{t-1}$ to predict the next hidden state $s_t$:
<br>
<br>
$$s_t = \text{Decoder}(d(y_t), s_{t-1}),$$
<br>

where $d(y_t)$ is the embedding of the current token $y_t$. We use another letter because the embedding of the input and output tokens may be different. Finally, we use the hidden state $s_t$ to predict the next token $\hat{y}_{t+1}$. We do this until we predict an end-of-sentence token, or we reach a maximum length of the sequence. Follows a diagram of this process (the examples uses a GRU):
<br>
<br>

<p align="center">
  <img src="https://incredible.ai/assets/images/seq2seq-seq2seq_ts.png" />
</p>

<br>
<br>

# Data

We will the dataset named "VanessaSchenkel/translation-en-pt", available in HuggingFace's datasets library. This dataset contains pairs of sentences in English and Portuguese. We will use this dataset to train our model to translate from English to Portuguese.

In [3]:
from datasets import load_dataset

main_data = load_dataset("VanessaSchenkel/translation-en-pt", field="data")

In [4]:
main_data

DatasetDict({
    train: Dataset({
        features: ['translation', 'id'],
        num_rows: 260482
    })
})

The dataset contains only a train split, so we will split it into train and validation sets. We will use 80% of the data for training and 20% for validation. 

Like before we first need to pre-process the data and tokenize all examples, encode them into integers and create dataloaders to iterate over the batches. 

We will use the same tokenizer as before, but we will need to add the special tokens "\<eos\>" (end of sentence) and "\<sos\>" (start of sentence) to the vocabulary. 

We start by building the two vocabulary, one for the source language (English) and one for the target language (Portuguese).

Note: we limit the total number of exemples to 50000 to speed up training.

In [5]:
 # For this example we will keep punctuation and capital letters, meaning that can use directly the word_tokenize function from nltk
# Also, we will not remove stopwords or rare words, since they can be important for the translation
from nltk.tokenize import word_tokenize

# note: skip the example with id 199351, since it is a very long sentence
english_tokens= []
portuguese_tokens = []
for d in main_data['train']:
    eng_tokens = word_tokenize(d['translation']['english'].lower())
    pt_tokens = word_tokenize(d['translation']['portuguese'].lower())
    if len(eng_tokens) > 15 or len(pt_tokens) > 15:
        continue
    english_tokens.append(eng_tokens)
    portuguese_tokens.append(pt_tokens)
    if len(english_tokens) == 50000:
        break

# Is 15 but let's get the maximum length of the sentences. We will use this to pad the sentences
max_len_english = max([len(s) for s in english_tokens])
max_len_portuguese = max([len(s) for s in portuguese_tokens])

# Ok, now we can get the unique tokens for each language
unique_english_tokens = sorted(list(set([tk for s in english_tokens for tk in s])))
unique_portuguese_tokens = sorted(list(set([tk for s in portuguese_tokens for tk in s])))

print("English vocabulary size: ", len(unique_english_tokens))
print("Portuguese vocabulary size: ", len(unique_portuguese_tokens))
print("")
print("Maximum length of English sentences: ", max_len_english)
print("Maximum length of Portuguese sentences: ", max_len_portuguese)

English vocabulary size:  11461
Portuguese vocabulary size:  17554

Maximum length of English sentences:  15
Maximum length of Portuguese sentences:  15


In [6]:
len(english_tokens)

50000

In [7]:
english_tokens[0]

['let', "'s", 'try', 'something', '.']

In [8]:
unique_english_tokens = ['<pad>','<sos>', '<eos>'] + unique_english_tokens
tokeng2id = {t: i for i, t in enumerate(unique_english_tokens)}
id2tokeng = {i: t for t, i in tokeng2id.items()}

unique_portuguese_tokens = ['<pad>','<sos>', '<eos>'] + unique_portuguese_tokens
tokpt2id = {t: i for i, t in enumerate(unique_portuguese_tokens)}
id2tokpt = {i: t for t, i in tokpt2id.items()}

In [9]:
print(tokeng2id["<pad>"])
print(tokeng2id["<sos>"])
print(tokeng2id["<eos>"])

0
1
2


To make things simpler let's add the special tokens manually. 

In [10]:
english_tokens_ids = [[1]+[tokeng2id[t] for t in s]+[2] for s in english_tokens]
portuguese_tokens_ids = [[1]+[tokpt2id[t] for t in s]+[2] for s in portuguese_tokens]

We will pad each sentence to the maximum length of the batch. This means that if the maximum length of the batch is 50, all sentences will be padded to length 50. 

We will the english sentences to the left because we want the final token to be the \<eos\>, and we will pad the portuguese sentences to the right because we want the first token to be the \<sos\>.

In [11]:
def pad_sequence(seq, max_length = 500, pad_direction = 'left'):
    if pad_direction == 'left':
        return seq[:max_length] if len(seq) > max_length else [0] * (max_length - len(seq)) + seq
    elif pad_direction == 'right':
        return seq[:max_length] if len(seq) > max_length else seq + [0] * (max_length - len(seq))
    else:
        raise ValueError("pad_direction must be either 'left' or 'right'")


english_tokens_ids = [pad_sequence(seq, max_length=17,pad_direction='left') for seq in english_tokens_ids]
portuguese_tokens_ids = [pad_sequence(seq, max_length=17,pad_direction='right') for seq in portuguese_tokens_ids]

In [12]:
print(english_tokens_ids[0])
print(portuguese_tokens_ids[0])

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 6057, 24, 10606, 9529, 29, 2]
[1, 16811, 16005, 1005, 3647, 3, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]


# Dataset and DataLoader

Ok, we are now ready to create the dataset and the dataloaders. We will use the same batch size as before (32).
Also, before we need to do the split between train and validation sets. We will use 80% of the data for training and 20% for validation.

In [13]:
len(english_tokens_ids)

50000

In [14]:
# We assume the data is already randomly shuffled
train_size = int(len(english_tokens_ids) * 0.8)

train_en = english_tokens_ids[:train_size]
train_pt = portuguese_tokens_ids[:train_size]

val_en = english_tokens_ids[train_size:]
val_pt = portuguese_tokens_ids[train_size:]

print("Size of training set: ", len(train_en))
print("Size of validation set: ", len(val_en))

Size of training set:  40000
Size of validation set:  10000


Once again we can create the Dataloader with the help of the Dataset class.

In [15]:
from torch.utils.data import DataLoader
from datasets import Dataset

list_data = [{'english':train_en[i],'portuguese':train_pt[i]} for i in range(len(train_en))]
train_dataset = Dataset.from_list(list_data)
train_dataset = train_dataset.with_format("torch")

batch_size = 32 # number of sequences in each batch
train_dataloader = DataLoader(train_dataset, shuffle=True, batch_size=batch_size) # train_dataloader is an iterator that returns a batch each time it is called

list_data = [{'english':val_en[i],'portuguese':val_pt[i]} for i in range(len(val_en))]
val_dataset = Dataset.from_list(list_data)
val_dataset = val_dataset.with_format("torch")

val_dataloader = DataLoader(val_dataset, shuffle=True, batch_size=batch_size) # val_dataloader is an iterator that returns a batch each time it is called

# Encoder, Decoder 

We are now ready to implement the encoder, decoder and seq2seq models. We will use a LSTM for both the encoder and the decoder.

Let's start with the encoder:

In [16]:
import torch
import torch.nn as nn
torch.manual_seed(0)

<torch._C.Generator at 0x7f4055522170>

In [17]:
# The main class of the RNN built from the nn.Module class
class Encoder(nn.Module):

    def __init__(self, vocab_size, emb_d,hidden_d,n_layers, drop_prob = 0.5):
        """
        Initialize the RNN Module

        Arguments:
        vocab_size: size of the vocabulary
        output_size: size of the output layer
        emb_d: size of the embedding layer
        h_d: size of the hidden layer
        n_layers: number of layers
        drop_prob: dropout probability
        """
    
        super().__init__()

        # define the embedding layer
        self.embedding = nn.Embedding(vocab_size, emb_d)

        # define a RNN layer
        self.rnn = nn.LSTM(emb_d, hidden_d, n_layers, dropout = drop_prob, batch_first=True) # batch_first=True means that the first dimension of the input and output will be the batch_size

        # define a dropout layer
        self.dropout = nn.Dropout(drop_prob)

    def forward(self, x):
        """
        Perform a forward pass of our model on some input and hidden state.

        Arguments:
        x: input to the model
        hidden: hidden state

        Returns:
        output: output of the model
        hidden: hidden state
        """

        # get the embedding vectors from lookup embedding layer 
        embeds = self.embedding(x) # shape: (batch_size, seq_length, emb_d)

        # pass the embedding vectors to the RNN layer. We get the output and the hidden state and cell state to initialize the decoder  
        # shape of out: (batch_size, seq_length, hidden_d)
        # shape of hidden: (n_layers, batch_size, hidden_d)
        # shape of cell: (n_layers, batch_size, hidden_d)
        out, (hidden,cell) = self.rnn(embeds) 
        out = self.dropout(out)
        return out, (hidden,cell)

Let's test the Encoder with a forward pass on the first english example:

In [18]:
vocab_size = len(tokeng2id)
embedding_dim = 50
hidden_dim = 32
n_layers = 1

model_enc = Encoder(vocab_size, embedding_dim, hidden_dim, n_layers)
model_enc



Encoder(
  (embedding): Embedding(11464, 50)
  (rnn): LSTM(50, 32, batch_first=True, dropout=0.5)
  (dropout): Dropout(p=0.5, inplace=False)
)

In [19]:
print("Tokens:",english_tokens[0])
token_ids = [tokeng2id[tk] for tk in english_tokens[0]]
print("Tokens IDS:",token_ids)

Tokens: ['let', "'s", 'try', 'something', '.']
Tokens IDS: [6057, 24, 10606, 9529, 29]


In [20]:
torch.IntTensor([token_ids,token_ids]).shape

torch.Size([2, 5])

In [21]:
model_enc.eval()
output_enc,(hidden_z,cell_z) = model_enc(torch.IntTensor([token_ids,token_ids])) # We had two sentences just to check simulate a batch of size 2
print("Output shape:",output_enc.shape)
print("Hidden shape:",hidden_z.shape)
print("Cell shape:",cell_z.shape)
print(output_enc)
print(hidden_z)
print(cell_z)

Output shape: torch.Size([2, 5, 32])
Hidden shape: torch.Size([1, 2, 32])
Cell shape: torch.Size([1, 2, 32])
tensor([[[ 1.3423e-01,  3.6583e-02, -6.6397e-02,  2.2418e-01, -1.1520e-01,
           3.4336e-01,  4.3391e-02, -1.7703e-02, -1.5032e-01,  1.1651e-01,
          -7.8157e-02, -2.1796e-02, -3.2822e-02,  5.7263e-02, -1.1593e-01,
           1.0967e-02, -8.7867e-02,  6.7248e-02,  2.9072e-02, -1.2149e-01,
          -3.6983e-02, -1.4534e-02,  5.6360e-02,  7.8790e-02, -1.8293e-02,
           4.6104e-02, -2.6533e-02,  5.0323e-02, -4.1733e-02, -1.6648e-01,
          -7.0855e-02,  2.6974e-01],
         [ 1.5144e-02, -5.9152e-02, -1.7571e-01,  1.2108e-01, -4.8443e-03,
           2.9913e-01,  1.3320e-01,  7.6689e-02, -2.6027e-01,  1.7537e-01,
          -3.3836e-02,  1.8307e-01,  7.0443e-02, -1.5437e-01, -2.1981e-01,
          -2.2986e-01,  1.9333e-01,  1.4743e-01, -2.6945e-01,  1.1514e-02,
          -6.5978e-02, -7.5610e-02,  4.8579e-01, -3.4326e-02,  1.2818e-01,
          -6.8017e-04, -6.065

The Encoder seems to be working fine. It outputs 5 hidden states, one for each token in the input sequenc, the last hidden state (that is equal to the last output)  and the cell state.

Now let's implement the decoder. The decoder will take as input the encoder hidden state and cell state, and the target sequence (t-1) and will output the predicted sequence (t). We will also add a fully connected layer to the output of the decoder to predict the next token.

Note: the encoder the RNN also takes the hidden and cell states but we don't need to pass them explicitly because they are initialized to zero by default.

In [22]:
class Decoder(nn.Module):

    def __init__(self, vocab_size, emb_d,hidden_d,n_layers, drop_prob = 0.5):
        """
        Initialize the RNN Module

        Arguments:
        vocab_size: size of the vocabulary
        output_size: size of the output layer
        emb_d: size of the embedding layer
        h_d: size of the hidden layer
        n_layers: number of layers
        drop_prob: dropout probability
        """
    
        super().__init__()

        # define the embedding layer
        self.embedding = nn.Embedding(vocab_size, emb_d)

        # define a RNN layer
        self.rnn = nn.LSTM(emb_d, hidden_d, n_layers, dropout = drop_prob, batch_first=True) # batch_first=True means that the first dimension of the input and output will be the batch_size

        # define a linear layer what will be used to predict the next word
        self.fc_out = nn.Linear(hidden_d, vocab_size)

        # define a dropout layer
        self.dropout = nn.Dropout(drop_prob)

    def forward(self, x, hidden, cell):
        """
        Perform a forward pass of our model on some input and hidden state.

        Arguments:
        x: input to the model
        hidden: hidden state
        cell: cell state

        Returns:

        """
        # get the embedding vectors from lookup embedding layer 
        embeds = self.embedding(x)

        # pass the embedding vectors, the hidden can cell states to the RNN layer.
        # contrary to the encoder, the hidden and cell states are not initialized with zeros, but with the values from the encoder
        # shape of out: (batch_size, seq_length, hidden_d)
        # shape of hidden: (n_layers, batch_size, hidden_d)
        # shape of cell: (n_layers, batch_size, hidden_d)
        out, (hidden, cell) = self.rnn(embeds, (hidden, cell))

        out = self.dropout(out)


        # fully connected layer that will return a vector with the size of the vocabulary. This vector will be used to predict the next word
        # shape of pred: (batch_size, seq_length, vocab_size)
        pred = self.fc_out(out)

        return pred, (hidden, cell)


Ok let's check if the decoder is working using the context vectors we got from the encoder.

In [23]:
model_dec = Decoder(vocab_size, embedding_dim, hidden_dim, n_layers)
model_dec

Decoder(
  (embedding): Embedding(11464, 50)
  (rnn): LSTM(50, 32, batch_first=True, dropout=0.5)
  (fc_out): Linear(in_features=32, out_features=11464, bias=True)
  (dropout): Dropout(p=0.5, inplace=False)
)

Let's test with a forward pass on the first english example:

In [24]:
prediction,(hidden,cell)  = model_dec(torch.IntTensor([token_ids[-1:],token_ids[-1:]]),hidden_z,cell_z) # Simulate the input of the last token of the sentence
print("Output shape:",prediction.shape)
print("Hidden shape:",hidden.shape)
print("Cell shape:",cell.shape)
print(prediction)
print(hidden)
print(cell)

Output shape: torch.Size([2, 1, 11464])
Hidden shape: torch.Size([1, 2, 32])
Cell shape: torch.Size([1, 2, 32])
tensor([[[ 0.0288, -0.0730, -0.0555,  ..., -0.1163, -0.0195,  0.0394]],

        [[ 0.0653, -0.1216,  0.0790,  ..., -0.1256, -0.1290,  0.1666]]],
       grad_fn=<ViewBackward0>)
tensor([[[-0.1121, -0.0330,  0.0318, -0.1038,  0.0981, -0.1162, -0.0084,
           0.1633,  0.0326,  0.0095, -0.1046,  0.1206,  0.0851, -0.2097,
           0.0712,  0.0960,  0.2367,  0.0634,  0.0012, -0.0975,  0.0025,
          -0.1637,  0.0416,  0.1224,  0.1393,  0.0629, -0.1018, -0.0705,
           0.2856,  0.2441, -0.1941,  0.0114],
         [-0.1121, -0.0330,  0.0318, -0.1038,  0.0981, -0.1162, -0.0084,
           0.1633,  0.0326,  0.0095, -0.1046,  0.1206,  0.0851, -0.2097,
           0.0712,  0.0960,  0.2367,  0.0634,  0.0012, -0.0975,  0.0025,
          -0.1637,  0.0416,  0.1224,  0.1393,  0.0629, -0.1018, -0.0705,
           0.2856,  0.2441, -0.1941,  0.0114]]], grad_fn=<StackBackward0>)
tens

The decoder is also working fine. It outpus the output from the fully-connected layer twice, one for each token in the target sequence. The same for the hidden and cell states.

# Seq2Seq model

We are now ready to implement the seq2seq model. The seq2seq model will take as input the source sequence and will output the predicted target sequence. 

Since we want to train with teacher forcing, we will pass the target sequence to the decoder. Teacher forcing is a technique where the target word is passed, with some probability, as the next input to the decoder. The intuition behind teacher forcing is that it will help the decoder learn to better predict the next token.

We will use the Encoder and Decoder classes we implemented before. The encoder will take as input the source sequence and will output the context vectors (the last hidden and cell states). Then we will use the Decoder class to iterate over the target sequence and predict the next token. In each iteration we predict the next token only, based in the previous hidden can cell states, and the previous true or predicted token, depending on the teacher forcing probability. 

Follows the main steps we need to implement:

1. Pass the source sequence to the encoder and get the context vectors.
2. Initialize the decoder with the context vectors and the \<sos\> token.
3. Predict the next token, hidden and cell states.
4. Repeat 3. with the true or predicted token, depending on the teacher forcing probability, and the new hidden and cell states. 
5. Stop when we reach the maximum length of the sequence or when we predict the \<eos\> token.

In [25]:
import random
random.seed(42)

In [26]:
class Seq2Seq(nn.Module):
    def __init__(self, encoder, decoder):
        super().__init__()
        self.encoder = encoder
        self.decoder = decoder

    def forward(self, source, target, teacher_forcing_ratio = 0.5):
        batch_size = source.shape[0]
        target_len = target.shape[1]
        target_vocab_size = len(tokpt2id)

        outputs = torch.zeros(batch_size, target_len, target_vocab_size)

        _, (hidden,cell) = self.encoder(source)

        # first input to the decoder is the <sos> token
        # shape of x: (batch_size, 1)
        x = target[:,:1]

        for t in range(1, target_len):

            output, (hidden, cell) = self.decoder(x, hidden, cell)

            outputs[:,t:t+1,:] = output

            best_guess = output.argmax(dim = -1)

            x = target[:,t:t+1] if random.random() < teacher_forcing_ratio else best_guess

        return outputs

In [27]:
vocab_eng_size = len(tokeng2id)
vocab_pt_size = len(tokpt2id)
device = "cuda"

embedding_dim = 100
hidden_dim = 256
n_layers = 1

model_enc = Encoder(vocab_eng_size, embedding_dim, hidden_dim, n_layers)
model_dec = Decoder(vocab_pt_size, embedding_dim, hidden_dim, n_layers)

model = Seq2Seq(model_enc, model_dec).to(device)
model

Seq2Seq(
  (encoder): Encoder(
    (embedding): Embedding(11464, 100)
    (rnn): LSTM(100, 256, batch_first=True, dropout=0.5)
    (dropout): Dropout(p=0.5, inplace=False)
  )
  (decoder): Decoder(
    (embedding): Embedding(17557, 100)
    (rnn): LSTM(100, 256, batch_first=True, dropout=0.5)
    (fc_out): Linear(in_features=256, out_features=17557, bias=True)
    (dropout): Dropout(p=0.5, inplace=False)
  )
)

In [28]:
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f'The model has {count_parameters(model):,} trainable parameters')

The model has 8,147,433 trainable parameters


In [29]:
from tqdm import tqdm_notebook as tqdm
import math

# main training loop
n_epochs = 10
lr=1e-3
clip = 1
criterion = nn.CrossEntropyLoss(ignore_index = 0)
optimizer = torch.optim.Adam(model.parameters(), lr=lr)

step = 0
evaluation_step = 250
train_losses = []
val_losses = []


for epoch in range(n_epochs):
    loss_train_total = 0


    for i, batch in tqdm(enumerate(train_dataloader),total=len(train_dataloader)):
        step += 1

        source = batch['english'].to(device)
        target = batch['portuguese'].to(device)

        # zero the gradients
        optimizer.zero_grad()

        # forward pass
        output = model(source, target)
        output = output.to(device)

        output = output[:,1:,:].reshape(-1, output.shape[2])
        target = target[:,1:].reshape(-1)

        # target = target.to("cpu")

        loss = criterion(output, target)
        loss_train_total += loss.item()
        # backward pass
        loss.backward()

        # clip the gradients to prevent exploding gradient problem. This step is very important.
        nn.utils.clip_grad_norm_(model.parameters(), clip)

        # update the parameters
        optimizer.step()

        # print(loss.item())

        # evaluation step
        if step % evaluation_step == 0:
            model.eval()
            with torch.no_grad():
                # evaluate on training data
                loss_val = 0
                for j,batch in enumerate(val_dataloader):
                    source = batch['english'].to(device)
                    target = batch['portuguese'].to(device)

                    # forward pass
                    output = model(source, target, teacher_forcing_ratio = 0) # we do not use teacher forcing here
                    output = output.to(device)

                    output = output[:,1:,:].reshape(-1, output.shape[2])
                    target = target[:,1:].reshape(-1)

                    loss = criterion(output, target)
                    loss_val += loss.item()


            # Calculate training perplexity
            train_perplexity = math.exp(loss_train_total/i)


            # Calculate perplexity
            val_perplexity = math.exp(loss_val / j)


            # print the loss at each step 
            print(f'Step: {step} | Train Loss: {loss_train_total/i: .3f} | Val Loss: {loss_val/j: .3f} | Train Perplexity: {train_perplexity: .3f} | Val Perplexity: {val_perplexity: .3f}')

            model.train()
            
            train_losses.append(loss_train_total/i)
            val_losses.append(loss_val/j)

    
    # print the loss and ppl at each epoch
    print(f'Epochs: {epoch + 1} | Train Loss: {loss_train_total/i: .3f} | Val Loss: {loss_val/j: .3f} | Train Perplexity: {train_perplexity: .3f} | Val Perplexity: {val_perplexity: .3f}')


Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  for i, batch in tqdm(enumerate(train_dataloader),total=len(train_dataloader)):


  0%|          | 0/1250 [00:00<?, ?it/s]

Step: 250 | Train Loss:  6.024 | Val Loss:  5.562 | Train Perplexity:  413.337 | Val Perplexity:  260.410
Step: 500 | Train Loss:  5.744 | Val Loss:  5.433 | Train Perplexity:  312.218 | Val Perplexity:  228.794
Step: 750 | Train Loss:  5.603 | Val Loss:  5.298 | Train Perplexity:  271.348 | Val Perplexity:  199.883
Step: 1000 | Train Loss:  5.499 | Val Loss:  5.217 | Train Perplexity:  244.390 | Val Perplexity:  184.359
Step: 1250 | Train Loss:  5.409 | Val Loss:  5.091 | Train Perplexity:  223.455 | Val Perplexity:  162.627
Epochs: 1 | Train Loss:  5.409 | Val Loss:  5.091 | Train Perplexity:  223.455 | Val Perplexity:  162.627


  0%|          | 0/1250 [00:00<?, ?it/s]

Step: 1500 | Train Loss:  4.855 | Val Loss:  4.980 | Train Perplexity:  128.366 | Val Perplexity:  145.423
Step: 1750 | Train Loss:  4.807 | Val Loss:  4.924 | Train Perplexity:  122.398 | Val Perplexity:  137.614
Step: 2000 | Train Loss:  4.755 | Val Loss:  4.851 | Train Perplexity:  116.141 | Val Perplexity:  127.908
Step: 2250 | Train Loss:  4.723 | Val Loss:  4.752 | Train Perplexity:  112.470 | Val Perplexity:  115.851
Step: 2500 | Train Loss:  4.683 | Val Loss:  4.733 | Train Perplexity:  108.045 | Val Perplexity:  113.615
Epochs: 2 | Train Loss:  4.683 | Val Loss:  4.733 | Train Perplexity:  108.045 | Val Perplexity:  113.615


  0%|          | 0/1250 [00:00<?, ?it/s]

Step: 2750 | Train Loss:  4.411 | Val Loss:  4.653 | Train Perplexity:  82.373 | Val Perplexity:  104.862
Step: 3000 | Train Loss:  4.364 | Val Loss:  4.630 | Train Perplexity:  78.534 | Val Perplexity:  102.558
Step: 3250 | Train Loss:  4.337 | Val Loss:  4.590 | Train Perplexity:  76.503 | Val Perplexity:  98.454
Step: 3500 | Train Loss:  4.303 | Val Loss:  4.558 | Train Perplexity:  73.938 | Val Perplexity:  95.433
Step: 3750 | Train Loss:  4.283 | Val Loss:  4.485 | Train Perplexity:  72.428 | Val Perplexity:  88.716
Epochs: 3 | Train Loss:  4.283 | Val Loss:  4.485 | Train Perplexity:  72.428 | Val Perplexity:  88.716


  0%|          | 0/1250 [00:00<?, ?it/s]

Step: 4000 | Train Loss:  4.045 | Val Loss:  4.465 | Train Perplexity:  57.128 | Val Perplexity:  86.913
Step: 4250 | Train Loss:  4.029 | Val Loss:  4.432 | Train Perplexity:  56.227 | Val Perplexity:  84.123
Step: 4500 | Train Loss:  4.015 | Val Loss:  4.398 | Train Perplexity:  55.432 | Val Perplexity:  81.262
Step: 4750 | Train Loss:  4.011 | Val Loss:  4.374 | Train Perplexity:  55.195 | Val Perplexity:  79.339
Step: 5000 | Train Loss:  3.998 | Val Loss:  4.336 | Train Perplexity:  54.489 | Val Perplexity:  76.439
Epochs: 4 | Train Loss:  3.998 | Val Loss:  4.336 | Train Perplexity:  54.489 | Val Perplexity:  76.439


  0%|          | 0/1250 [00:00<?, ?it/s]

Step: 5250 | Train Loss:  3.769 | Val Loss:  4.319 | Train Perplexity:  43.337 | Val Perplexity:  75.134
Step: 5500 | Train Loss:  3.760 | Val Loss:  4.308 | Train Perplexity:  42.941 | Val Perplexity:  74.300
Step: 5750 | Train Loss:  3.761 | Val Loss:  4.275 | Train Perplexity:  43.012 | Val Perplexity:  71.906
Step: 6000 | Train Loss:  3.749 | Val Loss:  4.260 | Train Perplexity:  42.468 | Val Perplexity:  70.795
Step: 6250 | Train Loss:  3.734 | Val Loss:  4.244 | Train Perplexity:  41.842 | Val Perplexity:  69.686
Epochs: 5 | Train Loss:  3.734 | Val Loss:  4.244 | Train Perplexity:  41.842 | Val Perplexity:  69.686


  0%|          | 0/1250 [00:00<?, ?it/s]

Step: 6500 | Train Loss:  3.550 | Val Loss:  4.228 | Train Perplexity:  34.816 | Val Perplexity:  68.563
Step: 6750 | Train Loss:  3.532 | Val Loss:  4.216 | Train Perplexity:  34.198 | Val Perplexity:  67.761
Step: 7000 | Train Loss:  3.529 | Val Loss:  4.187 | Train Perplexity:  34.097 | Val Perplexity:  65.845
Step: 7250 | Train Loss:  3.525 | Val Loss:  4.192 | Train Perplexity:  33.937 | Val Perplexity:  66.170
Step: 7500 | Train Loss:  3.521 | Val Loss:  4.161 | Train Perplexity:  33.803 | Val Perplexity:  64.131
Epochs: 6 | Train Loss:  3.521 | Val Loss:  4.161 | Train Perplexity:  33.803 | Val Perplexity:  64.131


  0%|          | 0/1250 [00:00<?, ?it/s]

Step: 7750 | Train Loss:  3.324 | Val Loss:  4.159 | Train Perplexity:  27.783 | Val Perplexity:  63.996
Step: 8000 | Train Loss:  3.317 | Val Loss:  4.181 | Train Perplexity:  27.575 | Val Perplexity:  65.439
Step: 8250 | Train Loss:  3.321 | Val Loss:  4.162 | Train Perplexity:  27.677 | Val Perplexity:  64.181
Step: 8500 | Train Loss:  3.323 | Val Loss:  4.122 | Train Perplexity:  27.741 | Val Perplexity:  61.665
Step: 8750 | Train Loss:  3.322 | Val Loss:  4.096 | Train Perplexity:  27.723 | Val Perplexity:  60.108
Epochs: 7 | Train Loss:  3.322 | Val Loss:  4.096 | Train Perplexity:  27.723 | Val Perplexity:  60.108


  0%|          | 0/1250 [00:00<?, ?it/s]

Step: 9000 | Train Loss:  3.120 | Val Loss:  4.117 | Train Perplexity:  22.636 | Val Perplexity:  61.404
Step: 9250 | Train Loss:  3.138 | Val Loss:  4.096 | Train Perplexity:  23.059 | Val Perplexity:  60.102
Step: 9500 | Train Loss:  3.142 | Val Loss:  4.118 | Train Perplexity:  23.141 | Val Perplexity:  61.424
Step: 9750 | Train Loss:  3.147 | Val Loss:  4.092 | Train Perplexity:  23.263 | Val Perplexity:  59.844
Step: 10000 | Train Loss:  3.146 | Val Loss:  4.068 | Train Perplexity:  23.232 | Val Perplexity:  58.453
Epochs: 8 | Train Loss:  3.146 | Val Loss:  4.068 | Train Perplexity:  23.232 | Val Perplexity:  58.453


  0%|          | 0/1250 [00:00<?, ?it/s]

Step: 10250 | Train Loss:  2.976 | Val Loss:  4.090 | Train Perplexity:  19.600 | Val Perplexity:  59.752
Step: 10500 | Train Loss:  2.977 | Val Loss:  4.085 | Train Perplexity:  19.632 | Val Perplexity:  59.427
Step: 10750 | Train Loss:  2.988 | Val Loss:  4.053 | Train Perplexity:  19.838 | Val Perplexity:  57.592
Step: 11000 | Train Loss:  2.990 | Val Loss:  4.055 | Train Perplexity:  19.890 | Val Perplexity:  57.670
Step: 11250 | Train Loss:  2.992 | Val Loss:  4.060 | Train Perplexity:  19.922 | Val Perplexity:  57.992
Epochs: 9 | Train Loss:  2.992 | Val Loss:  4.060 | Train Perplexity:  19.922 | Val Perplexity:  57.992


  0%|          | 0/1250 [00:00<?, ?it/s]

Step: 11500 | Train Loss:  2.800 | Val Loss:  4.054 | Train Perplexity:  16.444 | Val Perplexity:  57.607
Step: 11750 | Train Loss:  2.817 | Val Loss:  4.074 | Train Perplexity:  16.733 | Val Perplexity:  58.799
Step: 12000 | Train Loss:  2.828 | Val Loss:  4.056 | Train Perplexity:  16.911 | Val Perplexity:  57.734
Step: 12250 | Train Loss:  2.838 | Val Loss:  4.036 | Train Perplexity:  17.077 | Val Perplexity:  56.599
Step: 12500 | Train Loss:  2.849 | Val Loss:  4.023 | Train Perplexity:  17.263 | Val Perplexity:  55.870
Epochs: 10 | Train Loss:  2.849 | Val Loss:  4.023 | Train Perplexity:  17.263 | Val Perplexity:  55.870


Ok now let's test it by generating a translation from english to portuguese, by building a function that receives a sentence in english and outputs the predicted translation in portuguese.

In [38]:
def translate(text,max_len = 30):
    tokens = word_tokenize(text.lower(), language='english')
    tokens = ['<sos>'] + tokens + ['<eos>']
    tokens_ids = [tokeng2id[t] for t in tokens]
    tokens_tensor = torch.LongTensor(tokens_ids).unsqueeze(0).to(device)
    target_tensor = torch.LongTensor([1]*max_len).unsqueeze(0).to(device) # We just need a tensor with defined max length. Except the first token, the other tokens are not important, since we will not use teacher forcing.
    model.eval()
    with torch.no_grad():
        predictions = model(tokens_tensor,target_tensor,teacher_forcing_ratio=0)
        for i in range(1,max_len):
            predicted_id = predictions[0,i,:].argmax(dim=-1).item()
            if predicted_id == 2: # <eos>   
                break
            print(id2tokpt[predicted_id],end=" ")

In [39]:
translate("Hello, how are you?")

oi , você está você ? 

In [40]:
translate("Hello, what is your name?")

oi , o que o nome ? ? 

In [70]:
main_data['train'][21]

{'translation': {'english': 'I will be back soon.', 'portuguese': 'Volto já.'},
 'id': '21'}

In [72]:
translate("I will be back soon.")

voltarei logo logo . 

We can see that the model is already getting some words right, but it is still far from perfect. 

We must note however, that this is a simple model with a single layer, with just 8 million parameters , and trained in a very small dataset. For comparision, the original [paper](https://arxiv.org/pdf/1409.3215.pdf) that proposed this model used a 4 layer LSTM with 1000 hidden units in each layer, with  384 million parameters, and trained in a dataset with 12 million sentences.

In [1]:
import torch.nn as nn

In [3]:
model = nn.GRU((32 * 2) + 100, 64)

In [5]:
model.num_layers

1