In [1]:
!pip install nltk

[0m

In [2]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [3]:
from tqdm.notebook import tqdm

# Sequence to Sequence model with a GRU encoder and decoder

In this notebook we are going to implement a sequence to sequence model with RNN encoder and decoder, together with an attention mechanism.

### Encoder
As before, the encoder will take a source sentence as input and will encode it into a single vector (also known as a context vector or a latent vector) which will be passed to the decoder. The decoder will then use this context vector to generate a new sequence (in our case, a translation of the source sentence). However, we will use the hidden states from each step to calculate the attention weights.

So, given a sequential input sentence $X = \{x_1, x_2, ..., x_T\}$, we want to encode it into a single vector $z$. At each step we have the hidden state $h_t$:
<br>
<br>
$$h_t = \text{Encoder}(e(x_t), h_{t-1}),$$
<br>
where $e(x_t)$ is the embedding of the current token $x_t$. This means that in practice, the $z$ vector will actually be $h_T$, the last hidden state of the RNN.


### Attention 

The attention will be calculated using the hidden states of the encoder and the last hidden decoder state.  We will use the dot product between the last hidden decoder state $s_{t-1}$ and the hidden states of the encoder $h_t$ to calculate the attention weights, and apply a softmax to normalise the weights:
<br>
<br>
$$\alpha_t = \text{softmax}(s_{t-1}^T h_t),$$

where $\alpha_t$ is the attention weight for the hidden state $h_t$. We will use the attention weights to calculate the context vector $c_t$:
<br>
<br>
$$c_t = \sum_{i=1}^T \alpha_{t,i} h_i,$$

where $T$ is the length of the input sequence. The context vector $c_t$ will be concatenated with the hidden state $s_{t-1}$ and the embedding of the previous token $e(y_{t-1})$ to predict the next hidden state $s_t$:
<br>
<br>
$$s_t = \text{Decoder}(e(y_{t-1}), s_{t-1}, c_t).$$

Finally, we use the hidden state $s_t$ to predict the next token $\hat{y}_{t+1}$. We do this until we predict an end-of-sentence token, or we reach a maximum length of the sequence. 

<br>
<br>

# Data

We will the dataset named "VanessaSchenkel/translation-en-pt", available in HuggingFace's datasets library. This dataset contains pairs of sentences in English and Portuguese. We will use this dataset to train our model to translate from English to Portuguese.

In [4]:
from datasets import load_dataset

main_data = load_dataset("VanessaSchenkel/translation-en-pt", field="data")

Downloading readme:   0%|          | 0.00/743 [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/59.8M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

In [5]:
main_data

DatasetDict({
    train: Dataset({
        features: ['id', 'translation'],
        num_rows: 260482
    })
})

The dataset contains only a train split, so we will split it into train and validation sets. We will use 80% of the data for training and 20% for validation. 

Like before we first need to pre-process the data and tokenize all examples, encode them into integers and create dataloaders to iterate over the batches. 

We will use the same tokenizer as before, but we will need to add the special tokens "\<eos\>" (end of sentence) and "\<sos\>" (start of sentence) to the vocabulary. 

We start by building the two vocabulary, one for the source language (English) and one for the target language (Portuguese).

Note: we limit the total number of exemples to 50000 to speed up training.

In [6]:
 # For this example we will keep punctuation and capital letters, meaning that can use directly the word_tokenize function from nltk
# Also, we will not remove stopwords or rare words, since they can be important for the translation
from nltk.tokenize import word_tokenize

# note: skip the example with id 199351, since it is a very long sentence
english_tokens= []
portuguese_tokens = []
for d in main_data['train']:
    eng_tokens = word_tokenize(d['translation']['english'].lower())
    pt_tokens = word_tokenize(d['translation']['portuguese'].lower())
    if len(eng_tokens) > 15 or len(pt_tokens) > 15:
        continue
    english_tokens.append(eng_tokens)
    portuguese_tokens.append(pt_tokens)
    if len(english_tokens) == 50000:
        break

# Is 15 but let's get the maximum length of the sentences. We will use this to pad the sentences
max_len_english = max([len(s) for s in english_tokens])
max_len_portuguese = max([len(s) for s in portuguese_tokens])

# Ok, now we can get the unique tokens for each language
unique_english_tokens = sorted(list(set([tk for s in english_tokens for tk in s])))
unique_portuguese_tokens = sorted(list(set([tk for s in portuguese_tokens for tk in s])))

print("English vocabulary size: ", len(unique_english_tokens))
print("Portuguese vocabulary size: ", len(unique_portuguese_tokens))
print("")
print("Maximum length of English sentences: ", max_len_english)
print("Maximum length of Portuguese sentences: ", max_len_portuguese)

English vocabulary size:  11461
Portuguese vocabulary size:  17554

Maximum length of English sentences:  15
Maximum length of Portuguese sentences:  15


In [7]:
len(english_tokens)

50000

In [8]:
english_tokens[0]

['let', "'s", 'try', 'something', '.']

In [9]:
unique_english_tokens = ['<pad>','<sos>', '<eos>'] + unique_english_tokens
tokeng2id = {t: i for i, t in enumerate(unique_english_tokens)}
id2tokeng = {i: t for t, i in tokeng2id.items()}

unique_portuguese_tokens = ['<pad>','<sos>', '<eos>'] + unique_portuguese_tokens
tokpt2id = {t: i for i, t in enumerate(unique_portuguese_tokens)}
id2tokpt = {i: t for t, i in tokpt2id.items()}

In [10]:
print(tokeng2id["<pad>"])
print(tokeng2id["<sos>"])
print(tokeng2id["<eos>"])

0
1
2


To make things simpler let's add the special tokens manually. 

In [11]:
english_tokens_ids = [[1]+[tokeng2id[t] for t in s]+[2] for s in english_tokens]
portuguese_tokens_ids = [[1]+[tokpt2id[t] for t in s]+[2] for s in portuguese_tokens]

We will pad each sentence to the maximum length of the batch. This means that if the maximum length of the batch is 50, all sentences will be padded to length 50. 

We will the english sentences to the left because we want the final token to be the \<eos\>, and we will pad the portuguese sentences to the right because we want the first token to be the \<sos\>.

In [12]:
def pad_sequence(seq, max_length = 500, pad_direction = 'left'):
    if pad_direction == 'left':
        return seq[:max_length] if len(seq) > max_length else [0] * (max_length - len(seq)) + seq
    elif pad_direction == 'right':
        return seq[:max_length] if len(seq) > max_length else seq + [0] * (max_length - len(seq))
    else:
        raise ValueError("pad_direction must be either 'left' or 'right'")


english_tokens_ids = [pad_sequence(seq, max_length=17,pad_direction='left') for seq in english_tokens_ids]
portuguese_tokens_ids = [pad_sequence(seq, max_length=17,pad_direction='right') for seq in portuguese_tokens_ids]

In [13]:
print(english_tokens_ids[0])
print(portuguese_tokens_ids[0])

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 6057, 24, 10606, 9529, 29, 2]
[1, 16811, 16005, 1005, 3647, 3, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]


# Dataset and DataLoader

Ok, we are now ready to create the dataset and the dataloaders. We will use the same batch size as before (32).
Also, before we need to do the split between train and validation sets. We will use 80% of the data for training and 20% for validation.

In [14]:
len(english_tokens_ids)

50000

In [15]:
# We assume the data is already randomly shuffled
train_size = int(len(english_tokens_ids) * 0.8)

train_en = english_tokens_ids[:train_size]
train_pt = portuguese_tokens_ids[:train_size]

val_en = english_tokens_ids[train_size:]
val_pt = portuguese_tokens_ids[train_size:]

print("Size of training set: ", len(train_en))
print("Size of validation set: ", len(val_en))

Size of training set:  40000
Size of validation set:  10000


Once again we can create the Dataloader with the help of the Dataset class.

In [16]:
from torch.utils.data import DataLoader
from datasets import Dataset

list_data = [{'english':train_en[i],'portuguese':train_pt[i]} for i in range(len(train_en))]
train_dataset = Dataset.from_list(list_data)
train_dataset = train_dataset.with_format("torch")

batch_size = 32 # number of sequences in each batch
train_dataloader = DataLoader(train_dataset, shuffle=True, batch_size=batch_size) # train_dataloader is an iterator that returns a batch each time it is called

list_data = [{'english':val_en[i],'portuguese':val_pt[i]} for i in range(len(val_en))]
val_dataset = Dataset.from_list(list_data)
val_dataset = val_dataset.with_format("torch")

val_dataloader = DataLoader(val_dataset, shuffle=True, batch_size=batch_size) # val_dataloader is an iterator that returns a batch each time it is called

# Encoder, Decoder 

We are now ready to implement the encoder, decoder and seq2seq models. We will use a LSTM for both the encoder and the decoder.

Let's start with the encoder:

In [17]:
import torch
import torch.nn as nn
torch.manual_seed(0)

<torch._C.Generator at 0x7f5199ba1650>

In [23]:
# The main class of the RNN built from the nn.Module class
class Encoder(nn.Module):

    def __init__(self, vocab_size, emb_d,hidden_d,n_layers, drop_prob = 0.2):
        """
        Initialize the RNN Module

        Arguments:
        vocab_size: size of the vocabulary
        output_size: size of the output layer
        emb_d: size of the embedding layer
        h_d: size of the hidden layer
        n_layers: number of layers
        drop_prob: dropout probability
        """
    
        super().__init__()

        # define the embedding layer
        self.embedding = nn.Embedding(vocab_size, emb_d)

        # define a RNN layer
        self.rnn = nn.GRU(emb_d, hidden_d, n_layers, dropout = drop_prob, batch_first=True) # batch_first=True means that the first dimension of the input and output will be the batch_size

        # define a dropout layer
        self.dropout = nn.Dropout(drop_prob)

    def forward(self, x):
        """
        Perform a forward pass of our model on some input and hidden state.

        Arguments:
        x: input to the model
        hidden: hidden state

        Returns:
        output: output of the model
        hidden: hidden state
        """

        # get the embedding vectors from lookup embedding layer 
        embeds = self.embedding(x) # shape: (batch_size, seq_length, emb_d)

        # pass the embedding vectors to the RNN layer. We get the output and the hidden state and cell state to initialize the decoder  
        # shape of out: (batch_size, seq_length, hidden_d)
        # shape of hidden: (n_layers, batch_size, hidden_d)
        # shape of cell: (n_layers, batch_size, hidden_d)
        out, hidden = self.rnn(embeds) 
        out = self.dropout(out)
        return out, hidden[-1:,:,:] # we return only the hidden state of the last layer

In [24]:
vocab_size = len(tokeng2id)
embedding_dim = 50
hidden_dim = 8
n_layers = 3

model_enc = Encoder(vocab_size, embedding_dim, hidden_dim, n_layers)
model_enc

Encoder(
  (embedding): Embedding(11464, 50)
  (rnn): GRU(50, 8, num_layers=3, batch_first=True, dropout=0.2)
  (dropout): Dropout(p=0.2, inplace=False)
)

In [25]:
print("Tokens:",english_tokens[0])
token_ids = [tokeng2id[tk] for tk in english_tokens[0]]
print("Tokens IDS:",token_ids)

Tokens: ['let', "'s", 'try', 'something', '.']
Tokens IDS: [6057, 24, 10606, 9529, 29]


In [26]:
print("Tokens:",english_tokens[2])
token_ids_2 = [tokeng2id[tk] for tk in english_tokens[2]][:5]
print("Tokens IDS:",token_ids_2)

Tokens: ['i', 'have', 'to', 'go', 'to', 'sleep', '.']
Tokens IDS: [5193, 4898, 10402, 4589, 10402]


In [27]:
model_enc.eval()
output_enc,hidden_z = model_enc(torch.IntTensor([token_ids,token_ids_2])) # We had two sentences just to check simulate a batch of size 2
print("Output shape:",output_enc.shape)
print("Hidden shape:",hidden_z.shape)

Output shape: torch.Size([2, 5, 8])
Hidden shape: torch.Size([1, 2, 8])


In [31]:
# Attention class
# calculates the attention weights and the context vector for each time step of the decoder, using the dot product of the last hidden state
class Attention(nn.Module):
    def __init__(self, hidden_d):
        super().__init__()
        self.hidden_d = hidden_d

    def forward(self, hidden, encoder_outputs):
        # permute the hidden dimensions to (batch_size, hidden_d, n_layers)
        hidden = hidden.permute(1,2,0) 

        # dot product between encoder outputs and hidden state
        # attention_weights shape: (batch_size, seq_length, 1)
        attention_weights = torch.bmm(encoder_outputs,hidden)

        # softmax to get the attention weights:
        attention_weights = nn.functional.softmax(attention_weights, dim=1)

        # permute the attention weights to (batch_size, 1, seq_length) to do the weighted sum
        context = torch.bmm(attention_weights.permute(0,2,1),encoder_outputs)

        # sum along the seq_lengt axis to get the final context vector
        context = context.sum(dim=1)
        # add a dimension to the context vector to match the shape of the hidden state
        context = context.unsqueeze(1)

        return context, attention_weights
        

In [37]:
# Now we can define the decoder class
class Decoder(nn.Module):

    def __init__(self, vocab_size, emb_d,hidden_d, drop_prob = 0.2):
        """
        Initialize the RNN Module

        Arguments:
        vocab_size: size of the vocabulary
        output_size: size of the output layer
        emb_d: size of the embedding layer
        h_d: size of the hidden layer
        drop_prob: dropout probability
        """
    
        super().__init__()

        # define the embedding layer
        self.embedding = nn.Embedding(vocab_size, emb_d)

        # define a RNN layer
        # To make things more simple, we will use only one layer
        self.rnn = nn.GRU(emb_d + hidden_d, hidden_d, dropout = drop_prob, batch_first=True) # batch_first=True means that the first dimension of the input and output will be the batch_size

        # define a dropout layer
        self.dropout = nn.Dropout(drop_prob)

        # define the output layer
        self.fc = nn.Linear(hidden_d, vocab_size)

        # define the attention layer
        self.attention = Attention(hidden_d)

    def forward(self, x, hidden, encoder_outputs):

        """
        Forward propagate through the RNN module

        Arguments:
        x: input to the RNN
        hidden: hidden state
        encoder_outputs: output of encoder
        """
        # pass the input through the embedding layer
        x = self.embedding(x)

        # apply dropout
        x = self.dropout(x)

        # get the attention weights and context
        # context shape: (batch_size, 1, hidden_d)
        # attention_weights shape: (batch_size, seq_length, 1)
        context, attention_weights = self.attention(hidden,encoder_outputs)
        
        # concatenate the context and the embedded input
        # rnn_input shape: (batch_size, 1, emb_d + hidden_d)    
        rnn_input = torch.cat((x, context), dim=2)
        
        # pass the input and hidden state to the rnn
        # output shape: (batch_size, 1, hidden_d)
        # hidden shape: (n_layers, batch_size, hidden_d)
        output, hidden = self.rnn(rnn_input, hidden)
        
        # pass the output through the output layer
        # output shape: (batch_size, vocab_size)
        output = self.fc(output)

        return output, hidden, attention_weights    

Ok let's check if the decoder is working using the context vectors we got from the encoder.

In [38]:
model_dec = Decoder(vocab_size, embedding_dim, hidden_dim)
model_dec



Decoder(
  (embedding): Embedding(11464, 50)
  (rnn): GRU(58, 8, batch_first=True, dropout=0.2)
  (dropout): Dropout(p=0.2, inplace=False)
  (fc): Linear(in_features=8, out_features=11464, bias=True)
  (attention): Attention()
)

Let's test with a forward pass on the first english example:

In [40]:
new_out,new_h,att_w = model_dec(torch.IntTensor([token_ids[-1:],token_ids[-1:]]),hidden_z,output_enc) # Simulate the input of the last token of the sentence
print("Output shape:",new_out.shape)
print("Hidden shape:",new_h.shape)
print("Attention weights shape:",att_w.shape)

Output shape: torch.Size([2, 1, 11464])
Hidden shape: torch.Size([1, 2, 8])
Attention weights shape: torch.Size([2, 5, 1])


# Seq2Seq model

We are now ready to implement the seq2seq model. The seq2seq model will take as input the source sequence and will output the predicted target sequence. 

Since we want to train with teacher forcing, we will pass the target sequence to the decoder. Teacher forcing is a technique where the target word is passed, with some probability, as the next input to the decoder. The intuition behind teacher forcing is that it will help the decoder learn to better predict the next token.

We will use the Encoder and Decoder classes we implemented before. The encoder will take as input the source sequence and will output the context vectors (the last hidden and cell states). Then we will use the Decoder class to iterate over the target sequence and predict the next token. In each iteration we predict the next token only, based in the previous hidden can cell states, and the previous true or predicted token, depending on the teacher forcing probability. 

Follows the main steps we need to implement:

1. Pass the source sequence to the encoder and get the context vectors.
2. Initialize the decoder with the context vectors and the \<sos\> token.
3. Predict the next token, hidden and cell states.
4. Repeat 3. with the true or predicted token, depending on the teacher forcing probability, and the new hidden and cell states. 
5. Stop when we reach the maximum length of the sequence or when we predict the \<eos\> token.

In [41]:
import random
random.seed(42)

In [42]:
class Seq2Seq(nn.Module):
    def __init__(self, encoder, decoder):
        super().__init__()
        self.encoder = encoder
        self.decoder = decoder

    def forward(self, source, target, teacher_forcing_ratio = 0.5):
        batch_size = source.shape[0]
        target_len = target.shape[1]
        target_vocab_size = len(tokpt2id)

        outputs = torch.zeros(batch_size, target_len, target_vocab_size)

        # _, (hidden,cell) = self.encoder(source)
        encoder_outputs, hidden = self.encoder(source)
        # first input to the decoder is the <sos> token
        # shape of x: (batch_size, 1)
        x = target[:,:1]

        for t in range(1, target_len):

            output, hidden, att_w = self.decoder(x, hidden, encoder_outputs)

            outputs[:,t:t+1,:] = output

            best_guess = output.argmax(dim = -1)

            x = target[:,t:t+1] if random.random() < teacher_forcing_ratio else best_guess

        return outputs,att_w

In [50]:
vocab_eng_size = len(tokeng2id)
vocab_pt_size = len(tokpt2id)
device = "cuda"

embedding_dim = 100
hidden_dim = 256
n_layers = 1

model_enc = Encoder(vocab_eng_size, embedding_dim, hidden_dim, n_layers)
model_dec = Decoder(vocab_pt_size, embedding_dim, hidden_dim)

model = Seq2Seq(model_enc, model_dec).to(device)
model



Seq2Seq(
  (encoder): Encoder(
    (embedding): Embedding(11464, 100)
    (rnn): GRU(100, 256, batch_first=True, dropout=0.2)
    (dropout): Dropout(p=0.2, inplace=False)
  )
  (decoder): Decoder(
    (embedding): Embedding(17557, 100)
    (rnn): GRU(356, 256, batch_first=True, dropout=0.2)
    (dropout): Dropout(p=0.2, inplace=False)
    (fc): Linear(in_features=256, out_features=17557, bias=True)
    (attention): Attention()
  )
)

In [51]:
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f'The model has {count_parameters(model):,} trainable parameters')

The model has 8,160,745 trainable parameters


In [52]:
import math

# main training loop
n_epochs = 4
lr=1e-3
clip = 1
criterion = nn.CrossEntropyLoss(ignore_index = 0)
optimizer = torch.optim.Adam(model.parameters(), lr=lr)

step = 0
evaluation_step = 250
train_losses = []
val_losses = []


for epoch in range(n_epochs):
    loss_train_total = 0


    for i, batch in tqdm(enumerate(train_dataloader),total=len(train_dataloader)):
        step += 1

        source = batch['english'].to(device)
        target = batch['portuguese'].to(device)

        # zero the gradients
        optimizer.zero_grad()

        # forward pass
        output,_ = model(source, target)
        output = output.to(device)

        output = output[:,1:,:].reshape(-1, output.shape[2])
        target = target[:,1:].reshape(-1)

        # target = target.to("cpu")

        loss = criterion(output, target)
        loss_train_total += loss.item()
        # backward pass
        loss.backward()

        # clip the gradients to prevent exploding gradient problem. This step is very important.
        nn.utils.clip_grad_norm_(model.parameters(), clip)

        # update the parameters
        optimizer.step()

        # print(loss.item())

        # evaluation step
        if step % evaluation_step == 0:
            model.eval()
            with torch.no_grad():
                # evaluate on training data
                loss_val = 0
                for j,batch in enumerate(val_dataloader):
                    source = batch['english'].to(device)
                    target = batch['portuguese'].to(device)

                    # forward pass
                    output,_ = model(source, target, teacher_forcing_ratio = 0) # we do not use teacher forcing here
                    output = output.to(device)

                    output = output[:,1:,:].reshape(-1, output.shape[2])
                    target = target[:,1:].reshape(-1)

                    loss = criterion(output, target)
                    loss_val += loss.item()


            # Calculate training perplexity
            train_perplexity = math.exp(loss_train_total/i)


            # Calculate perplexity
            val_perplexity = math.exp(loss_val / j)


            # print the loss at each step 
            print(f'Step: {step} | Train Loss: {loss_train_total/i: .3f} | Val Loss: {loss_val/j: .3f} | Train Perplexity: {train_perplexity: .3f} | Val Perplexity: {val_perplexity: .3f}')

            model.train()
            
            train_losses.append(loss_train_total/i)
            val_losses.append(loss_val/j)

    
    # print the loss and ppl at each epoch
    print(f'Epochs: {epoch + 1} | Train Loss: {loss_train_total/i: .3f} | Val Loss: {loss_val/j: .3f} | Train Perplexity: {train_perplexity: .3f} | Val Perplexity: {val_perplexity: .3f}')


  0%|          | 0/1250 [00:00<?, ?it/s]

Step: 250 | Train Loss:  5.892 | Val Loss:  5.520 | Train Perplexity:  362.224 | Val Perplexity:  249.695
Step: 500 | Train Loss:  5.586 | Val Loss:  5.232 | Train Perplexity:  266.778 | Val Perplexity:  187.167
Step: 750 | Train Loss:  5.367 | Val Loss:  4.986 | Train Perplexity:  214.232 | Val Perplexity:  146.300
Step: 1000 | Train Loss:  5.190 | Val Loss:  4.819 | Train Perplexity:  179.392 | Val Perplexity:  123.885
Step: 1250 | Train Loss:  5.038 | Val Loss:  4.649 | Train Perplexity:  154.183 | Val Perplexity:  104.434
Epochs: 1 | Train Loss:  5.038 | Val Loss:  4.649 | Train Perplexity:  154.183 | Val Perplexity:  104.434


  0%|          | 0/1250 [00:00<?, ?it/s]

Step: 1500 | Train Loss:  4.086 | Val Loss:  4.524 | Train Perplexity:  59.500 | Val Perplexity:  92.188
Step: 1750 | Train Loss:  4.014 | Val Loss:  4.436 | Train Perplexity:  55.364 | Val Perplexity:  84.473
Step: 2000 | Train Loss:  3.942 | Val Loss:  4.349 | Train Perplexity:  51.497 | Val Perplexity:  77.415
Step: 2250 | Train Loss:  3.887 | Val Loss:  4.267 | Train Perplexity:  48.763 | Val Perplexity:  71.274
Step: 2500 | Train Loss:  3.832 | Val Loss:  4.224 | Train Perplexity:  46.168 | Val Perplexity:  68.272
Epochs: 2 | Train Loss:  3.832 | Val Loss:  4.224 | Train Perplexity:  46.168 | Val Perplexity:  68.272


  0%|          | 0/1250 [00:00<?, ?it/s]

Step: 2750 | Train Loss:  3.222 | Val Loss:  4.175 | Train Perplexity:  25.083 | Val Perplexity:  65.019
Step: 3000 | Train Loss:  3.196 | Val Loss:  4.146 | Train Perplexity:  24.442 | Val Perplexity:  63.182
Step: 3250 | Train Loss:  3.193 | Val Loss:  4.094 | Train Perplexity:  24.359 | Val Perplexity:  59.957
Step: 3500 | Train Loss:  3.172 | Val Loss:  4.078 | Train Perplexity:  23.849 | Val Perplexity:  59.052
Step: 3750 | Train Loss:  3.152 | Val Loss:  4.024 | Train Perplexity:  23.391 | Val Perplexity:  55.918
Epochs: 3 | Train Loss:  3.152 | Val Loss:  4.024 | Train Perplexity:  23.391 | Val Perplexity:  55.918


  0%|          | 0/1250 [00:00<?, ?it/s]

Step: 4000 | Train Loss:  2.680 | Val Loss:  4.010 | Train Perplexity:  14.583 | Val Perplexity:  55.131
Step: 4250 | Train Loss:  2.684 | Val Loss:  4.012 | Train Perplexity:  14.642 | Val Perplexity:  55.271
Step: 4500 | Train Loss:  2.679 | Val Loss:  3.998 | Train Perplexity:  14.571 | Val Perplexity:  54.491
Step: 4750 | Train Loss:  2.672 | Val Loss:  3.999 | Train Perplexity:  14.474 | Val Perplexity:  54.545
Step: 5000 | Train Loss:  2.674 | Val Loss:  3.932 | Train Perplexity:  14.501 | Val Perplexity:  51.010
Epochs: 4 | Train Loss:  2.674 | Val Loss:  3.932 | Train Perplexity:  14.501 | Val Perplexity:  51.010


Ok now let's test it by generating a translation from english to portuguese, by building a function that receives a sentence in english and outputs the predicted translation in portuguese.

In [53]:
def translate(text,max_len = 30):
    tokens = word_tokenize(text.lower(), language='english')
    tokens = ['<sos>'] + tokens + ['<eos>']
    tokens_ids = [tokeng2id[t] for t in tokens]
    tokens_tensor = torch.LongTensor(tokens_ids).unsqueeze(0).to(device)
    target_tensor = torch.LongTensor([1]*max_len).unsqueeze(0).to(device) # We just need a tensor with defined max length. Except the first token, the other tokens are not important, since we will not use teacher forcing.
    model.eval()
    with torch.no_grad():
        predictions,_ = model(tokens_tensor,target_tensor,teacher_forcing_ratio=0)
        for i in range(1,max_len):
            predicted_id = predictions[0,i,:].argmax(dim=-1).item()
            if predicted_id == 2: # <eos>   
                break
            print(id2tokpt[predicted_id],end=" ")

In [54]:
translate("Hello, where do you live?")

oi , onde você , onde você ? ? 

In [55]:
translate("Hello, what is your name?")

oi , o que é o seu nome é ? 

In [56]:
def get_att_w(text,max_len = 30):
    tokens = word_tokenize(text.lower(), language='english')
    tokens = ['<sos>'] + tokens + ['<eos>']
    print(tokens)
    tokens_ids = [tokeng2id[t] for t in tokens]
    tokens_tensor = torch.LongTensor(tokens_ids).unsqueeze(0).to(device)
    target_tensor = torch.LongTensor([1]*max_len).unsqueeze(0).to(device) # We just need a tensor with defined max length. Except the first token, the other tokens are not important, since we will not use teacher forcing.
    model.eval()
    with torch.no_grad():
        predictions,att_w = model(tokens_tensor,target_tensor,teacher_forcing_ratio=0)
        return att_w
        # for i in range(1,max_len):
        #     predicted_id = predictions[0,i,:].argmax(dim=-1).item()
        #     if predicted_id == 2: # <eos>   
        #         break
        #     print(id2tokpt[predicted_id],end=" ")

In [57]:
get_att_w("Hello, what is your name?")

['<sos>', 'hello', ',', 'what', 'is', 'your', 'name', '?', '<eos>']


tensor([[[3.9325e-06],
         [2.1797e-11],
         [4.5073e-13],
         [3.6881e-10],
         [1.0367e-08],
         [1.8424e-08],
         [1.6288e-05],
         [9.2460e-02],
         [9.0752e-01]]], device='cuda:0')