<a href="https://colab.research.google.com/github/rastringer/code_first_ml/blob/main/transformers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install torch transformers datasets

# Transformers

<img src="https://github.com/rastringer/code_first_ml/blob/main/images/transformer_architecture.png?raw=true" width="500"/>

## Word vectors

### Tokenizers

Since computers don't understand words, we use tokenizers to convert words into numbers. There are different ways to achieve this, including word-based, character-based and subword tokenization.



One of the best and most accessible libraries for tokenization is from [Hugging Face](huggingface.co).

To run the following cells, you will need a free account.

### Encoding

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-cased")

sequence = "Learning transformers is easy"
tokens = tokenizer.tokenize(sequence)

print(f"Tokens = {tokens}\n")

ids = tokenizer.convert_tokens_to_ids(tokens)

print(f"ids={ids}")


In [None]:
### Decoding

decoded_string = tokenizer.decode([9681, 11303, 1468, 1110, 3123])
print(decoded_string)

In [None]:
import spacy

# Load the pre-trained English model
nlp = spacy.load("en_core_web_sm")

# Example sentence
sentence = "Word vectors are awesome!"

# Process the sentence using spaCy
doc = nlp(sentence)

# Access the word vectors for each token in the sentence
for token in doc:
    print(f"{token.text}: {token.vector[:5]}...")  # Displaying the first 5 components of the vector


<img src="https://github.com/rastringer/code_first_ml/blob/main/images/word_vectors.png?raw=true" width="500"/>

### The difficulties of word embeddings

Word vectors and embeddings are very useful however the abstract nature of language can cause problems when assigning numerical values to words. For example, in the following sentences, "trainers" has a different meaning based on the context.

*"Mustafa loved running in his new trainers"*

*"Svitlana said the gym had the best trainers around"*

Linguists call these words with unrelated meanings  *homonyms*. Another term is *polysemy*, which means a word can mean the same thing but have a slightly different meaning. For example,

*"Joan wrote a program to calculate eucledian distance"*

*"The program featured Mozart's The Marriage of Figaro"*

In short, we need a way of finding the meaning of words based on their relevance to other words in the text. Step forward, Attention.



## Attention

There are two steps in the transformer during which the model learns what words and text mean. This in ML parlance is updating the "hidden state" for inputs to the model.

The first is the attention stage, the transformer compares each word to all the other words in a sequence, looking for context and shared significance.

The second is the feed forward step, where the model tries to capture more complex patterns and relationships between words. These are accomplished by mathematical transformations.

<img src="https://github.com/rastringer/code_first_ml/blob/main/images/attention_diagram.png?raw=true" width="800"/>

[Diagram](https://distill.pub/2016/augmented-rnns/) from Olah and Carter, 2016

### From text to genomics and vision

<img src="https://github.com/rastringer/code_first_ml/blob/main/images/vit_transformer.png?raw=true" width="800"/>

<img src="https://github.com/rastringer/code_first_ml/blob/main/images/vit_attention.png?raw=true" width="400"/>

Images from ["An image is worth 16 x 16 words"](https://arxiv.org/pdf/2010.11929.pdf)

In [None]:
from IPython.display import YouTubeVideo
YouTubeVideo('ojQB7PYaU28')

### Self-attention

<img src="https://github.com/rastringer/code_first_ml/blob/main/images/self_attention.png?raw=true" width="800"/>

Self-attention is the process of learning the relevance of each word to all other words in the sequence, and is an O(n²), quadratic operation.

### Query, key, value

The *query* vector checks its own characteristics against the every *key* vector and the network calculates a *value* based on how related the keys are to the query.

<img src="https://github.com/rastringer/code_first_ml/blob/main/images/multi_head.png?raw=true" width="800"/>

<img src="https://github.com/rastringer/code_first_ml/blob/main/images/transformer_encoder.png?raw=true" width="800"/>

With each layer, the model's understanding of the text improves. Here are the outputs up to layer 23 of GPT-2 when given the following prompt:

"Q: What is the capital of France?"
"A: Paris"
"Q: What is the capital of Poland?"
"A: "

```
0  ( [ The:,
 at and Act A
1  A The ( [ Is59
 At and40
2  A [ ( The At Is Act at59,
3  A [ ( Act At Is The CH An at
4  A [ At Q (Q The Are M An
5  A M No At The payable Q Qu (Q
6  No M A The C Die An H En Qu
7  C A No The M n P N H An
8  A The C P H No n Ass N T
9  A C No nil The Ch P An H N
10  A The G C N P No Me An Le
11  A C N None P G The Pr Ce H
12  Unknown None C G A N Bar The Ch P
13  C P N G B A Unknown St None The
14  St N G P Poland B C Pol A D
15  Poland P St Pol Warsaw Polish N B G Germany
16  Poland Warsaw Polish Poles Budapest Prague Pol Germany Berlin Moscow
17  Poland Warsaw Polish Poles Budapest Prague � Pol Lithuania Moscow
18  Poland Warsaw Polish Prague Budapest Poles Moscow � Berlin Kiev
19  Warsaw Poland Polish Budapest Prague Moscow Berlin Kiev � Frankfurt
20  Warsaw Poland Prague Budapest Polish Moscow Kiev Berlin Frankfurt Brussels
21  Warsaw Poland Polish Prague Budapest � Kiev Sz Berlin Moscow
22  Warsaw Poland Prague Budapest K W Kiev Sz Moscow Berlin
23  Warsaw W K Br Po B L Z P Poland

```


In [None]:
import torch
import torch.nn as nn

class SelfAttention(nn.Module):
    def __init__(self, embed_dim, num_heads):
        super().__init__()
        self.embed_dim = embed_dim
        self.num_heads = num_heads

        # Calculate projection dimension
        self.head_dim = embed_dim // num_heads

        # Create linear projections for key, query, and value vectors for
        # each head
        self.query_proj = nn.Linear(embed_dim, self.head_dim * num_heads)
        self.key_proj = nn.Linear(embed_dim, self.head_dim * num_heads)
        self.value_proj = nn.Linear(embed_dim, self.head_dim * num_heads)

        # Output projection to project multi-head attention concatenated outputs
        # back to original embed_dim
        self.output_proj = nn.Linear(embed_dim, embed_dim)

        # Softmax attention calculation
        self.softmax = nn.Softmax(dim=-1)

    def forward(self, x):

        # Get input/output batch sizes and sequence length
        batch_size, seq_len, embed_dim = x.size()

        # Project inputs to queries, keys, values
        # Split last dimension into self.num_heads because we want a separate
        # head for each projection
        # Note: .view allows you to reshape tensors for operations like
        # splitting the vector into heads, flattening sequences, etc.
        # while keeping the same data.
        queries = self.query_proj(x).view(batch_size, seq_len, self.num_heads, self.head_dim)
        keys = self.key_proj(x).view(batch_size, seq_len, self.num_heads, self.head_dim)
        values = self.value_proj(x).view(batch_size, seq_len, self.num_heads, self.head_dim)

        # Transpose dimensions so dot products calculate attention weights
        # over heads effectively
        queries = queries.transpose(1, 2)  # [batch_size, num_heads, seq_len, head_dim]
        keys = keys.transpose(1, 2)
        values = values.transpose(1, 2)

        # Compute scaled dot-product attention scores
        scores = torch.matmul(queries, keys.transpose(-2, -1)) / (self.head_dim ** 0.5)

        # Apply softmax to normalize attention weights across heads
        attention_weights = self.softmax(scores)

        # Apply attention weights to values
        output = torch.matmul(attention_weights, values)

        # Concatenate heads into seq_len flattened batch
        output = output.transpose(1, 2).contiguous()

        # Flatten heads and seq_len out to project
        # back to original embed_dim
        output = output.view(batch_size, seq_len, embed_dim)

        # Project multi-head attention concatenated outputs back to embed_dim
        output = self.output_proj(output)

        return output



In [None]:
import torch

# Example input
x = torch.tensor([[[1,2,3,4], [5,6,7,8], [9,10,11,12]]])

embed_dim = 4 # Embedded dimension from input
num_heads = 2 # Number of heads

# Instantiate module
model = SelfAttention(embed_dim, num_heads)

# Run forward pass
output = model(x)

print(output.shape)
print(output)

This SelfAttention module is implementing multi-headed self-attention, which is a key component of transformers.

The main idea is that we're going to project the input embeddings into multiple "heads", where each head represents a different learned representation subspace.

These subspaces could include:

* Positional relationships between words eg
  * "They put **the rug** in the middle of the room and the dog went to sleep on **it**"

* Syntactic roles
  * identifying part-of-speech or syntactic dependencies eg *verb*, *suject*, *object*

* Semantic relationships
* Word importance

So basically anything that can help a program understand the meaning of a text.


The key components:

* `embed_dim`: The input embedding dimensionality (e.g. size of each input token vector)
* `num_heads`: The number of parallel attention heads
* `head_dim`: The dimension of each attention head. This is embed_dim // num_heads so all heads concatenated together equals the input dimension.
* `query_proj`: A linear layer that projects the input into queries, one set per head
* `key_proj`: Projects inputs into keys, one set per head
* `value_proj`: Projects inputs into values, one set per head

Then for each head, we compute attention weights between queries and keys with softmax. And apply those weights to the values:

* softmax: Normalizes the attention weights per head
* output_proj: Projects multi-head attention outputs back to original input dimension

### Word vectors

GPT-3 uses word vectors of 12,288 dimensions, which means that *each word* in an input training text is represented by a vector of 12,288 numbers. This means the model has almost 13,000 bits of scrap paper to make notes about how the words relate to one another. Those scribbled notes are refined over and over by later layers.



### Feed it forward

In the *attention* layers, the transformer model check all words against all other words in a sequence and gathers information about relevance, meaning and relationship.

In the feed forward step, the model muses about what it has learned in the attention step, draws more intricate inferences, and tries to predict the next, *masked* word in a sequence.

If we were training a language model on the text of Harry Potter and The Philospher's Stone, for example, the advancing layers of the transformer may look something like this (dramatised example):

```
adsfkl aer wand aklj london ajsfdhkla Harry and the the j magic
```

layer 15:

```
Harry train Hogwarts sweets magic wand Ron school
```

layer 30:

```
Harry is a wizard. Ron is Harry's friend. Hermoine is good at spells.
```

layer 300:

```
Harry is a diligent, talented young wizard who

feels destined to take on the forces of evil

magic. Ron is his affable sidekick and the two

enjoy their adventures together. Hermoine is a

sharp voice of reason and a good friend to

both. Ron seems both adoring of and intimidated by Hermoine.
```



The power of the feed forward neural net comes from its huge connections. There are 49,152 neurons GPT-3's hidden layer. For more details, please read this [excellent article](https://www.understandingai.org/p/large-language-models-explained-with) by Timothy B Lee and Sean Trott on [understandingai.org](understandingai.org).

This means:
* 12,288 inputs (the word vector)
* 49,152 weight parameters
* Each feed-forward layer has 49,152 * 12,288 + 12,288 * 49,152 = 1.2 billion weight parameters.
* With 96 feed forward layers, this equals 116 billion parameters.

In [None]:
import torch
import torch.nn as nn

class SelfAttention(nn.Module):
    def __init__(self, embed_dim, num_heads):
        super().__init__()
        self.embed_dim = embed_dim
        self.num_heads = num_heads

        # Calculate projection dimension
        self.head_dim = embed_dim // num_heads

        # Create linear projections for key, query, and value vectors for
        # each head
        self.query_proj = nn.Linear(embed_dim, self.head_dim * num_heads)
        self.key_proj = nn.Linear(embed_dim, self.head_dim * num_heads)
        self.value_proj = nn.Linear(embed_dim, self.head_dim * num_heads)

        # Output projection to project multi-head attention concatenated outputs
        # back to original embed_dim
        self.output_proj = nn.Linear(embed_dim, embed_dim)

        # Softmax attention calculation
        self.softmax = nn.Softmax(dim=-1)

    def forward(self, x):

        # Get input/output batch sizes and sequence length
        batch_size, seq_len, embed_dim = x.size()

        # Project inputs to queries, keys, values
        # Split last dimension into self.num_heads because we want a separate
        # head for each projection
        queries = self.query_proj(x).view(batch_size, seq_len, self.num_heads, self.head_dim)
        keys = self.key_proj(x).view(batch_size, seq_len, self.num_heads, self.head_dim)
        values = self.value_proj(x).view(batch_size, seq_len, self.num_heads, self.head_dim)

        # Transpose dimensions so dot products calculate attention weights
        # over heads effectively
        queries = queries.transpose(1, 2)
        keys = keys.transpose(1, 2)
        values = values.transpose(1, 2)

        # Calculate dot product attention scores
        scores = torch.matmul(queries, keys.transpose(-2, -1)) / (self.head_dim ** 0.5)

        # Apply softmax to normalize attention weights across heads
        attention_weights = self.softmax(scores)

        # Apply attention weights to values
        output = torch.matmul(attention_weights, values)

        # Concatenate heads into seq_len flattened batch
        output = output.transpose(1, 2).contiguous()

        # Flatten heads and seq_len out to project
        # back to original embed_dim
        output = output.view(batch_size, seq_len, embed_dim)

        # Project multi-head attention concatenated outputs back to embed_dim
        output = self.output_proj(output)

        return output

### Full translation example

This following example is from bentrevett's excellent seq2seq [course](https://github.com/bentrevett/pytorch-seq2seq) on GitHub.


In [None]:
! pip install spacy torchtext evaluate tqdm

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
import random
import numpy as np
import spacy
import datasets
import torchtext
import tqdm
import evaluate
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker

In [None]:
seed = 1234

random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed(seed)
torch.backends.cudnn.deterministic = True

In [None]:
dataset = datasets.load_dataset("bentrevett/multi30k")

In [None]:
train_data, valid_data, test_data = (
    dataset["train"],
    dataset["validation"],
    dataset["test"],
)

In [None]:
!wget https://github.com/explosion/spacy-models/releases/download/de_core_news_sm-3.7.0/de_core_news_sm-3.7.0-py3-none-any.whl

In [None]:
!python3 -m spacy download en_core_web_sm de_core_news_sm

In [None]:
import en_core_web_sm
de_nlp = en_core_web_sm.load()

In [None]:
def tokenize_example(example, en_nlp, de_nlp, max_length, lower, sos_token, eos_token):
    en_tokens = [token.text for token in en_nlp.tokenizer(example["en"])][:max_length]
    de_tokens = [token.text for token in de_nlp.tokenizer(example["de"])][:max_length]
    if lower:
        en_tokens = [token.lower() for token in en_tokens]
        de_tokens = [token.lower() for token in de_tokens]
    en_tokens = [sos_token] + en_tokens + [eos_token]
    de_tokens = [sos_token] + de_tokens + [eos_token]
    return {"en_tokens": en_tokens, "de_tokens": de_tokens}

In [None]:
max_length = 1_000
lower = True
sos_token = "<sos>"
eos_token = "<eos>"

fn_kwargs = {
    "en_nlp": en_nlp,
    "de_nlp": de_nlp,
    "max_length": max_length,
    "lower": lower,
    "sos_token": sos_token,
    "eos_token": eos_token,
}

train_data = train_data.map(tokenize_example, fn_kwargs=fn_kwargs)
valid_data = valid_data.map(tokenize_example, fn_kwargs=fn_kwargs)
test_data = test_data.map(tokenize_example, fn_kwargs=fn_kwargs)

In [None]:
min_freq = 2
unk_token = "<unk>"
pad_token = "<pad>"

special_tokens = [
    unk_token,
    pad_token,
    sos_token,
    eos_token,
]

en_vocab = torchtext.vocab.build_vocab_from_iterator(
    train_data["en_tokens"],
    min_freq=min_freq,
    specials=special_tokens,
)

de_vocab = torchtext.vocab.build_vocab_from_iterator(
    train_data["de_tokens"],
    min_freq=min_freq,
    specials=special_tokens,
)

In [None]:
assert en_vocab[unk_token] == de_vocab[unk_token]
assert en_vocab[pad_token] == de_vocab[pad_token]

unk_index = en_vocab[unk_token]
pad_index = en_vocab[pad_token]

In [None]:
en_vocab.set_default_index(unk_index)
de_vocab.set_default_index(unk_index)

In [None]:
def numericalize_example(example, en_vocab, de_vocab):
    en_ids = en_vocab.lookup_indices(example["en_tokens"])
    de_ids = de_vocab.lookup_indices(example["de_tokens"])
    return {"en_ids": en_ids, "de_ids": de_ids}

In [None]:
fn_kwargs = {"en_vocab": en_vocab, "de_vocab": de_vocab}

train_data = train_data.map(numericalize_example, fn_kwargs=fn_kwargs)
valid_data = valid_data.map(numericalize_example, fn_kwargs=fn_kwargs)
test_data = test_data.map(numericalize_example, fn_kwargs=fn_kwargs)

In [None]:
data_type = "torch"
format_columns = ["en_ids", "de_ids"]

train_data = train_data.with_format(
    type=data_type, columns=format_columns, output_all_columns=True
)

valid_data = valid_data.with_format(
    type=data_type,
    columns=format_columns,
    output_all_columns=True,
)

test_data = test_data.with_format(
    type=data_type,
    columns=format_columns,
    output_all_columns=True,
)

In [None]:
def get_collate_fn(pad_index):
    def collate_fn(batch):
        batch_en_ids = [example["en_ids"] for example in batch]
        batch_de_ids = [example["de_ids"] for example in batch]
        batch_en_ids = nn.utils.rnn.pad_sequence(batch_en_ids, padding_value=pad_index)
        batch_de_ids = nn.utils.rnn.pad_sequence(batch_de_ids, padding_value=pad_index)
        batch = {
            "en_ids": batch_en_ids,
            "de_ids": batch_de_ids,
        }
        return batch

    return collate_fn

In [None]:
def get_data_loader(dataset, batch_size, pad_index, shuffle=False):
    collate_fn = get_collate_fn(pad_index)
    data_loader = torch.utils.data.DataLoader(
        dataset=dataset,
        batch_size=batch_size,
        collate_fn=collate_fn,
        shuffle=shuffle,
    )
    return data_loader

In [None]:
# Authenticate via Colab (if on Vertex Workbench, you're already authenticated)
from google.colab import auth
auth.authenticate_user()

In [None]:
batch_size = 128

train_data_loader = get_data_loader(train_data, batch_size, pad_index, shuffle=True)
valid_data_loader = get_data_loader(valid_data, batch_size, pad_index)
test_data_loader = get_data_loader(test_data, batch_size, pad_index)

In [None]:
class Encoder(nn.Module):
    def __init__(
        self, input_dim, embedding_dim, encoder_hidden_dim, decoder_hidden_dim, dropout
    ):
        super().__init__()
        self.embedding = nn.Embedding(input_dim, embedding_dim)
        self.rnn = nn.GRU(embedding_dim, encoder_hidden_dim, bidirectional=True)
        self.fc = nn.Linear(encoder_hidden_dim * 2, decoder_hidden_dim)
        self.dropout = nn.Dropout(dropout)

    def forward(self, src):
        # src = [src length, batch size]
        embedded = self.dropout(self.embedding(src))
        # embedded = [src length, batch size, embedding dim]
        outputs, hidden = self.rnn(embedded)
        # outputs = [src length, batch size, hidden dim * n directions]
        # hidden = [n layers * n directions, batch size, hidden dim]
        # hidden is stacked [forward_1, backward_1, forward_2, backward_2, ...]
        # outputs are always from the last layer
        # hidden [-2, :, : ] is the last of the forwards RNN
        # hidden [-1, :, : ] is the last of the backwards RNN
        # initial decoder hidden is final hidden state of the forwards and backwards
        # encoder RNNs fed through a linear layer
        hidden = torch.tanh(
            self.fc(torch.cat((hidden[-2, :, :], hidden[-1, :, :]), dim=1))
        )
        # outputs = [src length, batch size, encoder hidden dim * 2]
        # hidden = [batch size, decoder hidden dim]
        return outputs, hidden

In [None]:
class Attention(nn.Module):
    def __init__(self, encoder_hidden_dim, decoder_hidden_dim):
        super().__init__()
        self.attn_fc = nn.Linear(
            (encoder_hidden_dim * 2) + decoder_hidden_dim, decoder_hidden_dim
        )
        self.v_fc = nn.Linear(decoder_hidden_dim, 1, bias=False)

    def forward(self, hidden, encoder_outputs):
        # hidden = [batch size, decoder hidden dim]
        # encoder_outputs = [src length, batch size, encoder hidden dim * 2]
        batch_size = encoder_outputs.shape[1]
        src_length = encoder_outputs.shape[0]
        # repeat decoder hidden state src_length times
        hidden = hidden.unsqueeze(1).repeat(1, src_length, 1)
        encoder_outputs = encoder_outputs.permute(1, 0, 2)
        # hidden = [batch size, src length, decoder hidden dim]
        # encoder_outputs = [batch size, src length, encoder hidden dim * 2]
        energy = torch.tanh(self.attn_fc(torch.cat((hidden, encoder_outputs), dim=2)))
        # energy = [batch size, src length, decoder hidden dim]
        attention = self.v_fc(energy).squeeze(2)
        # attention = [batch size, src length]
        return torch.softmax(attention, dim=1)

In [None]:
class Decoder(nn.Module):
    def __init__(
        self,
        output_dim,
        embedding_dim,
        encoder_hidden_dim,
        decoder_hidden_dim,
        dropout,
        attention,
    ):
        super().__init__()
        self.output_dim = output_dim
        self.attention = attention
        self.embedding = nn.Embedding(output_dim, embedding_dim)
        self.rnn = nn.GRU((encoder_hidden_dim * 2) + embedding_dim, decoder_hidden_dim)
        self.fc_out = nn.Linear(
            (encoder_hidden_dim * 2) + decoder_hidden_dim + embedding_dim, output_dim
        )
        self.dropout = nn.Dropout(dropout)

    def forward(self, input, hidden, encoder_outputs):
        # input = [batch size]
        # hidden = [batch size, decoder hidden dim]
        # encoder_outputs = [src length, batch size, encoder hidden dim * 2]
        input = input.unsqueeze(0)
        # input = [1, batch size]
        embedded = self.dropout(self.embedding(input))
        # embedded = [1, batch size, embedding dim]
        a = self.attention(hidden, encoder_outputs)
        # a = [batch size, src length]
        a = a.unsqueeze(1)
        # a = [batch size, 1, src length]
        encoder_outputs = encoder_outputs.permute(1, 0, 2)
        # encoder_outputs = [batch size, src length, encoder hidden dim * 2]
        weighted = torch.bmm(a, encoder_outputs)
        # weighted = [batch size, 1, encoder hidden dim * 2]
        weighted = weighted.permute(1, 0, 2)
        # weighted = [1, batch size, encoder hidden dim * 2]
        rnn_input = torch.cat((embedded, weighted), dim=2)
        # rnn_input = [1, batch size, (encoder hidden dim * 2) + embedding dim]
        output, hidden = self.rnn(rnn_input, hidden.unsqueeze(0))
        # output = [seq length, batch size, decoder hid dim * n directions]
        # hidden = [n layers * n directions, batch size, decoder hid dim]
        # seq len, n layers and n directions will always be 1 in this decoder, therefore:
        # output = [1, batch size, decoder hidden dim]
        # hidden = [1, batch size, decoder hidden dim]
        # this also means that output == hidden
        assert (output == hidden).all()
        embedded = embedded.squeeze(0)
        output = output.squeeze(0)
        weighted = weighted.squeeze(0)
        prediction = self.fc_out(torch.cat((output, weighted, embedded), dim=1))
        # prediction = [batch size, output dim]
        return prediction, hidden.squeeze(0), a.squeeze(1)

In [None]:
class Seq2Seq(nn.Module):
    def __init__(self, encoder, decoder, device):
        super().__init__()
        self.encoder = encoder
        self.decoder = decoder
        self.device = device

    def forward(self, src, trg, teacher_forcing_ratio):
        # src = [src length, batch size]
        # trg = [trg length, batch size]
        # teacher_forcing_ratio is probability to use teacher forcing
        # e.g. if teacher_forcing_ratio is 0.75 we use teacher forcing 75% of the time
        batch_size = src.shape[1]
        trg_length = trg.shape[0]
        trg_vocab_size = self.decoder.output_dim
        # tensor to store decoder outputs
        outputs = torch.zeros(trg_length, batch_size, trg_vocab_size).to(self.device)
        # encoder_outputs is all hidden states of the input sequence, back and forwards
        # hidden is the final forward and backward hidden states, passed through a linear layer
        encoder_outputs, hidden = self.encoder(src)
        # outputs = [src length, batch size, encoder hidden dim * 2]
        # hidden = [batch size, decoder hidden dim]
        # first input to the decoder is the <sos> tokens
        input = trg[0, :]
        for t in range(1, trg_length):
            # insert input token embedding, previous hidden state and all encoder hidden states
            # receive output tensor (predictions) and new hidden state
            output, hidden, _ = self.decoder(input, hidden, encoder_outputs)
            # output = [batch size, output dim]
            # hidden = [n layers, batch size, decoder hidden dim]
            # place predictions in a tensor holding predictions for each token
            outputs[t] = output
            # decide if we are going to use teacher forcing or not
            teacher_force = random.random() < teacher_forcing_ratio
            # get the highest predicted token from our predictions
            top1 = output.argmax(1)
            # if teacher forcing, use actual next token as next input
            # if not, use predicted token
            input = trg[t] if teacher_force else top1
            # input = [batch size]
        return outputs

In [None]:
input_dim = len(de_vocab)
output_dim = len(en_vocab)
encoder_embedding_dim = 256
decoder_embedding_dim = 256
encoder_hidden_dim = 512
decoder_hidden_dim = 512
encoder_dropout = 0.5
decoder_dropout = 0.5
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

attention = Attention(encoder_hidden_dim, decoder_hidden_dim)

encoder = Encoder(
    input_dim,
    encoder_embedding_dim,
    encoder_hidden_dim,
    decoder_hidden_dim,
    encoder_dropout,
)

decoder = Decoder(
    output_dim,
    decoder_embedding_dim,
    encoder_hidden_dim,
    decoder_hidden_dim,
    decoder_dropout,
    attention,
)

model = Seq2Seq(encoder, decoder, device).to(device)

In [None]:
def init_weights(m):
    for name, param in m.named_parameters():
        if "weight" in name:
            nn.init.normal_(param.data, mean=0, std=0.01)
        else:
            nn.init.constant_(param.data, 0)


model.apply(init_weights)

In [None]:
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)


print(f"The model has {count_parameters(model):,} trainable parameters")

In [None]:
optimizer = optim.Adam(model.parameters())
criterion = nn.CrossEntropyLoss(ignore_index=pad_index)

In [None]:
def train_fn(
    model, data_loader, optimizer, criterion, clip, teacher_forcing_ratio, device
):
    model.train()
    epoch_loss = 0
    for i, batch in enumerate(data_loader):
        src = batch["de_ids"].to(device)
        trg = batch["en_ids"].to(device)
        # src = [src length, batch size]
        # trg = [trg length, batch size]
        optimizer.zero_grad()
        output = model(src, trg, teacher_forcing_ratio)
        # output = [trg length, batch size, trg vocab size]
        output_dim = output.shape[-1]
        output = output[1:].view(-1, output_dim)
        # output = [(trg length - 1) * batch size, trg vocab size]
        trg = trg[1:].view(-1)
        # trg = [(trg length - 1) * batch size]
        loss = criterion(output, trg)
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), clip)
        optimizer.step()
        epoch_loss += loss.item()
    return epoch_loss / len(data_loader)

In [None]:
def evaluate_fn(model, data_loader, criterion, device):
    model.eval()
    epoch_loss = 0
    with torch.no_grad():
        for i, batch in enumerate(data_loader):
            src = batch["de_ids"].to(device)
            trg = batch["en_ids"].to(device)
            # src = [src length, batch size]
            # trg = [trg length, batch size]
            output = model(src, trg, 0)  # turn off teacher forcing
            # output = [trg length, batch size, trg vocab size]
            output_dim = output.shape[-1]
            output = output[1:].view(-1, output_dim)
            # output = [(trg length - 1) * batch size, trg vocab size]
            trg = trg[1:].view(-1)
            # trg = [(trg length - 1) * batch size]
            loss = criterion(output, trg)
            epoch_loss += loss.item()
    return epoch_loss / len(data_loader)

In [None]:
n_epochs = 3
clip = 1.0
teacher_forcing_ratio = 0.5

best_valid_loss = float("inf")

for epoch in tqdm.tqdm(range(n_epochs)):
    train_loss = train_fn(
        model,
        train_data_loader,
        optimizer,
        criterion,
        clip,
        teacher_forcing_ratio,
        device,
    )
    valid_loss = evaluate_fn(
        model,
        valid_data_loader,
        criterion,
        device,
    )
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), "tut3-model.pt")
    print(f"\tTrain Loss: {train_loss:7.3f} | Train PPL: {np.exp(train_loss):7.3f}")
    print(f"\tValid Loss: {valid_loss:7.3f} | Valid PPL: {np.exp(valid_loss):7.3f}")

In [None]:
model.load_state_dict(torch.load("tut3-model.pt"))

test_loss = evaluate_fn(model, test_data_loader, criterion, device)

print(f"| Test Loss: {test_loss:.3f} | Test PPL: {np.exp(test_loss):7.3f} |")

In [None]:
def translate_sentence(
    sentence,
    model,
    en_nlp,
    de_nlp,
    en_vocab,
    de_vocab,
    lower,
    sos_token,
    eos_token,
    device,
    max_output_length=25,
):
    model.eval()
    with torch.no_grad():
        if isinstance(sentence, str):
            de_tokens = [token.text for token in de_nlp.tokenizer(sentence)]
        else:
            de_tokens = [token for token in sentence]
        if lower:
            de_tokens = [token.lower() for token in de_tokens]
        de_tokens = [sos_token] + de_tokens + [eos_token]
        ids = de_vocab.lookup_indices(de_tokens)
        tensor = torch.LongTensor(ids).unsqueeze(-1).to(device)
        encoder_outputs, hidden = model.encoder(tensor)
        inputs = en_vocab.lookup_indices([sos_token])
        attentions = torch.zeros(max_output_length, 1, len(ids))
        for i in range(max_output_length):
            inputs_tensor = torch.LongTensor([inputs[-1]]).to(device)
            output, hidden, attention = model.decoder(
                inputs_tensor, hidden, encoder_outputs
            )
            attentions[i] = attention
            predicted_token = output.argmax(-1).item()
            inputs.append(predicted_token)
            if predicted_token == en_vocab[eos_token]:
                break
        en_tokens = en_vocab.lookup_tokens(inputs)
    return en_tokens, de_tokens, attentions[: len(en_tokens) - 1]

In [None]:
def plot_attention(sentence, translation, attention):
    fig, ax = plt.subplots(figsize=(10, 10))
    attention = attention.squeeze(1).numpy()
    cax = ax.matshow(attention, cmap="bone")
    ax.set_xticks(ticks=np.arange(len(sentence)), labels=sentence, rotation=90, size=15)
    translation = translation[1:]
    ax.set_yticks(ticks=np.arange(len(translation)), labels=translation, size=15)
    plt.show()
    plt.close()

In [None]:
sentence = test_data[0]["de"]
expected_translation = test_data[0]["en"]

sentence, expected_translation

In [None]:
translation, sentence_tokens, attention = translate_sentence(
    sentence,
    model,
    en_nlp,
    de_nlp,
    en_vocab,
    de_vocab,
    lower,
    sos_token,
    eos_token,
    device,
)


In [None]:
translation

In [None]:
sentence_tokens

In [None]:
plot_attention(sentence_tokens, translation, attention)
