# Introduction

This notebook uses the transformer neural network from this repository.

It uses a custom dataset class instead of the the Multi30K class from PyTorch. This notebook can be adapated to any other translation dataset if the CSV file can be prepared in the same format.

In [1]:
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator
from typing import Iterable, List
from torch.nn.utils.rnn import pad_sequence
from torch.utils.data import DataLoader, Dataset
from timeit import default_timer as timer
from attention import transformer

import torch.nn as nn
import torch
import torch.nn.functional as F
import numpy as np
import pandas as pd

In [2]:
# Set seed.
seed = 42
np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed(seed)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = True

In [3]:
SRC_LANGUAGE = 'de'
TGT_LANGUAGE = 'en'

# Place-holders
token_transform = {}
vocab_transform = {}

In [4]:
token_transform[SRC_LANGUAGE] = get_tokenizer('spacy', language='de_core_news_sm')
token_transform[TGT_LANGUAGE] = get_tokenizer('spacy', language='en_core_web_sm')

2023-06-28 07:26:24.825818: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-06-28 07:26:26.343391: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:982] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-06-28 07:26:26.343683: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1956] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...


In [5]:
train_csv = pd.read_csv('data/english_german/translation_train.csv', usecols=['english', 'german'])
train_csv.head()

Unnamed: 0,english,german
0,"Two young, White males are outside near many b...",Zwei junge weiße Männer sind im Freien in der ...
1,Several men in hard hats are operating a giant...,Mehrere Männer mit Schutzhelmen bedienen ein A...
2,A little girl climbing into a wooden playhouse.,Ein kleines Mädchen klettert in ein Spielhaus ...
3,A man in a blue shirt is standing on a ladder ...,Ein Mann in einem blauen Hemd steht auf einer ...
4,Two men are at the stove preparing food.,Zwei Männer stehen am Herd und bereiten Essen zu.


In [6]:
valid_csv = pd.read_csv('data/english_german/translation_test.csv', usecols=['english', 'german'])
valid_csv.head()

Unnamed: 0,english,german
0,A man in an orange hat starring at something.,"Ein Mann mit einem orangefarbenen Hut, der etw..."
1,A Boston Terrier is running on lush green gras...,Ein Boston Terrier läuft über saftig-grünes Gr...
2,A girl in karate uniform breaking a stick with...,Ein Mädchen in einem Karateanzug bricht einen ...
3,Five people wearing winter jackets and helmets...,Fünf Leute in Winterjacken und mit Helmen steh...
4,People are fixing the roof of a house.,Leute Reparieren das Dach eines Hauses.


In [7]:
train_csv['german'].iloc[0]

'Zwei junge weiße Männer sind im Freien in der Nähe vieler Büsche.'

## Create a Custom Dataset Class

In [8]:
class TranslationDataset(Dataset):
    def __init__(self, csv):
        self.csv = csv
        
    def __len__(self):
        return len(self.csv)
    
    def __getitem__(self, idx):
        return(
            self.csv['german'].iloc[idx],
            self.csv['english'].iloc[idx]
        )

In [9]:
train_dataset = TranslationDataset(train_csv)
valid_dataset = TranslationDataset(valid_csv)

In [10]:
iterator = iter(train_dataset)
print(next(iterator))

('Zwei junge weiße Männer sind im Freien in der Nähe vieler Büsche.', 'Two young, White males are outside near many bushes.')


In [11]:
# helper function to yield list of tokens
def yield_tokens(data_iter: Iterable, language: str) -> List[str]:
    language_index = {SRC_LANGUAGE: 0, TGT_LANGUAGE: 1}

    for data_sample in data_iter:
        yield token_transform[language](data_sample[language_index[language]])

# Define special symbols and indices
UNK_IDX, PAD_IDX, BOS_IDX, EOS_IDX = 0, 1, 2, 3
# Make sure the tokens are in order of their indices to properly insert them in vocab
special_symbols = ['<unk>', '<pad>', '<bos>', '<eos>']

for ln in [SRC_LANGUAGE, TGT_LANGUAGE]:
    # Create torchtext's Vocab object
    vocab_transform[ln] = build_vocab_from_iterator(
        yield_tokens(train_dataset, ln),
        min_freq=1,
        specials=special_symbols,
        special_first=True,
    )

# Set ``UNK_IDX`` as the default index. This index is returned when the token is not found.
# If not set, it throws ``RuntimeError`` when the queried token is not found in the Vocabulary.
for ln in [SRC_LANGUAGE, TGT_LANGUAGE]:
    vocab_transform[ln].set_default_index(UNK_IDX)

## Hyperparameters

In [12]:
SRC_VOCAB_SIZE = len(vocab_transform[SRC_LANGUAGE])
TGT_VOCAB_SIZE = len(vocab_transform[TGT_LANGUAGE])
EMB_SIZE = 512
NHEAD = 8
FFN_HID_DIM = 512
BATCH_SIZE = 128
MAX_LEN = 256
NUM_ENCODER_LAYERS = 3
DEVICE = 'cuda'
NUM_EPOCHS = 75
# DEVICE = 'cpu'

## Utilites for Dataset Preparation

In [13]:
# helper function to club together sequential operations
def sequential_transforms(*transforms):
    def func(txt_input):
        for transform in transforms:
            txt_input = transform(txt_input)
        return txt_input
    return func

# function to add BOS/EOS and create tensor for input sequence indices
def tensor_transform(token_ids: List[int]):
    return torch.cat((torch.tensor([BOS_IDX]),
                      torch.tensor(token_ids),
                      torch.tensor([EOS_IDX])))

# ``src`` and ``tgt`` language text transforms to convert raw strings into tensors indices
text_transform = {}
for ln in [SRC_LANGUAGE, TGT_LANGUAGE]:
    text_transform[ln] = sequential_transforms(token_transform[ln], #Tokenization
                                               vocab_transform[ln], #Numericalization
                                               tensor_transform) # Add BOS/EOS and create tensor


# function to collate data samples into batch tensors
def collate_fn(batch):
    src_batch, tgt_batch = [], []
    for src_sample, tgt_sample in batch:
        src_batch.append(text_transform[SRC_LANGUAGE](src_sample.rstrip("\n")))
        tgt_batch.append(text_transform[TGT_LANGUAGE](tgt_sample.rstrip("\n")))

    src_batch = pad_sequence(src_batch, padding_value=PAD_IDX, batch_first=True)
    tgt_batch = pad_sequence(tgt_batch, padding_value=PAD_IDX, batch_first=True)
    return src_batch, tgt_batch

## Model

In [14]:
model = transformer.Transformer(
    embed_dim=EMB_SIZE,
    src_vocab_size=SRC_VOCAB_SIZE,
    tgt_vocab_size=TGT_VOCAB_SIZE,
    seq_len=MAX_LEN,
    num_layers=NUM_ENCODER_LAYERS,
    n_heads=NHEAD,
    device=DEVICE,
    dropout=0.1
).to(DEVICE)

# Total parameters and trainable parameters.
total_params = sum(p.numel() for p in model.parameters())
print(f"{total_params:,} total parameters.")
total_trainable_params = sum(
    p.numel() for p in model.parameters() if p.requires_grad)
print(f"{total_trainable_params:,} training parameters.")
print(model)

36,057,347 total parameters.
36,057,347 training parameters.
Transformer(
  (encoder): TransformerEncoder(
    (embedding): Embedding(
      (embed): Embedding(19293, 512)
    )
    (positional_encoding): PositionalEncoding(
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (layers): ModuleList(
      (0-2): 3 x TransformerBlock(
        (attention): MultiHeadAttention(
          (q): Linear(in_features=64, out_features=64, bias=True)
          (k): Linear(in_features=64, out_features=64, bias=True)
          (v): Linear(in_features=64, out_features=64, bias=True)
          (out): Linear(in_features=512, out_features=512, bias=True)
        )
        (norm1): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        (norm2): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        (ffn): Sequential(
          (0): Linear(in_features=512, out_features=2048, bias=True)
          (1): ReLU()
          (2): Linear(in_features=2048, out_features=512, bias=True)
        )
   

In [15]:
# for p in model.parameters():
#     if p.dim() > 1:
#         nn.init.xavier_uniform_(p)

loss_fn = torch.nn.CrossEntropyLoss(ignore_index=PAD_IDX)

optimizer = torch.optim.Adam(model.parameters(), lr=0.0001, betas=(0.9, 0.98), eps=1e-9)

In [16]:
train_dataloader = DataLoader(train_dataset, batch_size=BATCH_SIZE, collate_fn=collate_fn)
def train_epoch(model, optimizer):
    model.train()
    losses = 0

    for src, tgt in train_dataloader:
        # print(" ".join(vocab_transform[SRC_LANGUAGE].lookup_tokens(list(src[0].cpu().numpy()))).replace("<bos>", "").replace("<eos>", ""))
        # print(" ".join(vocab_transform[TGT_LANGUAGE].lookup_tokens(list(tgt[0].cpu().numpy()))).replace("<bos>", "").replace("<eos>", ""))
        src = src.to(DEVICE)
        tgt = tgt.to(DEVICE)
        
        tgt_input = tgt[:, :-1]

        logits = model(src, tgt_input)

        optimizer.zero_grad()

        tgt_out = tgt[:, 1:]
        loss = loss_fn(logits.view(-1, TGT_VOCAB_SIZE), tgt_out.contiguous().view(-1))
        loss.backward()

        optimizer.step()
        losses += loss.item()

    return losses / len(list(train_dataloader))


val_dataloader = DataLoader(valid_dataset, batch_size=BATCH_SIZE, collate_fn=collate_fn)
def evaluate(model):
    model.eval()
    losses = 0

    for src, tgt in val_dataloader:
        # print(" ".join(vocab_transform[SRC_LANGUAGE].lookup_tokens(list(src[0].cpu().numpy()))).replace("<bos>", "").replace("<eos>", ""))
        # print(" ".join(vocab_transform[TGT_LANGUAGE].lookup_tokens(list(tgt[0].cpu().numpy()))).replace("<bos>", "").replace("<eos>", ""))
        src = src.to(DEVICE)
        tgt = tgt.to(DEVICE)
        
        tgt_input = tgt[:, :-1]
        
        logits = model(src, tgt_input)

        tgt_out = tgt[:, 1:]
        loss = loss_fn(logits.view(-1, TGT_VOCAB_SIZE), tgt_out.contiguous().view(-1))
        losses += loss.item()

    return losses / len(list(val_dataloader))

In [17]:
for epoch in range(1, NUM_EPOCHS+1):
    start_time = timer()
    train_loss = train_epoch(model, optimizer)
    end_time = timer()
    val_loss = evaluate(model)
    print((f"Epoch: {epoch}, Train loss: {train_loss:.3f}, Val loss: {val_loss:.3f}, "f"Epoch time = {(end_time - start_time):.3f}s"))

Epoch: 1, Train loss: 5.647, Val loss: 4.772, Epoch time = 62.767s
Epoch: 2, Train loss: 4.518, Val loss: 4.161, Epoch time = 62.893s
Epoch: 3, Train loss: 4.061, Val loss: 3.782, Epoch time = 64.410s
Epoch: 4, Train loss: 3.760, Val loss: 3.537, Epoch time = 68.796s
Epoch: 5, Train loss: 3.531, Val loss: 3.352, Epoch time = 73.837s
Epoch: 6, Train loss: 3.348, Val loss: 3.203, Epoch time = 73.689s
Epoch: 7, Train loss: 3.197, Val loss: 3.076, Epoch time = 73.617s
Epoch: 8, Train loss: 3.067, Val loss: 2.971, Epoch time = 75.402s
Epoch: 9, Train loss: 2.957, Val loss: 2.886, Epoch time = 76.924s
Epoch: 10, Train loss: 2.860, Val loss: 2.812, Epoch time = 74.137s
Epoch: 11, Train loss: 2.771, Val loss: 2.736, Epoch time = 77.777s
Epoch: 12, Train loss: 2.695, Val loss: 2.684, Epoch time = 77.348s
Epoch: 13, Train loss: 2.621, Val loss: 2.634, Epoch time = 77.682s
Epoch: 14, Train loss: 2.555, Val loss: 2.572, Epoch time = 76.698s
Epoch: 15, Train loss: 2.492, Val loss: 2.535, Epoch time

In [18]:
import os
os.makedirs('outputs/translation_custom_dataloader', exist_ok=True)
torch.save(model, 'outputs/translation_custom_dataloader/model.pth')

## Inference

In [19]:
# print(translate(model, "Eine Gruppe von Menschen steht vor einem Iglu ."))

In [20]:
import torch

from attention.transformer import TransformerDecoder, TransformerEncoder

In [21]:
model = torch.load('outputs/translation_custom_dataloader/model.pth')

In [22]:
print(model)

Transformer(
  (encoder): TransformerEncoder(
    (embedding): Embedding(
      (embed): Embedding(19293, 512)
    )
    (positional_encoding): PositionalEncoding(
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (layers): ModuleList(
      (0-2): 3 x TransformerBlock(
        (attention): MultiHeadAttention(
          (q): Linear(in_features=64, out_features=64, bias=True)
          (k): Linear(in_features=64, out_features=64, bias=True)
          (v): Linear(in_features=64, out_features=64, bias=True)
          (out): Linear(in_features=512, out_features=512, bias=True)
        )
        (norm1): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        (norm2): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        (ffn): Sequential(
          (0): Linear(in_features=512, out_features=2048, bias=True)
          (1): ReLU()
          (2): Linear(in_features=2048, out_features=512, bias=True)
        )
        (dropout1): Dropout(p=0.1, inplace=False)
        (dropo

In [23]:
def make_tgt_mask(tgt, pad_token_id=1):
    """
    :param tgt: Target sequence.
    Returns:
        tgt_mask: Target mask.
    """
    batch_size = tgt.shape[0]
    device = tgt.device

    # Same as src_mask but we additionally want to mask tokens from looking forward into the future tokens
    # Note: wherever the mask value is true we want to attend to that token, otherwise we mask (ignore) it.
    sequence_length = tgt.shape[1]  # trg_token_ids shape = (B, T) where T max trg token-sequence length
    trg_padding_mask = (tgt != pad_token_id).view(batch_size, 1, 1, -1)  # shape = (B, 1, 1, T)
    trg_no_look_forward_mask = torch.triu(torch.ones((1, 1, sequence_length, sequence_length), device=device) == 1).transpose(2, 3)

    # logic AND operation (both padding mask and no-look-forward must be true to attend to a certain target token)
    tgt_mask = trg_padding_mask & trg_no_look_forward_mask  # final shape = (B, 1, T, T)
    return tgt_mask
    
def make_src_mask(src, pad_token_id=1):
    """
    :param src: Source sequence.

    Returns:
        src_mask: Source mask.
    """
    batch_size = src.shape[0]

    # src_mask shape = (B, 1, 1, S) check out attention function in transformer_model.py where masks are applied
    # src_mask only masks pad tokens as we want to ignore their representations (no information in there...)
    src_mask = (src != pad_token_id).view(batch_size, 1, 1, -1)
    return src_mask

In [24]:
decoder = TransformerDecoder(
            TGT_VOCAB_SIZE,
            EMB_SIZE,
            MAX_LEN,
            NUM_ENCODER_LAYERS,
            expansion_factor=4,
            n_heads=NHEAD
        ).to(DEVICE).eval()

In [25]:
decoder.load_state_dict(model.decoder.state_dict())

<All keys matched successfully>

In [26]:
encoder = TransformerEncoder(
            MAX_LEN,
            SRC_VOCAB_SIZE,
            EMB_SIZE,
            NUM_ENCODER_LAYERS,
            expansion_factor=4,
            n_heads=NHEAD
        ).to(DEVICE).eval()

In [27]:
encoder.load_state_dict(model.encoder.state_dict())

<All keys matched successfully>

In [28]:
model.eval()

Transformer(
  (encoder): TransformerEncoder(
    (embedding): Embedding(
      (embed): Embedding(19293, 512)
    )
    (positional_encoding): PositionalEncoding(
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (layers): ModuleList(
      (0-2): 3 x TransformerBlock(
        (attention): MultiHeadAttention(
          (q): Linear(in_features=64, out_features=64, bias=True)
          (k): Linear(in_features=64, out_features=64, bias=True)
          (v): Linear(in_features=64, out_features=64, bias=True)
          (out): Linear(in_features=512, out_features=512, bias=True)
        )
        (norm1): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        (norm2): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        (ffn): Sequential(
          (0): Linear(in_features=512, out_features=2048, bias=True)
          (1): ReLU()
          (2): Linear(in_features=2048, out_features=512, bias=True)
        )
        (dropout1): Dropout(p=0.1, inplace=False)
        (dropo

In [29]:
def decode(src, tgt):
    """
    :param src: Encoder input
    :param tgt: Decoder input

    Returns:
        out_labels: Final prediction sequence
    """
    tgt_mask = make_tgt_mask(tgt).to(DEVICE)
    src_mask = make_src_mask(src).to(DEVICE)
    enc_out = encoder(src)
    out_labels = []
    batch_size, seq_len = src.shape[0], src.shape[1]
    out = tgt
    with torch.no_grad():
        for i in range(seq_len):
            if i != 0:
                tgt = torch.tensor(out_labels, dtype=torch.long).unsqueeze(0).to(DEVICE)
                # print(tgt)
                out = decoder(torch.tensor(tgt).to(DEVICE), enc_out, src_mask, tgt_mask)
            else:
                out = decoder(out, enc_out, src_mask, tgt_mask)
            out = out.reshape(-1, out.shape[-1])
            num_of_trg_tokens = len(tgt[0])
            out = out[num_of_trg_tokens-1::num_of_trg_tokens]
            out = torch.argmax(out, dim=-1)
            out_labels.append(out.item())
            out = torch.unsqueeze(out, 0)
        return out_labels

### Some Test Samples

Top - English\
Bottom - German

A man in an orange hat starring at something.\
Ein Mann mit einem orangefarbenen Hut, der etwas anstarrt.

A Boston Terrier is running on lush green grass in front of a white fence.\
Ein Boston Terrier läuft über saftig-grünes Gras vor einem weißen Zaun.

A girl in karate uniform breaking a stick with a front kick.\
Ein Mädchen in einem Karateanzug bricht einen Stock mit einem Tritt.

Five people wearing winter jackets and helmets stand in the snow, with snowmobiles in the background...\
Fünf Leute in Winterjacken und mit Helmen stehen im Schnee mit Schneemobilen im Hintergrund.

People are fixing the roof of a house.\
Leute Reparieren das Dach eines Hauses.

A man in light colored clothing photographs a group of men wearing dark suits and hats standing arou...\
Ein hell gekleideter Mann fotografiert eine Gruppe von Männern in dunklen Anzügen und mit Hüten, die...

A group of people standing in front of an igloo.\
Eine Gruppe von Menschen steht vor einem Iglu.

A boy in a red uniform is attempting to avoid getting out at home plate, while the catcher in the bl...\
Ein Junge in einem roten Trikot versucht, die Home Base zu erreichen, während der Catcher im blauen ...

A guy works on a building.\
Ein Typ arbeitet an einem Gebäude.

In [30]:
# Full-stops are important for the model to perform well.
src_sentence = "Fünf Leute in Winterjacken und mit Helmen stehen im Schnee mit Schneemobilen im Hintergrund."
start_symbol = BOS_IDX
src = text_transform[SRC_LANGUAGE](src_sentence).view(-1, 1)
num_tokens = src.shape[0]
src = src.to(DEVICE)
ys = torch.ones(1, 1).fill_(start_symbol).type(torch.long).to(DEVICE)
out = decode(torch.ravel(src).unsqueeze(0), ys)
print(" ".join(vocab_transform[TGT_LANGUAGE].lookup_tokens(list(out))).replace("<bos>", "").replace("<eos>", ""))

Five people in winter jackets and helmets are standing in the snow with ski poles in the


  out = decoder(torch.tensor(tgt).to(DEVICE), enc_out, src_mask, tgt_mask)
