# 2150188401(2) Artificial Intelligence Assignment #2<br>  Transformer from scratch (PyTorch)

Copyright (C) Computer Science & Engineering, Soongsil University. This material is for educational uses only. Some contents are based on the material provided by other paper/book authors and may be copyrighted by them. Written by Haneul Pyeon, September 2024.

**For understanding of this work, please carefully
look at given PDF file.**

In this notebook, you will learn to implement a transformer model from scratch. By doing so, you will understand the nuts and bolts of Transformers more clearly at a code level.
<br>
There are **5 sections**, and in each section, you need to follow the instructions to complete the skeleton codes and explain them.

**Note**: certain details are missing or ambiguous on purpose, in order to test your knowledge on the related materials. However, if you really feel that something essential is missing and cannot proceed to the next step, then contact the teaching staff with clear description of your problem.

### Submitting your work:
<font color=red>**DO NOT clear the final outputs**</font> so that TAs can grade both your code and results.  

### Some helpful tutorials and references for assignment #2-1:
- [1] Original Transformer paper(Vaswani et al., 2017) [[link]](https://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf).
- [2] Helpful instructions about how Transformer works [[link]](https://github.com/jadore801120/attention-is-all-you-need-pytorch).   

### Check virtual env and import packages

In [1]:
import os
# assert os.environ["CONDA_DEFAULT_ENV"] == "AI-24", "current environment is not AI-24"

%env CUDA_VISIBLE_DEVICES = 0

import torch
import torch.nn as nn
import torch.optim as optim
import torch.utils.data as data
import math


if torch.cuda.is_available() is True:
    device = torch.device("cuda")
else:
    device = torch.device("cpu")

env: CUDA_VISIBLE_DEVICES=0


## Overview of the model

![encoder](./imgs/Model_small.png)

## 1. Positional Encoding

According to the original paper on Transformer, positional encoding is constructed by using sine functions to even dimensions and cosine functions to odd dimensions.

\begin{align*}
    PE_{(pos,2i)} = sin(pos / 10000^{2i/dim}) \\
    PE_{(pos,2i+1)} = cos(pos / 10000^{2i/dim})
\end{align*}

### torch.arange [[link]](https://pytorch.org/docs/stable/generated/torch.arange.html)   
Returns a 1-D tensor of size
$$
\left\lceil \frac{\text{end} - \text{start}}{\text{step}} \right\rceil
$$ with values from the interval [start, end) taken with common difference 'step' beginning from 'start'.

Note that non-integer 'step' is subject to floating point rounding errors when comparing against 'end'; to avoid inconsistency, we advise subtracting a small epsilon from 'end' in such cases.

$$
\text{out}_{i+1} = \text{out}_i + \text{step}
$$

#### Parameter
- start (Number) – the starting value for the set of points. Default: 0.

- end (Number) – the ending value for the set of points

- step (Number) – the gap between each pair of adjacent points. Default: 1.



In [2]:
# example
example_arange1 = torch.arange(5)
print(example_arange1)
example_arange2 = torch.arange(1, 4)
print(example_arange2)
example_arange3 = torch.arange(1, 2.5, 0.5)
print(example_arange3)

tensor([0, 1, 2, 3, 4])
tensor([1, 2, 3])
tensor([1.0000, 1.5000, 2.0000])


### torch.pow [[link]](https://pytorch.org/docs/stable/generated/torch.pow.html)   

Takes the power of each element in input with exponent and returns a tensor with the result.

exponent can be either a single float number or a Tensor with the same number of elements as input.

When exponent is a scalar value, the operation applied is:

$$
\text{out}_i = x_i^{\text{exponent}}
$$

When exponent is a tensor, the operation applied is:

$$
\text{out}_i = x_i^{\text{exponent}_i}
$$
​

When exponent is a tensor, the shapes of input and exponent must be broadcastable.

#### Parameters
- input (Tensor) – the input tensor.

- exponent (float or tensor) – the exponent value

#### Keyword Arguments
- out (Tensor, optional) – the output tensor.


In [3]:
a = torch.randn(4)
print(a)

tensor([-0.4115, -0.1367,  0.4284,  1.3782])


In [4]:
exm_pow = torch.pow(a, 2)
print(exm_pow)

tensor([0.1694, 0.0187, 0.1835, 1.8995])


In [5]:
b = torch.arange(1., 5.)
print(b)

tensor([1., 2., 3., 4.])


In [6]:
exm_pow2 = torch.pow(b, exm_pow)
print(exm_pow2)

tensor([ 1.0000,  1.0130,  1.2233, 13.9200])


In [7]:
# example
a = torch.randn(4)
a
torch.pow(a, 2)

exp = torch.arange(1., 5.)
a = torch.arange(1., 5.)
a
exp
torch.pow(a, exp)

tensor([  1.,   4.,  27., 256.])

### torch.sin [[link]](https://pytorch.org/docs/stable/generated/torch.sin.html)   
Returns a new tensor with the sine of the elements of `input`.

$$
\text{out}_i = \sin(\text{input}_i)
$$

#### Parameters
- `input` (Tensor): the input tensor.

#### Keyword Arguments
- `out` (Tensor, optional): the output tensor.


### torch.cos [[link]](https://pytorch.org/docs/stable/generated/torch.cos.html)   

Returns a new tensor with the cosine of the elements of input.

$$
\text{out}_i = \sin(\text{input}_i)
$$

#### Parameters
input (Tensor) – the input tensor.

#### Keyword Arguments
out (Tensor, optional) – the output tensor.


In [8]:
class PositionalEncoding(nn.Module):
    def __init__(self, dim, seq_len_max):
        super(PositionalEncoding, self).__init__()
        PE = torch.zeros(seq_len_max, dim) # zeros : Returns a tensor filled with the scalar value 0, with the shape defined by the variable argument size.
        ######################### TO DO #########################

        pos = torch.arange(0, seq_len_max, dtype=torch.float).unsqueeze(1)
        div = torch.pow(10000, torch.arange(0, dim, 2, dtype=torch.float) / dim)

        # 2i -> sin
        PE[:, 0::2] = torch.sin(pos / div)
        # 2i + 1 -> cos
        PE[:, 1::2] = torch.cos(pos / div)

        ######################### TO DO #########################

        ######################### DO NOT CHANGE #########################
        # Positional Encoding is not learnable parameters.
        self.register_buffer('PE', PE.unsqueeze(0))
        ######################### DO NOT CHANGE #########################

    def forward(self, X):
        return X + self.PE[:, :X.size(1)]

## 2. Multi-head attention

![multi_head_attention](./imgs/Attention.png)

In this section, we will implement MultiHeadAttention Class.  
The parameters of MultiHeadAttention class is defined as follows.
Note that according to the definition of multi-head attention, the dimension of the model is equal to the product
of the word dimension and the number of heads

$dim$:  dimension of the model  
$dim$ = dimension for a each word * $head\_num$  
$seq\_len$:  length of the input sequence

This module will get batched sequences x and return multi-head attention ouput.

X size:  $(batch\_num, seq\_len, dim)$  
mask: Tensor to indicate the words involved in score calculation  
output size:  $(batch\_num, seq\_len, dim)$

$W_q$ = linear transformation for query  
$W_k$ = linear transformation for key    
$W_v$ = linear transformation for value  
$W_o$ = linear transformation for concatenated heads

The model operates according to the following equation.  
It should select the values that will participate in score calculation based on the received mask.

$Q = X * W_q$  
$K = X * W_k$  
$V = X * W_v$  

$scores = \frac{QK^T}{\sqrt{word\_dim}}$  
$masked\_scores = mask(\frac{QK^T}{\sqrt{word\_dim}})$  
$probs = softmax(masked\_scores)$  
$heads = probsV$  
$output = heads * W_o$  




### torch.matmul [[link]](https://pytorch.org/docs/stable/generated/torch.matmul.html)

Matrix product of two tensors.

The behavior depends on the dimensionality of the tensors as follows:

- If both tensors are 1-dimensional, the dot product (scalar) is returned.

- If both arguments are 2-dimensional, the matrix-matrix product is returned.

- If the first argument is 1-dimensional and the second argument is 2-dimensional, a 1 is prepended to its dimension for the purpose of the matrix multiply. After the matrix multiply, the prepended dimension is removed.

- If the first argument is 2-dimensional and the second argument is 1-dimensional, the matrix-vector product is returned.

If both arguments are at least 1-dimensional and at least one argument is N-dimensional (where \(N > 2\)), then a batched matrix multiply is returned.
If the first argument is 1-dimensional, a 1 is prepended to its dimension for the purpose of the batched matrix multiply and removed after.
If the second argument is 1-dimensional, a 1 is appended to its dimension for the purpose of the batched matrix multiply and removed after.
The non-matrix (i.e. batch) dimensions are broadcasted (and thus must be broadcastable).
For example, if `input` is a $$ (j \times 1 \times n \times n) $$
tensor and `other` is a $$ (k \times n \times n) $$ tensor, `out` will be a $$ (j \times k \times n \times n) $$ tensor.
Note that the broadcasting logic only looks at the batch dimensions when determining if the inputs are broadcastable, and not the matrix dimensions.
For example, if `input` is a $$ (j \times 1 \times n \times m) $$ tensor and `other` is a $$ (k \times m \times p) $$
tensor, these inputs are valid for broadcasting even though the final two dimensions (i.e. the matrix dimensions) are different. `out` will be a  $$ (j \times k \times n \times p) $$ tensor.

This operation has support for arguments with sparse layouts. In particular the matrix-matrix (both arguments 2-dimensional) supports sparse arguments with the same restrictions as torch.mm()



#### Parameters
- input (Tensor) – the first tensor to be multiplied

- other (Tensor) – the second tensor to be multiplied

#### Keyword Arguments
- out (Tensor, optional) – the output tensor.

In [9]:
# # vector x vector
tensor1 = torch.randn(3)
tensor2 = torch.randn(3)
torch.matmul(tensor1, tensor2).size()

torch.Size([])

In [10]:
# matrix x vector
tensor1 = torch.randn(3, 4)
tensor2 = torch.randn(4)
torch.matmul(tensor1, tensor2).size()

torch.Size([3])

In [11]:
# batched matrix x broadcasted vector
tensor1 = torch.randn(10, 3, 4)
tensor2 = torch.randn(4)
torch.matmul(tensor1, tensor2).size()

torch.Size([10, 3])

In [12]:
# batched matrix x batched matrix
tensor1 = torch.randn(10, 3, 4)
tensor2 = torch.randn(10, 4, 5)
torch.matmul(tensor1, tensor2).size()

torch.Size([10, 3, 5])

In [13]:
# batched matrix x broadcasted matrix
tensor1 = torch.randn(10, 3, 4)
tensor2 = torch.randn(4, 5)
torch.matmul(tensor1, tensor2).size()

torch.Size([10, 3, 5])

In [14]:
import torch.nn as nn

class MultiHeadAttention(nn.Module):
    def __init__(self, dim, head_num):
        super(MultiHeadAttention, self).__init__()

        self.dim = dim
        self.head_num = head_num
        self.word_dim = dim // head_num

        ######################### TO DO #########################
         # Weight matrices for the linear transformation of Query, Key, and Value (dim x dim)
        self.W_q = nn.Linear(dim, dim)
        self.W_k = nn.Linear(dim, dim)
        self.W_v = nn.Linear(dim, dim)
        # Weight matrix for the linear transformation after combining the multi-head attention results
        self.W_o = nn.Linear(dim, dim)
        ######################### TO DO #########################

    def scaled_dot_product(self, Q, K, V, mask=None):
        ######################### TO DO #########################
         # Calculate the dot-product between Q and K, then scale by the square root of word_dim
        scores = torch.matmul(Q, K.transpose(-2, -1)) / torch.sqrt(torch.tensor(self.word_dim, dtype=torch.float32))

        # If a mask is provided, apply the mask (masked positions will not attend)
        if mask is not None:
            scores = scores.masked_fill(mask == 0, float('-inf'))
        # Apply the softmax function to calculate the attention probabilities
        probs = torch.softmax(scores,dim=-1)

         # Multiply the attention probabilities by V to get the final attention values (heads)
        heads = torch.matmul(probs, V)

        ######################### TO DO #########################
        return heads

    def split(self, X):
        batch_num, seq_len, dim = X.size()
        return X.view(batch_num, seq_len, self.head_num, self.word_dim).transpose(1, 2)

    def combine(self, X):
        batch_num, _, seq_len, _ = X.size()
        return X.transpose(1, 2).contiguous().view(batch_num, seq_len, self.dim)

    def forward(self, X_Q, X_K, X_V, mask=None):
        Q = self.split(self.W_q(X_Q))
        K = self.split(self.W_k(X_K))
        V = self.split(self.W_v(X_V))

        heads = self.scaled_dot_product(Q, K, V, mask)
        output = self.W_o(self.combine(heads))
        return output



## 3. Encoder

Implement EncoderLayer class using **one MultiHeadAttention layer, one FNN layer and two normalization layer**.  
**Please apply dropout right after passing through multi-head attention and FFN layer.**

**HINT**  
**1. Normalization is a LayerNorm.**  
**2. LayerNorm layers have learnable parameters. Therefore, you should use two normalization layers.**

### LayerNorm  [[link]](https://pytorch.org/docs/stable/generated/torch.nn.LayerNorm.html)

Applies Layer Normalization over a mini-batch of inputs.

This layer implements the operation as described in the paper *Layer Normalization*:

$$
y = \frac{x - \mathbb{E}[x]}{\sqrt{\text{Var}[x] + \epsilon}} \times \gamma + \beta
$$

The mean (\(\mathbb{E}[x]\)) and standard deviation (\(\text{Var}[x]\)) are calculated over the last \(D\) dimensions, where \(D\) is the dimension of `normalized_shape`.

For example, if `normalized_shape` is \((3, 5)\) (a 2-dimensional shape), the mean and standard-deviation are computed over the last 2 dimensions of the input (i.e. `input.mean((-2, -1))`).

 \(\gamma\) and \(\beta\) are learnable affine transform parameters of `normalized_shape` if `elementwise_affine` is `True`. The standard deviation is calculated via the biased estimator, equivalent to `torch.var(input, unbiased=False)`.


#### Parameters

- **normalized_shape** (int or list or torch.Size): input shape from an expected input of size
$$
[* \times \text{normalized\_shape}[0] \times \text{normalized\_shape}[1] \times \dots \times \text{normalized\_shape}[-1]]
$$

If a single integer is used, it is treated as a singleton list, and this module will normalize over the last dimension, which is expected to be of that specific size.
- **eps** (float): a value added to the denominator for numerical stability. Default: 1e-5

- **elementwise_affine** (bool): a boolean value that, when set to `True`, allows this module to have learnable per-element affine parameters, initialized to ones (for weights) and zeros (for biases). Default: `True`.

- **bias** (bool): If set to `False`, the layer will not learn an additive bias (only relevant if `elementwise_affine` is `True`). Default: `True`.

#### Variables

- **weight**: the learnable weights of the module of shape
  \[
  \text{normalized\_shape}
  \]
  when `elementwise_affine` is set to `True`. The values are initialized to 1.

- **bias**: the learnable bias of the module of shape
  \[
  \text{normalized\_shape}
  \]
  when `elementwise_affine` is set to `True`. The values are initialized to 0.


In [15]:
# NLP Example
batch, sentence_length, embedding_dim = 20, 5, 10
embedding = torch.randn(batch, sentence_length, embedding_dim)
layer_norm = nn.LayerNorm(embedding_dim)
nlp_output = layer_norm(embedding)
print("NLP Output:")
print(nlp_output)

# Image Example
N, C, H, W = 20, 5, 10, 10
input_tensor = torch.randn(N, C, H, W)
layer_norm_image = nn.LayerNorm([C, H, W])
output_image = layer_norm_image(input_tensor)
print("\nImage Output:")
print(output_image)

NLP Output:
tensor([[[ 2.1810e+00,  1.1947e+00, -1.1607e+00, -7.7661e-02, -1.6418e-03,
          -1.1999e+00,  1.2851e-01, -6.7876e-02,  4.2783e-03, -1.0007e+00],
         [-3.5309e-01, -1.1017e+00, -1.1988e+00, -2.3843e-01, -5.0636e-01,
          -4.8090e-01,  1.8327e-01,  4.3722e-01,  8.7265e-01,  2.3861e+00],
         [-3.8815e-01, -5.1422e-01,  1.5834e+00,  9.6475e-01,  5.5568e-01,
          -4.6659e-01, -1.1476e+00, -5.9283e-01,  1.4084e+00, -1.4030e+00],
         [ 2.2443e-01, -6.9817e-02,  6.2914e-01, -1.9734e+00,  1.4775e+00,
           5.5984e-02,  1.1884e+00,  3.5822e-01, -1.2105e+00, -6.7999e-01],
         [-1.2426e+00,  2.0512e-01,  3.6039e-01, -8.9463e-02,  6.1274e-01,
           6.8258e-01,  1.4225e+00,  8.5670e-01, -2.0100e+00, -7.9798e-01]],

        [[ 1.6258e-01,  2.3108e+00, -6.2664e-01,  8.5801e-01, -3.5128e-01,
          -3.1568e-01, -9.5877e-01, -3.4723e-01, -1.3594e+00,  6.2765e-01],
         [-2.8773e-01,  1.3654e+00,  7.6392e-01,  8.1258e-01, -6.5804e-01,
     

### Dropout [[link]](https://pytorch.org/docs/stable/generated/torch.nn.Dropout.html)

During training, randomly zeroes some of the elements of the input tensor with probability \(p\).

The zeroed elements are chosen independently for each forward call and are sampled from a Bernoulli distribution.

Each channel will be zeroed out independently on every forward call.

This has proven to be an effective technique for regularization and preventing the co-adaptation of neurons as described in the paper *Improving neural networks by preventing co-adaptation of feature detectors*.

Furthermore, the outputs are scaled by a factor of

$$
\frac{1}{1 - p}
$$

during training. This means that during evaluation the module simply computes an identity function.

#### Parameters

- **p** (float): probability of an element to be zeroed. Default: 0.5
- **inplace** (bool): If set to `True`, will do this operation in-place. Default: `False`


In [16]:
m = nn.Dropout(p=0.2)
input = torch.randn(20, 16)
output = m(input)
print(output)

tensor([[ 1.6940, -0.0240, -0.3172,  0.2933,  1.1613,  0.0838, -0.4921,  0.0000,
          1.9795,  0.0000, -0.0000, -4.2381,  0.6872, -0.5144, -1.3878, -0.5299],
        [ 0.0000,  0.0000,  0.0000,  0.0000,  0.2705, -0.4977,  0.0000,  0.5501,
          0.2791,  0.3285, -1.0191, -1.8658, -0.4828, -0.2147,  1.3184, -1.8300],
        [ 0.0000,  0.0000,  2.2360, -0.1183, -0.7370,  0.0612,  2.1247, -0.0000,
         -0.4143,  0.9947, -0.0000, -0.8055, -0.0000, -0.1768,  0.4132, -1.0906],
        [-1.0306, -2.1310, -0.0000,  0.0000, -0.0000, -2.4116, -1.1415,  2.4301,
         -0.8760,  2.1751, -0.5870, -2.1890, -1.2128, -0.0000, -0.0000,  1.3384],
        [ 0.0000, -1.2953,  0.0000, -0.0000,  0.7542, -0.9406,  0.9553,  0.8360,
         -0.1970, -0.7161,  0.0000,  0.0000, -0.0000,  1.0606,  0.8460,  0.7699],
        [ 1.2954,  1.3428,  0.0000, -0.3040, -1.6917,  2.4120, -0.3873, -0.0000,
          0.5526, -0.1744, -0.0296,  0.0000,  0.0000, -1.1495,  0.7021, -0.0000],
        [ 0.0000,  0.6

In [17]:
class FFN(nn.Module):
    def __init__(self, dim, FFN_dim):
        super(FFN, self).__init__()
        self.FFN_layer = nn.Sequential(nn.Linear(dim, FFN_dim),
                                       nn.ReLU(),
                                       nn.Linear(FFN_dim, dim))
    def forward(self, X):
        return self.FFN_layer(X)

class EncoderLayer(nn.Module):
    def __init__(self, dim, head_num, FFN_dim, dropout):
        super(EncoderLayer, self).__init__()
        ######################### TO DO #########################

        self.multi_head_attention = MultiHeadAttention(dim, head_num)
        self.ffn = FFN(dim, FFN_dim)
        self.dropout = nn.Dropout(p=dropout)

        self.layer_norm1 = nn.LayerNorm(dim)
        self.layer_norm2 = nn.LayerNorm(dim)


        ######################### TO DO #########################

    def forward(self, X, mask):
        ######################### TO DO #########################

        # Multi-head attention
        attn_output = self.multi_head_attention(X, X, X, mask)
        attn_output = self.dropout(attn_output)
        X = self.layer_norm1(X + attn_output)

        # Feed forward network
        ffn_output = self.ffn(X)
        ffn_output = self.dropout(ffn_output)
        output = self.layer_norm2(X + ffn_output)


        ######################### TO DO #########################
        return output

## 4. Decoder

Implement DecoderLayer class using **two MultiHeadAttention layers(self-attention and cross-attention), one FNN layer and three normalization layers.**
**Please apply dropout right after passing through two multi-head attention layers and FFN layer.**

**HINT**  
**1. Normalization is a LayerNorm.**  
**2. LayerNorm layers have learnable parameters. Therefore, you should use three normalization layers.**  
**3. The first multi-head attention layer is a self attention layer, and the second attention layer is a cross attention layer. Choose the mask carefully.**

In [18]:
class DecoderLayer(nn.Module):
    def __init__(self, dim, head_num, FFN_dim, dropout):
        super(DecoderLayer, self).__init__()
        ######################### TO DO #########################

        self.multi_head_attention1 = MultiHeadAttention(dim, head_num)
        self.multi_head_attention2 = MultiHeadAttention(dim, head_num)
        self.ffn = FFN(dim, FFN_dim)
        self.dropout = nn.Dropout(p=dropout)

        self.layer_norm1 = nn.LayerNorm(dim)
        self.layer_norm2 = nn.LayerNorm(dim)
        self.layer_norm3 = nn.LayerNorm(dim)

        ######################### TO DO #########################

    def forward(self, X, enc_output, cross_attn_mask, self_attn_mask):
        ######################### TO DO #########################

        # Self attention
        self_attn_output = self.multi_head_attention1(X, X, X, self_attn_mask)
        self_attn_output = self.dropout(self_attn_output)
        X = self.layer_norm1(X + self_attn_output)

        # Cross attention
        cross_attn_output = self.multi_head_attention2(X, enc_output, enc_output, cross_attn_mask)
        cross_attn_output = self.dropout(cross_attn_output)
        X = self.layer_norm2(X + cross_attn_output)

        # Feed forward network
        ffn_output = self.ffn(X)
        ffn_output = self.dropout(ffn_output)
        output = self.layer_norm3(X + ffn_output)

        ######################### TO DO #########################
        return output

## 5. Prepare sample data and Run model

In [19]:
class Transformer(nn.Module):
    def __init__(self, input_lib_size, output_lib_size, dim, head_num, layer_num, \
                 FFN_dim, seq_len_max, dropout):
        super(Transformer, self).__init__()
        self.enc_embeds = nn.Embedding(input_lib_size, dim)
        self.dec_embeds = nn.Embedding(output_lib_size, dim)
        self.pe = PositionalEncoding(dim, seq_len_max)

        self.encoder = nn.ModuleList([EncoderLayer(dim, head_num, FFN_dim, dropout) \
                                             for _ in range(layer_num)])
        self.decoder = nn.ModuleList([DecoderLayer(dim, head_num, FFN_dim, dropout) \
                                             for _ in range(layer_num)])
        self.Linear = nn.Linear(dim, output_lib_size)
        self.dropout = nn.Dropout(dropout)

    def generate_mask(self, src, tgt):
        self_attn_mask = (src != 0).unsqueeze(1).unsqueeze(2)
        cross_attn_mask = (tgt != 0).unsqueeze(1).unsqueeze(3)
        seq_length = tgt.size(1)
        nopeak_mask = (1 - torch.triu(torch.ones(1, seq_length, seq_length), diagonal=1)).bool()
        nopeak_mask = nopeak_mask.to(device)
        cross_attn_mask = cross_attn_mask & nopeak_mask
        return self_attn_mask, cross_attn_mask

    def forward(self, src, tgt):
        self_attn_mask, cross_attn_mask = self.generate_mask(src, tgt)
        src_embeds = self.dropout(self.pe(self.enc_embeds(src)))
        tgt_embeds = self.dropout(self.pe(self.dec_embeds(tgt)))

        enc_output = src_embeds
        for enc_layer in self.encoder:
            enc_output = enc_layer(enc_output, self_attn_mask)

        dec_output = tgt_embeds
        for dec_layer in self.decoder:
            dec_output = dec_layer(dec_output, enc_output, self_attn_mask, cross_attn_mask)

        output = self.Linear(dec_output)
        return output

In [20]:
input_lib_size = 5000
output_lib_size = 5000
dim = 512
head_num = 4
layer_num = 3
FFN_dim = 2048
seq_len_max = 100
dropout = 0.1

transformer = Transformer(input_lib_size, output_lib_size, dim, head_num, layer_num, \
                          FFN_dim, seq_len_max, dropout)
transformer = transformer.to(device)

# Generate random sample data
src_data = torch.randint(1, input_lib_size, (64, seq_len_max)).to(device)
tgt_data = torch.randint(1, output_lib_size, (64, seq_len_max)).to(device)

In [21]:
criterion = nn.CrossEntropyLoss(ignore_index=0)
optimizer = optim.Adam(transformer.parameters(), lr=0.0001, betas=(0.9, 0.98), eps=1e-9)

transformer.train()

for epoch in range(100):
    optimizer.zero_grad()
    output = transformer(src_data, tgt_data[:, :-1])
    loss = criterion(output.contiguous().view(-1, output_lib_size), tgt_data[:, 1:].contiguous().view(-1))
    loss.backward()
    optimizer.step()
    print(f"Epoch: {epoch+1}, Loss: {loss.item()}")

Epoch: 1, Loss: 8.686573028564453
Epoch: 2, Loss: 8.585132598876953
Epoch: 3, Loss: 8.511429786682129
Epoch: 4, Loss: 8.450674057006836
Epoch: 5, Loss: 8.400747299194336
Epoch: 6, Loss: 8.356406211853027
Epoch: 7, Loss: 8.311816215515137
Epoch: 8, Loss: 8.26016902923584
Epoch: 9, Loss: 8.208110809326172
Epoch: 10, Loss: 8.152913093566895
Epoch: 11, Loss: 8.092650413513184
Epoch: 12, Loss: 8.036897659301758
Epoch: 13, Loss: 7.974550247192383
Epoch: 14, Loss: 7.912300109863281
Epoch: 15, Loss: 7.854128837585449
Epoch: 16, Loss: 7.792971611022949
Epoch: 17, Loss: 7.739367961883545
Epoch: 18, Loss: 7.679068565368652
Epoch: 19, Loss: 7.614984512329102
Epoch: 20, Loss: 7.558025360107422
Epoch: 21, Loss: 7.495822429656982
Epoch: 22, Loss: 7.435426712036133
Epoch: 23, Loss: 7.3715105056762695
Epoch: 24, Loss: 7.311313152313232
Epoch: 25, Loss: 7.248730182647705
Epoch: 26, Loss: 7.183887004852295
Epoch: 27, Loss: 7.123903751373291
Epoch: 28, Loss: 7.058160305023193
Epoch: 29, Loss: 6.9984908103