<br>
<font>
<div dir=ltr align=center>
<img src="https://cdn.freebiesupply.com/logos/large/2x/sharif-logo-png-transparent.png" width=150 height=150> <br>
<font color=0F5298 size=7>
    Machine learning <br>
<font color=2565AE size=5>
    Computer Engineering Department <br>
    Fall 2024<br>
<font color=3C99D size=5>
    Practical Assignment 5 - NLP - Transformer & Bert <br>
</div>
<div dir=ltr align=center>
<font color=0CBCDF size=4>
   &#x1F349; Masoud Tahmasbi  &#x1F349;  &#x1F353; Arash Ziyaei &#x1F353;
<br>
<font color=0CBCDF size=4>
   &#x1F335; Amirhossein Akbari  &#x1F335;
</div>

____

<font color=9999FF size=4>
&#x1F388; Full Name : Radin Shahadei
<br>
<font color=9999FF size=4>
&#x1F388; Student Number : 401106096

<font color=0080FF size=3>
This notebook covers two key topics. First, we implement a transformer model from scratch and apply it to a specific task. Second, we fine-tune the BERT model using LoRA for efficient adaptation to a downstream task.
</font>
<br>

**Note:**
<br>
<font color=66B2FF size=2>In this notebook, you are free to use any function or model from PyTorch to assist with the implementation. However, TensorFlow is not permitted for this exercise. This ensures consistency and alignment with the tools being focused on.</font>
<br>
<font color=red size=3>**Run All Cells Before Submission**</font>: <font color=FF99CC size=2>Before saving and submitting your notebook, please ensure you run all cells from start to finish. This practice guarantees that your notebook is self-consistent and can be evaluated correctly by others.</font>

# Section 1: Transformer

The transformer architecture consists of two main components: an encoder and a decoder. Each of these components is made up of multiple layers that include self-attention mechanisms and feedforward neural networks. The self-attention mechanism is central to the transformer, as it enables the model to assess the importance of different words in a sentence by considering their relationships with one another.


In this assignment, you should design a transformer model from scratch. You are required to implement the Encoder and Decoder components of a Transformer model.

In [None]:
!pip install torchmetrics

In [None]:
!pip install datasets

In [5]:
# Importing libraries

# PyTorch
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader, random_split
from torch.utils.tensorboard import SummaryWriter

# Math
import math
import os
import torchmetrics
# HuggingFace libraries
from datasets import load_dataset
from tokenizers import Tokenizer
from tokenizers.models import WordLevel
from tokenizers.trainers import WordLevelTrainer
from tokenizers.pre_tokenizers import Whitespace

# Pathlib
from pathlib import Path

# typing
from typing import Any

# Library for progress bars in loops
from tqdm import tqdm

# Importing library of warnings
import warnings

## Part 1: Input Embeddings
<p style = "font-family: 'Helvetica Neue', Arial, sans-serif; text-align: left; font-size: 17.5px">When we observe the Transformer architecture image above, we can see that the Embeddings represent the first step of both blocks.</p>

<p style = "font-family: 'Helvetica Neue', Arial, sans-serif; text-align: left; font-size: 17.5px">The <code>InputEmbedding</code> class below is responsible for converting the input text into numerical vectors of <code>d_model</code> dimensions. To prevent that our input embeddings become extremely small, we normalize them by multiplying them by the $\sqrt{d_{model}}$.</p>

<p style = "font-family: 'Helvetica Neue', Arial, sans-serif; text-align: left; font-size: 17.5px">In the image below, we can see how the embeddings are created. First, we have a sentence that gets split into tokens—we will explore what tokens are later on—. Then, the token IDs—identification numbers—are transformed into the embeddings, which are high-dimensional vectors.</p>

In [None]:
class InputEmbeddings(nn.Module):
    def __init__(self, d_model: int, vocab_size: int) -> None:
        super().__init__()
        self.d_model = d_model
        self.embedding = nn.Embedding(vocab_size, d_model)

    def forward(self, x):
        return self.embedding(x) * math.sqrt(self.d_model)

## Part 2: positional encoding
<p style = "font-family: 'Helvetica Neue', Arial, sans-serif; text-align: left; font-size: 17.5px">In the original paper, the authors add the positional encodings to the input embeddings at the bottom of both the encoder and decoder blocks so the model can have some information about the relative or absolute position of the tokens in the sequence. The positional encodings have the same dimension $d_{model}$ as the embeddings, so that the two vectors can be summed and we can combine the semantic content from the word embeddings and positional information from the positional encodings.</p>

<p style = "font-family: 'Helvetica Neue', Arial, sans-serif; text-align: left; font-size: 17.5px">In the <code>PositionalEncoding</code> class below, we will create a matrix of positional encodings <code>pe</code> with dimensions <code>(seq_len, d_model)</code>. We will start by filling it with $0$s.We will then apply the sine function to even indices of the positional encoding matrix while the cosine function is applied to the odd ones.</p>

<p style="
    margin-bottom: 5;
    font-size: 22px;
    font-weight: 300;
    font-family: 'Helvetica Neue', sans-serif;
    color: #000000;
  ">
    \begin{equation}
    \text{Odd Indices } (2i + 1): \quad \text{PE(pos, } 2i + 1) = \cos\left(\frac{\text{pos}}{10000^{2i / d_{model}}}\right)
    \end{equation}
</p>

<p style = "font-family: 'Helvetica Neue', Arial, sans-serif; text-align: left; font-size: 17.5px">We apply the sine and cosine functions because it allows the model to determine the position of a word based on the position of other words in the sequence, since for any fixed offset $k$, $PE_{pos + k}$ can be represented as a linear function of $PE_{pos}$. This happens due to the properties of sine and cosine functions, where a shift in the input results in a predictable change in the output.</p>

In [None]:
class PositionalEncoding(nn.Module):
    def __init__(self, d_model: int, seq_len: int, dropout: float) -> None:
        super().__init__()
        self.dropout = nn.Dropout(dropout)
        pe = torch.zeros(seq_len, d_model)
        position = torch.arange(0, seq_len, dtype=torch.float).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        pe = pe.unsqueeze(0)
        self.register_buffer('pe', pe)

    def forward(self, x):
        x = x + self.pe[:, :x.shape[1], :].requires_grad_(False)
        return self.dropout(x)

## Part 3: layer normalization
<p style = "font-family: 'Helvetica Neue', Arial, sans-serif; text-align: left; font-size: 17.5px">When we look at the encoder and decoder blocks, we see several normalization layers called <b><i>Add &amp; Norm</i></b>.</p>

<p style = "font-family: 'Helvetica Neue', Arial, sans-serif; text-align: left; font-size: 17.5px">The <code>LayerNormalization</code> class below performs layer normalization on the input data. During its forward pass, we compute the mean and standard deviation of the input data. We then normalize the input data by subtracting the mean and dividing by the standard deviation plus a small number called epsilon to avoid any divisions by zero. This process results in a normalized output with a mean 0 and a standard deviation 1.</p>

<p style = "font-family: 'Helvetica Neue', Arial, sans-serif; text-align: left; font-size: 17.5px">We will then scale the normalized output by a learnable parameter <code>alpha</code> and add a learnable parameter called <code>bias</code>. The training process is responsible for adjusting these parameters. The final result is a layer-normalized tensor, which ensures that the scale of the inputs to layers in the network is consistent.</p>

In [None]:
class LayerNormalization(nn.Module):
    def __init__(self, features: int, eps: float = 1e-6) -> None:
        super().__init__()
        self.eps = eps
        self.alpha = nn.Parameter(torch.ones(features))
        self.bias = nn.Parameter(torch.zeros(features))

    def forward(self, x):
        mean = x.mean(dim=-1, keepdim=True)
        std = x.std(dim=-1, keepdim=True)
        return self.alpha * (x - mean) / (std + self.eps) + self.bias

## Part 4: Feed Forward Network
<p style = "font-family: 'Helvetica Neue', Arial, sans-serif; text-align: left; font-size: 17.5px">In the fully connected feed-forward network, we apply two linear transformations with a ReLU activation in between. We can mathematically represent this operation as:</p>

<p style="
    margin-bottom: 5;
    font-size: 22px;
    font-weight: 300;
    font-family: 'Helvetica Neue', sans-serif;
    color: #000000;
  ">
    \begin{equation}
    \text{FFN}(x) = \max(0, xW_1 + b_1)W_2 + b_2
    \end{equation}
</p>


<p style = "font-family: 'Helvetica Neue', Arial, sans-serif; text-align: left; font-size: 17.5px">$W_1$ and $W_2$ are the weights, while $b_1$ and $b_2$ are the biases of the two linear transformations.</p>

<p style = "font-family: 'Helvetica Neue', Arial, sans-serif; text-align: left; font-size: 17.5px">In the <code>FeedForwardBlock</code> below, we will define the two linear transformations—<code>self.linear_1</code> and <code>self.linear_2</code>—and the inner-layer <code>d_ff</code>. The input data will first pass through the <code>self.linear_1</code> transformation, which increases its dimensionality from <code>d_model</code> to <code>d_ff</code>. The output of this operation passes through the ReLU activation function, which introduces non-linearity so the network can learn more complex patterns, and the <code>self.dropout</code> layer is applied to mitigate overfitting. The final operation is the <code>self.linear_2</code> transformation to the dropout-modified tensor, which transforms it back to the original <code>d_model</code> dimension.</p>

In [None]:
class FeedForwardBlock(nn.Module):
    def __init__(self, d_model: int, d_ff: int, dropout: float) -> None:
        super().__init__()
        self.linear_1 = nn.Linear(d_model, d_ff)
        self.dropout = nn.Dropout(dropout)
        self.linear_2 = nn.Linear(d_ff, d_model)

    def forward(self, x):
        return self.linear_2(self.dropout(torch.relu(self.linear_1(x))))

## Part 5: Multi Head Attention
<p style = "font-family: 'Helvetica Neue', Arial, sans-serif; text-align: left; font-size: 17.5px">The Multi-Head Attention is the most crucial component of the Transformer. It is responsible for helping the model to understand complex relationships and patterns in the data.</p>

<p style = "font-family: 'Helvetica Neue', Arial, sans-serif; text-align: left; font-size: 17.5px">The image below displays how the Multi-Head Attention works. It doesn't include <code>batch</code> dimension because it only illustrates the process for one single sentence.</p>

<p style = "font-family: 'Helvetica Neue', Arial, sans-serif; text-align: left; font-size: 17.5px">The Multi-Head Attention block receives the input data split into queries, keys, and values organized into matrices $Q$, $K$, and $V$. Each matrix contains different facets of the input, and they have the same dimensions as the input.</p>

<p style = "font-family: 'Helvetica Neue', Arial, sans-serif; text-align: left; font-size: 17.5px">We then linearly transform each matrix by their respective weight matrices $W^Q$, $W^K$, and $W^V$. These transformations will result in new matrices $Q'$, $K'$, and $V'$, which will be split into smaller matrices corresponding to different heads $h$, allowing the model to attend to information from different representation subspaces in parallel. This split creates multiple sets of queries, keys, and values for each head.</p>

<p style = "font-family: 'Helvetica Neue', Arial, sans-serif; text-align: left; font-size: 17.5px">Finally, we concatenate every head into an $H$ matrix, which is then transformed by another weight matrix $W^o$ to produce the multi-head attention output, a matrix $MH-A$ that retains the input dimensionality.</p>

In [None]:
class MultiHeadAttentionBlock(nn.Module):
    def __init__(self, d_model: int, h: int, dropout: float) -> None:
        super().__init__()
        assert d_model % h == 0, "d_model is not divisible by h"
        self.d_k = d_model // h
        self.h = h
        self.w_q = nn.Linear(d_model, d_model, bias=False)
        self.w_k = nn.Linear(d_model, d_model, bias=False)
        self.w_v = nn.Linear(d_model, d_model, bias=False)
        self.w_o = nn.Linear(d_model, d_model, bias=False)
        self.dropout = nn.Dropout(dropout)

    @staticmethod
    def attention(query, key, value, mask, dropout: nn.Dropout):
        d_k = query.shape[-1]
        attention_scores = (query @ key.transpose(-2, -1)) / math.sqrt(d_k)
        if mask is not None:
            attention_scores.masked_fill_(mask == 0, -1e9)
        attention_scores = attention_scores.softmax(dim=-1)
        if dropout is not None:
            attention_scores = dropout(attention_scores)
        return (attention_scores @ value), attention_scores

    def forward(self, q, k, v, mask):
        query = self.w_q(q)
        key = self.w_k(k)
        value = self.w_v(v)

        query = query.view(query.shape[0], query.shape[1], self.h, self.d_k).transpose(1, 2)
        key = key.view(key.shape[0], key.shape[1], self.h, self.d_k).transpose(1, 2)
        value = value.view(value.shape[0], value.shape[1], self.h, self.d_k).transpose(1, 2)

        x, self.attention_scores = self.attention(query, key, value, mask, self.dropout)

        x = x.transpose(1, 2).contiguous().view(x.shape[0], -1, self.h * self.d_k)
        return self.w_o(x)

## Part 6: Residual Connection
<p style = "font-family: 'Helvetica Neue', Arial, sans-serif; text-align: left; font-size: 17.5px">When we look at the architecture of the Transformer, we see that each sub-layer, including the <i>self-attention</i> and <i>Feed Forward</i> blocks, adds its output to its input before passing it to the <i>Add &amp; Norm</i> layer. This approach integrates the output with the original input in the <i>Add &amp; Norm</i> layer. This process is known as the skip connection, which allows the Transformer to train deep networks more effectively by providing a shortcut for the gradient to flow through during backpropagation.</p>

<p style = "font-family: 'Helvetica Neue', Arial, sans-serif; text-align: left; font-size: 17.5px">The <code>ResidualConnection</code> class below is responsible for this process.</p>

In [None]:
class ResidualConnection(nn.Module):
    def __init__(self, features: int, dropout: float) -> None:
        super().__init__()
        self.dropout = nn.Dropout(dropout)
        self.norm = LayerNormalization(features)

    def forward(self, x, sublayer):
        return x + self.dropout(sublayer(self.norm(x)))

## Part 7: Encoder
<p style = "font-family: 'Helvetica Neue', Arial, sans-serif; text-align: left; font-size: 17.5px">We will now build the encoder. We create the <code>EncoderBlock</code> class, consisting of the Multi-Head Attention and Feed Forward layers, plus the residual connections.</p>

<p style = "font-family: 'Helvetica Neue', Arial, sans-serif; text-align: left; font-size: 17.5px">In the original paper, the Encoder Block repeats six times. We create the <code>Encoder</code> class as an assembly of multiple <code>EncoderBlock</code>s. We also add layer normalization as a final step after processing the input through all its blocks.</p>

In [None]:
class EncoderBlock(nn.Module):
    def __init__(self, features: int, self_attention_block: MultiHeadAttentionBlock, feed_forward_block: FeedForwardBlock, dropout: float) -> None:
        super().__init__()
        self.self_attention_block = self_attention_block
        self.feed_forward_block = feed_forward_block
        self.residual_connections = nn.ModuleList([ResidualConnection(features, dropout) for _ in range(2)])

    def forward(self, x, src_mask):
        x = self.residual_connections[0](x, lambda x: self.self_attention_block(x, x, x, src_mask))
        x = self.residual_connections[1](x, self.feed_forward_block)
        return x

In [None]:
class Encoder(nn.Module):
    def __init__(self, features: int, layers: nn.ModuleList) -> None:
        super().__init__()
        self.layers = layers
        self.norm = LayerNormalization(features)

    def forward(self, x, mask):
        for layer in self.layers:
            x = layer(x, mask)
        return self.norm(x)

## Part 8: Decoder
<p style = "font-family: 'Helvetica Neue', Arial, sans-serif; text-align: left; font-size: 17.5px">Similarly, the Decoder also consists of several DecoderBlocks that repeat six times in the original paper. The main difference is that it has an additional sub-layer that performs multi-head attention with a <i>cross-attention</i> component that uses the output of the Encoder as its keys and values while using the Decoder's input as queries.</p>

<p style = "font-family: 'Helvetica Neue', Arial, sans-serif; text-align: left; font-size: 17.5px">For the Output Embedding, we can use the same <code>InputEmbeddings</code> class we use for the Encoder. You can also notice that the self-attention sub-layer is <i>masked</i>, which restricts the model from accessing future elements in the sequence.</p>

<p style = "font-family: 'Helvetica Neue', Arial, sans-serif; text-align: left; font-size: 17.5px">We will start by building the <code>DecoderBlock</code> class, and then we will build the <code>Decoder</code> class, which will assemble multiple <code>DecoderBlock</code>s.</p>

In [None]:
class DecoderBlock(nn.Module):
    def __init__(self, features: int, self_attention_block: MultiHeadAttentionBlock, cross_attention_block: MultiHeadAttentionBlock, feed_forward_block: FeedForwardBlock, dropout: float) -> None:
        super().__init__()
        self.self_attention_block = self_attention_block
        self.cross_attention_block = cross_attention_block
        self.feed_forward_block = feed_forward_block
        self.residual_connections = nn.ModuleList([ResidualConnection(features, dropout) for _ in range(3)])

    def forward(self, x, encoder_output, src_mask, tgt_mask):
        x = self.residual_connections[0](x, lambda x: self.self_attention_block(x, x, x, tgt_mask))
        x = self.residual_connections[1](x, lambda x: self.cross_attention_block(x, encoder_output, encoder_output, src_mask))
        x = self.residual_connections[2](x, self.feed_forward_block)
        return x

In [None]:
class Decoder(nn.Module):
    def __init__(self, features: int, layers: nn.ModuleList) -> None:
        super().__init__()
        self.layers = layers
        self.norm = LayerNormalization(features)

    def forward(self, x, encoder_output, src_mask, tgt_mask):
        for layer in self.layers:
            x = layer(x, encoder_output, src_mask, tgt_mask)
        return self.norm(x)

<p style = "font-family: 'Helvetica Neue', Arial, sans-serif; text-align: left; font-size: 17.5px">You can see in the Decoder image that after running a stack of <code>DecoderBlock</code>s, we have a Linear Layer and a Softmax function to the output of probabilities. The <code>ProjectionLayer</code> class below is responsible for converting the output of the model into a probability distribution over the <i>vocabulary</i>, where we select each output token from a vocabulary of possible tokens.</p>

In [None]:
class ProjectionLayer(nn.Module):
    def __init__(self, d_model: int, vocab_size: int) -> None:
        super().__init__()
        self.proj = nn.Linear(d_model, vocab_size)

    def forward(self, x):
        return torch.log_softmax(self.proj(x), dim=-1)

## Part 9: Building the Transformer

<p style = "font-family: 'Helvetica Neue', Arial, sans-serif; text-align: left; font-size: 17.5px">We finally have every component of the Transformer architecture ready. We may now construct the Transformer by putting it all together.</p>

<p style = "font-family: 'Helvetica Neue', Arial, sans-serif; text-align: left; font-size: 17.5px">In the <code>Transformer</code> class below, we will bring together all the components of the model's architecture.</p>

In [None]:
class Transformer(nn.Module):
    def __init__(self, encoder: Encoder, decoder: Decoder, src_embed: InputEmbeddings, tgt_embed: InputEmbeddings, src_pos: PositionalEncoding, tgt_pos: PositionalEncoding, projection_layer: ProjectionLayer) -> None:
        super().__init__()
        self.encoder = encoder
        self.decoder = decoder
        self.src_embed = src_embed
        self.tgt_embed = tgt_embed
        self.src_pos = src_pos
        self.tgt_pos = tgt_pos
        self.projection_layer = projection_layer

    def encode(self, src, src_mask):
        src = self.src_embed(src)
        src = self.src_pos(src)
        return self.encoder(src, src_mask)

    def decode(self, encoder_output, src_mask, tgt, tgt_mask):
        tgt = self.tgt_embed(tgt)
        tgt = self.tgt_pos(tgt)
        return self.decoder(tgt, encoder_output, src_mask, tgt_mask)

    def project(self, x):
        return self.projection_layer(x)

<p style = "font-family: 'Helvetica Neue', Arial, sans-serif; text-align: left; font-size: 17.5px">The architecture is finally ready. We now define a function called <code>build_transformer</code>, in which we define the parameters and everything we need to have a fully operational Transformer model for the task of <b>machine translation</b>.</p>

<p style = "font-family: 'Helvetica Neue', Arial, sans-serif; text-align: left; font-size: 17.5px">We will set the same parameters as in the original paper, <a href = "https://arxiv.org/pdf/1706.03762.pdf"><i>Attention Is All You Need</i></a>, where $d_{model}$ = 512, $N$ = 6, $h$ = 8, dropout rate $P_{drop}$ = 0.1, and $d_{ff}$ = 2048.</p>

In [None]:
def build_transformer(src_vocab_size: int, tgt_vocab_size: int, src_seq_len: int, tgt_seq_len: int, d_model: int = 512, N: int = 6, h: int = 8, dropout: float = 0.1, d_ff: int = 2048) -> Transformer:
    src_embed = InputEmbeddings(d_model, src_vocab_size)
    tgt_embed = InputEmbeddings(d_model, tgt_vocab_size)

    src_pos = PositionalEncoding(d_model, src_seq_len, dropout)
    tgt_pos = PositionalEncoding(d_model, tgt_seq_len, dropout)

    encoder_blocks = [
        EncoderBlock(
            d_model,
            MultiHeadAttentionBlock(d_model, h, dropout),
            FeedForwardBlock(d_model, d_ff, dropout),
            dropout
        ) for _ in range(N)
    ]

    decoder_blocks = [
        DecoderBlock(
            d_model,
            MultiHeadAttentionBlock(d_model, h, dropout),
            MultiHeadAttentionBlock(d_model, h, dropout),
            FeedForwardBlock(d_model, d_ff, dropout),
            dropout
        ) for _ in range(N)
    ]

    encoder = Encoder(d_model, nn.ModuleList(encoder_blocks))
    decoder = Decoder(d_model, nn.ModuleList(decoder_blocks))

    projection_layer = ProjectionLayer(d_model, tgt_vocab_size)

    transformer = Transformer(encoder, decoder, src_embed, tgt_embed, src_pos, tgt_pos, projection_layer)

    for p in transformer.parameters():
        if p.dim() > 1:
            nn.init.xavier_uniform_(p)

    return transformer

The model is now ready to be trained!

## Part 10: Tokenizer

<p style = "font-family: 'Helvetica Neue', Arial, sans-serif; text-align: left; font-size: 17.5px">Tokenization is a crucial preprocessing step for our Transformer model. In this step, we convert raw text into a number format that the model can process.  </p>

<p style = "font-family: 'Helvetica Neue', Arial, sans-serif; text-align: left; font-size: 17.5px">There are several Tokenization strategies. We will use the <i>word-level tokenization</i> to transform each word in a sentence into a token.</p>

<center>
    <img src = "https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F8d5e749c-b0bd-4496-85a1-9b4397ad935f_1400x787.jpeg" width = 800, height= 800>
<p style = "font-size: 16px;
            font-family: 'Georgia', serif;
            text-align: center;
            margin-top: 10px;">Different tokenization strategies. Source: <a href = "https://shaankhosla.substack.com/p/talking-tokenization">shaankhosla.substack.com</a>.</p>
</center>

<p style = "font-family: 'Helvetica Neue', Arial, sans-serif; text-align: left; font-size: 17.5px">After tokenizing a sentence, we map each token to an unique integer ID based on the created vocabulary present in the training corpus during the training of the tokenizer. Each integer number represents a specific word in the vocabulary.</p>

<p style = "font-family: 'Helvetica Neue', Arial, sans-serif; text-align: left; font-size: 17.5px">Besides the words in the training corpus, Transformers use special tokens for specific purposes. These are some that we will define right away:</p>

<p style = "font-family: 'Helvetica Neue', Arial, sans-serif; text-align: left; font-size: 17.5px"><b>• [UNK]:</b> This token is used to identify an unknown word in the sequence.</p>

<p style = "font-family: 'Helvetica Neue', Arial, sans-serif; text-align: left; font-size: 17.5px"><b>• [PAD]:</b> Padding token to ensure that all sequences in a batch have the same length, so we pad shorter sentences with this token. We use attention masks to <i>"tell"</i> the model to ignore the padded tokens during training since they don't have any real meaning to the task.</p>

<p style = "font-family: 'Helvetica Neue', Arial, sans-serif; text-align: left; font-size: 17.5px"><b>•  [SOS]:</b> This is a token used to signal the <i>Start of Sentence</i>.</p>

<p style = "font-family: 'Helvetica Neue', Arial, sans-serif; text-align: left; font-size: 17.5px"><b>•  [EOS]:</b> This is a token used to signal the <i>End of Sentence</i>.</p>

<p style = "font-family: 'Helvetica Neue', Arial, sans-serif; text-align: left; font-size: 17.5px">In the <code>build_tokenizer</code> function below, we ensure a tokenizer is ready to train the model. It checks if there is an existing tokenizer, and if that is not the case, it trains a new tokenizer.</p>

In [None]:
def get_or_build_tokenizer(config, ds, lang):
    tokenizer_path = Path(config['tokenizer_file'].format(lang))
    if not tokenizer_path.exists():
        tokenizer = Tokenizer(WordLevel(unk_token="[UNK]"))
        tokenizer.pre_tokenizer = Whitespace()
        trainer = WordLevelTrainer(special_tokens=["[UNK]", "[PAD]", "[SOS]", "[EOS]"], min_frequency=2)
        tokenizer.train_from_iterator(get_all_sentences(ds, lang), trainer=trainer)
        tokenizer.save(str(tokenizer_path))
    else:
        tokenizer = Tokenizer.from_file(str(tokenizer_path))
    return tokenizer

## Part 11: Load Dataset

<p style = "font-family: 'Helvetica Neue', Arial, sans-serif; text-align: left; font-size: 17.5px">For this task, we will use the <a href = "opus_books · Datasets at Hugging Face">OpusBooks dataset</a>, available on 🤗Hugging Face. This dataset consists of two features, <code>id</code> and <code>translation</code>. The <code>translation</code> feature contains pairs of sentences in different languages, such as Spanish and Portuguese, English and French, and so forth.</p>

<p style = "font-family: 'Helvetica Neue', Arial, sans-serif; text-align: left; font-size: 17.5px">I first tried translating sentences from English to Portuguese—my native tongue — but there are only 1.4k examples for this pair, so the results were not satisfying in the current configurations for this model. I then tried to use the English-French pair due to its higher number of examples—127k—but it would take too long to train with the current configurations. I then opted to train the model on the English-Italian pair, the same one used in the <a href = "https://youtu.be/ISNdQcPhsts?si=253J39cose6IdsLv">Coding a Transformer from scratch on PyTorch, with full explanation, training and inference
</a> video, as that was a good balance between performance and time of training.</p>

<p style = "font-family: 'Helvetica Neue', Arial, sans-serif; text-align: left; font-size: 17.5px">We start by defining the <code>get_all_sentences</code> function to iterate over the dataset and extract the sentences according to the language pair defined—we will do that later.</p>

In [None]:
def get_all_sentences(ds, lang):
    for item in ds:
        yield item['translation'][lang]

<p style = "font-family: 'Helvetica Neue', Arial, sans-serif; text-align: left; font-size: 17.5px">The <code>get_ds</code> function is defined to load and prepare the dataset for training and validation. In this function, we build or load the tokenizer, split the dataset, and create DataLoaders, so the model can successfully iterate over the dataset in batches. The result of these functions is tokenizers for the source and target languages plus the DataLoader objects.</p>

In [None]:
def get_ds(config):
    ds_raw = load_dataset(f"{config['datasource']}", f"{config['lang_src']}-{config['lang_tgt']}", split='train')

    tokenizer_src = get_or_build_tokenizer(config, ds_raw, config['lang_src'])
    tokenizer_tgt = get_or_build_tokenizer(config, ds_raw, config['lang_tgt'])

    train_ds_size = int(0.9 * len(ds_raw))
    val_ds_size = len(ds_raw) - train_ds_size
    train_ds_raw, val_ds_raw = random_split(ds_raw, [train_ds_size, val_ds_size])

    train_ds = BilingualDataset(train_ds_raw, tokenizer_src, tokenizer_tgt, config['lang_src'], config['lang_tgt'], config['seq_len'])
    val_ds = BilingualDataset(val_ds_raw, tokenizer_src, tokenizer_tgt, config['lang_src'], config['lang_tgt'], config['seq_len'])

    max_len_src = max(len(tokenizer_src.encode(item['translation'][config['lang_src']]).ids) for item in ds_raw)
    max_len_tgt = max(len(tokenizer_tgt.encode(item['translation'][config['lang_tgt']]).ids) for item in ds_raw)

    print(f'Max length of source sentence: {max_len_src}')
    print(f'Max length of target sentence: {max_len_tgt}')

    train_dataloader = DataLoader(train_ds, batch_size=config['batch_size'], shuffle=True)
    val_dataloader = DataLoader(val_ds, batch_size=1, shuffle=True)

    return train_dataloader, val_dataloader, tokenizer_src, tokenizer_tgt

<p style = "font-family: 'Helvetica Neue', Arial, sans-serif; text-align: left; font-size: 17.5px">We define the <code>casual_mask</code> function to create a mask for the attention mechanism of the decoder. This mask prevents the model from having information about future elements in the sequence. </p>

<p style = "font-family: 'Helvetica Neue', Arial, sans-serif; text-align: left; font-size: 17.5px">We start by making a square grid filled with ones. We determine the grid size with the <code>size</code> parameter. Then, we change all the numbers above the main diagonal line to zeros. Every number on one side becomes a zero, while the rest remain ones. The function then flips all these values, turning ones into zeros and zeros into ones. This process is crucial for models that predict future tokens in a sequence.</p>

In [None]:
def causal_mask(size):
    mask = torch.triu(torch.ones((1, size, size)), diagonal=1).type(torch.int)
    return mask == 0

<p style = "font-family: 'Helvetica Neue', Arial, sans-serif; text-align: left; font-size: 17.5px">The <code>BilingualDataset</code> class processes the texts of the target and source languages in the dataset by tokenizing them and adding all the necessary special tokens. This class also certifies that the sentences are within a maximum sequence length for both languages and pads all necessary sentences.</p>

In [None]:
class BilingualDataset(Dataset):
    def __init__(self, ds, tokenizer_src, tokenizer_tgt, src_lang, tgt_lang, seq_len):
        super().__init__()
        self.seq_len = seq_len
        self.ds = ds
        self.tokenizer_src = tokenizer_src
        self.tokenizer_tgt = tokenizer_tgt
        self.src_lang = src_lang
        self.tgt_lang = tgt_lang

        self.sos_token = torch.tensor([tokenizer_tgt.token_to_id("[SOS]")], dtype=torch.int64)
        self.eos_token = torch.tensor([tokenizer_tgt.token_to_id("[EOS]")], dtype=torch.int64)
        self.pad_token = torch.tensor([tokenizer_tgt.token_to_id("[PAD]")], dtype=torch.int64)

    def __len__(self):
        return len(self.ds)

    def __getitem__(self, idx):
        src_target_pair = self.ds[idx]
        src_text = src_target_pair['translation'][self.src_lang]
        tgt_text = src_target_pair['translation'][self.tgt_lang]

        enc_input_tokens = self.tokenizer_src.encode(src_text).ids
        dec_input_tokens = self.tokenizer_tgt.encode(tgt_text).ids

        enc_num_padding_tokens = self.seq_len - len(enc_input_tokens) - 2
        dec_num_padding_tokens = self.seq_len - len(dec_input_tokens) - 1

        if enc_num_padding_tokens < 0 or dec_num_padding_tokens < 0:
            raise ValueError("Sentence is too long")

        encoder_input = torch.cat(
            [
                self.sos_token,
                torch.tensor(enc_input_tokens, dtype=torch.int64),
                self.eos_token,
                torch.tensor([self.pad_token] * enc_num_padding_tokens, dtype=torch.int64),
            ],
            dim=0,
        )

        decoder_input = torch.cat(
            [
                self.sos_token,
                torch.tensor(dec_input_tokens, dtype=torch.int64),
                torch.tensor([self.pad_token] * dec_num_padding_tokens, dtype=torch.int64),
            ],
            dim=0,
        )

        label = torch.cat(
            [
                torch.tensor(dec_input_tokens, dtype=torch.int64),
                self.eos_token,
                torch.tensor([self.pad_token] * dec_num_padding_tokens, dtype=torch.int64),
            ],
            dim=0,
        )

        assert encoder_input.size(0) == self.seq_len
        assert decoder_input.size(0) == self.seq_len
        assert label.size(0) == self.seq_len

        return {
            "encoder_input": encoder_input,
            "decoder_input": decoder_input,
            "encoder_mask": (encoder_input != self.pad_token).unsqueeze(0).unsqueeze(0).int(),
            "decoder_mask": (decoder_input != self.pad_token).unsqueeze(0).int() & causal_mask(decoder_input.size(0)),
            "label": label,
            "src_text": src_text,
            "tgt_text": tgt_text,
        }

## Part 12: Validation Loop

<p style = "font-family: 'Helvetica Neue', Arial, sans-serif; text-align: left; font-size: 17.5px">We will now create two functions for the validation loop. The validation loop is crucial to evaluate model performance in translating sentences from data it has not seen during training.</p>

<p style = "font-family: 'Helvetica Neue', Arial, sans-serif; text-align: left; font-size: 17.5px">We will define two functions. The first function, <code>greedy_decode</code>, gives us the model's output by obtaining the most probable next token. The second function, <code>run_validation</code>, is responsible for running the validation process in which we decode the model's output and compare it with the reference text for the target sentence.</p>

In [None]:
def greedy_decode(model, source, source_mask, tokenizer_src, tokenizer_tgt, max_len, device):
    sos_idx = tokenizer_tgt.token_to_id('[SOS]')
    eos_idx = tokenizer_tgt.token_to_id('[EOS]')

    encoder_output = model.encode(source, source_mask)
    decoder_input = torch.empty(1, 1).fill_(sos_idx).type_as(source).to(device)

    while decoder_input.size(1) < max_len:
        decoder_mask = causal_mask(decoder_input.size(1)).type_as(source_mask).to(device)
        out = model.decode(encoder_output, source_mask, decoder_input, decoder_mask)
        prob = model.project(out[:, -1])
        _, next_word = torch.max(prob, dim=1)
        decoder_input = torch.cat(
            [decoder_input, torch.empty(1, 1).type_as(source).fill_(next_word.item()).to(device)], dim=1
        )

        if next_word == eos_idx:
            break

    return decoder_input.squeeze(0)

In [None]:
def run_validation(model, validation_ds, tokenizer_src, tokenizer_tgt, max_len, device, print_msg, global_step, writer, num_examples=2):
    model.eval()
    count = 0

    source_texts = []
    expected = []
    predicted = []

    try:
        with os.popen('stty size', 'r') as console:
            _, console_width = console.read().split()
            console_width = int(console_width)
    except:
        console_width = 80

    with torch.no_grad():
        for batch in validation_ds:
            count += 1
            encoder_input = batch["encoder_input"].to(device)
            encoder_mask = batch["encoder_mask"].to(device)

            assert encoder_input.size(0) == 1, "Batch size must be 1 for validation"

            model_out = greedy_decode(model, encoder_input, encoder_mask, tokenizer_src, tokenizer_tgt, max_len, device)
            source_text = batch["src_text"][0]
            target_text = batch["tgt_text"][0]
            model_out_text = tokenizer_tgt.decode(model_out.detach().cpu().numpy())

            source_texts.append(source_text)
            expected.append(target_text)
            predicted.append(model_out_text)

            print_msg('-' * console_width)
            print_msg(f"{f'SOURCE: ':>12}{source_text}")
            print_msg(f"{f'TARGET: ':>12}{target_text}")
            print_msg(f"{f'PREDICTED: ':>12}{model_out_text}")

            if count == num_examples:
                print_msg('-' * console_width)
                break

    if writer:
        metric = torchmetrics.CharErrorRate()
        cer = metric(predicted, expected)
        writer.add_scalar('validation cer', cer, global_step)
        writer.flush()

        metric = torchmetrics.WordErrorRate()
        wer = metric(predicted, expected)
        writer.add_scalar('validation wer', wer, global_step)
        writer.flush()

        metric = torchmetrics.BLEUScore()
        bleu = metric(predicted, expected)
        writer.add_scalar('validation BLEU', bleu, global_step)
        writer.flush()

## Part 13: Training Loop

<p style = "font-family: 'Helvetica Neue', Arial, sans-serif; text-align: left; font-size: 17.5px">We are ready to train our Transformer model on the OpusBook dataset for the English to Italian translation task.</p>

<p style = "font-family: 'Helvetica Neue', Arial, sans-serif; text-align: left; font-size: 17.5px">We first start by defining the <code>get_model</code> function to load the model by calling the <code>build_transformer</code> function we have previously defined. This function uses the <code>config</code> dictionary to set a few parameters.</p>

In [None]:
def get_model(config, vocab_src_len, vocab_tgt_len):
    model = build_transformer(vocab_src_len, vocab_tgt_len, config["seq_len"], config["seq_len"], d_model=config["d_model"])
    return model

<p style = "font-family: 'Helvetica Neue', Arial, sans-serif; text-align: left; font-size: 17.5px">I have mentioned the <code>config</code> dictionary several times throughout this notebook. Now, it is time to create it.</p>

<p style = "font-family: 'Helvetica Neue', Arial, sans-serif; text-align: left; font-size: 17.5px">In the following cell, we will define two functions to configure our model and the training process.</p>

<p style = "font-family: 'Helvetica Neue', Arial, sans-serif; text-align: left; font-size: 17.5px">In the <code>get_config</code> function, we define crucial parameters for the training process. <code>batch_size</code> for the number of training examples used in one iteration, <code>num_epochs</code> as the number of times the entire dataset is passed forward and backward through the Transformer, <code>lr</code> as the learning rate for the optimizer, etc. We will also finally define the pairs from the OpusBook dataset, <code>'lang_src': 'en'</code> for selecting English as the source language and <code>'lang_tgt': 'it'</code> for selecting Italian as the target language.</p>

<p style = "font-family: 'Helvetica Neue', Arial, sans-serif; text-align: left; font-size: 17.5px">The <code>get_weights_file_path</code> function constructs the file path for saving or loading model weights for any specific epoch.</p>

In [None]:
def get_config():
    return {
        "batch_size": 8,
        "num_epochs": 15,
        "lr": 1e-4,
        "seq_len": 350,
        "d_model": 512,
        "datasource": "opus_books",
        "lang_src": "en",
        "lang_tgt": "it",
        "model_folder": "weights",
        "model_basename": "tmodel_",
        "preload": "latest",
        "tokenizer_file": "tokenizer_{0}.json",
        "experiment_name": "runs/tmodel",
    }

def get_weights_file_path(config, epoch: str):
    model_folder = f"{config['datasource']}_{config['model_folder']}"
    return str(Path(".") / model_folder / f"{config['model_basename']}{epoch}.pt")

def latest_weights_file_path(config):
    model_folder = f"{config['datasource']}_{config['model_folder']}"
    weights_files = sorted(Path(model_folder).glob(f"{config['model_basename']}*"))
    return str(weights_files[-1]) if weights_files else None

<p style = "font-family: 'Helvetica Neue', Arial, sans-serif; text-align: left; font-size: 17.5px">We finally define our last function, <code>train_model</code>, which takes the <code>config</code> arguments as input. </p>

<p style = "font-family: 'Helvetica Neue', Arial, sans-serif; text-align: left; font-size: 17.5px">In this function, we will set everything up for the training. We will load the model and its necessary components onto the GPU for faster training, set the <code>Adam</code> optimizer, and configure the <code>CrossEntropyLoss</code> function to compute the differences between the translations output by the model and the reference translations from the dataset. </p>

<p style = "font-family: 'Helvetica Neue', Arial, sans-serif; text-align: left; font-size: 17.5px">Every loop necessary for iterating over the training batches, performing backpropagation, and computing the gradients is in this function. We will also use it to run the validation function and save the current state of the model.</p>

In [None]:
def train_model(config):
    device = (
        "cuda" if torch.cuda.is_available()
        else "mps" if torch.has_mps or torch.backends.mps.is_available()
        else "cpu"
    )
    print(f"Using device: {device}")
    if device == "cuda":
        print(f"Device name: {torch.cuda.get_device_name()}")
        print(f"Device memory: {torch.cuda.get_device_properties(0).total_memory / 1024 ** 3:.2f} GB")
    device = torch.device(device)

    weights_dir = Path(f"{config['datasource']}_{config['model_folder']}")
    weights_dir.mkdir(parents=True, exist_ok=True)

    train_dataloader, val_dataloader, tokenizer_src, tokenizer_tgt = get_ds(config)
    model = get_model(config, tokenizer_src.get_vocab_size(), tokenizer_tgt.get_vocab_size()).to(device)
    writer = SummaryWriter(config['experiment_name'])
    optimizer = Adam(model.parameters(), lr=config['lr'], eps=1e-9)

    initial_epoch, global_step = 0, 0
    model_filename = (
        latest_weights_file_path(config)
        if config['preload'] == 'latest'
        else get_weights_file_path(config, config['preload'])
        if config['preload'] else None
    )

    if model_filename:
        print(f'Preloading model from {model_filename}')
        checkpoint = torch.load(model_filename)
        model.load_state_dict(checkpoint['model_state_dict'])
        optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
        initial_epoch = checkpoint['epoch'] + 1
        global_step = checkpoint['global_step']
    else:
        print("No pretrained model found, starting from scratch.")

    loss_fn = nn.CrossEntropyLoss(ignore_index=tokenizer_src.token_to_id('[PAD]'), label_smoothing=0.1).to(device)

    for epoch in range(initial_epoch, config['num_epochs']):
        torch.cuda.empty_cache()
        model.train()
        epoch_loss = 0
        batch_iterator = tqdm(train_dataloader, desc=f"Epoch {epoch + 1}/{config['num_epochs']}")

        for batch in batch_iterator:
            encoder_input = batch['encoder_input'].to(device)
            decoder_input = batch['decoder_input'].to(device)
            encoder_mask = batch['encoder_mask'].to(device)
            decoder_mask = batch['decoder_mask'].to(device)
            labels = batch['label'].to(device)

            encoder_output = model.encode(encoder_input, encoder_mask)
            decoder_output = model.decode(encoder_output, encoder_mask, decoder_input, decoder_mask)
            proj_output = model.project(decoder_output)

            loss = loss_fn(proj_output.view(-1, tokenizer_tgt.get_vocab_size()), labels.view(-1))
            epoch_loss += loss.item()

            writer.add_scalar('train loss', loss.item(), global_step)

            loss.backward()
            optimizer.step()
            optimizer.zero_grad(set_to_none=True)

            global_step += 1
            batch_iterator.set_postfix({"loss": f"{loss.item():.4f}"})

        run_validation(
            model, val_dataloader, tokenizer_src, tokenizer_tgt,
            config['seq_len'], device, lambda msg: batch_iterator.write(msg),
            global_step, writer
        )

        model_path = get_weights_file_path(config, f"{epoch:02d}")
        torch.save({
            'epoch': epoch,
            'model_state_dict': model.state_dict(),
            'optimizer_state_dict': optimizer.state_dict(),
            'global_step': global_step
        }, model_path)
        print(f"Model saved at {model_path}")

We can now train the model!

In [None]:
if __name__ == '__main__':
    warnings.filterwarnings('ignore')
    config = get_config()
    train_model(config)
    pass

Using device: cuda
Device name: Tesla P100-PCIE-16GB
Device memory: 15.887939453125 GB
Max length of source sentence: 309
Max length of target sentence: 274
Preloading model opus_books_weights/tmodel_00.pt


Processing Epoch 01: 100%|██████████| 3638/3638 [15:05<00:00,  4.02it/s, loss=5.731]


--------------------------------------------------------------------------------
    SOURCE: I said I'd pack.
    TARGET: Dissi che avrei fatto il bagaglio io.
 PREDICTED: Io mi pare che mi .
--------------------------------------------------------------------------------
    SOURCE: She, he thought, would cast no stones, but would simply and resolutely go and see Anna and receive her at her own house.
    TARGET: Gli sembrava che non avrebbe scagliato lei la prima pietra, e con semplicità e franchezza sarebbe andata da Anna e l’avrebbe ricevuta.
 PREDICTED: Egli , non aveva detto , ma non aveva detto , e non aveva detto che egli aveva detto , e si sentiva a lei .
--------------------------------------------------------------------------------


Processing Epoch 02: 100%|██████████| 3638/3638 [15:05<00:00,  4.02it/s, loss=5.369]


--------------------------------------------------------------------------------
    SOURCE: It was impossible not to smile, not to kiss the little thing; impossible not to hold out a finger to her, which she caught, screaming and wriggling the whole surface of her little body; impossible not to approach one's lips to her mouth and let her draw them in, her way of kissing.
    TARGET: Non si poteva non sorridere, non baciare la piccola; non si poteva non tenderle un dito al quale ella si aggrappò stringendo e sussultando in tutto il corpo; non si poteva non tenderle il labbro ch’ella afferrò nella piccola bocca a mo’ di bacio.
 PREDICTED: Non era nulla di nuovo , non solo il suo sorriso , non si , non si , e non si , e , il suo sguardo , e , senza , e non si , e non si , e non si .
--------------------------------------------------------------------------------
    SOURCE: A bolder hand might have turned the game even at that point.
    TARGET: Una mano più ardita avrebbe anche a quest

Processing Epoch 03: 100%|██████████| 3638/3638 [15:06<00:00,  4.01it/s, loss=5.057]


--------------------------------------------------------------------------------
    SOURCE: Another five minutes went by, and then I asked her to look again.
    TARGET: Passarono altri cinque minuti, e poi le dissi di guardare ancora.
 PREDICTED: Un quarto di poco dopo , e io mi a casa .
--------------------------------------------------------------------------------
    SOURCE: Kitty remained silent.
    TARGET: Kitty taceva.
 PREDICTED: Kitty si fermò .
--------------------------------------------------------------------------------


Processing Epoch 04: 100%|██████████| 3638/3638 [15:06<00:00,  4.01it/s, loss=5.628]


--------------------------------------------------------------------------------
    SOURCE: Levin saw no one and nothing; he did not take his eyes off his bride.
    TARGET: Levin non notava nulla e nessuno; senza abbassar gli occhi, guardava la sposa.
 PREDICTED: Levin non vedeva nulla e non lo guardò con la sua mano .
--------------------------------------------------------------------------------
    SOURCE: 'What's in it?' said the Queen.
    TARGET: — Che contiene? — domandò la Regina
 PREDICTED: — Che cosa è ? — disse la Regina .
--------------------------------------------------------------------------------


Processing Epoch 05: 100%|██████████| 3638/3638 [15:05<00:00,  4.02it/s, loss=3.472]


--------------------------------------------------------------------------------
    SOURCE: Dolly told Masha's crime.
    TARGET: — E Dar’ja Aleksandrovna raccontò il delitto di Maša.
 PREDICTED: Dolly disse : — Dolly .
--------------------------------------------------------------------------------
    SOURCE: It was about a young girl who lived in the Hartz Mountains, and who had given up her life to save her lover's soul; and he died, and met her spirit in the air; and then, in the last verse, he jilted her spirit, and went on with another spirit - I'm not quite sure of the details, but it was something very sad, I know.
    TARGET: Parlava d’una fanciulla che abitava nelle montagne dell’Hartz, e che aveva dato la vita per salvare quella dell’innamorato: questi, poi, aveva incontrato lo spirito di lei in aria; quindi, nell’ultima strofa, egli respingeva lo spirito della fanciulla, e se ne andava con lo spirito d’un’altra.
 PREDICTED: Era una bambina che aveva fatto un giovane e che

Processing Epoch 06: 100%|██████████| 3638/3638 [15:05<00:00,  4.02it/s, loss=3.837]


--------------------------------------------------------------------------------
    SOURCE: Levin could not at all understand what was the matter, and was astounded at the ardour with which they discussed the question whether Flerov's case should be put to the ballot or not.
    TARGET: Levin non riusciva in nessun modo a capire di che si trattasse e si stupiva della passionalità con cui si esaminava la questione se mettere o no ai voti la opinione su Flerov.
 PREDICTED: Levin non poteva non capire quello che era quello che voleva fare e che si con la questione della questione che si doveva fare o che si sarebbe dovuto fare o non si poteva far nulla .
--------------------------------------------------------------------------------
    SOURCE: Kitty did not say a word of this; she spoke only of her state of mind.
    TARGET: Ma Kitty non disse neppure una parola di questo. Parlava solo delle sue condizioni di spirito.
 PREDICTED: Kitty non diceva una parola di questa parola , ma le par

Processing Epoch 07: 100%|██████████| 3638/3638 [15:05<00:00,  4.02it/s, loss=2.816]


--------------------------------------------------------------------------------
    SOURCE: 'That's the reason they're called lessons,' the Gryphon remarked: 'because they lessen from day to day.'
    TARGET: — Ma è questa la ragione perchè si chiamano lezioni, — osservò il Grifone: — perchè c'è una lesione ogni giorno.
 PREDICTED: — È la ragione del tutto , — disse il Grifone , — perché hanno il giorno , perché si il giorno .
--------------------------------------------------------------------------------
    SOURCE: These I set up to dry within my circle or hedge, and when they were fit for use I carried them to my cave; and here, during the next season, I employed myself in making, as well as I could, a great many baskets, both to carry earth or to carry or lay up anything, as I had occasion; and though I did not finish them very handsomely, yet I made them sufficiently serviceable for my purpose; thus, afterwards, I took care never to be without them; and as my wicker-ware decayed

Processing Epoch 08: 100%|██████████| 3638/3638 [15:05<00:00,  4.02it/s, loss=4.325]


--------------------------------------------------------------------------------
    SOURCE: One memory after another, both joyful and painful, rose in her mind, and for a moment she forgot why she had come.
    TARGET: Uno dietro l’altro i ricordi, felici e tormentosi, si sollevarono nell’animo suo, e per un attimo ella dimenticò perché si trovava là.
 PREDICTED: Un ricordo dopo un altro , un ’ altra gioia e l ’ agitazione , e nello stesso tempo si sentì un attimo di esitazione .
--------------------------------------------------------------------------------
    SOURCE: And remembering how when he met him he had corrected the young man's use of a word that betrayed ignorance Koznyshev found an explanation of the article.
    TARGET: E ricordatosi come, nell’incontro, avesse corretto quel giovane in una parola che rivelava la sua ignoranza, Sergej Ivanovic trovò la spiegazione del senso dell’articolo.
 PREDICTED: E ricordò come egli avesse detto che era stato detto al giovane russo , 

Processing Epoch 09: 100%|██████████| 3638/3638 [15:05<00:00,  4.02it/s, loss=3.236]


--------------------------------------------------------------------------------
    SOURCE: 'Oh, but we are not talking about that,' said Kitty, blushing.
    TARGET: — Ma non parliamo di questo — disse Kitty, arrossendo.
 PREDICTED: — Ah , ma noi non parliamo più — disse Kitty , arrossendo .
--------------------------------------------------------------------------------
    SOURCE: I have been bothering about it a long time.
    TARGET: Ho dovuto faticare per averlo.
 PREDICTED: Mi sono per un tempo .
--------------------------------------------------------------------------------


Processing Epoch 10: 100%|██████████| 3638/3638 [15:05<00:00,  4.02it/s, loss=4.410]


--------------------------------------------------------------------------------
    SOURCE: "That I am not Edward Rochester's bride is the least part of my woe," I alleged: "that I have wakened out of most glorious dreams, and found them all void and vain, is a horror I could bear and master; but that I must leave him decidedly, instantly, entirely, is intolerable.
    TARGET: — Non poter essere più la sposa di Edoardo Rochester, — aggiunsi, — ecco il mio supplizio; svegliarmi dal più dolce dei sogni per non trovare intorno a me altro che vuoto e tristezza, ecco quello che posso ancora sopportare; ma doverlo lasciare risolutamente, subito per sempre, è intollerabile.
 PREDICTED: — Non sono il signor Edoardo , — interruppe la parte dei miei , — che sono le mie idee , e non potrei , ma tutti i piaceri di pensiero e non posso ; mi sono , ma è un aspetto volgare , che lo so .
--------------------------------------------------------------------------------
    SOURCE: "_You_," I said, "a f

Processing Epoch 11: 100%|██████████| 3638/3638 [15:05<00:00,  4.02it/s, loss=3.755]


--------------------------------------------------------------------------------
    SOURCE: His comrades had wakened long before and had had time to get hungry and have their breakfast.
    TARGET: I compagni s’erano svegliati da un pezzo e avevano avuto il tempo di farsi venir fame e di far colazione.
 PREDICTED: I suoi compagni di pranzo prima che prima erano stati e avrebbero portato la colazione .
--------------------------------------------------------------------------------
    SOURCE: He said he couldn't say for certain of course, but that he rather thought he was.
    TARGET: Rispose di non poterlo dire sicuramente, ma inclinava piuttosto per il sì.
 PREDICTED: Disse che non sapeva nulla di straordinario , ma che credeva che egli fosse un poco .
--------------------------------------------------------------------------------


Processing Epoch 12: 100%|██████████| 3638/3638 [15:05<00:00,  4.02it/s, loss=2.608]


--------------------------------------------------------------------------------
    SOURCE: Georgiana said she dreaded being left alone with Eliza; from her she got neither sympathy in her dejection, support in her fears, nor aid in her preparations; so I bore with her feeble-minded wailings and selfish lamentations as well as I could, and did my best in sewing for her and packing her dresses.
    TARGET: Georgiana diceva di non voler rimanere sola con sua sorella, perché non poteva trovare in lei né simpatia nel dolore, né appoggio nei suoi dolori, né aiuto nei suoi preparativi.
 PREDICTED: Georgiana mi disse che era rimasta sola con lei , Elisa e non aveva né tenerezza né tenerezza né tenerezza né tenerezza né tenerezza né tenerezza né tenerezza né la sua ammirazione , e che potesse meglio per le sue .
--------------------------------------------------------------------------------
    SOURCE: So it would be more correct to say that women are seeking for duties, and quite rightly.
 

Processing Epoch 13: 100%|██████████| 3638/3638 [15:06<00:00,  4.01it/s, loss=2.333]


--------------------------------------------------------------------------------
    SOURCE: Because there is nothing proportionate between the armed and the unarmed; and it is not reasonable that he who is armed should yield obedience willingly to him who is unarmed, or that the unarmed man should be secure among armed servants.
    TARGET: Perché da uno armato a uno disarmato non è proporzione alcuna; e non è ragionevole che chi è armato obedisca volentieri a chi è disarmato, e che il disarmato stia sicuro intra servitori armati.
 PREDICTED: Perché non c ’ è nulla di particolare , di e di , non è che sia ragionevole che , sia quello che si acquista , o si , o si , o si le signore che si in grandi soldati .
--------------------------------------------------------------------------------
    SOURCE: But why does he not come?
    TARGET: Ma come mai non viene?
 PREDICTED: Ma perché non è venuto ?
--------------------------------------------------------------------------------


Processing Epoch 14: 100%|██████████| 3638/3638 [15:06<00:00,  4.01it/s, loss=2.712]


--------------------------------------------------------------------------------
    SOURCE: And she wrote out a telegram.
    TARGET: E scrisse un telegramma:
 PREDICTED: E scrisse un telegramma .
--------------------------------------------------------------------------------
    SOURCE: We tried to get away from it at Marlow.
    TARGET: Provammo a fuggire e a riparare a Marlow.
 PREDICTED: Ne fuori per il percorso .
--------------------------------------------------------------------------------


# Section 2: BERT and LoRA

Welcome to Section 2 of our Machine Learning assignment! I hope you've been enjoying the journey so far! 😊

 In this section, you will gain hands-on experience with [BERT](https://arxiv.org/abs/1810.04805) (Bidirectional Encoder Representations from Transformers) and [LoRA](https://arxiv.org/abs/2106.09685) (Low-Rank Adaptation) for text classification tasks. The section is divided into three main parts, each focusing on different aspects of NLP techniques.

## Assignment Structure

### Part 1: Data Preparation and Preprocessing
In this part, you will work with a text classification dataset. You will learn how to:
- Download and load the dataset
- Perform necessary preprocessing steps
- Implement data cleaning and transformation techniques
- Prepare the data in a format suitable for BERT training

### Part 2: Building a Small BERT Model
You will create and train a small BERT model from scratch using the Hugging Face [Transformers](https://huggingface.co/docs/transformers/en/index) library. This part will help you understand:
- The architecture of BERT
- How to configure and initialize a BERT model
- Training process and optimization
- Model evaluation and performance analysis

### Part 3: Fine-tuning with LoRA
In the final part, you will work with a pre-trained [TinyBERT](https://arxiv.org/abs/1909.10351) model and use LoRA for efficient fine-tuning. You will:
- Load a pre-trained TinyBERT model
- Implement LoRA adaptation and fine-tune the model on our classification task
- Compare the results with the previous approach

---

> **NOTE**:  
> Throughout this notebook, make an effort to include sufficient visualizations to enhance understanding:  
> - In the data processing section, display the results of your operations (e.g., show data samples or distributions after preprocessing).  
> - In the classification section, report various evaluation metrics such as accuracy, precision, recall, and F1-score to thoroughly assess your model's performance.  
> - Additionally, take a moment to compare the sizes of the models discussed in this notebook with today’s enormous models. This will help you appreciate the challenges and computational demands associated with training such massive models. 😵‍💫

---


## Part 1: Data Preparation and Preprocessing
We'll be working with the [Consumer Complaint](https://catalog.data.gov/dataset/consumer-complaint-database) dataset, which contains ***complaints*** submitted by consumers about financial products and services. Our goal is to build a classifier that can automatically identify the type of complaint based on the consumer's text description. For this task, we will work with a smaller subset of the dataset, available for download through this [link](https://drive.google.com/file/d/1SpIHksR-WzruEgUjp1SQKGG8bZPnJJoN/view?usp=sharing).

In [6]:
import torch
from torch.utils.data import Dataset, DataLoader
from transformers import BertTokenizer, BertForSequenceClassification, AdamW, BertConfig
from sklearn.model_selection import train_test_split
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from tqdm import tqdm

In [21]:
import gdown
import zipfile

### 1.2 Loading the Data

In [23]:
# Import necessary libraries
import pandas as pd

df = pd.read_csv("complaints_small.csv")

# Display the first few rows of the dataset
print("First 5 rows of the dataset:")
print(df.head())

# Display basic information about the dataset
print("\nDataset information:")
print(df.info())

# Display the column names
print("\nColumn names:")
print(df.columns)

# Display the distribution of complaint types
print("\nDistribution of complaint types:")
print(df['Product'].value_counts())

First 5 rows of the dataset:
                                             Product  \
0  Credit reporting, credit repair services, or o...   
1                                       Student loan   
2  Credit reporting or other personal consumer re...   
3  Credit reporting, credit repair services, or o...   
4  Credit reporting or other personal consumer re...   

                        Consumer complaint narrative  
0  My credit reports are inaccurate. These inaccu...  
1  Beginning in XX/XX/XXXX I had taken out studen...  
2  I am disputing a charge-off on my account that...  
3  I did not consent to, authorize, nor benefit f...  
4  I am a federally protected consumer and I am a...  

Dataset information:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 941128 entries, 0 to 941127
Data columns (total 2 columns):
 #   Column                        Non-Null Count   Dtype 
---  ------                        --------------   ----- 
 0   Product                       941128 non-null  ob

### 1.3 Data Sampling and Class Distribution Analysis

Working with large datasets can be computationally intensive during development. Additionally, imbalanced class distribution can affect model performance. In this section, you'll sample the data and analyze class distributions to make informed decisions about your training dataset.

---

We'll work with a manageable portion of the data to develop and test our approach. While using the complete dataset would likely yield better results, a smaller sample allows us to prototype our solution more efficiently.


In [24]:
# Import necessary libraries
import pandas as pd

# Sample a portion of the dataset (e.g., 10%)
sample_fraction = 0.10  # Adjust this fraction as needed
sampled_df = df.sample(frac=sample_fraction, random_state=42)

# Display the first few rows of the sampled dataset
print("First 5 rows of the sampled dataset:")
print(sampled_df.head())

# Print the shape of the original and sampled datasets
print("\nShape of the original dataset:", df.shape)
print("Shape of the sampled dataset:", sampled_df.shape)

# Display the distribution of complaint types in the sampled dataset
print("\nDistribution of complaint types in the sampled dataset:")
print(sampled_df['Product'].value_counts())

First 5 rows of the sampled dataset:
                                                  Product  \
335123  Credit reporting or other personal consumer re...   
601718                                           Mortgage   
847752  Credit reporting, credit repair services, or o...   
765316  Credit reporting or other personal consumer re...   
798300  Credit reporting, credit repair services, or o...   

                             Consumer complaint narrative  
335123  Upon reviewing my credit report, I have identi...  
601718  I was doing a rate check to refinance. The age...  
847752  This is my 2nd request that I have been a vict...  
765316  I'm sending this compliant to inform credit bu...  
798300  Im submitting a complaint to you today to info...  

Shape of the original dataset: (941128, 2)
Shape of the sampled dataset: (94113, 2)

Distribution of complaint types in the sampled dataset:
Product
Credit reporting, credit repair services, or other personal consumer reports    32262


---

Let's examine the distribution of ***complaints*** types in our dataset. You'll notice that some products have significantly more instances than others, and some categories are quite similar. For example:

- Multiple categories might refer to similar financial products
- Some categories might have very few examples
- Certain categories might be subcategories of others

You have two main approaches to handle this situation:

1. **Merging Similar Classes:** Identify categories that represent similar products/services and Combine them to create more robust, general categories

2. **Selecting Major Classes:** Only select the categories with sufficient representation



> You may choose any approach, but after this step, your data must include **at least five** distinct classes.



In [25]:
# Display the number of complaints in each product category
print("Number of complaints per product category:")
print(sampled_df['Product'].value_counts())

# Identify under-represented classes (e.g., categories with fewer than 1000 complaints)
under_represented = sampled_df['Product'].value_counts()[sampled_df['Product'].value_counts() < 1000]
print("\nUnder-represented product categories (fewer than 1000 complaints):")
print(under_represented)

# Handle class imbalance by merging similar categories
# Example: Merge similar credit reporting categories
sampled_df['Product'] = sampled_df['Product'].replace({
    'Credit reporting or other personal consumer reports': 'Credit reporting',
    'Credit reporting, credit repair services, or other personal consumer reports': 'Credit reporting',
    'Credit reporting': 'Credit reporting'
})

# Example: Merge similar payday loan categories
sampled_df['Product'] = sampled_df['Product'].replace({
    'Payday loan, title loan, or personal loan': 'Payday loan',
    'Payday loan, title loan, personal loan, or advance loan': 'Payday loan',
    'Payday loan': 'Payday loan'
})

# Example: Merge similar credit card categories
sampled_df['Product'] = sampled_df['Product'].replace({
    'Credit card or prepaid card': 'Credit card',
    'Credit card': 'Credit card'
})

# Drop categories with very few examples (e.g., fewer than 1000 complaints)
sampled_df = sampled_df[sampled_df['Product'].map(sampled_df['Product'].value_counts()) >= 1000]

# Display the updated distribution of complaint types
print("\nUpdated distribution of complaint types after merging and filtering:")
print(sampled_df['Product'].value_counts())

# Verify that there are at least five distinct classes
assert len(sampled_df['Product'].unique()) >= 5, "There must be at least five distinct classes."

Number of complaints per product category:
Product
Credit reporting, credit repair services, or other personal consumer reports    32262
Credit reporting or other personal consumer reports                             25121
Debt collection                                                                 11727
Mortgage                                                                         4941
Checking or savings account                                                      4566
Credit card or prepaid card                                                      4269
Credit card                                                                      2504
Student loan                                                                     1880
Money transfer, virtual currency, or money service                               1829
Vehicle loan or lease                                                            1439
Credit reporting                                                                 1231
Pay

---
### 1.4 Data Encoding and Text Preprocessing

Before training our model, we need to prepare both our target labels and text data. This involves converting categorical labels into numerical format and cleaning our text data to improve model performance.

In [26]:
# Import necessary libraries
from sklearn.preprocessing import LabelEncoder
import re

# Step 1: Label Encoding
# Convert product categories into numerical labels
label_encoder = LabelEncoder()
sampled_df['Product_encoded'] = label_encoder.fit_transform(sampled_df['Product'])

# Display the mapping of product categories to numerical labels
print("Product categories and their corresponding numerical labels:")
for i, category in enumerate(label_encoder.classes_):
    print(f"{category}: {i}")

# Step 2: Text Preprocessing
# Define a function to clean the text data
def clean_text(text):
    # Remove HTML tags (if any)
    text = re.sub(r"<.*?>", "", text)
    # Remove special characters and punctuation
    text = re.sub(r"[^a-zA-Z\s]", "", text)
    # Convert to lowercase
    text = text.lower()
    return text.strip()

# Apply the cleaning function to the 'Consumer complaint narrative' column
sampled_df['Cleaned_text'] = sampled_df['Consumer complaint narrative'].apply(clean_text)

# Remove very short complaints (e.g., less than 10 words)
sampled_df = sampled_df[sampled_df['Cleaned_text'].apply(lambda x: len(x.split()) >= 10)]

# Display the first few rows of the cleaned dataset
print("\nFirst 5 rows of the cleaned dataset:")
print(sampled_df[['Product', 'Cleaned_text', 'Product_encoded']].head())

# Display the shape of the cleaned dataset
print("\nShape of the cleaned dataset:", sampled_df.shape)

Product categories and their corresponding numerical labels:
Checking or savings account: 0
Credit card: 1
Credit reporting: 2
Debt collection: 3
Money transfer, virtual currency, or money service: 4
Mortgage: 5
Payday loan: 6
Student loan: 7
Vehicle loan or lease: 8

First 5 rows of the cleaned dataset:
                 Product                                       Cleaned_text  \
335123  Credit reporting  upon reviewing my credit report i have identif...   
601718          Mortgage  i was doing a rate check to refinance the agen...   
847752  Credit reporting  this is my nd request that i have been a victi...   
765316  Credit reporting  im sending this compliant to inform credit bur...   
798300  Credit reporting  im submitting a complaint to you today to info...   

        Product_encoded  
335123                2  
601718                5  
847752                2  
765316                2  
798300                2  

Shape of the cleaned dataset: (91935, 4)


## 1.5 Dataset Creation and Tokenization

For training our BERT model, we need to:
1. Create a custom Dataset class that will handle tokenization
2. Split the data into training and testing sets
3. Use BERT's tokenizer to convert text into a format suitable for the model

In [27]:
# Import necessary libraries
from torch.utils.data import Dataset
from transformers import BertTokenizer
import torch

class ComplaintDataset(Dataset):
    """A custom Dataset class for handling consumer complaints text data with BERT tokenization.

    Parameters:
        texts (List[str]): List of complaint texts to be processed
        labels (List[int]): List of encoded labels corresponding to each text
        tokenizer (BertTokenizer): A BERT tokenizer instance for text processing
        max_len (int, optional): Maximum length for padding/truncating texts. Defaults to 512

    Returns:
        dict: For each item, returns a dictionary containing:
            - input_ids (torch.Tensor): Encoded token ids of the text
            - attention_mask (torch.Tensor): Attention mask for the padded sequence
            - labels (torch.Tensor): Encoded label as a tensor
    """
    def __init__(self, texts, labels, tokenizer, max_len=512):
        self.texts = texts
        self.labels = labels
        self.tokenizer = tokenizer
        self.max_len = max_len

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        text = self.texts[idx]
        label = self.labels[idx]

        # Tokenize the text
        encoding = self.tokenizer.encode_plus(
            text,
            add_special_tokens=True,  # Add [CLS] and [SEP] tokens
            max_length=self.max_len,  # Truncate/pad to max_len
            return_token_type_ids=False,  # Not needed for classification
            padding='max_length',  # Pad to max_len
            truncation=True,  # Truncate to max_len
            return_attention_mask=True,  # Generate attention mask
            return_tensors='pt',  # Return PyTorch tensors
        )

        return {
            'input_ids': encoding['input_ids'].flatten(),
            'attention_mask': encoding['attention_mask'].flatten(),
            'labels': torch.tensor(label, dtype=torch.long)
        }

# Load the BERT tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# Split the dataset into training and testing sets
from sklearn.model_selection import train_test_split

train_texts, test_texts, train_labels, test_labels = train_test_split(
    sampled_df['Cleaned_text'].tolist(),
    sampled_df['Product_encoded'].tolist(),
    test_size=0.2,  # 80% training, 20% testing
    random_state=42
)

# Create Dataset instances for training and testing
train_dataset = ComplaintDataset(train_texts, train_labels, tokenizer)
test_dataset = ComplaintDataset(test_texts, test_labels, tokenizer)

# Display the size of the training and testing datasets
print(f"Training dataset size: {len(train_dataset)}")
print(f"Testing dataset size: {len(test_dataset)}")

# Example: Inspect a single item from the training dataset
sample_item = train_dataset[0]
print("\nSample item from the training dataset:")
print(f"Input IDs: {sample_item['input_ids']}")
print(f"Attention Mask: {sample_item['attention_mask']}")
print(f"Label: {sample_item['labels']}")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Training dataset size: 73548
Testing dataset size: 18387

Sample item from the training dataset:
Input IDs: tensor([  101,  1045,  2031,  2025,  2018,  2019,  4070,  2007,  2023,  2194,
         1998,  2038,  6303, 27354,  1997,  7016,  3807,  2085,  1045,  2106,
         2025,  4607,  2046,  2151,  3206,  2007,  2023,  2194,  1998,  2038,
         2025,  2363,  2151, 12653,  2012,  2035,  2013,  2068,   102,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0

## Part 2: Training a Small-Size BERT Model

In this part, we will explore how to build and train a small-sized BERT model for our classification task. Instead of using the full-sized BERT model, which is computationally expensive, we will create a smaller version using the Transformers library.

In [28]:
# Import necessary libraries
from transformers import BertForSequenceClassification, BertConfig
import torch

# Step 1: Define the BERT model for sequence classification
# Set up the configuration for a smaller BERT model
config = BertConfig(
    vocab_size=30522,  # Vocabulary size of BERT
    hidden_size=128,  # Smaller hidden size (default is 768)
    num_hidden_layers=4,  # Fewer layers (default is 12)
    num_attention_heads=4,  # Fewer attention heads (default is 12)
    intermediate_size=512,  # Smaller intermediate size (default is 3072)
    num_labels=len(label_encoder.classes_),  # Number of output labels
    max_position_embeddings=512,  # Maximum sequence length
)

# Initialize the BERT model with the custom configuration
model = BertForSequenceClassification(config)

# Move the model to the appropriate device (GPU if available)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

# Step 2: Print the total number of trainable parameters in the model
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

total_params = count_parameters(model)
print(f"Total number of trainable parameters in the model: {total_params:,}")

Total number of trainable parameters in the model: 4,783,625


---

Now that you have defined your model, it's time to train it!☠️

Training a model of this size can take some time, depending on the available resources. To manage this, you can train your model for just **2–3 epochs** to demonstrate progress. Here are some hints:
- **Training Metrics:** Ensure you print enough metrics, such as loss and accuracy, to track the training progress.
- **Interactive Monitoring:** Use the `tqdm` library to display the progress of your training loop in real-time.

In [29]:
# Import necessary libraries
from torch.utils.data import DataLoader
from transformers import AdamW
from sklearn.metrics import accuracy_score
from tqdm import tqdm

# Step 1: Define the optimizer and number of epochs
optimizer = AdamW(model.parameters(), lr=2e-5)  # Learning rate for BERT fine-tuning
num_epochs = 3  # Number of training epochs

# Create DataLoader for training and testing datasets
train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=16, shuffle=False)

# Training loop
for epoch in range(num_epochs):
    print(f"Epoch {epoch + 1}/{num_epochs}")
    model.train()
    total_loss = 0
    total_correct = 0
    total_samples = 0

    # Use tqdm for real-time progress monitoring
    for batch in tqdm(train_loader, desc="Training"):
        optimizer.zero_grad()

        # Move batch data to the appropriate device
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['labels'].to(device)

        # Forward pass
        outputs = model(
            input_ids=input_ids,
            attention_mask=attention_mask,
            labels=labels
        )

        # Compute loss
        loss = outputs.loss
        total_loss += loss.item()

        # Backpropagation
        loss.backward()
        optimizer.step()

        # Compute accuracy
        logits = outputs.logits
        predictions = torch.argmax(logits, dim=-1)
        total_correct += (predictions == labels).sum().item()
        total_samples += labels.size(0)

    # Compute average loss and accuracy for the epoch
    avg_loss = total_loss / len(train_loader)
    accuracy = total_correct / total_samples

    print(f"Training Loss: {avg_loss:.4f}, Training Accuracy: {accuracy:.4f}")

# Step 2: Evaluate the model on the test dataset
model.eval()
test_correct = 0
test_samples = 0

with torch.no_grad():
    for batch in tqdm(test_loader, desc="Testing"):
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['labels'].to(device)

        outputs = model(
            input_ids=input_ids,
            attention_mask=attention_mask,
            labels=labels
        )

        logits = outputs.logits
        predictions = torch.argmax(logits, dim=-1)
        test_correct += (predictions == labels).sum().item()
        test_samples += labels.size(0)

# Compute test accuracy
test_accuracy = test_correct / test_samples
print(f"Test Accuracy: {test_accuracy:.4f}")



Epoch 1/3


Training: 100%|██████████| 4597/4597 [08:30<00:00,  9.01it/s]


Training Loss: 1.0516, Training Accuracy: 0.6723
Epoch 2/3


Training: 100%|██████████| 4597/4597 [08:25<00:00,  9.09it/s]


Training Loss: 0.7515, Training Accuracy: 0.7401
Epoch 3/3


Training: 100%|██████████| 4597/4597 [08:20<00:00,  9.19it/s]


Training Loss: 0.6533, Training Accuracy: 0.7868


Testing: 100%|██████████| 1150/1150 [01:37<00:00, 11.81it/s]

Test Accuracy: 0.8023





## Part 3: Fine-Tuning TinyBERT with LoRA

As you have experienced, training even a small-sized BERT model can be computationally intensive and time-consuming. To address these challenges, we explore **Parameter-Efficient Fine-Tuning (PEFT)** methods, which allow us to utilize the power of large pretrained models without requiring extensive resources.

---

### **Parameter-Efficient Fine-Tuning (PEFT)**

PEFT methods focus on fine-tuning only a small portion of the model’s parameters while keeping most of the pretrained weights frozen. This drastically reduces the computational and storage requirements while leveraging the rich knowledge embedded in pretrained models.

One popular PEFT method is LoRA (Low-Rank Adaptation).

- **What is LoRA?**

LoRA introduces a mechanism to fine-tune large language models by injecting small low-rank matrices into the model's architecture. Instead of updating all parameters during training, LoRA trains these small matrices while keeping the majority of the original parameters frozen.  This is achieved as follows:

1. **Frozen Weights**: The pretrained weights of the model, represented as a weight matrix $ W \in \mathbb{R}^{d \times k} $, remain **frozen** during fine-tuning.

2. **Low-Rank Decomposition**:
   Instead of directly updating $ W $, LoRA introduces two trainable matrices, $ A \in \mathbb{R}^{d \times r} $ and $ B \in \mathbb{R}^{r \times k} $, where $ r \ll \min(d, k) $.  
   These matrices approximate the update to $ W $ as:
   $$
   \Delta W = A \cdot B
   $$

   Here, $ r $, the rank of the decomposition, is a key hyperparameter that determines the trade-off between computational cost and model capacity.

3. **Adaptation**:
   During training, instead of updating $ W $, the adapted weight is:
   $$
   W' = W + \Delta W = W + A \cdot B
   $$
   Only the low-rank matrices $ A $ and $ B $ are optimized, while $ W $ remains fixed.

4. **Efficiency**:
   Since $ r $ is much smaller than $ d $ and $ k $, the number of trainable parameters in $ A $ and $ B $ is significantly less than in $ W $. This makes the approach highly efficient both in terms of computation and memory.

---

###  **Fine-Tuning TinyBERT**

For this part, we will fine-tune **TinyBERT**, a distilled version of BERT, using the LoRA method.

- **What is TinyBERT?**

TinyBERT is a lightweight version of the original BERT model created through knowledge distillation. It significantly reduces the model size and inference latency while preserving much of the original BERT’s effectiveness. Here are some key characteristics of TinyBERT:
- It is designed to be more resource-efficient for tasks such as classification, question answering, and more.
- TinyBERT retains a compact structure with fewer layers and parameters, making it ideal for fine-tuning with limited computational resources.


> Similar to the previous section, training this model might take some time. Given the resource limitations, you can train the model for just **2-3 epochs** to demonstrate the process.


In [30]:
from transformers import AutoModelForSequenceClassification, AutoTokenizer
from peft import get_peft_model, LoraConfig, TaskType
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
import numpy as np
from tqdm import tqdm

In [None]:
# Import necessary libraries
from transformers import AutoModelForSequenceClassification, AutoTokenizer
from peft import LoraConfig, get_peft_model
from torch.optim import AdamW
from torch.nn import CrossEntropyLoss
from tqdm import tqdm

# Step 1: Load the pre-trained TinyBERT model and tokenizer
model_name = "prajjwal1/bert-tiny"
base_model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=len(label_encoder.classes_))
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Step 2: Define LoRA Configuration
lora_config = LoraConfig(
    r=8,  # Rank of the low-rank matrices
    lora_alpha=16,  # Scaling factor for the low-rank matrices
    target_modules=["query", "value"],  # Modules to apply LoRA (e.g., attention layers)
    lora_dropout=0.1,  # Dropout rate for LoRA layers
    bias="none",  # Whether to add bias terms
)

# Step 3: Apply LoRA to the model
lora_model = get_peft_model(base_model, lora_config)

# Display the number of trainable parameters
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

total_params = count_parameters(lora_model)
print(f"Total number of trainable parameters in the LoRA model: {total_params:,}")

# Step 4: Training configuration
optimizer = AdamW(lora_model.parameters(), lr=2e-5)  # Learning rate for fine-tuning
criterion = CrossEntropyLoss()  # Loss function for classification

# Step 5: Training loop
num_epochs = 3  # Number of training epochs

# Move the model to the GPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
lora_model = lora_model.to(device)

# Training loop
for epoch in range(num_epochs):
    print(f"Epoch {epoch + 1}/{num_epochs}")
    lora_model.train()
    total_loss = 0
    total_correct = 0
    total_samples = 0

    # Use tqdm for real-time progress monitoring
    for batch in tqdm(train_loader, desc="Training"):
        optimizer.zero_grad()

        # Move batch data to the GPU
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['labels'].to(device)

        # Forward pass
        outputs = lora_model(
            input_ids=input_ids,
            attention_mask=attention_mask,
            labels=labels
        )

        # Compute loss
        loss = outputs.loss
        total_loss += loss.item()

        # Backpropagation
        loss.backward()
        optimizer.step()

        # Compute accuracy
        logits = outputs.logits
        predictions = torch.argmax(logits, dim=-1)
        total_correct += (predictions == labels).sum().item()
        total_samples += labels.size(0)

    # Compute average loss and accuracy for the epoch
    avg_loss = total_loss / len(train_loader)
    accuracy = total_correct / total_samples

    print(f"Training Loss: {avg_loss:.4f}, Training Accuracy: {accuracy:.4f}")

# Step 6: Evaluate the model on the test dataset
lora_model.eval()
test_correct = 0
test_samples = 0

with torch.no_grad():
    for batch in tqdm(test_loader, desc="Testing"):
        # Move batch data to the GPU
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['labels'].to(device)

        outputs = lora_model(
            input_ids=input_ids,
            attention_mask=attention_mask,
            labels=labels
        )

        logits = outputs.logits
        predictions = torch.argmax(logits, dim=-1)
        test_correct += (predictions == labels).sum().item()
        test_samples += labels.size(0)

# Compute test accuracy
test_accuracy = test_correct / test_samples
print(f"Test Accuracy: {test_accuracy:.4f}")

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at prajjwal1/bert-tiny and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Total number of trainable parameters in the LoRA model: 8,192
Epoch 1/3


Training: 100%|██████████| 4597/4597 [07:21<00:00, 10.41it/s]


Training Loss: 1.7871, Training Accuracy: 0.6002
Epoch 2/3


Training:  80%|███████▉  | 3658/4597 [05:43<02:08,  7.32it/s]

Due to saving issues, the complete segment doesn't show in the output cell. Below is the correct output cell gotten from the RAW format:

```
Execution output
1KB
	Stream
		Some weights of BertForSequenceClassification were not initialized from the model checkpoint at prajjwal1/bert-tiny and are newly initialized: ['classifier.bias', 'classifier.weight']
		You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
		Total number of trainable parameters in the LoRA model: 8,192
		Epoch 1/3
		Training: 100%|██████████| 4597/4597 [07:21<00:00, 10.41it/s]
		Training Loss: 1.7871, Training Accuracy: 0.6002
		Epoch 2/3
		Training: 100%|██████████| 4597/4597 [07:14<00:00, 10.58it/s]
		Training Loss: 1.7152, Training Accuracy: 0.6311
		Epoch 3/3
		Training: 100%|██████████| 4597/4597 [07:16<00:00, 10.54it/s]
		Training Loss: 1.7097, Training Accuracy: 0.6311
		Testing: 100%|██████████| 1150/1150 [01:24<00:00, 13.57it/s]
		Test Accuracy: 0.6293
  ```