# 🔍 Breaking Down BERT: A Hands-On Dive into Transformers

Hey there! 👋

In this notebook, I’ve taken up a personal challenge — to debunk the BERT model and truly understand what’s happening under the hood. While BERT has become a go-to model in the world of NLP, I often found myself using it like a black box. So, I decided to go beyond just running code — I wanted to get into the math, the mechanics, and the magic behind the Transformer architecture that powers it.

This notebook is a result of that journey — a hands-on, intuitive walkthrough that aims to simplify concepts without skipping the key mathematical ideas. If you’ve ever felt like you wanted to “get” BERT instead of just “use” BERT, you’re in the right place.

Let’s get started 🚀

# About the Dataset

For this breakdown, I’m using the Cornell Movie Dialogues Corpus — a classic dataset packed with movie conversations between characters. It’s perfect for experimenting with language models like BERT because it offers natural, back-and-forth human dialogue.

In this step, I'm:

- Installing the necessary NLP libraries from Hugging Face.

- Downloading and unzipping the dataset.

- Organizing the relevant files (movie_conversations.txt and movie_lines.txt) into a datasets/ folder for easy access.

In [1]:
!pip install -qq transformers datasets tokenizers
!wget http://www.cs.cornell.edu/~cristian/data/cornell_movie_dialogs_corpus.zip
!unzip -oq cornell_movie_dialogs_corpus.zip 
!rm cornell_movie_dialogs_corpus.zip
!mkdir -p datasets
!mv -f "cornell movie-dialogs corpus/movie_conversations.txt" ./datasets/
!mv -f "cornell movie-dialogs corpus/movie_lines.txt" ./datasets/


[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m183.9/183.9 kB[0m [31m5.3 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
gcsfs 2024.10.0 requires fsspec==2024.10.0, but you have fsspec 2024.12.0 which is incompatible.
torch 2.5.1+cu124 requires nvidia-cublas-cu12==12.4.5.8; platform_system == "Linux" and platform_machine == "x86_64", but you have nvidia-cublas-cu12 12.8.4.1 which is incompatible.
torch 2.5.1+cu124 requires nvidia-cudnn-cu12==9.1.0.70; platform_system == "Linux" and platform_machine == "x86_64", but you have nvidia-cudnn-cu12 9.3.0.75 which is incompatible.
torch 2.5.1+cu124 requires nvidia-cufft-cu12==11.2.1.3; platform_system == "Linux" and platform_machine == "x86_64", but you have nvidia-cufft-cu12 11.3.3.83 which is incompatible.
torch 2.5.1+cu124 requires nvidia-curand-cu12==10.3.5.

In [2]:
## Importing the necessary libraries.

import os
from pathlib import Path
import torch
import re
import random
import transformers, datasets
from tokenizers import BertWordPieceTokenizer
from transformers import BertTokenizer
import tqdm
from torch.utils.data import Dataset, DataLoader
import itertools
import math
import torch.nn.functional as F
import numpy as np
from torch.optim import Adam

# 🗂️ Understanding the Dataset Structure

Before diving into modeling, let’s quickly understand the structure of the two main files we'll be working with:

**movie_lines.txt:**
This file contains every individual line of dialogue from the dataset.
Each line follows this format:

```
lineID +++$+++ characterID +++$+++ movieID +++$+++ character name +++$+++ actual dialogue
```

**movie_conversations.txt:**
This file connects those individual lines into conversations.
Each row lists two characters, the movie they’re from, and the sequence of utterance IDs that form a full conversation:

```
character1ID +++$+++ character2ID +++$+++ movieID +++$+++ [list of utterance IDs]
```

## Preprocessing:

We are extracting question-answer pairs from the above text files, which are in that format mentioned above.

In [3]:
# Setting maximum token length
MAX_LEN = 64  # Controls the number of tokens per sentence (truncates longer ones)


### loading all data into memory
corpus_movie_conv = './datasets/movie_conversations.txt'
corpus_movie_lines = './datasets/movie_lines.txt'

## Reading the files
with open(corpus_movie_conv, 'r', encoding='iso-8859-1') as c:
    conv = c.readlines()

with open(corpus_movie_lines, 'r', encoding='iso-8859-1') as l:
    lines = l.readlines()


### splitting text using special lines ('+++$+++' delimiter)
lines_dic = {}
for line in lines:
    objects = line.split(" +++$+++ ")
    lines_dic[objects[0]] = objects[-1]
    

### generate question answer pairs
pairs = []
for con in conv:
    ids = eval(con.split(" +++$+++ ")[-1])
    for i in range(len(ids)):
        qa_pairs = []

        if i == len(ids) - 1:
            break

        first = lines_dic[ids[i]].strip()
        second = lines_dic[ids[i+1]].strip()

        qa_pairs.append(' '.join(first.split()[:MAX_LEN]))
        qa_pairs.append(' '.join(second.split()[:MAX_LEN]))
        pairs.append(qa_pairs)


In [4]:
## Checking a sample of the input.
pairs[0]

['Can we make this quick? Roxanne Korrine and Andrew Barrett are having an incredibly horrendous public break- up on the quad. Again.',
 "Well, I thought we'd start with pronunciation, if that's okay with you."]

# Importing the Tokenizer;

The goal of this section is to train a custom WordPiece tokenizer from scratch on our movie dialogue dataset.
Instead of using a pre-trained tokenizer (like the default BERT tokenizer), we build one tailored to the vocabulary and linguistic patterns present in conversational data.
This helps:

🧠 Improve tokenization for informal, chat-style text (e.g., contractions, slang).

🔤 Generate a vocabulary that better reflects the specific domain of movie conversations.

🔄 Enable training a BERT-style model from scratch or fine-tuning on custom vocab, ensuring consistency between tokenizer and model. 




We are using the **BertWordPieceTokenizer** from the tokenizers library because....

- This is a fast and efficient tokenizer designed for training WordPiece models like BERT from scratch. It's significantly faster than the HuggingFace transformers tokenizer and gives you fine-grained control over training parameters (e.g., vocab size, token frequencies).

- These are its parameters and its purpose;

| Parameter | Purpose |
|----------|---------|
| `clean_text=True` | Normalizes strange unicode, punctuation, and whitespace for consistency. |
| `handle_chinese_chars=False` | Not relevant here, so we turn it off. |
| `strip_accents=False` | Keeps accented characters (like café), which can matter in dialogue. |
| `lowercase=True` | Makes everything lowercase to reduce vocabulary size and noise. |
| `special_tokens=[...]` | Adds necessary BERT tokens like `[PAD]`, `[CLS]`, `[SEP]`, etc. |



**tokenizer.train(...)**

This line trains the tokenizer on all the collected .txt files. We specify:

- vocab_size=30_000: A good size for English corpora — small enough for efficiency but large enough to capture most frequent words/subwords.

- min_frequency=5: Removes rare noise-like tokens.

- limit_alphabet=1000: Caps the character set used to construct the vocabulary.

- special_tokens=['[PAD]', '[CLS]', '[SEP]', '[MASK]', '[UNK]']. These tokens serve structural purposes: <ul><li>[PAD] – padding</li><li>[CLS] – classification tasks</li><li>[SEP] – sentence separation</li><li>[MASK] – masking for MLM</li><li>[UNK] – unknown tokens</li></ul>

These are predefined tokens needed for training BERT.

In [5]:
# WordPiece tokenizer

### save data as txt file
os.mkdir('./data')
text_data = []
file_count = 0

for sample in tqdm.tqdm([x[0] for x in pairs]):
    text_data.append(sample)

    # once we hit the 10K mark, save to file
    if len(text_data) == 10000:
        with open(f'./data/text_{file_count}.txt', 'w', encoding='utf-8') as fp:
            fp.write('\n'.join(text_data))
        text_data = []
        file_count += 1

paths = [str(x) for x in Path('./data').glob('**/*.txt')]

### training own tokenizer
tokenizer = BertWordPieceTokenizer(
    clean_text=True,
    handle_chinese_chars=False,
    strip_accents=False,
    lowercase=True
)


## We pass in the file path, vocab size, 
tokenizer.train(
    files=paths,
    vocab_size=30_000,
    min_frequency=5,
    limit_alphabet=1000,
    wordpieces_prefix='##',
    special_tokens=['[PAD]', '[CLS]', '[SEP]', '[MASK]', '[UNK]']
    )

os.mkdir('./bert-it-1')

## Saving the Tokenizer;
tokenizer.save_model('./bert-it-1', 'bert-it')

## Loading it;
tokenizer = BertTokenizer.from_pretrained('./bert-it-1/bert-it-vocab.txt', local_files_only=True)

100%|██████████| 221616/221616 [00:00<00:00, 1995296.58it/s]









In [6]:
tokenizer

BertTokenizer(name_or_path='./bert-it-1/bert-it-vocab.txt', vocab_size=21160, model_max_length=1000000000000000019884624838656, is_fast=False, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=True, added_tokens_decoder={
	0: AddedToken("[PAD]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	1: AddedToken("[CLS]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	2: AddedToken("[SEP]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	3: AddedToken("[MASK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	4: AddedToken("[UNK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}
)

In [7]:
## Example of a word tokenised into two word - this is because of our tokeniser not only identifies words
## as tokens, but rather tries to break down a word into subwords and tokenises these subwords also;


sample_token = 'normalized'
print(f"The token id for the word {sample_token} in the vocabulary is \n{tokenizer(sample_token)['input_ids']}")

## To view the 
id_token_mapping = {j:i for i,j in dict(tokenizer.vocab).items()}

print(id_token_mapping[2609])  ### token id of "normal"
print(id_token_mapping[1808])  ### token id of "##ized"


The token id for the word normalized in the vocabulary is 
[1, 2609, 1808, 2]
normal
##ized


 # Building the BERT Dataset Class


To train BERT from scratch, we need to feed it data in the format it expects — this includes:

- **Masked Language Modeling (MLM):**

BERT is trained to predict missing words in a sentence, which helps it learn deep bidirectional representations of language.

```
- We randomly select 15% of the tokens in the input for possible masking.

- But if we always replace them with [MASK], the model learns to rely too much on that specific token — which never appears during fine-tuning or inference.

- To solve this, BERT applies the following masking strategy:

* 80% of the selected tokens are replaced with [MASK].

* 10% are replaced with a random token.

* 10% are left unchanged (even though we still ask the model to predict them).

```

This way, the model learns to make predictions even when no explicit clue (like [MASK]) is present in the input — making it more robust.

- **Next Sentence Prediction (NSP):** 

The NSP task helps BERT understand sentence-level relationships, such as question-answer pairs or sequential dialogue.

```
- For each training example, BERT is fed a pair of sentences.

- 50% of the time, the second sentence is the actual next sentence.

- 50% of the time, it is a randomly picked sentence from the dataset.

- The model is trained to predict whether the second sentence follows the first.
```

This binary classification task improves the model’s ability to capture context across sentences — crucial for downstream tasks like QA, natural language inference, and conversation modeling.


This custom BERTDataset class helps us prepare sentence pairs in this exact format.


### ✅ What This Class Does:

1. Loads a list of sentence pairs.

2. Randomly decides if the second sentence should be the real next sentence or a randomly selected one.

3. Applies word masking to 15% of the tokens as required for MLM.

4. Adds [CLS], [SEP], and [PAD] tokens to structure the input properly.

5. Assigns segment labels to distinguish sentence 1 from sentence 2.

6. Returns a dictionary containing everything BERT needs: **input IDs, token labels, segment labels, and NSP labels.**

In [8]:
class BERTDataset(Dataset):

    ## data_pair is the list of pair of sentences;
    def __init__(self, data_pair, tokenizer, seq_len=64):

        self.tokenizer = tokenizer
        self.seq_len = seq_len
        self.corpus_lines = len(data_pair)
        self.lines = data_pair

    def __len__(self):
        return self.corpus_lines


    def get_corpus_line(self, item):
        '''return sentence pair'''
        return self.lines[item][0], self.lines[item][1]

    def get_random_line(self):
        '''return random single sentence'''
        return self.lines[random.randrange(len(self.lines))][1]

    def get_sent(self, index):
        '''return random sentence pair'''
        t1, t2 = self.get_corpus_line(index)

        # negative or positive pair, for next sentence prediction
        if random.random() > 0.5:
            return t1, t2, 1
        else:
            return t1, self.get_random_line(), 0

    def random_word(self, sentence):

        ## Split the sentence into words;
        tokens = sentence.split()
        output_label = []  ### Whichever positions have MLM applied upon, there we pass in the token id, else 0
        output = []  ###

        # 15% of the tokens would be replaced
        for i, token in enumerate(tokens):

            ## Generate a probability;
            prob = random.random()

            # remove cls and sep token
            token_id = self.tokenizer(token)['input_ids'][1:-1]   ### tokenizer("normalized")['input_ids']  ## [2609, 1808]

            if prob < 0.15:
                prob /= 0.15  ### 0.5 /

                # 80% chance change token to mask token
                if prob < 0.8:
                    for i in range(len(token_id)): ### Some words could be tokenised into multiple sub-words
                        output.append(self.tokenizer.vocab['[MASK]'])

                # 10% chance change token to random token
                elif prob < 0.9:
                    for i in range(len(token_id)): ### Some words could be tokenised into multiple sub-words
                        output.append(random.randrange(len(self.tokenizer.vocab)))

                # 10% chance change token to current token
                else:
                    output.append(token_id)

                output_label.append(token_id) ### Some words could be tokenised into multiple sub-words, hence in such case [] will be appended to output_label.

            else:
                output.append(token_id)

                for i in range(len(token_id)): ### Some words could be tokenised into multiple sub-words
                    output_label.append(0)

        # flattening
        output = list(itertools.chain(*[[x] if not isinstance(x, list) else x for x in output]))
        output_label = list(itertools.chain(*[[x] if not isinstance(x, list) else x for x in output_label]))
        assert len(output) == len(output_label)
        return output, output_label

    def __getitem__(self, item):  #### item is an index

        # Step 1: get random sentence pair, either negative or positive (saved as is_next_label)
        t1, t2, is_next_label = self.get_sent(item)

        # Step 2: replace random words in sentence with mask / random words
        t1_random, t1_label = self.random_word(t1)
        t2_random, t2_label = self.random_word(t2)

        # Step 3: Adding CLS and SEP tokens to the start and end of sentences
         # Adding PAD token for labels
        t1 = [self.tokenizer.vocab['[CLS]']] + t1_random + [self.tokenizer.vocab['[SEP]']]
        t2 = t2_random + [self.tokenizer.vocab['[SEP]']]

        t1_label = [self.tokenizer.vocab['[PAD]']] + t1_label + [self.tokenizer.vocab['[PAD]']]
        t2_label = t2_label + [self.tokenizer.vocab['[PAD]']]

        # Step 4: combine sentence 1 and 2 as one input
        # adding PAD tokens to make the sentence same length as seq_len
        segment_label = ([1 for _ in range(len(t1))] + [2 for _ in range(len(t2))])[:self.seq_len]

        bert_input = (t1 + t2)[:self.seq_len]
        bert_label = (t1_label + t2_label)[:self.seq_len]

        padding = [self.tokenizer.vocab['[PAD]'] for _ in range(self.seq_len - len(bert_input))]

        bert_input.extend(padding), bert_label.extend(padding), segment_label.extend(padding)  ### Adding padding at the end fullfil the sequence length

        output = {"bert_input": bert_input,
                  "bert_label": bert_label,
                  "segment_label": segment_label,   ### Used to separate the two sentence - first sentence's tokens is marked as 1, the other is 2
                  "is_next": is_next_label}

        return {key: torch.tensor(value) for key, value in output.items()}


In [9]:
train_data = BERTDataset(
   pairs, seq_len=MAX_LEN, tokenizer=tokenizer)


## Data Loader always requires a __getitem__() attribute
train_loader = DataLoader(
   train_data, batch_size=32, shuffle=True, pin_memory=True)


In [10]:
## Viewing a sample;
sample_data = next(iter(train_loader))


In [11]:
sample_data

{'bert_input': tensor([[   1,   48, 1809,  ...,    0,    0,    0],
         [   1,  162,   11,  ...,    0,    0,    0],
         [   1,  314,   17,  ...,    0,    0,    0],
         ...,
         [   1,  182,  187,  ...,    0,    0,    0],
         [   1,  303,   15,  ...,    2,    0,    0],
         [   1, 2352,  435,  ...,    0,    0,    0]]),
 'bert_label': tensor([[0, 0, 0,  ..., 0, 0, 0],
         [0, 0, 0,  ..., 0, 0, 0],
         [0, 0, 0,  ..., 0, 0, 0],
         ...,
         [0, 0, 0,  ..., 0, 0, 0],
         [0, 0, 0,  ..., 0, 0, 0],
         [0, 0, 0,  ..., 0, 0, 0]]),
 'segment_label': tensor([[1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0],
         ...,
         [1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 2, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0]]),
 'is_next': tensor([1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0,
         0, 1, 0, 1, 1, 0, 0, 0])}

# The BERT Embedding


In BERT, each token is represented not just by its identity (word) but also by its position in the sentence and which sentence it belongs to.

So, instead of feeding plain word embeddings to the model, we enrich each token representation with:

- Token Embedding – what the word is.

- Positional Embedding – where it appears in the sentence.

- Segment Embedding – which sentence it belongs to (important for Next Sentence Prediction).

These three vectors are summed together for each token before being passed to the transformer.

In [12]:
class PositionalEmbedding(torch.nn.Module):

    def __init__(self, d_model, max_len=128):
        super().__init__()

        # Compute the positional encodings once in log space.
        pe = torch.zeros(max_len, d_model).float()
        pe.require_grad = False

        for pos in range(max_len):
            # for each dimension of the each position
            for i in range(0, d_model, 2):
                pe[pos, i] = math.sin(pos / (10000 ** ((2 * i)/d_model)))
                pe[pos, i + 1] = math.cos(pos / (10000 ** ((2 * (i + 1))/d_model)))

        # include the batch size
        self.pe = pe.unsqueeze(0)
        # self.register_buffer('pe', pe)

    def forward(self, x):   ### Doubt here
        return self.pe

class BERTEmbedding(torch.nn.Module):
    """
    BERT Embedding which is consisted with under features
        1. TokenEmbedding : normal embedding matrix
        2. PositionalEmbedding : adding positional information using sin, cos
        2. SegmentEmbedding : adding sentence segment info, (sent_A:1, sent_B:2)
        sum of all these features are output of BERTEmbedding
    """

    def __init__(self, vocab_size, embed_size, seq_len=64, dropout=0.1):
        """
        :param vocab_size: total vocab size
        :param embed_size: embedding size of token embedding
        :param dropout: dropout rate
        """

        super().__init__()
        self.embed_size = embed_size
        # (m, seq_len) --> (m, seq_len, embed_size)
        # padding_idx is not updated during training, remains as fixed pad (0)
        self.token = torch.nn.Embedding(vocab_size, embed_size, padding_idx=0)
        self.segment = torch.nn.Embedding(3, embed_size, padding_idx=0)
        self.position = PositionalEmbedding(d_model=embed_size, max_len=seq_len)
        self.dropout = torch.nn.Dropout(p=dropout)

    def forward(self, sequence, segment_label):
        x = self.token(sequence) + self.position(sequence) + self.segment(segment_label)
        return self.dropout(x)


# The Multi-Head Attention Layer;

## Purpose: What is Multi-Head Attention?

Multi-head attention allows the model to focus on different parts of a sentence in parallel, capturing diverse relationships between words (e.g., “the cat” and “sat on the mat”). Instead of one attention mechanism, we compute multiple attention heads and then combine them.


## 🔢 Core Math & Intuition:

**1. Linear Projections (Q, K, V)**
Input shape: (batch_size, seq_len, d_model)

We project the input into:

```
- Query: What am I looking for?

- Key: What content do I have?

- Value: What do I return if I attend here?
```

These are learned linear layers:

```
self.query = torch.nn.Linear(d_model, d_model)
self.key = torch.nn.Linear(d_model, d_model)
self.value = torch.nn.Linear(d_model, d_model)
```


**2. Split into Multiple Heads**

Each head has dimension d_k = d_model / heads.

```
query.view(..., heads, d_k).permute(...)
```

Now we compute attention in parallel for different parts of the representation space.

**3. Scaled Dot-Product Attention**

$$\text{Attention}(Q, K, V) = \text{softmax} \left( \frac{QK^T}{\sqrt{d_k}} \right) V$$

This gives attention weights that score how much each token attends to others.

- **QK^T** gives similarity between tokens.

- **softmax** ensures all attention scores sum to 1.

- **mask** hides padding or unwanted tokens.


**4. Concat Heads and Final Linear Projection**

After computing attention for each head, we concatenate them and apply a final linear transformation:



In [13]:
### attention layers
class MultiHeadedAttention(torch.nn.Module):

    def __init__(self, heads, d_model, dropout=0.1):
        super(MultiHeadedAttention, self).__init__()

        assert d_model % heads == 0
        self.d_k = d_model // heads
        self.heads = heads
        self.dropout = torch.nn.Dropout(dropout)

        self.query = torch.nn.Linear(d_model, d_model)
        self.key = torch.nn.Linear(d_model, d_model)
        self.value = torch.nn.Linear(d_model, d_model)
        self.output_linear = torch.nn.Linear(d_model, d_model)

    def forward(self, query, key, value, mask):
        """
        query, key, value of shape: (batch_size, max_len, d_model)
        mask of shape: (batch_size, 1, 1, max_words)
        """
        # (batch_size, max_len, d_model)
        query = self.query(query)
        key = self.key(key)
        value = self.value(value)

        # (batch_size, max_len, d_model) --> (batch_size, max_len, h, d_k) --> (batch_size, h, max_len, d_k)
        query = query.view(query.shape[0], -1, self.heads, self.d_k).permute(0, 2, 1, 3)
        key = key.view(key.shape[0], -1, self.heads, self.d_k).permute(0, 2, 1, 3)
        value = value.view(value.shape[0], -1, self.heads, self.d_k).permute(0, 2, 1, 3)

        # (batch_size, h, max_len, d_k) matmul (batch_size, h, d_k, max_len) --> (batch_size, h, max_len, max_len)
        scores = torch.matmul(query, key.permute(0, 1, 3, 2)) / math.sqrt(query.size(-1))

        # fill 0 mask with super small number so it wont affect the softmax weight
        # (batch_size, h, max_len, max_len)
        scores = scores.masked_fill(mask == 0, -1e9)  ### mask ---> (batch_size, 1, max_len, max_len), scores ---> (batch_size, h, max_len, max_len)
        ### mask == 0 marks True in places where condition is satisfied, so wherever True, it will be populated as 1e-9, meaning to ignore that index/token.


        # (batch_size, h, max_len, max_len)
        # softmax to put attention weight for all non-pad tokens
        # max_len X max_len matrix of attention
        weights = F.softmax(scores, dim=-1)
        weights = self.dropout(weights)

        # (batch_size, h, max_len, max_len) matmul (batch_size, h, max_len, d_k) --> (batch_size, h, max_len, d_k)
        context = torch.matmul(weights, value)

        # (batch_size, h, max_len, d_k) --> (batch_size, max_len, h, d_k) --> (batch_size, max_len, d_model)
        context = context.permute(0, 2, 1, 3).contiguous().view(context.shape[0], -1, self.heads * self.d_k)

        # (batch_size, max_len, d_model)
        return self.output_linear(context)

# The Feed-Forward Network Layer;

## Why is the Feed-Forward Network (FFN) needed?

- With FFN: The Transformer model becomes more expressive and powerful, capable of learning **complex, non-linear relationships** (bacause of non-linear transformations). It can process the data in deeper and more nuanced ways, leading to better performance on most tasks.

- Without FFN: The Transformer model would be much **simpler and less powerful**, as it would lack the non-linear transformations and feature refinement that the FFN provides. While it might still perform adequately on simpler tasks, its overall capacity to handle complex patterns and abstractions would be greatly diminished.

## 🔢 Core Math & Intuition:

1. Linear Layers (fc1 and fc2):

- fc1 takes input with shape (d_model) and transforms it into a higher-dimensional space with shape (middle_dim). This is a typical strategy to allow the model to learn more complex representations.

- fc2 then takes the result of fc1 and projects it back to the original size, (d_model). This creates a "bottleneck" structure where the input is expanded and then reduced, enabling the network to learn complex non-linear transformations in a lower-dimensional space.

2. Activation Function (GELU):

- After the first linear transformation (fc1), the output goes through a non-linear activation function, GELU (Gaussian Error Linear Unit). This activation function helps the model learn non-linear relationships, which is important for capturing complex patterns.

- GELU is often preferred in Transformer models over ReLU due to its smoother gradient behavior, which helps with training stability.

| Property                         | **GELU**                                          | **ReLU**                                         |
|-----------------------------------|---------------------------------------------------|--------------------------------------------------|
| **Formula**                       | $\text{GELU}(x) = 0.5 \cdot x \left( 1 + \tanh \left( \sqrt{\frac{2}{\pi}} \cdot \left( x + 0.044715 x^3 \right) \right) \right)$ | $\text{ReLU}(x) = \max(0, x)$ |
| **Handling Negative Inputs**      | Small negative values are allowed, output is smooth and non-zero for negative inputs | Negative values are zeroed out (no gradient) |
| **Differentiability**             | Smooth and differentiable everywhere              | Not differentiable at \(x = 0\)                 |
| **Gradient Behavior**             | Provides stable gradients even for negative values | Zero gradient for negative inputs               |
| **Computational Complexity**      | Slightly more complex due to the non-linear components | Simple, fast to compute                        |
| **Usage in Transformers**         | Preferred in transformer models (BERT, GPT, etc.)  | Less commonly used in transformers              |

3. Dropout:

- The Dropout layer is used to prevent the model from overfitting by randomly setting some of the activations to zero during training. This helps improve generalization by forcing the model to not rely too heavily on any specific activation.

In [14]:
# self.feed_forward = FeedForward(d_model, middle_dim=feed_forward_hidden)  ## Called from EncoderLayer

class FeedForward(torch.nn.Module):
    "Implements FFN equation."

    def __init__(self, d_model, middle_dim=2048, dropout=0.1):
        super(FeedForward, self).__init__()

        self.fc1 = torch.nn.Linear(d_model, middle_dim)
        self.fc2 = torch.nn.Linear(middle_dim, d_model)
        self.dropout = torch.nn.Dropout(dropout)
        self.activation = torch.nn.GELU()

    def forward(self, x):
        out = self.activation(self.fc1(x))
        out = self.fc2(self.dropout(out))
        return out

The below Encoder Layer integrates the Multi-head Attention and FFN, also introducing skip-connections (residual connections) as per the original architecture;

In [15]:
class EncoderLayer(torch.nn.Module):
    def __init__(
        self,
        d_model=768,
        heads=12,
        feed_forward_hidden=768 * 4,
        dropout=0.1
        ):
        super(EncoderLayer, self).__init__()

        self.layernorm = torch.nn.LayerNorm(d_model)

        self.self_multihead = MultiHeadedAttention(heads, d_model)  ### defines the k, q, v,

        self.feed_forward = FeedForward(d_model, middle_dim=feed_forward_hidden)  ## Normal NN
        self.dropout = torch.nn.Dropout(dropout)

    def forward(self, embeddings, mask):
        # embeddings: (batch_size, max_len, d_model)
        # encoder mask: (batch_size, 1, 1, max_len)
        # result: (batch_size, max_len, d_model)

        ## def forward(self, query, key, value, mask):  ## masking the 0 with -1e9, applying dropout, calculating the softmax on a head level, then reshaping
        interacted = self.dropout(self.self_multihead(embeddings, embeddings, embeddings, mask))

        # residual layer
        interacted = self.layernorm(interacted + embeddings)

        # bottleneck - includes the skip connection step.
        feed_forward_out = self.dropout(self.feed_forward(interacted))
        encoded = self.layernorm(feed_forward_out + interacted)

        return encoded


# Putting It All Together: Encoding the Input

Now we connect all the key components that make BERT capable of understanding context-rich input.


1. 
```
self.embedding = BERTEmbedding(vocab_size=vocab_size, embed_size=d_model)

```

Creates the **embedding layer**, which converts input token indices into dense vector representations. This includes:

Token embeddings

Positional embeddings (to give order information)

Segment embeddings (to distinguish sentence A vs B)



2. 
```
self.encoder_blocks = torch.nn.ModuleList(
    [EncoderLayer(d_model, heads, d_model * 4, dropout) for _ in range(n_layers)])
```

Creates a list of EncoderLayer blocks (Transformer encoder layers) — 12 by default. Each one contains:

- Multi-head self-attention

- Feed-forward network

- Residual + normalization

They're stored in a **ModuleList** so PyTorch can register them as trainable layers.


3. 
```
mask = (x > 0).unsqueeze(1).repeat(1, x.size(1), 1).unsqueeze(1)

```

Creates an **attention mask to ignore padding tokens** (0s). This mask will be used in attention layers to ensure the model doesn’t attend to padding positions.


4. 
```
for encoder in self.encoder_blocks:
    x = encoder.forward(x, mask)

```

Runs the input through each transformer encoder layer, one by one. Each layer applies **self-attention + feed-forward + skip connections + normalization**.

In [16]:

class BERT(torch.nn.Module):
    """
    BERT model : Bidirectional Encoder Representations from Transformers.
    """

    def __init__(self, vocab_size, d_model=768, n_layers=12, heads=12, dropout=0.1):
        """
        :param vocab_size: vocab_size of total words
        :param hidden: BERT model hidden size
        :param n_layers: numbers of Transformer blocks(layers)
        :param attn_heads: number of attention heads
        :param dropout: dropout rate
        """

        super().__init__()

        self.d_model = d_model
        self.n_layers = n_layers
        self.heads = heads

        # paper noted they used 4 * hidden_size for ff_network_hidden_size
        self.feed_forward_hidden = d_model * 4

        # embedding for BERT, sum of positional, segment, token embeddings
        self.embedding = BERTEmbedding(vocab_size=vocab_size, embed_size=d_model)   ### sum of TokenEmbedding, PositionalEmbedding, SegmentEmbedding

        # multi-layers transformer blocks, deep network
        self.encoder_blocks = torch.nn.ModuleList(
            [EncoderLayer(d_model, heads, d_model * 4, dropout) for _ in range(n_layers)])  ## Creating layers, and we will sequentially compute while applying a loop



    def forward(self, x, segment_info):   #### x --> (batch_size, seq_len), segment_info --> (batch_size, seq_len)  ### data["bert_input"], data["segment_label"]
        # attention masking for padded token
        # (batch_size, 1, seq_len, seq_len)

        # ✅ "Hey, attend only to real tokens!"
        # ❌ "Ignore padded tokens!"
        # [[1, 1, 1, 1, 1, 1, 0, 0, 0, ..., 0]]

        mask = (x > 0).unsqueeze(1).repeat(1, x.size(1), 1).unsqueeze(1)

        # embedding the indexed sequence to sequence of vectors
        x = self.embedding(x, segment_info)   #### sum of TokenEmbedding, PositionalEmbedding, SegmentEmbedding

        # running over multiple transformer blocks
        for encoder in self.encoder_blocks:
            x = encoder.forward(x, mask)  ### Applies the (Transformer Computation (Attention calculation, transformation) + FFN + skip connections) * 12 times

        return x



# BERT Pretraining Heads: Next Sentence & Masked Token Prediction

### 🧩 1. NextSentencePrediction Class

This is a binary classifier used in BERT’s pretraining phase.
It predicts whether sentence B logically follows sentence A.

- Takes in the BERT output (specifically the [CLS] token).

- Passes it through a linear layer → 2 output scores: is_next or not_next.

- Applies log-softmax to return log-probabilities.

📌 Used to help BERT understand sentence-level relationships.

### 🧩 2. MaskedLanguageModel Class

This is a multi-class classifier for predicting the original token in place of a masked one.

- Takes the full output of the BERT model (all tokens).

- Applies a linear layer to map hidden states to vocabulary size.

- Uses log-softmax for each token’s prediction.

📌 Helps BERT learn deep word-level understanding during pretraining by predicting missing words.

In [17]:
class NextSentencePrediction(torch.nn.Module):
    """
    2-class classification model : is_next, is_not_next
    """

    def __init__(self, hidden):
        """
        :param hidden: BERT model output size
        """
        super().__init__()
        self.linear = torch.nn.Linear(hidden, 2)
        self.softmax = torch.nn.LogSoftmax(dim=-1)

    def forward(self, x):
        # use only the first token which is the [CLS]
        return self.softmax(self.linear(x[:, 0]))

class MaskedLanguageModel(torch.nn.Module):
    """
    predicting origin token from masked input sequence
    n-class classification problem, n-class = vocab_size
    """

    def __init__(self, hidden, vocab_size):
        """
        :param hidden: output size of BERT model
        :param vocab_size: total vocab size
        """
        super().__init__()
        self.linear = torch.nn.Linear(hidden, vocab_size)
        self.softmax = torch.nn.LogSoftmax(dim=-1)

    def forward(self, x):
        return self.softmax(self.linear(x))



This class wraps the full BERT model and connects it to both of its pretraining heads:

1. Masked Language Modeling (MLM) – Predicts masked tokens in the input sequence

2. Next Sentence Prediction (NSP) – Determines if one sentence follows another


During the **forward pass**, the following steps are followed:

- The input goes through the BERT encoder, generating contextual embeddings.

- The [CLS] token is passed to the NSP head for sentence-level prediction.

- The full sequence is passed to the MLM head for token prediction.

In [18]:
class BERTLM(torch.nn.Module):
    """
    BERT Language Model
    Next Sentence Prediction Model + Masked Language Model
    """

    def __init__(self, bert: BERT, vocab_size):
        """
        :param bert: BERT model which should be trained
        :param vocab_size: total vocab size for masked_lm
        """

        super().__init__()
        self.bert = bert
        self.next_sentence = NextSentencePrediction(self.bert.d_model)
        self.mask_lm = MaskedLanguageModel(self.bert.d_model, vocab_size)

    def forward(self, x, segment_label):
        x = self.bert(x, segment_label)
        return self.next_sentence(x), self.mask_lm(x)


# Custom Learning Rate Scheduler for BERT Pre-Training;


### 🚀 **ScheduledOptim**: 

This class wraps around a PyTorch optimizer (like Adam) and **controls the learning rate dynamically** during training, as suggested in the original Transformer paper.

### ✅ Purpose:
Instead of using a fixed learning rate, this schedule:

1. Warms up the learning rate for a few thousand steps (small → large)

2. Then decays it smoothly as training continues (large → small)

This helps with stable and effective training of deep transformer models like BERT.

In [19]:
### Creating the optimiser;

class ScheduledOptim():
    '''A simple wrapper class for learning rate scheduling'''

    def __init__(self, optimizer, d_model, n_warmup_steps):
        self._optimizer = optimizer
        self.n_warmup_steps = n_warmup_steps
        self.n_current_steps = 0
        self.init_lr = np.power(d_model, -0.5)

    def step_and_update_lr(self):
        "Step with the inner optimizer"
        self._update_learning_rate()
        self._optimizer.step()

    def zero_grad(self):
        "Zero out the gradients by the inner optimizer"
        self._optimizer.zero_grad()

    def _get_lr_scale(self):
        return np.min([
            np.power(self.n_current_steps, -0.5),
            np.power(self.n_warmup_steps, -1.5) * self.n_current_steps])

    def _update_learning_rate(self):
        ''' Learning rate scheduling per step '''

        self.n_current_steps += 1
        lr = self.init_lr * self._get_lr_scale()

        for param_group in self._optimizer.param_groups:
            param_group['lr'] = lr

# Finally Pre-Training Begins!!

In [20]:
### Integrating all the above in this trainer class;

# The BERTTrainer class manages the full training loop for the BERT model, 
# including optimization, loss calculation, and evaluation.


## We are using the Negative Log Likelihood Loss function as the objective to minimize.

class BERTTrainer:

    def __init__(
        self,
        model,
        train_dataloader,
        test_dataloader=None,
        lr= 1e-4,
        weight_decay=0.01,
        betas=(0.9, 0.999),
        warmup_steps=10000,
        log_freq=10,
        device='cuda'
        ):

        self.device = device
        self.model = model
        self.train_data = train_dataloader
        self.test_data = test_dataloader

        # Setting the Adam optimizer with hyper-param
        self.optim = Adam(self.model.parameters(), lr=lr, betas=betas, weight_decay=weight_decay)
        self.optim_schedule = ScheduledOptim(
            self.optim, self.model.bert.d_model, n_warmup_steps=warmup_steps
            )


        # Using Negative Log Likelihood Loss function for predicting the masked_token
        self.criterion = torch.nn.NLLLoss(ignore_index=0)
        self.log_freq = log_freq
        print("Total Parameters:", sum([p.nelement() for p in self.model.parameters()]))


    def iteration(self, epoch, data_loader, train=True):

        avg_loss = 0.0
        total_correct = 0
        total_element = 0

        mode = "train" if train else "test"

        # progress bar
        data_iter = tqdm.tqdm(
            enumerate(data_loader),
            desc="EP_%s:%d" % (mode, epoch),
            total=len(data_loader),
            bar_format="{l_bar}{r_bar}"
        )

        for i, data in data_iter:

            # 0. batch_data will be sent into the device(GPU or cpu)
            data = {key: value.to(self.device) for key, value in data.items()}


            # 1. forward the next_sentence_prediction and masked_lm model
            next_sent_output, mask_lm_output = self.model.forward(data["bert_input"], data["segment_label"])


            # 2-1. NLL(negative log likelihood) loss of is_next classification result
            next_loss = self.criterion(next_sent_output, data["is_next"])

            # 2-2. NLLLoss of predicting masked token word
            # transpose to (m, vocab_size, seq_len) vs (m, seq_len)
            # criterion(mask_lm_output.view(-1, mask_lm_output.size(-1)), data["bert_label"].view(-1))
            mask_loss = self.criterion(mask_lm_output.transpose(1, 2), data["bert_label"])

            # 2-3. Adding next_loss and mask_loss : 3.4 Pre-training Procedure
            loss = next_loss + mask_loss

            # 3. backward and optimization only in train
            if train:
                self.optim_schedule.zero_grad()
                loss.backward()
                self.optim_schedule.step_and_update_lr()

            # next sentence prediction accuracy
            correct = next_sent_output.argmax(dim=-1).eq(data["is_next"]).sum().item()
            avg_loss += loss.item()
            total_correct += correct
            total_element += data["is_next"].nelement()

            post_fix = {
                "epoch": epoch,
                "iter": i,
                "avg_loss": avg_loss / (i + 1),
                "avg_acc": total_correct / total_element * 100,
                "loss": loss.item()
            }

            if i % self.log_freq == 0:
                data_iter.write(str(post_fix))
        print(
            f"EP{epoch}, {mode}: \
            avg_loss={avg_loss / len(data_iter)}, \
            total_acc={total_correct * 100.0 / total_element}"
        )


    def train(self, epoch):
        self.iteration(epoch, self.train_data)

    def test(self, epoch):
        self.iteration(epoch, self.test_data, train=False)


In [None]:
%%time

train_data = BERTDataset(
   pairs, seq_len=MAX_LEN, tokenizer=tokenizer)

train_loader = DataLoader(
   train_data, batch_size=32, shuffle=True, pin_memory=True)

bert_model = BERT(
  vocab_size=len(tokenizer.vocab),
  d_model=768,
  n_layers=2,
  heads=12,
  dropout=0.1
)

bert_lm = BERTLM(bert_model, len(tokenizer.vocab))


bert_trainer = BERTTrainer(bert_lm, train_loader, device='cpu')

epochs = 1  ## as a trial, I have taken epoch as 1, can be tweaked by the reader.

for epoch in range(epochs):
  bert_trainer.train(epoch)

Total Parameters: 46699434


EP_train:0:   0%|| 1/6926 [00:03<6:14:23,  3.24s/it]

{'epoch': 0, 'iter': 0, 'avg_loss': 10.940971374511719, 'avg_acc': 56.25, 'loss': 10.940971374511719}


EP_train:0:   0%|| 11/6926 [00:31<5:27:08,  2.84s/it]

{'epoch': 0, 'iter': 10, 'avg_loss': 10.830794247713955, 'avg_acc': 50.85227272727273, 'loss': 10.835719108581543}


EP_train:0:   0%|| 21/6926 [00:58<5:22:27,  2.80s/it]

{'epoch': 0, 'iter': 20, 'avg_loss': 10.78608953385126, 'avg_acc': 50.44642857142857, 'loss': 10.730015754699707}


EP_train:0:   0%|| 31/6926 [01:25<5:09:12,  2.69s/it]

{'epoch': 0, 'iter': 30, 'avg_loss': 10.70475661370062, 'avg_acc': 49.29435483870967, 'loss': 10.442561149597168}


EP_train:0:   1%|| 41/6926 [01:52<5:09:47,  2.70s/it]

{'epoch': 0, 'iter': 40, 'avg_loss': 10.627633815858422, 'avg_acc': 48.6280487804878, 'loss': 10.39102840423584}


EP_train:0:   1%|| 51/6926 [02:20<5:08:17,  2.69s/it]

{'epoch': 0, 'iter': 50, 'avg_loss': 10.55413938036152, 'avg_acc': 49.1421568627451, 'loss': 10.196638107299805}


EP_train:0:   1%|| 61/6926 [02:47<5:11:24,  2.72s/it]

{'epoch': 0, 'iter': 60, 'avg_loss': 10.475267394644316, 'avg_acc': 48.46311475409836, 'loss': 10.047904968261719}


EP_train:0:   1%|| 71/6926 [03:14<5:09:53,  2.71s/it]

{'epoch': 0, 'iter': 70, 'avg_loss': 10.408695408995722, 'avg_acc': 48.723591549295776, 'loss': 9.909573554992676}


EP_train:0:   1%|| 81/6926 [03:42<5:22:05,  2.82s/it]

{'epoch': 0, 'iter': 80, 'avg_loss': 10.34888215712559, 'avg_acc': 48.53395061728395, 'loss': 9.952644348144531}


EP_train:0:   1%|| 91/6926 [04:08<5:01:08,  2.64s/it]

{'epoch': 0, 'iter': 90, 'avg_loss': 10.292921495961618, 'avg_acc': 49.107142857142854, 'loss': 9.7589693069458}


EP_train:0:   1%|| 96/6926 [04:22<5:10:34,  2.73s/it]

# 📘 End Notes


Thanks for walking through this notebook on "Demystifying BERT from Scratch"!

We covered:

- Core components of the BERT architecture

- How attention and embeddings work

- Pretraining objectives (MLM + NSP)

- Training loop and optimization setup


### This notebook aims to help you understand what's going on under the hood, beyond just calling transformers.BertModel.


### **If you found this helpful, please give it an upvote 👍.**



### **💬 Have feedback or questions? Feel free to leave a comment — I'm happy to clarify or improve anything!**


### **Thanks again, and happy learning! 🚀**