**AI & Machine Learning (KAN-CINTO4003U) - Copenhagen Business School | Spring 2025**

***


<p align="center">
<img src="media/bert_header.jpg" alt="BERT" width="500"/>
</p>


# BERT: Bidirectional Encoder Representations from Transformers

***

* Bidirectional Encoder Representations from Transformers (BERT) ([Devlin et al., 2018](https://arxiv.org/abs/1810.04805)) is a deep learning model developed by Google AI Language that significantly advanced Natural Language Processing (NLP), particularly in Natural Language Understanding (NLU). <br><br>
* Many subsequent models, such as RoBERTa ([Liu et al., 2019](https://arxiv.org/abs/1907.11692)), ALBERT ([Lan et al., 2019](https://arxiv.org/abs/1909.11942)), and DistilBERT ([Sanh et al., 2019](https://arxiv.org/abs/1910.01108)), have built upon BERT’s architecture, improving efficiency and performance.<br><br>
* The original BERT model was introduced in 2018, following OpenAI’s Generative Pre-trained Transformer (GPT-1) ([Radford et al., 2018](https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf)). Both models were based on the Transformer architecture (Vaswani et al., 2017), but they took different approaches: while GPT is a unidirectional model designed for Natural Language Generation (NLG), BERT introduced bidirectional self-attention to improve contextual understanding in NLU tasks. <br><br>
* These two architectures played a pivotal role in modern NLP, with BERT influencing retrieval-based models and GPT evolving into more advanced generative AI systems such as the breakthrough of GPT-3 ([Brown et al., 2020](https://arxiv.org/abs/2005.14165)) and ChatGPT.<br><br>
* BERT has seen wide industry applications. For example, Google [integrated BERT into its search algorithms](https://snorkel.ai/large-language-models/bert-models/?utm_source=chatgpt.com) to better understand user queries, leading to more accurate and contextually relevant search results. Other companies, [like Wayfair](https://www.aboutwayfair.com/tech-innovation/bert-does-business-implementing-the-bert-model-for-natural-language-processing-at-wayfair?utm_source=chatgpt.com), have implemented BERT to analyze customer messages, enabling more efficient and accurate responses. <br><br>
* While highly effective for Natural Language Understanding (NLU), BERT is computationally expensive, limited to a 512-token context window, lacks generative capabilities, and inherits biases from its pretraining data, making it less suitable for real-time, long-document, or dynamically evolving knowledge tasks. Perfect for some tasks, but not all.<br><br>
* In December 2024, [ModernBERT](https://huggingface.co/papers/2412.13663) was introduced as a state-of-the-art encoder-only model, offering significant improvements over previous architectures. It supports sequences up to 8,192 tokens and incorporates modern enhancements like Rotary Positional Embeddings (RoPE) and Flash Attention for improved performance and efficiency..<br><br>

***

<br><br>

**In this notebook, we will explore the BERT model, its architecture and explain how to use ModernBERT for various NLP tasks using the Hugging Face Transformers library.**

(Part is is loosely adopted from "[A Complete Guide to BERT with Code](https://towardsdatascience.com/a-complete-guide-to-bert-with-code-9f87602e4a11/)" (2024) by Bradney Smith and [github.com/AnswerDotAI/ModernBERT](https://github.com/AnswerDotAI/ModernBERT/blob/main/examples/finetune_modernbert_on_glue.ipynb))

<br><br>

**Note**: If your run into issues with memory or performance, consider using a GPU or a cloud-based service like Google Colab, which offers free GPU access. Apart from the nice header image, there are not external, local dependencies for this notebook, which means you can run it on any machine with Python installed. However, you still need to install the relevant packages, which are listed in `environment.yml`.

<br><br>

***

# 1. Introduction to Transformers

In 2017, the Transformer architecture revolutionized natural language processing (NLP) with the publication of the paper "*Attention Is All You Need*" by Vaswani et al.  Unlike older neural network models for language tasks - such as convolutional neural networks (CNNs), recurrent neural networks (RNNs) or long short-term memory networks (LSTMs) - the Transformer relies heavily on a mechanism known as **self-attention**. This approach allows the model to focus on different parts of a sentence (or sequence) when encoding its meaning, which was found to drastically improve both training efficiency and performance on large-scale language tasks.

<br>

<div style="background-color:rgba(4, 12, 78, 0.58); color: #ffffff; font-weight: 700; padding-left: 10px; padding-top: 20px; padding-bottom: 20px"><strong>The original transformer</strong></div>

<div style="background-color:rgb(13, 14, 18); padding-left: 10px; padding-top: 10px; padding-bottom: 10px; padding-right: 10px">

<div style="padding-left: 10px; padding-right: 10px; padding-top: 10px; padding-bottom: 30px, align: justify">
<p align="center">
<img src="media/transformer_architecture.png" alt="Transformer Architecture" width="800"/>
</p>
</div>

<p>The original Transformer was designed as a so-called <i>encoder-decoder</i> model, primarily for machine translation. Here’s how it works in simple terms:</p>
<ul>
    <li><strong>Encoder<br><br></strong>
        <ul>
            <li>Converts (or “encodes”) an input sequence (e.g., a sentence in French) into a set of hidden, contextualized vector representations (hidden state as discussed class 2!), what is often referred to as “contextual embeddings”.<br><br></li>
            <li>Multiple encoder layers apply self-attention to the input tokens. Self-attention means each token can “attend” to all other tokens, learning how each word relates to the rest of the sentence.<br><br></li>
            <li>The encoder produces a context-rich representation of each token, capturing not just its meaning in isolation but its meaning relative to other words in the sequence.<br><br></li>
        </ul>
    </li>
    <li><strong>Decoder<br><br></strong>
        <ul>
            <li>Generates (or “decodes”) an output sequence (e.g., the equivalent sentence in English) based on the encoder’s output.<br><br></li>
            <li>Multiple decoder layers take two inputs:
                <ul>
                    <li>The representations produced by the encoder.</li>
                    <li>A partial sequence of already generated tokens (so the decoder can attend to what it has produced so far).</li>
                </ul>
            <br></li>
            <li>The decoder produces one token at a time, using both the encoder’s context and its own previously generated tokens to create a coherent output sequence.<br><br></li>
        </ul>
    </li>
</ul>

</div>

<br><br>

Because the original Transformer was so successful at translation - an area that requires both deep semantic understanding and fluent generation - researchers quickly realized that the core ideas could be adapted for all sorts of NLP tasks. This led to three major “families” of Transformer-based architectures:


1. **Encoder-Only models** (e.g., BERT series)
    * Focuses on understanding an input sequence deeply, e.g., mapping the semantic vector representations of each token. Commonly used for classification, question answering, named entity recognition, and other analysis-driven tasks.<br><br>

2. **Decoder-Only models** (e.g., GPT series)
    * Focuses on generating or predicting the next tokens in a sequence, typically autoregressively. Commonly used for text generation, chatbots, and creative writing.<br><br>

3. **Encoder-Decoder models** (e.g., T5)
    * Focuses on bombining both understanding and generation of sequences, such as translation, summarization, and other tasks that require both comprehension and generation.<br><br>


<div style="padding-left: 10px; padding-right: 10px; padding-top: 10px; padding-bottom: 30px, align: justify">
<p align="center">
<img src="media/transformer_families.webp" alt="Transformer Architecture" width="800"/>
</p>
</div>




# 2. BERT Architecture

Before we dive into the BERT architecture, let's first understand the building blocks and concepts that make up the model:


<div style="background-color:rgba(4, 12, 78, 0.58); color: #ffffff; font-weight: 700; padding-left: 10px; padding-top: 20px; padding-bottom: 20px"><strong>To understand BERT, it helps to understand a few key concepts and components</strong></div>

<div style="background-color:rgb(13, 14, 18); padding-left: 10px; padding-top: 10px; padding-bottom: 10px; padding-right: 10px">

<div style="padding-left: 10px; padding-right: 10px; padding-top: 10px; padding-bottom: 30px, align: justify">
</div>

<ul>
    <li><strong>Tokens and Tokenization<br><br></strong>
        <ul>
            <li><strong>Token:</strong> A token is a basic unit of text, which could be a word, subword, or character. In BERT, tokens are the smallest units that the model processes.<br><br></li>
            <li><strong>Tokenization:</strong> The process of converting raw text into tokens. BERT uses WordPiece tokenization, which breaks down words into subwords or characters to handle out-of-vocabulary words.<br><br></li>
            <div align="center"><img src="media/tokenization_BERT.png" alt="Tokenization BERT" width="800"/><br><br></li></div>
        </ul>
    </li>
    <li><strong>Special Tokens<br><br></strong>
        <ul>
            <li><strong>[CLS]:</strong> A special token added at the beginning of each input sequence. The final hidden state corresponding to this token is used as the aggregate sequence representation for classification tasks. In short, the final hidden representation of the special CLS token often serves as the “summary” vector for classification tasks.<br><br></li>
            <li><strong>[SEP]:</strong> A special token used to separate different sentences in a single input sequence. It helps the model distinguish between different segments.<br><br></li>
            <div align="center"><img src="media/cls_token_bert.png" alt="Special Tokens BERT" width="800"/><br><br></li></div>
        </ul>
    </li>
    <li><strong>Context Length<br><br></strong>
        <ul>
            <li>BERT can handle input sequences up to 512 tokens in length. (ModernBERT can handle 8K+)<br><br></li>
        </ul>
    </li>
    <li><strong>Attention Mechanism<br><br></strong>
        <ul>
            <li>BERT uses self-attention to compute a representation of the input sequence. Each token attends to every other token in the sequence, which helps the model capture contextual relationships.<br><br></li>
            <div align="center"><img src="media/self-attention-exampl.webp" alt="Self-Attention BERT" width="800"/><br><br></li></div>
        </ul>
    </li>
    <li><strong>Bidirectionality<br><br></strong>
        <ul>
            <li>Unlike traditional left-to-right or right-to-left models, BERT reads the entire sequence of words at once. This bidirectional approach allows it to understand the context of a word based on both its left and right surroundings.<br><br></li>
            <div align="center"><img src="media/bidrectional_example.png" alt="Bidirectional BERT" width="800"/><br><br></li></div>
        </ul>
    </li>
    <li><strong>Transformer Layers<br><br></strong>
        <ul>
            <li>BERT is composed of multiple transformer layers (12 for BERT-base and 24 for BERT-large). Each layer consists of self-attention and feed-forward neural networks.<br><br></li>
            <div align="center"><img src="media/transformer_layer.png" alt="Transformer Layers BERT" width="800"/><br><br></li></div>
        </ul>
    </li>
</ul>

</div>

*** 



### So, how does BERT work?
Having covered tokens, special tokens, maximum sequence length, and the absolute basics of **self-attention**. Let's dive into the core of BERT's architecture and training process. 

<div align="center">
<img src="media/bert_architecture.webp" alt="BERT Architecture" width="800"/>
</div>

* Recall that BERT is an **encoder-only** model, meaning it focuses exclusively on understanding the input sequence. In other words BERT is a transformer-based model that focuses exclusively on the encoder component of the original “vanilla” Transformer architecture.

* Because BERT is designed for **language understanding** rather than **text generation**, it only needs the encoder. By stacking multiple encoder layers (12 for BERT-Base or 24 for BERT-Large), BERT can learn increasingly complex and context-rich representations of input tokens.

* Traditional language models often process text directionally - either left-to-right or right-to-left. This can prevent them from seeing future context when predicting or encoding each token. BERT, however, processes all tokens at once, giving it a **bidirectional** (or “non-directional”) view of the entire sequence.  In practice, this means BERT can “see” both the words before and after a given token, leading to richer context and better performance on tasks like:

    - Sentiment Analysis
    - Named Entity Recognition (NER)
    - Question Answering
    - Sentence/Document Classification

#### Masked Language Modelling (MLM)

<div align="center">
<img src="media/masked_language_modeling.png" alt="Masked Language Modeling" width="800"/>
</div>

BERT is trained using **Masked Language Modeling** (MLM). This essentially means that we can feed the model with a large corpus of text, mask some of the words, and ask the model to predict the masked words. This process encourages the model to learn contextual relationships between words, as it can’t rely on just the next or previous word to make an educated guess.


1. **Random Masking**  
   - Before feeding a sentence into BERT, 15% of tokens are randomly replaced with a special `[MASK]` token.  
   - Example: “The child came home from **[MASK]**.”

2. **Prediction Objective**  
   - BERT’s goal is to **recover** the original tokens that were masked.  
   - Each token’s vector representation (from the final encoder layer) is passed through a classification head (a feedforward neural net as we covered in class 2) to predict the masked word.

3. **Self-Attention with Masks**  
   - Because BERT sees the entire sequence (including the `[MASK]` tokens), it uses self-attention to figure out which unmasked tokens can help it guess the masked ones.

4. **Loss Function**  
   - The model only updates weights based on how accurately it predicts the masked words (ignoring unmasked words).  
   - (This can slow down training compared to a unidirectional model, but yields much richer contextual embeddings.)


By masking random tokens, BERT is forced to learn relationships between *all* words in a sentence. It can’t simply rely on the next or previous word to make a guess; it has to consider everything else in the sequence. Over time, this (hopefully!) fosters an exceptionally deep understanding of language structure and context.


#### Fine-tuning

<div align="center">
<img src="media/bert_classifier.png" alt="Fine-tuning BERT" width="800"/>
</div>

After pre-training on a massive corpus (using MLM and the Next Sentence Prediction objective), BERT is essentially trained to predicted masked word in an input (MLM). However, from here we can **fine-tune** the pre-trained model for various NLP tasks using our own labeled data. This process involves adding a simple classification layer on top of the pre-trained BERT model and training it on a specific task (e.g., sentiment analysis, named entity recognition, etc.). This allows us to leverage BERT’s deep contextual understanding of language for a wide range of NLP tasks.:

- **Classification** (e.g., sentiment, topic): Use the final hidden state of `[CLS]` as input to a simple classifier. <br><br>
- **Token-Level Tasks** (e.g., NER): Each token’s final hidden state can serve as input to a classifier that assigns labels (e.g., “person,” “location,” etc.).<br><br>
- **Question Answering**: Combine the embeddings to find the start and end positions of answers in a passage.<br><br>

---


# 3. ModernBERT

[ModernBERT](https://huggingface.co/docs/transformers/model_doc/modernbert) is a state-of-the-art encoder-only model that builds upon the original BERT architecture. Introduced in December 2024, ModernBERT offers several key improvements over its predecessor, including:

- **Increased Sequence Length**: ModernBERT can handle sequences up to 8,192 tokens, compared to BERT’s 512-token limit. This makes it more suitable for long-document tasks and other applications that require processing large amounts of text.<br><br>
- **Rotary Positional Embeddings (RoPE)**: ModernBERT uses rotary positional embeddings to encode positional information in the input sequence. This allows the model to capture long-range dependencies more effectively and improves performance on tasks that require understanding of sequential relationships.<br><br>
- **Flash Attention**: ModernBERT incorporates Flash Attention, a novel attention mechanism that improves efficiency and reduces computational complexity. Flash Attention is designed to be more memory-efficient than traditional self-attention mechanisms, making it well-suited for large-scale language tasks.<br><br>

In [6]:
from datasets import load_dataset, DatasetDict
from transformers import pipeline
import numpy as np
from sklearn.metrics import classification_report
from sklearn.linear_model import LogisticRegression

### 3.0. Load Data

**NOTE**: This will download the dataset from the Hugging Face datasets library. If you've already downloaded it before, it will load from your local cache instead of downloading again. 

Importantly, this also means that the dataset files are stored on your computer, which can take up space over time. If you need to free up space, you can clear the cache:

- **On Mac/Linux**, run:  
    ```bash
    rm -rf ~/.cache/huggingface/datasets
    ```

- **On Windows** (Command Prompt), run:

    ``` bash
    rmdir /s /q %USERPROFILE%\.cache\huggingface\datasets
    ```

Alternatively, if you don’t want to store the dataset on disk, you can load it directly into memory bu settingg `keep_in_memory=True`:

```python
ag_news_train = load_dataset("fancyzhx/ag_news", split="train[:20%]", keep_in_memory=True)
ag_news_test = load_dataset("fancyzhx/ag_news", split="test", keep_in_memory=True)
```

In [7]:
TRAIN_SIZE = 1 # percent as whole number
TEST_SIZE = 10 # percent as whole number

In [8]:
ag_news_train = load_dataset("fancyzhx/ag_news", split=f"train[:{TRAIN_SIZE}%]", keep_in_memory=True )  # n% of training data
ag_news_test = load_dataset("fancyzhx/ag_news", split=f"test[:{TEST_SIZE}%]", keep_in_memory=True)  # n% of test data

ag_news = DatasetDict({
    "train": ag_news_train,
    "test": ag_news_test
})

ag_news

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 1200
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 760
    })
})

### 3.1. Load ModernBERT pipeline

We can make use for the `pipeline` function from the HuggingFace `Transformers` library to load both the model and tokenizer for ModernBERT. This pipeline will automatically tokenize the input text and prepare it for the model. Note that we don't need to load the model itself yet, just the tokenizer, as we will use the pipeline for tokenization and preprocessing. We indicate that by setting task=`feature-extraction`.

In [9]:
embedder = pipeline(
    model="answerdotai/ModernBERT-base",      # model used for embedding
    tokenizer="answerdotai/ModernBERT-base",  # tokenizer used for embedding
    task="feature-extraction",                # feature extraction task (returns embeddings)
    device=0                                  # use GPU 0 if available
)

Device set to use cpu


### 3.2. Encode the data

In this step, we’ll convert each text in our dataset into a numerical representation that machine learning models can understand. To do this, we use the `ModernBERT` model, which takes in text and produces embeddings — vectors of numbers that capture the meaning of the text.

BERT (and ModernBERT) actually creates an embedding for token (each word or subword) in a sentence, but instead of keeping all of them, we’ll extract just one: the embedding of the special `[CLS]` token. As covered in the introduction, this token appears at the beginning of every input and is designed to represent the meaning of the whole sentence. This way, instead of storing a variable number of embeddings for each text, we keep just one fixed-size vector per sentence, making it much easier to work with in machine learning.

Note that we process the dataset in batches, meaning we send multiple texts through the model at once. This speeds things up, but it also requires more memory. If you run into memory issues, try reducing the batch size (by adjust the `batch_size` parameter in the `map` function). After this step, each entry in our dataset will have an "embeddings" field that contains a single vector. The dataset will now look like this:


``` python
DatasetDict({
    train: Dataset({
        features: ['text', 'label', 'embeddings'],
        num_rows: 1200
    })
    test: Dataset({
        features: ['text', 'label', 'embeddings'],
        num_rows: 760
    })
})
```

In [10]:
def get_embeddings(data):
    """ Extract the [CLS] embedding for each text. """
    embeddings = embedder(data["text"])  # Full token embeddings
    cls_embeddings = [e[0][0] for e in embeddings]  # Extract first token ([CLS])
    return {"embeddings": cls_embeddings}

ag_news = ag_news.map(get_embeddings, batched=True, batch_size=8)


Map:   0%|          | 0/1200 [00:00<?, ? examples/s]

Map:   0%|          | 0/760 [00:00<?, ? examples/s]

In [11]:
ag_news

DatasetDict({
    train: Dataset({
        features: ['text', 'label', 'embeddings'],
        num_rows: 1200
    })
    test: Dataset({
        features: ['text', 'label', 'embeddings'],
        num_rows: 760
    })
})

Then, we can extract features and labels into X_train, y_train, X_test, y_test to fit with the standard scikit-learn paradigm.

In [12]:

X_train = np.array(ag_news["train"]["embeddings"])  # Feature embeddings
y_train = np.array(ag_news["train"]["label"])       # Labels

X_test = np.array(ag_news["test"]["embeddings"])
y_test = np.array(ag_news["test"]["label"])

# Check shapes
print(f"X_train shape: {X_train.shape}, y_train shape: {y_train.shape}")
print(f"X_test shape: {X_test.shape}, y_test shape: {y_test.shape}")


X_train shape: (1200, 768), y_train shape: (1200,)
X_test shape: (760, 768), y_test shape: (760,)


### 3.3. Train a classifier

In [13]:
lr = LogisticRegression(max_iter=1000)

lr.fit(X_train, y_train)

In [14]:
y_pred_train = lr.predict(X_train)

print(classification_report(y_train, y_pred_train))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00       273
           1       1.00      1.00      1.00       182
           2       1.00      0.99      1.00       202
           3       1.00      1.00      1.00       543

    accuracy                           1.00      1200
   macro avg       1.00      1.00      1.00      1200
weighted avg       1.00      1.00      1.00      1200



### 3.4. Make predictions

In [15]:
y_pred = lr.predict(X_test)

In [16]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.75      0.76      0.75       197
           1       0.91      0.87      0.89       199
           2       0.66      0.59      0.62       158
           3       0.74      0.82      0.78       206

    accuracy                           0.77       760
   macro avg       0.76      0.76      0.76       760
weighted avg       0.77      0.77      0.77       760

