**AI & Machine Learning (KAN-CINTO4003U) - Copenhagen Business School | Spring 2025**

***


<p align="center">
<img src="media/bert_header.jpg" alt="BERT" width="800"/>
</p>


# BERT: Bidirectional Encoder Representations from Transformers

***

* Bidirectional Encoder Representations from Transformers (BERT) ([Devlin et al., 2018](https://arxiv.org/abs/1810.04805)) is a deep learning model developed by Google AI Language that significantly advanced Natural Language Processing (NLP), particularly in Natural Language Understanding (NLU). <br><br>
* Many subsequent models, such as RoBERTa ([Liu et al., 2019](https://arxiv.org/abs/1907.11692)), ALBERT ([Lan et al., 2019](https://arxiv.org/abs/1909.11942)), and DistilBERT ([Sanh et al., 2019](https://arxiv.org/abs/1910.01108)), have built upon BERT’s architecture, improving efficiency and performance.<br><br>
* The original BERT model was introduced in 2018, following OpenAI’s Generative Pre-trained Transformer (GPT-1) ([Radford et al., 2018](https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf)). Both models were based on the Transformer architecture (Vaswani et al., 2017), but they took different approaches: while GPT is a unidirectional model designed for Natural Language Generation (NLG), BERT introduced bidirectional self-attention to improve contextual understanding in NLU tasks. <br><br>
* These two architectures played a pivotal role in modern NLP, with BERT influencing retrieval-based models and GPT evolving into more advanced generative AI systems such as the breakthrough of GPT-3 ([Brown et al., 2020](https://arxiv.org/abs/2005.14165)) and ChatGPT.<br><br>
* BERT has seen wide industry applications. For example, Google [integrated BERT into its search algorithms](https://snorkel.ai/large-language-models/bert-models/?utm_source=chatgpt.com) to better understand user queries, leading to more accurate and contextually relevant search results. Other companies, [like Wayfair](https://www.aboutwayfair.com/tech-innovation/bert-does-business-implementing-the-bert-model-for-natural-language-processing-at-wayfair?utm_source=chatgpt.com), have implemented BERT to analyze customer messages, enabling more efficient and accurate responses. <br><br>
* While highly effective for Natural Language Understanding (NLU), BERT is computationally expensive, limited to a 512-token context window, lacks generative capabilities, and inherits biases from its pretraining data, making it less suitable for real-time, long-document, or dynamically evolving knowledge tasks. Perfect for some tasks, but not all.<br><br>
* In December 2024, [ModernBERT](https://huggingface.co/papers/2412.13663) was introduced as a state-of-the-art encoder-only model, offering significant improvements over previous architectures. It supports sequences up to 8,192 tokens and incorporates modern enhancements like Rotary Positional Embeddings (RoPE) and Flash Attention for improved performance and efficiency..<br><br>

***

<br><br>

**In this notebook, we will explore how to fine-tune the BERT model on the AG News data.** 

**DO NOT RUN THIS NOTEBOOK LOCALLY. It is intended to be run on Google Colab.**

If you run on Google Colab, remember to change the runtime to GPU or TPU and install the following libraries

* transformers
* hugingface_hub
* datasets
* torch
* evaluate
* accelerate
* fastpaquet
* huggingface_hub


<br><br>

***

In [None]:
# !pip install transformers hugingface_hub datasets torch evaluate accelerate fastparquet huggingface_hub

# 1. Fine-tune ModernBERT

[ModernBERT](https://huggingface.co/docs/transformers/model_doc/modernbert) is a state-of-the-art encoder-only model that builds upon the original BERT architecture. Introduced in December 2024, ModernBERT offers several key improvements over its predecessor, including:

- **Increased Sequence Length**: ModernBERT can handle sequences up to 8,192 tokens, compared to BERT’s 512-token limit. This makes it more suitable for long-document tasks and other applications that require processing large amounts of text.<br><br>
- **Rotary Positional Embeddings (RoPE)**: ModernBERT uses rotary positional embeddings to encode positional information in the input sequence. This allows the model to capture long-range dependencies more effectively and improves performance on tasks that require understanding of sequential relationships.<br><br>
- **Flash Attention**: ModernBERT incorporates Flash Attention, a novel attention mechanism that improves efficiency and reduces computational complexity. Flash Attention is designed to be more memory-efficient than traditional self-attention mechanisms, making it well-suited for large-scale language tasks.<br><br>

In [1]:
from datasets import load_dataset, DatasetDict
from transformers import AutoTokenizer, ModernBertForSequenceClassification, TrainingArguments, Trainer, DataCollatorWithPadding
import evaluate
import numpy as np

ModuleNotFoundError: No module named 'evaluate'

### 2.1. Load data

**NOTE**: This will download the dataset from the Hugging Face datasets library. If you have already downloaded the dataset, it should load from the cache.

In [None]:
ag_news_train = load_dataset("fancyzhx/ag_news", split="train[:10%]")
ag_news_test = load_dataset("fancyzhx/ag_news", split="test")

In [None]:
ag_news = DatasetDict({"train": ag_news_train, "test": ag_news_test})
ag_news

### 2.2. Load ModernBERT tokenizer and model from HuggingFace

In [None]:
# Define the mappping from label names to label ids
id2label = {
    0: 'World',
    1: 'Sports',
    2: 'Business',
    3: 'Sci/Tech'
}

# Define the mapping from label ids to label names (the reverse of id2label)
label2id = {v: k for k, v in id2label.items()}

In [None]:
# load the model
model = ModernBertForSequenceClassification.from_pretrained("answerdotai/ModernBERT-base", num_labels=4, id2label=id2label, label2id=label2id)

# load the tokenizer
tokenizer = AutoTokenizer.from_pretrained("answerdotai/ModernBERT-base")

### 2.3. Tokenize and encode the data

In [None]:

def preprocess_function(examples):
    """ Tokenize the text column in the examples. """
    return tokenizer(examples["text"], truncation=True)

tokenized_ag_news = ag_news.map(preprocess_function, batched=True, batch_size=4)

### 2.4. Set evaluation metric

In [None]:
f1 = evaluate.load("f1")


def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return f1.compute(predictions=predictions, references=labels)

### 2.5. Define a data collator

The data collator is a function that takes a list of samples and collates them into a batch. It is used to pad the samples to the same length and convert them to PyTorch tensors.

In [None]:
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

### 2.6. Train the model

In [None]:
training_args = TrainingArguments(
    output_dir="my_awesome_model",  # THIS NEEDS TO CHANGE ON GOOGLE COLAB: "/content/drive/MyDrive/Colab Notebooks/my_awesome_model" or similar. Please check the path.
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=2,
    weight_decay=0.01,
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_ag_news["train"],
    eval_dataset=tokenized_ag_news["test"],
    processing_class=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

trainer.train()