# Fine-Tuning BERT and Exploring Its Variants

## Introduction

Bidirectional Encoder Representations from Transformers (BERT) has significantly advanced Natural Language Processing (NLP) by providing powerful, pre-trained language representations. Fine-tuning BERT for specific downstream tasks enables leveraging its deep understanding of language to achieve state-of-the-art performance.

In this notebook, we will:

1. **Fine-Tune BERT for Various NLP Tasks:**
   - **Question Answering (QA)**
   - **Named Entity Recognition (NER)**

2. **Explore BERT Variants and Extensions:**
   - **RoBERTa**
   - **DistilBERT**
   - **ALBERT**
   - **Domain-Specific BERT Models**

By the end of this notebook, you'll have hands-on experience fine-tuning BERT for different applications and an understanding of its powerful variants.

**Resources for Further Reading:**

- [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805) by Devlin et al.
- [The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning)](http://jalammar.github.io/illustrated-bert/)
- [Hugging Face Transformers Documentation](https://huggingface.co/transformers/)
- [The Illustrated Transformer](http://jalammar.github.io/illustrated-transformer/)

**Prerequisites:**

- Basic understanding of Python and PyTorch
- Familiarity with neural network concepts
- Knowledge of NLP tasks and tokenization

---

## 1. Fine-Tuning BERT

Fine-tuning involves taking a pre-trained BERT model and training it on a specific task with task-specific data. We'll explore fine-tuning BERT for three popular NLP tasks: Text Classification, Question Answering, and Named Entity Recognition.

### 1.1 Fine-Tuning BERT for Question Answering (QA)

**Objective:** Given a context paragraph and a question, predict the span of text in the context that answers the question using the SQuAD (Stanford Question Answering Dataset).

#### 1.1.1 Setup

First, ensure that the necessary libraries are installed. We'll primarily use Hugging Face's `transformers` library, which provides easy-to-use interfaces for BERT and its variants.


In [3]:
!pip install transformers torch datasets evaluate seqeval matplotlib scikit-learn datasets seaborn

  pid, fd = os.forkpty()




#### 1.1.1 Loading and Preprocessing the SQuAD Dataset

In [4]:
from transformers import BertForQuestionAnswering, AutoTokenizer
from datasets import load_dataset

# Load the SQuAD dataset
squad = load_dataset('squad')

# Inspect the dataset
print(squad)

README.md:   0%|          | 0.00/7.62k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/14.5M [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/1.82M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/87599 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/10570 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 87599
    })
    validation: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 10570
    })
})


# 1.1.2 Tokenization and Alignment

Tokenize the inputs, ensuring that the model can predict the start and end positions of the answer.

In [6]:
import torch


# Initialize the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
qa_model = BertForQuestionAnswering.from_pretrained('bert-base-uncased')

# Move the model to the desired device (GPU or CPU)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
qa_model.to(device)

def tokenize_qa(examples):
    return tokenizer(
        examples['question'],
        examples['context'],
        truncation="only_second",  # Truncate only the context
        padding='max_length',
        max_length=512,
        return_offsets_mapping=True  # Needed for answer alignment
    )

# Apply tokenization WITHOUT removing 'id' and 'answers'
tokenized_squad = squad.map(
    tokenize_qa,
    batched=True,
    remove_columns=['title', 'context', 'question']  # Keep 'id' and 'answers'
)

def align_answers(examples):
    start_positions = []
    end_positions = []
    for i in range(len(examples['input_ids'])):
        # Get the answer's character start and end positions
        answer_start_char = examples['answers'][i]['answer_start'][0]
        answer_text = examples['answers'][i]['text'][0]
        answer_end_char = answer_start_char + len(answer_text)

        # Get the offset mappings for the current example
        offsets = examples['offset_mapping'][i]

        # Initialize start and end token positions
        start_pos = 0
        end_pos = 0

        # Find the start token
        for idx, (start, end) in enumerate(offsets):
            if start <= answer_start_char < end:
                start_pos = idx
                break

        # Find the end token
        for idx, (start, end) in enumerate(offsets):
            if start < answer_end_char <= end:
                end_pos = idx
                break

        # Handle cases where the answer might not be fully contained within the max_length
        if not (start_pos and end_pos):
            start_pos = 0
            end_pos = 0

        start_positions.append(start_pos)
        end_positions.append(end_pos)

    examples['start_positions'] = start_positions
    examples['end_positions'] = end_positions
    return examples

# Apply answer alignment WITHOUT removing 'answers' yet
tokenized_squad = tokenized_squad.map(
    align_answers,
    batched=True,
    remove_columns=['offset_mapping']  # Remove offset_mapping after alignment
)

# Verify columns
print("Columns in the dataset:", tokenized_squad['validation'].column_names)
# Expected Output: ['id', 'input_ids', 'token_type_ids', 'attention_mask', 'answers', 'start_positions', 'end_positions']

Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Map:   0%|          | 0/87599 [00:00<?, ? examples/s]

Map:   0%|          | 0/10570 [00:00<?, ? examples/s]

Map:   0%|          | 0/87599 [00:00<?, ? examples/s]

Map:   0%|          | 0/10570 [00:00<?, ? examples/s]

Columns in the dataset: ['id', 'answers', 'input_ids', 'token_type_ids', 'attention_mask', 'start_positions', 'end_positions']


*Note:* Aligning the character positions of answers to token positions is non-trivial and requires handling the mapping between tokens and original text. Hugging Face provides tools to facilitate this, but for brevity, we'll assume the data is already aligned.

#### 1.1.3 Creating DataLoaders

In [8]:
from torch.utils.data import Dataset, DataLoader

# Set format for PyTorch, including 'id' and 'answers' without converting them to tensors
tokenized_squad.set_format(
    type='torch',
    columns=['input_ids', 'attention_mask', 'start_positions', 'end_positions', 'id', 'answers']
)

# Use a subset for demonstration to reduce training time
train_qa = tokenized_squad['train'].select(range(1000))
test_qa = tokenized_squad['validation'].select(range(100))

train_loader_qa = DataLoader(train_qa, batch_size=4, shuffle=True)
test_loader_qa = DataLoader(test_qa, batch_size=4)

print("Number of training batches (QA):", len(train_loader_qa))
print("Number of testing batches (QA):", len(test_loader_qa))

Number of training batches (QA): 250
Number of testing batches (QA): 25


#### 1.1.4 Defining the QA Model

Load the pre-trained BERT model for question answering.

In [9]:
from transformers import BertForQuestionAnswering


# Verify that 'id' is present in batches
for batch in test_loader_qa:
    print("Batch keys:", batch.keys())
    print("Batch 'id' example:", batch['id'][0])
    break  # Only inspect the first batch

# Reload the model (if not already loaded)
qa_model = BertForQuestionAnswering.from_pretrained('bert-base-uncased')
qa_model.to(device)

Batch keys: dict_keys(['id', 'answers', 'input_ids', 'attention_mask', 'start_positions', 'end_positions'])
Batch 'id' example: 56be4db0acb8001400a502ec


Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


BertForQuestionAnswering(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, 

#### 1.1.5 Setting Up the Optimizer and Scheduler

In [10]:
from transformers import AdamW, get_linear_schedule_with_warmup


# Setup optimizer and scheduler
epochs = 2
total_steps = len(train_loader_qa) * epochs

optimizer_qa = AdamW(qa_model.parameters(), lr=3e-5)

scheduler_qa = get_linear_schedule_with_warmup(optimizer_qa,
                                              num_warmup_steps=0,
                                              num_training_steps=total_steps)



#### 1.1.6 Training the QA Model

In [11]:
import evaluate
from tqdm.auto import tqdm

# Initialize the SQuAD metric
squad_metric = evaluate.load('squad')

# Function to calculate QA accuracy (simplified)
def qa_accuracy(preds, labels):
    # This is a simplified version; proper evaluation requires exact match and F1
    return (preds == labels).sum().item() / len(labels)

# Training Loop
qa_model.train()

for epoch in range(epochs):
    print(f'\nEpoch {epoch + 1}/{epochs}')
    total_loss = 0
    progress_bar = tqdm(train_loader_qa, desc="Training QA")
    
    for batch in progress_bar:
        optimizer_qa.zero_grad()
        
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        start_positions = batch['start_positions'].to(device)
        end_positions = batch['end_positions'].to(device)
        
        outputs = qa_model(input_ids=input_ids, attention_mask=attention_mask,
                           start_positions=start_positions, end_positions=end_positions)
        loss = outputs.loss
        loss.backward()
        optimizer_qa.step()
        scheduler_qa.step()
        
        total_loss += loss.item()
        progress_bar.set_postfix({'loss': loss.item()})
    
    avg_loss = total_loss / len(train_loader_qa)
    print(f'Training Loss: {avg_loss:.4f}')

Downloading builder script:   0%|          | 0.00/4.53k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/3.32k [00:00<?, ?B/s]


Epoch 1/2


Training QA:   0%|          | 0/250 [00:00<?, ?it/s]

Training Loss: 3.9648

Epoch 2/2


Training QA:   0%|          | 0/250 [00:00<?, ?it/s]

Training Loss: 2.1461


#### 1.1.7 Evaluating the QA Model

Assess the model's performance on the test set using metrics like Exact Match (EM) and F1 score.

In [12]:
# Evaluation Code Starts Here
qa_model.eval()  # Set model to evaluation mode

# Initialize lists to store predictions and references
predictions = []
references = []

# Create a mapping from 'id' to the original example for quick lookup
original_validation = load_dataset('squad', split='validation').select(range(100))
id_to_example = {example['id']: example for example in original_validation}

for batch in tqdm(test_loader_qa, desc="Evaluating QA"):
    input_ids = batch['input_ids'].to(device)
    attention_mask = batch['attention_mask'].to(device)
    ids = batch['id']  # This is a list of strings

    with torch.no_grad():
        outputs = qa_model(input_ids=input_ids, attention_mask=attention_mask)
        start_logits = outputs.start_logits
        end_logits = outputs.end_logits

    # Move logits to CPU and convert to numpy
    start_logits = start_logits.cpu().numpy()
    end_logits = end_logits.cpu().numpy()

    # Iterate over each example in the batch
    for i in range(len(ids)):
        id_ = ids[i]  # Already a string, no need for .item()

        # Retrieve the original example using the 'id'
        example = id_to_example[id_]
        context = example['context']
        question = example['question']
        answers = example['answers']['text']

        # Get the most likely start and end positions
        start_idx = start_logits[i].argmax()
        end_idx = end_logits[i].argmax()

        # Decode the tokens to get the answer string
        tokens = tokenizer.convert_ids_to_tokens(input_ids[i])

        # Handle cases where end_idx is before start_idx
        if end_idx < start_idx:
            answer = ""
        else:
            # Join the tokens from start_idx to end_idx
            answer_tokens = tokens[start_idx:end_idx + 1]
            # Clean up tokens
            answer = tokenizer.convert_tokens_to_string(answer_tokens)

        # Append only 'id' and 'prediction_text' to predictions
        predictions.append({
            'id': id_,
            'prediction_text': answer
        })

        # Append to references as per SQuAD metric requirements
        references.append({
            'id': id_,
            'answers': {
                'text': answers,
                'answer_start': example['answers']['answer_start']
            }
        })

# Compute the metrics
results = squad_metric.compute(predictions=predictions, references=references)

print(f"\nExact Match: {results['exact_match']:.2f}")
print(f"F1 Score: {results['f1']:.2f}")

Evaluating QA:   0%|          | 0/25 [00:00<?, ?it/s]


Exact Match: 49.00
F1 Score: 53.09


*Note:* For a comprehensive evaluation, use the full SQuAD dataset and proper answer alignment.

---

### 1.2 Fine-Tuning BERT for Named Entity Recognition (NER)

**Objective:** Identify and classify named entities (e.g., person, organization, location) in text using the CoNLL-2003 dataset.

#### 1.2.1 Loading and Preprocessing the CoNLL-2003 Dataset


In [13]:
from transformers import BertForTokenClassification

# Load the CoNLL-2003 NER dataset
ner = load_dataset('conll2003')

# Inspect the dataset
print(ner)

README.md:   0%|          | 0.00/12.3k [00:00<?, ?B/s]

conll2003.py:   0%|          | 0.00/9.57k [00:00<?, ?B/s]

The repository for conll2003 contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/conll2003.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N]  y


Downloading data:   0%|          | 0.00/983k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/14041 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/3250 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/3453 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 14041
    })
    validation: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 3250
    })
    test: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 3453
    })
})


#### 1.2.2 Tokenization and Alignment

Tokenize the inputs, ensuring that the labels align with the tokenized words.


In [14]:
from transformers import BertTokenizerFast
from torch.utils.data import DataLoader

# Get the label list from the dataset
label_list = ner['train'].features['ner_tags'].feature.names
num_labels = len(label_list)
print("NER Labels:", label_list)

# Initialize the fast tokenizer
tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')

def tokenize_and_align_labels(examples):
    tokenized_inputs = tokenizer(
        examples['tokens'],
        truncation=True,
        is_split_into_words=True,
        padding='max_length',
        max_length=128
    )
    
    labels = []
    for i, label in enumerate(examples['ner_tags']):
        word_ids = tokenized_inputs.word_ids(batch_index=i)
        previous_word_idx = None
        label_ids = []
        for word_idx in word_ids:
            if word_idx is None:
                label_ids.append(-100)  # Special tokens
            elif word_idx != previous_word_idx:
                label_ids.append(label[word_idx])
            else:
                label_ids.append(-100)  # Subsequent tokens in a word
            previous_word_idx = word_idx
        labels.append(label_ids)
    
    tokenized_inputs['labels'] = labels
    return tokenized_inputs

# Apply tokenization and label alignment
tokenized_ner = ner.map(tokenize_and_align_labels, batched=True)

# Set format to PyTorch tensors
tokenized_ner.set_format('torch', columns=['input_ids', 'attention_mask', 'labels'])

# Create DataLoaders
train_ner = tokenized_ner['train']
validation_ner = tokenized_ner['validation']
test_ner = tokenized_ner['test']

train_loader_ner = DataLoader(train_ner, batch_size=8, shuffle=True)
validation_loader_ner = DataLoader(validation_ner, batch_size=8)
test_loader_ner = DataLoader(test_ner, batch_size=8)

print("Number of training batches (NER):", len(train_loader_ner))
print("Number of testing batches (NER):", len(test_loader_ner))

NER Labels: ['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC']


Map:   0%|          | 0/14041 [00:00<?, ? examples/s]

Map:   0%|          | 0/3250 [00:00<?, ? examples/s]

Map:   0%|          | 0/3453 [00:00<?, ? examples/s]

Number of training batches (NER): 1756
Number of testing batches (NER): 432


#### 1.2.3 Defining the NER Model

Load the pre-trained BERT model for token classification.


In [15]:
from transformers import BertForTokenClassification

ner_model = BertForTokenClassification.from_pretrained('bert-base-uncased', num_labels=num_labels)

# Move the model to the desired device (GPU or CPU)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
ner_model.to(device)

Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


BertForTokenClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12

#### 1.2.4 Setting Up the Optimizer and Scheduler

In [17]:
from transformers import AdamW, get_linear_schedule_with_warmup

epochs = 3
total_steps = len(train_loader_ner) * epochs

optimizer_ner = AdamW(ner_model.parameters(), lr=2e-5)

scheduler_ner = get_linear_schedule_with_warmup(optimizer_ner,
                                                num_warmup_steps=0,
                                                num_training_steps=total_steps)

#### 1.2.5 Training the NER Model

In [18]:
from tqdm.auto import tqdm

# Function to calculate NER accuracy (token-level)
def ner_accuracy(preds, labels):
    preds = torch.argmax(preds, dim=2)
    valid = labels != -100
    correct = (preds == labels) & valid
    return correct.sum().item() / valid.sum().item()

# Set the model to training mode
ner_model.train()

for epoch in range(epochs):
    print(f'\nEpoch {epoch + 1}/{epochs}')
    total_loss = 0
    total_acc = 0
    num_batches = 0
    progress_bar = tqdm(train_loader_ner, desc="Training NER")
    
    for batch in progress_bar:
        optimizer_ner.zero_grad()
        
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['labels'].to(device)
        
        outputs = ner_model(input_ids=input_ids, attention_mask=attention_mask, labels=labels)
        loss = outputs.loss
        logits = outputs.logits
        
        loss.backward()
        optimizer_ner.step()
        scheduler_ner.step()
        
        total_loss += loss.item()
        total_acc += ner_accuracy(logits, labels)
        num_batches += 1
        
        # Update progress bar
        progress_bar.set_postfix({
            'loss': loss.item(),
            'accuracy': total_acc / num_batches  # Avoid division by zero
        })
    
    avg_loss = total_loss / len(train_loader_ner)
    avg_acc = total_acc / len(train_loader_ner)
    print(f'Training Loss: {avg_loss:.4f}, Training NER Accuracy: {avg_acc:.4f}')




Epoch 1/3


Training NER:   0%|          | 0/1756 [00:00<?, ?it/s]

Training Loss: 0.1157, Training NER Accuracy: 0.9679

Epoch 2/3


Training NER:   0%|          | 0/1756 [00:00<?, ?it/s]

Training Loss: 0.0322, Training NER Accuracy: 0.9912

Epoch 3/3


Training NER:   0%|          | 0/1756 [00:00<?, ?it/s]

Training Loss: 0.0176, Training NER Accuracy: 0.9955


#### 1.2.6 Evaluating the NER Model

Assess the model's performance on the test set using metrics like Precision, Recall, and F1 score.


In [19]:
import evaluate

def safe_label_mapping(label_indices, label_list):
    mapped_labels = []
    for idx in label_indices:
        try:
            mapped_labels.append(label_list[idx])
        except IndexError:
            # Handle unexpected label indices
            mapped_labels.append("O")
    return mapped_labels

metric = evaluate.load("seqeval")

# Set ner_model to evaluation mode
ner_model.eval()

# Initialize lists to store predictions and references
all_predictions = []
all_references = []

# Disable gradient computation for evaluation
with torch.no_grad():
    for batch_idx, batch in enumerate(tqdm(test_loader_ner, desc="Evaluating NER")):
        try:
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            labels = batch['labels'].to(device)
        except KeyError as e:
            print(f"Missing key in batch {batch_idx}: {e}")
            continue
        
        try:
            # Forward pass
            outputs = ner_model(input_ids=input_ids, attention_mask=attention_mask)
            logits = outputs.logits
            
            # Get the predicted class by taking the argmax
            predictions = torch.argmax(logits, dim=-1)
        except Exception as e:
            print(f"Error during model inference on batch {batch_idx}: {e}")
            continue
        
        # Move tensors to CPU and convert to numpy arrays
        predictions = predictions.cpu().numpy()
        true_labels = labels.cpu().numpy()
        masks = attention_mask.cpu().numpy()
        
        for i in range(len(input_ids)):
            # Apply the attention mask to filter out padding tokens
            active_indices = masks[i] == 1
            pred_labels = predictions[i][active_indices]
            true_label_ids = true_labels[i][active_indices]
            
            # Filter out labels with value -100 (ignored index)
            valid_indices = true_label_ids != -100
            pred_labels = pred_labels[valid_indices]
            true_label_ids = true_label_ids[valid_indices]
            
            # Safely map label IDs to label names
            pred_label_names = safe_label_mapping(pred_labels, label_list)
            true_label_names = safe_label_mapping(true_label_ids, label_list)
            
            all_predictions.append(pred_label_names)
            all_references.append(true_label_names)

Downloading builder script:   0%|          | 0.00/6.34k [00:00<?, ?B/s]

Evaluating NER:   0%|          | 0/432 [00:00<?, ?it/s]

In [20]:
results = metric.compute(predictions=all_predictions, references=all_references)


# Display evaluation results
print("\nNER Evaluation Results:")
print(f"Precision: {results.get('overall_precision', 0):.4f}")
print(f"Recall: {results.get('overall_recall', 0):.4f}")
print(f"F1 Score: {results.get('overall_f1', 0):.4f}")
print(f"Accuracy: {results.get('overall_accuracy', 0):.4f}")


NER Evaluation Results:
Precision: 0.8945
Recall: 0.9111
F1 Score: 0.9027
Accuracy: 0.9804


*Note:* For more detailed metrics, refer to the `seqeval` documentation.

---

## 2. Variants and Extensions of BERT

BERT's architecture has inspired numerous variants aimed at improving efficiency, scalability, and performance. We'll explore some of the most notable ones: RoBERTa, DistilBERT, ALBERT, and Domain-Specific BERT models.

### 2.1 RoBERTa (Robustly Optimized BERT Pretraining Approach)

**Key Enhancements:**

- **More Training Data:** Trains on a larger corpus compared to BERT.
- **Dynamic Masking:** Applies masking dynamically during training rather than using a fixed masking pattern.
- **No Next Sentence Prediction (NSP):** Removes the NSP task, focusing solely on Masked Language Modeling (MLM).
- **Larger Batch Sizes and Learning Rates:** Utilizes larger mini-batches and higher learning rates for more efficient training.

**Benefits:**

- Achieves better performance on various NLP benchmarks.
- More robust representations due to optimized training strategies.

**Implementation Example:**

In [22]:
from transformers import RobertaTokenizer, RobertaForSequenceClassification
from transformers import pipeline

# Load RoBERTa tokenizer and model
roberta_tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
roberta_model = RobertaForSequenceClassification.from_pretrained('roberta-base')
roberta_model.to(device)

# Example usage with a classification pipeline
roberta_classifier = pipeline('sentiment-analysis', model=roberta_model, tokenizer=roberta_tokenizer, device=0 if torch.cuda.is_available() else -1)

sentences = [
    "I absolutely loved this movie!",
    "This was the worst film I have ever seen."
]

results = roberta_classifier(sentences)
for sentence, sentiment in zip(sentences, results):
    print(f"Sentence: {sentence}")
    print(f"Sentiment: {sentiment}\n")

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Sentence: I absolutely loved this movie!
Sentiment: {'label': 'LABEL_1', 'score': 0.522160530090332}

Sentence: This was the worst film I have ever seen.
Sentiment: {'label': 'LABEL_1', 'score': 0.5255693197250366}



### 2.2 DistilBERT

**Key Enhancements:**

- **Model Compression:** Reduces the size of BERT by 40% while retaining 97% of its language understanding capabilities.
- **Knowledge Distillation:** Trains a smaller model (student) to replicate the behavior of a larger model (teacher).

**Benefits:**

- Faster inference times.
- Reduced computational and memory requirements.
- Suitable for deployment in resource-constrained environments.

**Implementation Example:**


In [23]:
from transformers import DistilBertTokenizer, DistilBertForSequenceClassification

# Load DistilBERT tokenizer and model
distilbert_tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
distilbert_model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased')
distilbert_model.to(device)

# Example usage with a classification pipeline
distilbert_classifier = pipeline('sentiment-analysis', model=distilbert_model, tokenizer=distilbert_tokenizer, device=0 if torch.cuda.is_available() else -1)

sentences = [
    "I absolutely loved this movie!",
    "This was the worst film I have ever seen."
]

results = distilbert_classifier(sentences)
for sentence, sentiment in zip(sentences, results):
    print(f"Sentence: {sentence}")
    print(f"Sentiment: {sentiment}\n")

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Sentence: I absolutely loved this movie!
Sentiment: {'label': 'LABEL_0', 'score': 0.5197321176528931}

Sentence: This was the worst film I have ever seen.
Sentiment: {'label': 'LABEL_0', 'score': 0.5332385301589966}



### 2.3 ALBERT (A Lite BERT)

**Key Enhancements:**

- **Parameter Sharing:** Shares parameters across layers to significantly reduce the total number of parameters.
- **Factorized Embedding Parameterization:** Separates the size of hidden layers from the size of embeddings, allowing for smaller embedding sizes without compromising model capacity.

**Benefits:**

- Significantly fewer parameters compared to BERT.
- Comparable or better performance with a reduced memory footprint.
- Faster training and inference.

**Implementation Example:**

In [24]:
from transformers import AlbertTokenizer, AlbertForSequenceClassification

# Load ALBERT tokenizer and model
albert_tokenizer = AlbertTokenizer.from_pretrained('albert-base-v2')
albert_model = AlbertForSequenceClassification.from_pretrained('albert-base-v2')
albert_model.to(device)

# Example usage with a classification pipeline
albert_classifier = pipeline('sentiment-analysis', model=albert_model, tokenizer=albert_tokenizer, device=0 if torch.cuda.is_available() else -1)

sentences = [
    "I absolutely loved this movie!",
    "This was the worst film I have ever seen."
]

results = albert_classifier(sentences)
for sentence, sentiment in zip(sentences, results):
    print(f"Sentence: {sentence}")
    print(f"Sentiment: {sentiment}\n")

tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/760k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.31M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/684 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/47.4M [00:00<?, ?B/s]

Some weights of AlbertForSequenceClassification were not initialized from the model checkpoint at albert-base-v2 and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Sentence: I absolutely loved this movie!
Sentiment: {'label': 'LABEL_1', 'score': 0.6166632175445557}

Sentence: This was the worst film I have ever seen.
Sentiment: {'label': 'LABEL_1', 'score': 0.6308166980743408}



### 2.4 Domain-Specific BERT Models

Pre-trained BERT models tailored to specific domains can outperform general-purpose BERT models on tasks within those domains.

**Examples:**

- **BioBERT:** Specialized for biomedical text mining tasks.
- **SciBERT:** Designed for scientific publications.
- **FinBERT:** Tailored for financial text analysis.

**Implementation Example with BioBERT:**


In [25]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification

# Load BioBERT tokenizer and model
biobert_tokenizer = AutoTokenizer.from_pretrained("dmis-lab/biobert-base-cased-v1.1")
biobert_model = AutoModelForSequenceClassification.from_pretrained("dmis-lab/biobert-base-cased-v1.1")
biobert_model.to(device)

# Example usage with a classification pipeline (assuming a relevant task)
biobert_classifier = pipeline('sentiment-analysis', model=biobert_model, tokenizer=biobert_tokenizer, device=0 if torch.cuda.is_available() else -1)

# Example biomedical sentences
sentences = [
    "The patient was diagnosed with hypertension.",
    "CRISPR-Cas9 is a revolutionary gene-editing tool."
]

results = biobert_classifier(sentences)
for sentence, sentiment in zip(sentences, results):
    print(f"Sentence: {sentence}")
    print(f"Sentiment: {sentiment}\n")

config.json:   0%|          | 0.00/313 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/436M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at dmis-lab/biobert-base-cased-v1.1 and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Sentence: The patient was diagnosed with hypertension.
Sentiment: {'label': 'LABEL_1', 'score': 0.505736768245697}

Sentence: CRISPR-Cas9 is a revolutionary gene-editing tool.
Sentiment: {'label': 'LABEL_1', 'score': 0.5348502397537231}



## 3. Exploring BERT Variants and Extensions

BERT's architecture has inspired numerous variants aimed at improving efficiency, scalability, and performance. We'll delve into some prominent variants: RoBERTa, DistilBERT, ALBERT, and Domain-Specific BERT models.

### 3.1 RoBERTa (Robustly Optimized BERT Pretraining Approach)

**Key Enhancements:**

- **More Training Data:** Trains on a significantly larger corpus compared to BERT.
- **Dynamic Masking:** Applies masking dynamically during training rather than using a fixed masking pattern.
- **No Next Sentence Prediction (NSP):** Removes the NSP task, focusing solely on Masked Language Modeling (MLM).
- **Larger Batch Sizes and Learning Rates:** Utilizes larger mini-batches and higher learning rates for more efficient training.

**Benefits:**

- Achieves better performance on various NLP benchmarks.
- More robust representations due to optimized training strategies.

**Implementation Example:**


In [26]:
from transformers import RobertaTokenizer, RobertaForSequenceClassification

# Load RoBERTa tokenizer and model
roberta_tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
roberta_model = RobertaForSequenceClassification.from_pretrained('roberta-base')
roberta_model.to(device)

# Example usage with a classification pipeline
roberta_classifier = pipeline('sentiment-analysis', model=roberta_model, tokenizer=roberta_tokenizer, device=0 if torch.cuda.is_available() else -1)

sentences = [
    "I absolutely loved this movie!",
    "This was the worst film I have ever seen."
]

results = roberta_classifier(sentences)
for sentence, sentiment in zip(sentences, results):
    print(f"Sentence: {sentence}")
    print(f"Sentiment: {sentiment}\n")

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Sentence: I absolutely loved this movie!
Sentiment: {'label': 'LABEL_1', 'score': 0.5146975517272949}

Sentence: This was the worst film I have ever seen.
Sentiment: {'label': 'LABEL_1', 'score': 0.5187569856643677}



### 3.2 DistilBERT

**Key Enhancements:**

- **Model Compression:** Reduces the size of BERT by 40% while retaining 97% of its language understanding capabilities.
- **Knowledge Distillation:** Trains a smaller model (student) to replicate the behavior of a larger model (teacher).

**Benefits:**

- Faster inference times.
- Reduced computational and memory requirements.
- Suitable for deployment in resource-constrained environments.

**Implementation Example:**


In [27]:
from transformers import DistilBertTokenizer, DistilBertForSequenceClassification

# Load DistilBERT tokenizer and model
distilbert_tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
distilbert_model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased')
distilbert_model.to(device)

# Example usage with a classification pipeline
distilbert_classifier = pipeline('sentiment-analysis', model=distilbert_model, tokenizer=distilbert_tokenizer, device=0 if torch.cuda.is_available() else -1)

sentences = [
    "I absolutely loved this movie!",
    "This was the worst film I have ever seen."
]

results = distilbert_classifier(sentences)
for sentence, sentiment in zip(sentences, results):
    print(f"Sentence: {sentence}")
    print(f"Sentiment: {sentiment}\n")

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Sentence: I absolutely loved this movie!
Sentiment: {'label': 'LABEL_1', 'score': 0.5239751935005188}

Sentence: This was the worst film I have ever seen.
Sentiment: {'label': 'LABEL_1', 'score': 0.5243285298347473}



### 3.3 ALBERT (A Lite BERT)

**Key Enhancements:**

- **Parameter Sharing:** Shares parameters across layers to significantly reduce the total number of parameters.
- **Factorized Embedding Parameterization:** Separates the size of hidden layers from the size of embeddings, allowing for smaller embedding sizes without compromising model capacity.

**Benefits:**

- Significantly fewer parameters compared to BERT.
- Comparable or better performance with a reduced memory footprint.
- Faster training and inference.

**Implementation Example:**

In [28]:
from transformers import AlbertTokenizer, AlbertForSequenceClassification

# Load ALBERT tokenizer and model
albert_tokenizer = AlbertTokenizer.from_pretrained('albert-base-v2')
albert_model = AlbertForSequenceClassification.from_pretrained('albert-base-v2')
albert_model.to(device)

# Example usage with a classification pipeline
albert_classifier = pipeline('sentiment-analysis', model=albert_model, tokenizer=albert_tokenizer, device=0 if torch.cuda.is_available() else -1)

sentences = [
    "I absolutely loved this movie!",
    "This was the worst film I have ever seen."
]

results = albert_classifier(sentences)
for sentence, sentiment in zip(sentences, results):
    print(f"Sentence: {sentence}")
    print(f"Sentiment: {sentiment}\n")

Some weights of AlbertForSequenceClassification were not initialized from the model checkpoint at albert-base-v2 and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Sentence: I absolutely loved this movie!
Sentiment: {'label': 'LABEL_0', 'score': 0.5040313005447388}

Sentence: This was the worst film I have ever seen.
Sentiment: {'label': 'LABEL_1', 'score': 0.5025427341461182}



### 3.4 Domain-Specific BERT Models

Pre-trained BERT models tailored to specific domains can outperform general-purpose BERT models on tasks within those domains.

**Examples:**

- **BioBERT:** Specialized for biomedical text mining tasks.
- **SciBERT:** Designed for scientific publications.
- **FinBERT:** Tailored for financial text analysis.

**Implementation Example with BioBERT:**

In [29]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification

# Load BioBERT tokenizer and model
biobert_tokenizer = AutoTokenizer.from_pretrained("dmis-lab/biobert-base-cased-v1.1")
biobert_model = AutoModelForSequenceClassification.from_pretrained("dmis-lab/biobert-base-cased-v1.1")
biobert_model.to(device)

# Example usage with a classification pipeline (assuming a relevant task)
biobert_classifier = pipeline('sentiment-analysis', model=biobert_model, tokenizer=biobert_tokenizer, device=0 if torch.cuda.is_available() else -1)

# Example biomedical sentences
sentences = [
    "The patient was diagnosed with hypertension.",
    "CRISPR-Cas9 is a revolutionary gene-editing tool."
]

results = biobert_classifier(sentences)
for sentence, sentiment in zip(sentences, results):
    print(f"Sentence: {sentence}")
    print(f"Sentiment: {sentiment}\n")

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at dmis-lab/biobert-base-cased-v1.1 and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Sentence: The patient was diagnosed with hypertension.
Sentiment: {'label': 'LABEL_1', 'score': 0.6780614852905273}

Sentence: CRISPR-Cas9 is a revolutionary gene-editing tool.
Sentiment: {'label': 'LABEL_1', 'score': 0.6932359337806702}



#### *Note:* Domain-specific models may require task-specific fine-tuning to achieve optimal performance.

---

## 4. Analysis and Insights

### 4.1 Advantages of BERT Over Traditional RNNs

- **Bidirectional Contextualization:** BERT considers both left and right context simultaneously, enabling a deeper understanding of language nuances.
- **Pre-training with MLM and NSP:** BERT's pre-training tasks allow it to learn rich language representations that can be fine-tuned for various downstream tasks.
- **Transfer Learning:** Fine-tuning pre-trained BERT models on specific tasks often leads to state-of-the-art performance with relatively small task-specific datasets.
- **Handling Long-Range Dependencies:** The self-attention mechanism in BERT effectively captures dependencies between distant tokens in a sequence.

### 4.2 Challenges and Considerations

- **Computational Resources:** BERT models are large and require significant memory and computational power, especially during training.
- **Fine-Tuning Sensitivity:** BERT can be sensitive to hyperparameters during fine-tuning, necessitating careful tuning for optimal performance.
- **Interpretability:** While attention mechanisms provide some interpretability, understanding the full decision-making process of BERT remains complex.

---

## 5. Further Steps and Resources

### 5.1 Experiment with Different Tasks

- **Named Entity Recognition (NER)**
- **Question Answering (QA)**
- **Text Summarization**

### 5.2 Explore BERT Variants

- **RoBERTa:** Explore its improved training methodology.
- **DistilBERT:** Implement a distilled version for efficiency.
- **ALBERT:** Experiment with parameter-efficient BERT variants.

### 5.3 Dive Deeper into Transformers

- **Transformer-XL:** Understand its approach to handling longer sequences.
- **GPT Series:** Explore generative capabilities using decoder-only models.

### 5.4 Utilize Hugging Face Resources

- **Hugging Face Models:** Explore a wide range of pre-trained models.
- **Hugging Face Tutorials:** Engage with comprehensive tutorials for various NLP tasks.

**Remember:** Mastering BERT and its variants is pivotal for advancing in modern NLP. Leveraging pre-trained models and understanding their architecture enables you to tackle complex language understanding tasks with efficiency and effectiveness.

---

## References

- [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805) by Devlin et al.
- [The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning)](http://jalammar.github.io/illustrated-bert/)
- [Hugging Face Transformers Documentation](https://huggingface.co/transformers/)
- [The Illustrated Transformer](http://jalammar.github.io/illustrated-transformer/)