## Finetuning DistilBERT using SQuAD 2.0

This notebook contains the following fine-tuning groups:

....

The starting-point for code in this file was found in the Medium blog post titled, Question Answering with DistilBERT (https://medium.com/@sabrinaherbst/question-answering-with-distilbert-ba3e178fdf3d). Main differences include:
 - The DistilBERT model was pre-trined using SQuAD 2.0, rather than SQuAD 1.0
 - Exploring traditional split that included unseen test data  (i.e. not validation data)
 - Addition of adhoc dropout rate setting and number of attention head setting

....

In [None]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
# Load libraries

from transformers import DistilBertModel, DistilBertForMaskedLM, DistilBertConfig, \
            DistilBertTokenizerFast, AutoTokenizer, BertModel, BertForMaskedLM, BertTokenizerFast, BertConfig
from torch import nn
from pathlib import Path
import torch
import pandas as pd
from typing import Optional
from tqdm.auto import tqdm
from torch.optim import AdamW, RMSprop
import numpy as np

import sys
sys.path.append('/content/drive/MyDrive/distilBERT_SQuAD2_w266Project')
from qa_model import QuestionDistilBERT, SimpleQuestionDistilBERT, ReuseQuestionDistilBERT, Dataset, test_model
from util import eval_test_set, count_parameters, print_test_set_incorrect_predictions, \
            analyze_test_set_statistics, analyze_test_set_performance

# Load tokenizer


In [None]:
# Load DistilBERT tokenizer, use uncased (lowercase) vocabulary

tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


# Load data_2 (Traditional split)

Note: Dataset loader is the "Variable-Length Trunc" described in the report

In [None]:
# Get paths for all SQuAD dataset text files in training directory -- data_2, Traditional Split
squad_paths_2 = [str(x) for x in Path('/content/drive/MyDrive/distilBERT_SQuAD2_w266Project/data_2/training_squad').glob('**/*.txt')]

# Create full training dataset
dataset_2 = Dataset(squad_paths = [str(x) for x in Path('/content/drive/MyDrive/distilBERT_SQuAD2_w266Project/data_2/training_squad').glob('**/*.txt')],
                  natural_question_paths=None,
                  hotpotqa_paths=None,
                  tokenizer=tokenizer)

# Create a half size training dataset
full_size = len(dataset_2)
half_size = full_size // 2  # Integer division to get exact half
subset_indices = torch.randperm(full_size)[:half_size]
dataset_2_half = torch.utils.data.Subset(dataset_2, subset_indices)
loader_2_half = torch.utils.data.DataLoader(dataset_2_half, batch_size=8)
loader_2 = torch.utils.data.DataLoader(dataset_2, batch_size=8)

# Print sizes for verification
print(f"Original Training Dataset Size: {full_size}")
print(f"Half Training Dataset Size: {len(dataset_2_half)}")

## load the validation dataset -- used to be labeled as "test", test_dataset changed to val_dataset, test_loader changed to val_loader
val_dataset_2 = Dataset(squad_paths = [str(x) for x in Path('/content/drive/MyDrive/distilBERT_SQuAD2_w266Project/data_2/validation_squad').glob('**/*.txt')],
                       natural_question_paths=None,
                       hotpotqa_paths = None, tokenizer=tokenizer)
val_loader_2 = torch.utils.data.DataLoader(val_dataset_2, batch_size=4)
print(f"Approximate Validation Dataset Size: {len(val_dataset_2)}")

## load the test dataset -- test_dataset and test_loader should not be used during training
test_dataset_2 = Dataset(squad_paths = [str(x) for x in Path('/content/drive/MyDrive/distilBERT_SQuAD2_w266Project/data_2/test_squad').glob('**/*.txt')],
                       natural_question_paths=None,
                       hotpotqa_paths = None, tokenizer=tokenizer)
test_loader_2 = torch.utils.data.DataLoader(test_dataset_2, batch_size=4)
print(f"Approximate Test Dataset Size: {len(test_dataset_2)}")

Original Training Dataset Size: 113000
Half Training Dataset Size: 56500
Approximate Validation Dataset Size: 14000
Approximate Test Dataset Size: 14000


# Model_6
Train DistilBERT using SQuAD 2.0 --- data_2, Traditional Split but with half the training data  (40% rather than 80%); Alter training parameters to 4 epochs with dropout rate of 0.225

In [None]:
# Load pre-trained DistilBERT model for masked language modeling
model_6 = DistilBertForMaskedLM.from_pretrained("distilbert-base-uncased")

# Get model configuration (architecture, hyperparameters)
config_6 = DistilBertConfig.from_pretrained("distilbert-base-uncased")

# Extract base DistilBERT model without MLM head
mod_6 = model_6.distilbert

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

In [None]:
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
model_6 = SimpleQuestionDistilBERT(mod_6)
model_6.set_dropout_rate(0.225)
model_6.to(device)

# Verify the dropout rates for each layer
for name, module in model_6.named_modules():
    if isinstance(module, nn.Dropout):
        print(f"{name}: {module.p}")

set_dropout_rate
distilbert.embeddings.dropout: 0.225
distilbert.transformer.layer.0.attention.dropout: 0.225
distilbert.transformer.layer.0.ffn.dropout: 0.225
distilbert.transformer.layer.1.attention.dropout: 0.225
distilbert.transformer.layer.1.ffn.dropout: 0.225
distilbert.transformer.layer.2.attention.dropout: 0.225
distilbert.transformer.layer.2.ffn.dropout: 0.225
distilbert.transformer.layer.3.attention.dropout: 0.225
distilbert.transformer.layer.3.ffn.dropout: 0.225
distilbert.transformer.layer.4.attention.dropout: 0.225
distilbert.transformer.layer.4.ffn.dropout: 0.225
distilbert.transformer.layer.5.attention.dropout: 0.225
distilbert.transformer.layer.5.ffn.dropout: 0.225
dropout: 0.225


In [None]:
# Set model to training mode (enables dropout, batch norm, etc.)
model_6.train()

# Initialize RMSprop optimizer and learning rate of 4e-5
optim = RMSprop(model_6.parameters(), lr=4e-5)

In [None]:
# Train the DistilBERT model with data_2 for 4 epochs
# Validation data used to evaluate performance during training

epochs = 4

for epoch in range(epochs):
   # Training loop
   loop = tqdm(loader_2_half, leave=True)  # Progress bar for training batches
   model_6.train()  # Set model to training mode
   mean_training_loss = []

   for batch in loop:
       # Zero gradients at start of each batch
       optim.zero_grad()

       # Move batch data to GPU/CPU device
       input_ids = batch['input_ids'].to(device)
       attention_mask = batch['attention_mask'].to(device)
       start = batch['start_positions'].to(device)
       end = batch['end_positions'].to(device)

       # Forward pass
       outputs = model_6(input_ids, attention_mask=attention_mask,
                      start_positions=start, end_positions=end)
       loss = outputs['loss']

       # Backward pass and optimization
       loss.backward()
       optim.step()

       # Track and display training progress
       mean_training_loss.append(loss.item())
       loop.set_description(f'Epoch {epoch}')
       loop.set_postfix(loss=loss.item())
   print("Mean Training Loss", np.mean(mean_training_loss))

   # Validation loop
   loop = tqdm(val_loader_2, leave=True)  # Progress bar for validation batches
   model_6.eval()  # Set model to evaluation mode
   mean_val_loss = []

   for batch in loop:
       # Move batch data to GPU/CPU device
       input_ids = batch['input_ids'].to(device)
       attention_mask = batch['attention_mask'].to(device)
       start = batch['start_positions'].to(device)
       end = batch['end_positions'].to(device)

       # Forward pass (no gradients needed for validation)
       outputs = model_6(input_ids, attention_mask=attention_mask,
                      start_positions=start, end_positions=end)
       loss = outputs['loss']

       # Track and display validation progress
       mean_val_loss.append(loss.item())
       loop.set_description(f'Epoch {epoch} Validation set')
       loop.set_postfix(loss=loss.item())
   print("Mean Validation Loss", np.mean(mean_val_loss))

  0%|          | 0/7063 [00:00<?, ?it/s]

Mean Training Loss 1.5739412295776978


  0%|          | 0/3500 [00:00<?, ?it/s]

Mean Validation Loss 1.5015438418643816


  0%|          | 0/7063 [00:00<?, ?it/s]

Mean Training Loss 1.1840836575736613


  0%|          | 0/3500 [00:00<?, ?it/s]

Mean Validation Loss 1.3018232911175915


  0%|          | 0/7063 [00:00<?, ?it/s]

Mean Training Loss 0.9191326998943194


  0%|          | 0/3500 [00:00<?, ?it/s]

Mean Validation Loss 1.4525633385841337


  0%|          | 0/7063 [00:00<?, ?it/s]

Mean Training Loss 0.7897737834755024


  0%|          | 0/3500 [00:00<?, ?it/s]

Mean Validation Loss 1.3999360280877777


In [None]:
# Save trained model parameters (weights & biases)
torch.save(model_6.state_dict(), "simple_distilbert_qa_data_6.model")

In [None]:
# Initialize a new instance of our custom QA model
model_6 = SimpleQuestionDistilBERT(mod_6)

# Load previously saved model parameters from disk
model_6.load_state_dict(torch.load("simple_distilbert_qa_data_6.model"))

  model_6.load_state_dict(torch.load("simple_distilbert_qa_data_6.model"))


<All keys matched successfully>

In [None]:
# Evaulate data_2 performance for using test data
eval_test_set(model_6, tokenizer, test_loader_2, device)

100%|██████████| 3500/3500 [01:09<00:00, 50.15it/s]

Mean EM:  0.5939285714285715
Mean F-1:  0.6705066789671396





# Model_7

Train DistilBERT using SQuAD 2.0 --- data_2, Traditional Split but with half the training samples (40% rather than 80%); Alter training parameters to 4 epochs with a dropout rate of 0.18

In [None]:
# Load pre-trained DistilBERT model for masked language modeling
model_7 = DistilBertForMaskedLM.from_pretrained("distilbert-base-uncased")

# Get model configuration (architecture, hyperparameters)
config_7 = DistilBertConfig.from_pretrained("distilbert-base-uncased")

# Extract base DistilBERT model without MLM head
mod_7 = model_7.distilbert

In [None]:
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
model_7 = SimpleQuestionDistilBERT(mod_7)
model_7.set_dropout_rate(0.18)
model_7.to(device)

# Verify the dropout rates for each layer
for name, module in model_7.named_modules():
    if isinstance(module, nn.Dropout):
        print(f"{name}: {module.p}")

set_dropout_rate
distilbert.embeddings.dropout: 0.18
distilbert.transformer.layer.0.attention.dropout: 0.18
distilbert.transformer.layer.0.ffn.dropout: 0.18
distilbert.transformer.layer.1.attention.dropout: 0.18
distilbert.transformer.layer.1.ffn.dropout: 0.18
distilbert.transformer.layer.2.attention.dropout: 0.18
distilbert.transformer.layer.2.ffn.dropout: 0.18
distilbert.transformer.layer.3.attention.dropout: 0.18
distilbert.transformer.layer.3.ffn.dropout: 0.18
distilbert.transformer.layer.4.attention.dropout: 0.18
distilbert.transformer.layer.4.ffn.dropout: 0.18
distilbert.transformer.layer.5.attention.dropout: 0.18
distilbert.transformer.layer.5.ffn.dropout: 0.18
dropout: 0.18


In [None]:
# Set model to training mode (enables dropout, batch norm, etc.)
model_7.train()

# Initialize RMSprop optimizer and learning rate of 4e-5
optim = RMSprop(model_7.parameters(), lr=4e-5)

In [None]:
# Train the DistilBERT model with data_2 for 4 epochs
# Validation data used to evaluate performance during training

epochs = 4

for epoch in range(epochs):
   # Training loop
   loop = tqdm(loader_2_half, leave=True)  # Progress bar for training batches
   model_7.train()  # Set model to training mode
   mean_training_loss = []

   for batch in loop:
       # Zero gradients at start of each batch
       optim.zero_grad()

       # Move batch data to GPU/CPU device
       input_ids = batch['input_ids'].to(device)
       attention_mask = batch['attention_mask'].to(device)
       start = batch['start_positions'].to(device)
       end = batch['end_positions'].to(device)

       # Forward pass
       outputs = model_7(input_ids, attention_mask=attention_mask,
                      start_positions=start, end_positions=end)
       loss = outputs['loss']

       # Backward pass and optimization
       loss.backward()
       optim.step()

       # Track and display training progress
       mean_training_loss.append(loss.item())
       loop.set_description(f'Epoch {epoch}')
       loop.set_postfix(loss=loss.item())
   print("Mean Training Loss", np.mean(mean_training_loss))

   # Validation loop
   loop = tqdm(val_loader_2, leave=True)  # Progress bar for validation batches
   model_7.eval()  # Set model to evaluation mode
   mean_val_loss = []

   for batch in loop:
       # Move batch data to GPU/CPU device
       input_ids = batch['input_ids'].to(device)
       attention_mask = batch['attention_mask'].to(device)
       start = batch['start_positions'].to(device)
       end = batch['end_positions'].to(device)

       # Forward pass (no gradients needed for validation)
       outputs = model_7(input_ids, attention_mask=attention_mask,
                      start_positions=start, end_positions=end)
       loss = outputs['loss']

       # Track and display validation progress
       mean_val_loss.append(loss.item())
       loop.set_description(f'Epoch {epoch} Validation set')
       loop.set_postfix(loss=loss.item())
   print("Mean Validation Loss", np.mean(mean_val_loss))

  0%|          | 0/7063 [00:00<?, ?it/s]

Mean Training Loss 1.452944415713629


  0%|          | 0/3500 [00:00<?, ?it/s]

Mean Validation Loss 1.4451661678882581


  0%|          | 0/7063 [00:00<?, ?it/s]

Mean Training Loss 1.1292532323434386


  0%|          | 0/3500 [00:00<?, ?it/s]

Mean Validation Loss 1.3453501685288336


  0%|          | 0/7063 [00:00<?, ?it/s]

Mean Training Loss 0.8504287181054226


  0%|          | 0/3500 [00:00<?, ?it/s]

Mean Validation Loss 1.4131012615383203


  0%|          | 0/7063 [00:00<?, ?it/s]

Mean Training Loss 0.7393749339420849


  0%|          | 0/3500 [00:00<?, ?it/s]

Mean Validation Loss 1.3905756583940239


In [None]:
# Save trained model parameters (weights & biases)
torch.save(model_7.state_dict(), "simple_distilbert_qa_data_7.model")

In [None]:
# Initialize a new instance of our custom QA model
model_7 = SimpleQuestionDistilBERT(mod_7)

# Load previously saved model parameters from disk
model_7.load_state_dict(torch.load("simple_distilbert_qa_data_7.model"))

  model_7.load_state_dict(torch.load("simple_distilbert_qa_data_7.model"))


<All keys matched successfully>

In [None]:
# Evaulate data_2 performance for using test data
eval_test_set(model_7, tokenizer, test_loader_2, device)

100%|██████████| 3500/3500 [01:03<00:00, 54.74it/s]

Mean EM:  0.597
Mean F-1:  0.6734982508505128





# Model_8

Train DistilBERT using SQuAD 2.0 --- data_2, Traditional Split but with half the training examples (40% rather than 80%).  Alter training parameters to 4 epochs with a dropout rate of 0.18, additional attention head added to the model

In [None]:
# Load pre-trained DistilBERT model for masked language modeling
model_8 = DistilBertForMaskedLM.from_pretrained("distilbert-base-uncased")

# Get model configuration (architecture, hyperparameters)
config_8 = DistilBertConfig.from_pretrained("distilbert-base-uncased")

# Extract base DistilBERT model without MLM head
mod_8 = model_8.distilbert

In [None]:
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
model_8 = QuestionDistilBERT(mod_8)
model_8.set_dropout_rate(0.18)
model_8.to(device)
print(type(mod_8))
print(mod_8)

# Verify the dropout rates for each layer
for name, module in model_8.named_modules():
    if isinstance(module, nn.Dropout):
        print(f"{name}: {module.p}")


Updated dropout rate to 0.18
<class 'transformers.models.distilbert.modeling_distilbert.DistilBertModel'>
DistilBertModel(
  (embeddings): Embeddings(
    (word_embeddings): Embedding(30522, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.18, inplace=False)
  )
  (transformer): Transformer(
    (layer): ModuleList(
      (0-5): 6 x TransformerBlock(
        (attention): DistilBertSdpaAttention(
          (dropout): Dropout(p=0.18, inplace=False)
          (q_lin): Linear(in_features=768, out_features=768, bias=True)
          (k_lin): Linear(in_features=768, out_features=768, bias=True)
          (v_lin): Linear(in_features=768, out_features=768, bias=True)
          (out_lin): Linear(in_features=768, out_features=768, bias=True)
        )
        (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
        (ffn): FFN(
          (dropout): Dropout(p=0.

In [None]:
# Set model to training mode (enables dropout, batch norm, etc.)
model_8.train()

# Initialize RMSprop optimizer and learning rate of 4e-5
optim = RMSprop(model_8.parameters(), lr=4e-5)

In [None]:
# Train the DistilBERT model with data_2 for 3 epochs
# Validation data used to evaluate performance during training

epochs = 4

for epoch in range(epochs):
   # Training loop
   loop = tqdm(loader_2_half, leave=True)  # Progress bar for training batches
   model_8.train()  # Set model to training mode
   mean_training_loss = []

   for batch in loop:
       # Zero gradients at start of each batch
       optim.zero_grad()

       # Move batch data to GPU/CPU device
       input_ids = batch['input_ids'].to(device)
       attention_mask = batch['attention_mask'].to(device)
       start = batch['start_positions'].to(device)
       end = batch['end_positions'].to(device)

       # Forward pass
       outputs = model_8(input_ids, attention_mask=attention_mask,
                      start_positions=start, end_positions=end)
       loss = outputs['loss']

       # Backward pass and optimization
       loss.backward()
       optim.step()

       # Track and display training progress
       mean_training_loss.append(loss.item())
       loop.set_description(f'Epoch {epoch}')
       loop.set_postfix(loss=loss.item())
   print("Mean Training Loss", np.mean(mean_training_loss))

   # Validation loop
   loop = tqdm(val_loader_2, leave=True)  # Progress bar for validation batches
   model_8.eval()  # Set model to evaluation mode
   mean_val_loss = []

   for batch in loop:
       # Move batch data to GPU/CPU device
       input_ids = batch['input_ids'].to(device)
       attention_mask = batch['attention_mask'].to(device)
       start = batch['start_positions'].to(device)
       end = batch['end_positions'].to(device)

       # Forward pass (no gradients needed for validation)
       outputs = model_8(input_ids, attention_mask=attention_mask,
                      start_positions=start, end_positions=end)
       loss = outputs['loss']

       # Track and display validation progress
       mean_val_loss.append(loss.item())
       loop.set_description(f'Epoch {epoch} Validation set')
       loop.set_postfix(loss=loss.item())
   print("Mean Validation Loss", np.mean(mean_val_loss))

  0%|          | 0/7063 [00:00<?, ?it/s]

Mean Training Loss 3.3183439154978025


  0%|          | 0/3500 [00:00<?, ?it/s]

Mean Validation Loss 3.0314475964988983


  0%|          | 0/7063 [00:00<?, ?it/s]

Mean Training Loss 2.988504566012775


  0%|          | 0/3500 [00:00<?, ?it/s]

Mean Validation Loss 2.6414309091567993


  0%|          | 0/7063 [00:00<?, ?it/s]

Mean Training Loss 2.6014051500332855


  0%|          | 0/3500 [00:00<?, ?it/s]

Mean Validation Loss 2.3561602783032827


  0%|          | 0/7063 [00:00<?, ?it/s]

Mean Training Loss 2.388093977639143


  0%|          | 0/3500 [00:00<?, ?it/s]

Mean Validation Loss 2.2573356208460673


In [None]:
# Save trained model parameters (weights & biases)
torch.save(model_8.state_dict(), "simple_distilbert_qa_data_8.model")

NameError: name 'model_8' is not defined

In [None]:
# Initialize a new instance of our custom QA model
model_8 = QuestionDistilBERT(mod_8)

# Load previously saved model parameters from disk
model_8.load_state_dict(torch.load("simple_distilbert_qa_data_8.model"))

  model_8.load_state_dict(torch.load("simple_distilbert_qa_data_8.model"))


FileNotFoundError: [Errno 2] No such file or directory: 'simple_distilbert_qa_data_8.model'

In [None]:
# Evaulate data_2 performance for using test data
eval_test_set(model_8, tokenizer, test_loader_2, device)

100%|██████████| 3500/3500 [01:53<00:00, 30.75it/s]

Mean EM:  0.3512142857142857
Mean F-1:  0.38114915461706





# Model_9
Train DistilBERT using SQuAD 2.0 --- data_2, Traditional Split
Alter training parameters to 10 epochs with dropout rate of 0.18, additional attention head added to the model (12 heads)

In [None]:
# Load pre-trained DistilBERT model for masked language modeling
model_9 = DistilBertForMaskedLM.from_pretrained("distilbert-base-uncased")

# Get model configuration (architecture, hyperparameters)
config_9 = DistilBertConfig.from_pretrained("distilbert-base-uncased")

# Extract base DistilBERT model without MLM head
mod_9 = model_9.distilbert

In [None]:
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
model_9 = QuestionDistilBERT(mod_9)
model_9.set_dropout_rate(0.18)
model_9.set_num_heads(12)  # number of heads must divide evenly into the embedding dimension (768)
model_9.to(device)
print(type(mod_9))
print(mod_9)

# Verify the dropout rates for each layer
for name, module in model_9.named_modules():
    if isinstance(module, nn.Dropout):
        print(f"{name}: {module.p}")


Updated dropout rate to 0.18

Updated number of attention heads to 12
<class 'transformers.models.distilbert.modeling_distilbert.DistilBertModel'>
DistilBertModel(
  (embeddings): Embeddings(
    (word_embeddings): Embedding(30522, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.18, inplace=False)
  )
  (transformer): Transformer(
    (layer): ModuleList(
      (0-5): 6 x TransformerBlock(
        (attention): DistilBertSdpaAttention(
          (dropout): Dropout(p=0.18, inplace=False)
          (q_lin): Linear(in_features=768, out_features=768, bias=True)
          (k_lin): Linear(in_features=768, out_features=768, bias=True)
          (v_lin): Linear(in_features=768, out_features=768, bias=True)
          (out_lin): Linear(in_features=768, out_features=768, bias=True)
        )
        (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
        (ffn

In [None]:
# Set model to training mode (enables dropout, batch norm, etc.)
model_9.train()

# Initialize RMSprop optimizer and learning rate of 9e-5
optim = RMSprop(model_9.parameters(), lr=4e-5)

In [None]:
# Train the DistilBERT model with data_2 for 3 epochs
# Validation data used to evaluate performance during training

epochs = 10

for epoch in range(epochs):
   # Training loop
   loop = tqdm(loader_2, leave=True)  # Progress bar for training batches
   model_9.train()  # Set model to training mode
   mean_training_loss = []

   for batch in loop:
       # Zero gradients at start of each batch
       optim.zero_grad()

       # Move batch data to GPU/CPU device
       input_ids = batch['input_ids'].to(device)
       attention_mask = batch['attention_mask'].to(device)
       start = batch['start_positions'].to(device)
       end = batch['end_positions'].to(device)

       # Forward pass
       outputs = model_9(input_ids, attention_mask=attention_mask,
                      start_positions=start, end_positions=end)
       loss = outputs['loss']

       # Backward pass and optimization
       loss.backward()
       optim.step()

       # Track and display training progress
       mean_training_loss.append(loss.item())
       loop.set_description(f'Epoch {epoch}')
       loop.set_postfix(loss=loss.item())
   print("Mean Training Loss", np.mean(mean_training_loss))

   # Validation loop
   loop = tqdm(val_loader_2, leave=True)  # Progress bar for validation batches
   model_9.eval()  # Set model to evaluation mode
   mean_val_loss = []

   for batch in loop:
       # Move batch data to GPU/CPU device
       input_ids = batch['input_ids'].to(device)
       attention_mask = batch['attention_mask'].to(device)
       start = batch['start_positions'].to(device)
       end = batch['end_positions'].to(device)

       # Forward pass (no gradients needed for validation)
       outputs = model_9(input_ids, attention_mask=attention_mask,
                      start_positions=start, end_positions=end)
       loss = outputs['loss']

       # Track and display validation progress
       mean_val_loss.append(loss.item())
       loop.set_description(f'Epoch {epoch} Validation set')
       loop.set_postfix(loss=loss.item())
   print("Mean Validation Loss", np.mean(mean_val_loss))

  0%|          | 0/14125 [00:00<?, ?it/s]

Mean Training Loss 3.2135616431025276


  0%|          | 0/3500 [00:00<?, ?it/s]

Mean Validation Loss 2.7207516030243464


  0%|          | 0/14125 [00:00<?, ?it/s]

Mean Training Loss 2.6083776424500793


  0%|          | 0/3500 [00:00<?, ?it/s]

Mean Validation Loss 2.234415472592626


  0%|          | 0/14125 [00:00<?, ?it/s]

Mean Training Loss 2.343679363183216


  0%|          | 0/3500 [00:00<?, ?it/s]

Mean Validation Loss 2.0535331246852873


  0%|          | 0/14125 [00:00<?, ?it/s]

Mean Training Loss 2.1932894348887215


  0%|          | 0/3500 [00:00<?, ?it/s]

Mean Validation Loss 1.979893664726189


  0%|          | 0/14125 [00:00<?, ?it/s]

Mean Training Loss 2.082632528115163


  0%|          | 0/3500 [00:00<?, ?it/s]

Mean Validation Loss 1.8781117133413043


  0%|          | 0/14125 [00:00<?, ?it/s]

Mean Training Loss 1.9941713207274412


  0%|          | 0/3500 [00:00<?, ?it/s]

Mean Validation Loss 1.828266444998128


  0%|          | 0/14125 [00:00<?, ?it/s]

Mean Training Loss 1.9138346192119395


  0%|          | 0/3500 [00:00<?, ?it/s]

Mean Validation Loss 1.7670543297103474


  0%|          | 0/14125 [00:00<?, ?it/s]

Mean Training Loss 1.846269393032631


  0%|          | 0/3500 [00:00<?, ?it/s]

Mean Validation Loss 1.6862334455464567


  0%|          | 0/14125 [00:00<?, ?it/s]

Mean Training Loss 1.7471434933907162


  0%|          | 0/3500 [00:00<?, ?it/s]

Mean Validation Loss 1.6046015782994882


  0%|          | 0/14125 [00:00<?, ?it/s]

Mean Training Loss 1.6885964290483864


  0%|          | 0/3500 [00:00<?, ?it/s]

Mean Validation Loss 1.5729971965210778


In [None]:
# Save trained model parameters (weights & biases)
torch.save(model_9.state_dict(), "simple_distilbert_qa_data_9.model")

In [None]:
# Initialize a new instance of our custom QA model
model_9 = QuestionDistilBERT(mod_9)

# Load previously saved model parameters from disk
model_9.load_state_dict(torch.load("simple_distilbert_qa_data_9.model"))

  model_9.load_state_dict(torch.load("simple_distilbert_qa_data_9.model"))


<All keys matched successfully>

In [None]:
# Evaulate data_2 performance for using test data
eval_test_set(model_9, tokenizer, test_loader_2, device)

100%|██████████| 3500/3500 [01:43<00:00, 33.70it/s]

Mean EM:  0.49228571428571427
Mean F-1:  0.5540675839106092





# Model_10
Train DistilBERT using SQuAD 2.0 --- data_2, Traditional Split
Alter training parameters with 20 epochs and dropout rate of 0.18, additional attention head added to the model (12 heads)

Note: the model lost connectivity while training, so the it dropped data from epoch 6

In [None]:
# Load pre-trained DistilBERT model for masked language modeling
model_10 = DistilBertForMaskedLM.from_pretrained("distilbert-base-uncased")

# Get model configuration (architecture, hyperparameters)
config_10 = DistilBertConfig.from_pretrained("distilbert-base-uncased")

# Extract base DistilBERT model without MLM head
mod_10 = model_10.distilbert

In [None]:
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
model_10 = QuestionDistilBERT(mod_10)
model_10.set_dropout_rate(0.17)
model_10.set_num_heads(12)  # number of heads must divide evenly into the embedding dimension (768)
model_10.to(device)
print(type(mod_10))
print(mod_10)

# Verify the dropout rates for each layer
for name, module in model_10.named_modules():
    if isinstance(module, nn.Dropout):
        print(f"{name}: {module.p}")


Updated dropout rate to 0.17

Updated number of attention heads to 12
<class 'transformers.models.distilbert.modeling_distilbert.DistilBertModel'>
DistilBertModel(
  (embeddings): Embeddings(
    (word_embeddings): Embedding(30522, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.17, inplace=False)
  )
  (transformer): Transformer(
    (layer): ModuleList(
      (0-5): 6 x TransformerBlock(
        (attention): DistilBertSdpaAttention(
          (dropout): Dropout(p=0.17, inplace=False)
          (q_lin): Linear(in_features=768, out_features=768, bias=True)
          (k_lin): Linear(in_features=768, out_features=768, bias=True)
          (v_lin): Linear(in_features=768, out_features=768, bias=True)
          (out_lin): Linear(in_features=768, out_features=768, bias=True)
        )
        (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
        (ffn

In [None]:
# Set model to training mode (enables dropout, batch norm, etc.)
model_10.train()

# Initialize RMSprop optimizer and learning rate of 4e-5
optim = RMSprop(model_10.parameters(), lr=4e-5)

In [None]:
# Train the DistilBERT model with data_2 for 3 epochs
# Validation data used to evaluate performance during training

epochs = 20

for epoch in range(epochs):
   # Training loop
   loop = tqdm(loader_2, leave=True)  # Progress bar for training batches
   model_10.train()  # Set model to training mode
   mean_training_loss = []

   for batch in loop:
       # Zero gradients at start of each batch
       optim.zero_grad()

       # Move batch data to GPU/CPU device
       input_ids = batch['input_ids'].to(device)
       attention_mask = batch['attention_mask'].to(device)
       start = batch['start_positions'].to(device)
       end = batch['end_positions'].to(device)

       # Forward pass
       outputs = model_10(input_ids, attention_mask=attention_mask,
                      start_positions=start, end_positions=end)
       loss = outputs['loss']

       # Backward pass and optimization
       loss.backward()
       optim.step()

       # Track and display training progress
       mean_training_loss.append(loss.item())
       loop.set_description(f'Epoch {epoch}')
       loop.set_postfix(loss=loss.item())
   print("Mean Training Loss", np.mean(mean_training_loss))

   # Validation loop
   loop = tqdm(val_loader_2, leave=True)  # Progress bar for validation batches
   model_10.eval()  # Set model to evaluation mode
   mean_val_loss = []

   for batch in loop:
       # Move batch data to GPU/CPU device
       input_ids = batch['input_ids'].to(device)
       attention_mask = batch['attention_mask'].to(device)
       start = batch['start_positions'].to(device)
       end = batch['end_positions'].to(device)

       # Forward pass (no gradients needed for validation)
       outputs = model_10(input_ids, attention_mask=attention_mask,
                      start_positions=start, end_positions=end)
       loss = outputs['loss']

       # Track and display validation progress
       mean_val_loss.append(loss.item())
       loop.set_description(f'Epoch {epoch} Validation set')
       loop.set_postfix(loss=loss.item())
   print("Mean Validation Loss", np.mean(mean_val_loss))

  0%|          | 0/14125 [00:00<?, ?it/s]

Mean Training Loss 3.188515219882526


  0%|          | 0/3500 [00:00<?, ?it/s]

Mean Validation Loss 2.622334641746112


  0%|          | 0/14125 [00:00<?, ?it/s]

Mean Training Loss 2.5338046327945407


  0%|          | 0/3500 [00:00<?, ?it/s]

Mean Validation Loss 2.2559419690711158


  0%|          | 0/14125 [00:00<?, ?it/s]

Mean Training Loss 2.2897869962886372


  0%|          | 0/3500 [00:00<?, ?it/s]

Mean Validation Loss 2.1231856541633607


  0%|          | 0/14125 [00:00<?, ?it/s]

Mean Training Loss 2.1466313898078107


  0%|          | 0/3500 [00:00<?, ?it/s]

Mean Validation Loss 1.9266460951226099


  0%|          | 0/14125 [00:00<?, ?it/s]

Mean Training Loss 2.0437622722608855


  0%|          | 0/3500 [00:00<?, ?it/s]

Mean Validation Loss 1.8856808468699455


  0%|          | 0/14125 [00:00<?, ?it/s]

Mean Training Loss 1.9562416802296596


  0%|          | 0/3500 [00:00<?, ?it/s]

Mean Validation Loss 1.8577927553313118


  0%|          | 0/14125 [00:00<?, ?it/s]

Mean Training Loss 1.8855195416961097


  0%|          | 0/3500 [00:00<?, ?it/s]

Mean Validation Loss 1.8803430575302669


  0%|          | 0/14125 [00:00<?, ?it/s]

Mean Training Loss 1.8139113435892933


  0%|          | 0/3500 [00:00<?, ?it/s]

Mean Validation Loss 1.7172345146621977


  0%|          | 0/14125 [00:00<?, ?it/s]

Mean Training Loss 1.7434528265569063


  0%|          | 0/3500 [00:00<?, ?it/s]

Mean Validation Loss 1.6863721031376293


  0%|          | 0/14125 [00:00<?, ?it/s]

Mean Training Loss 1.6827836533605525


  0%|          | 0/3500 [00:00<?, ?it/s]

Mean Validation Loss 1.6109656227316176


  0%|          | 0/14125 [00:00<?, ?it/s]

Mean Training Loss 1.6387102873177655


  0%|          | 0/3500 [00:00<?, ?it/s]

Mean Validation Loss 1.5574541393603598


  0%|          | 0/14125 [00:00<?, ?it/s]

Mean Training Loss 1.602262474300587


  0%|          | 0/3500 [00:00<?, ?it/s]

Mean Validation Loss 1.5956747821484294


  0%|          | 0/14125 [00:00<?, ?it/s]

Mean Training Loss 1.5674825065579034


  0%|          | 0/3500 [00:00<?, ?it/s]

Mean Validation Loss 1.5621179049440792


  0%|          | 0/14125 [00:00<?, ?it/s]

Mean Training Loss 1.5390435121154362


  0%|          | 0/3500 [00:00<?, ?it/s]

Mean Validation Loss 1.5039278400199754


  0%|          | 0/14125 [00:00<?, ?it/s]

Mean Training Loss 1.4873150732559441


  0%|          | 0/3500 [00:00<?, ?it/s]

Mean Validation Loss 1.5276385686142104


  0%|          | 0/14125 [00:00<?, ?it/s]

Mean Training Loss 1.4784157573132388


  0%|          | 0/3500 [00:00<?, ?it/s]

Mean Validation Loss 1.5228046117595264


  0%|          | 0/14125 [00:00<?, ?it/s]

Mean Training Loss 1.4421059167142463


  0%|          | 0/3500 [00:00<?, ?it/s]

Mean Validation Loss 1.480561726191214


  0%|          | 0/14125 [00:00<?, ?it/s]

Mean Training Loss 1.4152797961973511


  0%|          | 0/3500 [00:00<?, ?it/s]

Mean Validation Loss 1.477606695424233


  0%|          | 0/14125 [00:00<?, ?it/s]

Mean Training Loss 1.3961390320558464


  0%|          | 0/3500 [00:00<?, ?it/s]

Mean Validation Loss 1.4581585787939173


  0%|          | 0/14125 [00:00<?, ?it/s]

Mean Training Loss 1.3748717046300922


  0%|          | 0/3500 [00:00<?, ?it/s]

Mean Validation Loss 1.5410569703302213


In [None]:
# Save trained model parameters (weights & biases)
torch.save(model_10.state_dict(), "simple_distilbert_qa_data_10.model")

In [None]:
# Initialize a new instance of our custom QA model
model_10 = QuestionDistilBERT(mod_10)

# Load previously saved model parameters from disk
model_10.load_state_dict(torch.load("simple_distilbert_qa_data_10.model"))

  model_10.load_state_dict(torch.load("simple_distilbert_qa_data_10.model"))


<All keys matched successfully>

In [None]:
# Evaulate data_2 performance for using test data
eval_test_set(model_10, tokenizer, test_loader_2, device)

100%|██████████| 3500/3500 [01:43<00:00, 33.67it/s]

Mean EM:  0.5390714285714285
Mean F-1:  0.6004884056420002



