## Finetuning DistilBERT using SQuAD 2.0

This notebook contains the following fine-tuning groups:

....

The starting-point for code in this file was found in the Medium blog post titled, Question Answering with DistilBERT (https://medium.com/@sabrinaherbst/question-answering-with-distilbert-ba3e178fdf3d). Main differences include:

 - The DistilBERT model was pre-trined using SQuAD 2.0, rather than SQuAD 1.0
 - Exploring traditional split that included unseen test data  (i.e. not validation data)
 - Addition of adhoc dropout rate setting
 - Addition of alternate dataloader with Fixed-Length Truncation
 - Addition of performance analysis for each category

....

In [None]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# Load libraries

from transformers import DistilBertModel, DistilBertForMaskedLM, DistilBertConfig, \
            DistilBertTokenizerFast, AutoTokenizer, BertModel, BertForMaskedLM, BertTokenizerFast, BertConfig
from torch import nn
from pathlib import Path
import torch
import pandas as pd
from typing import Optional
from tqdm.auto import tqdm
from torch.optim import AdamW, RMSprop
import numpy as np

import sys
sys.path.append('/content/drive/MyDrive/distilBERT_SQuAD2_w266Project')
from qa_model import QuestionDistilBERT, SimpleQuestionDistilBERT, ReuseQuestionDistilBERT, Dataset, test_model
from util import eval_test_set, count_parameters, print_test_set_incorrect_predictions, \
                 analyze_test_set_performance, eval_test_set_by_category
from my_distilbert import QADataset

# Load tokenizer


In [None]:
# Load DistilBERT tokenizer, use uncased (lowercase) vocabulary

tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

# Load data_2 (Traditional split)

Note: QADataset loader is the Fixed-Length Trunc" described in the report

In [None]:
# Get paths for all SQuAD dataset text files in training directory -- data_2, Traditional Split
squad_paths_2 = [str(x) for x in Path('/content/drive/MyDrive/distilBERT_SQuAD2_w266Project/data_2/training_squad').glob('**/*.txt')]

# Create training dataset using only SQuAD data, DataLoader with batch size of 8
dataset_2 = QADataset(squad_paths = [str(x) for x in Path('/content/drive/MyDrive/distilBERT_SQuAD2_w266Project/data_2/training_squad').glob('**/*.txt')],
                  natural_question_paths=None,
                  hotpotqa_paths=None, tokenizer=tokenizer)
loader_2 = torch.utils.data.DataLoader(dataset_2, batch_size=8)
print(f"Approximate Training Dataset Size: {len(dataset_2)}")

## load the validation dataset -- used to be labeled as "test", test_dataset changed to val_dataset, test_loader changed to val_loader
val_dataset_2 = QADataset(squad_paths = [str(x) for x in Path('/content/drive/MyDrive/distilBERT_SQuAD2_w266Project/data_2/validation_squad').glob('**/*.txt')],
                       natural_question_paths=None,
                       hotpotqa_paths = None, tokenizer=tokenizer)
val_loader_2 = torch.utils.data.DataLoader(val_dataset_2, batch_size=4)
print(f"Approximate Validation Dataset Size: {len(val_dataset_2)}")

## load the test dataset -- test_dataset and test_loader should not be used during training
test_dataset_2 = QADataset(squad_paths = [str(x) for x in Path('/content/drive/MyDrive/distilBERT_SQuAD2_w266Project/data_2/test_squad').glob('**/*.txt')],
                       natural_question_paths=None,
                       hotpotqa_paths = None, tokenizer=tokenizer)
test_loader_2 = torch.utils.data.DataLoader(test_dataset_2, batch_size=4)
print(f"Approximate Test Dataset Size: {len(test_dataset_2)}")

Loaded 113506 total samples
Approximate Training Dataset Size: 113506
Loaded 14181 total samples
Approximate Validation Dataset Size: 14181
Loaded 14190 total samples
Approximate Test Dataset Size: 14190


# Model_12
Train DistilBERT using SQuAD 2.0 --- data_2, Traditional Split; Alter training parameters with 3 epochs, dropout rate of 0.18 and AdamW optimizer rather than RMSprop

In [None]:
# Load pre-trained DistilBERT model for masked language modeling
model_12 = DistilBertForMaskedLM.from_pretrained("distilbert-base-uncased")

# Get model configuration (architecture, hyperparameters)
config_12 = DistilBertConfig.from_pretrained("distilbert-base-uncased")

# Extract base DistilBERT model without MLM head
mod_12 = model_12.distilbert

In [None]:
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
model_12 = SimpleQuestionDistilBERT(mod_12)
model_12.set_dropout_rate(0.18)
model_12.to(device)

# Verify the dropout rates for each layer
for name, module in model_12.named_modules():
    if isinstance(module, nn.Dropout):
        print(f"{name}: {module.p}")

set_dropout_rate
distilbert.embeddings.dropout: 0.18
distilbert.transformer.layer.0.attention.dropout: 0.18
distilbert.transformer.layer.0.ffn.dropout: 0.18
distilbert.transformer.layer.1.attention.dropout: 0.18
distilbert.transformer.layer.1.ffn.dropout: 0.18
distilbert.transformer.layer.2.attention.dropout: 0.18
distilbert.transformer.layer.2.ffn.dropout: 0.18
distilbert.transformer.layer.3.attention.dropout: 0.18
distilbert.transformer.layer.3.ffn.dropout: 0.18
distilbert.transformer.layer.4.attention.dropout: 0.18
distilbert.transformer.layer.4.ffn.dropout: 0.18
distilbert.transformer.layer.5.attention.dropout: 0.18
distilbert.transformer.layer.5.ffn.dropout: 0.18
dropout: 0.18


In [None]:
# Set model to training mode (enables dropout, batch norm, etc.)
model_12.train()

# Initialize RMSprop optimizer and learning rate of 4e-5
optim = AdamW(model_12.parameters(), lr=4e-5)

In [None]:
# Train the DistilBERT model with data_2 for 3 epochs
# Validation data used to evaluate performance during training

epochs = 3

for epoch in range(epochs):
   # Training loop
   loop = tqdm(loader_2, leave=True)  # Progress bar for training batches
   model_12.train()  # Set model to training mode
   mean_training_loss = []

   for batch in loop:
       # Zero gradients at start of each batch
       optim.zero_grad()

       # Move batch data to GPU/CPU device
       input_ids = batch['input_ids'].to(device)
       attention_mask = batch['attention_mask'].to(device)
       start = batch['start_positions'].to(device)
       end = batch['end_positions'].to(device)

       # Forward pass
       outputs = model_12(input_ids, attention_mask=attention_mask,
                      start_positions=start, end_positions=end)
       loss = outputs['loss']

       # Backward pass and optimization
       loss.backward()
       optim.step()

       # Track and display training progress
       mean_training_loss.append(loss.item())
       loop.set_description(f'Epoch {epoch}')
       loop.set_postfix(loss=loss.item())
   print("Mean Training Loss", np.mean(mean_training_loss))

   # Validation loop
   loop = tqdm(val_loader_2, leave=True)  # Progress bar for validation batches
   model_12.eval()  # Set model to evaluation mode
   mean_val_loss = []

   for batch in loop:
       # Move batch data to GPU/CPU device
       input_ids = batch['input_ids'].to(device)
       attention_mask = batch['attention_mask'].to(device)
       start = batch['start_positions'].to(device)
       end = batch['end_positions'].to(device)

       # Forward pass (no gradients needed for validation)
       outputs = model_12(input_ids, attention_mask=attention_mask,
                      start_positions=start, end_positions=end)
       loss = outputs['loss']

       # Track and display validation progress
       mean_val_loss.append(loss.item())
       loop.set_description(f'Epoch {epoch} Validation set')
       loop.set_postfix(loss=loss.item())
   print("Mean Validation Loss", np.mean(mean_val_loss))

  0%|          | 0/14103 [00:00<?, ?it/s]

Mean Training Loss 2.9327797776843982


  0%|          | 0/3502 [00:00<?, ?it/s]

Mean Validation Loss 2.5671445438601097


  0%|          | 0/14103 [00:00<?, ?it/s]

Mean Training Loss 2.4308641801008366


  0%|          | 0/3502 [00:00<?, ?it/s]

Mean Validation Loss 2.413677657788448


  0%|          | 0/14103 [00:00<?, ?it/s]

Mean Training Loss 2.1578972860301717


  0%|          | 0/3502 [00:00<?, ?it/s]

Mean Validation Loss 2.356090803636501


In [None]:
# Save trained model parameters (weights & biases)
torch.save(model_12.state_dict(), "simple_distilbert_qa_data_12.model")

In [None]:
# Initialize a new instance of our custom QA model
model_12 = SimpleQuestionDistilBERT(mod_12)

# Load previously saved model parameters from disk
model_12.load_state_dict(torch.load("simple_distilbert_qa_data_12.model"))

  model_12.load_state_dict(torch.load("simple_distilbert_qa_data_12.model"))


<All keys matched successfully>

In [None]:
# Evaulate data_2 performance for using test data
eval_test_set(model_12, tokenizer, test_loader_2, device)

100%|██████████| 3506/3506 [00:49<00:00, 70.21it/s]

Mean EM:  0.7662078311104772
Mean F-1:  0.8172893287686881





# Model_14
Train DistilBERT using SQuAD 2.0 --- data_2, Traditional Split; Alter training parameters with 4 epochs, dropout rate of 0.18 and AdamW optimizer rather than RMSprop

In [None]:
# Load pre-trained DistilBERT model for masked language modeling
model_14 = DistilBertForMaskedLM.from_pretrained("distilbert-base-uncased")

# Get model configuration (architecture, hyperparameters)
config_14 = DistilBertConfig.from_pretrained("distilbert-base-uncased")

# Extract base DistilBERT model without MLM head
mod_14 = model_14.distilbert

In [None]:
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
model_14 = SimpleQuestionDistilBERT(mod_14)
model_14.set_dropout_rate(0.18)
model_14.to(device)

# Verify the dropout rates for each layer
for name, module in model_14.named_modules():
    if isinstance(module, nn.Dropout):
        print(f"{name}: {module.p}")

set_dropout_rate
distilbert.embeddings.dropout: 0.18
distilbert.transformer.layer.0.attention.dropout: 0.18
distilbert.transformer.layer.0.ffn.dropout: 0.18
distilbert.transformer.layer.1.attention.dropout: 0.18
distilbert.transformer.layer.1.ffn.dropout: 0.18
distilbert.transformer.layer.2.attention.dropout: 0.18
distilbert.transformer.layer.2.ffn.dropout: 0.18
distilbert.transformer.layer.3.attention.dropout: 0.18
distilbert.transformer.layer.3.ffn.dropout: 0.18
distilbert.transformer.layer.4.attention.dropout: 0.18
distilbert.transformer.layer.4.ffn.dropout: 0.18
distilbert.transformer.layer.5.attention.dropout: 0.18
distilbert.transformer.layer.5.ffn.dropout: 0.18
dropout: 0.18


In [None]:
# Set model to training mode (enables dropout, batch norm, etc.)
model_14.train()

# Initialize RMSprop optimizer and learning rate of 4e-5
optim = AdamW(model_14.parameters(), lr=4e-5)

In [None]:
# Train the DistilBERT model with data_2 for 3 epochs
# Validation data used to evaluate performance during training

epochs = 4

for epoch in range(epochs):
   # Training loop
   loop = tqdm(loader_2, leave=True)  # Progress bar for training batches
   model_14.train()  # Set model to training mode
   mean_training_loss = []

   for batch in loop:
       # Zero gradients at start of each batch
       optim.zero_grad()

       # Move batch data to GPU/CPU device
       input_ids = batch['input_ids'].to(device)
       attention_mask = batch['attention_mask'].to(device)
       start = batch['start_positions'].to(device)
       end = batch['end_positions'].to(device)

       # Forward pass
       outputs = model_14(input_ids, attention_mask=attention_mask,
                      start_positions=start, end_positions=end)
       loss = outputs['loss']

       # Backward pass and optimization
       loss.backward()
       optim.step()

       # Track and display training progress
       mean_training_loss.append(loss.item())
       loop.set_description(f'Epoch {epoch}')
       loop.set_postfix(loss=loss.item())
   print("Mean Training Loss", np.mean(mean_training_loss))

   # Validation loop
   loop = tqdm(val_loader_2, leave=True)  # Progress bar for validation batches
   model_14.eval()  # Set model to evaluation mode
   mean_val_loss = []

   for batch in loop:
       # Move batch data to GPU/CPU device
       input_ids = batch['input_ids'].to(device)
       attention_mask = batch['attention_mask'].to(device)
       start = batch['start_positions'].to(device)
       end = batch['end_positions'].to(device)

       # Forward pass (no gradients needed for validation)
       outputs = model_14(input_ids, attention_mask=attention_mask,
                      start_positions=start, end_positions=end)
       loss = outputs['loss']

       # Track and display validation progress
       mean_val_loss.append(loss.item())
       loop.set_description(f'Epoch {epoch} Validation set')
       loop.set_postfix(loss=loss.item())
   print("Mean Validation Loss", np.mean(mean_val_loss))

  0%|          | 0/14103 [00:00<?, ?it/s]

Mean Training Loss 3.009145540081


  0%|          | 0/3502 [00:00<?, ?it/s]

Mean Validation Loss 2.6677160452727415


  0%|          | 0/14103 [00:00<?, ?it/s]

Mean Training Loss 2.4837736385913214


  0%|          | 0/3502 [00:00<?, ?it/s]

Mean Validation Loss 2.4161130606246286


  0%|          | 0/14103 [00:00<?, ?it/s]

Mean Training Loss 2.186711376887268


  0%|          | 0/3502 [00:00<?, ?it/s]

Mean Validation Loss 2.387788159200982


  0%|          | 0/14103 [00:00<?, ?it/s]

Mean Training Loss 1.9837533045809215


  0%|          | 0/3502 [00:00<?, ?it/s]

Mean Validation Loss 2.4912494182168015


In [None]:
# Save trained model parameters (weights & biases)
torch.save(model_14.state_dict(), "simple_distilbert_qa_data_14.model")

In [None]:
# Initialize a new instance of our custom QA model
model_14 = SimpleQuestionDistilBERT(mod_14)

# Load previously saved model parameters from disk
model_14.load_state_dict(torch.load("simple_distilbert_qa_data_14.model"))

  model_14.load_state_dict(torch.load("simple_distilbert_qa_data_14.model"))


<All keys matched successfully>

In [None]:
# Evaulate data_2 performance for using test data
eval_test_set(model_14, tokenizer, test_loader_2, device)

100%|██████████| 3506/3506 [00:49<00:00, 70.67it/s]

Mean EM:  0.7688467299051422
Mean F-1:  0.8218669553325509





#Model_15
Train DistilBERT using SQuAD 2.0 --- data_2, Traditional Split; Alter training parameters with 4 epochs, dropout rate of 0.16 and AdamW optimizer rather than RMSprop


In [None]:
# Load pre-trained DistilBERT model for masked language modeling
model_15 = DistilBertForMaskedLM.from_pretrained("distilbert-base-uncased")

# Get model configuration (architecture, hyperparameters)
config_15 = DistilBertConfig.from_pretrained("distilbert-base-uncased")

# Extract base DistilBERT model without MLM head
mod_15 = model_15.distilbert

In [None]:
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
model_15 = SimpleQuestionDistilBERT(mod_15)
model_15.set_dropout_rate(0.16)
model_15.to(device)

# Verify the dropout rates for each layer
for name, module in model_15.named_modules():
    if isinstance(module, nn.Dropout):
        print(f"{name}: {module.p}")

set_dropout_rate
distilbert.embeddings.dropout: 0.16
distilbert.transformer.layer.0.attention.dropout: 0.16
distilbert.transformer.layer.0.ffn.dropout: 0.16
distilbert.transformer.layer.1.attention.dropout: 0.16
distilbert.transformer.layer.1.ffn.dropout: 0.16
distilbert.transformer.layer.2.attention.dropout: 0.16
distilbert.transformer.layer.2.ffn.dropout: 0.16
distilbert.transformer.layer.3.attention.dropout: 0.16
distilbert.transformer.layer.3.ffn.dropout: 0.16
distilbert.transformer.layer.4.attention.dropout: 0.16
distilbert.transformer.layer.4.ffn.dropout: 0.16
distilbert.transformer.layer.5.attention.dropout: 0.16
distilbert.transformer.layer.5.ffn.dropout: 0.16
dropout: 0.16


In [None]:
# Set model to training mode (enables dropout, batch norm, etc.)
model_15.train()

# Initialize RMSprop optimizer and learning rate of 4e-5
optim = AdamW(model_15.parameters(), lr=4e-5)

In [None]:
# Train the DistilBERT model with data_2 for 3 epochs
# Validation data used to evaluate performance during training

epochs = 4

for epoch in range(epochs):
   # Training loop
   loop = tqdm(loader_2, leave=True)  # Progress bar for training batches
   model_15.train()  # Set model to training mode
   mean_training_loss = []

   for batch in loop:
       # Zero gradients at start of each batch
       optim.zero_grad()

       # Move batch data to GPU/CPU device
       input_ids = batch['input_ids'].to(device)
       attention_mask = batch['attention_mask'].to(device)
       start = batch['start_positions'].to(device)
       end = batch['end_positions'].to(device)

       # Forward pass
       outputs = model_15(input_ids, attention_mask=attention_mask,
                      start_positions=start, end_positions=end)
       loss = outputs['loss']

       # Backward pass and optimization
       loss.backward()
       optim.step()

       # Track and display training progress
       mean_training_loss.append(loss.item())
       loop.set_description(f'Epoch {epoch}')
       loop.set_postfix(loss=loss.item())
   print("Mean Training Loss", np.mean(mean_training_loss))

   # Validation loop
   loop = tqdm(val_loader_2, leave=True)  # Progress bar for validation batches
   model_15.eval()  # Set model to evaluation mode
   mean_val_loss = []

   for batch in loop:
       # Move batch data to GPU/CPU device
       input_ids = batch['input_ids'].to(device)
       attention_mask = batch['attention_mask'].to(device)
       start = batch['start_positions'].to(device)
       end = batch['end_positions'].to(device)

       # Forward pass (no gradients needed for validation)
       outputs = model_15(input_ids, attention_mask=attention_mask,
                      start_positions=start, end_positions=end)
       loss = outputs['loss']

       # Track and display validation progress
       mean_val_loss.append(loss.item())
       loop.set_description(f'Epoch {epoch} Validation set')
       loop.set_postfix(loss=loss.item())
   print("Mean Validation Loss", np.mean(mean_val_loss))

  0%|          | 0/14103 [00:00<?, ?it/s]

Mean Training Loss 2.918621488268947


  0%|          | 0/3502 [00:00<?, ?it/s]

Mean Validation Loss 2.570065584818251


  0%|          | 0/14103 [00:00<?, ?it/s]

Mean Training Loss 2.4253601323297316


  0%|          | 0/3502 [00:00<?, ?it/s]

Mean Validation Loss 2.3954262174322865


  0%|          | 0/14103 [00:00<?, ?it/s]

Mean Training Loss 2.1365608694584566


  0%|          | 0/3502 [00:00<?, ?it/s]

Mean Validation Loss 2.4227985396873133


  0%|          | 0/14103 [00:00<?, ?it/s]

Mean Training Loss 1.9464316566053483


  0%|          | 0/3502 [00:00<?, ?it/s]

Mean Validation Loss 2.4746865601897516


In [None]:
# Save trained model parameters (weights & biases)
torch.save(model_15.state_dict(), "simple_distilbert_qa_data_15.model")

In [None]:
# Initialize a new instance of our custom QA model
model_15 = SimpleQuestionDistilBERT(mod_15)

# Load previously saved model parameters from disk
model_15.load_state_dict(torch.load("simple_distilbert_qa_data_15.model"))

  model_15.load_state_dict(torch.load("simple_distilbert_qa_data_15.model"))


<All keys matched successfully>

In [None]:
# Evaulate data_2 performance for using test data
eval_test_set(model_15, tokenizer, test_loader_2, device)

100%|██████████| 3506/3506 [00:50<00:00, 69.86it/s]

Mean EM:  0.763996861850082
Mean F-1:  0.8176282045263565





#Model_17
Train DistilBERT using SQuAD 2.0 --- data_2, Traditional Split; Alter training parameters with 4 epochs, dropout rate of 0.13 and AdamW optimizer rather than RMSprop

In [None]:
# Load pre-trained DistilBERT model for masked language modeling
model_17 = DistilBertForMaskedLM.from_pretrained("distilbert-base-uncased")

# Get model configuration (architecture, hyperparameters)
config_17 = DistilBertConfig.from_pretrained("distilbert-base-uncased")

# Extract base DistilBERT model without MLM head
mod_17 = model_17.distilbert

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

In [None]:
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
model_17 = SimpleQuestionDistilBERT(mod_17)
model_17.set_dropout_rate(0.13)
model_17.to(device)

# Verify the dropout rates for each layer
for name, module in model_17.named_modules():
    if isinstance(module, nn.Dropout):
        print(f"{name}: {module.p}")

set_dropout_rate
distilbert.embeddings.dropout: 0.13
distilbert.transformer.layer.0.attention.dropout: 0.13
distilbert.transformer.layer.0.ffn.dropout: 0.13
distilbert.transformer.layer.1.attention.dropout: 0.13
distilbert.transformer.layer.1.ffn.dropout: 0.13
distilbert.transformer.layer.2.attention.dropout: 0.13
distilbert.transformer.layer.2.ffn.dropout: 0.13
distilbert.transformer.layer.3.attention.dropout: 0.13
distilbert.transformer.layer.3.ffn.dropout: 0.13
distilbert.transformer.layer.4.attention.dropout: 0.13
distilbert.transformer.layer.4.ffn.dropout: 0.13
distilbert.transformer.layer.5.attention.dropout: 0.13
distilbert.transformer.layer.5.ffn.dropout: 0.13
dropout: 0.13


In [None]:
# Set model to training mode (enables dropout, batch norm, etc.)
model_17.train()

# Initialize RMSprop optimizer and learning rate of 4e-5
optim = AdamW(model_17.parameters(), lr=4e-5)

In [None]:
# Train the DistilBERT model with data_2 for 3 epochs
# Validation data used to evaluate performance during training

epochs = 4

for epoch in range(epochs):
   # Training loop
   loop = tqdm(loader_2, leave=True)  # Progress bar for training batches
   model_17.train()  # Set model to training mode
   mean_training_loss = []

   for batch in loop:
       # Zero gradients at start of each batch
       optim.zero_grad()

       # Move batch data to GPU/CPU device
       input_ids = batch['input_ids'].to(device)
       attention_mask = batch['attention_mask'].to(device)
       start = batch['start_positions'].to(device)
       end = batch['end_positions'].to(device)

       # Forward pass
       outputs = model_17(input_ids, attention_mask=attention_mask,
                      start_positions=start, end_positions=end)
       loss = outputs['loss']

       # Backward pass and optimization
       loss.backward()
       optim.step()

       # Track and display training progress
       mean_training_loss.append(loss.item())
       loop.set_description(f'Epoch {epoch}')
       loop.set_postfix(loss=loss.item())
   print("Mean Training Loss", np.mean(mean_training_loss))

   # Validation loop
   loop = tqdm(val_loader_2, leave=True)  # Progress bar for validation batches
   model_17.eval()  # Set model to evaluation mode
   mean_val_loss = []

   for batch in loop:
       # Move batch data to GPU/CPU device
       input_ids = batch['input_ids'].to(device)
       attention_mask = batch['attention_mask'].to(device)
       start = batch['start_positions'].to(device)
       end = batch['end_positions'].to(device)

       # Forward pass (no gradients needed for validation)
       outputs = model_17(input_ids, attention_mask=attention_mask,
                      start_positions=start, end_positions=end)
       loss = outputs['loss']

       # Track and display validation progress
       mean_val_loss.append(loss.item())
       loop.set_description(f'Epoch {epoch} Validation set')
       loop.set_postfix(loss=loss.item())
   print("Mean Validation Loss", np.mean(mean_val_loss))

  0%|          | 0/14103 [00:00<?, ?it/s]

Mean Training Loss 2.8988373747907326


  0%|          | 0/3502 [00:00<?, ?it/s]

Mean Validation Loss 2.517550705645929


  0%|          | 0/14103 [00:00<?, ?it/s]

Mean Training Loss 2.389552794027597


  0%|          | 0/3502 [00:00<?, ?it/s]

Mean Validation Loss 2.3903511301880283


  0%|          | 0/14103 [00:00<?, ?it/s]

Mean Training Loss 2.110122181332258


  0%|          | 0/3502 [00:00<?, ?it/s]

Mean Validation Loss 2.370580654419458


  0%|          | 0/14103 [00:00<?, ?it/s]

Mean Training Loss 1.919016493759489


  0%|          | 0/3502 [00:00<?, ?it/s]

Mean Validation Loss 2.482143438713034


In [None]:
# Save trained model parameters (weights & biases)
torch.save(model_17.state_dict(), "simple_distilbert_qa_data_17.model")

In [None]:
# Initialize a new instance of our custom QA model
model_17 = SimpleQuestionDistilBERT(mod_17)

# Load previously saved model parameters from disk
model_17.load_state_dict(torch.load("simple_distilbert_qa_data_17.model"))

  model_17.load_state_dict(torch.load("simple_distilbert_qa_data_17.model"))


<All keys matched successfully>

In [None]:
# Evaulate data_2 performance for using test data
eval_test_set(model_17, tokenizer, test_loader_2, device)

100%|██████████| 3506/3506 [00:50<00:00, 69.53it/s]

Mean EM:  0.7665644390557022
Mean F-1:  0.8216557547155849





In [None]:
# Then add category evaluation
print("\nAnalyzing performance by question category:")
eval_test_set_by_category(model_17, tokenizer, test_loader_2, device)



Analyzing performance by question category:


100%|██████████| 3506/3506 [00:56<00:00, 62.30it/s]

+---------------+-------+------------+---------+---------+
| Question Type | Count | % of Total | Mean EM | Mean F1 |
+---------------+-------+------------+---------+---------+
|      what     |  8080 |   57.6%    |  0.777  |  0.827  |
|      when     |  957  |    6.8%    |  0.714  |  0.785  |
|      how      |  1401 |   10.0%    |  0.792  |  0.832  |
|      who      |  1460 |   10.4%    |  0.751  |  0.829  |
|      why      |  175  |    1.2%    |  0.800  |  0.832  |
|     other     |  275  |    2.0%    |  0.756  |  0.808  |
|     where     |  617  |    4.4%    |  0.716  |  0.787  |
|     which     |  1056 |    7.5%    |  0.747  |  0.814  |
+---------------+-------+------------+---------+---------+





{'what': {'count': 8080,
  'percentage': 57.627843948363164,
  'mean_em': 0.7772277227722773,
  'mean_f1': 0.8267159653243886},
 'when': {'count': 957,
  'percentage': 6.825476071606876,
  'mean_em': 0.7136886102403344,
  'mean_f1': 0.7852905238826241},
 'how': {'count': 1401,
  'percentage': 9.99215462520505,
  'mean_em': 0.7915774446823698,
  'mean_f1': 0.8319991596307137},
 'who': {'count': 1460,
  'percentage': 10.412952000570574,
  'mean_em': 0.7513698630136987,
  'mean_f1': 0.8291292147449019},
 'why': {'count': 175,
  'percentage': 1.2481278082875686,
  'mean_em': 0.8,
  'mean_f1': 0.8316912254621837},
 'other': {'count': 275,
  'percentage': 1.9613436987376078,
  'mean_em': 0.7563636363636363,
  'mean_f1': 0.8084937091163197},
 'where': {'count': 617,
  'percentage': 4.400542044076742,
  'mean_em': 0.7163695299837926,
  'mean_f1': 0.787357141257092},
 'which': {'count': 1056,
  'percentage': 7.5315598031524145,
  'mean_em': 0.7471590909090909,
  'mean_f1': 0.81364274517216}}