# Dataset Loading and Preprocessing

SQuAD Dataset 1.0 vs 2.0 -- The Stanford Question Answering Dataset (SQuAD 1.0) is a reading comprehension dataset containing 100,000 question-answer pairs (Rajpurkar, 2016). SQuAD 2.0 contains question-answer pairs from SQuAD 1.0, as well as 50,000 unanswerable questions (Rajpurkar, 2018).  This code will load and pre-process SQuAD 2.0 for the purpose of fine-tuning DistilBERT models.  

This notebook contains code to:
1. Load the SQuAD 2.0 dataset from Huggingface (https://huggingface.co/datasets/rajpurkar/squad_v2)
2. Split the dateset using methods using a traditional split approach -- 80% for Training, 10% for Validation, 10% for Testing (referenced as data_2)
3. Preprocessing of each dataset split: Removal of rows of data that failed validation checks   

The starting-point for code in this file was found on the blog post titled, Question Answering with DistilBERT (https://medium.com/@sabrinaherbst/question-answering-with-distilbert-ba3e178fdf3d).  Main differences include:
 - This file handles unanswerable questions needed for SQuAD 2.0
 - This file contains a traditional split of 80/10/10
 - Validation at the consolidation phase allowed removal of 315 rows were the expected answer did not match the actual  answer

Rajpurkar, P., Zhang, J., Lopyrev, K., & Liang, P. (2016, October 11). Squad: 100,000+ questions for machine comprehension of text. arXiv.org. https://arxiv.org/abs/1606.05250

Rajpurkar, P., Jia, R., & Liang, P. (2018b, June 11). Know what you don’t know: Unanswerable questions for squad. arXiv.org. https://arxiv.org/abs/1806.03822


In [None]:
# Mount Google Drive

from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
# Install Huggingface datasets

!pip install datasets

Collecting datasets
  Downloading datasets-3.1.0-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.1.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m8.7 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m7.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.9.0-py3-none-any.whl (1

In [None]:
# Load libraries

from tqdm.auto import tqdm
from datasets import load_dataset, concatenate_datasets
import os
import pandas as pd
import random
from sklearn.model_selection import train_test_split
from google.colab import drive

In [None]:
# Load SQuAD 2.0 from Huggingface

dataset = load_dataset("squad_v2")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/8.92k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/16.4M [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/1.35M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/130319 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/11873 [00:00<?, ? examples/s]

In [None]:
# Look at the structure of the huggingface datasplit

print(dataset)
print(dataset['train']['answers'][500:])

Output hidden; open in https://colab.research.google.com to view.

## data_2 File Loading and Initial Data Quality Verification

In [None]:
# Traditional Split -- 80% for Training, 10% for Validation, 10% for Testing (ie: data_2)
# with pre-processing to remove QA pairs that are defective

def validate_answer(context, answer_text, answer_start):
    """Validate if the answer text matches the context at given start position"""
    # Handle SQuAD 2.0 "impossible" questions
    if answer_text == "" and answer_start == "-1":
        return True

    try:
        start_idx = int(answer_start)
        # Get text from context at the answer position
        extracted_answer = context[start_idx:start_idx+len(answer_text)]

        # Add flexible matching
        if (extracted_answer.lower().strip() == answer_text.lower().strip() or
            extracted_answer.strip() == answer_text.strip()):
            return True

        # If no match, print debug info
        if not (extracted_answer == answer_text):
            print(f"\nValidation failed:")
            print(f"Expected: '{extracted_answer}'")
            print(f"Got: '{answer_text}'")
            print(f"Context: '...{context[max(0, start_idx-20):start_idx+len(answer_text)+20]}...'")
            return False

        return True
    except Exception as e:
        print(f"\nValidation error: {str(e)}")
        return False

def save_samples(data, save_dir, is_pandas=False):
    text = []
    i = 0
    skipped = 0

    save_path = f"/content/drive/MyDrive/distilBERT_SQuAD2_w266Project/data_2/{save_dir}"
    os.makedirs(save_path, exist_ok=True)

    iterator = data.itertuples() if is_pandas else data
    total = len(data)

    for sample in tqdm(iterator, total=total):
        try:
            if is_pandas:
                context = sample.context.replace('\n','')
                question = sample.question.replace('\n','')
                answer_text = sample.answers['text'][0] if len(sample.answers['text']) > 0 else ""
                answer_start = str(sample.answers['answer_start'][0]) if len(sample.answers['answer_start']) > 0 else "-1"
            else:
                context = sample['context'].replace('\n','')
                question = sample['question'].replace('\n','')
                answer_text = sample['answers']['text'][0] if len(sample['answers']['text']) > 0 else ""
                answer_start = str(sample['answers']['answer_start'][0]) if len(sample['answers']['answer_start']) > 0 else "-1"

            # Validate answer position
            if validate_answer(context, answer_text, answer_start):
                text.append([context, question, answer_text, answer_start])
            else:
                skipped += 1
                continue

            if len(text) == 1000:
                filepath = f"{save_path}/text_{i}.txt"
                try:
                    with open(filepath, 'w', encoding='utf-8') as f:
                        f.write("\n".join(["\t".join(t) for t in text]))
                    print(f"Saved chunk {i} to {filepath}")
                except Exception as e:
                    print(f"Error saving chunk {i}: {str(e)}")
                text = []
                i += 1

        except Exception as e:
            print(f"Error processing sample: {str(e)}")
            skipped += 1
            continue

    # Save remaining samples
    if text:
        filepath = f"{save_path}/text_{i}.txt"
        try:
            with open(filepath, 'w', encoding='utf-8') as f:
                f.write("\n".join(["\t".join(t) for t in text]))
            print(f"Saved final chunk {i} to {filepath}")
        except Exception as e:
            print(f"Error saving final chunk: {str(e)}")

    print(f"Skipped {skipped} samples due to invalid answer positions")

try:
    # Combine train and validation sets
    print("\nCombining datasets...")
    full_dataset = concatenate_datasets([dataset["train"], dataset["validation"]])

    # Convert to pandas for easier splitting
    full_df = full_dataset.to_pandas()

    # First split: separate out test set (10%)
    train_val_df, test_df = train_test_split(full_df, test_size=0.1, random_state=42)

    # Second split: separate training (8/9 of remaining) and validation (1/9 of remaining)
    train_df, val_df = train_test_split(train_val_df, test_size=0.111, random_state=42)

    # Print sizes
    print(f"\nDataset sizes:")
    print(f"Train: {len(train_df)} ({len(train_df)/len(full_df)*100:.1f}%)")
    print(f"Validation: {len(val_df)} ({len(val_df)/len(full_df)*100:.1f}%)")
    print(f"Test: {len(test_df)} ({len(test_df)/len(full_df)*100:.1f}%)")

    # Save all three splits
    print("\nSaving training data...")
    save_samples(train_df, "training_squad", is_pandas=True)

    print("\nSaving validation data...")
    save_samples(val_df, "validation_squad", is_pandas=True)

    print("\nSaving test data...")
    save_samples(test_df, "test_squad", is_pandas=True)

    print("\nAll data saved successfully!")

except Exception as e:
    print(f"\nAn error occurred: {str(e)}")


Combining datasets...

Dataset sizes:
Train: 113767 (80.0%)
Validation: 14205 (10.0%)
Test: 14220 (10.0%)

Saving training data...


  0%|          | 0/113767 [00:00<?, ?it/s]


Validation failed:
Expected: 's mantle, of m'
Got: 'Earth's mantle'
Context: '...inerals). The Earth's mantle, of much larger mass than...'

Validation failed:
Expected: 'ute oxygen toxicity ('
Got: 'Acute oxygen toxicity'
Context: '...atal for divers). Acute oxygen toxicity (causing seizures, it...'
Saved chunk 0 to /content/drive/MyDrive/distilBERT_SQuAD2_w266Project/data_2/training_squad/text_0.txt

Validation failed:
Expected: 'roduction of methanol '
Got: 'production of methanol'
Context: '...d directly for the production of methanol and related compound...'

Validation failed:
Expected: 'compression sickness o'
Got: 'Decompression sickness'
Context: '... helps kill them. Decompression sickness occurs in divers who ...'
Saved chunk 1 to /content/drive/MyDrive/distilBERT_SQuAD2_w266Project/data_2/training_squad/text_1.txt

Validation failed:
Expected: 'ye colour or number of limbs,'
Got: 'eye colour or number of limbs'
Context: '...y visible, such as eye colour or number of limbs,

  0%|          | 0/14205 [00:00<?, ?it/s]

Saved chunk 0 to /content/drive/MyDrive/distilBERT_SQuAD2_w266Project/data_2/validation_squad/text_0.txt

Validation failed:
Expected: '964 '
Got: '1964'
Context: '...n Malaysia was the 1964 Tokyo edition....'

Validation failed:
Expected: '0% oxygen '
Got: '50% oxygen'
Context: '...a), equal to about 50% oxygen composition at stand...'
Saved chunk 1 to /content/drive/MyDrive/distilBERT_SQuAD2_w266Project/data_2/validation_squad/text_1.txt

Validation failed:
Expected: 'ustralian Federal Police. '
Got: 'Australian Federal Police.'
Context: '... kept apart by the Australian Federal Police. Preparations for the...'

Validation failed:
Expected: 'ellular respiration '
Got: 'cellular respiration'
Context: '...uch as animals, in cellular respiration (see Biological role...'
Saved chunk 2 to /content/drive/MyDrive/distilBERT_SQuAD2_w266Project/data_2/validation_squad/text_2.txt
Saved chunk 3 to /content/drive/MyDrive/distilBERT_SQuAD2_w266Project/data_2/validation_squad/text_3.txt

Validatio

  0%|          | 0/14220 [00:00<?, ?it/s]


Validation failed:
Expected: '014,'
Got: '2014'
Context: '...ing enterprises.In 2014, the U.S. Department...'

Validation failed:
Expected: 'utside the stadium.'
Got: 'outside the stadium.'
Context: '...rotests took place outside the stadium....'
Saved chunk 0 to /content/drive/MyDrive/distilBERT_SQuAD2_w266Project/data_2/test_squad/text_0.txt

Validation failed:
Expected: 'cited form a'
Got: 'excited form'
Context: '... only exist in an excited form and is unstable. By c...'

Validation failed:
Expected: 'yghurs living in Turkey '
Got: 'Uyghurs living in Turkey'
Context: '... in Taksim Square. Uyghurs living in Turkey protested at Chinese...'

Validation failed:
Expected: ' '
Got: '3'
Context: '...house and included 3 alumni of the colleg...'

Validation failed:
Expected: 'artin McElroy.'
Got: 'Martin McElroy'
Context: '...g with their coach Martin McElroy. The club has been h...'
Saved chunk 1 to /content/drive/MyDrive/distilBERT_SQuAD2_w266Project/data_2/test_squad/text_1.txt

Vali

# data_2 File Inspection




In [None]:
# Open a data_2 text file

with open("/content/drive/MyDrive/24Nov2024_distilbert_squad_2/data_2/training_squad/text_0.txt", 'r', encoding='utf-8') as f:
    lines = f.read().split('\n')

lines_2 = pd.DataFrame([line.split("\t") for line in lines], columns=["context", "question", "answer", "answer_start"])

In [None]:
# Verify the DataFrame has exactly 1000 rows and 4 columns

assert lines_2.shape==(1000,4)
print("Passed")

Passed


In [None]:
# Verify that the answer_start position correctly locates each answer within its context

for ind, line in lines_2.iterrows():
    sample = line
    answer_start = int(sample['answer_start'])
    assert sample['context'][answer_start:answer_start+len(sample['answer'])] == sample['answer']
print("Passed")

Passed


## Summary of Dataset Split and Pre-Processing  
Data quality verification tasks identified 315 question-answer pairs where the expected answer did not match the true answer (See Appendix A:  Initial Data Quality Verification).  These question-answer pairs were removed as they make up an insignificant portion (~0.2%) of the dataset (See Appendix A—Initial Data Quality Verification).  