<a href="https://colab.research.google.com/github/upashanadutta23/DLPROJECT/blob/main/sp25_ATLDUTTALdata01.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### ***INTRODUCTION and DATASET OVERVIEW***

```
# This is formatted as code
```

: The Stanford Question Answering Dataset (SQuAD) v2, available on Hugging Face Datasets, is a benchmark for evaluating machine reading comprehension and question-answering models. It includes both answerable and unanswerable questions, challenging models to extract accurate answers or identify when no valid answer exists. The main goal of this project is to develop a robust question-answering model that can accurately extract answers from a given context while also identifying when a question is unanswerable. By leveraging the SQuAD v2 dataset, the project aims to fine-tune a transformer-based model, such as BERT or RoBERTa, to enhance its ability to understand context, reason effectively, and handle real-world QA scenarios where answers may or may not exist. This project ultimately seeks to build a reliable and intelligent QA system applicable to practical use cases.

# Installing and Importing Required **Libraries**

In [None]:
pip install datasets transformers torch

Collecting datasets
  Downloading datasets-3.3.0-py3-none-any.whl.metadata (19 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch)
  Downloading nvidia_cudnn_cu12-9.

In [None]:
import re
import torch
from datasets import load_dataset, load_from_disk, DatasetDict
from torch.utils.data import DataLoader
from transformers import AutoTokenizer

### DATA LOADING: Loaded the "rajpurkar/squad_v2" dataset

In [None]:
raw_dataset = load_dataset("rajpurkar/squad_v2")
print(raw_dataset)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/8.92k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/16.4M [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/1.35M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/130319 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/11873 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 130319
    })
    validation: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 11873
    })
})


## SINCE THE DATASET ONLY HAD TRAINING AND VALIDATION SET, I HAVE DIVIDED THE TRAINING SET INTO TRAINING AND VALIDATION SET, AND CONSIDERED ACTUAL VALIDATION SET AS TESTING SET

In [None]:
split_dataset = raw_dataset["train"].train_test_split(
    test_size  = 10000,
    shuffle = True,
    seed = 42,
)
print(split_dataset)#Split dataset has two subsets now with custom test set = 10000 and train set = 120319

DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 120319
    })
    test: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 10000
    })
})


In [None]:
train_dataset = split_dataset["train"]
validation_dataset = split_dataset["test"]
test_dataset = raw_dataset["validation"]
print(f"Train Dataset is {train_dataset}\n")
print(f"Validation Dataset is {validation_dataset}\n")
print(f"Test Dataset is {test_dataset}\n")

Train Dataset is Dataset({
    features: ['id', 'title', 'context', 'question', 'answers'],
    num_rows: 120319
})

Validation Dataset is Dataset({
    features: ['id', 'title', 'context', 'question', 'answers'],
    num_rows: 10000
})

Test Dataset is Dataset({
    features: ['id', 'title', 'context', 'question', 'answers'],
    num_rows: 11873
})



# DATA CLEANING


In [None]:
#Removed unencessary whitespace to ensure uniform formatting using strip function
def clean_text(text):
  text = re.sub('\s+', ' ', text)
  text = text.strip()
  return text

In [None]:
#Normalized the cleaned text and made it consistent for both the context and the question
def apply_clean_text(examples):
  cleaned_contexts = [clean_text(c) for c in examples['context']]
  cleaned_questions = [clean_text(q) for q in examples['question']]
  return {
      "context": cleaned_contexts,
      "question": cleaned_questions
  }


In [None]:
#Mapping the clean dataset to train, test and validation dataset
train_dataset = train_dataset.map(
    apply_clean_text,
    batched = True, #process the dataset in batches
    num_proc = 4 #spiltted the map operation accross CPU 4 cores to speed up the mapping operation
)
validation_dataset = validation_dataset.map(
    apply_clean_text,
    batched = True,
    num_proc = 4
)
test_dataset = test_dataset.map(
    apply_clean_text,
    batched = True,
    num_proc = 4
)

Map (num_proc=4):   0%|          | 0/120319 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/10000 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/11873 [00:00<?, ? examples/s]

In [None]:
#loaded the tokenizer compatible with Bert(Bidirectional Encoded Representations from Transformers)
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")


tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

In [None]:
#Pre-processing a batch of training examples for question answeing model
def preprocess_training_examples(examples):
  # Argument = examples (dict): A batch of examples containing:
  # 1: "question": A list of questions (strings).
  # 2: "context": A list of contexts (strings).
  # 3: "answers": A list of dictionaries with the keys:
  # 4: "text": The actual answer text(s)

  #Tokenzied the question and context
  tokenized = tokenizer(
      examples["question"], #tokenized the questions
      examples["context"],  #tokenized the context
      truncation = True,   #cut of input sequence(question+context) if it exceeds the max_length
      stride = 128,  #overlap between chunks when splitting long context
      return_overflowing_tokens = True, #return token split into multiple chunks
      return_offsets_mapping = True, #return character to token position mapping
      padding = "max_length" #all sequences are padded to the model's max_length
  )
  offset_mapping = tokenized["offset_mapping"]
  sample_map = tokenized.pop("overflow_to_sample_mapping")
  #stored the start and end token position of the answers in a list
  start_positions = []
  end_positions = []
  #Iterate over each chunk in tokenized batch size
  for i, offsets in enumerate(offset_mapping):
    sample_idx = sample_map[i] #index of the chunk
    answers = examples["answers"][sample_idx] #retrieving answes
    #if no answers are provided the start and end position is set to be zero
    if len(answers["text"]) == 0:
      start_positions.append(0)
      end_positions.append(0)
      continue
    #Extracting first answer text and character start position
    answer_text = answers["text"][0]
    answer_start_char = answers["answer_start"][0]
    answer_end_char = answer_start_char + len(answer_text)
    #Initializing the value of start and end token index
    start_token_idx = 0
    end_token_idx = 0
    #Iterate through token indices to find start and end token indices
    for idx, (start,end) in enumerate(offsets):
      if start <= answer_start_char < end: #to check if the start of the answer fell within the token
        start_token_idx = idx
      if start < answer_end_char <= end: #to check if the end of the answer fell within the token
        end_token_idx = idx
        break #exit the loop if it has been found
    #append identified token indices to the given list
    start_positions.append(start_token_idx)
    end_positions.append(end_token_idx)
  #add start and end positions to the tokenized dictionary
  tokenized["start_positions"] = start_positions
  tokenized["end_positions"] = end_positions
  return tokenized #return tokenized input within the given position of the answer

In [None]:
#Tokenized and Preprocessed the training dataset and removed the original column to avoid redundancy
tokenized_train = train_dataset.map(
    preprocess_training_examples,
    batched = True,
    remove_columns = train_dataset.column_names
)
#Tokenized and Preprocessed the testing dataset and removed the original column to avoid redundancy

tokenized_test = test_dataset.map(
    preprocess_training_examples,
    batched = True,
    remove_columns = test_dataset.column_names
)
#Tokenized and Preprocessed the validation dataset and removed the original column to avoid redundancy

tokenized_validation = validation_dataset.map(
    preprocess_training_examples,
    batched = True,
    remove_columns = validation_dataset.column_names
)

Map:   0%|          | 0/120319 [00:00<?, ? examples/s]

Map:   0%|          | 0/11873 [00:00<?, ? examples/s]

Map:   0%|          | 0/10000 [00:00<?, ? examples/s]

##Convert Dataset into Tensor Format

In [None]:
#Converted the tokenized dataset into Pytorch Sensors after formatting each set to pytorch tensors
tokenized_train.set_format("torch", columns = ["input_ids", "attention_mask", "start_positions", "end_positions"])
tokenized_test.set_format("torch", columns = ["input_ids", "attention_mask", "start_positions", "end_positions"])
tokenized_validation.set_format("torch", columns = ["input_ids", "attention_mask", "start_positions", "end_positions"])

In [None]:
#Created a dataloader for tokenized datasets with shuffling at epoch to improve model performance and be as much unbias as possible
train_dataloader = DataLoader(tokenized_train, shuffle = True, batch_size = 8)
test_dataloader = DataLoader(tokenized_test, shuffle = True, batch_size = 8)
validation_dataloader = DataLoader(tokenized_validation, shuffle = True, batch_size = 8)

In [None]:
#fetched the first batch from train dataloader and iterated over the batch items(keys and values)
sample_batch = next(iter(train_dataloader))
for k,v in sample_batch.items():
  print(k,v.shape)

input_ids torch.Size([8, 512])
attention_mask torch.Size([8, 512])
start_positions torch.Size([8])
end_positions torch.Size([8])


# ###Save Processed Data

In [None]:
#Created a DatasetDict to organize the  preprocess datasets
processed_dataset = DatasetDict(
    {
        "train" : tokenized_train, #Assigned the tokenized training dataset to the "train" split
        "test" : tokenized_test,   #Assigned the tokenized testing dataset to the "test" split
        "validation" : tokenized_validation,  #Assigned the tokenized validation dataset to the "validation" split
    }
)
# Saved the processed dataset to disk for using it
processed_dataset.save_to_disk("processed_squad_v2")


Saving the dataset (0/4 shards):   0%|          | 0/120522 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/11974 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/10022 [00:00<?, ? examples/s]

In [None]:
reload_processed_dataset = load_from_disk("processed_squad_v2")#reload the preprocessed dataset