# **CODE AND REPORT**
TOPIC: Building Q and A Model with SQuAD Dataset

NAME: Awofisayo Victoria Oladipo

STUDENT: 22029348

GITHUB KINK: [link text](https://github.com/va22abb/Research-Methodology-LLM-Ass-3)

GOGGLE COLAB LINK: [link text](https://colab.research.google.com/drive/1rNLlzo0HkAc884CXN7EmM68F3deeJqpc?authuser=2#scrollTo=JGgf1kDFp9-o)

# **Introduction**
Question answering (QA) plays an essential role in natural language processing (NLP) and a major step forward in artificial intelligence. Like phone interfaces and search engines, QA systems let users ask questns in plain language and get quick, clear answrs (S, Lavanya, 2022). However, users often need to look through several pieces of information for complex questns. This is hard for machines because understanding text requires knowledge of the world and language interpretation skills (Rajpurkar et al., 2016). Large language models (LLMs) can be transformative.This study details the process of developing a QA model with Stanford Question Answering Dataset (SQuAD) and BERT, an advanced transformer model trained for multiple NLP tasks. (Devlin et al., 2019).

# **About the dataset**
The Stanford Question Answering Dataset (SQuAD), is a famous dataset focused on understanding text, with over one hundred thousand questn-answr pairs from 500+ Wikipedia articles. It's available on Kaggle. Each answr, created by crowdworkers, is a specific passage from the article. The train-v1.1.json file is for training models, and the dev-v1.1.json file is for evaluation. SQuAD is a standard for building and testing machine learning models in natural language understanding and questn answring.

# Seting up Environment

In [1]:
!pip install transformers
!pip install datasets
!pip install torch


Collecting datasets
  Downloading datasets-2.21.0-py3-none-any.whl.metadata (21 kB)
Collecting pyarrow>=15.0.0 (from datasets)
  Downloading pyarrow-17.0.0-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (3.3 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.4.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Downloading datasets-2.21.0-py3-none-any.whl (527 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m527.3/527.3 kB[0m [31m15.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m6.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pyarrow-17.0.0-cp310-cp310-manylinux_2_28_x86_64.whl (39.9 MB)
[2

Here, I'm trying to install some dependences that I will be needing to for my LLM model

In [2]:
# Importing the Necessary Libraries
import pandas as pd
from datasets import load_dataset
from transformers import BertTokenizerFast
from torch.utils.data import Dataset, DataLoader
from transformers import BertForQuestionAnswering, Trainer, TrainingArguments
from datasets import load_metric
import torch
from transformers import BertTokenizerFast
from torch.utils.data import Dataset
from transformers import BertForQuestionAnswering, Trainer, TrainingArguments
from datasets import load_metric

In [3]:
from google.colab import drive
# mounting the drive
drive.mount('/content/drive')

Mounted at /content/drive


# **Load Data**
To load the data, parse the JSON files and use Pandas to extract relevant information. This involves understanding the structure of questns, answrs, and cntxt paragraphs. Preprocess the data by cleaning it, identifying important fields, and formatting it for tokenization.

In [4]:
# Load the SQuAD training and validation dataset
train_data = pd.read_json("/content/drive/MyDrive/train-v1.1.json")
validation_data = pd.read_json("/content/drive/MyDrive/dev-v1.1.json")

# **Preprocess Data**
The SQuAD dataset, in JSON format, contains questns with answrs and cntxt paragraphs, split into training and validation sets. For this study, 200 training entries and 25 validation entries were used to manage resources.



In [5]:
# Selecting randome sample 200 entries from the training data
train_sample = train_data.sample(n=200, random_state=0).reset_index(drop=True)

# Selecting Random sample 25 entries from the validation data
validation_sample = validation_data.sample(n=25, random_state=0).reset_index(drop=True)

In [6]:
# Printing the structure of the data
print(train_sample.head())
print(validation_sample.head())

                                                data  version
0  {'title': 'Cyprus', 'paragraphs': [{'context':...      1.1
1  {'title': 'Nonprofit_organization', 'paragraph...      1.1
2  {'title': 'Alsace', 'paragraphs': [{'context':...      1.1
3  {'title': 'Humanism', 'paragraphs': [{'context...      1.1
4  {'title': 'Iran', 'paragraphs': [{'context': '...      1.1
                                                data  version
0  {'title': 'Construction', 'paragraphs': [{'con...      1.1
1  {'title': 'Computational_complexity_theory', '...      1.1
2  {'title': 'Pharmacy', 'paragraphs': [{'context...      1.1
3  {'title': 'Private_school', 'paragraphs': [{'c...      1.1
4  {'title': 'Jacksonville,_Florida', 'paragraphs...      1.1


The first five rows of each of the train and test dataset are printed out to have an insight of what the dataset contains.

In [7]:
# The necessary information of my train sample dataset
train_sample.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   data     200 non-null    object 
 1   version  200 non-null    float64
dtypes: float64(1), object(1)
memory usage: 3.2+ KB


In [8]:
# Necessary information of my validation sample dataset
validation_sample.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25 entries, 0 to 24
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   data     25 non-null     object 
 1   version  25 non-null     float64
dtypes: float64(1), object(1)
memory usage: 528.0+ bytes


# **Preparing Data for Tokenization**
The SQuAD data was prepared by extracting IDs, titles, contexts, questions, and answers from the raw data. Each answr was formatted with its start position and text. This organized data was then stored in a new DataFrame, ready for tokenization with the BERT model.

In [9]:
def prepare_data_for_toknizr(data):
    """
    Processes a DataFrame containing raw SQuAD data,
    organizing it to be compatible with the BERT toknizr.
    """

    # Initializing lists to store processed data
    ids = []
    titles = []
    cntxts = []
    questns = []
    answrs = []

    # Iterating over each row in the DataFrame
    for _, row in data.iterrows():
        document = row['data']
        QuestionAnswer_title = document['title']
        paragraphs = document['paragraphs']

        # Extracting context, questions, and answrs from each paragraph
        # This part of the code is based on a script from Hugging Face (Hugging Face, 2022).
        # Available : https://huggingface.co/datasets/SkelterLabsInc/JaQuAD/commit/8331f752b38576f5188c2e8cbb284e5e36a0debc
        for prgph in paragraphs:
            cntxt = prgph['context']
            QuestionAnswers = prgph['qas']

            for QuestionAnswer in QuestionAnswers:
                QuestionAnswer_id = QuestionAnswer['id']
                questn = QuestionAnswer['question']
                answr = QuestionAnswer['answers'][0]

                formatted_answr = {
                    'answr_start': [answr['answer_start']],
                    'text': [answr['text']]
                }

                # Appending extracted data to corresponding lists
                ids.append(QuestionAnswer_id)
                titles.append(QuestionAnswer_title)
                cntxts.append(cntxt)
                questns.append(questn)
                answrs.append(formatted_answr)

    # Creating a DataFrame from the organized data
    cleaned_data = {
        'id': ids,
        'title': titles,
        'cntxt': cntxts,
        'questn': questns,
        'answrs': answrs
    }

    return pd.DataFrame(cleaned_data)

# Preparing the data
train_cleaned_data = prepare_data_for_toknizr(train_sample)
validation_cleaned_data = prepare_data_for_toknizr(validation_sample)

#(Hugging Face, 2022)


In [10]:
# Displaying the cleaned data to verify the structure
print(train_cleaned_data.head())
print(validation_cleaned_data.head())


                         id   title  \
0  572e7c43cb0c0d14000f11a6  Cyprus   
1  572e7c43cb0c0d14000f11a7  Cyprus   
2  572e7c43cb0c0d14000f11a8  Cyprus   
3  572e7c43cb0c0d14000f11a9  Cyprus   
4  572e7c43cb0c0d14000f11aa  Cyprus   

                                               cntxt  \
0  Cyprus (i/ˈsaɪprəs/; Greek: Κύπρος IPA: [ˈcipr...   
1  Cyprus (i/ˈsaɪprəs/; Greek: Κύπρος IPA: [ˈcipr...   
2  Cyprus (i/ˈsaɪprəs/; Greek: Κύπρος IPA: [ˈcipr...   
3  Cyprus (i/ˈsaɪprəs/; Greek: Κύπρος IPA: [ˈcipr...   
4  Cyprus (i/ˈsaɪprəs/; Greek: Κύπρος IPA: [ˈcipr...   

                                              questn  \
0                What is the official name of Cypus?   
1                           Where is Cyprus located?   
2                  What countries are nearby Cyprus?   
3  What is Cyprus' affiliation with the European ...   
4  Is Cyprus an island country or land-locked cou...   

                                              answrs  
0  {'answr_start': [99], 'text': ['R

Now the data is preprocessed and available for tokenization

# **Tokenizing Data**
In preparing text data for the BERT model is important for us to do TOKENIZATION. Using BertTokenizerFast, text is split into tokens, special tokens are added, and attention masks are created to differentiate between padding and real tokens. Additionally, the start and end positions of answrs within cntxt paragraphs are mapped. This mapping helps the model learn which text segments answr the questns.


In [11]:
# Loading BERT tokenizer
#This part of the code is based on a script from Hugging Face (Hugging Face, n.d).
#Available :  https://huggingface.co/docs/transformers/tasks/question_answering

toknizr = BertTokenizerFast.from_pretrained('bert-base-uncased')

def preprocess_function(data):
    inputs = toknizr(
        data['questn'].tolist(),
        data['cntxt'].tolist(),
        max_length=384,
        truncation="only_second",
        padding="max_length",
        return_offsets_mapping=True,
        return_tensors='pt'
    )

    start_positions = []
    end_positions = []

    for i in range(len(data)):
        strt_character = data['answrs'][i]['answr_start'][0]
        end_chararacter = strt_character + len(data['answrs'][i]['text'][0])
        offset_mapping = inputs['offset_mapping'][i]
        sequence_ids = inputs.sequence_ids(i)

        cntxt_start = sequence_ids.index(1)
        cntxt_end = len(sequence_ids) - 1 - sequence_ids[::-1].index(1)

        if offset_mapping[cntxt_start][0] > strt_character or offset_mapping[cntxt_end][1] < end_chararacter:
            start_positions.append(cntxt_start)
            end_positions.append(cntxt_start)
        else:
            start_positions.append(next(idx for idx, offset in enumerate(offset_mapping) if offset[0] <= strt_character < offset[1]))
            end_positions.append(next(idx for idx, offset in enumerate(offset_mapping) if offset[0] < end_chararacter <= offset[1]))

    inputs['start_positions'] = start_positions
    inputs['end_positions'] = end_positions
    return inputs

training_encodings = preprocess_function(train_cleaned_data)
val_encodings = preprocess_function(validation_cleaned_data)

#(Hugging Face, n.d)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

# **Preparing the Dataset Class**
The tokenized data is organized into a dataset class compatible with PyTorch's DataLoader. This custom class ensures efficient loading and batching of data during training. It handles tokenized inputs and labels, making it easy to integrate with the training process.


In [12]:
class QuestionAnsweringDataset(Dataset):
    def __init__(self, tokenized_data):
        self.tokenized_data = tokenized_data

    def __len__(self):
        return len(self.tokenized_data['input_ids'])

    def __getitem__(self, index):
        sample = {key: torch.tensor(value[index]) for key, value in self.tokenized_data.items()}
        return sample

# Creating datasets for training and validation
training_dataset = QuestionAnsweringDataset(training_encodings)
validation_dataset = QuestionAnsweringDataset(val_encodings)


# **Fine-tuning BERT Model**
BERT, a pre-trained transformer model, is the core of the QA system. Using the SQuAD dataset, BERT is fine-tuned for questn answring. This involves setting training parameters such as batch size,training cycles and learning rate.The transformers library's Trainer class simplifies this process by providing a high-level API for training and evaluation, helping the model learn to identify the correct text segments to answr questns.  

Epoch	    Training Loss     	Validation Loss

1	        1.484000	              1.441920

2	        1.100800	              1.373503

3	        0.854600	              1.435376

In [13]:
# Loading the pre-trained BERT model
#This part of the code is based on a script from Hugging Face (Hugging Face, n.d).
#Available : https://huggingface.co/docs/transformers/tasks/question_answering
model = BertForQuestionAnswering.from_pretrained('bert-base-uncased')

# Defining the training arguments
training_args = TrainingArguments(
    output_dir='./results',
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
)

# Initializing the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=training_dataset,
    eval_dataset=validation_dataset
)

# Train the model
trainer.train()

#(Hugging Face, n.d)

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  sample = {key: torch.tensor(value[index]) for key, value in self.tokenized_data.items()}


Epoch,Training Loss,Validation Loss
1,1.484,1.44192
2,1.1008,1.373503
3,0.8546,1.435376


  sample = {key: torch.tensor(value[index]) for key, value in self.tokenized_data.items()}
  sample = {key: torch.tensor(value[index]) for key, value in self.tokenized_data.items()}
  sample = {key: torch.tensor(value[index]) for key, value in self.tokenized_data.items()}
  sample = {key: torch.tensor(value[index]) for key, value in self.tokenized_data.items()}
  sample = {key: torch.tensor(value[index]) for key, value in self.tokenized_data.items()}
  sample = {key: torch.tensor(value[index]) for key, value in self.tokenized_data.items()}
  sample = {key: torch.tensor(value[index]) for key, value in self.tokenized_data.items()}
  sample = {key: torch.tensor(value[index]) for key, value in self.tokenized_data.items()}
  sample = {key: torch.tensor(value[index]) for key, value in self.tokenized_data.items()}
  sample = {key: torch.tensor(value[index]) for key, value in self.tokenized_data.items()}
  sample = {key: torch.tensor(value[index]) for key, value in self.tokenized_data.items()}

TrainOutput(global_step=7788, training_loss=1.2737366373133647, metrics={'train_runtime': 9425.4609, 'train_samples_per_second': 13.22, 'train_steps_per_second': 0.826, 'total_flos': 2.441916177981696e+16, 'train_loss': 1.2737366373133647, 'epoch': 3.0})

# **Evaluating the Model**
Evaluating the model's performance is essential to determining how accurate the answrs it provides. Key metrics were computed with the compute_metrics function using the validation dataset and the compute_metric function on the validation dataset, utilizing the datasets package. Evaluation loss is 1.44, the duration is 105.93 seconds, the sample rate is 44.33 samples per second, and the step rate is 2.775 steps per second. These metrics demonstrate the model's capacity to comprehend cntxt and its processing efficiency, demonstrating its dependability and efficacy for answring questns in the real world.

In [14]:
# Loading the evaluation metric for SQuAD
squad_evaluator = load_metric("squad")

# Applying function to calculate and return evaluation metrics
def compute_evaluation_metrics(eval_predictions):
    return squad_evaluator.compute(predictions=eval_predictions.predictions, references=eval_predictions.label_ids)

# Executing model evaluation
eval_results = trainer.evaluate()
print(eval_results)


  squad_evaluator = load_metric("squad")


Downloading builder script:   0%|          | 0.00/1.72k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/1.11k [00:00<?, ?B/s]

The repository for squad contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/squad.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N] Y


  sample = {key: torch.tensor(value[index]) for key, value in self.tokenized_data.items()}


{'eval_loss': 1.4353755712509155, 'eval_runtime': 106.6572, 'eval_samples_per_second': 44.029, 'eval_steps_per_second': 2.756, 'epoch': 3.0}


# **Example Question-Answer**
A sample questn and answr were created to show how my method works in real situations. This involves asking a questn, providing cntxt, and using the model to get the answr. The model's effectiveness is shown by its ability to give accurate answrs based on cntxt.

For example, with the questn "When did the United Kingdom join the European Union?" and the cntxt "The United Kingdom joined the European Union on January 1, 1973," the model correctly identifies "the United Kingdom joined the European Union on January 1, 1973" as the answr. Similar examples with different cntxts and questns can show the model's flexibility.

In [15]:
def answr_questn(questn, cntxt):
    inputs = toknizr.encode_plus(questn, cntxt, return_tensors='pt')
    input_ids = inputs['input_ids'].tolist()[0]

    # Ensuring the model and inputs are on the same device
    device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
    model.to(device)
    inputs = {k: v.to(device) for k, v in inputs.items()}

    text_tokens = toknizr.convert_ids_to_tokens(input_ids)
    model.eval()
    with torch.no_grad():
        outputs = model(**inputs)

    answr_start_scores = outputs.start_logits
    answr_end_scores = outputs.end_logits

    answr_start = torch.argmax(answr_start_scores)
    answr_end = torch.argmax(answr_end_scores) + 1

    answr = toknizr.convert_tokens_to_string(toknizr.convert_ids_to_tokens(input_ids[answr_start:answr_end]))
    return answr


In [16]:
# Example 1
questn1 = "What river runs through London?"
cntxt1 = "The River Thames runs through London, providing a significant waterway for the capital of the United Kingdom."
print(f"Q: {questn1}\nA: {answr_questn(questn1, cntxt1)}\n")


Q: What river runs through London?
A: what river runs through london? [SEP] the river thames



In [17]:
# Example 2:
questn2 = "What is the capital of the United Kingdom?"
cntxt2 = "London is the capital and largest city of the United Kingdom. It is one of the leading financial centers in the world and has a significant impact on the global economy."
print(f"Q: {questn2}\nA: {answr_questn(questn2, cntxt2)}\n")


Q: What is the capital of the United Kingdom?
A: what is



In [18]:
# Example 3:
questn3 = "When did the United Kingdom join the European Union?"
cntxt3 = "The United Kingdom joined the European Union on January 1, 1973. It was a significant moment in British history, marking the beginning of the UK's integration into the European political and economic sphere."
print(f"Q: {questn3}\nA: {answr_questn(questn3, cntxt3)}")


Q: When did the United Kingdom join the European Union?
A: january 1, 1973


# **Limitation Of The Work**
1. A lot of data is required to make a perfect prediction.
2. Using the whole data of the SQuAD results in the crashing of the model which requires more computational power.That reason makes me reduce the amount of data I used, which is why my prediction output in Example 1 and 2 end up not giving a perfect answer for the prediction.

# **Conclusion**
Creating a questn-answr model with the SQuAD dataset and BERT involves several steps: setting up the environment, loading and preprocessing data, tokenizing inputs, optimizing the model, and evaluating its performance. BERT excels at providing accurate answrs due to its strong cntxtual understanding.

Fine-tuning a pre-trained model like BERT for specific tasks showcases significant progress in NLP, enabling the development of reliable questn-answr systems. This approach can be applied to various fields, helping create intelligent systems that understand and answr human questns accurately.

# **References**
Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2019). “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding” arXiv preprint arXiv:1810.04805. https://arxiv.org/abs/1810.04805

Rajpurkar, P., Zhang, J., Lopyrev, K., & Liang, P. (2016). “SQuAD: 100,000+ Questions for Machine Comprehension of Text” Conference on Empirical Methods in Natural Language Processing, arXiv preprint arXiv:1606.05250. https://doi.org/10.48550/arXiv.1606.05250

Rajpurkar, Pranav et al. “SQuAD: 100,000+ Questions for Machine Comprehension of Text.” Conference on Empirical Methods in Natural Language Processing (2016).
S, Lavanya. (2022, August 24). End to End Question-Answering System Using NLP and SQuAD Dataset. Available at: https://www.analyticsvidhya.com/blog/2021/11/end-to-end-question-answering-system-using-nlp-and-squad-dataset/ (Accessed: 01 August 2024).

Stanford Question Answering Dataset. (2019, November 17). Available at: https://www.kaggle.com/datasets/stanfordu/stanford-question-answering-dataset(Accessed: 20 July 2024).

Hugging Face (2022). "JaQuAD.py." Available at: https://huggingface.co/datasets/SkelterLabsInc/JaQuAD/commit/8331f752b38576f5188c2e8cbb284e5e36a0debc (committed: febuary 3 2022, Accessed: 27 july 2024).

Hugging Face (n.d) "Question Answering with Hugging Face Transformers" Available at: https://huggingface.co/docs/transformers/tasks/question_answering (Accessed: 27 July 2024).

