<a href="https://colab.research.google.com/github/va22abb/Research-Methodology-LLM-Ass-3/blob/Large-language-model-Ass-3/Complete_LLM_Q%26A_Model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

TOPIC: Building Q and A Model with SQuAD Dataset

NAME: Awofisayo Victoria Oladipo

STUDENT: 22029348

GITHUB KINK: [link text](https://github.com/va22abb/Research-Methodology-LLM-Ass-3)

# **Introduction**
Question answering (QA) is a key part of natural language processing (NLP) and a major step forward in artificial intelligence. Like phone interfaces and search engines, QA systems let users ask questions in plain language and get quick, clear answers (S, Lavanya, 2022). However, users often need to look through several pieces of information for complex questions. This is hard for machines because understanding text requires knowledge of the world and language interpretation skills (Rajpurkar et al., 2016). Large language models (LLMs) can be transformative. This study explains how to create a QA model using the Stanford Question Answering Dataset (SQuAD) and BERT, a powerful pre-trained transformer model for many NLP tasks (Devlin et al., 2019).

# **About the dataset**
The Stanford Question Answering Dataset (SQuAD) is a famous reading comprehension dataset with over 100,000 question-answer pairs from 500+ Wikipedia articles. It's available on Kaggle. Each answer, created by crowdworkers, is a specific passage from the article. The train-v1.1.json file is for training models, and the dev-v1.1.json file is for evaluation. SQuAD is a standard for building and testing machine learning models in natural language understanding and question answering.

# Seting up Environment

In [None]:
!pip install transformers
!pip install datasets
!pip install torch




In [None]:
# Importing the Necessary Libraries
import pandas as pd
from datasets import load_dataset
from transformers import BertTokenizerFast
from torch.utils.data import Dataset, DataLoader
from transformers import BertForQuestionAnswering, Trainer, TrainingArguments
from datasets import load_metric
import torch
from transformers import BertTokenizerFast
from torch.utils.data import Dataset
from transformers import BertForQuestionAnswering, Trainer, TrainingArguments
from datasets import load_metric

In [None]:
from google.colab import drive
# mounting the drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


# **Load Data**
To load the data, parse the JSON files and use Pandas to extract relevant information. This involves understanding the structure of questions, answers, and context paragraphs. Preprocess the data by cleaning it, identifying important fields, and formatting it for tokenization.

In [None]:
# Loading the SQuAD training and validation dataset
train_data = pd.read_json("/content/drive/MyDrive/train-v1.1.json")
validation_data = pd.read_json("/content/drive/MyDrive/dev-v1.1.json")


# **Preprocess Data**
The SQuAD dataset, in JSON format, contains questions with answers and context paragraphs, split into training and validation sets. For this study, 200 training entries and 25 validation entries were used to manage resources.



In [None]:
# Selecting randome sample 200 entries from the training data
train_sample = train_data.sample(n=200, random_state=0).reset_index(drop=True)

# Selecting Random sample 25 entries from the validation data
validation_sample = validation_data.sample(n=25, random_state=0).reset_index(drop=True)

In [None]:
# Printing the structure of the data
print(train_sample.head())
print(validation_sample.head())

                                                data  version
0  {'title': 'Cyprus', 'paragraphs': [{'context':...      1.1
1  {'title': 'Nonprofit_organization', 'paragraph...      1.1
2  {'title': 'Alsace', 'paragraphs': [{'context':...      1.1
3  {'title': 'Humanism', 'paragraphs': [{'context...      1.1
4  {'title': 'Iran', 'paragraphs': [{'context': '...      1.1
                                                data  version
0  {'title': 'Construction', 'paragraphs': [{'con...      1.1
1  {'title': 'Computational_complexity_theory', '...      1.1
2  {'title': 'Pharmacy', 'paragraphs': [{'context...      1.1
3  {'title': 'Private_school', 'paragraphs': [{'c...      1.1
4  {'title': 'Jacksonville,_Florida', 'paragraphs...      1.1


The first five rows of each of the train and test dataset are printed out to have an insight of what the dataset contains.

In [None]:
# The necessary information of my train sample dataset
train_sample.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   data     200 non-null    object 
 1   version  200 non-null    float64
dtypes: float64(1), object(1)
memory usage: 3.2+ KB


In [None]:
# The necessary information of my validation sample dataset
validation_sample.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25 entries, 0 to 24
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   data     25 non-null     object 
 1   version  25 non-null     float64
dtypes: float64(1), object(1)
memory usage: 528.0+ bytes


# **Preparing Data for Tokenization**
The SQuAD data was prepared by extracting IDs, titles, contexts, questions, and answers from the raw data. Each answer was formatted with its start position and text. This organized data was then stored in a new DataFrame, ready for tokenization with the BERT model.

In [None]:
def prepare_data_for_tokenizer(data):
    """
    Processes a DataFrame containing raw SQuAD data,
    organizing it to be compatible with the BERT tokenizer.
    """

    # Initializing lists to store processed data
    ids = []
    titles = []
    contexts = []
    questions = []
    answers = []

    # Iterating over each row in the DataFrame
    for _, row in data.iterrows():
        document = row['data']
        qa_title = document['title']
        paragraphs = document['paragraphs']

        # Extracting context, questions, and answers from each paragraph
        for paragraph in paragraphs:
            context = paragraph['context']
            qas = paragraph['qas']

            for qa in qas:
                qa_id = qa['id']
                question = qa['question']
                answer = qa['answers'][0]

                formatted_answer = {
                    'answer_start': [answer['answer_start']],
                    'text': [answer['text']]
                }

                # Appending extracted data to corresponding lists
                ids.append(qa_id)
                titles.append(qa_title)
                contexts.append(context)
                questions.append(question)
                answers.append(formatted_answer)

    # Creating a DataFrame from the organized data
    cleaned_data = {
        'id': ids,
        'title': titles,
        'context': contexts,
        'question': questions,
        'answers': answers
    }

    return pd.DataFrame(cleaned_data)

# Preparing the data
train_cleaned = prepare_data_for_tokenizer(train_sample)
validation_cleaned = prepare_data_for_tokenizer(validation_sample)


In [None]:
# Displaying the cleaned data to verify the structure
print(train_cleaned.head())
print(validation_cleaned.head())


                         id   title  \
0  572e7c43cb0c0d14000f11a6  Cyprus   
1  572e7c43cb0c0d14000f11a7  Cyprus   
2  572e7c43cb0c0d14000f11a8  Cyprus   
3  572e7c43cb0c0d14000f11a9  Cyprus   
4  572e7c43cb0c0d14000f11aa  Cyprus   

                                             context  \
0  Cyprus (i/ˈsaɪprəs/; Greek: Κύπρος IPA: [ˈcipr...   
1  Cyprus (i/ˈsaɪprəs/; Greek: Κύπρος IPA: [ˈcipr...   
2  Cyprus (i/ˈsaɪprəs/; Greek: Κύπρος IPA: [ˈcipr...   
3  Cyprus (i/ˈsaɪprəs/; Greek: Κύπρος IPA: [ˈcipr...   
4  Cyprus (i/ˈsaɪprəs/; Greek: Κύπρος IPA: [ˈcipr...   

                                            question  \
0                What is the official name of Cypus?   
1                           Where is Cyprus located?   
2                  What countries are nearby Cyprus?   
3  What is Cyprus' affiliation with the European ...   
4  Is Cyprus an island country or land-locked cou...   

                                             answers  
0  {'answer_start': [99], 'text': ['

Now the data is preprocessed and available for tokenization

# **Tokenizing Data**
Tokenization is crucial for preparing text data for the BERT model. Using BertTokenizerFast, text is split into tokens, special tokens are added, and attention masks are created to differentiate between padding and real tokens. Additionally, the start and end positions of answers within context paragraphs are mapped. This mapping helps the model learn which text segments answer the questions.


In [None]:
# Loading BERT tokenizer
tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')

def preprocess_function(data):
    inputs = tokenizer(
        data['question'].tolist(),
        data['context'].tolist(),
        max_length=384,
        truncation="only_second",
        padding="max_length",
        return_offsets_mapping=True,
        return_tensors='pt'
    )

    start_positions = []
    end_positions = []

    for i in range(len(data)):
        start_char = data['answers'][i]['answer_start'][0]
        end_char = start_char + len(data['answers'][i]['text'][0])
        offset_mapping = inputs['offset_mapping'][i]
        sequence_ids = inputs.sequence_ids(i)

        context_start = sequence_ids.index(1)
        context_end = len(sequence_ids) - 1 - sequence_ids[::-1].index(1)

        if offset_mapping[context_start][0] > start_char or offset_mapping[context_end][1] < end_char:
            start_positions.append(context_start)
            end_positions.append(context_start)
        else:
            start_positions.append(next(idx for idx, offset in enumerate(offset_mapping) if offset[0] <= start_char < offset[1]))
            end_positions.append(next(idx for idx, offset in enumerate(offset_mapping) if offset[0] < end_char <= offset[1]))

    inputs['start_positions'] = start_positions
    inputs['end_positions'] = end_positions
    return inputs

train_encodings = preprocess_function(train_cleaned)
val_encodings = preprocess_function(validation_cleaned)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

# **Preparing the Dataset Class**
The tokenized data is organized into a dataset class compatible with PyTorch's DataLoader. This custom class ensures efficient loading and batching of data during training. It handles tokenized inputs and labels, making it easy to integrate with the training process.


In [None]:
class QADataset(Dataset):
    def __init__(self, encodings):
        self.encodings = encodings

    def __len__(self):
        return len(self.encodings['input_ids'])

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        return item

train_dataset = QADataset(train_encodings)
val_dataset = QADataset(val_encodings)


# **Fine-tuning BERT Model**
BERT, a pre-trained transformer model, is the core of the QA system. Using the SQuAD dataset, BERT is fine-tuned for question answering. This involves setting training parameters like learning rate, batch size, and epochs. The transformers library's Trainer class simplifies this process by providing a high-level API for training and evaluation, helping the model learn to identify the correct text segments to answer questions.

In [None]:
# Loading the pre-trained BERT model
model = BertForQuestionAnswering.from_pretrained('bert-base-uncased')

# Defining the training arguments
training_args = TrainingArguments(
    output_dir='./results',
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
)

# Initializing the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset
)

# Train the model
trainer.train()


model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}


Epoch,Training Loss,Validation Loss
1,1.4837,1.438912
2,1.106,1.399738
3,0.8559,1.449703


  item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
  item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
  item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
  item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
  item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
  item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
  item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
  item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
  item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
  item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
  item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
  item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
  item = {key: torch.tensor(val[idx]) for key, val in self.encod

TrainOutput(global_step=7788, training_loss=1.2807636231597288, metrics={'train_runtime': 3929.6651, 'train_samples_per_second': 31.709, 'train_steps_per_second': 1.982, 'total_flos': 2.441916177981696e+16, 'train_loss': 1.2807636231597288, 'epoch': 3.0})

# **Evaluating the Model**
Evaluating the model's performance is essential to determining how accurate the answers it provides. Key metrics were computed with the compute_metrics function using the validation dataset and the load_metric function from the datasets package. The evaluation loss is 1.44, the duration is 105.93 seconds, the sample rate is 44.33 samples per second, and the step rate is 2.775 steps per second. These metrics demonstrate the model's capacity to comprehend context and its processing efficiency, demonstrating its dependability and efficacy for answering questions in the real world.

In [None]:
# Loading the evaluation metric
metric = load_metric("squad")

def compute_metrics(p):
    return metric.compute(predictions=p.predictions, references=p.label_ids)

# Evaluating the model
results = trainer.evaluate()
print(results)

  metric = load_metric("squad")


Downloading builder script:   0%|          | 0.00/1.72k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/1.11k [00:00<?, ?B/s]

The repository for squad contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/squad.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N] Y


  item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}


{'eval_loss': 1.449703335762024, 'eval_runtime': 47.6529, 'eval_samples_per_second': 98.546, 'eval_steps_per_second': 6.17, 'epoch': 3.0}


# **Example Question-Answer**
A sample question and answer were created to show how my method works in real situations. This involves asking a question, providing context, and using the model to get the answer. The model's effectiveness is shown by its ability to give accurate answers based on context.

For example, with the question "When did the United Kingdom join the European Union?" and the context "The United Kingdom joined the European Union on January 1, 1973," the model correctly identifies "the United Kingdom joined the European Union on January 1, 1973" as the answer. Similar examples with different contexts and questions can show the model's flexibility.

In [None]:
def answer_question(question, context):
    inputs = tokenizer.encode_plus(question, context, return_tensors='pt')
    input_ids = inputs['input_ids'].tolist()[0]

    # Ensuring the model and inputs are on the same device
    device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
    model.to(device)
    inputs = {k: v.to(device) for k, v in inputs.items()}

    text_tokens = tokenizer.convert_ids_to_tokens(input_ids)
    model.eval()
    with torch.no_grad():
        outputs = model(**inputs)

    answer_start_scores = outputs.start_logits
    answer_end_scores = outputs.end_logits

    answer_start = torch.argmax(answer_start_scores)
    answer_end = torch.argmax(answer_end_scores) + 1

    answer = tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(input_ids[answer_start:answer_end]))
    return answer


In [None]:
# Example:
question = "When did the United Kingdom join the European Union?"
context = "The United Kingdom joined the European Union on January 1, 1973. It was a significant moment in British history, marking the beginning of the UK's integration into the European political and economic sphere."
print(f"Q: {question}\nA: {answer_question(question, context)}")


Q: When did the United Kingdom join the European Union?
A: january 1, 1973


# **Limitation Of The Work**
1. A lot of data is required to make a perfect prediction.
2. Using the whole data of the SQuAD results in the crashing of the model which requires more computational power, This is why I reduced the amount of data I used which makes it end up not having a perfect prediction.

# **Conclusion**
Creating a question-answer model with the SQuAD dataset and BERT involves several steps: setting up the environment, loading and preprocessing data, tokenizing inputs, optimizing the model, and evaluating its performance. BERT excels at providing accurate answers due to its strong contextual understanding.

Fine-tuning a pre-trained model like BERT for specific tasks showcases significant progress in NLP, enabling the development of reliable question-answer systems. This approach can be applied to various fields, helping create intelligent systems that understand and answer human questions accurately.

# **References**
Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2019). “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding” arXiv preprint arXiv:1810.04805. https://arxiv.org/abs/1810.04805   

Rajpurkar, P., Zhang, J., Lopyrev, K., & Liang, P. (2016). “SQuAD: 100,000+ Questions for Machine Comprehension of Text” Conference on Empirical Methods in Natural Language Processing, arXiv preprint arXiv:1606.05250. https://doi.org/10.48550/arXiv.1606.05250

Rajpurkar, Pranav et al. “SQuAD: 100,000+ Questions for Machine Comprehension of Text.” Conference on Empirical Methods in Natural Language Processing (2016).
S, Lavanya. (2022, August 24). End to End Question-Answering System Using NLP and SQuAD Dataset. Available at: https://www.analyticsvidhya.com/blog/2021/11/end-to-end-question-answering-system-using-nlp-and-squad-dataset/ (Accessed: 01 August 2024).

Stanford Question Answering Dataset. (2019, November 17). Available at:https://www.kaggle.com/datasets/stanfordu/stanford-question-answering-dataset (Accessed: 20 July 2024).