<a href="https://colab.research.google.com/github/ummeamunira/nlp-question_answer/blob/main/Question_Answering_with_T5_and_Safety_Measures.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Goal:** Build a question-answering system using T5 on a Kaggle dataset containing factual question-answer pairs (e.g., Natural Questions).

**Techniques:**

T5 Fine-tuning: Fine-tune a pre-trained T5 model on the provided question-answer dataset to enable it to answer open ended, challenging, or multi-step questions.

Prompt Engineering: Design prompts that frame the user's question and instruct T5 to search for the most relevant and factually accurate answer from the provided dataset. Examples: "Given the question '[user question]', find the answer within the dataset."

**Safety Measures:**

Confidence Scoring: Implement a confidence scoring mechanism for T5's answers. This helps identify situations where the model is unsure or the answer might be unreliable. Human review can be prioritized for low-confidence answers.

Answer Justification: Encourage T5 to justify its answer by providing relevant supporting evidence from the dataset. This enhances transparency and allows users to evaluate the answer's credibility.


In [None]:
!pip install transformers datasets

In [None]:
pip install transformers[torch]



In [None]:
!pip install transformers datasets



In [1]:
# Import necessary libraries
import torch
from transformers import T5Tokenizer, T5ForConditionalGeneration, Trainer, TrainingArguments
from datasets import load_dataset, DatasetDict

In [None]:
# #load json data
# import json

# # Load the dataset
# with open('dataset.json', 'r') as f:
#     data = json.load(f)

# # Convert the dataset into a format compatible with HuggingFace's Datasets library
# dataset = {
#     'train': {'question': [], 'context': [], 'answer': []},
#     'validation': {'question': [], 'context': [], 'answer': []}
# }

# # Split data into training and validation sets
# split_ratio = 0.8
# split_index = int(len(data) * split_ratio)
# train_data = data[:split_index]
# validation_data = data[split_index:]

# # Fill the dataset dictionary
# for entry in train_data:
#     dataset['train']['question'].append(entry['question'])
#     dataset['train']['context'].append(entry['context'])
#     dataset['train']['answer'].append(entry['answer'])

# for entry in validation_data:
#     dataset['validation']['question'].append(entry['question'])
#     dataset['validation']['context'].append(entry['context'])
#     dataset['validation']['answer'].append(entry['answer'])

# # Convert to DatasetDict format
# from datasets import Dataset, DatasetDict

# train_dataset = Dataset.from_dict(dataset['train'])
# validation_dataset = Dataset.from_dict(dataset['validation'])
# dataset = DatasetDict({'train': train_dataset, 'validation': validation_dataset})


In [2]:
# Load the SQuAD dataset
dataset = load_dataset('squad')

# Load the tokenizer
tokenizer = T5Tokenizer.from_pretrained('t5-small')

# Preprocess function
def preprocess_function(examples):
    # Combine the question and context
    inputs = [f"question: {q} context: {c}" for q, c in zip(examples['question'], examples['context'])]
    # Extract the answer text
    targets = [a['text'][0] if len(a['text']) > 0 else '' for a in examples['answers']]
    # Tokenize inputs and targets
    model_inputs = tokenizer(inputs, max_length=512, truncation=True, padding='max_length')
    labels = tokenizer(targets, max_length=128, truncation=True, padding='max_length').input_ids
    model_inputs['labels'] = labels
    return model_inputs

# Apply the preprocess function to the dataset
tokenized_datasets = dataset.map(preprocess_function, batched=True)

# Split the dataset into training and validation sets
train_testvalid = tokenized_datasets['train'].train_test_split(test_size=0.1)
test_valid = train_testvalid['test'].train_test_split(test_size=0.5)

# Create a DatasetDict with train, validation, and test sets
tokenized_datasets = DatasetDict({
    'train': train_testvalid['train'],
    'validation': test_valid['test'],
    'test': test_valid['train']
})

# Print a sample to check if the data is correctly preprocessed
print(tokenized_datasets['train'][0])


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


{'id': '56df6e3756340a1900b29b26', 'title': 'Plymouth', 'context': 'In 2004 the old Drake Circus shopping centre and Charles Cross car park were demolished and replaced by the latest Drake Circus Shopping Centre, which opened in October 2006. It received negative feedback before opening when David Mackay said it was already "ten years out of date". In contrast, the Theatre Royal\'s production and education centre, TR2, which was built on wasteland at Cattedown, was a runner-up for the RIBA Stirling Prize for Architecture in 2003.', 'question': 'Who spoke disparagingly about the Drake Circus Shopping Centre?', 'answers': {'text': ['David Mackay'], 'answer_start': [225]}, 'input_ids': [822, 10, 2645, 5468, 8378, 5855, 120, 81, 8, 24823, 30118, 14830, 2969, 58, 2625, 10, 86, 4406, 8, 625, 24823, 30118, 2309, 2050, 11, 5417, 4737, 443, 2447, 130, 20, 4641, 11904, 11, 5821, 57, 8, 1251, 24823, 30118, 14830, 2969, 6, 84, 2946, 16, 1797, 15066, 94, 1204, 2841, 3160, 274, 2101, 116, 1955, 2143

In [None]:
pip install accelerate -U

Collecting accelerate
  Downloading accelerate-0.30.1-py3-none-any.whl (302 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m302.6/302.6 kB[0m [31m2.3 MB/s[0m eta [36m0:00:00[0m
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch>=1.10.0->accelerate)
  Using cached nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch>=1.10.0->accelerate)
  Using cached nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (823 kB)
Collecting nvidia-cuda-cupti-cu12==12.1.105 (from torch>=1.10.0->accelerate)
  Using cached nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (14.1 MB)
Collecting nvidia-cudnn-cu12==8.9.2.26 (from torch>=1.10.0->accelerate)
  Using cached nvidia_cudnn_cu12-8.9.2.26-py3-none-manylinux1_x86_64.whl (731.7 MB)
Collecting nvidia-cublas-cu12==12.1.3.1 (from torch>=1.10.0->accelerate)
  Using cached nvidia_cublas_cu12-12.1.3.1-py3-none-manylinux1_x86_64.w

In [None]:
# Load the T5 model
model = T5ForConditionalGeneration.from_pretrained('t5-small')

# Define the training arguments
training_args = TrainingArguments(
    output_dir='./results',
    evaluation_strategy='epoch',
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=3,
    weight_decay=0.01,
)

# Initialize the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets['train'],
    eval_dataset=tokenized_datasets['validation'],
)

# Fine-tune the model
trainer.train()





Epoch,Training Loss,Validation Loss


In [None]:
# Function to calculate confidence score (softmax probability of the generated answer)
def confidence_score(logits):
    probabilities = torch.softmax(logits, dim=-1)
    max_prob, _ = torch.max(probabilities, dim=-1)
    return max_prob.item()

# Function to generate answers with justifications
def generate_answer(question, context):
    input_text = f"question: {question} context: {context}"
    input_ids = tokenizer(input_text, return_tensors='pt').input_ids
    output = model.generate(input_ids, return_dict_in_generate=True, output_scores=True)
    answer = tokenizer.decode(output.sequences[0], skip_special_tokens=True)
    score = confidence_score(output.scores[0])
    return answer, score

# Example usage
question = "What is the capital of France?"
context = "France is a country in Europe. The capital city of France is Paris."

answer, score = generate_answer(question, context)
print(f"Answer: {answer}, Confidence Score: {score}")

# Function to provide answer with justification
def generate_answer_with_justification(question, context):
    input_text = f"question: {question} context: {context}"
    input_ids = tokenizer(input_text, return_tensors='pt').input_ids
    output = model.generate(input_ids, return_dict_in_generate=True, output_scores=True)
    answer = tokenizer.decode(output.sequences[0], skip_special_tokens=True)
    score = confidence_score(output.scores[0])

    # Justification (could be retrieved from context or explanation logic)
    justification = f"The answer is based on the provided context: '{context}'"

    return answer, score, justification

# Example usage
answer, score, justification = generate_answer_with_justification(question, context)
print(f"Answer: {answer}\nConfidence Score: {score}\nJustification: {justification}")
