<a href="https://colab.research.google.com/github/ummeamunira/nlp-question_answer/blob/main/Question_Answering_System.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This project leverages pre-trained models like BERT, DistilBERT, or even GPT-3 to answer questions based on a given context. It's a great starting point because it can be implemented with minimal data preprocessing and allows you to explore the capabilities of LLMs in a meaningful way.

**Steps to Create the QA System:**
*   Setup and Import Libraries
*   Load a Pre-trained Model and Tokenizer
*   Provide Context and Questions
*   Tokenize the Inputs
*   Get Answers from the Model
*   Display the Answers





In [None]:
# Install the transformers library if not already installed
# !pip install transformers

import torch
from transformers import BertForQuestionAnswering, BertTokenizer


**Using BERT-based transformers**

In [None]:
# Load pre-trained model and tokenizer
model_name = "bert-large-uncased-whole-word-masking-finetuned-squad"
model = BertForQuestionAnswering.from_pretrained(model_name)
tokenizer = BertTokenizer.from_pretrained(model_name)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/443 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-large-uncased-whole-word-masking-finetuned-squad were not used when initializing BertForQuestionAnswering: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

In [None]:
# Provide the context and the question
context = """
Artificial intelligence (AI) is intelligence demonstrated by machines, in contrast to the natural intelligence displayed by humans and animals.
Leading AI textbooks define the field as the study of "intelligent agents": any device that perceives its environment and takes actions that maximize its chance of successfully achieving its goals.
Colloquially, the term "artificial intelligence" is often used to describe machines (or computers) that mimic "cognitive" functions that humans associate with the human mind, such as "learning" and "problem solving".
"""

question = "What is artificial intelligence?"


In [None]:
# Tokenize input
inputs = tokenizer(question, context, add_special_tokens=True, return_tensors="pt")
input_ids = inputs["input_ids"].tolist()[0]

# Get model outputs
outputs = model(**inputs)
start_scores = outputs.start_logits
end_scores = outputs.end_logits


In [None]:
# Get the most likely beginning and end of answer with the argmax of the score
answer_start = torch.argmax(start_scores)
answer_end = torch.argmax(end_scores) + 1

# Convert tokens to answer
answer = tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(input_ids[answer_start:answer_end]))

print(f"Question: {question}")
print(f"Answer: {answer}")


Question: What is artificial intelligence?
Answer: intelligence demonstrated by machines


**Using T5**

In [None]:
# Install the transformers library if not already installed
# !pip install transformers

import torch
from transformers import T5ForConditionalGeneration, T5Tokenizer


In [None]:
# Load pre-trained model and tokenizer
model_name = "t5-base"
model = T5ForConditionalGeneration.from_pretrained(model_name)
tokenizer = T5Tokenizer.from_pretrained(model_name)

config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/892M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [None]:
context = """
Artificial intelligence (AI) is intelligence demonstrated by machines, in contrast to the natural intelligence displayed by humans and animals.
Leading AI textbooks define the field as the study of "intelligent agents": any device that perceives its environment and takes actions that maximize its chance of successfully achieving its goals.
Colloquially, the term "artificial intelligence" is often used to describe machines (or computers) that mimic "cognitive" functions that humans associate with the human mind, such as "learning" and "problem solving".
"""

question = "What is artificial intelligence?"


In [None]:
# Prepare the input text in the format that T5 expects: "question: <question> context: <context>"
input_text = f"question: {question} context: {context}"

# Tokenize the input text
inputs = tokenizer(input_text, return_tensors="pt", max_length=512, truncation=True)

# Generate the paraphrased answer
outputs = model.generate(
    inputs["input_ids"],
    max_length=150,
    num_beams=5,
    early_stopping=True
)

# Decode the generated text
answer = tokenizer.decode(outputs[0], skip_special_tokens=True)

print(f"Question: {question}")
print(f"Paraphrased Answer: {answer}")


Question: What is artificial intelligence?
Paraphrased Answer: intelligence demonstrated by machines


In [None]:
import torch
from transformers import T5ForConditionalGeneration, T5Tokenizer

# Load pre-trained model and tokenizer
model_name = "t5-base"
model = T5ForConditionalGeneration.from_pretrained(model_name)
tokenizer = T5Tokenizer.from_pretrained(model_name)

# Prepare context and question
context = """
Artificial intelligence (AI) is intelligence demonstrated by machines, in contrast to the natural intelligence displayed by humans and animals.
Leading AI textbooks define the field as the study of "intelligent agents": any device that perceives its environment and takes actions that maximize its chance of successfully achieving its goals.
Colloquially, the term "artificial intelligence" is often used to describe machines (or computers) that mimic "cognitive" functions that humans associate with the human mind, such as "learning" and "problem solving".
"""

question = "What is artificial intelligence?"

# Prepare the input text for T5
input_text = f"question: {question} context: {context}"

# Tokenize the input text
inputs = tokenizer(input_text, return_tensors="pt", max_length=512, truncation=True)

# Generate the paraphrased answer
outputs = model.generate(
    inputs["input_ids"],
    max_length=150,
    num_beams=5,
    early_stopping=True
)

# Decode the generated text
answer = tokenizer.decode(outputs[0], skip_special_tokens=True)

print(f"Question: {question}")
print(f"Paraphrased Answer: {answer}")


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Question: What is artificial intelligence?
Paraphrased Answer: intelligence demonstrated by machines


In [5]:
import torch
from transformers import BertForQuestionAnswering, BertTokenizer, T5ForConditionalGeneration, T5Tokenizer

# Load BERT model and tokenizer for QA
bert_model_name = "bert-large-uncased-whole-word-masking-finetuned-squad"
bert_model = BertForQuestionAnswering.from_pretrained(bert_model_name)
bert_tokenizer = BertTokenizer.from_pretrained(bert_model_name)

# Load T5 model and tokenizer for paraphrasing
t5_model_name = "t5-base"
t5_model = T5ForConditionalGeneration.from_pretrained(t5_model_name)
t5_tokenizer = T5Tokenizer.from_pretrained(t5_model_name)

# Provide the context and the question
context = """
Artificial intelligence (AI) is intelligence demonstrated by machines, in contrast to the natural intelligence displayed by humans and animals.
Leading AI textbooks define the field as the study of "intelligent agents": any device that perceives its environment and takes actions that maximize its chance of successfully achieving its goals.
Colloquially, the term "artificial intelligence" is often used to describe machines (or computers) that mimic "cognitive" functions that humans associate with the human mind, such as "learning" and "problem solving".
"""

question = "What is artificial intelligence?"

# Tokenize input
inputs = bert_tokenizer(question, context, add_special_tokens=True, return_tensors="pt")
input_ids = inputs["input_ids"].tolist()[0]

# Get model outputs
outputs = bert_model(**inputs)
start_scores = outputs.start_logits
end_scores = outputs.end_logits

# Get the most likely beginning and end of answer with the argmax of the score
answer_start = torch.argmax(start_scores)
answer_end = torch.argmax(end_scores) + 1

# Convert tokens to answer
answer_tokens = input_ids[answer_start:answer_end]
answer = bert_tokenizer.convert_tokens_to_string(bert_tokenizer.convert_ids_to_tokens(answer_tokens))

print(f"Extracted Answer: {answer}")

# Prepare the input text for T5
input_text = f"question: {question} context: {context}"

# Tokenize the input text
t5_inputs = t5_tokenizer(input_text, return_tensors="pt", max_length=512, truncation=True)

# Generate the paraphrased answer
t5_outputs = t5_model.generate(
    t5_inputs["input_ids"],
    max_length=150,
    num_beams=5,
    early_stopping=True
)

# Decode the generated text
paraphrased_answer = t5_tokenizer.decode(t5_outputs[0], skip_special_tokens=True)

print(f"Paraphrased Answer: {paraphrased_answer}")


Some weights of the model checkpoint at bert-large-uncased-whole-word-masking-finetuned-squad were not used when initializing BertForQuestionAnswering: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Extracted Answer: intelligence demonstrated by machines
Paraphrased Answer: intelligence demonstrated by machines


In [12]:
# Provide the context and the question
context = """
The ketogenic diet is a high-fat, adequate-protein, low-carbohydrate dietary therapy that in conventional medicine is used mainly to treat hard-to-control (refractory) epilepsy in children. The diet forces the body to burn fats rather than carbohydrates.
Normally, carbohydrates in food are converted into glucose, which is then transported around the body and is important in fueling brain function.
However, if only a little carbohydrate remains in the diet, the liver converts fat into fatty acids and ketone bodies, the latter passing into the brain and replacing glucose as an energy source.
An elevated level of ketone bodies in the blood (a state called ketosis) eventually lowers the frequency of epileptic seizures. Around half of children and young people with epilepsy who have tried some form of this diet saw the number of seizures drop by at least half, and the effect persists after discontinuing the diet.[2] Some evidence shows that adults with epilepsy may benefit from the diet and that a less strict regimen, such as a modified Atkins diet, is similarly effective.[1] Side effects may include constipation, high cholesterol, growth slowing, acidosis, and kidney stones.[3]
"""

question = "who should try keto diet?"

# Tokenize input
inputs = bert_tokenizer(question, context, add_special_tokens=True, return_tensors="pt")
input_ids = inputs["input_ids"].tolist()[0]

# Get model outputs
outputs = bert_model(**inputs)
start_scores = outputs.start_logits
end_scores = outputs.end_logits

# Get the most likely beginning and end of answer with the argmax of the score
answer_start = torch.argmax(start_scores)
answer_end = torch.argmax(end_scores) + 1

# Convert tokens to answer
answer_tokens = input_ids[answer_start:answer_end]
answer = bert_tokenizer.convert_tokens_to_string(bert_tokenizer.convert_ids_to_tokens(answer_tokens))

print(f"Extracted Answer: {answer}")

# Prepare the input text for T5
input_text = f"question: {question} context: {context}"

# Tokenize the input text
t5_inputs = t5_tokenizer(input_text, return_tensors="pt", max_length=512, truncation=True)

# Generate the paraphrased answer
t5_outputs = t5_model.generate(
    t5_inputs["input_ids"],
    max_length=150,
    num_beams=10,
    temperature=0.7,
    early_stopping=True
)

# Decode the generated text
paraphrased_answer = t5_tokenizer.decode(t5_outputs[0], skip_special_tokens=True)

print(f"Paraphrased Answer: {paraphrased_answer}")


Extracted Answer: children
Paraphrased Answer: children
