In this tutorial, we'll use a pretrained transformer model to perform question answering (QA). The model will be given a context (a passage of text) and a question, and it will try to find the most relevant answer within the context.

# Import the necessary libraries:

In [1]:
from transformers import pipeline


# Create a question-answering pipeline:
We'll initialize a pipeline for question-answering using a pre-trained model.

In [2]:
question_answerer = pipeline("question-answering", model="distilbert-base-cased-distilled-squad")


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/473 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/261M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]



# Provide context and ask a question:
You’ll need a context paragraph (the passage in which the model should search for answers) and a specific question.

In [3]:
context = """
Transformers are a type of machine learning model introduced in 2017. They use self-attention mechanisms to
process input data. Since their introduction, they have achieved state-of-the-art performance in various natural
language processing tasks like machine translation, text summarization, and question answering. The transformer
architecture led to the creation of models like BERT, GPT, and others.
"""

question = ["What tasks are transformers used for?","What is  transformer ?"]


# Get the answer:
Use the question_answerer() function to get the answer based on the context and question.

**Parameters:**

*   context: The passage where the model will search for the answer.
*   question: The question the model will try to answer based on the context.

In [5]:
answer = question_answerer(question=question, context=context)
print("Answer:", answer['answer'])

Answer: ['machine translation, text summarization, and question answering', 'a type of machine learning model']



# Fine-Tuning a Pretrained Model for Question Answering

In this section, we'll demonstrate how to fine-tune a pretrained model like `distilbert-base-cased-distilled-squad` for question-answering tasks on a custom dataset.

Fine-tuning the model on your own data helps it adapt to domain-specific language and questions.


In [None]:
! pip install datasets # if necessary

Collecting datasets
  Downloading datasets-3.0.0-py3-none-any.whl.metadata (19 kB)
Collecting pyarrow>=15.0.0 (from datasets)
  Downloading pyarrow-17.0.0-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (3.3 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Downloading datasets-3.0.0-py3-none-any.whl (474 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m474.3/474.3 kB[0m [31m7.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m4.6 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pyarrow-17.0.0-cp310-cp310-manylinux_2_28_x86_64.whl (39.9 MB)
[2K  

In [None]:
# Step 1: Load a custom question-answering dataset
# We'll use the SQuAD dataset here as an example. Replace this with your own dataset if needed.
from datasets import load_dataset

dataset = load_dataset('squad', split='train[:1%]')

# Step 2: Load the pretrained model and tokenizer
from transformers import DistilBertTokenizer, DistilBertForQuestionAnswering

tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-cased')
model = DistilBertForQuestionAnswering.from_pretrained('distilbert-base-cased')

# Step 3: Preprocess the dataset for fine-tuning
def preprocess_data(examples):
    inputs = tokenizer(
        examples['context'],
        examples['question'],
        truncation=True,
        padding='max_length',
        max_length=384
    )

    # Tokenize the answer separately
    answers = examples['answers']
    start_positions = []
    end_positions = []

    for i, answer in enumerate(answers):
        answer_text = answer['text'][0]
        start_char = answer['answer_start'][0]

        # Tokenize the context
        context = examples['context'][i]
        tokenized_context = tokenizer(context, truncation=True, padding='max_length', max_length=384)

        # Tokenize the answer
        tokenized_answer = tokenizer(answer_text, truncation=True, padding='max_length', max_length=384)

        # Find the token indices corresponding to the start and end of the answer
        start_pos = None
        end_pos = None

        # Loop through the tokenized context and look for the answer tokens
        for idx in range(len(tokenized_context['input_ids']) - len(tokenized_answer['input_ids']) + 1):
            if tokenized_context['input_ids'][idx:idx + len(tokenized_answer['input_ids'])] == tokenized_answer['input_ids']:
                start_pos = idx
                end_pos = idx + len(tokenized_answer['input_ids']) - 1
                break

        if start_pos is None or end_pos is None:
            start_pos = 0
            end_pos = 0

        start_positions.append(start_pos)
        end_positions.append(end_pos)

    inputs.update({
        'start_positions': start_positions,
        'end_positions': end_positions
    })

    return inputs



train_data = dataset.map(preprocess_data, batched=True, remove_columns=['id', 'title', 'context', 'question', 'answers'])

# Step 4: Define training arguments and initialize Trainer
from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir='./results',
evaluation_strategy="no",  # Disable evaluation
    learning_rate=3e-5,
    per_device_train_batch_size=8,
    num_train_epochs=1
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_data
)

# Step 5: Fine-tune the model
trainer.train()


# Now check model's performance after fine-tuning

# Exercise Questions:

*   Change the context to a different passage (such as one from a news article or a technical document). Does the model still provide accurate answers?
*   Modify the script to allow the user to ask multiple questions about the same context without restarting the program. What changes did you make to achieve this?
*   Experiment with different pretrained QA models from Hugging Face (e.g., bert-large-uncased-whole-word-masking-finetuned-squad). How does the performance change? Which model gives the best results in your experiments?





 # 1


In [13]:
context = """
Ibn Saud was the son of Abdul Rahman bin Faisal, Emir of Nejd, and Sara bint Ahmed Al Sudairi. The family were exiled from their residence in the city of Riyadh in 1890.
 Ibn Saud reconquered Riyadh in 1902, starting three decades of conquests that made him the ruler of nearly all of central and north Arabia.
  He consolidated his control over the Nejd in 1922, then conquered the Hejaz in 1925.
  He extended his dominions into what later became the Kingdom of Saudi Arabia in 1932.
  Ibn Saud's victory and his support for Islamic revivalists would greatly bolster pan-Islamism across the Islamic world.
  [2] Concording with Wahhabi beliefs, he ordered the demolition of several shrines, the Al-Baqi Cemetery and the Jannat al-Mu'alla.
  [3] As King, he presided over the discovery of petroleum in Saudi Arabia in 1938 and the beginning of large-scale oil production after World War II.
  He fathered many children, including 45 sons, and all of the subsequent kings of Saudi Arabia as of 2024.s.
"""

question = "when did he tack cntrol over nejd ?"

In [14]:
answer = question_answerer(question=question, context=context)
print("Answer:", answer['answer'])

Answer: 1922


# 2

In [16]:
context = """
Transformers are a type of machine learning model introduced in 2017. They use self-attention mechanisms to
process input data. Since their introduction, they have achieved state-of-the-art performance in various natural
language processing tasks like machine translation, text summarization, and question answering. The transformer
architecture led to the creation of models like BERT, GPT, and others.
"""

question = ["What tasks are transformers used for?","What is  transformer ?"]

In [17]:
answer = question_answerer(question=question, context=context)
print("Answer:", [a['answer'] for a in answer])

Answer: ['machine translation, text summarization, and question answering', 'a type of machine learning model']


# 3

In [15]:
question_answerer = pipeline("question-answering", model="bert-large-uncased-whole-word-masking-finetuned-squad")


config.json:   0%|          | 0.00/443 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-large-uncased-whole-word-masking-finetuned-squad were not used when initializing BertForQuestionAnswering: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

In [18]:
answer = question_answerer(question=question, context=context)
print("Answer:", [a['answer'] for a in answer])

Answer: ['machine translation, text summarization, and question answering', 'a type of machine learning model']
