# Workshop Week 10: Question Answering

#### Please follow the instructions in this code and the workshop Instructor.

Types of QA systems:

    Extractive QA systems: These systems extract the answer directly from the given text by identifying the relevant section of text that contains the answer.

    Abstractive QA systems: These systems generate a new answer by understanding the meaning of the question and synthesizing information from various sources.

Classical (before deep neural learning) QA systems:

    Information Retrieval based QA systems: These systems use information retrieval techniques to search for relevant documents and retrieve the most relevant answers.

    Knowledge Graph based QA systems: These systems represent information in a structured format and use graph-based algorithms to answer questions.

    Watson QA system: This system, developed by IBM, uses a combination of natural language processing, machine learning, and information retrieval techniques to answer questions in a wide range of domains.

Evaluation of QA and Stanford Question Answering Dataset (SQuAD):

SQuAD is a popular dataset used for evaluating QA systems. It consists of a large number of questions and answers, along with the corresponding passages of text that contain the answers. The dataset is used to evaluate the accuracy and performance of different QA systems.

Language models for QA systems:

    BiDAF (Bidirectional Attention Flow): This model uses a bidirectional attention mechanism to encode the question and the passage and identify the most relevant words and phrases.

    Encoder-decoder transformers: These models use transformer networks to encode the input text and generate the output answer.

    SpanBERT: This model is an extension of the BERT (Bidirectional Encoder Representations from Transformers) model and uses a span-based approach to answer questions. It considers all possible spans in the input text to generate the final answer.

In [1]:
# !pip install transformers
import torch
import torch.nn as nn
from transformers import AutoTokenizer, AutoModelForQuestionAnswering

# Load the BiDAF model and tokenizer
tokenizer = AutoTokenizer.from_pretrained('deepset/bert-base-cased-squad2')
model = AutoModelForQuestionAnswering.from_pretrained('deepset/bert-base-cased-squad2')

# Define a sample question and passage
question = "What is the capital of France?"
passage = "France, officially the French Republic, is a country primarily located in Western Europe, consisting of metropolitan France and several overseas regions and territories. Paris is the capital and most populous city of France."

# Encode the question and passage using the tokenizer
inputs = tokenizer.encode_plus(question, passage, return_tensors='pt', max_length=512, truncation=True, truncation_strategy='only_second')
input_ids = inputs['input_ids']
token_type_ids = inputs['token_type_ids']
attention_mask = inputs['attention_mask']

# Pass the encoded input through the BiDAF model
outputs = model(input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids, return_dict=True)
start_logits = outputs.start_logits
end_logits = outputs.end_logits

# Decode the predicted start and end positions to get the answer
start_index = torch.argmax(start_logits)
end_index = torch.argmax(end_logits) + 1

input_tokens = tokenizer.convert_ids_to_tokens(input_ids[0])
answer_tokens = input_ids[0][start_index:end_index]
answer = tokenizer.decode(answer_tokens)

# Skip over any tokens before the start position or after the end position
for i, token in enumerate(answer_tokens):
    if token == tokenizer.cls_token_id:
        start_index += 1
    elif token == tokenizer.sep_token_id:
        end_index -= 1
answer_tokens = input_ids[0][start_index:end_index]

# Decode the answer tokens to get the final answer
answer = tokenizer.decode(answer_tokens)
print("Answer:", answer)

Downloading tokenizer_config.json:   0%|          | 0.00/152 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


Downloading config.json:   0%|          | 0.00/508 [00:00<?, ?B/s]

Downloading vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

  torch.utils._pytree._register_pytree_node(
  torch.utils._pytree._register_pytree_node(


Downloading model.safetensors:   0%|          | 0.00/433M [00:00<?, ?B/s]

Some weights of the model checkpoint at deepset/bert-base-cased-squad2 were not used when initializing BertForQuestionAnswering: ['bert.pooler.dense.weight', 'bert.pooler.dense.bias']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).



Answer: Paris


## Task 1: Construct Q/A system

Task Description: In this task, you will be given a set of questions and a corresponding set of passages. Your goal is to use a QA model to find the answer to each question in its corresponding passage.

Please review the code to understand it, run the first part, and complete the rest to make a QA system.

Then follow the instructions for the workshop Instructor.

Instructions for the Instructor (please take this as a suggeston, you can design your own workshop flow):

    Begin by introducing the participants to the task and the QA model that will be used. Provide a brief overview of how the model works and how it can be used to find answers to questions.

    Divide the participants into small groups, with each group consisting of 2-3 people. Provide each group with a set of questions and a corresponding set of passages.

    Instruct the participants to use the QA model to find the answer to each question in its corresponding passage. They should start by encoding the question and passage using the tokenizer, and then pass the encoded input through the QA model to obtain the predicted answer.

    Once the participants have obtained the predicted answer, they should decode the answer from the corresponding tokens using the tokenizer, and then compare the predicted answer to the actual answer.

    After each group has finished answering all the questions, bring the participants together and review the answers to each question. Discuss any common mistakes or misconceptions that arose during the task, and provide feedback and guidance to help the participants improve their performance.

    To wrap up the task, ask the participants to reflect on what they learned and how they can apply this knowledge in their aAsignment 1 or work or studies.
    

Example Questions and Passages:

Question 1: What is the capital of the United States?

Passage 1: The capital of the United States is Washington, D.C. It is located on the east coast of the country, and is home to many important government buildings and monuments.

Question 2: Who wrote the novel "To Kill a Mockingbird"?

Passage 2: "To Kill a Mockingbird" is a novel written by Harper Lee. It was published in 1960 and has since become a classic of American literature.

Question 3: What is the largest country in the world by area?

Passage 3: Russia is the largest country in the world by area. It covers more than 17 million square kilometers and spans 11 time zones.

Question 4: What is the capital of France?

Passage 4: Paris is the capital and most populous city of France. It is located in the north-central part of the country, and is known for its rich history, art, and culture.

Question 5: Who was the first president of the United States?

Passage 5: George Washington was the first president of the United States. He served from 1789 to 1797, and is widely regarded as one of the most important figures in American history.

In [2]:
# Solution for reference

import torch
import torch.nn as nn
from transformers import AutoTokenizer, AutoModelForQuestionAnswering

# Load the QA model and tokenizer
model_name = "distilbert-base-cased-distilled-squad"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForQuestionAnswering.from_pretrained(model_name)

# Define a set of questions and passages
questions = [
    "What is the capital of the United States?",
    "Who wrote the novel \"To Kill a Mockingbird\"?",
    "What is the largest country in the world by area?",
    "What is the capital of France?",
    "Who was the first president of the United States?"
]
passages = [
    "The capital of the United States is Washington, D.C. It is located on the east coast of the country, and is home to many important government buildings and monuments.",
    "\"To Kill a Mockingbird\" is a novel written by Harper Lee. It was published in 1960 and has since become a classic of American literature.",
    "Russia is the largest country in the world by area. It covers more than 17 million square kilometers and spans 11 time zones.",
    "Paris is the capital and most populous city of France. It is located in the north-central part of the country, and is known for its rich history, art, and culture.",
    "George Washington was the first president of the United States. He served from 1789 to 1797, and is widely regarded as one of the most important figures in American history."
]

# Loop over each question and passage, and use the QA model to find the answer
for i, (question, passage) in enumerate(zip(questions, passages)):
    # Encode the question and passage using the tokenizer
    inputs = tokenizer.encode_plus(question, passage, return_tensors='pt', max_length=512, truncation_strategy='only_second')
    input_ids = inputs['input_ids']
    attention_mask = inputs['attention_mask']

    # Pass the encoded input through the QA model
    outputs = model(input_ids, attention_mask=attention_mask, return_dict=True)
    start_logits = outputs.start_logits
    end_logits = outputs.end_logits


    # Decode the predicted start and end positions to get the answer
    start_index = torch.argmax(start_logits)
    end_index = torch.argmax(end_logits) + 1

    # Skip over any tokens before the start position or after the end position
    for j, token_id in enumerate(input_ids[0]):
        if j < start_index or j >= end_index:
            input_ids[0][j] = tokenizer.pad_token_id

    # Decode the answer from the corresponding tokens
    answer_tokens = input_ids[0][start_index:end_index]
    answer = tokenizer.decode(answer_tokens)

    # Print the question, passage, and answer
    print("Question {}: {}".format(i+1, question))
    print("Passage {}: {}".format(i+1, passage))
    print("Answer {}: {}\n".format(i+1, answer))

Downloading tokenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading config.json:   0%|          | 0.00/473 [00:00<?, ?B/s]

Downloading vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/261M [00:00<?, ?B/s]

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


Question 1: What is the capital of the United States?
Passage 1: The capital of the United States is Washington, D.C. It is located on the east coast of the country, and is home to many important government buildings and monuments.
Answer 1: Washington, D. C

Question 2: Who wrote the novel "To Kill a Mockingbird"?
Passage 2: "To Kill a Mockingbird" is a novel written by Harper Lee. It was published in 1960 and has since become a classic of American literature.
Answer 2: Harper Lee

Question 3: What is the largest country in the world by area?
Passage 3: Russia is the largest country in the world by area. It covers more than 17 million square kilometers and spans 11 time zones.
Answer 3: Russia

Question 4: What is the capital of France?
Passage 4: Paris is the capital and most populous city of France. It is located in the north-central part of the country, and is known for its rich history, art, and culture.
Answer 4: Paris

Question 5: Who was the first president of the United States

## Task 2: Use the QA code above

Apply the code to one of Assignment 1 articles. Make a question, ground truth answer, and predict an answer using the code. Evaluate answer using precision/recall.