![NVIDIA Logo](images/nvidia.png)

# Question Answering

In this notebook you will begin work on an extractive question answering task using the [Stanford Question Answering](https://rajpurkar.github.io/SQuAD-explorer/) (SQuAD) dataset.

---

## Learning Objectives

By the time you complete this notebook you will:
- Be familiar with the SQuAD question answering dataset.
- Observe zero-shot performance for extractive question answering using GPT43B and GPT8B.

---

## Imports

In [None]:
import json
import random

from llm_utils.nemo_service_models import NemoServiceBaseModel
from llm_utils.models import Models

---

## List Models

In [None]:
Models.list_models()

---

## SQuAD

For the question answering task, we will be working with the Stanford Question Answering Dataset (SQuAD). From the SQuAD documentation:

> SQuAD is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable.

The dataset contains over 100,000 questions and either its answer, or, that the question in unanswerable from the provided textual context.

In [None]:
with open('data/squad.json', 'r') as f:
    squad_data = json.load(f)

---

## Explore SQuAD

The dataset comes as a dictionary with only 2 keys.

In [None]:
squad_data.keys()

We are entirely interested in `data` which contains 442 different topics, each with many textual contexts and then questions and answers based on that context.

In [None]:
data = squad_data['data']

In [None]:
len(data)

In [None]:
for d in data[:10]:
    print(f'Topic: {d['title']}')

---

## Explore Beyoncé Topic

Let's take a look at the first topic in the dataset, which is about the pop singer Beyoncé.

In [None]:
beyonce = data[0]

In [None]:
beyonce.keys()

Each topic contains a collection of context paragraphs that serve as the basis for the question answering task.

In [None]:
paragraphs = beyonce['paragraphs']

In [None]:
type(paragraphs)

In the case of the Beyoncé topic we can see that there are 66 context paragraphs, each with their own set of questions and answers.

In [None]:
len(paragraphs)

### Context, Questions and Answers

Let's look at the first contextual paragraph and its questions and answers.

In [None]:
paragraph = paragraphs[0]

In [None]:
paragraph.keys()

In [None]:
paragraph['context']

This particular contextual paragraph has 15 question/answer pairs associated with it.

In [None]:
qas = paragraph['qas']

In [None]:
len(qas)

Here's the structure of a single question/answer pair

In [None]:
qas[0]

Let's take a look at a few of the questions and their answers, and confirm that the answers are derived from text in the provided context paragraph.

In [None]:
for qa in qas[:5]:
    question = qa['question']
    answer = qa['answers'][0]['text']
    print(f'Question: {question}')
    print(f'Answer: {answer}')
    print(f'Answer in paragraph: {answer in paragraph['context']}\n') # See `paragraph['context']` above.

---

## Process SQuAD Data Into Context, Question, Answer Data

Ultimately we are going to use SQuAD data to fine tune a model on a question answering task. To that end it will be helpful to process the SQuAD data to simplify its structure and create a list where each item contains a context, question, and answer.

Knowing what we do about the structure of the SQuAD data above we can run the following cell to do just this.

Note that SQuAD contains some questions that are intentionally impossible to answer based on the provided context. We are going to choose to ignore these questions and instead only use those that have a clear answer.

Also remember that SQuAD contains over 100,000 questions and answers. We know that for PEFT we can typically do well with roughly 1000 samples. With that in mind, and to keep our dataset diverse, we are only going to take the first context paragraph and its questions and answers for each topic.

In [None]:
contexts_questions_answers = []
for topic in data:
    cqa = topic['paragraphs'][0]
    context = cqa['context']
    for qa in cqa['qas']:
        if qa['is_impossible']:
            continue
        question = qa['question']
        answer = qa['answers'][0]['text']
        contexts_questions_answers.append({'context': context, 'question': question, 'answer': answer})

This leaves us with over 2000 context, question, answer items.

In [None]:
len(contexts_questions_answers)

In [None]:
contexts_questions_answers[:2]

---

## Shuffle Data

Even though we only took the first context paragraph for each topic in the dataset, we still have many questions for each of those context paragraphs. With that in mind, let's shuffle the data.

We set a random seed here for reproducibility.

In [None]:
random.seed(1)

In [None]:
random.shuffle(contexts_questions_answers)

In [None]:
for cqa in contexts_questions_answers[:5]:
    print(cqa['context']+'\n')
    print(cqa['question'])
    print(cqa['answer']+'\n-----\n')

---

## Question Answering Prompt Template

We will continue the practice of denoting our LLM tasks with a prompt template function. In the case of extractive question answering, we will use the following, which constructs a prompt given a provided `text` context and the `question` we would like answered from the provided `text`.

In [None]:
def extract_template(text, question):
    return f'{text}\n{question} answer: '

---

## Create Prompts with Labels

Now we can combine our `contexts_questions_answers` with the `extract_template` to create a list of prompts and their labels, which we will be able to leverage when working with our LLMs.

In [None]:
prompts_and_answers = []
for cqa in contexts_questions_answers:
    context, question, answer = cqa['context'], cqa['question'], cqa['answer']
    prompt = extract_template(context, question)
    prompts_and_answers.append((prompt, answer))

In [None]:
len(prompts_and_answers)

In [None]:
for prompt, answer in prompts_and_answers[0:3]:
    print(prompt+'\n')
    print(answer+'\n---\n')

---

## Try Zero-shot Prompting with GPT43B

Let's see how GPT43B performs on this extractive question answering task with straightforward zero-shot prompting. First we'll instantiate an instance of our model.

In [None]:
gpt43b = NemoServiceBaseModel(Models.gpt43b.value)

Next we'll try it out on the first several prompts in `prompts_and_answers`.

In [None]:
for prompt, answer in prompts_and_answers[:5]:
    response = gpt43b.generate(prompt).strip()
    print(f'Response: {response}')
    print(f'Answer: {answer}\n')

### Analysis

At a glance, it looks like GPT43B is well suited for this task.

---

## Try Zero-shot Prompting with GPT8B

Now let's see how the much smaller GPT8B does.

In [None]:
gpt8b = NemoServiceBaseModel(Models.gpt8b.value)

In [None]:
for prompt, answer in prompts_and_answers[:5]:
    response = gpt8b.generate(prompt).strip()
    print(f'Response: {response}')
    print(f'Answer: {answer}\n')

At the least, GPT8B seems to be going on and on, let's try again, indicating that we would like the model to stop generating after newlines.

In [None]:
for prompt, answer in prompts_and_answers[:5]:
    response = gpt8b.generate(prompt, stop=['\n']).strip()
    print(f'Response: {response}')
    print(f'Answer: {answer}\n')

### Analysis

GPT8B continues to generate much more than we would like. It often repeats itself. It does not appear to be providing an answer extracted from the provided context. It is sometimes (see "Boris Yeltsin") wrong.

---

## Write Prompts and Answers to File

In the next section we will turn our attention to fine-tuning GPT8B on this task and it will be helpful to reuse the `prompts_and_answers` list that we created here. Let's write it to file so we can easily load it into the next notebook.

In [None]:
with open('data/squad_prompts_and_answers.json', 'w') as f:
    json.dump(prompts_and_answers, f)