# Question-Answering
Question answering is becoming more and more popular, as people seem to prefer to get answer interactively rather than reading long documents, and information can be found more quickly. 

We will the SubjQA dataset, which contains over 10'000 customer reviews in English  in six domains. 

**Note:** If you get a `NotImplementedError: Loading a dataset cached in a LocalFileSystem is not supported.` when running the cell `print(subjqa["train"]["answers"][1])`: uncomment the following two `!pip` commands, restart the kernel, and try again.

In [None]:
# !pip install -U typing-extensions
# !pip install fsspec==2023.9.2

## Preparations

In [None]:
from datasets import get_dataset_config_names

The `subjqa` has reviews from different domains:

In [None]:
domains = get_dataset_config_names("subjqa")
domains

We will use the feedbacks on electronics:

In [None]:
from datasets import load_dataset

subjqa = load_dataset("subjqa", name="electronics")

Let's first look at some examples:

In [None]:
print(subjqa["train"]["answers"][1])

We re-organize the data into a structure that is more easy to handle:

In [None]:
import pandas as pd

dfs = {split: dset.to_pandas() for split, dset in subjqa.flatten().items()}

And we want to get an overview over how many data points we have:

In [None]:
for split, df in dfs.items():
    print(f"Number of questions in {split}: {df['id'].nunique()}")

Here are some of the most relevant attributes:

In [None]:
qa_cols = ["title", "question", "answers.text",
           "answers.answer_start", "context"]
sample_df = dfs["train"][qa_cols].sample(2, random_state=7)
sample_df

In [None]:
sample_df.loc[791, 'context']

In [None]:
sample_df.loc[1159, 'context']

As mentioned in the slides, we see that some of the answers are grammatically not correct sentences, and some cannot be answered (such as "how is the battery?")

## Span Classification
A popular approach is to define a task to find the start and end token of the context that is contains the answer.

Many models for this are available on e.g., Huggingface Hub. We choose a small one by deepset, a German AI start-up

In [None]:
from transformers import AutoTokenizer

model_ckpt = "deepset/minilm-uncased-squad2"
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)

In [None]:
question = "How much music can this hold?"
context = """An MP3 is about 1 MB/minute, so about 6000 hours depending on file size."""
inputs = tokenizer(question, context, return_tensors="pt")

As before, we can get the tokens that the tokenizer has generated - and with `tokenizer.decode`, we see how our question and context are represented:

In [None]:
inputs

In [None]:
print(tokenizer.decode(inputs["input_ids"][0]))

After tokenization, we can use the model to predict the start and end token:

In [None]:
import torch
from transformers import AutoModelForQuestionAnswering

model = AutoModelForQuestionAnswering.from_pretrained(model_ckpt)

with torch.no_grad():
    outputs = model(**inputs)
print(outputs)

In [None]:
start_logits = outputs.start_logits
end_logits = outputs.end_logits

In [None]:
import numpy as np
import matplotlib.pyplot as plt

s_scores = start_logits.detach().numpy().flatten()
e_scores = end_logits.detach().numpy().flatten()
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
token_ids = range(len(tokens))

In [None]:
fig, (ax1, ax2) = plt.subplots(nrows=2, sharex=True)
colors = ["C0" if s != np.max(s_scores) else "C1" for s in s_scores]
ax1.bar(x=token_ids, height=s_scores, color=colors)
ax1.set_ylabel("Start Scores")
colors = ["C0" if s != np.max(e_scores) else "C1" for s in e_scores]
ax2.bar(x=token_ids, height=e_scores, color=colors)
ax2.set_ylabel("End Scores")
plt.xticks(token_ids, tokens, rotation="vertical")
plt.show()

Putting the components together, we can now get the result:

In [None]:
start_idx = torch.argmax(start_logits)
end_idx = torch.argmax(end_logits) + 1
answer_span = inputs["input_ids"][0][start_idx:end_idx]
answer = tokenizer.decode(answer_span)
print(f"Question: {question}")
print(f"Answer: {answer}")