<a href="https://colab.research.google.com/github/saakolch/question-answering_model/blob/main/Q-A_process.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install datasets evaluate transformers sentencepiece

Collecting datasets
  Downloading datasets-2.19.1-py3-none-any.whl (542 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m542.0/542.0 kB[0m [31m6.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting evaluate
  Downloading evaluate-0.4.2-py3-none-any.whl (84 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m8.2 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m14.7 MB/s[0m eta [36m0:00:00[0m
Collecting xxhash (from datasets)
  Downloading xxhash-3.4.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (194 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m8.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting multiprocess (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl (134 kB)
[2K     [90m━━━━━━━━

Let's do tokenize

In [22]:
from transformers import AutoTokenizer, AutoModelForQuestionAnswering

checkpoints = 'distilbert-base-cased-distilled-squad'
tokenizer = AutoTokenizer.from_pretrained(checkpoints)
model = AutoModelForQuestionAnswering.from_pretrained(checkpoints)


content = '''


The edges are decorated with octahedrons, painted white and black. In this case, any two borders are an ordinary edge, decorated with different colors.
Point out that for any point inside the octahedron, the average distances to the planes of the white faces are equal to the sum of the distances to the planes of the black faces.

'''

question = 'What is the color of octahedrons?'

inputs = tokenizer(question, content, return_tensors='pt')
outputs = model(**inputs)

In [23]:
start_logits = outputs.start_logits
end_logits = outputs.end_logits
start_logits.shape, end_logits.shape

(torch.Size([1, 87]), torch.Size([1, 87]))

We have to equate a [CLS] and [SEP] to number (a large negative number) for preparing to process of softmax using

In [24]:
import torch

sequence_ids = inputs.sequence_ids()
mask = [i != 1 for i in sequence_ids]
mask[0] = False
mask = torch.tensor(mask)[None]

start_logits[mask] = -10000
end_logits[mask] = -10000

In [25]:
start_prob = torch.nn.functional.softmax(start_logits, dim=-1)[0]
end_prob = torch.nn.functional.softmax(end_logits, dim=-1)[0]

As we know the NLP tasks and especially question-answering task is a code + probability of appearing of words, based on previous information. So we are going to do a simple math here. We want to predict the start of the entity and the end of one. Assuming we have independent events such us answer in the text with probability at the start and probability at the end, therefore we need to multiple each prob of start-end and take the argmax of all of them

In [26]:
scores = start_prob[:, None] * end_prob[None,:]
scores.shape

torch.Size([87, 87])

In [27]:
scores = torch.triu(scores)
scores.shape

torch.Size([87, 87])

In [28]:
max_index = scores.argmax().item()
start_index = max_index // scores.shape[1]
end_index = max_index % scores.shape[1]
scores[start_index, end_index]

tensor(0.4925, grad_fn=<SelectBackward0>)

In [29]:
inputs_with_offsets = tokenizer(question, content, return_offsets_mapping=True)
offsets = inputs_with_offsets["offset_mapping"]

start_char, _ = offsets[start_index]
_, end_char = offsets[end_index]
answer = content[start_char:end_char]

In [30]:
result = {
    "answer": answer,
    "start": start_char,
    "end": end_char,
    "score": scores[start_index, end_index],
}
print(result)

{'answer': 'white and black', 'start': 53, 'end': 68, 'score': tensor(0.4925, grad_fn=<SelectBackward0>)}
