<a href="https://colab.research.google.com/github/summer1278/NLP-student-homework/blob/main/QA_transformers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Let's isntall transformers package first.

In [None]:
! pip install transformers

Here's an example of Question Answering (QA) from transformers (https://huggingface.co/transformers/usage.html#extractive-question-answering).

The process of the task using a pre-trained bidirectional transformers (BERT) model is:
- Instantiate a tokenizer and a model from the checkpoint name. The model is identified as a BERT model (https://arxiv.org/pdf/1810.04805.pdf) and loads it with the weights stored in the checkpoint.

- Define a text and a few questions.

- Iterate over the questions and build a sequence from the text and the current question, with the correct model-specific separators token type ids and attention masks

- Pass this sequence through the model. This outputs a range of scores across the entire sequence tokens (question and text), for both the start and end positions.

- Compute the softmax of the result to get probabilities over the tokens

- Fetch the tokens from the identified start and stop values, convert those tokens to a string.

- Print the results


In [None]:
from transformers import AutoTokenizer, AutoModelForQuestionAnswering
import torch

# load pre-trained BERT model
tokenizer = AutoTokenizer.from_pretrained("deepset/bert-base-cased-squad2")
model = AutoModelForQuestionAnswering.from_pretrained("deepset/bert-base-cased-squad2")

# the text we want to extract answers from
# where you can test a few other examples
text = r"""
🤗 Transformers (formerly known as pytorch-transformers and pytorch-pretrained-bert) provides general-purpose
architectures (BERT, GPT-2, RoBERTa, XLM, DistilBert, XLNet…) for Natural Language Understanding (NLU) and Natural
Language Generation (NLG) with over 32+ pretrained models in 100+ languages and deep interoperability between
TensorFlow 2.0 and PyTorch.
"""

# questions you would like to ask
questions = [
    "How many pretrained models are available in Transformers?",
    "What does Transformers provide?",
    "Transformers provides interoperability between which frameworks?",
]


for question in questions:
    inputs = tokenizer.encode_plus(question, text, add_special_tokens=True, return_tensors="pt")
    input_ids = inputs["input_ids"].tolist()[0]

    text_tokens = tokenizer.convert_ids_to_tokens(input_ids)
    answer_start_scores, answer_end_scores = model(**inputs, return_dict=False) # this is different from the original tutrial to avoid a bug of wrong return type

    answer_start = torch.argmax(
        answer_start_scores
    )  # get the most likely beginning of answer with the argmax of the score
    answer_end = torch.argmax(answer_end_scores) + 1  # get the most likely end of answer with the argmax of the score

    answer = tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(input_ids[answer_start:answer_end]))

    print(f"Question: {question}")
    print(f"Answer: {answer}\n")

Question: How many pretrained models are available in Transformers?
Answer: over 32 +

Question: What does Transformers provide?
Answer: general - purpose architectures

Question: Transformers provides interoperability between which frameworks?
Answer: TensorFlow 2. 0 and PyTorch



In [None]:
# we can also use pre-implemented pipeline to simplify the process
nlp = pipeline("question-answering",model=model, tokenizer=tokenizer)
for question in questions:
    answer = nlp(question=question, context=text)
    print(f"Question: {question}")
    print(f"Answer: {answer['answer']}")
    print(f"Score: {answer['score']}\n")

Question: How many pretrained models are available in Transformers?
Answer: over 32+
Score: 0.32048341631889343

Question: What does Transformers provide?
Answer: general-purpose
architectures
Score: 0.8302127122879028

Question: Transformers provides interoperability between which frameworks?
Answer: TensorFlow 2.0 and PyTorch
Score: 0.8364729285240173

