In [2]:
! pip install transformers datasets evaluate accelerate

Collecting datasets
  Downloading datasets-2.19.1-py3-none-any.whl (542 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m542.0/542.0 kB[0m [31m6.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting evaluate
  Downloading evaluate-0.4.2-py3-none-any.whl (84 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m5.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting accelerate
  Downloading accelerate-0.30.1-py3-none-any.whl (302 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m302.6/302.6 kB[0m [31m9.5 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m7.4 MB/s[0m eta [36m0:00:00[0m
Collecting xxhash (from datasets)
  Downloading xxhash-3.4.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (194 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

# Question answering

## Load SQuAD dataset

In [1]:
from datasets import load_dataset

squad = load_dataset("squad", split="train[:5000]")


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading data:   0%|          | 0.00/14.5M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.82M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/87599 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/10570 [00:00<?, ? examples/s]

In [2]:
squad

Dataset({
    features: ['id', 'title', 'context', 'question', 'answers'],
    num_rows: 5000
})

In [15]:
# squad['test'] = squad.pop('validation')

Split the dataset's `train` split into a train and test set with the [train_test_split](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset.train_test_split) method:

In [4]:
squad = squad.train_test_split(test_size=0.2)

In [5]:
squad

DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 4000
    })
    test: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 1000
    })
})

In [6]:
squad['train'][2]

{'id': '56d25ca259d6e41400145ef2',
 'title': 'To_Kill_a_Mockingbird',
 'context': 'Allusions to legal issues in To Kill a Mockingbird, particularly in scenes outside of the courtroom, has drawn the attention from legal scholars. Claudia Durst Johnson writes that "a greater volume of critical readings has been amassed by two legal scholars in law journals than by all the literary scholars in literary journals". The opening quote by the 19th-century essayist Charles Lamb reads: "Lawyers, I suppose, were children once." Johnson notes that even in Scout and Jem\'s childhood world, compromises and treaties are struck with each other by spitting on one\'s palm and laws are discussed by Atticus and his children: is it right that Bob Ewell hunts and traps out of season? Many social codes are broken by people in symbolic courtrooms: Mr. Dolphus Raymond has been exiled by society for taking a black woman as his common-law wife and having interracial children; Mayella Ewell is beaten by her fathe

In [7]:
squad["train"][2]['answers']

{'text': ['frilly clothes'], 'answer_start': [1185]}

There are several important fields here:

- `answers`: the starting location of the answer token and the answer text.
- `context`: background information from which the model needs to extract the answer.
- `question`: the question a model should answer.

## Preprocess

In [8]:
from transformers import AutoTokenizer

tokenizer = None # TODO: load distilbert-base-uncased tokenizer
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")




tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

There are a few preprocessing steps particular to question answering tasks you should be aware of:

1. Some examples in a dataset may have a very long `context` that exceeds the maximum input length of the model. To deal with longer sequences, truncate only the `context` by setting `truncation="only_second"`.
2. Next, map the start and end positions of the answer to the original `context` by setting
   `return_offset_mapping=True`.
3. With the mapping in hand, now you can find the start and end tokens of the answer. Use the [sequence_ids](https://huggingface.co/docs/tokenizers/main/en/api/encoding#tokenizers.Encoding.sequence_ids) method to
   find which part of the offset corresponds to the `question` and which corresponds to the `context`.

Here is how you can create a function to truncate and map the start and end tokens of the `answer` to the `context`:

In [9]:
def preprocess_function(examples):
    questions = [q.strip() for q in examples["question"]] # strip() is a built-in string method that is used to remove specified characters (by default, whitespace characters)
    inputs = tokenizer(
        questions,
        examples["context"],
        max_length=384,
        truncation="only_second", # means that only the second sequence (examples["context"]) will be truncated to fit within the max_length limit.
        return_offsets_mapping=True,
        padding="max_length",
    )

    offset_mapping = inputs.pop("offset_mapping")
    answers = examples["answers"]
    start_positions = []
    end_positions = []

    for i, offset in enumerate(offset_mapping):
        answer = answers[i]
        start_char = answer['answer_start'][0]  # Start character of the answer
        end_char = start_char + len(answer['text'][0])
        sequence_ids = inputs.sequence_ids(i)

        # Find the start and end of the context
        idx = 0
        while sequence_ids[idx] != 1:
            idx += 1
        context_start = idx
        while sequence_ids[idx] == 1:
            idx += 1
        context_end = idx - 1

        # If the answer is not fully inside the context, label it (0, 0)
        if offset[context_start][0] > end_char or offset[context_end][1] < start_char:
            start_positions.append(0)
            end_positions.append(0)
        else:
            # Otherwise it's the start and end token positions
            idx = context_start
            while idx <= context_end and offset[idx][0] <= start_char:
                idx += 1
            start_positions.append(idx - 1)

            idx = context_end
            while idx >= context_start and offset[idx][1] >= end_char:
                idx -= 1
            end_positions.append(idx + 1)

    inputs["start_positions"] = start_positions
    inputs["end_positions"] = end_positions
    return inputs

*To* apply the preprocessing function over the entire dataset, use 🤗 Datasets [map](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset.map) function. You can speed up the `map` function by setting `batched=True` to process multiple elements of the dataset at once. Remove any columns you don't need:

In [10]:
tokenized_squad = squad.map(preprocess_function, batched=True, remove_columns=squad["train"].column_names)

Map:   0%|          | 0/4000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Now create a batch of examples using [DefaultDataCollator](https://huggingface.co/docs/transformers/main/en/main_classes/data_collator#transformers.DefaultDataCollator). Unlike other data collators in 🤗 Transformers, the [DefaultDataCollator](https://huggingface.co/docs/transformers/main/en/main_classes/data_collator#transformers.DefaultDataCollator) does not apply any additional preprocessing such as padding.

In [11]:
from transformers import DefaultDataCollator

data_collator = DefaultDataCollator()

## Train

<Tip>

If you aren't familiar with finetuning a model with the [Trainer](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer), take a look at the basic tutorial [here](https://huggingface.co/docs/transformers/main/en/tasks/../training#train-with-pytorch-trainer)!

</Tip>

You're ready to start training your model now! Load DistilBERT with [AutoModelForQuestionAnswering](https://huggingface.co/docs/transformers/main/en/model_doc/auto#transformers.AutoModelForQuestionAnswering):

In [12]:
from transformers import AutoModelForQuestionAnswering, TrainingArguments, Trainer

# load distilbert-base-uncased model
model = AutoModelForQuestionAnswering.from_pretrained("distilbert-base-uncased")




model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForQuestionAnswering were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [13]:
training_args = TrainingArguments(
    output_dir="qa_model",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
)

# pass the required arguments
trainer = Trainer(
    model= model,
    args= training_args,
    train_dataset= tokenized_squad["train"],
    eval_dataset= tokenized_squad["test"],
    tokenizer= tokenizer,
    data_collator= data_collator,
)

trainer.train()

Epoch,Training Loss,Validation Loss
1,No log,2.387792


Epoch,Training Loss,Validation Loss
1,No log,2.387792
2,2.783400,1.770445
3,2.783400,1.654444


TrainOutput(global_step=750, training_loss=2.338609659830729, metrics={'train_runtime': 497.1522, 'train_samples_per_second': 24.137, 'train_steps_per_second': 1.509, 'total_flos': 1175877900288000.0, 'train_loss': 2.338609659830729, 'epoch': 3.0})

In [14]:
# TODO: save both model and tokenizer
output_dir = "qa_model"

# Save the model
model.save_pretrained(output_dir)

# Save the tokenizer
tokenizer.save_pretrained(output_dir)

('qa_model/tokenizer_config.json',
 'qa_model/special_tokens_map.json',
 'qa_model/vocab.txt',
 'qa_model/added_tokens.json',
 'qa_model/tokenizer.json')

## Evaluate

Evaluation for question answering requires a significant amount of postprocessing. To avoid taking up too much of your time, this guide skips the evaluation step. The [Trainer](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer) still calculates the evaluation loss during training so you're not completely in the dark about your model's performance.

If have more time and you're interested in how to evaluate your model for question answering, take a look at the [Question answering](https://huggingface.co/course/chapter7/7?fw=pt#postprocessing) chapter from the 🤗 Hugging Face Course!

## Inference

In [22]:
question = 'what am I studying?' # TODO: write a question
context = "My name is shakiba anaraki and I am a computer engineering student at IUST university in Iran , Tehran" # TODO: write a context for your question

In [20]:
context2 = 'the weather was so cold yesterday and it was a heavy rain'
question2 = 'how was the weather ?'

The simplest way to try out your finetuned model for inference is to use it in a [pipeline()](https://huggingface.co/docs/transformers/main/en/main_classes/pipelines#transformers.pipeline). Instantiate a `pipeline` for question answering with your model, and pass your text to it:

In [23]:
from transformers import pipeline

# model = AutoModelForQuestionAnswering.from_pretrained(output_dir)
# tokenizer = AutoTokenizer.from_pretrained(output_dir)

question_answerer = pipeline("question-answering", model=model, tokenizer=tokenizer)
question_answerer(question=question, context=context)

{'score': 0.06742642819881439,
 'start': 38,
 'end': 85,
 'answer': 'computer engineering student at IUST university'}

In [24]:
question_answerer(question=question2, context=context2)

{'score': 0.16299065947532654, 'start': 47, 'end': 57, 'answer': 'heavy rain'}

You can also manually replicate the results of the `pipeline` if you'd like:

Tokenize the text and return PyTorch tensors:

In [None]:
# in the following boxes we will make QA pipeline and then pass model and tokenizer to it.
# then by tokenizing the inputs we will pass it to the pipeline to take back the answer after doing QA task

In [25]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(output_dir) # TODO: load your tokenizer
inputs = tokenizer(question, context, return_tensors="pt")

Pass your inputs to the model and return the `logits`:

In [26]:
import torch
from transformers import AutoModelForQuestionAnswering

model = AutoModelForQuestionAnswering.from_pretrained(output_dir) # load your model
with torch.no_grad():
    outputs = model(**inputs) # pass your inputs to the model

Get the highest probability from the model output for the start and end positions:

In [27]:
answer_start_index = outputs.start_logits.argmax()
answer_end_index = outputs.end_logits.argmax()

Decode the predicted tokens to get the answer:

In [28]:
predict_answer_tokens = inputs.input_ids[0, answer_start_index : answer_end_index + 1]
tokenizer.decode(predict_answer_tokens)
# we can see that the answer is similar to the one we got before.

'computer engineering student at iust university'