<a href="https://colab.research.google.com/github/susantaghosh1/nlp-notebooks/blob/develop/Fine_Tuning_Extractive_QA_with_BERT_and_Friends.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fine Tuning BERT/RoBERTa/DeBERTa/ALBERT/DistillBERT for extractive QA on Squad dataset

In this section we will fine-tune Extractive QA on Squad dataset. Encoder-only models like BERT tend to be great at extracting answers to factoid questions like “Who invented the Transformer architecture?” but fare poorly when given open-ended questions like “Why is the sky blue?” In these more challenging cases, encoder-decoder models like T5 and BART are typically used to synthesize the information in a way that’s quite similar to text summarization.

All of those work in the same way: they add a linear layer on top of the base model, which is used to produce a tensor of shape (batch_size,sequence_length,2), indicating the unnormalized scores **[LOGITS]** for start position and end position of the answers for every example in the batch.

Let's discuss little bit internal working of the model :

1. Question and Context [tokenized version] will be passed together as a pair to the model **[ let's say shape of input to the model is (5,30) where 5 is batch_size and 30 is sequence length [number of tokens in each input]**
2. Vanilla BERT [OR it's friends] will produce contextualized embeddings for each and every word in the sequence. Shape of output from BERT is **(5,30,768) where 5 is the batch size, 30 is the sequece length and 768 is the embedding dimension of the each token**
3. Now a linear head will be added on top of each of the tokens and each liner layer will take 768 dim as input and outputs 2 tensors , which we call start_logits and end_logits. Now, shape of output is **(5,30,2)**
4. Now we will split the start_logits and end_logits where shape of each logits are **(5,30,1)**
5. Now we will remove the single dimesion from the last dimension of start and end logits or in other words we will squeeze the start and end logits across the last dimesion and now shape of start and end logits will be **(5,30)**

**start_logits = tensor of shape (5,30)**
**end_logits = tensor of shape (5,30)**

6. Model will take start_positions and end_positions of the answer in the tokenized data as labels

start_positions (`torch.LongTensor` of shape `(batch_size,)`):
            Labels for position (index) of the start of the labelled span for computing the token classification loss.Positions are clamped to the length of the sequence (`sequence_length`). Position outside of the sequence are not taken into account for computing the loss.

end_positions (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
            Labels for position (index) of the end of the labelled span for computing the token classification loss.Positions are clamped to the length of the sequence (`sequence_length`). Position outside of the sequence are not taken into account for computing the loss.

**start_positions = tensor of shape (5,)**
**end_positions = tensor of shape (5,)**

7. Now Cross Entropy loss will be computed between **start_logits and start_positions** and **end_logits and end_positions**.

8. Total loss will be the average loss of **start_logits and start_positions** and end_logits and end_positions** and it will be backpropagated to the model for calculationg the gradients and optimizing the weights

Pseudo code for QA Model with BERT

class PseudoQA(nn.Module):

  def __init__(self, config):
        super().__init__(config)
        self.num_labels = config.num_labels

        self.bert = BertModel(config, add_pooling_layer=False)
        self.qa_outputs = nn.Linear(config.hidden_size, config.num_labels)

        # Initialize weights and apply final processing
        self.post_init()
  
   def forward(
        self,
        input_ids: Optional[torch.Tensor] = None,
        attention_mask: Optional[torch.Tensor] = None,
        token_type_ids: Optional[torch.Tensor] = None,
        start_positions: Optional[torch.Tensor] = None,
        end_positions: Optional[torch.Tensor] = None,
    ) :
        
        outputs = self.bert(
            input_ids,
            attention_mask=attention_mask,
            token_type_ids=token_type_ids,
            position_ids=position_ids,
            head_mask=head_mask,
            inputs_embeds=inputs_embeds,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
            return_dict=return_dict,
        )

        sequence_output = outputs[0] ## ** last hidden state output of bert**

        # ** shape of sequence_output : (batch_size,sequence_length,768) **

        logits = self.qa_outputs(sequence_output)
        # ** shape of logits : (batch_size,sequence_length,2) **
        start_logits, end_logits = logits.split(1, dim=-1)
        # ** shape of start_logits and end_logits : (batch_size,sequence_length,1) **
        start_logits = start_logits.squeeze(-1).contiguous() # ** shape : (batch_size,sequence_length) **
        end_logits = end_logits.squeeze(-1).contiguous() # ** shape : (batch_size,sequence_length) **

        total_loss = None
        if start_positions is not None and end_positions is not None:
            # If we are on multi-GPU, split add a dimension
            if len(start_positions.size()) > 1:
                start_positions = start_positions.squeeze(-1)
            if len(end_positions.size()) > 1:
                end_positions = end_positions.squeeze(-1)
            # sometimes the start/end positions are outside our model inputs, 
            # we ignore these terms
            ignored_index = start_logits.size(1)
            start_positions = start_positions.clamp(0, ignored_index)
            end_positions = end_positions.clamp(0, ignored_index)

            loss_fct = CrossEntropyLoss(ignore_index=ignored_index)
            start_loss = loss_fct(start_logits, start_positions)
            end_loss = loss_fct(end_logits, end_positions)
            total_loss = (start_loss + end_loss) / 2
  



Enough of theory!!!! Let's dirty our hands

In [4]:
%%capture
!pip install datasets transformers[sentencepiece]
!pip3 install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu113
!pip install scipy sklearn

In [2]:
import torch
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
device

device(type='cuda')

In [3]:
!nvidia-smi

Sun May 29 06:15:28 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   38C    P8     9W /  70W |      3MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [5]:
# load the dataset

from datasets import load_dataset

raw_datasets = load_dataset("squad")

Downloading builder script:   0%|          | 0.00/1.97k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/1.02k [00:00<?, ?B/s]

Downloading and preparing dataset squad/plain_text (download: 33.51 MiB, generated: 85.63 MiB, post-processed: Unknown size, total: 119.14 MiB) to /root/.cache/huggingface/datasets/squad/plain_text/1.0.0/d6ec3ceb99ca480ce37cdd35555d6cb2511d223b9150cce08a837ef62ffea453...


Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/8.12M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.05M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/87599 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/10570 [00:00<?, ? examples/s]

Dataset squad downloaded and prepared to /root/.cache/huggingface/datasets/squad/plain_text/1.0.0/d6ec3ceb99ca480ce37cdd35555d6cb2511d223b9150cce08a837ef62ffea453. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

In [6]:
raw_datasets

DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 87599
    })
    validation: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 10570
    })
})

In [7]:
print("Context: ", raw_datasets["train"][0]["context"])
print("Question: ", raw_datasets["train"][0]["question"])
print("Answer: ", raw_datasets["train"][0]["answers"])

Context:  Architecturally, the school has a Catholic character. Atop the Main Building's gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.
Question:  To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?
Answer:  {'text': ['Saint Bernadette Soubirous'], 'answer_start': [515]}


In [8]:
print(raw_datasets["train"][0]["answers"].keys())
print(type(raw_datasets["train"][0]["answers"]['text']))
print(raw_datasets["train"][0]["answers"]['text'][0])

dict_keys(['text', 'answer_start'])
<class 'list'>
Saint Bernadette Soubirous


In [9]:
answer = raw_datasets["train"][0]["answers"]['text'][0]
answer_start = raw_datasets["train"][0]["answers"]['answer_start'][0]
answer_end = answer_start + len(answer)
answer_from_context = raw_datasets["train"][0]["context"] [answer_start:answer_end]


In [10]:
answer_from_context

'Saint Bernadette Soubirous'

During training, there is only one possible answer. We can double-check this by using the Dataset.filter() method:

In [11]:
raw_datasets["train"].filter(lambda x: len(x["answers"]["text"]) != 1)



  0%|          | 0/88 [00:00<?, ?ba/s]

Dataset({
    features: ['id', 'title', 'context', 'question', 'answers'],
    num_rows: 0
})

For evaluation, however, there are several possible answers for each sample, which may be the same or different:

In [12]:
print(raw_datasets["validation"][0]["answers"])
print(raw_datasets["validation"][2]["answers"])

{'text': ['Denver Broncos', 'Denver Broncos', 'Denver Broncos'], 'answer_start': [177, 177, 177]}
{'text': ['Santa Clara, California', "Levi's Stadium", "Levi's Stadium in the San Francisco Bay Area at Santa Clara, California."], 'answer_start': [403, 355, 355]}


# PreProcessing the training data

In [13]:
from transformers import AutoTokenizer

model_checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/455k [00:00<?, ?B/s]

In [16]:
tokenizer.is_fast,tokenizer.special_tokens_map

(True,
 {'cls_token': '[CLS]',
  'mask_token': '[MASK]',
  'pad_token': '[PAD]',
  'sep_token': '[SEP]',
  'unk_token': '[UNK]'})

We can pass to our tokenizer the question and the context together, and it will properly insert the special tokens to form a sentence like this:

Copied
[CLS] question [SEP] context [SEP]

a predicted answer to all the acceptable answers and take the best score. 

In [17]:
context = raw_datasets["train"][0]["context"]
question = raw_datasets["train"][0]["question"]

inputs = tokenizer(question, context,return_offsets_mapping=True)


In [18]:
len(inputs['input_ids']),len(inputs['offset_mapping']),inputs

(176,
 176,
 {'input_ids': [101, 2000, 3183, 2106, 1996, 6261, 2984, 9382, 3711, 1999, 8517, 1999, 10223, 26371, 2605, 1029, 102, 6549, 2135, 1010, 1996, 2082, 2038, 1037, 3234, 2839, 1012, 10234, 1996, 2364, 2311, 1005, 1055, 2751, 8514, 2003, 1037, 3585, 6231, 1997, 1996, 6261, 2984, 1012, 3202, 1999, 2392, 1997, 1996, 2364, 2311, 1998, 5307, 2009, 1010, 2003, 1037, 6967, 6231, 1997, 4828, 2007, 2608, 2039, 14995, 6924, 2007, 1996, 5722, 1000, 2310, 3490, 2618, 4748, 2033, 18168, 5267, 1000, 1012, 2279, 2000, 1996, 2364, 2311, 2003, 1996, 13546, 1997, 1996, 6730, 2540, 1012, 3202, 2369, 1996, 13546, 2003, 1996, 24665, 23052, 1010, 1037, 14042, 2173, 1997, 7083, 1998, 9185, 1012, 2009, 2003, 1037, 15059, 1997, 1996, 24665, 23052, 2012, 10223, 26371, 1010, 2605, 2073, 1996, 6261, 2984, 22353, 2135, 2596, 2000, 3002, 16595, 9648, 4674, 2061, 12083, 9711, 2271, 1999, 8517, 1012, 2012, 1996, 2203, 1997, 1996, 2364, 3298, 1006, 1998, 1999, 1037, 3622, 2240, 2008, 8539, 2083, 1017, 11342, 1

In [19]:
tokenizer.decode(inputs["input_ids"])


'[CLS] to whom did the virgin mary allegedly appear in 1858 in lourdes france? [SEP] architecturally, the school has a catholic character. atop the main building\'s gold dome is a golden statue of the virgin mary. immediately in front of the main building and facing it, is a copper statue of christ with arms upraised with the legend " venite ad me omnes ". next to the main building is the basilica of the sacred heart. immediately behind the basilica is the grotto, a marian place of prayer and reflection. it is a replica of the grotto at lourdes, france where the virgin mary reputedly appeared to saint bernadette soubirous in 1858. at the end of the main drive ( and in a direct line that connects through 3 statues and the gold dome ), is a simple, modern stone statue of mary. [SEP]'

In [20]:
tokenizer.convert_ids_to_tokens(inputs["input_ids"])

['[CLS]',
 'to',
 'whom',
 'did',
 'the',
 'virgin',
 'mary',
 'allegedly',
 'appear',
 'in',
 '1858',
 'in',
 'lou',
 '##rdes',
 'france',
 '?',
 '[SEP]',
 'architectural',
 '##ly',
 ',',
 'the',
 'school',
 'has',
 'a',
 'catholic',
 'character',
 '.',
 'atop',
 'the',
 'main',
 'building',
 "'",
 's',
 'gold',
 'dome',
 'is',
 'a',
 'golden',
 'statue',
 'of',
 'the',
 'virgin',
 'mary',
 '.',
 'immediately',
 'in',
 'front',
 'of',
 'the',
 'main',
 'building',
 'and',
 'facing',
 'it',
 ',',
 'is',
 'a',
 'copper',
 'statue',
 'of',
 'christ',
 'with',
 'arms',
 'up',
 '##rai',
 '##sed',
 'with',
 'the',
 'legend',
 '"',
 've',
 '##ni',
 '##te',
 'ad',
 'me',
 'om',
 '##nes',
 '"',
 '.',
 'next',
 'to',
 'the',
 'main',
 'building',
 'is',
 'the',
 'basilica',
 'of',
 'the',
 'sacred',
 'heart',
 '.',
 'immediately',
 'behind',
 'the',
 'basilica',
 'is',
 'the',
 'gr',
 '##otto',
 ',',
 'a',
 'marian',
 'place',
 'of',
 'prayer',
 'and',
 'reflection',
 '.',
 'it',
 'is',
 'a',
 

In this case the context is not too long, but some of the examples in the dataset have very long contexts that will exceed the maximum length we set (which is 384 in this case).  we will deal with long contexts by creating several training features from one sample of our dataset, with a sliding window between them.

To see how this works using the current example, we can limit the length to 100 and use a sliding window of 50 tokens. As a reminder, we use:

max_length to set the maximum length (here 100)
truncation="only_second" to truncate the context (which is in the second position) when the question with its context is too long
stride to set the number of overlapping tokens between two successive chunks (here 50)
return_overflowing_tokens=True to let the tokenizer know we want the overflowing tokens

return_offsets_mapping=True to get the positions of the tokens with respect to the input of the tokenizer [for sequence_id =0, position of question otherwise positions of context]

In [21]:
batch_encoding = tokenizer(question,context,max_length=100,truncation="only_second",stride=50,
                           return_overflowing_tokens=True,return_offsets_mapping=True)

In [22]:
batch_encoding.keys(),len(batch_encoding['input_ids'])

(dict_keys(['input_ids', 'token_type_ids', 'attention_mask', 'offset_mapping', 'overflow_to_sample_mapping']),
 4)

In [23]:
batch_encoding

{'input_ids': [[101, 2000, 3183, 2106, 1996, 6261, 2984, 9382, 3711, 1999, 8517, 1999, 10223, 26371, 2605, 1029, 102, 6549, 2135, 1010, 1996, 2082, 2038, 1037, 3234, 2839, 1012, 10234, 1996, 2364, 2311, 1005, 1055, 2751, 8514, 2003, 1037, 3585, 6231, 1997, 1996, 6261, 2984, 1012, 3202, 1999, 2392, 1997, 1996, 2364, 2311, 1998, 5307, 2009, 1010, 2003, 1037, 6967, 6231, 1997, 4828, 2007, 2608, 2039, 14995, 6924, 2007, 1996, 5722, 1000, 2310, 3490, 2618, 4748, 2033, 18168, 5267, 1000, 1012, 2279, 2000, 1996, 2364, 2311, 2003, 1996, 13546, 1997, 1996, 6730, 2540, 1012, 3202, 2369, 1996, 13546, 2003, 1996, 24665, 102], [101, 2000, 3183, 2106, 1996, 6261, 2984, 9382, 3711, 1999, 8517, 1999, 10223, 26371, 2605, 1029, 102, 2364, 2311, 1998, 5307, 2009, 1010, 2003, 1037, 6967, 6231, 1997, 4828, 2007, 2608, 2039, 14995, 6924, 2007, 1996, 5722, 1000, 2310, 3490, 2618, 4748, 2033, 18168, 5267, 1000, 1012, 2279, 2000, 1996, 2364, 2311, 2003, 1996, 13546, 1997, 1996, 6730, 2540, 1012, 3202, 2369, 19

In [24]:
batch_encoding['overflow_to_sample_mapping'] # one long context has been truncated to 4 samples

[0, 0, 0, 0]

In [25]:
sequence_ids = batch_encoding.sequence_ids(0)
sliced_text = ""
for idx,tokens,positions in zip(range(len(batch_encoding['input_ids'][0])),batch_encoding['input_ids'][0],batch_encoding['offset_mapping'][0]):
  if sequence_ids[idx]==0:
    sliced_text = question[positions[0]:positions[1]]
  elif sequence_ids[idx]==1:
    sliced_text = context[positions[0]:positions[1]]
  print(f"tokens :: {tokens} and decoed token :: {tokenizer.convert_ids_to_tokens(tokens)} and positions :: {positions} and sliced  text :: {sliced_text}")  ## positions for special tokens will be (0,0)

tokens :: 101 and decoed token :: [CLS] and positions :: (0, 0) and sliced  text :: 
tokens :: 2000 and decoed token :: to and positions :: (0, 2) and sliced  text :: To
tokens :: 3183 and decoed token :: whom and positions :: (3, 7) and sliced  text :: whom
tokens :: 2106 and decoed token :: did and positions :: (8, 11) and sliced  text :: did
tokens :: 1996 and decoed token :: the and positions :: (12, 15) and sliced  text :: the
tokens :: 6261 and decoed token :: virgin and positions :: (16, 22) and sliced  text :: Virgin
tokens :: 2984 and decoed token :: mary and positions :: (23, 27) and sliced  text :: Mary
tokens :: 9382 and decoed token :: allegedly and positions :: (28, 37) and sliced  text :: allegedly
tokens :: 3711 and decoed token :: appear and positions :: (38, 44) and sliced  text :: appear
tokens :: 1999 and decoed token :: in and positions :: (45, 47) and sliced  text :: in
tokens :: 8517 and decoed token :: 1858 and positions :: (48, 52) and sliced  text :: 1858
toke

In [26]:
# let's try to encode few more samples together

sample_question =  raw_datasets["train"][2:6]["question"] # list of size 4
sample_context =  raw_datasets["train"][2:6]["context"] # list of size 4
sample_answers = raw_datasets["train"][2:6]["answers"]
sample_question,sample_question[0],sample_context[0],sample_answers[0]

(['The Basilica of the Sacred heart at Notre Dame is beside to which structure?',
  'What is the Grotto at Notre Dame?',
  'What sits on top of the Main Building at Notre Dame?',
  'When did the Scholastic Magazine of Notre dame begin publishing?'],
 'The Basilica of the Sacred heart at Notre Dame is beside to which structure?',
 'Architecturally, the school has a Catholic character. Atop the Main Building\'s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple,

In [27]:
sample_encoding = tokenizer(sample_question,sample_context,max_length=100,truncation="only_second",stride=50,
                           return_overflowing_tokens=True,return_offsets_mapping=True)
sample_encoding,sample_encoding.keys(),len(sample_encoding['offset_mapping'][0])

({'input_ids': [[101, 1996, 13546, 1997, 1996, 6730, 2540, 2012, 10289, 8214, 2003, 3875, 2000, 2029, 3252, 1029, 102, 6549, 2135, 1010, 1996, 2082, 2038, 1037, 3234, 2839, 1012, 10234, 1996, 2364, 2311, 1005, 1055, 2751, 8514, 2003, 1037, 3585, 6231, 1997, 1996, 6261, 2984, 1012, 3202, 1999, 2392, 1997, 1996, 2364, 2311, 1998, 5307, 2009, 1010, 2003, 1037, 6967, 6231, 1997, 4828, 2007, 2608, 2039, 14995, 6924, 2007, 1996, 5722, 1000, 2310, 3490, 2618, 4748, 2033, 18168, 5267, 1000, 1012, 2279, 2000, 1996, 2364, 2311, 2003, 1996, 13546, 1997, 1996, 6730, 2540, 1012, 3202, 2369, 1996, 13546, 2003, 1996, 24665, 102], [101, 1996, 13546, 1997, 1996, 6730, 2540, 2012, 10289, 8214, 2003, 3875, 2000, 2029, 3252, 1029, 102, 2364, 2311, 1998, 5307, 2009, 1010, 2003, 1037, 6967, 6231, 1997, 4828, 2007, 2608, 2039, 14995, 6924, 2007, 1996, 5722, 1000, 2310, 3490, 2618, 4748, 2033, 18168, 5267, 1000, 1012, 2279, 2000, 1996, 2364, 2311, 2003, 1996, 13546, 1997, 1996, 6730, 2540, 1012, 3202, 2369, 1

In [28]:
for k,v in sample_encoding.items():
  print(f"shape of {k} :: {len(v)}")  # 4 inputs  results in 19 samples

shape of input_ids :: 17
shape of token_type_ids :: 17
shape of attention_mask :: 17
shape of offset_mapping :: 17
shape of overflow_to_sample_mapping :: 17


input_ids ,token_type_ids,attention_mask,offset_mapping : each of them will be list of lists and overflow_to_sample_mapping will be simple list

let's make the labels. labels will be start_positions and end_positions where each of them will be of shape (batch_size)

(0, 0) if the answer is not in the corresponding span of the context
(start_position, end_position) if the answer is in the corresponding span of the context, with start_position being the index of the token (in the input IDs) at the start of the answer and end_position being the index of the token (in the input IDs) where the answer ends

In [29]:
sample_answers = raw_datasets["train"][2:6]["answers"]
sample_answers,sample_answers[0]

([{'answer_start': [279], 'text': ['the Main Building']},
  {'answer_start': [381], 'text': ['a Marian place of prayer and reflection']},
  {'answer_start': [92], 'text': ['a golden statue of the Virgin Mary']},
  {'answer_start': [248], 'text': ['September 1876']}],
 {'answer_start': [279], 'text': ['the Main Building']})

In [30]:
sample_encoding['overflow_to_sample_mapping']


[0, 0, 0, 0, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3]

In [38]:
# find the original sample
# find answers start and end char positions of that original sample
# Find the start and end of the context
# If the answer is not fully inside the context, label is (0, 0)
# Otherwise it's the start and end token positions
sample_mappings = sample_encoding['overflow_to_sample_mapping']
start_positions = []
end_positions = []
for i,offset in enumerate(sample_encoding['offset_mapping']):
  original_sample_id = sample_mappings[i] #find the original sample
  answer = sample_answers[original_sample_id]
  answer_start = answer['answer_start'][0]
  answer_end = answer_start+len(answer['text'][0])
  sequence_id = sample_encoding.sequence_ids(i)
  idx = 0
  while sequence_id[idx]!=1:
    idx +=1
  context_start = idx
  while sequence_id[idx]==1:
    idx +=1
  context_end = idx-1
  if offset[context_start][0]>answer_start or offset[context_end][1]<answer_end:
    start_positions.append(0)
    end_positions.append(0)
  else:
    idx = context_start
    while idx <= context_end and offset[idx][0] <= answer_start:
      idx +=1
    start_positions.append(idx-1)
    idx = context_end
    while idx >= context_start and offset[idx][1] >= answer_end:
      idx -= 1
    end_positions.append(idx+1)
start_positions, end_positions




([81, 49, 17, 0, 0, 57, 19, 33, 0, 0, 0, 63, 27, 0, 0, 0, 0],
 [83, 51, 19, 0, 0, 63, 25, 39, 0, 0, 0, 64, 28, 0, 0, 0, 0])

Let’s take a look at a few results to verify that our approach is correct. For the first feature we find (83, 85) as labels, so let’s compare the theoretical answer with the decoded span of tokens from 83 to 85 (inclusive):

In [49]:
idx = 0
sample_idx = sample_encoding["overflow_to_sample_mapping"][idx]
answer = sample_answers[sample_idx]["text"][0]

start = start_positions[idx]
end = end_positions[idx]
labeled_answer = tokenizer.decode(sample_encoding["input_ids"][idx][start : end + 1])

print(f"Theoretical answer: {answer}, labels give: {labeled_answer}")

Theoretical answer: the Main Building, labels give: the main building


In [53]:
idx = 4
sample_idx = sample_encoding["overflow_to_sample_mapping"][idx]
answer = sample_answers[sample_idx]["text"][0]

decoded_example = tokenizer.decode(sample_encoding["input_ids"][idx])
print(f"Theoretical answer: {answer}, decoded example: {decoded_example}") #we don’t see the answer inside the context.

Theoretical answer: a Marian place of prayer and reflection, decoded example: [CLS] what is the grotto at notre dame? [SEP] architecturally, the school has a catholic character. atop the main building's gold dome is a golden statue of the virgin mary. immediately in front of the main building and facing it, is a copper statue of christ with arms upraised with the legend " venite ad me omnes ". next to the main building is the basilica of the sacred heart. immediately behind the basilica is the grotto, a marian place of [SEP]


In [54]:
max_length = 384
stride = 128


def preprocess_training_examples(examples):
    questions = [q.strip() for q in examples["question"]]
    inputs = tokenizer(
        questions,
        examples["context"],
        max_length=max_length,
        truncation="only_second",
        stride=stride,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    offset_mapping = inputs.pop("offset_mapping")
    sample_map = inputs.pop("overflow_to_sample_mapping")
    answers = examples["answers"]
    start_positions = []
    end_positions = []

    for i, offset in enumerate(offset_mapping):
        sample_idx = sample_map[i]
        answer = answers[sample_idx]
        start_char = answer["answer_start"][0]
        end_char = answer["answer_start"][0] + len(answer["text"][0])
        sequence_ids = inputs.sequence_ids(i)

        # Find the start and end of the context
        idx = 0
        while sequence_ids[idx] != 1:
            idx += 1
        context_start = idx
        while sequence_ids[idx] == 1:
            idx += 1
        context_end = idx - 1

        # If the answer is not fully inside the context, label is (0, 0)
        if offset[context_start][0] > start_char or offset[context_end][1] < end_char:
            start_positions.append(0)
            end_positions.append(0)
        else:
            # Otherwise it's the start and end token positions
            idx = context_start
            while idx <= context_end and offset[idx][0] <= start_char:
                idx += 1
            start_positions.append(idx - 1)

            idx = context_end
            while idx >= context_start and offset[idx][1] >= end_char:
                idx -= 1
            end_positions.append(idx + 1)

    inputs["start_positions"] = start_positions
    inputs["end_positions"] = end_positions
    return inputs

In [61]:
tokenized_dataset = raw_datasets.map(
    preprocess_training_examples,
    batched=True,
    remove_columns=raw_datasets["train"].column_names,
)

  0%|          | 0/88 [00:00<?, ?ba/s]

  0%|          | 0/11 [00:00<?, ?ba/s]

In [62]:
tokenized_dataset

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask', 'start_positions', 'end_positions'],
        num_rows: 88524
    })
    validation: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask', 'start_positions', 'end_positions'],
        num_rows: 10784
    })
})

Fine tuning the model

In [57]:
from transformers import AutoModelForQuestionAnswering
model = AutoModelForQuestionAnswering.from_pretrained(model_checkpoint)

Downloading:   0%|          | 0.00/420M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForQuestionAnswering: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-uncased a

In [58]:
from huggingface_hub import notebook_login

notebook_login()

Login successful
Your token has been saved to /root/.huggingface/token
[1m[31mAuthenticated through git-credential store but this isn't the helper defined on your machine.
You might have to re-authenticate when pushing to the Hugging Face Hub. Run the following command in your terminal in case you want to set this credential helper as the default

git config --global credential.helper store[0m


In [60]:
!apt install git-lfs

Reading package lists... Done
Building dependency tree       
Reading state information... Done
git-lfs is already the newest version (2.3.4-1).
The following package was automatically installed and is no longer required:
  libnvidia-common-460
Use 'apt autoremove' to remove it.
0 upgraded, 0 newly installed, 0 to remove and 42 not upgraded.


In [59]:
from transformers import TrainingArguments

args = TrainingArguments(
    "bert-finetuned-squad",
    evaluation_strategy="no",
    save_strategy="epoch",
    learning_rate=2e-5,
    num_train_epochs=3,
    weight_decay=0.01,
    fp16=True,
    push_to_hub=True,
)

In [1]:
from transformers import Trainer

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=tokenized_dataset['train'],
    eval_dataset=tokenized_dataset['validation'],
    tokenizer=tokenizer,
)
trainer.train()

ModuleNotFoundError: ignored

PreProcessing the validation data

In [None]:
# add one example_id column which will hel to identify from which original example the feature came
# convert all the offset_mappings for question to None

def preprocess_eval_data(eval_data):
  questions = [q.strip() for q in eval_data['question']]
  batch_encoding = tokenizer(
        questions,
        eval_data["context"],
        max_length=384,
        truncation="only_second",
        stride=128,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )
  sample_map = batch_encoding.pop("overflow_to_sample_mapping")
  print(f"len of sample_map {len(sample_map)} and len of input_ids {len(batch_encoding['input_ids'])}")
  example_ids = []
  for i in range(len(batch_encoding['input_ids'])):
    #for each question and cheunked context
    sample_id = sample_map[i]
    example_ids.append(eval_data['id'][sample_id])
    sequence_ids =  batch_encoding.sequence_ids(i)
    offset = batch_encoding['offset_mapping'][i]
    batch_encoding['offset_mapping'][i] = [o if sequence_ids[k]==1 else None for k,o in enumerate(offset) ]
  batch_encoding["example_id"] = example_ids
  return batch_encoding


In [None]:
small_eval_set = raw_datasets["validation"].select(range(100))
trained_checkpoint = "distilbert-base-cased-distilled-squad"

tokenizer = AutoTokenizer.from_pretrained(trained_checkpoint)
eval_set = small_eval_set.map(
    preprocess_eval_data,
    batched=True,
    remove_columns=raw_datasets["validation"].column_names,
)

Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/473 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/208k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/426k [00:00<?, ?B/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

len of sample_map 100 and len of input_ids 100


In [None]:
eval_set

Dataset({
    features: ['input_ids', 'attention_mask', 'offset_mapping', 'example_id'],
    num_rows: 100
})

In [None]:
eval_set['example_id']

['56be4db0acb8001400a502ec',
 '56be4db0acb8001400a502ed',
 '56be4db0acb8001400a502ee',
 '56be4db0acb8001400a502ef',
 '56be4db0acb8001400a502f0',
 '56be8e613aeaaa14008c90d1',
 '56be8e613aeaaa14008c90d2',
 '56be8e613aeaaa14008c90d3',
 '56bea9923aeaaa14008c91b9',
 '56bea9923aeaaa14008c91ba',
 '56bea9923aeaaa14008c91bb',
 '56beace93aeaaa14008c91df',
 '56beace93aeaaa14008c91e0',
 '56beace93aeaaa14008c91e1',
 '56beace93aeaaa14008c91e2',
 '56beace93aeaaa14008c91e3',
 '56bf10f43aeaaa14008c94fd',
 '56bf10f43aeaaa14008c94fe',
 '56bf10f43aeaaa14008c94ff',
 '56bf10f43aeaaa14008c9500',
 '56bf10f43aeaaa14008c9501',
 '56d20362e7d4791d009025e8',
 '56d20362e7d4791d009025e9',
 '56d20362e7d4791d009025ea',
 '56d20362e7d4791d009025eb',
 '56d600e31c85041400946eae',
 '56d600e31c85041400946eb0',
 '56d600e31c85041400946eb1',
 '56d9895ddc89441400fdb50e',
 '56d9895ddc89441400fdb510',
 '56be4e1facb8001400a502f6',
 '56be4e1facb8001400a502f9',
 '56be4e1facb8001400a502fa',
 '56beaa4a3aeaaa14008c91c2',
 '56beaa4a3aea

In [None]:
!nvidia-smi

Wed May 25 04:15:08 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   38C    P8     9W /  70W |      3MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [None]:
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

In [None]:
import torch
from transformers import AutoModelForQuestionAnswering
eval_set_for_model = eval_set.remove_columns(['offset_mapping', 'example_id'])
eval_set_for_model.set_format("torch")
device = torch.device("cuda") if torch.cuda.is_available else torch.device("cpu")
batch = {k : eval_set_for_model[k].to(device) for k in eval_set_for_model.column_names}
trained_model = AutoModelForQuestionAnswering.from_pretrained(trained_checkpoint).to(device)
with torch.no_grad():
  outputs = trained_model(**batch)

Downloading:   0%|          | 0.00/249M [00:00<?, ?B/s]

In [None]:
eval_set_for_model.column_names

['input_ids', 'attention_mask']

In [None]:
outputs.start_logits.shape,outputs.end_logits.shape,type(outputs.start_logits),type(outputs.start_logits),outputs.start_logits.size(1), # we got one start logit and one end logit for each token in each batch

(torch.Size([100, 384]),
 torch.Size([100, 384]),
 torch.Tensor,
 torch.Tensor,
 384)

In [None]:
start_logits = outputs.start_logits.cpu().numpy()
end_logits = outputs.end_logits.cpu().numpy()

In [None]:
start_logits[5]

array([ -6.0025806 ,  -9.441373  , -10.468868  ,  -9.709174  ,
       -10.077002  , -10.543653  , -10.214958  , -11.334011  ,
       -10.562479  ,  -7.8932896 ,  -8.072902  ,  -3.6805031 ,
        -9.079008  ,  -6.2378054 ,  -6.228958  ,  -2.8845828 ,
        -2.0040712 ,  -3.7998216 ,  -5.25239   ,  -1.2919195 ,
        -4.1664577 ,  -5.43419   ,  -3.5442555 ,  -8.407289  ,
        -4.9563055 ,  -3.6859071 ,  -6.0853686 ,  -6.504245  ,
        -7.263981  ,  -4.4759665 ,  -7.087413  ,  -6.5546665 ,
        -6.041664  ,  -3.1254323 ,  -7.1854944 ,  -8.200024  ,
        -7.543401  ,  -7.528896  ,  -9.052555  ,  -9.464549  ,
        -9.769499  ,  -8.251631  , -10.58807   ,  -9.577547  ,
        -7.473591  ,  -9.056961  ,  -9.579401  ,  -9.8195915 ,
        -8.40139   ,  -9.990761  ,  -9.797199  , -10.035499  ,
        -8.159855  , -10.707861  ,  -9.74521   ,  -8.52379   ,
        -9.346121  ,  -9.3583355 , -11.699266  , -10.020927  ,
       -10.282642  , -10.161473  , -10.032623  ,  -8.31

In [None]:
np.argsort(start_logits[5])[-1:-20 -1:-1] #[start:end:step]

array([108, 107, 106, 113, 103, 105,  19,  16, 112,  15,  95, 109,  33,
       163,  98,  22,  11,  25,  17,  20])

In [None]:
a = np.array([ 2, 0,  1, 5, 4, 10, 9])
a[-1:-4-1]

array([], dtype=int64)

In [None]:
eval_set

Dataset({
    features: ['input_ids', 'attention_mask', 'offset_mapping', 'example_id'],
    num_rows: 100
})

In [None]:
import collections

example_to_features = collections.defaultdict(list)
for idx, feature in enumerate(eval_set):
    example_to_features[feature["example_id"]].append(idx)

In [None]:
example_to_features

defaultdict(list,
            {'56be4db0acb8001400a502ec': [0],
             '56be4db0acb8001400a502ed': [1],
             '56be4db0acb8001400a502ee': [2],
             '56be4db0acb8001400a502ef': [3],
             '56be4db0acb8001400a502f0': [4],
             '56be4e1facb8001400a502f6': [30],
             '56be4e1facb8001400a502f9': [31],
             '56be4e1facb8001400a502fa': [32],
             '56be4eafacb8001400a50302': [55],
             '56be4eafacb8001400a50303': [56],
             '56be4eafacb8001400a50304': [57],
             '56be5333acb8001400a5030a': [80],
             '56be5333acb8001400a5030b': [81],
             '56be5333acb8001400a5030c': [82],
             '56be5333acb8001400a5030d': [83],
             '56be5333acb8001400a5030e': [84],
             '56be8e613aeaaa14008c90d1': [5],
             '56be8e613aeaaa14008c90d2': [6],
             '56be8e613aeaaa14008c90d3': [7],
             '56bea9923aeaaa14008c91b9': [8],
             '56bea9923aeaaa14008c91ba': [9],
     

In [None]:
import numpy as np
n_best = 20
max_answer_length = 30
predicted_answers = []
for example in small_eval_set:
  example_id = example["id"]
  context = example["context"]
  answers = []
  for feature_index in example_to_features[example_id]: 
    start_logit = start_logits[feature_index]
    end_logit = end_logits[feature_index]
    offsets = eval_set["offset_mapping"][feature_index]
    

{'id': '56be4db0acb8001400a502ec', 'title': 'Super_Bowl_50', 'context': 'Super Bowl 50 was an American football game to determine the champion of the National Football League (NFL) for the 2015 season. The American Football Conference (AFC) champion Denver Broncos defeated the National Football Conference (NFC) champion Carolina Panthers 24–10 to earn their third Super Bowl title. The game was played on February 7, 2016, at Levi\'s Stadium in the San Francisco Bay Area at Santa Clara, California. As this was the 50th Super Bowl, the league emphasized the "golden anniversary" with various gold-themed initiatives, as well as temporarily suspending the tradition of naming each Super Bowl game with Roman numerals (under which the game would have been known as "Super Bowl L"), so that the logo could prominently feature the Arabic numerals 50.', 'question': 'Which NFL team represented the AFC at Super Bowl 50?', 'answers': {'text': ['Denver Broncos', 'Denver Broncos', 'Denver Broncos'], 'ans