# Finetuning Question Answering on RoBERTa


**Fine_tuning RoBERTa Extracitve Question Answering in PyTorch**

In [133]:
import torch
torch.cuda.empty_cache

<function torch.cuda.memory.empty_cache() -> None>

Install transformers Library

In [134]:
!pip install -q transformers datasets

Import libraries

In [135]:
import numpy as np
import pandas as pd

import torch
from torch.optim import Adam
from transformers import AutoModelForQuestionAnswering, AutoTokenizer, pipeline


### Load the dataset

In [108]:
from datasets import load_dataset

### Load and split dataset, using small datasets for the sake of model training

In [109]:
train_data, valid_data = load_dataset('squad_v2', split='train[:1%]'), load_dataset('squad_v2', split='validation[:3%]')



In [110]:
train_data[0]

{'id': '56be85543aeaaa14008c9063',
 'title': 'Beyoncé',
 'context': 'Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and actress. Born and raised in Houston, Texas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 1990s as lead singer of R&B girl-group Destiny\'s Child. Managed by her father, Mathew Knowles, the group became one of the world\'s best-selling girl groups of all time. Their hiatus saw the release of Beyoncé\'s debut album, Dangerously in Love (2003), which established her as a solo artist worldwide, earned five Grammy Awards and featured the Billboard Hot 100 number-one singles "Crazy in Love" and "Baby Boy".',
 'question': 'When did Beyonce start becoming popular?',
 'answers': {'text': ['in the late 1990s'], 'answer_start': [269]}}

### Getting correct answer text alignment and tokenizing the dataset

In [111]:
# Dataset cleaning and tokenization
# BertTokenizerFast because python tokenizer do not have char_to_token functionality

def correct_alignment(context, answer):

    """ Description: This functions corrects the alignment of answers in the squad dataset that are sometimes off by one or 2 values also adds end_postion index.
    
    inputs: list of contexts and answers
    outputs: Updated list that contains answer_end positions """
    
    start_text = answer['text'][0]
    start_idx = answer['answer_start'][0]
    end_idx = start_idx + len(start_text)

    # When alignment is okay
    if context[start_idx:end_idx] == start_text:
      return start_idx, end_idx    
      # When alignment is off by 1 character
    elif context[start_idx-1:end_idx-1] == start_text:
      return start_idx-1, end_idx-1  
      # when alignment is off by 2 characters
    elif context[start_idx-2:end_idx-2] == start_text:
      return start_idx-2, end_idx-2
    else:
      raise ValueError()

### Tokenize our training dataset

In [136]:
model_name = "deepset/roberta-base-squad2"
model = AutoModelForQuestionAnswering.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

Downloading (…)lve/main/config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/496M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/79.0 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/772 [00:00<?, ?B/s]

## Asking a Question

In [137]:
random_num = np.random.randint(0,len(train_data))
question = train_data["question"][random_num]
text = train_data["context"][random_num]

## Let’s see how many tokens this question and text pair have.

In [138]:
input_ids = tokenizer.encode(question, text)
print("The input has a total of {} tokens.".format(len(input_ids)))

The input has a total of 249 tokens.


## To look at what our tokenizer is doing, let’s just print out the tokens and their IDs.

In [139]:
tokens = tokenizer.convert_ids_to_tokens(input_ids)
for token, id in zip(tokens, input_ids):
    print('{:8}{:8,}'.format(token,id))

<s>            0
On         4,148
Ġwhat         99
Ġmagazine   4,320
Ġwas          21
Ġshe          79
Ġthe           5
Ġcover     1,719
Ġmodel     1,421
?            116
</s>           2
</s>           2
At         3,750
Ġthe           5
Ġ57        4,981
th           212
ĠAnnual    7,453
ĠGrammy   12,727
ĠAwards    4,229
Ġin           11
ĠFebruary     902
Ġ2015        570
,              6
ĠBeyon    12,674
cÃ©       12,695
Ġwas          21
Ġnominated   7,076
Ġfor          13
Ġsix         411
Ġawards    4,188
,              6
Ġultimately   3,284
Ġwinning   1,298
Ġthree       130
:             35
ĠBest      2,700
ĠR           248
&            947
B            387
ĠPerformance  10,193
Ġand           8
ĠBest      2,700
ĠR           248
&            947
B            387
ĠSong     10,264
Ġfor          13
Ġ"            22
Dr        14,043
unk        6,435
Ġin           11
ĠLove      3,437
",         1,297
Ġand           8
ĠBest      2,700
ĠSur       6,544
round      3,431
ĠSound     8,479
ĠAl

In [140]:
#first occurence of [SEP] token
sep_idx = input_ids.index(tokenizer.sep_token_id)
print("SEP token index: ", sep_idx)
#number of tokens in segment A (question) - this will be one more than the sep_idx as the index in Python starts from 0
num_seg_a = sep_idx+1
print("Number of tokens in segment A: ", num_seg_a)
#number of tokens in segment B (text)
num_seg_b = len(input_ids) - num_seg_a
print("Number of tokens in segment B: ", num_seg_b)
#creating the segment ids
segment_ids = [0]*num_seg_a + [1]*num_seg_b
#making sure that every input token has a segment id
assert len(segment_ids) == len(input_ids)

SEP token index:  10
Number of tokens in segment A:  11
Number of tokens in segment B:  238


## Let’s now feed this to our model.

In [142]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'

In [143]:
model.to(device)

RobertaForQuestionAnswering(
  (roberta): RobertaModel(
    (embeddings): RobertaEmbeddings(
      (word_embeddings): Embedding(50265, 768, padding_idx=1)
      (position_embeddings): Embedding(514, 768, padding_idx=1)
      (token_type_embeddings): Embedding(1, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): RobertaEncoder(
      (layer): ModuleList(
        (0-11): 12 x RobertaLayer(
          (attention): RobertaAttention(
            (self): RobertaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): RobertaSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (Lay

## creating a function to perform Question Answer analysis

In [144]:
def sample(sample_question, sample_context):
    input_ids = tokenizer.encode(sample_question, sample_context, truncation=True, max_length=512, return_tensors='pt').to(device)

    with torch.no_grad():
        outputs = model(input_ids=input_ids)
        start_logits = outputs.start_logits
        end_logits = outputs.end_logits

    start_index = torch.argmax(start_logits)
    end_index = torch.argmax(end_logits)

    start_token = input_ids[0][start_index].item()
    end_token = input_ids[0][end_index].item()

    answer = tokenizer.decode(input_ids[0][start_index:end_index+1])

    return answer

In [146]:

sample_question = "When did the Kathmandu valley monuments receive WHS status?"
sample_context = "The ancient trade route between India and Tibet that passed through Kathmandu enabled a fusion of artistic and architectural traditions from other cultures to be amalgamated with local art and architecture. The monuments of Kathmandu City have been influenced over the centuries by Hindu and Buddhist religious practices. The architectural treasure of the Kathmandu valley has been categorized under the well-known seven groups of heritage monuments and buildings. In 2006 UNESCO declared these seven groups of monuments as a World Heritage Site (WHS). The seven monuments zones cover an area of 188.95 hectares (466.9 acres), with the buffer zone extending to 239.34 hectares (591.4 acres). The Seven Monument Zones (Mzs) inscribed originally in 1979 and with a minor modification in 2006 are Durbar squares of Hanuman Dhoka, Patan and Bhaktapur, Hindu temples of Pashupatinath and Changunarayan, the Buddhist stupas of Swayambhu and Boudhanath."

answer = sample(sample_question, sample_context)
print("Sample Question:", sample_question)
print("Sample Context:", sample_context)
print("Answer:", answer)






Sample Question: When did the Kathmandu valley monuments receive WHS status?
Sample Context: The ancient trade route between India and Tibet that passed through Kathmandu enabled a fusion of artistic and architectural traditions from other cultures to be amalgamated with local art and architecture. The monuments of Kathmandu City have been influenced over the centuries by Hindu and Buddhist religious practices. The architectural treasure of the Kathmandu valley has been categorized under the well-known seven groups of heritage monuments and buildings. In 2006 UNESCO declared these seven groups of monuments as a World Heritage Site (WHS). The seven monuments zones cover an area of 188.95 hectares (466.9 acres), with the buffer zone extending to 239.34 hectares (591.4 acres). The Seven Monument Zones (Mzs) inscribed originally in 1979 and with a minor modification in 2006 are Durbar squares of Hanuman Dhoka, Patan and Bhaktapur, Hindu temples of Pashupatinath and Changunarayan, the Buddh