# Natural Language Questions

Dataset : https://hotpotqa.github.io/
HotpotQA is a question answering dataset featuring natural, multi-hop questions, with strong supervision for supporting facts to enable more explainable question answering systems. It is collected by a team of NLP researchers at Carnegie Mellon University, Stanford University, and Université de Montréal.

In [1]:
import json

In [2]:
with open('train_set.json', 'r') as f:
  data = json.load(f)

In [3]:
# Number of question + context dataset
len(data)

90447

## Data Prep

### Data Exploration

In [4]:
type(data[1]) # Dictionary of items
type(data)

list

In [5]:
data[:3]

[{'supporting_facts': [["Arthur's Magazine", 0], ['First for Women', 0]],
  'level': 'medium',
  'question': "Which magazine was started first Arthur's Magazine or First for Women?",
  'context': [['Radio City (Indian radio station)',
    ["Radio City is India's first private FM radio station and was started on 3 July 2001.",
     ' It broadcasts on 91.1 (earlier 91.0 in most cities) megahertz from Mumbai (where it was started in 2004), Bengaluru (started first in 2001), Lucknow and New Delhi (since 2003).',
     ' It plays Hindi, English and regional songs.',
     ' It was launched in Hyderabad in March 2006, in Chennai on 7 July 2006 and in Visakhapatnam October 2007.',
     ' Radio City recently forayed into New Media in May 2008 with the launch of a music portal - PlanetRadiocity.com that offers music related news, videos, songs, and other music-related features.',
     ' The Radio station currently plays a mix of Hindi and Regional music.',
     ' Abraham Thomas is the CEO of the 

In [6]:
print(data[0].keys())

dict_keys(['supporting_facts', 'level', 'question', 'context', 'answer', '_id', 'type'])


In [7]:
type(data[0]['context'])

list

In [8]:
print(data[0]['context'])

[['Radio City (Indian radio station)', ["Radio City is India's first private FM radio station and was started on 3 July 2001.", ' It broadcasts on 91.1 (earlier 91.0 in most cities) megahertz from Mumbai (where it was started in 2004), Bengaluru (started first in 2001), Lucknow and New Delhi (since 2003).', ' It plays Hindi, English and regional songs.', ' It was launched in Hyderabad in March 2006, in Chennai on 7 July 2006 and in Visakhapatnam October 2007.', ' Radio City recently forayed into New Media in May 2008 with the launch of a music portal - PlanetRadiocity.com that offers music related news, videos, songs, and other music-related features.', ' The Radio station currently plays a mix of Hindi and Regional music.', ' Abraham Thomas is the CEO of the company.']], ['History of Albanian football', ['Football in Albania existed before the Albanian Football Federation (FSHF) was created.', " This was evidenced by the team's registration at the Balkan Cup tournament during 1929-193

In [2]:
from collections.abc import Iterable

def flatten(xs):
    for x in xs:
        if isinstance(x, Iterable) and not isinstance(x, (str, bytes)):
            yield from flatten(x)
        else:
            yield x

In [16]:
flatlist = flatten(data[0]['context'])

In [17]:
type(''.join(flatlist))

str

In [18]:
print(''.join(flatlist))




In [19]:
print(data[0]['question'])

Which magazine was started first Arthur's Magazine or First for Women?


In [20]:
contexts = []
questions = []
answers = []

for group in data:
    contexts.append(''.join(flatten(group['context'])))
    questions.append(group['question'])
    answers.append(group['answer'])
   
print(len(contexts))
print(len(questions))
print(len(answers))

90447
90447
90447


In [21]:
contexts[0]

'Radio City (Indian radio station)Radio City is India\'s first private FM radio station and was started on 3 July 2001. It broadcasts on 91.1 (earlier 91.0 in most cities) megahertz from Mumbai (where it was started in 2004), Bengaluru (started first in 2001), Lucknow and New Delhi (since 2003). It plays Hindi, English and regional songs. It was launched in Hyderabad in March 2006, in Chennai on 7 July 2006 and in Visakhapatnam October 2007. Radio City recently forayed into New Media in May 2008 with the launch of a music portal - PlanetRadiocity.com that offers music related news, videos, songs, and other music-related features. The Radio station currently plays a mix of Hindi and Regional music. Abraham Thomas is the CEO of the company.History of Albanian footballFootball in Albania existed before the Albanian Football Federation (FSHF) was created. This was evidenced by the team\'s registration at the Balkan Cup tournament during 1929-1931, which started in 1929 (although Albania ev

In [3]:
def read_data(path):
    with open(path, 'r') as f:
      data = json.load(f)
    
    contexts = []
    questions = []
    answers = []
    
    for group in data:
        # Removing yes/no questions not found in the context
        if "yes" in group['answer'] or "no" in group['answer']:
            continue
        contexts.append(''.join(flatten(group['context'])))
        questions.append(group['question'])
        answers.append(group['answer'])
        
    return contexts, questions, answers

In [4]:
train_contexts, train_questions, train_answers = read_data('train_set.json')
val_contexts, val_questions, val_answers = read_data('dev_set.json')

In [175]:
print(len(train_answers))
print(len(val_answers))

83159
6768


In [176]:
train_answers

["Arthur's Magazine",
 'Delhi',
 'President Richard Nixon',
 'American',
 'alcohol',
 'Jonathan Stark',
 'Crambidae',
 'Badr Hari',
 '2006',
 '6.213 km long',
 'Jaime Meline',
 'Walter Darwin Coy',
 'United States',
 'Super Bowl XLVIII',
 'US 60',
 '2006',
 'Hetfield and Ulrich, longtime lead guitarist Kirk Hammett, and bassist Robert Trujillo.',
 'Fox',
 '2017',
 'Nevada',
 'Hawaii',
 'Kelli Ward',
 'The Wolfhounds',
 '16-year-old',
 'World War II',
 'Todd Phillips',
 'Carol Lawrence',
 'New York City',
 'Amy Jo Johnson',
 'Aleksander Ford',
 'director',
 'Roseau, Minnesota, USA',
 'The Saimaa Gesture',
 'David Lee Roth',
 'Nassau County',
 'Australia',
 'California',
 'Dessau',
 'Roman Catholic',
 'The Joshua Tree',
 'Ulster County',
 'Tammy Wynette',
 'Sir Francis Nethersole',
 'ingredients in beer',
 'Dennis Howard Marks',
 'Robert Sheehan',
 'Glenn Hughes',
 'March 28, 1941',
 'standard gauge track',
 'Sex and the City',
 'Kato',
 'England',
 'Robert Zemeckis',
 '1932',
 'nine',
 

In [177]:
train_contexts[0].find(train_answers[0])

3015

In [5]:
def update_train_answers(answers, contexts):
    temp = []
    for answer, context in zip(answers,contexts):
        gold_text = answer
        start_idx = context.find(answer)
        # There are some yes/no answers not found in the context
        if start_idx == -1:
            print(answer)
        end_idx = start_idx + len(gold_text)
        if context[start_idx:end_idx] == gold_text:
            temp.append({'text':answer, 'answer_start':start_idx, 'answer_end':end_idx})
        else:
            for n in [1,2]:
                if context[start_idx-n:end_idx-n] == gold_text:
                    temp.append({'text':answer, 'answer_start':start_idx, 'answer_end':end_idx})
    return temp

train_answers_n = update_train_answers(train_answers,train_contexts)
val_answers_n = update_train_answers(val_answers,val_contexts)

In [179]:
print(train_answers_n[0:10])
print(train_contexts[1][train_answers_n[1]['answer_start']:train_answers_n[1]['answer_end']])

[{'text': "Arthur's Magazine", 'answer_start': 3015, 'answer_end': 3032}, {'text': 'Delhi', 'answer_start': 3177, 'answer_end': 3182}, {'text': 'President Richard Nixon', 'answer_start': 3327, 'answer_end': 3350}, {'text': 'American', 'answer_start': 1891, 'answer_end': 1899}, {'text': 'alcohol', 'answer_start': 195, 'answer_end': 202}, {'text': 'Jonathan Stark', 'answer_start': 4523, 'answer_end': 4537}, {'text': 'Crambidae', 'answer_start': 1796, 'answer_end': 1805}, {'text': 'Badr Hari', 'answer_start': 2492, 'answer_end': 2501}, {'text': '2006', 'answer_start': 5261, 'answer_end': 5265}, {'text': '6.213 km long', 'answer_start': 337, 'answer_end': 350}]
Delhi


In [180]:
print(len(train_answers_n))
print(len(val_answers_n))

83159
6768


## Tokenize/Encode

In [10]:
from transformers import DistilBertTokenizerFast

tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')

  from .autonotebook import tqdm as notebook_tqdm


In [11]:
train_encodings = tokenizer(train_contexts, train_questions, truncation=True, padding=True)
val_encodings = tokenizer(val_contexts, val_questions, truncation=True, padding=True)

In [183]:
train_encodings.keys()

dict_keys(['input_ids', 'attention_mask'])

In [184]:
type(train_answers_n[0])

dict

In [12]:
def add_token_answers(encodings, answers):
    start_positions = []
    end_positions = []
    for i in range(len(answers)):
        start_positions.append(encodings.char_to_token(i,answers[i]['answer_start']))
        end_positions.append(encodings.char_to_token(i,answers[i]['answer_end']))
        
        # if start position is None, the answer passage has been truncated
        if start_positions[-1] is None:
            start_positions[-1] = tokenizer.model_max_length
        # end position cannot be found, char_to_token found space, so shift one token forward
        go_back = 1
        while end_positions[-1] is None:
            end_positions[-1] = encodings.char_to_token(i, answers[i]['answer_end']-go_back)
            go_back +=1
    encodings.update({
        'start_positions':start_positions,
        'end_positions':end_positions
                     })
    print(None in start_positions)
    print(None in end_positions)

In [13]:
add_token_answers(train_encodings,train_answers_n)

False
False


In [14]:
add_token_answers(val_encodings,val_answers_n)

False
False


In [188]:
train_encodings.keys()
val_encodings.keys()

dict_keys(['input_ids', 'attention_mask', 'start_positions', 'end_positions'])

In [189]:
print(train_encodings['start_positions'][1])
print(train_encodings['end_positions'][1])

512
490


In [15]:
import torch

class NLDataset(torch.utils.data.Dataset):
    def __init__(self, encodings):
        self.encodings = encodings

    def __getitem__(self, idx):
        return {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}

    def __len__(self):
        return len(self.encodings.input_ids)

train_dataset = NLDataset(train_encodings)
val_dataset = NLDataset(val_encodings)


In [191]:
print(train_dataset)

<__main__.NLDataset object at 0x7fc6f6b61640>


## Fine-tuning

In [192]:
from transformers import DistilBertForQuestionAnswering
model = DistilBertForQuestionAnswering.from_pretrained('distilbert-base-uncased')

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForQuestionAnswering: ['vocab_layer_norm.bias', 'vocab_transform.weight', 'vocab_projector.weight', 'vocab_transform.bias', 'vocab_projector.bias', 'vocab_layer_norm.weight']
- This IS expected if you are initializing DistilBertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForQuestionAnswering were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this mode

In [193]:
from torch.utils.data import DataLoader
from transformers import AdamW
from tqdm import tqdm

# setup GPU/CPU
# setup GPU/CPU
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')# move model over to detected device
model.to(device)
# activate training mode of model
model.train()
# initialize adam optimizer with weight decay (reduces chance of overfitting)
optim = AdamW(model.parameters(), lr=5e-5)

# initialize data loader for training data
train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)

for epoch in range(3):
    # set model to train mode
    model.train()
    # setup loop (we use tqdm for the progress bar)
    loop = tqdm(train_loader, leave=True)
    
    for batch in loop:
        # initialize calculated gradients (from prev step)
        optim.zero_grad()
        # pull all the tensor batches required for training
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        start_positions = batch['start_positions'].to(device)
        end_positions = batch['end_positions'].to(device)
        # train model on batch and return outputs (incl. loss)
        outputs = model(input_ids, attention_mask=attention_mask,
                        start_positions=start_positions,
                        end_positions=end_positions)
        # extract loss
        loss = outputs[0]
        # calculate loss for every parameter that needs grad update
        loss.backward()
        # update parameters
        optim.step()
        # print relevant info to progress bar
        loop.set_description(f'Epoch {epoch}')
        loop.set_postfix(loss=loss.item())

Epoch 0: 100%|██████████| 5198/5198 [15:58:07<00:00, 11.06s/it, loss=0.855]  
Epoch 1: 100%|██████████| 5198/5198 [15:35:00<00:00, 10.79s/it, loss=0.66]   
Epoch 2: 100%|██████████| 5198/5198 [15:53:43<00:00, 11.01s/it, loss=0.828]   


In [194]:
model_path = 'models/distilbert-custom'
model.save_pretrained(model_path)
tokenizer.save_pretrained(model_path)

('models/distilbert-custom/tokenizer_config.json',
 'models/distilbert-custom/special_tokens_map.json',
 'models/distilbert-custom/vocab.txt',
 'models/distilbert-custom/added_tokens.json',
 'models/distilbert-custom/tokenizer.json')

In [195]:
# switch model out of training mode
model.eval()

#val_sampler = SequentialSampler(val_dataset)
val_loader = DataLoader(val_dataset, batch_size=16)

acc = []

# initialize loop for progress bar
loop = tqdm(val_loader)
# loop through batches
for batch in loop:
    # we don't need to calculate gradients as we're not training
    with torch.no_grad():
        # pull batched items from loader
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        start_true = batch['start_positions'].to(device)
        end_true = batch['end_positions'].to(device)
        # make predictions
        outputs = model(input_ids, attention_mask=attention_mask)
        # pull preds out
        start_pred = torch.argmax(outputs['start_logits'], dim=1)
        end_pred = torch.argmax(outputs['end_logits'], dim=1)
        # calculate accuracy for both and append to accuracy list
        acc.append(((start_pred == start_true).sum()/len(start_pred)).item())
        acc.append(((end_pred == end_true).sum()/len(end_pred)).item())
# calculate average accuracy in total
acc = sum(acc)/len(acc)

100%|██████████| 423/423 [21:14<00:00,  3.01s/it]


In [196]:
print("T/F\tstart\tend\n")
for i in range(len(start_true)):
    print(f"true\t{start_true[i]}\t{end_true[i]}\n"
          f"pred\t{start_pred[i]}\t{end_pred[i]}\n")

T/F	start	end

true	512	494
pred	23	494

true	37	43
pred	37	126

true	323	327
pred	323	327

true	512	471
pred	316	471

true	512	489
pred	356	489

true	512	496
pred	279	281

true	512	493
pred	283	493

true	47	49
pred	43	49

true	512	494
pred	207	494

true	267	269
pred	364	4

true	512	490
pred	37	490

true	344	347
pred	344	347

true	512	491
pred	465	491

true	401	405
pred	401	402

true	71	74
pred	43	45

true	485	488
pred	152	156



In [197]:
print(acc)

0.3964243498817967
