In this notebook, I tried implementing a Hugging Face extractive question answerer using the bAbI tasks generated by Meta's AI Research division to advance NLP. Question answering is a branch of NLP that seeks to train models to process questions, understand its meaning, and generate (or in this case, identify the portion of the text that contains) the correct answer. This is my first venture into NLP, done as a learning experience for myself, and the contents of this notebook are inspired by the material I learned from DeepLearning.AI's class on Coursera. 

In [1]:
import numpy as np
import pandas as pd
import random
import os
from copy import deepcopy
from datasets import load_from_disk, Dataset
#from datasets.dataset_dict import DatasetDic
from transformers import DistilBertTokenizerFast, DistilBertForQuestionAnswering, Trainer, TrainingArguments
from sklearn.metrics import f1_score
import torch



The format of this data is as follows:
- answer: the one-word answer to the question
- id: each sentence within the story has its own id starting from 1
- supporting_ids: the sentence id that helps answer the question
- text: the English text of each sentence
- type: 0 if the sentence gives context and 1 if the sentence is the question

In [2]:
babi_dataset = load_from_disk('/kaggle/input/babl-question-answering/data')
print(type(babi_dataset))
for i in range(3):
    idx = random.randint(0, len(babi_dataset['train']))
    print('index:', idx)
    print(babi_dataset['train'][idx])

<class 'datasets.dataset_dict.DatasetDict'>
index: 420
{'story': {'answer': ['', '', 'kitchen'], 'id': ['1', '2', '3'], 'supporting_ids': [[], [], ['1']], 'text': ['The kitchen is west of the hallway.', 'The garden is east of the hallway.', 'What is the hallway east of?'], 'type': [0, 0, 1]}}
index: 77
{'story': {'answer': ['', '', 'bathroom'], 'id': ['1', '2', '3'], 'supporting_ids': [[], [], ['1']], 'text': ['The bathroom is south of the garden.', 'The kitchen is north of the garden.', 'What is the garden north of?'], 'type': [0, 0, 1]}}
index: 57
{'story': {'answer': ['', '', 'garden'], 'id': ['1', '2', '3'], 'supporting_ids': [[], [], ['1']], 'text': ['The garden is west of the bedroom.', 'The office is west of the garden.', 'What is west of the bedroom?'], 'type': [0, 0, 1]}}


The last element of each entry in the DatasetDict is called 'type', which indicates whether the corresponding sentence is supposed to be a contextual sentence (0) or a question (1). We can check all of the unique formats of the sentences by adding them to a set. Luckily in this dataset, they're all the same.

In [3]:
type_set = set()
for story in babi_dataset['train']:
    if str(story['story']['type'] )not in type_set:
        type_set.add(str(story['story']['type'] ))
type_set

{'[0, 0, 1]'}

Flattening from dictionary to table structure

In [4]:
flattened_babi = babi_dataset.flatten()
print(flattened_babi)
next(iter(flattened_babi['train']))

DatasetDict({
    train: Dataset({
        features: ['story.answer', 'story.id', 'story.supporting_ids', 'story.text', 'story.type'],
        num_rows: 1000
    })
    test: Dataset({
        features: ['story.answer', 'story.id', 'story.supporting_ids', 'story.text', 'story.type'],
        num_rows: 1000
    })
})


{'story.answer': ['', '', 'office'],
 'story.id': ['1', '2', '3'],
 'story.supporting_ids': [[], [], ['1']],
 'story.text': ['The office is north of the kitchen.',
  'The garden is south of the kitchen.',
  'What is north of the kitchen?'],
 'story.type': [0, 0, 1]}

For some reason I couldn't use map() or add new keys to the flattened_babi dataset, so I made a new dictionary with three new keys instead of using a DatasetDic (which is what Hugging Face uses). We can now extract individual questions, answers, as well as the contextual sentences with their keys. 

In [5]:
train_data = []
for original_dict in flattened_babi['train']:
    copy_dict = deepcopy(original_dict)
    copy_dict.update({
        'questions': original_dict['story.text'][2],
        'sentences': ' '.join([original_dict['story.text'][0], original_dict['story.text'][1]]),
        'answer': original_dict['story.answer'][2]
    })
    train_data.append(copy_dict)

test_data = []
for original_dict in flattened_babi['test']:
    copy_dict = deepcopy(original_dict)
    copy_dict.update({
        'questions': original_dict['story.text'][2],
        'sentences': ' '.join([original_dict['story.text'][0], original_dict['story.text'][1]]),
        'answer': original_dict['story.answer'][2]
    })
    test_data.append(copy_dict)
    
processed = {'train': train_data, 'test': test_data}

Let's take a look at some entries in our dataset

In [6]:
for i in range(3):
    idx = random.randint(0, len(babi_dataset['train']))
    print('index:', idx)
    print('train:', processed['train'][idx])
    print('test:', processed['test'][idx])

index: 789
train: {'story.answer': ['', '', 'office'], 'story.id': ['1', '2', '3'], 'story.supporting_ids': [[], [], ['1']], 'story.text': ['The office is east of the garden.', 'The bathroom is west of the garden.', 'What is east of the garden?'], 'story.type': [0, 0, 1], 'questions': 'What is east of the garden?', 'sentences': 'The office is east of the garden. The bathroom is west of the garden.', 'answer': 'office'}
test: {'story.answer': ['', '', 'office'], 'story.id': ['1', '2', '3'], 'story.supporting_ids': [[], [], ['2']], 'story.text': ['The kitchen is north of the bedroom.', 'The office is south of the bedroom.', 'What is the bedroom north of?'], 'story.type': [0, 0, 1], 'questions': 'What is the bedroom north of?', 'sentences': 'The kitchen is north of the bedroom. The office is south of the bedroom.', 'answer': 'office'}
index: 938
train: {'story.answer': ['', '', 'hallway'], 'story.id': ['1', '2', '3'], 'story.supporting_ids': [[], [], ['2']], 'story.text': ['The bedroom is

Now adding the starting and ending indices for each story. This is important because we're doing extractive question answering, which relies on identifying the position of the answer within the contextual sentences. 

In [7]:
for story in processed['train']:
    story.update({
        'str_idx': story['sentences'].find(story['answer']),
        'end_idx': story['sentences'].find(story['answer']) + len(story['answer'])
    })
for story in processed['test']:
    story.update({
        'str_idx': story['sentences'].find(story['answer']),
        'end_idx': story['sentences'].find(story['answer']) + len(story['answer'])
    })

Now that we're done processing the dataset, we have to tokenize and align the input for the model. This is necessary because language models process text in fixed-size chunks. Since words have different lengths, tokenization converts the text into a sequence of tokens, allowing it to be processed by the model. More importantly, it also maps unique words from readable English to their own vectors for the model to understand and use. The DistilBert fast tokenizer uses sequences of length 512 and pads with zeros when necessary.



In [8]:
tokenizer = DistilBertTokenizerFast.from_pretrained('/kaggle/input/babl-question-answering/tokenizer')

One thing I didn't mention about tokenizers is that they often split words into several subwords. One must be very careful when tokenizing to avoid misaligning 

In [9]:
train_data = []
for original_dict in processed['train']:

    encoding = tokenizer(original_dict['sentences'], original_dict['questions'], truncation=True, padding=True, max_length=tokenizer.model_max_length)
    start_positions = encoding.char_to_token(original_dict['str_idx'])
    end_positions = encoding.char_to_token(original_dict['end_idx']-1)
    if start_positions is None:
        start_positions = tokenizer.model_max_length
    if end_positions is None:
        end_positions = tokenizer.model_max_length
    original_dict.update({
        'input_ids': encoding['input_ids'],
        'attention_mask': encoding['attention_mask'],
        'start_positions': start_positions,
        'end_positions': end_positions 
    })
    train_data.append(original_dict)

test_data = []
for original_dict in processed['test']:

    encoding = tokenizer(original_dict['sentences'], original_dict['questions'], truncation=True, padding=True, max_length=tokenizer.model_max_length)
    start_positions = encoding.char_to_token(original_dict['str_idx'])
    end_positions = encoding.char_to_token(original_dict['end_idx']-1)
    if start_positions is None:
        start_positions = tokenizer.model_max_length
    if end_positions is None:
        end_positions = tokenizer.model_max_length
    original_dict.update({
        'input_ids': encoding['input_ids'],
        'attention_mask': encoding['attention_mask'],
        'start_positions': start_positions,
        'end_positions': end_positions 
    })
    test_data.append(original_dict)
    
qa_dataset = {'train': train_data, 'test': test_data}


Getting rid of the preprocessed data in favor of the new data we created earlier

In [10]:
keys_to_delete = ['story.answer', 'story.id', 'story.supporting_ids', 'story.text', 'story.type']
for original_dict in qa_dataset['train']:
    for key in keys_to_delete:
        del original_dict[key]
for original_dict in qa_dataset['test']:
    for key in keys_to_delete:
        del original_dict[key]

Since map() didn't work, I had to use a dictionary instead of DatasetDict to store the training/test sets. This step is to convert the train/test datasets back to the proper datatype for the Hugging Face model to use. There's already a DistilBert model specifically for question answering so I'm using that for fine tuning.

In [11]:
train_ds = Dataset.from_pandas(pd.DataFrame(data=qa_dataset['train']))
test_ds = Dataset.from_pandas(pd.DataFrame(data=qa_dataset['test']))

model = DistilBertForQuestionAnswering.from_pretrained("/kaggle/input/babl-question-answering/model/pytorch")

Choosing the variables of interest and changing the format for a PyTorch implementation

In [12]:
columns_to_return = ['input_ids','attention_mask', 'start_positions', 'end_positions']

train_ds.set_format(type='pt', columns=columns_to_return)
test_ds.set_format(type='pt', columns=columns_to_return)

Using a custom function to compute the f1 score

In [13]:
def compute_metrics(pred):
    start_labels = pred.label_ids[0]
    start_preds = pred.predictions[0].argmax(-1)
    end_labels = pred.label_ids[1]
    end_preds = pred.predictions[1].argmax(-1)
    
    f1_start = f1_score(start_labels, start_preds, average='macro')
    f1_end = f1_score(end_labels, end_preds, average='macro')
    
    return {
        'f1_start': f1_start,
        'f1_end': f1_end,
    }

Training time

In [14]:
training_args = TrainingArguments(
    output_dir='results',         
    overwrite_output_dir=True,
    num_train_epochs=5,             
    per_device_train_batch_size=8,  
    per_device_eval_batch_size=8,   
    warmup_steps=20,                
    weight_decay=0.01,               
    logging_dir=None,           
    logging_steps=50
)

trainer = Trainer(
    model=model,                
    args=training_args,                
    train_dataset=train_ds,        
    eval_dataset=test_ds,
    compute_metrics=compute_metrics            
)

trainer.train()

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
[34m[1mwandb[0m: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

  ········································


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc




Step,Training Loss
50,1.4809
100,0.5597
150,0.4299
200,0.4175
250,0.2776
300,0.1737


TrainOutput(global_step=315, training_loss=0.5409559946211557, metrics={'train_runtime': 79.883, 'train_samples_per_second': 62.592, 'train_steps_per_second': 3.943, 'total_flos': 33173638680000.0, 'train_loss': 0.5409559946211557, 'epoch': 5.0})

0.93 F1 score is pretty good!

In [15]:
trainer.evaluate(test_ds)

{'eval_loss': 0.14967407286167145,
 'eval_f1_start': 0.9352821297160085,
 'eval_f1_end': 0.936880101821786,
 'eval_runtime': 2.5049,
 'eval_samples_per_second': 399.214,
 'eval_steps_per_second': 25.15,
 'epoch': 5.0}

Printing out some examples from the test set. It messes up a few but I'm pretty impressed.

In [18]:
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
model.to(device)

for i in range(10):
    idx = random.randint(0, len(test_ds))
    text = test_ds['sentences'][idx]
    question = test_ds['questions'][idx]
    
    input_dict = tokenizer(text, question, return_tensors='pt')

    input_ids = input_dict['input_ids'].to(device)
    attention_mask = input_dict['attention_mask'].to(device)

    outputs = model(input_ids, attention_mask=attention_mask)

    start_logits = outputs[0]
    end_logits = outputs[1]

    all_tokens = tokenizer.convert_ids_to_tokens(input_dict["input_ids"].numpy()[0])
    answer = ' '.join(all_tokens[torch.argmax(start_logits, 1)[0] : torch.argmax(end_logits, 1)[0]+1])

    print(text)
    print(question, answer.capitalize())
    print('='*50)

The kitchen is south of the garden. The bedroom is south of the kitchen.
What is the kitchen north of? Bedroom
The kitchen is west of the bathroom. The office is east of the bathroom.
What is east of the bathroom? Office
The bathroom is west of the kitchen. The office is east of the kitchen.
What is the kitchen east of? Bathroom
The kitchen is east of the office. The hallway is west of the office.
What is west of the office? Hallway
The office is south of the bathroom. The bedroom is north of the bathroom.
What is south of the bathroom? Office
The kitchen is north of the garden. The hallway is south of the garden.
What is the garden north of? Hallway
The garden is north of the kitchen. The bathroom is south of the kitchen.
What is the kitchen north of? Garden
The bathroom is west of the bedroom. The kitchen is west of the bathroom.
What is west of the bedroom? Bathroom
The hallway is west of the bedroom. The kitchen is east of the bedroom.
What is west of the bedroom? Hallway
The kitch

In [19]:
model.save_pretrained('/kaggle/working/')