![DLI Header](images/DLI_Header.png)

# Overview

## Task Description

- Given a context and a natural language query, we want to generate an answer for the query
- Depending on how the answer is generated, the task can be broadly divided into two types:
    1. <b>Extractive Question Answering</b>
    2. Generative Question Answering

### Extractive Question-Answering with BERT-like models

Given a question and a context, both in natural language, predict the span within the context with a start and end position which indicates the answer to the question.
For every word in our training dataset we’re going to predict:
- likelihood this word is the start of the span 
- likelihood this word is the end of the span

We are using a BERT encoder with 2 span prediction heads for predicting start and end position of the answer. The span predictions are token classifiers consisting of a single linear layer.

In [1]:
BRANCH = 'main'

# Imports and constants

In [2]:
import os
import wget
import gc

import pytorch_lightning as pl
from omegaconf import OmegaConf

from nemo.collections.nlp.models.question_answering.qa_bert_model import BERTQAModel

gc.disable()

NOTE! Installing ujson may make loading annotations faster.


In [3]:
# set the following paths
DATA_DIR = "data" # directory for storing datasets
WORK_DIR = "work_dir" # directory for storing trained models, logs, additionally downloaded scripts

os.makedirs(DATA_DIR, exist_ok=True)
os.makedirs(WORK_DIR, exist_ok=True)

# Configuration

The model is defined in a config file which declares multiple important sections:
- **model**: All arguments that will relate to the Model - language model, span prediction, optimizer and schedulers, datasets and any other related information
- **trainer**: Any argument to be passed to PyTorch Lightning
- **exp_manager**: All arguments used for setting up the experiment manager - target directory, name, logger information

We will download the default config file provided at `NeMo/examples/nlp/question_answering/conf/qa_conf.yaml` and edit necessary values for training different models

In [4]:
# download the model's default configuration file 
config_dir = WORK_DIR + '/conf/'
os.makedirs(config_dir, exist_ok=True)
if not os.path.exists(config_dir + "qa_conf.yaml"):
    print('Downloading config file...')
    wget.download(f'https://raw.githubusercontent.com/NVIDIA/NeMo/{BRANCH}/examples/nlp/question_answering/conf/qa_conf.yaml', config_dir)
else:
    print ('config file already exists')

config file already exists


In [5]:
# this will print the entire default config of the model
config_path = f'{WORK_DIR}/conf/qa_conf.yaml'
print(config_path)
config = OmegaConf.load(config_path)
print("Default Config - \n")
print(OmegaConf.to_yaml(config))

work_dir/conf/qa_conf.yaml
Default Config - 

pretrained_model: null
do_training: true
trainer:
  devices:
  - 0
  num_nodes: 1
  max_epochs: 3
  max_steps: -1
  accumulate_grad_batches: 1
  gradient_clip_val: 1.0
  precision: 16
  accelerator: gpu
  log_every_n_steps: 5
  val_check_interval: 1.0
  num_sanity_val_steps: 0
  enable_checkpointing: false
  logger: false
  strategy: ddp
model:
  tensor_model_parallel_size: 1
  nemo_path: null
  library: huggingface
  save_model: false
  tokens_to_generate: 32
  dataset:
    version_2_with_negative: true
    doc_stride: 128
    max_query_length: 64
    max_seq_length: 512
    max_answer_length: 30
    use_cache: false
    do_lower_case: true
    check_if_answer_in_context: true
    keep_doc_spans: all
    null_score_diff_threshold: 0.0
    n_best_size: 20
    num_workers: 1
    pin_memory: false
    drop_last: false
  train_ds:
    file: null
    batch_size: 24
    shuffle: true
    num_samples: -1
    num_workers: ${model.dataset.num_worke

# Training and testing models on SQuAD v2.0

## Dataset

For this example, we are going to download the [SQuAD](https://rajpurkar.github.io/SQuAD-explorer/) dataset to showcase how to do training and inference. There are two datasets, SQuAD1.0 and SQuAD2.0. SQuAD 1.1, the previous version of the SQuAD dataset, contains 100,000+ question-answer pairs on 500+ articles. SQuAD2.0 dataset combines the 100,000 questions in SQuAD1.1 with over 50,000 unanswerable questions written adversarially by crowdworkers to look similar to answerable ones. 

We have prepared the data directory "squad" with the following four files for training and evaluation: 

```
squad  
│
└───v1.1
│   │ -  train-v1.1.json
│   │ -  dev-v1.1.json
│
└───v2.0
    │ -  train-v2.0.json
    │ -  dev-v2.0.json
```

In [6]:
!ls -LR {DATA_DIR}/squad

data/squad:
v1.1  v2.0

data/squad/v1.1:
dev-v1.1.json  train-v1.1.json

data/squad/v2.0:
dev-v2.0.json  train-v2.0.json


## Set dataset config values

In [7]:
# if True, model will load features from cache if file is present, or
# create features and dump to cache file if not already present
config.model.dataset.use_cache = False

# indicates whether the dataset has unanswerable questions
config.model.dataset.version_2_with_negative = True

# indicates whether the dataset is of extractive nature or not
# if True, context spans/chunks that do not contain answer are treated as unanswerable 
config.model.dataset.check_if_answer_in_context = True

# set file paths for train, validation, and test datasets
config.model.train_ds.file = f"{DATA_DIR}/squad/v2.0/train-v2.0.json"
config.model.validation_ds.file = f"{DATA_DIR}/squad/v2.0/dev-v2.0.json"
config.model.test_ds.file = f"{DATA_DIR}/squad/v2.0/dev-v2.0.json"

# set batch sizes for train, validation, and test datasets
config.model.train_ds.batch_size = 8
config.model.validation_ds.batch_size = 8
config.model.test_ds.batch_size = 8

# set number of samples to be used from dataset. setting to -1 uses entire dataset
config.model.train_ds.num_samples = 5000
config.model.validation_ds.num_samples = 1000
config.model.test_ds.num_samples = 100

## Set trainer config values

In [8]:
config.trainer.max_epochs = 1
config.trainer.max_steps = -1 # takes precedence over max_epochs
config.trainer.precision = 16
config.trainer.devices = [0] # 0 for CPU, or list of the GPUs to use [0] this tutorial does not support multiple GPUs. If needed please use NeMo/examples/nlp/question_answering/question_answering.py
config.trainer.accelerator = "gpu"
config.trainer.strategy="auto"

## BERT model for SQuAD v2.0

### Set model config values

In [10]:
# set language model and tokenizer to be used
# tokenizer is derived from model if a tokenizer name is not provided
config.model.language_model.pretrained_model_name = "bert-base-uncased"
config.model.tokenizer.tokenizer_name = "bert-base-uncased"

# path where model will be saved
config.model.nemo_path = f"{WORK_DIR}/checkpoints/bert_squad_v2_0.nemo"

# config.exp_manager.create_checkpoint_callback = True

config.model.optim.lr = 3e-5

### Create trainer and initialize model

In [11]:
trainer = pl.Trainer(**config.trainer)
model = BERTQAModel(config.model, trainer=trainer)

Using 16bit None Automatic Mixed Precision (AMP)
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
`Trainer(val_check_interval=1.0)` was configured so validation will run at the end of the training epoch..


[NeMo I 2023-10-31 15:48:40 tokenizer_utils:130] Getting HuggingFace AutoTokenizer with pretrained_model_name: bert-base-uncased, vocab_file: None, merges_files: None, special_tokens_dict: {}, and use_fast: False


Using eos_token, but it is not set yet.
Using bos_token, but it is not set yet.
[NeMo W 2023-10-31 15:48:41 modelPT:244] You tried to register an artifact under config key=tokenizer.vocab_file but an artifact for it has already been registered.


[NeMo I 2023-10-31 15:48:41 qa_processing:106] mean no. of chars in doc: 839.2727272727273
[NeMo I 2023-10-31 15:48:41 qa_processing:107] max no. of chars in doc: 1895
[NeMo I 2023-10-31 15:48:41 qa_processing:106] mean no. of chars in doc: 677.5487804878048
[NeMo I 2023-10-31 15:48:41 qa_processing:107] max no. of chars in doc: 1782
[NeMo I 2023-10-31 15:48:41 qa_processing:106] mean no. of chars in doc: 828.0972222222222
[NeMo I 2023-10-31 15:48:41 qa_processing:107] max no. of chars in doc: 2132
[NeMo I 2023-10-31 15:48:41 qa_processing:106] mean no. of chars in doc: 540.0
[NeMo I 2023-10-31 15:48:41 qa_processing:107] max no. of chars in doc: 1423
[NeMo I 2023-10-31 15:48:41 qa_processing:106] mean no. of chars in doc: 756.71875
[NeMo I 2023-10-31 15:48:41 qa_processing:107] max no. of chars in doc: 1747
[NeMo I 2023-10-31 15:48:41 qa_processing:106] mean no. of chars in doc: 732.4418604651163
[NeMo I 2023-10-31 15:48:41 qa_processing:107] max no. of chars in doc: 3076
[NeMo I 2023

  0%|          | 0/5000 [00:00<?, ?it/s]

[NeMo I 2023-10-31 15:48:42 qa_bert_dataset:264] *** Example ***
[NeMo I 2023-10-31 15:48:42 qa_bert_dataset:265] unique_id: 1000000000
[NeMo I 2023-10-31 15:48:42 qa_bert_dataset:266] example_index: 0
[NeMo I 2023-10-31 15:48:42 qa_bert_dataset:267] doc_span_index: 0
[NeMo I 2023-10-31 15:48:42 qa_bert_dataset:268] tokens: [CLS] when did beyonce start becoming popular ? [SEP] beyonce gi ##selle knowles - carter ( / bi ##ː ##ˈ ##j ##ɒ ##nse ##ɪ / bee - yo ##n - say ) ( born september 4 , 1981 ) is an american singer , songwriter , record producer and actress . born and raised in houston , texas , she performed in various singing and dancing competitions as a child , and rose to fame in the late 1990s as lead singer of r & b girl - group destiny ' s child . managed by her father , mathew knowles , the group became one of the world ' s best - selling girl groups of all time . their hiatus saw the release of beyonce ' s debut album , dangerously in love ( 2003 ) , which established her as

100%|██████████| 5000/5000 [00:09<00:00, 550.94it/s]

[NeMo I 2023-10-31 15:48:51 qa_bert_dataset:90] Converting dict features into object features



100%|██████████| 5021/5021 [00:00<00:00, 589607.49it/s]

[NeMo I 2023-10-31 15:48:51 qa_processing:106] mean no. of chars in doc: 649.4358974358975
[NeMo I 2023-10-31 15:48:51 qa_processing:107] max no. of chars in doc: 1765
[NeMo I 2023-10-31 15:48:51 qa_processing:106] mean no. of chars in doc: 571.625
[NeMo I 2023-10-31 15:48:51 qa_processing:107] max no. of chars in doc: 1404
[NeMo I 2023-10-31 15:48:51 qa_processing:106] mean no. of chars in doc: 491.79487179487177
[NeMo I 2023-10-31 15:48:51 qa_processing:107] max no. of chars in doc: 1145
[NeMo I 2023-10-31 15:48:51 qa_processing:106] mean no. of chars in doc: 694.5454545454545
[NeMo I 2023-10-31 15:48:51 qa_processing:107] max no. of chars in doc: 1127
[NeMo I 2023-10-31 15:48:51 qa_processing:106] mean no. of chars in doc: 668.76
[NeMo I 2023-10-31 15:48:51 qa_processing:107] max no. of chars in doc: 1096
[NeMo I 2023-10-31 15:48:51 qa_processing:106] mean no. of chars in doc: 789.7727272727273
[NeMo I 2023-10-31 15:48:51 qa_processing:107] max no. of chars in doc: 1466
[NeMo I 2023




[NeMo I 2023-10-31 15:48:51 qa_processing:106] mean no. of chars in doc: 1674.2
[NeMo I 2023-10-31 15:48:51 qa_processing:107] max no. of chars in doc: 4063
[NeMo I 2023-10-31 15:48:51 qa_processing:106] mean no. of chars in doc: 700.2857142857143
[NeMo I 2023-10-31 15:48:51 qa_processing:107] max no. of chars in doc: 1057
[NeMo I 2023-10-31 15:48:51 qa_processing:106] mean no. of chars in doc: 926.6451612903226
[NeMo I 2023-10-31 15:48:51 qa_processing:107] max no. of chars in doc: 1916
[NeMo I 2023-10-31 15:48:51 qa_processing:106] mean no. of chars in doc: 781.1785714285714
[NeMo I 2023-10-31 15:48:51 qa_processing:107] max no. of chars in doc: 1778
[NeMo I 2023-10-31 15:48:51 qa_processing:106] mean no. of chars in doc: 840.695652173913
[NeMo I 2023-10-31 15:48:51 qa_processing:107] max no. of chars in doc: 3145
[NeMo I 2023-10-31 15:48:51 qa_processing:106] mean no. of chars in doc: 854.3913043478261
[NeMo I 2023-10-31 15:48:51 qa_processing:107] max no. of chars in doc: 1629
[NeM

  0%|          | 0/1000 [00:00<?, ?it/s]

[NeMo I 2023-10-31 15:48:51 qa_bert_dataset:264] *** Example ***
[NeMo I 2023-10-31 15:48:51 qa_bert_dataset:265] unique_id: 1000000000
[NeMo I 2023-10-31 15:48:51 qa_bert_dataset:266] example_index: 0
[NeMo I 2023-10-31 15:48:51 qa_bert_dataset:267] doc_span_index: 0
[NeMo I 2023-10-31 15:48:52 qa_bert_dataset:268] tokens: [CLS] in what country is normandy located ? [SEP] the norman ##s ( norman : no ##ur ##man ##ds ; french : norman ##ds ; latin : norman ##ni ) were the people who in the 10th and 11th centuries gave their name to normandy , a region in france . they were descended from norse ( " norman " comes from " norse ##man " ) raiders and pirates from denmark , iceland and norway who , under their leader roll ##o , agreed to swear fe ##al ##ty to king charles iii of west fran ##cia . through generations of assimilation and mixing with the native frankish and roman - gaul ##ish populations , their descendants would gradually merge with the carol ##ing ##ian - based cultures of w

100%|██████████| 1000/1000 [00:01<00:00, 584.22it/s]

[NeMo I 2023-10-31 15:48:53 qa_bert_dataset:90] Converting dict features into object features



100%|██████████| 1000/1000 [00:00<00:00, 362578.15it/s]


[NeMo I 2023-10-31 15:48:53 qa_processing:106] mean no. of chars in doc: 649.4358974358975
[NeMo I 2023-10-31 15:48:53 qa_processing:107] max no. of chars in doc: 1765
[NeMo I 2023-10-31 15:48:53 qa_processing:106] mean no. of chars in doc: 571.625
[NeMo I 2023-10-31 15:48:53 qa_processing:107] max no. of chars in doc: 1404
[NeMo I 2023-10-31 15:48:53 qa_processing:106] mean no. of chars in doc: 491.79487179487177
[NeMo I 2023-10-31 15:48:53 qa_processing:107] max no. of chars in doc: 1145
[NeMo I 2023-10-31 15:48:53 qa_processing:106] mean no. of chars in doc: 694.5454545454545
[NeMo I 2023-10-31 15:48:53 qa_processing:107] max no. of chars in doc: 1127
[NeMo I 2023-10-31 15:48:53 qa_processing:106] mean no. of chars in doc: 668.76
[NeMo I 2023-10-31 15:48:53 qa_processing:107] max no. of chars in doc: 1096
[NeMo I 2023-10-31 15:48:53 qa_processing:106] mean no. of chars in doc: 789.7727272727273
[NeMo I 2023-10-31 15:48:53 qa_processing:107] max no. of chars in doc: 1466
[NeMo I 2023

  0%|          | 0/100 [00:00<?, ?it/s]

[NeMo I 2023-10-31 15:48:53 qa_bert_dataset:264] *** Example ***
[NeMo I 2023-10-31 15:48:53 qa_bert_dataset:265] unique_id: 1000000000
[NeMo I 2023-10-31 15:48:53 qa_bert_dataset:266] example_index: 0
[NeMo I 2023-10-31 15:48:53 qa_bert_dataset:267] doc_span_index: 0
[NeMo I 2023-10-31 15:48:53 qa_bert_dataset:268] tokens: [CLS] in what country is normandy located ? [SEP] the norman ##s ( norman : no ##ur ##man ##ds ; french : norman ##ds ; latin : norman ##ni ) were the people who in the 10th and 11th centuries gave their name to normandy , a region in france . they were descended from norse ( " norman " comes from " norse ##man " ) raiders and pirates from denmark , iceland and norway who , under their leader roll ##o , agreed to swear fe ##al ##ty to king charles iii of west fran ##cia . through generations of assimilation and mixing with the native frankish and roman - gaul ##ish populations , their descendants would gradually merge with the carol ##ing ##ian - based cultures of w

100%|██████████| 100/100 [00:00<00:00, 447.22it/s]

[NeMo I 2023-10-31 15:48:54 qa_bert_dataset:90] Converting dict features into object features



100%|██████████| 100/100 [00:00<00:00, 282444.71it/s]


### Train, test, and save the model

In [None]:
trainer.fit(model)
trainer.test(model)

model.save_to(config.model.nemo_path)

### Load the saved model and run inference

In [None]:
model = BERTQAModel.restore_from(config.model.nemo_path)

eval_device = [config.trainer.devices[0]] if isinstance(config.trainer.devices, list) else 1
model.trainer = pl.Trainer(
    devices=eval_device,
    accelerator=config.trainer.accelerator,
    precision=16,
    logger=False,
)

all_preds, all_nbest = model.inference(
    config.model.test_ds.file,
    num_samples=10, # setting to -1 will use all samples for inference
)

for question_id in all_preds:
    print(all_preds[question_id])

![DLI Header](images/DLI_Header.png)