Copyright (c) Microsoft Corporation. All rights reserved.

Licensed under the MIT License.

# Question Answering on the SQuAD Dataset using BERT


# Before You Start

The running time shown in this notebook is on a Standard_NC24s_v3 Azure Deep Learning Virtual Machine with 4 NVIDIA Tesla V100 GPUs. 
> **Tip**: If you want to run through the notebook quickly, you can set the **`QUICK_RUN`** flag in the cell below to **`True`** to run the notebook on a small subset of the data and a smaller number of epochs. 

The table below provides some reference running time on different machine configurations.  

|QUICK_RUN|Machine Configurations|Running time|
|:---------|:----------------------|:------------|
|True|4 **CPU**s, 14GB memory| ~ 10 minutes |
|True|1 NVIDIA Tesla K80 GPUs, 12GB GPU memory| ~ 3 minutes |
|False|4 NVIDIA Tesla K80 GPUs, 48GB GPU memory| ~ 18 hours |
|False|4 NVIDIA Tesla V100 GPUs, 64GB GPU memory| ~ 7 hours|

If you run into CUDA out-of-memory error, try reducing the `BATCH_SIZE` and `MAX_SEQ_LENGTH`, but note that model performance will be compromised. 

In [35]:
## Set QUICK_RUN = True to run the notebook on a small subset of data and a smaller number of epochs.
QUICK_RUN = False

## Summary
This notebook demonstrates how to fine tune [pretrained BERT model](https://github.com/huggingface/pytorch-transformers) for extractive question answering task. Utility functions and classes in the NLP Best Practices repo are used to facilitate data preprocessing, model training, model scoring, result postprocessing, and model evaluation. 

BERT[\[1\]](#References) is a powerful pre-trained lanaguage model that can be used for multiple NLP tasks, including text classification, question answering, named entity recognition, etc. It's able to achieve state of the art performance with only a few epochs of fine tuning on task specific datasets.  
The figure below illustrates how BERT can be fine tuned for extractive question answering task. The question and paragraph tokens are concatenated as a single input token sequence with a special token [SEP] between them. For the paragraph tokens, BERT predicts the probabilities of each token being the start and end of the answer span. The tokens with the highest sum of starting probability and ending probability define the span of the predicted answer

<img src="https://nlpbp.blob.core.windows.net/images/bert_qa.PNG">

In [36]:
import os
import sys

import torch
import numpy as np

nlp_path = os.path.abspath('../../')
if nlp_path not in sys.path:
    sys.path.insert(0, nlp_path)

from utils_nlp.dataset.squad import load_pandas_df
from utils_nlp.models.transformers.question_answering_distributed import AnswerExtractor
from utils_nlp.models.transformers.qa_utils_distributed import (QADataset, 
                                                    get_qa_dataloader, 
                                                    postprocess_answer, 
                                                    evaluate_qa, 
                                                    TOKENIZER_CLASSES
                                                   )
from utils_nlp.common.timer import Timer

## Configurations

In [37]:
TRAIN_DATA_USED_PERCENT = 1
DEV_DATA_USED_PERCENT = 1
NUM_EPOCHS = 2

if QUICK_RUN:
    TRAIN_DATA_USED_PERCENT = 0.001
    DEV_DATA_USED_PERCENT = 0.01
    NUM_EPOCHS = 1

if torch.cuda.is_available() and torch.cuda.device_count() >= 4:
    MAX_SEQ_LENGTH = 384
    DOC_STRIDE = 128
    BATCH_SIZE = 8
else:
    MAX_SEQ_LENGTH = 128
    DOC_STRIDE = 64
    BATCH_SIZE = 2

print("Max sequence length: {}".format(MAX_SEQ_LENGTH))
print("Document stride: {}".format(DOC_STRIDE))
print("Batch size: {}".format(BATCH_SIZE))
    
SQUAD_VERSION = "v1.1" 
CACHE_DIR = "./temp"

# MODEL_NAME = "bert-large-uncased-whole-word-masking"
# DO_LOWER_CASE = True

MODEL_NAME = "xlnet-large-cased"
DO_LOWER_CASE = False

MAX_QUESTION_LENGTH = 64
LEARNING_RATE = 3e-5

DOC_TEXT_COL = "doc_text"
QUESTION_TEXT_COL = "question_text"
ANSWER_START_COL = "answer_start"
ANSWER_TEXT_COL = "answer_text"
QA_ID_COL = "qa_id"
IS_IMPOSSIBLE_COL = "is_impossible"

RANDOM_SEED = 42
np.random.seed(RANDOM_SEED)
torch.manual_seed(RANDOM_SEED)
if  torch.cuda.device_count() > 0:
    torch.cuda.manual_seed_all(RANDOM_SEED)

Max sequence length: 384
Document stride: 128
Batch size: 8


## Load Data

### The SQuAD Dataset
Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable. [\[2, 3\]](#References)

<img src="https://nlpbp.blob.core.windows.net/images/squad.png">

There has been two versions of SQuAD datasets. SQuAD 1.1 contains 100,000+ question-answer pairs on 500+ articles. SQuAD 2.0 adds 50,000 new, unanswerable questions written adversarially by crowdworkers to look similar to answerable ones. These datasets are available at [https://rajpurkar.github.io/SQuAD-explorer/](https://rajpurkar.github.io/SQuAD-explorer/). Each dataset comes with a training dataset and a development dataset. 


The utility function `load_pandas_df` downloads the dataset specified by `squad_version` and `file_split` to `local_cache_path` if it doesn't exist already.

In [4]:
train_df = load_pandas_df(local_cache_path=".", squad_version="v1.1", file_split="train")
dev_df = load_pandas_df(local_cache_path=".", squad_version="v1.1", file_split="dev")

In [5]:
train_df.head()

Unnamed: 0,doc_text,question_text,answer_start,answer_text,qa_id,is_impossible
0,"Architecturally, the school has a Catholic cha...",To whom did the Virgin Mary allegedly appear i...,515,Saint Bernadette Soubirous,5733be284776f41900661182,False
1,"Architecturally, the school has a Catholic cha...",What is in front of the Notre Dame Main Building?,188,a copper statue of Christ,5733be284776f4190066117f,False
2,"Architecturally, the school has a Catholic cha...",The Basilica of the Sacred heart at Notre Dame...,279,the Main Building,5733be284776f41900661180,False
3,"Architecturally, the school has a Catholic cha...",What is the Grotto at Notre Dame?,381,a Marian place of prayer and reflection,5733be284776f41900661181,False
4,"Architecturally, the school has a Catholic cha...",What sits on top of the Main Building at Notre...,92,a golden statue of the Virgin Mary,5733be284776f4190066117e,False


In [6]:
dev_df.head()

Unnamed: 0,doc_text,question_text,answer_start,answer_text,qa_id,is_impossible
0,Super Bowl 50 was an American football game to...,Which NFL team represented the AFC at Super Bo...,"[177, 177, 177]","[Denver Broncos, Denver Broncos, Denver Broncos]",56be4db0acb8001400a502ec,False
1,Super Bowl 50 was an American football game to...,Which NFL team represented the NFC at Super Bo...,"[249, 249, 249]","[Carolina Panthers, Carolina Panthers, Carolin...",56be4db0acb8001400a502ed,False
2,Super Bowl 50 was an American football game to...,Where did Super Bowl 50 take place?,"[403, 355, 355]","[Santa Clara, California, Levi's Stadium, Levi...",56be4db0acb8001400a502ee,False
3,Super Bowl 50 was an American football game to...,Which NFL team won Super Bowl 50?,"[177, 177, 177]","[Denver Broncos, Denver Broncos, Denver Broncos]",56be4db0acb8001400a502ef,False
4,Super Bowl 50 was an American football game to...,What color was used to emphasize the 50th anni...,"[488, 488, 521]","[gold, gold, gold]",56be4db0acb8001400a502f0,False


In [7]:
train_df = train_df.sample(frac=TRAIN_DATA_USED_PERCENT).reset_index(drop=True)
dev_df = dev_df.sample(frac=DEV_DATA_USED_PERCENT).reset_index(drop=True)

In [5]:
# train_dataset = QADataset(df=train_df,
#                           doc_text_col=DOC_TEXT_COL,
#                           question_text_col=QUESTION_TEXT_COL,
#                           qa_id_col=QA_ID_COL,
#                           is_impossible_col=IS_IMPOSSIBLE_COL,
#                           answer_start_col=ANSWER_START_COL,
#                           answer_text_col=ANSWER_TEXT_COL)
dev_dataset = QADataset(df=dev_df,
                        doc_text_col=DOC_TEXT_COL,
                        question_text_col=QUESTION_TEXT_COL,
                        qa_id_col=QA_ID_COL,
                        is_impossible_col=IS_IMPOSSIBLE_COL,
                        answer_start_col=ANSWER_START_COL,
                        answer_text_col=ANSWER_TEXT_COL)

## Tokenize and Preprocess Data

The `tokenizer_qa` method of `Tokenizer` tokenizes the input paragraph, question, and answer texts and converts them into the format required by pre-trained BERT model, involving the following steps:
* WordPiece tokenization.
* Convert character-based answer span indices to token-based indices.
* Truncate the question token list if it's longer than `max_question_length`.
* Split the paragraph into multiple segments if it's longer than `max_len` - `max_question_length` - 3. (The "-3" is for the special [CLS] token and two [SEP] tokens.)
* Add the special tokens [CLS] and [SEP].
* Pad the concatenated token sequence to `max_len` if it's shorter.
* Convert the tokens into token indices corresponding to the BERT tokenizer's vocabulary.

In additional to the features required by BERT, `tokenize_qa` outputs a few additional fields needed by postprocessing. See the `QAFeatures` class in [qa_utils.py](../../utils_nlp/models/bert/qa_utils.py) for more details

In [6]:
# train_dataloader = get_qa_dataloader(train_dataset, 
#                                     model_name=MODEL_NAME, 
#                                     is_training=True,
#                                     to_lower=DO_LOWER_CASE,
#                                     batch_size=BATCH_SIZE
#                                         )

dev_dataloader = get_qa_dataloader(dev_dataset, 
                                   model_name=MODEL_NAME, 
                                   is_training=False,
                                   to_lower=DO_LOWER_CASE,
                                   batch_size=BATCH_SIZE)

## Train BERTQAExtractor

In [10]:
qa_extractor = AnswerExtractor(model_name=MODEL_NAME, cache_dir=CACHE_DIR)

100%|██████████| 467/467 [00:00<00:00, 206945.59B/s]
100%|██████████| 1441285815/1441285815 [00:26<00:00, 53758449.43B/s]


In [11]:
with Timer() as t:
    qa_extractor.fit(train_dataloader=train_dataloader,
                     num_epochs=NUM_EPOCHS,
                     learning_rate=LEARNING_RATE,
                     cache_model=True)
print("Training time : {:.3f} hrs".format(t.interval / 3600))

# qa_extractor = AnswerExtractor(model_name=MODEL_NAME, cache_dir=CACHE_DIR, load_model_from_dir="./temp")
 

Epoch:   0%|          | 0/2 [00:00<?, ?it/s]

Iteration:   0%|          | 32/10950 [01:00<5:46:35,  1.90s/it][A
Iteration:   0%|          | 32/10950 [01:20<5:46:35,  1.90s/it][A
Iteration:   1%|          | 68/10950 [02:01<5:32:36,  1.83s/it][A
Iteration:   1%|          | 68/10950 [02:20<5:32:36,  1.83s/it][A
Iteration:   1%|          | 106/10950 [03:01<5:18:50,  1.76s/it][A
Iteration:   1%|          | 106/10950 [03:20<5:18:50,  1.76s/it][A
Iteration:   1%|▏         | 145/10950 [04:03<5:07:40,  1.71s/it][A
Iteration:   1%|▏         | 145/10950 [04:20<5:07:40,  1.71s/it][A
Iteration:   2%|▏         | 183/10950 [05:04<5:00:48,  1.68s/it][A
Iteration:   2%|▏         | 183/10950 [05:20<5:00:48,  1.68s/it][A
Iteration:   2%|▏         | 221/10950 [06:04<4:54:53,  1.65s/it][A
Iteration:   2%|▏         | 221/10950 [06:20<4:54:53,  1.65s/it][A
Iteration:   2%|▏         | 259/10950 [07:05<4:51:34,  1.64s/it][A
Iteration:   2%|▏         | 259/10950 [07:20<4:51:34,  1.64s/it][A
Iterat

Iteration:  40%|████      | 4392/10950 [1:56:12<2:52:49,  1.58s/it][A
Iteration:  40%|████      | 4392/10950 [1:56:30<2:52:49,  1.58s/it][A
Iteration:  40%|████      | 4430/10950 [1:57:12<2:51:46,  1.58s/it][A
Iteration:  40%|████      | 4430/10950 [1:57:30<2:51:46,  1.58s/it][A
Iteration:  41%|████      | 4469/10950 [1:58:13<2:50:19,  1.58s/it][A
Iteration:  41%|████      | 4469/10950 [1:58:30<2:50:19,  1.58s/it][A
Iteration:  41%|████      | 4508/10950 [1:59:14<2:48:58,  1.57s/it][A
Iteration:  41%|████      | 4508/10950 [1:59:30<2:48:58,  1.57s/it][A
Iteration:  42%|████▏     | 4547/10950 [2:00:16<2:47:53,  1.57s/it][A
Iteration:  42%|████▏     | 4547/10950 [2:00:30<2:47:53,  1.57s/it][A
Iteration:  42%|████▏     | 4586/10950 [2:01:17<2:46:58,  1.57s/it][A
Iteration:  42%|████▏     | 4586/10950 [2:01:30<2:46:58,  1.57s/it][A
Iteration:  42%|████▏     | 4624/10950 [2:02:18<2:46:26,  1.58s/it][A
Iteration:  42%|████▏     | 4624/10950 [2:02:30<2:46:26,  1.58s/it][A
Iterat

Iteration:  81%|████████  | 8816/10950 [3:52:43<56:08,  1.58s/it][A
Iteration:  81%|████████  | 8816/10950 [3:53:01<56:08,  1.58s/it][A
Iteration:  81%|████████  | 8854/10950 [3:53:43<55:08,  1.58s/it][A
Iteration:  81%|████████  | 8854/10950 [3:54:01<55:08,  1.58s/it][A
Iteration:  81%|████████  | 8892/10950 [3:54:44<54:17,  1.58s/it][A
Iteration:  81%|████████  | 8892/10950 [3:55:01<54:17,  1.58s/it][A
Iteration:  82%|████████▏ | 8930/10950 [3:55:44<53:18,  1.58s/it][A
Iteration:  82%|████████▏ | 8930/10950 [3:56:01<53:18,  1.58s/it][A
Iteration:  82%|████████▏ | 8968/10950 [3:56:44<52:20,  1.58s/it][A
Iteration:  82%|████████▏ | 8968/10950 [3:57:01<52:20,  1.58s/it][A
Iteration:  82%|████████▏ | 9007/10950 [3:57:46<51:12,  1.58s/it][A
Iteration:  82%|████████▏ | 9007/10950 [3:58:01<51:12,  1.58s/it][A
Iteration:  83%|████████▎ | 9045/10950 [3:58:46<50:12,  1.58s/it][A
Iteration:  83%|████████▎ | 9045/10950 [3:59:01<50:12,  1.58s/it][A
Iteration:  83%|████████▎ | 9084/1

Iteration:  22%|██▏       | 2391/10950 [1:03:12<3:44:56,  1.58s/it][A
Iteration:  22%|██▏       | 2429/10950 [1:03:52<3:44:34,  1.58s/it][A
Iteration:  22%|██▏       | 2429/10950 [1:04:12<3:44:34,  1.58s/it][A
Iteration:  23%|██▎       | 2468/10950 [1:04:54<3:43:00,  1.58s/it][A
Iteration:  23%|██▎       | 2468/10950 [1:05:12<3:43:00,  1.58s/it][A
Iteration:  23%|██▎       | 2506/10950 [1:05:54<3:42:11,  1.58s/it][A
Iteration:  23%|██▎       | 2506/10950 [1:06:12<3:42:11,  1.58s/it][A
Iteration:  23%|██▎       | 2545/10950 [1:06:55<3:40:52,  1.58s/it][A
Iteration:  23%|██▎       | 2545/10950 [1:07:12<3:40:52,  1.58s/it][A
Iteration:  24%|██▎       | 2584/10950 [1:07:56<3:39:42,  1.58s/it][A
Iteration:  24%|██▎       | 2584/10950 [1:08:12<3:39:42,  1.58s/it][A
Iteration:  24%|██▍       | 2623/10950 [1:08:58<3:38:23,  1.57s/it][A
Iteration:  24%|██▍       | 2623/10950 [1:09:12<3:38:23,  1.57s/it][A
Iteration:  24%|██▍       | 2662/10950 [1:09:59<3:37:17,  1.57s/it][A
Iterat

Iteration:  62%|██████▏   | 6813/10950 [2:59:32<1:48:53,  1.58s/it][A
Iteration:  63%|██████▎   | 6852/10950 [3:00:20<1:47:42,  1.58s/it][A
Iteration:  63%|██████▎   | 6852/10950 [3:00:32<1:47:42,  1.58s/it][A
Iteration:  63%|██████▎   | 6891/10950 [3:01:21<1:46:38,  1.58s/it][A
Iteration:  63%|██████▎   | 6891/10950 [3:01:32<1:46:38,  1.58s/it][A
Iteration:  63%|██████▎   | 6930/10950 [3:02:22<1:45:33,  1.58s/it][A
Iteration:  63%|██████▎   | 6930/10950 [3:02:32<1:45:33,  1.58s/it][A
Iteration:  64%|██████▎   | 6968/10950 [3:03:22<1:44:39,  1.58s/it][A
Iteration:  64%|██████▎   | 6968/10950 [3:03:32<1:44:39,  1.58s/it][A
Iteration:  64%|██████▍   | 7006/10950 [3:04:23<1:43:52,  1.58s/it][A
Iteration:  64%|██████▍   | 7006/10950 [3:04:42<1:43:52,  1.58s/it][A
Iteration:  64%|██████▍   | 7045/10950 [3:05:24<1:42:35,  1.58s/it][A
Iteration:  64%|██████▍   | 7045/10950 [3:05:42<1:42:35,  1.58s/it][A
Iteration:  65%|██████▍   | 7083/10950 [3:06:24<1:41:52,  1.58s/it][A
Iterat

Training time : 9.623 hrs


## Predict
Note that the `BERTQAExtractor.predict` only outputs the probabilities of each token being the start and end of the answer span. the `postprocess_answers` method takes these probabilities and generates the final answers. 

In [42]:
qa_extractor = AnswerExtractor(model_name=MODEL_NAME, cache_dir=CACHE_DIR, load_model_from_dir="./temp/distributed_0")
qa_results = qa_extractor.predict(dev_dataloader)

Evaluating: 100%|██████████| 1322/1322 [15:48<00:00,  1.77it/s]


In [12]:
qa_extractor.model.module

XLNetForQuestionAnswering(
  (transformer): XLNetModel(
    (word_embedding): Embedding(32000, 1024)
    (layer): ModuleList(
      (0): XLNetLayer(
        (rel_attn): XLNetRelativeAttention(
          (layer_norm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
          (dropout): Dropout(p=0.1, inplace=False)
        )
        (ff): XLNetFeedForward(
          (layer_norm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
          (layer_1): Linear(in_features=1024, out_features=4096, bias=True)
          (layer_2): Linear(in_features=4096, out_features=1024, bias=True)
          (dropout): Dropout(p=0.1, inplace=False)
        )
        (dropout): Dropout(p=0.1, inplace=False)
      )
      (1): XLNetLayer(
        (rel_attn): XLNetRelativeAttention(
          (layer_norm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
          (dropout): Dropout(p=0.1, inplace=False)
        )
        (ff): XLNetFeedForward(
          (layer_norm): LayerNorm((1024,), eps=

In [15]:
from pytorch_transformers import AdamW
model = qa_extractor.model.module
no_decay = ["bias", "LayerNorm.weight"]
optimizer_grouped_parameters = [
    {
        "params": [
            p
            for n, p in model.named_parameters()
            if not any(nd in n for nd in no_decay)
        ],
        "weight_decay": 0.01,
    },
    {
        "params": [
            p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)
        ],
        "weight_decay": 0.0,
    },
]
optimizer = AdamW(optimizer_grouped_parameters, lr=3e-5, eps=1e-8)

In [22]:
optimizer_grouped_parameters = [
    {
        "params": [
            p
            for n, p in model.named_parameters()
            if not any(nd in n for nd in no_decay)
        ],
        "weight_decay": 0.01,
    },
    {
        "params": [
            p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)
        ],
        "weight_decay": 0.0,
    },
]
optimizer = AdamW(optimizer_grouped_parameters, lr=3e-5, eps=1e-8)

In [31]:
no_lr_layer_decay_group = []
lr_layer_decay_groups = {k:[] for k in range(24)}
for n, p in model.named_parameters():
    name_split = n.split(".")
    if name_split[1] == "layer":
        lr_layer_decay_groups[int(name_split[2])].append(p) 
    else:
        no_lr_layer_decay_group.append(p)

learning_rate = 3e-5
lr_layer_decay = 0.75
n_layers = 24

optimizer_grouped_parameters = [{"params": no_lr_layer_decay_group, "lr": learning_rate}]
print(len(no_lr_layer_decay_group))
for i in range(n_layers):
    parameters_group = {"params": lr_layer_decay_groups[i], "lr": learning_rate * (lr_layer_decay ** (n_layers - i - 1))}
    print(len(lr_layer_decay_groups[i]))
    optimizer_grouped_parameters.append(parameters_group)

13
17
17
17
17
17
17
17
17
17
17
17
17
17
17
17
17
17
17
17
17
17
17
17
17


In [32]:
13 + 17 *24

421

In [30]:
optimizer_grouped_parameters

[{'params': [Parameter containing:
   tensor([[[ 0.0020,  0.0090,  0.0266,  ..., -0.0021, -0.0030,  0.0049]]],
          device='cuda:0', requires_grad=True), Parameter containing:
   tensor([[ 0.0464,  0.0515, -0.0547,  ..., -0.0034, -0.0633,  0.0161],
           [ 0.0223, -0.0398, -0.0133,  ...,  0.0291,  0.0253, -0.0287],
           [ 0.0279, -0.0446, -0.0223,  ...,  0.0406,  0.0262, -0.0311],
           ...,
           [-0.0036, -0.0610,  0.0338,  ...,  0.0214, -0.0514, -0.0258],
           [ 0.0415,  0.0115, -0.0534,  ..., -0.1437,  0.0533, -0.0117],
           [ 0.0752,  0.0393, -0.0127,  ..., -0.0176, -0.0611, -0.0175]],
          device='cuda:0', requires_grad=True), Parameter containing:
   tensor([[-0.0282, -0.0284, -0.0116,  ...,  0.0063,  0.0305, -0.0177]],
          device='cuda:0', requires_grad=True), Parameter containing:
   tensor([7.2050e-05], device='cuda:0', requires_grad=True), Parameter containing:
   tensor([[ 0.0097,  0.0248, -0.0123,  ...,  0.0072,  0.0047,  0.

In [33]:
count = 0
for n, p in model.named_parameters():
    print(n,p)
    count += 1

transformer.mask_emb Parameter containing:
tensor([[[ 0.0020,  0.0090,  0.0266,  ..., -0.0021, -0.0030,  0.0049]]],
       device='cuda:0', requires_grad=True)
transformer.word_embedding.weight Parameter containing:
tensor([[ 0.0464,  0.0515, -0.0547,  ..., -0.0034, -0.0633,  0.0161],
        [ 0.0223, -0.0398, -0.0133,  ...,  0.0291,  0.0253, -0.0287],
        [ 0.0279, -0.0446, -0.0223,  ...,  0.0406,  0.0262, -0.0311],
        ...,
        [-0.0036, -0.0610,  0.0338,  ...,  0.0214, -0.0514, -0.0258],
        [ 0.0415,  0.0115, -0.0534,  ..., -0.1437,  0.0533, -0.0117],
        [ 0.0752,  0.0393, -0.0127,  ..., -0.0176, -0.0611, -0.0175]],
       device='cuda:0', requires_grad=True)
transformer.layer.0.rel_attn.q Parameter containing:
tensor([[[ 1.5486e-02,  7.9165e-03, -6.8250e-04,  ..., -6.7614e-03,
          -1.2938e-04,  2.5389e-03],
         [ 2.0044e-03,  9.5261e-03,  9.3410e-03,  ..., -1.1020e-02,
           7.0381e-03,  1.2984e-02],
         [ 4.7253e-03, -1.0354e-02,  2.5903

transformer.layer.2.rel_attn.v Parameter containing:
tensor([[[ 1.9607e-03,  4.0693e-02, -6.5724e-02,  ..., -3.0055e-02,
          -3.7586e-02, -2.4628e-02],
         [ 2.3130e-02, -2.9848e-02,  8.0330e-03,  ..., -4.7522e-02,
           3.3761e-02,  1.4913e-02],
         [-1.1373e-04, -1.6255e-02, -3.5629e-02,  ..., -1.1950e-02,
           3.0917e-02,  4.6254e-02],
         ...,
         [ 2.6838e-02,  1.6839e-02, -4.7978e-03,  ...,  1.2174e-03,
           2.7336e-03,  3.2211e-02],
         [ 1.7791e-02, -1.6208e-03,  1.0353e-02,  ..., -8.6503e-03,
          -2.8809e-02,  1.9587e-02],
         [-1.0884e-02,  1.7078e-02, -1.2022e-02,  ..., -4.7808e-04,
           2.4518e-02, -3.9494e-02]],

        [[ 2.8730e-02,  1.3371e-02,  4.8628e-02,  ...,  1.8995e-02,
          -4.1294e-02,  2.1916e-02],
         [-1.8299e-02, -1.9155e-02, -1.7701e-03,  ..., -4.6348e-02,
           2.3057e-02, -5.1124e-02],
         [ 3.7797e-02,  1.1349e-02, -2.6199e-02,  ..., -3.3093e-02,
           4.8814e-03, 

transformer.layer.4.rel_attn.layer_norm.bias Parameter containing:
tensor([-0.0182,  0.0980,  0.0816,  ..., -0.0304,  0.0104,  0.0582],
       device='cuda:0', requires_grad=True)
transformer.layer.4.ff.layer_norm.weight Parameter containing:
tensor([0.9913, 0.9717, 0.9728,  ..., 0.9732, 0.9663, 1.0156], device='cuda:0',
       requires_grad=True)
transformer.layer.4.ff.layer_norm.bias Parameter containing:
tensor([ 0.0428, -0.0602, -0.0208,  ...,  0.0237, -0.0068, -0.0450],
       device='cuda:0', requires_grad=True)
transformer.layer.4.ff.layer_1.weight Parameter containing:
tensor([[ 0.0026, -0.0633, -0.0140,  ...,  0.0289, -0.0429,  0.0168],
        [ 0.0052, -0.0164,  0.0343,  ...,  0.0361, -0.0107, -0.0228],
        [-0.0249,  0.0554, -0.0363,  ..., -0.0416,  0.0110,  0.0183],
        ...,
        [ 0.0235,  0.0336,  0.0366,  ..., -0.0542, -0.0350,  0.0358],
        [-0.0176, -0.0493, -0.0565,  ..., -0.0011, -0.0101, -0.0224],
        [-0.0007, -0.0140, -0.0639,  ..., -0.0080, -0

transformer.layer.7.rel_attn.v Parameter containing:
tensor([[[-0.0035, -0.0184,  0.0296,  ..., -0.0307,  0.0240, -0.0230],
         [ 0.0265, -0.0183,  0.0602,  ...,  0.0093,  0.0377,  0.0042],
         [-0.0044,  0.0121, -0.0176,  ...,  0.0296,  0.0130, -0.0023],
         ...,
         [-0.0047,  0.0080,  0.0028,  ...,  0.0107,  0.0037, -0.0058],
         [-0.0197, -0.0859, -0.0316,  ...,  0.0074,  0.0295,  0.0193],
         [ 0.0004,  0.0072,  0.0123,  ..., -0.0309, -0.0246, -0.0111]],

        [[ 0.0106, -0.0319, -0.0420,  ...,  0.0228,  0.0138,  0.0634],
         [ 0.0199, -0.0440,  0.0222,  ..., -0.0150, -0.0416, -0.0219],
         [ 0.0352,  0.0014,  0.0033,  ...,  0.0034,  0.0073,  0.0031],
         ...,
         [ 0.0145,  0.0506,  0.0132,  ...,  0.0144,  0.0086,  0.0201],
         [ 0.0350, -0.0009,  0.0202,  ...,  0.0298, -0.0299,  0.0097],
         [ 0.0124,  0.0024,  0.0106,  ..., -0.0268, -0.0177, -0.0363]],

        [[ 0.0199,  0.0243, -0.0104,  ..., -0.0100, -0.0159,  0

transformer.layer.9.rel_attn.seg_embed Parameter containing:
tensor([[[ 4.8978e-01, -8.1710e-02,  4.1005e-01,  ..., -1.5047e-01,
           3.3607e-01, -3.7705e-04],
         [ 1.2276e-01, -9.1511e-02, -1.5685e-02,  ..., -2.4817e-01,
          -1.9544e-02, -7.8173e-02],
         [ 4.7773e-01,  1.5898e-01,  1.5284e-01,  ...,  1.6608e-01,
           1.1130e-01, -3.0068e-01],
         ...,
         [-1.0655e-01, -9.7923e-02, -7.6099e-02,  ...,  4.4363e-02,
          -1.2570e-01, -3.5633e-04],
         [-2.2247e-02, -8.4173e-02, -4.2496e-02,  ..., -2.7193e-02,
          -1.1671e-02, -1.1269e-01],
         [ 6.0542e-02, -2.8918e-02,  7.8071e-02,  ...,  1.0579e-01,
          -3.8465e-02, -7.3352e-02]],

        [[-4.9805e-01,  7.8159e-02, -4.1601e-01,  ...,  1.4044e-01,
          -3.4337e-01, -8.4164e-03],
         [-1.2257e-01,  8.1972e-02,  2.3735e-02,  ...,  2.3258e-01,
          -5.6588e-03,  8.3339e-02],
         [-4.6527e-01, -1.5475e-01, -1.5633e-01,  ..., -1.6160e-01,
          -9.71

transformer.layer.12.rel_attn.v Parameter containing:
tensor([[[-0.0049,  0.0176,  0.0465,  ...,  0.0044,  0.0268,  0.0043],
         [-0.0072,  0.0261,  0.0475,  ..., -0.0070,  0.0267,  0.0041],
         [ 0.0058, -0.0036, -0.0231,  ...,  0.0094,  0.0356,  0.0485],
         ...,
         [ 0.0105, -0.0028,  0.0086,  ...,  0.0184, -0.0146,  0.0094],
         [-0.0219,  0.0216,  0.0405,  ...,  0.0492, -0.0320, -0.0107],
         [ 0.0450,  0.0304, -0.0164,  ..., -0.0228,  0.0400,  0.0150]],

        [[-0.0382, -0.0438,  0.0085,  ...,  0.0309, -0.0265, -0.0409],
         [-0.0088, -0.0222, -0.0111,  ..., -0.0494, -0.0284,  0.0013],
         [-0.0480,  0.0452,  0.0291,  ..., -0.0224,  0.0287, -0.0250],
         ...,
         [ 0.0169,  0.0176, -0.0298,  ..., -0.0273, -0.0356,  0.0257],
         [-0.0402, -0.0364, -0.0354,  ..., -0.0058,  0.0621, -0.0394],
         [-0.0587,  0.0076, -0.0186,  ..., -0.0387,  0.0204, -0.0054]],

        [[-0.0126,  0.0204, -0.0233,  ...,  0.0123, -0.0091, -

transformer.layer.14.ff.layer_norm.bias Parameter containing:
tensor([-0.1068, -0.0660, -0.0663,  ..., -0.0364, -0.0509, -0.0257],
       device='cuda:0', requires_grad=True)
transformer.layer.14.ff.layer_1.weight Parameter containing:
tensor([[-0.0037, -0.0219, -0.0100,  ...,  0.0312,  0.0070, -0.0132],
        [-0.0056, -0.0169, -0.0036,  ...,  0.0273, -0.0524, -0.0651],
        [ 0.0129,  0.0312, -0.0551,  ..., -0.0667,  0.0026, -0.0137],
        ...,
        [-0.0392, -0.0290,  0.0247,  ...,  0.0074, -0.0016, -0.0066],
        [ 0.0298,  0.0312,  0.0077,  ..., -0.0187, -0.0319, -0.0251],
        [ 0.0019,  0.0320,  0.0499,  ..., -0.0092,  0.0459,  0.0293]],
       device='cuda:0', requires_grad=True)
transformer.layer.14.ff.layer_1.bias Parameter containing:
tensor([-0.1525, -0.1798, -0.0822,  ..., -0.0093, -0.0663, -0.0775],
       device='cuda:0', requires_grad=True)
transformer.layer.14.ff.layer_2.weight Parameter containing:
tensor([[-0.0319, -0.0238, -0.0041,  ...,  0.0283,  0

transformer.layer.17.rel_attn.v Parameter containing:
tensor([[[ 1.8407e-02, -2.6996e-02,  2.5889e-03,  ...,  4.8635e-02,
          -5.5703e-02,  8.1880e-03],
         [ 1.3630e-02, -4.4564e-02, -2.7376e-02,  ..., -1.1577e-02,
           1.2373e-02,  2.8491e-02],
         [-4.5321e-02,  2.8406e-02,  1.4424e-02,  ..., -1.4849e-02,
           6.0771e-02, -3.7183e-03],
         ...,
         [ 7.7709e-03,  4.3435e-02, -4.6793e-02,  ...,  2.8617e-02,
           3.6761e-02,  9.3007e-03],
         [-1.9772e-02,  2.3890e-02,  3.4793e-03,  ...,  4.0287e-02,
          -2.0604e-02,  5.6163e-03],
         [ 9.7745e-02,  1.7755e-02,  2.5005e-04,  ...,  1.7299e-02,
          -6.4595e-02, -6.4804e-03]],

        [[ 5.8623e-02, -2.3014e-02,  2.7892e-02,  ...,  1.2353e-02,
          -4.3778e-02,  3.1823e-03],
         [ 4.4087e-02,  1.3053e-02, -3.0942e-02,  ...,  4.3225e-02,
           2.4652e-02,  4.0233e-02],
         [-1.4445e-02, -1.5585e-02,  2.3059e-02,  ..., -2.1628e-02,
          -5.4769e-02,

transformer.layer.19.ff.layer_1.weight Parameter containing:
tensor([[-1.9649e-02,  5.0762e-02, -2.9767e-02,  ...,  1.0735e-02,
         -3.0167e-02, -4.3829e-03],
        [ 3.5074e-02, -2.7688e-03, -1.2336e-02,  ...,  1.5710e-02,
          9.4019e-02, -7.3631e-02],
        [ 3.9487e-03, -2.7117e-02, -4.1989e-04,  ..., -1.2867e-03,
          4.5613e-02, -1.3132e-02],
        ...,
        [ 1.9728e-04,  1.8610e-02,  9.4184e-03,  ..., -6.5648e-02,
          2.6209e-02, -1.0324e-02],
        [-1.6423e-02, -9.8365e-03, -5.6336e-02,  ...,  3.1993e-02,
          2.1933e-02,  1.7731e-02],
        [-1.4287e-02, -3.4946e-02, -4.9955e-02,  ...,  4.4783e-02,
         -3.3925e-05,  2.9370e-02]], device='cuda:0', requires_grad=True)
transformer.layer.19.ff.layer_1.bias Parameter containing:
tensor([ 0.0649, -0.0586, -0.0875,  ..., -0.0200, -0.0377, -0.0443],
       device='cuda:0', requires_grad=True)
transformer.layer.19.ff.layer_2.weight Parameter containing:
tensor([[ 0.0199,  0.0038, -0.0277,  

transformer.layer.22.rel_attn.o Parameter containing:
tensor([[[-8.1041e-02, -4.0541e-02,  4.3433e-02,  ...,  9.8196e-03,
           2.2727e-03, -8.1638e-03],
         [-5.9462e-02,  2.3698e-02, -1.8787e-02,  ...,  3.9275e-03,
          -4.6129e-03,  5.3776e-03],
         [-2.1303e-02, -1.9214e-02, -7.1432e-03,  ...,  1.8197e-02,
          -6.7501e-02,  3.6777e-03],
         ...,
         [ 1.3282e-02, -2.4399e-02,  1.5016e-02,  ..., -3.2165e-02,
          -1.0011e-02,  6.4494e-02],
         [ 7.5878e-02, -2.1919e-02, -6.6692e-03,  ...,  3.5277e-02,
           2.3564e-02, -2.5066e-02],
         [-2.6779e-02, -4.2199e-02,  2.4799e-02,  ..., -2.5740e-02,
           3.3450e-02,  2.3703e-03]],

        [[ 1.3621e-03, -9.7445e-03,  1.5050e-02,  ..., -3.0097e-02,
           2.2835e-02,  4.0101e-02],
         [ 7.0512e-03, -1.7517e-02, -3.0041e-02,  ...,  2.4089e-02,
          -5.4393e-02, -8.2083e-02],
         [-9.3504e-03,  4.0965e-03, -2.8758e-02,  ..., -2.2106e-02,
          -8.9099e-02,

In [34]:
print(count)

421


In [19]:
items = [1, 2, 3, 4, 5]
squared = list(map(lambda x: x**2, items))

In [20]:
squared

[1, 4, 9, 16, 25]

## Postprocess and Generate the Final Answers

In [43]:
tokenizer_class = TOKENIZER_CLASSES["xlnet"]
tokenizer = tokenizer_class.from_pretrained(
    MODEL_NAME, do_lower_case=DO_LOWER_CASE, cache_dir=CACHE_DIR
)
final_answers, answer_probs, nbest_answers = postprocess_answer(qa_results,
                                                                "./cached_qa_features/cached_examples_test.jsonl",
                                                                "./cached_qa_features/cached_features_test.jsonl", 
                                                                do_lower_case=DO_LOWER_CASE,
                                                                model_type='xlnet',
                                                                tokenizer=tokenizer,
                                                                n_best_size=5
                                                               )

In [44]:
for i in [0, 10, 100]:
    print('Paragraph:')
    print(dev_df.iloc[i]['doc_text'])
    print()
    print('Question:')
    print(dev_df.iloc[i]['question_text'])
    print()
    print('Ground truth answers:')
    print(dev_df.iloc[i]['answer_text'])
    print()
    print('Predicted answer:')
    print(final_answers[dev_df.iloc[i]['qa_id']])
    print()
    print('Top N best answers')
    print(nbest_answers[dev_df.iloc[i]['qa_id']])
    print('-------------------------------------------------------------------------------------------------------------------')

Paragraph:
Super Bowl 50 was an American football game to determine the champion of the National Football League (NFL) for the 2015 season. The American Football Conference (AFC) champion Denver Broncos defeated the National Football Conference (NFC) champion Carolina Panthers 24–10 to earn their third Super Bowl title. The game was played on February 7, 2016, at Levi's Stadium in the San Francisco Bay Area at Santa Clara, California. As this was the 50th Super Bowl, the league emphasized the "golden anniversary" with various gold-themed initiatives, as well as temporarily suspending the tradition of naming each Super Bowl game with Roman numerals (under which the game would have been known as "Super Bowl L"), so that the logo could prominently feature the Arabic numerals 50.

Question:
Which NFL team represented the AFC at Super Bowl 50?

Ground truth answers:
['Denver Broncos', 'Denver Broncos', 'Denver Broncos']

Predicted answer:
Denver Broncos

Top N best answers
[OrderedDict([('t

## Evaluate

Question answering task is usually evaluated on two metrics: exact match (EM) and F1 score.   
The exact match is computed by first performing some simple normalization (e.g. remove punctuation and convert to lower case) on the ground truth and predicted answers and check if they match exactly after normalization.   
F1 score is computed from token-level precision and recall by comparing the ground truth and predicted answers. 

In [45]:
evaluation_result = evaluate_qa(qa_ids=dev_df['qa_id'], 
                                actuals=dev_df['answer_text'], 
                                preds=final_answers)

{
  "exact": 83.99243140964995,
  "f1": 91.66718130779226,
  "total": 10570,
  "HasAns_exact": 83.99243140964995,
  "HasAns_f1": 91.66718130779226,
  "HasAns_total": 10570
}


In [16]:
# from utils_nlp.models.transformers.qa_utils import QADataset
# from torch.utils.data import (
#     Dataset,
#     IterableDataset,
#     DataLoader,
#     RandomSampler,
#     SequentialSampler,
#     TensorDataset,
# )

# qa_dataset = QADataset(train_df,
#                        doc_text_col="doc_text",
#                        question_text_col="question_text",
#                        qa_id_col="qa_id",
#                        is_impossible_col="is_impossible",
#                        answer_start_col="answer_start",
#                        answer_text_col="answer_text")
# sampler = SequentialSampler(qa_dataset)
# data_loader = DataLoader(qa_dataset, sampler=sampler, batch_size=32)
# def test_generator():
#     features = []
#     c = 0
#     f = True
#     for i in range(10):
#         features.append(c)
#         features.append(c+1)
        
#         while len(features) > 0:
#             output = features[0]
#             features = features[1:]
            
#             if f:
#                 yield output
#             else:
#                 yield output * 10
            
#             f = not f
#         c += 2

# g = test_generator()
# for item in g:
#     print(item)

# from torch.utils.data import TensorDataset
# def test_generator():
#     i = 0
#     while i < 2:
#         i+=1
#         t1 = torch.tensor([list(range(1024)), list(range(1024))], dtype=torch.long)
#         t2 = torch.tensor([list(range(512)), list(range(512))], dtype=torch.long)
#         yield (t1, t2)
        
# g = test_generator()

# for t1, t2 in g:
#     print(t2)

## References

1. Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina, [*BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding*](https://arxiv.org/abs/1810.04805), ACL, 2018.
2. Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, Percy Liang, [*SQuAD: 100,000+ Questions for Machine Comprehension of Text*](https://arxiv.org/abs/1606.05250), EMNLP, 2016.
3. Pranav Rajpurkar, Robin Jia, Percy Liang, [*Know What You Don't Know: Unanswerable Questions for SQuAD*](https://arxiv.org/abs/1806.03822), ACL, 2018