# BERT - Out of the Box

In this notebook, we will test the performance of an out-of-the-box BERT model on CommonsenseQA. We follow the tutorial here: https://github.com/jalammar/jalammar.github.io/blob/master/notebooks/bert/A_Visual_Notebook_to_Using_BERT_for_the_First_Time.ipynb

We referred to the Commonsense QA repo and code to understand how the authors of this work establiahsed their baseline using BERT. This is the link to their repo: https://github.com/jonathanherzig/commonsenseqa

We used Google Colab to run the following code.

In [1]:
!pip install pytorch_pretrained_bert
!pip install urllib3==1.25.10
!pip install transformers

Collecting pytorch_pretrained_bert
[?25l  Downloading https://files.pythonhosted.org/packages/d7/e0/c08d5553b89973d9a240605b9c12404bcf8227590de62bae27acbcfe076b/pytorch_pretrained_bert-0.6.2-py3-none-any.whl (123kB)
[K     |██▋                             | 10kB 19.5MB/s eta 0:00:01[K     |█████▎                          | 20kB 25.4MB/s eta 0:00:01[K     |████████                        | 30kB 21.2MB/s eta 0:00:01[K     |██████████▋                     | 40kB 17.7MB/s eta 0:00:01[K     |█████████████▎                  | 51kB 17.0MB/s eta 0:00:01[K     |███████████████▉                | 61kB 13.2MB/s eta 0:00:01[K     |██████████████████▌             | 71kB 13.7MB/s eta 0:00:01[K     |█████████████████████▏          | 81kB 14.7MB/s eta 0:00:01[K     |███████████████████████▉        | 92kB 14.3MB/s eta 0:00:01[K     |██████████████████████████▌     | 102kB 14.1MB/s eta 0:00:01[K     |█████████████████████████████▏  | 112kB 14.1MB/s eta 0:00:01[K     |████████████

In [14]:
!pip install nlp

Collecting nlp
[?25l  Downloading https://files.pythonhosted.org/packages/09/e3/bcdc59f3434b224040c1047769c47b82705feca2b89ebbc28311e3764782/nlp-0.4.0-py3-none-any.whl (1.7MB)
[K     |▏                               | 10kB 22.5MB/s eta 0:00:01[K     |▍                               | 20kB 16.7MB/s eta 0:00:01[K     |▋                               | 30kB 15.9MB/s eta 0:00:01[K     |▉                               | 40kB 15.5MB/s eta 0:00:01[K     |█                               | 51kB 12.6MB/s eta 0:00:01[K     |█▏                              | 61kB 12.7MB/s eta 0:00:01[K     |█▍                              | 71kB 13.0MB/s eta 0:00:01[K     |█▋                              | 81kB 12.9MB/s eta 0:00:01[K     |█▉                              | 92kB 14.1MB/s eta 0:00:01[K     |██                              | 102kB 14.1MB/s eta 0:00:01[K     |██▏                             | 112kB 14.1MB/s eta 0:00:01[K     |██▍                             | 122kB 14.1MB/s eta

In [3]:
from google.colab import drive
drive.mount('/content/drive')
%cd /content/drive/My\ Drive/NLP/w266-commonsenseqa/BERT

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
[Errno 2] No such file or directory: '/content/drive/My Drive/NLP/w266-commonsenseqa/BERT'
/content/drive/MyDrive/MIDS/NLP/w266-commonsenseqa/BERT_oob


In [15]:
import logging
import numpy as np
import pandas as pd

import nlp

import warnings
warnings.filterwarnings('ignore')

import json
from pandas.io.json import json_normalize

from transformers import BertTokenizer, BertModel, BertConfig
import torch
from torch.utils.tensorboard import SummaryWriter
from sklearn import metrics

from datetime import datetime
import pytz
# configuration = BertConfig() 
from collections import defaultdict 
import pickle 

11/29/2020 23:01:09 - INFO - nlp.utils.file_utils -   PyTorch version 1.7.0+cu101 available.
11/29/2020 23:01:09 - INFO - nlp.utils.file_utils -   TensorFlow version 2.3.0 available.


In [5]:
ts = datetime.now(pytz.timezone('US/Pacific')).strftime("%Y%m%d_%H%M%S")


In [6]:
runtype="full_allgrad_newlr_b16"
NAME = 'BertForMultipleChoice__{runtype}_{ts}'.format(runtype=runtype, ts=ts)
# Logs for tensorboard will be saved in the following directory 
writer = SummaryWriter("runs/"+ NAME)

print("Model NAME:", NAME)

Model NAME: BertForMultipleChoice__full_allgrad_newlr_b16_20201129_145722


In [7]:
# To use tensorboard in Google Colab, run this:
%load_ext tensorboard

# Tensorboard can be viewed with the following command
# %tensorboard --logdir logs

In [8]:
# These were supposed to be fed in as .py arguments
# Copied them over from the BertForMultipleChoice example doc.
# Adapted from https://github.com/rodgzilla/pytorch-pretrained-BERT/blob/dcb50eaa4b80d3ab75d373c36780c80fb47cfd97/examples/run_swag.py

logging.basicConfig(format = '%(asctime)s - %(levelname)s - %(name)s -   %(message)s',
                    datefmt = '%m/%d/%Y %H:%M:%S',
                    level = logging.INFO)
logger = logging.getLogger(__name__)


class arg_holder():
    def __init__(self):
        self.bert_model = 'bert-base-uncased'
        
        self.max_seq_length = 128
        self.train_batch_size = 16   # 32
        self.eval_batch_size = 8
        self.learning_rate = 2e-5    # 5e-5
        self.num_train_epochs = 3
        self.warmup_proportion = 0
        self.no_cuda = False
        self.gradient_accumulation_steps = 4
        
args = arg_holder()

In [9]:
args.train_batch_size

16

In [10]:
from tqdm import tqdm, trange

from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler
from torch.utils.data.distributed import DistributedSampler

In [11]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
device

device(type='cuda')

## Import dataset

It's in the dataset folder.

In [17]:
train_dataset = nlp.load_dataset("commonsense_qa", split = nlp.Split.TRAIN)
valid_dataset = nlp.load_dataset("commonsense_qa", split = nlp.Split.VALIDATION)

11/29/2020 23:01:23 - INFO - filelock -   Lock 139879389531664 acquired on /root/.cache/huggingface/datasets/ea29813e78501904688a90c430f89aca3126b45c3f3f072c4a4246096a5ad0ef.29f2def772f6505fbbc27537a807141066e57bf70f50bf1e0eeb544f7eb5cf3a.py.lock
11/29/2020 23:01:23 - INFO - nlp.utils.file_utils -   https://s3.amazonaws.com/datasets.huggingface.co/nlp/datasets/commonsense_qa/commonsense_qa.py not found in cache or force_download set to True, downloading to /root/.cache/huggingface/datasets/tmpwzny632s


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=4021.0, style=ProgressStyle(description…

11/29/2020 23:01:23 - INFO - nlp.utils.file_utils -   storing https://s3.amazonaws.com/datasets.huggingface.co/nlp/datasets/commonsense_qa/commonsense_qa.py in cache at /root/.cache/huggingface/datasets/ea29813e78501904688a90c430f89aca3126b45c3f3f072c4a4246096a5ad0ef.29f2def772f6505fbbc27537a807141066e57bf70f50bf1e0eeb544f7eb5cf3a.py
11/29/2020 23:01:23 - INFO - nlp.utils.file_utils -   creating metadata file for /root/.cache/huggingface/datasets/ea29813e78501904688a90c430f89aca3126b45c3f3f072c4a4246096a5ad0ef.29f2def772f6505fbbc27537a807141066e57bf70f50bf1e0eeb544f7eb5cf3a.py
11/29/2020 23:01:23 - INFO - filelock -   Lock 139879389531664 released on /root/.cache/huggingface/datasets/ea29813e78501904688a90c430f89aca3126b45c3f3f072c4a4246096a5ad0ef.29f2def772f6505fbbc27537a807141066e57bf70f50bf1e0eeb544f7eb5cf3a.py.lock
11/29/2020 23:01:23 - INFO - filelock -   Lock 139879389531664 acquired on /root/.cache/huggingface/datasets/57f2f3ea75915746c8c82276c4fa8c5ad2eefec1b68f4916fd427e350e05




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=2306.0, style=ProgressStyle(description…

11/29/2020 23:01:23 - INFO - nlp.utils.file_utils -   storing https://s3.amazonaws.com/datasets.huggingface.co/nlp/datasets/commonsense_qa/dataset_infos.json in cache at /root/.cache/huggingface/datasets/57f2f3ea75915746c8c82276c4fa8c5ad2eefec1b68f4916fd427e350e0548d1.8e7921b0f2e4b84d6fe54300e2304189b22d5d373f2574d7071269bf89d558c8
11/29/2020 23:01:23 - INFO - nlp.utils.file_utils -   creating metadata file for /root/.cache/huggingface/datasets/57f2f3ea75915746c8c82276c4fa8c5ad2eefec1b68f4916fd427e350e0548d1.8e7921b0f2e4b84d6fe54300e2304189b22d5d373f2574d7071269bf89d558c8
11/29/2020 23:01:23 - INFO - filelock -   Lock 139879389531664 released on /root/.cache/huggingface/datasets/57f2f3ea75915746c8c82276c4fa8c5ad2eefec1b68f4916fd427e350e0548d1.8e7921b0f2e4b84d6fe54300e2304189b22d5d373f2574d7071269bf89d558c8.lock
11/29/2020 23:01:23 - INFO - nlp.load -   Checking /root/.cache/huggingface/datasets/ea29813e78501904688a90c430f89aca3126b45c3f3f072c4a4246096a5ad0ef.29f2def772f6505fbbc27537a80




11/29/2020 23:01:23 - INFO - filelock -   Lock 139879389531664 acquired on /root/.cache/huggingface/datasets/ea29813e78501904688a90c430f89aca3126b45c3f3f072c4a4246096a5ad0ef.29f2def772f6505fbbc27537a807141066e57bf70f50bf1e0eeb544f7eb5cf3a.py.lock
11/29/2020 23:01:23 - INFO - nlp.load -   Found main folder for dataset https://s3.amazonaws.com/datasets.huggingface.co/nlp/datasets/commonsense_qa/commonsense_qa.py at /usr/local/lib/python3.6/dist-packages/nlp/datasets/commonsense_qa
11/29/2020 23:01:23 - INFO - nlp.load -   Creating specific version folder for dataset https://s3.amazonaws.com/datasets.huggingface.co/nlp/datasets/commonsense_qa/commonsense_qa.py at /usr/local/lib/python3.6/dist-packages/nlp/datasets/commonsense_qa/a3ca6d89a0ea731b712148dc01a2d5679a3bffd6af0896468b02834325cc2bf9
11/29/2020 23:01:23 - INFO - nlp.load -   Copying script file from https://s3.amazonaws.com/datasets.huggingface.co/nlp/datasets/commonsense_qa/commonsense_qa.py to /usr/local/lib/python3.6/dist-pack

Downloading and preparing dataset commonsense_qa/default (download: 4.46 MiB, generated: 2.08 MiB, post-processed: Unknown sizetotal: 6.54 MiB) to /root/.cache/huggingface/datasets/commonsense_qa/default/0.1.0/a3ca6d89a0ea731b712148dc01a2d5679a3bffd6af0896468b02834325cc2bf9...


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=3785890.0, style=ProgressStyle(descript…

11/29/2020 23:01:24 - INFO - nlp.utils.file_utils -   storing https://s3.amazonaws.com/commensenseqa/train_rand_split.jsonl in cache at /root/.cache/huggingface/datasets/downloads/f8ac90b19c90fe8ce59d56fa26d9c92ba19c31cbd83a80b10a875837b14d55cc.94445c194f4fa7081632829299edde11e6e705959b4711f3cf68f320e19fdb3a
11/29/2020 23:01:24 - INFO - nlp.utils.file_utils -   creating metadata file for /root/.cache/huggingface/datasets/downloads/f8ac90b19c90fe8ce59d56fa26d9c92ba19c31cbd83a80b10a875837b14d55cc.94445c194f4fa7081632829299edde11e6e705959b4711f3cf68f320e19fdb3a
11/29/2020 23:01:24 - INFO - filelock -   Lock 139878226541032 released on /root/.cache/huggingface/datasets/downloads/f8ac90b19c90fe8ce59d56fa26d9c92ba19c31cbd83a80b10a875837b14d55cc.94445c194f4fa7081632829299edde11e6e705959b4711f3cf68f320e19fdb3a.lock
11/29/2020 23:01:24 - INFO - filelock -   Lock 139878226542488 acquired on /root/.cache/huggingface/datasets/downloads/fe6a5184d14e087b1cbb56e3cea5ceb0a24dd497ee8f82dcf4963e20aa0c5f




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=423148.0, style=ProgressStyle(descripti…

11/29/2020 23:01:24 - INFO - nlp.utils.file_utils -   storing https://s3.amazonaws.com/commensenseqa/test_rand_split_no_answers.jsonl in cache at /root/.cache/huggingface/datasets/downloads/fe6a5184d14e087b1cbb56e3cea5ceb0a24dd497ee8f82dcf4963e20aa0c5fe5.9138d3568a6b3ae08fbf85c1057048c6d23f4daff8df252f1e85a643ef03c83a
11/29/2020 23:01:24 - INFO - nlp.utils.file_utils -   creating metadata file for /root/.cache/huggingface/datasets/downloads/fe6a5184d14e087b1cbb56e3cea5ceb0a24dd497ee8f82dcf4963e20aa0c5fe5.9138d3568a6b3ae08fbf85c1057048c6d23f4daff8df252f1e85a643ef03c83a
11/29/2020 23:01:24 - INFO - filelock -   Lock 139878226542488 released on /root/.cache/huggingface/datasets/downloads/fe6a5184d14e087b1cbb56e3cea5ceb0a24dd497ee8f82dcf4963e20aa0c5fe5.9138d3568a6b3ae08fbf85c1057048c6d23f4daff8df252f1e85a643ef03c83a.lock





11/29/2020 23:01:24 - INFO - filelock -   Lock 139878226540920 acquired on /root/.cache/huggingface/datasets/downloads/712467f293862605fd78116a1c37eea362706de7051d865440c01296b042e1d0.babf491afa13fc9c974882ffa47e4a1707c82f5cb6f95ba74adeda5c380e222e.lock
11/29/2020 23:01:24 - INFO - nlp.utils.file_utils -   https://s3.amazonaws.com/commensenseqa/dev_rand_split.jsonl not found in cache or force_download set to True, downloading to /root/.cache/huggingface/datasets/downloads/tmp3u_qnm5w


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=471653.0, style=ProgressStyle(descripti…

11/29/2020 23:01:24 - INFO - nlp.utils.file_utils -   storing https://s3.amazonaws.com/commensenseqa/dev_rand_split.jsonl in cache at /root/.cache/huggingface/datasets/downloads/712467f293862605fd78116a1c37eea362706de7051d865440c01296b042e1d0.babf491afa13fc9c974882ffa47e4a1707c82f5cb6f95ba74adeda5c380e222e
11/29/2020 23:01:24 - INFO - nlp.utils.file_utils -   creating metadata file for /root/.cache/huggingface/datasets/downloads/712467f293862605fd78116a1c37eea362706de7051d865440c01296b042e1d0.babf491afa13fc9c974882ffa47e4a1707c82f5cb6f95ba74adeda5c380e222e
11/29/2020 23:01:24 - INFO - filelock -   Lock 139878226540920 released on /root/.cache/huggingface/datasets/downloads/712467f293862605fd78116a1c37eea362706de7051d865440c01296b042e1d0.babf491afa13fc9c974882ffa47e4a1707c82f5cb6f95ba74adeda5c380e222e.lock





11/29/2020 23:01:24 - INFO - nlp.utils.info_utils -   All the checksums matched successfully for dataset source files
11/29/2020 23:01:24 - INFO - nlp.builder -   Generating split train


HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

11/29/2020 23:01:25 - INFO - nlp.arrow_writer -   Done writing 9741 examples in 1736365 bytes /root/.cache/huggingface/datasets/commonsense_qa/default/0.1.0/a3ca6d89a0ea731b712148dc01a2d5679a3bffd6af0896468b02834325cc2bf9.incomplete/commonsense_qa-train.arrow.
11/29/2020 23:01:25 - INFO - nlp.builder -   Generating split validation




HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

11/29/2020 23:01:25 - INFO - nlp.arrow_writer -   Done writing 1221 examples in 215057 bytes /root/.cache/huggingface/datasets/commonsense_qa/default/0.1.0/a3ca6d89a0ea731b712148dc01a2d5679a3bffd6af0896468b02834325cc2bf9.incomplete/commonsense_qa-validation.arrow.
11/29/2020 23:01:25 - INFO - nlp.builder -   Generating split test




HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

11/29/2020 23:01:25 - INFO - nlp.arrow_writer -   Done writing 1140 examples in 202782 bytes /root/.cache/huggingface/datasets/commonsense_qa/default/0.1.0/a3ca6d89a0ea731b712148dc01a2d5679a3bffd6af0896468b02834325cc2bf9.incomplete/commonsense_qa-test.arrow.
11/29/2020 23:01:25 - INFO - nlp.utils.info_utils -   All the splits matched successfully.
11/29/2020 23:01:25 - INFO - nlp.builder -   Constructing Dataset for split train, from /root/.cache/huggingface/datasets/commonsense_qa/default/0.1.0/a3ca6d89a0ea731b712148dc01a2d5679a3bffd6af0896468b02834325cc2bf9
11/29/2020 23:01:25 - INFO - nlp.utils.info_utils -   Unable to verify checksums.


Dataset commonsense_qa downloaded and prepared to /root/.cache/huggingface/datasets/commonsense_qa/default/0.1.0/a3ca6d89a0ea731b712148dc01a2d5679a3bffd6af0896468b02834325cc2bf9. Subsequent calls will reuse this data.


11/29/2020 23:01:25 - INFO - nlp.load -   Checking /root/.cache/huggingface/datasets/ea29813e78501904688a90c430f89aca3126b45c3f3f072c4a4246096a5ad0ef.29f2def772f6505fbbc27537a807141066e57bf70f50bf1e0eeb544f7eb5cf3a.py for additional imports.
11/29/2020 23:01:25 - INFO - filelock -   Lock 139878226594000 acquired on /root/.cache/huggingface/datasets/ea29813e78501904688a90c430f89aca3126b45c3f3f072c4a4246096a5ad0ef.29f2def772f6505fbbc27537a807141066e57bf70f50bf1e0eeb544f7eb5cf3a.py.lock
11/29/2020 23:01:25 - INFO - nlp.load -   Found main folder for dataset https://s3.amazonaws.com/datasets.huggingface.co/nlp/datasets/commonsense_qa/commonsense_qa.py at /usr/local/lib/python3.6/dist-packages/nlp/datasets/commonsense_qa
11/29/2020 23:01:25 - INFO - nlp.load -   Found specific version folder for dataset https://s3.amazonaws.com/datasets.huggingface.co/nlp/datasets/commonsense_qa/commonsense_qa.py at /usr/local/lib/python3.6/dist-packages/nlp/datasets/commonsense_qa/a3ca6d89a0ea731b712148dc0

In [39]:
train_dataset[0]["choices"]

{'label': ['A', 'B', 'C', 'D', 'E'],
 'text': ['ignore', 'enforce', 'authoritarian', 'yell at', 'avoid']}

# BERT base model (uncased)

From: https://huggingface.co/bert-base-uncased

> Pretrained model on English language using a masked language modeling (MLM) objective. It was introduced in this paper and first released in this repository. This model is uncased: it does not make a difference between english and English.
> 
> Disclaimer: The team releasing BERT did not write a model card for this model so this model card has been written by the Hugging Face team.

For each question, there are five answer choices. Only one of them is correct.

For BERT, the first thought was to have all five answers attached to each question, and the model would choose one of the five responses. This is how it's originally done in the CommonsenseQA paper.

```
[CLS] Question text here [SEP] Ans choice A [SEP] Ans choice B [SEP] Ans choice C [SEP] Ans choice D [SEP] Ans choice E [SEP]
```

It seems complicated, however, and requires a significant lift. So for now, let me try creating five question-answer pairs for each question. Like this:

```
[CLS] Question text here [SEP] A Ans choice [SEP]
[CLS] Question text here [SEP] B Ans choice [SEP]
[CLS] Question text here [SEP] C Ans choice [SEP]
[CLS] Question text here [SEP] D Ans choice [SEP]
[CLS] Question text here [SEP] E Ans choice [SEP]
```

Only one of the above 5 inputs will have a positive label for being the correct answer. The rest will have 0. The problem with this model is that we're evaluating each choice separately to see if it looks like a right answer at all. But it's important for the model to know how the answer choices compare to each other as well.


In [45]:
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
lab_order = {"A": 0, "B":1, "C":2, "D":3, "E":4}

class InputExample(object):
    """A single multiple choice question and its five multiple choice answer candidates"""
    # This class is adapted from https://github.com/jonathanherzig/commonsenseqa/blob/master/bert/run_commonsense_qa.py
    # and from https://github.com/rodgzilla/pytorch-pretrained-BERT/blob/dcb50eaa4b80d3ab75d373c36780c80fb47cfd97/examples/run_swag.py

    def __init__(
            self,
            question,
            choice_0,
            choice_1,
            choice_2,
            choice_3,
            choice_4,
            label=None):
        """Construct an instance."""
        self.question = question  # e.g., 'The sanctions against the school were a punishing blow, and they seemed to what the efforts the school had made to change?'
        self.choices = [          # All five anser choices as a list
            choice_0,
            choice_1,
            choice_2,
            choice_3,
            choice_4
        ]
        self.label = label        # 
        
    def __str__(self):
        return self.__repr__()

    def __repr__(self):
        l = [
            f"question: {self.question}",
            f"choice_0: {self.choices[0]}",
            f"choice_1: {self.choices[1]}",
            f"choice_2: {self.choices[2]}",
            f"choice_3: {self.choices[3]}",
            f"choice_4: {self.choices[4]}",
        ]

        if self.label is not None:
            l.append(f"label: {self.label}")

        return ", ".join(l)    

class InputFeatures(object):
    """Adapted from: https://github.com/rodgzilla/pytorch-pretrained-BERT/blob/dcb50eaa4b80d3ab75d373c36780c80fb47cfd97/examples/run_swag.py
    Stores Bert model inputs (ids, masks) for each example"""
    
    def __init__(self,
                 choices_features,
                 label

    ):
        self.choices_features = [
            {
                'input_ids': input_ids,
                'input_mask': input_mask,
                'segment_ids': segment_ids
            }
            for _, input_ids, input_mask, segment_ids in choices_features
        ]
        self.label = label

    
def process_examples(data):
    """Given the examples in a pandas df format, process examples into example class"""
    examples = []
    labels = []
    questions = []
    anscands = []
    
    
    for row in data: 
        example = InputExample(
                    question=row["question"],
                    choice_0=str(row["choices"]["label"][0] + " " + row["choices"]["text"][0]),
                    choice_1=str(row["choices"]["label"][1] + " " + row["choices"]["text"][1]),
                    choice_2=str(row["choices"]["label"][2] + " " + row["choices"]["text"][2]),
                    choice_3=str(row["choices"]["label"][3] + " " + row["choices"]["text"][3]),
                    choice_4=str(row["choices"]["label"][4] + " " + row["choices"]["text"][4]),
                    label=lab_order[row["answerKey"]]
                )
        examples.append(example)
        
    return examples 

def convert_examples_to_features(examples, tokenizer, max_seq_length, is_training):
    # For each quesiton, we generate five inputs: one for each answer choice. 
    
    # - [CLS] question [SEP] A choice_1 [SEP]
    # - [CLS] question [SEP] B choice_2 [SEP]
    # - [CLS] question [SEP] C choice_3 [SEP]
    # - [CLS] question [SEP] D choice_4 [SEP]
    # - [CLS] question [SEP] E choice_5 [SEP]
    
    features = []
    # Loop through questions
    for example_index, example in enumerate(examples):
        question_tokens = tokenizer.tokenize(example.question)

        choices_features = []
        # For each question, loop through all answer choices 
        for choice_index, choice in enumerate(example.choices):
            # We create a copy of the question tokens in order to be
            # able to shrink it according to choice_tokens
            question_tokens_choice = question_tokens[:]
            choice_tokens = tokenizer.tokenize(choice)
            # Modifies `question_tokens_choice` and `choice_tokens` in
            # place so that the total length is less than the
            # specified length.  Account for [CLS], [SEP], [SEP] with
            # "- 3"
            _truncate_seq_pair(question_tokens_choice, choice_tokens, max_seq_length - 3)

            tokens = ["[CLS]"] + question_tokens_choice + ["[SEP]"] + choice_tokens + ["[SEP]"]
            segment_ids = [0] * (len(question_tokens_choice) + 2) + [1] * (len(choice_tokens) + 1)

            input_ids = tokenizer.convert_tokens_to_ids(tokens)
            input_mask = [1] * len(input_ids)

            # Zero-pad up to the sequence length.
            padding = [0] * (max_seq_length - len(input_ids))
            input_ids += padding
            input_mask += padding
            segment_ids += padding

            assert len(input_ids) == max_seq_length
            assert len(input_mask) == max_seq_length
            assert len(segment_ids) == max_seq_length

            choices_features.append((tokens, input_ids, input_mask, segment_ids))

        label = example.label

        features.append(
            InputFeatures(
                choices_features = choices_features,
                label = label
            )
        )

    return features


def _truncate_seq_pair(tokens_a, tokens_b, max_length):
    """Truncates a sequence pair in place to the maximum length."""

    # This is a simple heuristic which will always truncate the longer sequence
    # one token at a time. This makes more sense than truncating an equal percent
    # of tokens from each, since if one sequence is very short then each token
    # that's truncated likely contains more information than a longer sequence.
    while True:
        total_length = len(tokens_a) + len(tokens_b)
        if total_length <= max_length:
            break
        if len(tokens_a) > len(tokens_b):
            tokens_a.pop()
        else:
            tokens_b.pop()
            
def select_field(features, field):
    """Yields a list, length equal to the total number of examples,
    where each item is a list of arrays,
    each array representing the feature array"""
    return [
        [
            choice[field]   # Grab the feature array of that choice.
            for choice in feature.choices_features  # Loop through 5 choices of that example
        ]
        for feature in features   # loop through each example
    ]



In [46]:
# Process inputs 

train_examples= process_examples(train_dataset)
train_features = convert_examples_to_features(
                    examples=train_examples, 
                    tokenizer=tokenizer, 
                    max_seq_length=50, 
                    is_training=True)

dev_examples= process_examples(valid_dataset)
dev_features = convert_examples_to_features(
                    examples=dev_examples, 
                    tokenizer=tokenizer, 
                    max_seq_length=50, 
                    is_training=True)


In [47]:
print("dev_examples:", len(dev_examples))
print("dev_features:", len(dev_features))
print("train_examples:", len(train_examples))
print("train_features:", len(train_features))

dev_examples: 1221
dev_features: 1221
train_examples: 9741
train_features: 9741


In [48]:
def create_inputs_from_features(features):
    input_ids = torch.tensor(select_field(features, 'input_ids'), dtype=torch.long)
    input_mask = torch.tensor(select_field(features, 'input_mask'), dtype=torch.long)
    segment_ids = torch.tensor(select_field(features, 'segment_ids'), dtype=torch.long)
    label = torch.tensor([f.label for f in features], dtype=torch.long)
    
    return input_ids, input_mask, segment_ids, label

def accuracy(out, labels):
    outputs = np.argmax(out, axis=1)
    return np.sum(outputs == labels)


In [50]:
train_examples= process_examples(train_dataset)


In [51]:
# Full data here. Can correct it to tiny dataset as well
train_features = convert_examples_to_features(
                    examples=train_examples, 
                    tokenizer=tokenizer, 
                    max_seq_length=50, 
                    is_training=True)

# Import Model


In [52]:
from transformers import BertForMultipleChoice

In [53]:
model = BertForMultipleChoice.from_pretrained('bert-base-uncased')

11/29/2020 23:14:02 - INFO - filelock -   Lock 139880709384120 acquired on /root/.cache/torch/transformers/3c61d016573b14f7f008c02c4e51a366c67ab274726fe2910691e2a761acf43e.637c6035640bacb831febcc2b7f7bee0a96f9b30c2d7e9ef84082d9f252f3170.lock


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=433.0, style=ProgressStyle(description_…

11/29/2020 23:14:02 - INFO - filelock -   Lock 139880709384120 released on /root/.cache/torch/transformers/3c61d016573b14f7f008c02c4e51a366c67ab274726fe2910691e2a761acf43e.637c6035640bacb831febcc2b7f7bee0a96f9b30c2d7e9ef84082d9f252f3170.lock
11/29/2020 23:14:02 - INFO - filelock -   Lock 139878227049384 acquired on /root/.cache/torch/transformers/a8041bf617d7f94ea26d15e218abd04afc2004805632abc0ed2066aa16d50d04.faf6ea826ae9c5867d12b22257f9877e6b8367890837bd60f7c54a29633f7f2f.lock





HBox(children=(FloatProgress(value=0.0, description='Downloading', max=440473133.0, style=ProgressStyle(descri…

11/29/2020 23:14:09 - INFO - filelock -   Lock 139878227049384 released on /root/.cache/torch/transformers/a8041bf617d7f94ea26d15e218abd04afc2004805632abc0ed2066aa16d50d04.faf6ea826ae9c5867d12b22257f9877e6b8367890837bd60f7c54a29633f7f2f.lock





Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMultipleChoice: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForMultipleChoice from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMultipleChoice from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForMultipleChoice were not initialized from the model checkpoint at bert-base-uncased and are newly

In [54]:
model.to(device)

BertForMultipleChoice(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_aff

Output is of the class `MultipleChoiceModelOutput`. It contains the following elements:

            loss=loss,
            logits=reshaped_logits,
            hidden_states=outputs.hidden_states,
            attentions=outputs.attentions,
        


Let's see if we can train it some more. 

1. Forward pass to make predictions
2. Calculate loss
3. Backward pass: compute gradient of the loss with respect to all the learnable parameters of the model.


In [55]:
# Set up optimizer 
from pytorch_pretrained_bert.optimization import BertAdam

no_decay = ['bias', 'gamma', 'beta']
num_train_steps = int(len(train_examples) / args.train_batch_size / args.gradient_accumulation_steps * args.num_train_epochs)

t_total = num_train_steps
param_optimizer = list(model.named_parameters())
optimizer_grouped_parameters = [
        {'params': [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)], 'weight_decay_rate': 0.01},
        {'params': [p for n, p in param_optimizer if any(nd in n for nd in no_decay)], 'weight_decay_rate': 0.0}
        ]
optimizer = BertAdam(optimizer_grouped_parameters,
                         lr=args.learning_rate,
                         warmup=args.warmup_proportion,
                         t_total=t_total)

11/29/2020 23:14:27 - INFO - pytorch_pretrained_bert.modeling -   Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex .


In [56]:
# Update loss at every epoch rather than little batches. 

num_epochs=args.num_train_epochs

# Prepare data by processing features into inputs
all_input_ids = torch.tensor(select_field(train_features, 'input_ids'), dtype=torch.long)
all_input_mask = torch.tensor(select_field(train_features, 'input_mask'), dtype=torch.long)
all_segment_ids = torch.tensor(select_field(train_features, 'segment_ids'), dtype=torch.long)
all_label = torch.tensor([f.label for f in train_features], dtype=torch.long)

# Package it all nicely into a TensorDataset 
train_data = TensorDataset(all_input_ids, all_input_mask, all_segment_ids, all_label)

# Then package it into a DataLoader that can be batched and processed 
train_sampler = RandomSampler(train_data)
train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=args.train_batch_size)

global_step = 0
model.train()  # Models are initialized in eval mode by default. We can call model.train() to put it in train mode.
for epoch in range(num_epochs):
  # Reset for epoch: tr loss, number of training examples, training steps 
  tr_loss = 0
  nb_tr_examples, nb_tr_steps = 0, 0

  print("epoch", datetime.now(pytz.timezone('US/Pacific')).strftime("%Y%m%d_%H%M%S"))

  for step, batch in enumerate(tqdm(train_dataloader, desc="Iteration")):
    # Create a batch 
    batch = tuple(t.to(device) for t in batch)
    input_ids, input_mask, segment_ids, label_ids = batch

    # Calculate loss from forward feed method 
    output = model(input_ids=input_ids, 
                 attention_mask=input_mask, 
                 token_type_ids=segment_ids,
                 labels=label_ids,
                 return_dict=True)
    loss=output["loss"]
    
    # back prop
    loss.backward()
    # print("\nbackprop!", epoch, step)

    # add to tr loss
    tr_loss += loss.item()
    nb_tr_examples += input_ids.size(0)
    nb_tr_steps += 1
    
    if (step + 1) % args.gradient_accumulation_steps == 0:   # Number of updates steps to accumulate before performing a backward/update pass.
      # update 
      optimizer.step()
      # reset gradient 
      model.zero_grad()
      global_step += 1
      
      # Show the answer choice with the highest score for each question 
      train_predictions = torch.argmax(torch.nn.functional.softmax(output["logits"]), dim=1)
      # Accuracy against train data
      train_accuracy = float(sum(label_ids==train_predictions))/len(label_ids)
      
      # log loss 
      # print("\nLogging:", global_step)
      writer.add_scalar('training loss', 
                      loss.item(), 
                      global_step)
      # Log accuracy against train data
      writer.add_scalar('train accuracy', train_accuracy, global_step)



Iteration:   0%|          | 0/609 [00:00<?, ?it/s]

epoch 20201129_151507


Iteration: 100%|██████████| 609/609 [06:28<00:00,  1.57it/s]
Iteration:   0%|          | 0/609 [00:00<?, ?it/s]

epoch 20201129_152135


Iteration: 100%|██████████| 609/609 [06:32<00:00,  1.55it/s]
Iteration:   0%|          | 0/609 [00:00<?, ?it/s]

epoch 20201129_152808


Iteration: 100%|██████████| 609/609 [06:32<00:00,  1.55it/s]


In [None]:
all_input_ids = torch.tensor(select_field(dev_features, 'input_ids'), dtype=torch.long)
all_input_mask = torch.tensor(select_field(dev_features, 'input_mask'), dtype=torch.long)
all_segment_ids = torch.tensor(select_field(dev_features, 'segment_ids'), dtype=torch.long)
all_label = torch.tensor([f.label for f in dev_features], dtype=torch.long)
eval_data = TensorDataset(all_input_ids, all_input_mask, all_segment_ids, all_label)

# Run prediction for full data
eval_sampler = SequentialSampler(eval_data)
eval_dataloader = DataLoader(eval_data, sampler=eval_sampler, batch_size=args.eval_batch_size)

model.eval()
eval_loss, eval_accuracy = 0, 0
nb_eval_steps, nb_eval_examples = 0, 0
for input_ids, input_mask, segment_ids, label_ids in eval_dataloader:
    input_ids = input_ids.to(device)
    input_mask = input_mask.to(device)
    segment_ids = segment_ids.to(device)
    label_ids = label_ids.to(device)

    # Since we're using dev set just to evaluate the data,
    # make sure not to update anything. 
    with torch.no_grad():
        eval_output = model.forward(
          input_ids=input_ids,
          attention_mask=input_mask,
          token_type_ids=segment_ids,
          labels=label_ids,
          return_dict=True,
      )
        tmp_eval_loss = eval_output["loss"]
        logits = eval_output["logits"]

    logits = logits.detach().cpu().numpy()
    label_ids = label_ids.to('cpu').numpy()
    tmp_eval_accuracy = accuracy(logits, label_ids)

    eval_loss += tmp_eval_loss.mean().item()
    eval_accuracy += tmp_eval_accuracy

    nb_eval_examples += input_ids.size(0)
    nb_eval_steps += 1

eval_loss = eval_loss / nb_eval_steps
eval_accuracy = eval_accuracy / nb_eval_examples

# Show the answer choice with the highest score for each question 
dev_predictions = torch.argmax(torch.nn.functional.softmax(eval_output["logits"]), dim=1)

result = {'eval_loss': eval_loss,
          'eval_accuracy': eval_accuracy,
          'global_step': global_step#,
          #'loss': tr_loss/nb_tr_steps
          }

output_eval_file =  "eval_results_{}.txt".format(NAME)
with open(output_eval_file, "w") as writer:
    logger.info("***** Eval results *****")
    for key in sorted(result.keys()):
        logger.info("  %s = %s", key, str(result[key]))
        writer.write("%s = %s\n" % (key, str(result[key])))


In [58]:
NAME

'BertForMultipleChoice__full_allgrad_newlr_b16_20201129_145722'


**BertForMultipleChoice__full_allgrad_newlr_b16_20201028_210307**

    10/29/2020 05:54:54 - INFO - __main__ -   ***** Eval results *****
    10/29/2020 05:54:54 - INFO - __main__ -     eval_accuracy = 0.5634725634725635
    10/29/2020 05:54:54 - INFO - __main__ -     eval_loss = 1.3910671822191063
    10/29/2020 05:54:54 - INFO - __main__ -     global_step = 1827


In [62]:
# Save model
torch.save(model.state_dict(), "models/"+NAME)

# Save dev predictions 
fordump = eval_output 
dir = "models/"
filename = "{NAME}_dev_predictions".format(NAME=NAME)
pickle_out = open(dir + filename, "wb")
pickle.dump(fordump, pickle_out)
pickle_out.close()

In [None]:
%tensorboard --logdir "runs/"