# BERT - Out of the Box

In this notebook, we will test the performance of an out-of-the-box BERT model on CommonsenseQA. I follow the tutorial here: https://github.com/jalammar/jalammar.github.io/blob/master/notebooks/bert/A_Visual_Notebook_to_Using_BERT_for_the_First_Time.ipynb

I've implemented the Hugginface transformers library. 

I referred to the Commonsense QA repo and code to understand how the authors of this work establiahsed their baseline using BERT. This is the link to their code: https://github.com/jonathanherzig/commonsenseqa/blob/master/bert/run_commonsense_qa.py

From this repo (README): https://github.com/jonathanherzig/commonsenseqa

Their work is far more advanced and complicated than maybe what I want to do at this time. But I refer to their work to understand the set up.

In [12]:
!pip install pytorch_pretrained_bert



In [13]:
!pip install urllib3==1.25.10



In [14]:
!pip install transformers



In [15]:
from google.colab import drive
drive.mount('/content/drive')
%cd /content/drive/My\ Drive/NLP/w266-commonsenseqa/BERT_oob

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
/content/drive/My Drive/MIDS/NLP/w266-commonsenseqa/BERT_oob


In [16]:
import logging
import numpy as np
import pandas as pd

import warnings
warnings.filterwarnings('ignore')

import json
from pandas.io.json import json_normalize

from transformers import BertTokenizer, BertModel, BertConfig
import torch
from torch.utils.tensorboard import SummaryWriter
from sklearn import metrics

from datetime import datetime
import pytz
# configuration = BertConfig() 
from collections import defaultdict 
import pickle 

In [17]:
ts = datetime.now(pytz.timezone('US/Pacific')).strftime("%Y%m%d_%H%M%S")


In [18]:
runtype="full" # versus tiny
NAME = 'BertForMultipleChoice__{runtype}_{ts}'.format(runtype=runtype, ts=ts)
# Logs for tensorboard will be saved in the following directory 
writer = SummaryWriter("runs/"+ NAME)

print("Model NAME:", NAME)

Model NAME: BertForMultipleChoice__full_20201026_014420


In [19]:
# To use tensorboard in Google Colab, run this:
%load_ext tensorboard

# Tensorboard can be viewed with the following command
# %tensorboard --logdir logs

The tensorboard extension is already loaded. To reload it, use:
  %reload_ext tensorboard


In [20]:
# These were supposed to be fed in as .py arguments
# I copied them over from the BertForMultipleChoice example doc.
# Since I'm running on Jupyter notebook rather than a .py script, 
# I and created a class to hold all the args 
# Adapted from https://github.com/rodgzilla/pytorch-pretrained-BERT/blob/dcb50eaa4b80d3ab75d373c36780c80fb47cfd97/examples/run_swag.py

logging.basicConfig(format = '%(asctime)s - %(levelname)s - %(name)s -   %(message)s',
                    datefmt = '%m/%d/%Y %H:%M:%S',
                    level = logging.INFO)
logger = logging.getLogger(__name__)

# These were supposed to be fed in as .py arguments
# but for jupyter notebook, I cheated and created a class to hold all the args 

class arg_holder():
    def __init__(self):
        self.data_dir = '../dataset/'
        self.output_dir = 'bfmc/'
        self.bert_model = 'bert-base-uncased'
        
        self.max_seq_length = 128
        self.do_train = True
        self.do_eval = False
        self.do_lower_case = True
        self.train_batch_size = 32
        self.eval_batch_size = 8
        self.learning_rate = 5e-5
        self.num_train_epochs = 3
        self.warmup_proportion = 0
        self.no_cuda = False
        self.local_rank = -1
        self.seed = 42
        self.gradient_accumulation_steps = 1
        self.optimize_on_cpu=False
        self.fp16 = False
        self.loss_scale = 128
args = arg_holder()

In [21]:
from tqdm import tqdm, trange

from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler
from torch.utils.data.distributed import DistributedSampler

# from pytorch_pretrained_bert.tokenization import BertTokenizer
# from pytorch_pretrained_bert.modeling import BertForMultipleChoice
# from pytorch_pretrained_bert.optimization import BertAdam
# from pytorch_pretrained_bert.file_utils import PYTORCH_PRETRAINED_BERT_CACHE


In [22]:
if args.local_rank == -1 or args.no_cuda:
    device = torch.device("cuda" if torch.cuda.is_available() and not args.no_cuda else "cpu")
    n_gpu = torch.cuda.device_count()
else:
    device = torch.device("cuda", args.local_rank)
    n_gpu = 1
    # Initializes the distributed backend which will take care of sychronizing nodes/GPUs
    torch.distributed.init_process_group(backend='nccl')
    if args.fp16:
        logger.info("16-bits training currently not supported in distributed training")
        args.fp16 = False # (see https://github.com/pytorch/pytorch/pull/13496)
logger.info("device %s n_gpu %d distributed training %r", device, n_gpu, bool(args.local_rank != -1))


10/26/2020 08:44:20 - INFO - __main__ -   device cuda n_gpu 1 distributed training False


## Import dataset

It's in the dataset folder.

In [23]:
def load_data(file):
    lines = []
    with open(file, 'rb') as json_file:
        for json_line in json_file:
            lines.append(json.loads(json_line))
        data = json_normalize(lines)
        data.columns = data.columns.map(lambda x: x.split(".")[-1])
    return data
# os.chdir('w266-commonsenseqa/BERT_oob)
train = load_data('../dataset/train_rand_split.jsonl')
dev = load_data('../dataset/dev_rand_split.jsonl')
train.head()

Unnamed: 0,answerKey,id,question_concept,choices,stem
0,A,075e483d21c29a511267ef62bedc0461,punishing,"[{'label': 'A', 'text': 'ignore'}, {'label': '...",The sanctions against the school were a punish...
1,B,61fe6e879ff18686d7552425a36344c8,people,"[{'label': 'A', 'text': 'race track'}, {'label...",Sammy wanted to go to where the people were. ...
2,A,4c1cb0e95b99f72d55c068ba0255c54d,choker,"[{'label': 'A', 'text': 'jewelry store'}, {'la...",To locate a choker not located in a jewelry bo...
3,D,02e821a3e53cb320790950aab4489e85,highway,"[{'label': 'A', 'text': 'united states'}, {'la...",Google Maps and other highway and street GPS s...
4,C,23505889b94e880c3e89cff4ba119860,fox,"[{'label': 'A', 'text': 'pretty flowers.'}, {'...","The fox walked from the city into the forest, ..."


## Steps

1. Import training examples
2. Process it
    - Format input into something BERT can work with, including `[CLS]` and `[SEP]`
    - We were thinking which label is correct: 
    - Tokenize 
    - Create an output layer using softmax. 
3. Train it
    - Specify how many layers of BERT to fine tune
    

# BERT base model (uncased)

From: https://huggingface.co/bert-base-uncased

> Pretrained model on English language using a masked language modeling (MLM) objective. It was introduced in this paper and first released in this repository. This model is uncased: it does not make a difference between english and English.
> 
> Disclaimer: The team releasing BERT did not write a model card for this model so this model card has been written by the Hugging Face team.

For each question, there are five answer choices. Only one of them is correct.

For BERT, the first thought was to have all five answers attached to each question, and the model would choose one of the five responses. This is how it's originally done in the CommonsenseQA paper.

```
[CLS] Question text here [SEP] Ans choice A [SEP] Ans choice B [SEP] Ans choice C [SEP] Ans choice D [SEP] Ans choice E [SEP]
```

It seems complicated, however, and requires a significant lift. So for now, let me try creating five question-answer pairs for each question. Like this:

```
[CLS] Question text here [SEP] Ans choice A [SEP]
[CLS] Question text here [SEP] Ans choice B [SEP]
[CLS] Question text here [SEP] Ans choice C [SEP]
[CLS] Question text here [SEP] Ans choice D [SEP]
[CLS] Question text here [SEP] Ans choice E [SEP]
```

Only one of the above 5 inputs will have a positive label for being the correct answer. The rest will have 0. The problem with this model is that we're evaluating each choice separately to see if it looks like a right answer at all. But I think it's important for the model to know how the answer choices compare to each other as well.


In [24]:
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
lab_order = {"A": 0, "B":1, "C":2, "D":3, "E":4}

class InputExample(object):
    """A single multiple choice question and its five multiple choice answer candidates"""
    # This class is adapted from https://github.com/jonathanherzig/commonsenseqa/blob/master/bert/run_commonsense_qa.py
    # and from https://github.com/rodgzilla/pytorch-pretrained-BERT/blob/dcb50eaa4b80d3ab75d373c36780c80fb47cfd97/examples/run_swag.py

    def __init__(
            self,
            qid,
            question,
            choice_0,
            choice_1,
            choice_2,
            choice_3,
            choice_4,
            label=None):
        """Construct an instance."""
        self.qid = qid
        self.question = question  # e.g., 'The sanctions against the school were a punishing blow, and they seemed to what the efforts the school had made to change?'
        self.choices = [          # All five anser choices as a list
            choice_0,
            choice_1,
            choice_2,
            choice_3,
            choice_4
        ]
        self.label = label        # 
        
    def __str__(self):
        return self.__repr__()

    def __repr__(self):
        l = [
            f"qid: {self.qid}",
            f"question: {self.question}",
            f"choice_0: {self.choices[0]}",
            f"choice_1: {self.choices[1]}",
            f"choice_2: {self.choices[2]}",
            f"choice_3: {self.choices[3]}",
            f"choice_4: {self.choices[4]}",
        ]

        if self.label is not None:
            l.append(f"label: {self.label}")

        return ", ".join(l)    

class InputFeatures(object):
    """Adapted from: https://github.com/rodgzilla/pytorch-pretrained-BERT/blob/dcb50eaa4b80d3ab75d373c36780c80fb47cfd97/examples/run_swag.py
    Stores Bert model inputs (ids, masks) for each example"""
    
    def __init__(self,
                 example_id,
                 choices_features,
                 label

    ):
        self.example_id = example_id
        self.choices_features = [
            {
                'input_ids': input_ids,
                'input_mask': input_mask,
                'segment_ids': segment_ids
            }
            for _, input_ids, input_mask, segment_ids in choices_features
        ]
        self.label = label

    
def process_examples(data):
    """Given the examples in a pandas df format, process examples into example class"""
    examples = []
    labels = []
    questions = []
    anscands = []
    
    
    for index, row in data.iterrows(): 
        example = InputExample(
                    qid=row.id,
                    question=row.stem,
                    choice_0=str(row.choices[0]).replace("'",""),
                    choice_1=str(row.choices[1]).replace("'",""),
                    choice_2=str(row.choices[2]).replace("'",""),
                    choice_3=str(row.choices[3]).replace("'",""),
                    choice_4=str(row.choices[4]).replace("'",""),
                    label=lab_order[row.answerKey]
                )
        examples.append(example)
        
    return examples 

def convert_examples_to_features(examples, tokenizer, max_seq_length, is_training):
    # For each quesiton, we generate five inputs: one for each answer choice. 
    
    # - [CLS] question [SEP] choice_1 [SEP]
    # - [CLS] question [SEP] choice_2 [SEP]
    # - [CLS] question [SEP] choice_3 [SEP]
    # - [CLS] question [SEP] choice_4 [SEP]
    # - [CLS] question [SEP] choice_5 [SEP]
    
    features = []
    # Loop through questions
    for example_index, example in enumerate(examples):
        question_tokens = tokenizer.tokenize(example.question)

        choices_features = []
        # For each question, loop through all answer choices 
        for choice_index, choice in enumerate(example.choices):
            # We create a copy of the question tokens in order to be
            # able to shrink it according to choice_tokens
            question_tokens_choice = question_tokens[:]
            choice_tokens = tokenizer.tokenize(choice)
            # Modifies `question_tokens_choice` and `choice_tokens` in
            # place so that the total length is less than the
            # specified length.  Account for [CLS], [SEP], [SEP] with
            # "- 3"
            _truncate_seq_pair(question_tokens_choice, choice_tokens, max_seq_length - 3)

            tokens = ["[CLS]"] + question_tokens_choice + ["[SEP]"] + choice_tokens + ["[SEP]"]
            segment_ids = [0] * (len(question_tokens_choice) + 2) + [1] * (len(choice_tokens) + 1)

            input_ids = tokenizer.convert_tokens_to_ids(tokens)
            input_mask = [1] * len(input_ids)

            # Zero-pad up to the sequence length.
            padding = [0] * (max_seq_length - len(input_ids))
            input_ids += padding
            input_mask += padding
            segment_ids += padding

            assert len(input_ids) == max_seq_length
            assert len(input_mask) == max_seq_length
            assert len(segment_ids) == max_seq_length

            choices_features.append((tokens, input_ids, input_mask, segment_ids))

        label = example.label

        features.append(
            InputFeatures(
                example_id = example.qid,
                choices_features = choices_features,
                label = label
            )
        )

    return features


def _truncate_seq_pair(tokens_a, tokens_b, max_length):
    """Truncates a sequence pair in place to the maximum length."""

    # This is a simple heuristic which will always truncate the longer sequence
    # one token at a time. This makes more sense than truncating an equal percent
    # of tokens from each, since if one sequence is very short then each token
    # that's truncated likely contains more information than a longer sequence.
    while True:
        total_length = len(tokens_a) + len(tokens_b)
        if total_length <= max_length:
            break
        if len(tokens_a) > len(tokens_b):
            tokens_a.pop()
        else:
            tokens_b.pop()
            
def select_field(features, field):
    """Yields a list, length equal to the total number of examples,
    where each item is a list of arrays,
    each array representing the feature array"""
    return [
        [
            choice[field]   # Grab the feature array of that choice.
            for choice in feature.choices_features  # Loop through 5 choices of that example
        ]
        for feature in features   # loop through each example
    ]



In [25]:
# Process inputs 

train_examples= process_examples(train)
train_features = convert_examples_to_features(
                    examples=train_examples, 
                    tokenizer=tokenizer, 
                    max_seq_length=50, 
                    is_training=True)

dev_examples= process_examples(dev)
dev_features = convert_examples_to_features(
                    examples=dev_examples, 
                    tokenizer=tokenizer, 
                    max_seq_length=50, 
                    is_training=True)


In [26]:
len(dev_examples)

1221

In [27]:
len(dev_features)

1221

In [28]:
len(train_examples)

9741

In [29]:
len(train_features)

9741

In [30]:
def create_inputs_from_features(features):
    input_ids = torch.tensor(select_field(features, 'input_ids'), dtype=torch.long)
    input_mask = torch.tensor(select_field(features, 'input_mask'), dtype=torch.long)
    segment_ids = torch.tensor(select_field(features, 'segment_ids'), dtype=torch.long)
    label = torch.tensor([f.label for f in features], dtype=torch.long)
    
    return input_ids, input_mask, segment_ids, label



In [31]:
all_input_ids, all_input_mask, all_segment_ids, all_label = create_inputs_from_features(train_features)
train_data = [all_input_ids, all_input_mask, all_segment_ids, all_label]
dev_data = list(create_inputs_from_features(dev_features))


In [32]:
print(all_input_ids.shape)
print(all_input_mask.shape)
print(all_segment_ids.shape)
print(all_label.shape)


torch.Size([9741, 5, 50])
torch.Size([9741, 5, 50])
torch.Size([9741, 5, 50])
torch.Size([9741])


In [33]:
print(dev_data[0].shape)
print(train_data[0].shape)

torch.Size([1221, 5, 50])
torch.Size([9741, 5, 50])


In [34]:
from transformers import BertForMultipleChoice

In [35]:
model = BertForMultipleChoice.from_pretrained('bert-base-uncased')

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMultipleChoice: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForMultipleChoice from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing BertForMultipleChoice from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForMultipleChoice were not initialized from the model checkpoint at bert-base-uncased and are newly

In [36]:
# Which layers are learnable?

def show_learnable_layers(model):
  counter_learnable = defaultdict(int)
  counter_frozen = defaultdict(int)

  for name, param in model.named_parameters():
      if param.requires_grad==True:
          if "bert" in name:
            counter_learnable["bert"] += 1 
          else:
            counter_learnable[name] += 1 
      else:
          if "bert" in name:
            counter_frozen["bert"] += 1 
          else:
            counter_frozen[name] += 1 
  print("Learnable params")
  print(counter_learnable)
  print("Frozen params")
  print(counter_frozen)

show_learnable_layers(model)

Learnable params
defaultdict(<class 'int'>, {'bert': 199, 'classifier.weight': 1, 'classifier.bias': 1})
Frozen params
defaultdict(<class 'int'>, {})


In [37]:
# Freeze layers. 

for param in model.bert.parameters():
    param.requires_grad = False

show_learnable_layers(model)

Learnable params
defaultdict(<class 'int'>, {'classifier.weight': 1, 'classifier.bias': 1})
Frozen params
defaultdict(<class 'int'>, {'bert': 199})


Output is of the class `MultipleChoiceModelOutput`. It contains the following elements:

            loss=loss,
            logits=reshaped_logits,
            hidden_states=outputs.hidden_states,
            attentions=outputs.attentions,
        


Let's see if we can train it some more. I would like it to do a couple rounds of the following.

1. Forward pass to make predictions
2. Calculate loss
3. Backward pass: compute gradient of the loss with respect to all the learnable parameters of the model.


In [38]:
# Set up optimizer 
from pytorch_pretrained_bert.optimization import BertAdam

no_decay = ['bias', 'gamma', 'beta']
num_train_steps = 100

t_total = num_train_steps
param_optimizer = list(model.named_parameters())
optimizer_grouped_parameters = [
        {'params': [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)], 'weight_decay_rate': 0.01},
        {'params': [p for n, p in param_optimizer if any(nd in n for nd in no_decay)], 'weight_decay_rate': 0.0}
        ]
optimizer = BertAdam(optimizer_grouped_parameters,
                         lr=args.learning_rate,
                         warmup=args.warmup_proportion,
                         t_total=t_total)

10/26/2020 08:44:53 - INFO - pytorch_pretrained_bert.modeling -   Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex .


In [None]:
# Update loss at every epoch rather than little batches. 

num_epochs=3

for epoch in range(num_epochs):
  print(datetime.now(pytz.timezone('US/Pacific')).strftime("%Y%m%d_%H%M%S"))
  len_dataset = all_input_ids.shape[0]

  output = model.forward(
        input_ids=all_input_ids,
        attention_mask=all_input_mask,
        token_type_ids=all_segment_ids,
        position_ids=None,
        head_mask=None,
        inputs_embeds=None,
        labels=all_label,
        output_attentions=None,
        output_hidden_states=None,
        return_dict=True,
    )
  loss = output["loss"]

  print(epoch, loss.item())
  

  # Before backward pass, zero all gradients for variables it will update 
  optimizer.zero_grad()   

  # Backward pass
  loss.backward()

  # Update weights 
  optimizer.step()

  # Log the training loss
  writer.add_scalar('training loss', loss.item(), epoch)
  
  # Evaluate against dev data 
  dev_output = model.forward(
        input_ids=dev_data[0],
        attention_mask=dev_data[1],
        token_type_ids=dev_data[2],
        labels=dev_data[3],
        output_attentions=None,
        output_hidden_states=None,
        return_dict=True,
    )

  # Log the dev loss
  writer.add_scalar('dev loss', dev_output["loss"].item(), epoch)
  
 
  # Show the answer choice with the highest score for each question 
  train_predictions = torch.argmax(torch.nn.functional.softmax(output["logits"]), dim=1)
  dev_predictions = torch.argmax(torch.nn.functional.softmax(dev_output["logits"]), dim=1)
  
   # Accuracy against train data
  train_accuracy = metrics.accuracy_score(all_label, train_predictions)
  
  # Log accuracy against train data
  writer.add_scalar('train accuracy', train_accuracy, epoch)
  
  # Accuracy against dev data
  dev_accuracy = metrics.accuracy_score(dev_data[3], dev_predictions)
  
  # Log accuracy against dev data
  writer.add_scalar('dev accuracy', dev_accuracy, epoch)
  
  

20201026_014453


In [None]:
dev_predictions

In [None]:
# Save model
torch.save(model.state_dict(), "models/"+NAME)

# Save dev predictions 
fordump = dev_output 
dir = "models/predictions/"
filename = "{NAME}_{ver}_dev_predictions".format(NAME=NAME, ver=runtype)
pickle_out = open(dir + filename, "wb")
pickle.dump(fordump, pickle_out)
pickle_out.close()

In [None]:
# # Load model code snippet

# loadmodel = BertForMultipleChoice.from_pretrained('bert-base-uncased')
# loadmodel.load_state_dict(torch.load("models/"+NAME))

# loadmodel_output = loadmodel.forward(
#         input_ids=dev_data[0],
#         attention_mask=dev_data[1],
#         token_type_ids=dev_data[2],
#         labels=dev_data[3],
#         output_attentions=None,
#         output_hidden_states=None,
#         return_dict=True,
#     )

# loadmodel_output_predictions = torch.argmax(torch.nn.functional.softmax(loadmodel_output["logits"]), dim=1)
# print(loadmodel_output_predictions)

In [None]:
%tensorboard --logdir "runs/"

In [None]:
# Error analysis

# incorrect_idxs = [i for i, prediction in enumerate(predictions) if prediction != targets[i]]
# for incorrect_idx in incorrect_idxs:
#     print(tokenizer.decode(valid_dataset[incorrect_idx]['input_ids']))
#     print("Target Answer: {}".format(tokenizer.decode(valid_dataset[incorrect_idx]['target_ids'])))
#     print("Predicted Answer: {}".format(predictions[incorrect_idx]))