# BERT - Out of the Box

In this notebook, we will test the performance of an out-of-the-box BERT model on CommonsenseQA. I follow the tutorial here: https://github.com/jalammar/jalammar.github.io/blob/master/notebooks/bert/A_Visual_Notebook_to_Using_BERT_for_the_First_Time.ipynb

I've implemented the Hugginface transformers library. 

I referred to the Commonsense QA repo and code to understand how the authors of this work establiahsed their baseline using BERT. This is the link to their code: https://github.com/jonathanherzig/commonsenseqa/blob/master/bert/run_commonsense_qa.py

From this repo (README): https://github.com/jonathanherzig/commonsenseqa

Their work is far more advanced and complicated than maybe what I want to do at this time. But I refer to their work to understand the set up.

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score

import warnings

import json
from pandas.io.json import json_normalize
warnings.filterwarnings('ignore')


from transformers import BertTokenizer, TFBertModel, BertConfig

import tensorflow as tf
from tensorflow import keras

from datetime import datetime

import time 
configuration = BertConfig() 
from IPython.display import Image 

from tensorflow.keras.callbacks import TensorBoard
import pickle

runtype="tiny"

print("Run type:", runtype)
print(tf.__version__) 
print("Num GPUs Available: ", len(tf.config.experimental.list_physical_devices('GPU')))
print("Num CPUs Available: ",
      len(tf.config.experimental.list_physical_devices('CPU')))


Run type: tiny
2.3.0
Num GPUs Available:  0
Num CPUs Available:  1


## Import dataset

It's in the dataset folder.

In [2]:
def load_data(file):
    lines = []
    with open(file, 'rb') as json_file:
        for json_line in json_file:
            lines.append(json.loads(json_line))
        data = json_normalize(lines)
        data.columns = data.columns.map(lambda x: x.split(".")[-1])
    return data
# os.chdir('w266-commonsenseqa/BERT_oob)
train = load_data('../dataset/train_rand_split.jsonl')
dev = load_data('../dataset/dev_rand_split.jsonl')
train.head()

Unnamed: 0,answerKey,id,question_concept,choices,stem
0,A,075e483d21c29a511267ef62bedc0461,punishing,"[{'label': 'A', 'text': 'ignore'}, {'label': '...",The sanctions against the school were a punish...
1,B,61fe6e879ff18686d7552425a36344c8,people,"[{'label': 'A', 'text': 'race track'}, {'label...",Sammy wanted to go to where the people were. ...
2,A,4c1cb0e95b99f72d55c068ba0255c54d,choker,"[{'label': 'A', 'text': 'jewelry store'}, {'la...",To locate a choker not located in a jewelry bo...
3,D,02e821a3e53cb320790950aab4489e85,highway,"[{'label': 'A', 'text': 'united states'}, {'la...",Google Maps and other highway and street GPS s...
4,C,23505889b94e880c3e89cff4ba119860,fox,"[{'label': 'A', 'text': 'pretty flowers.'}, {'...","The fox walked from the city into the forest, ..."


## Steps

1. Import training examples
2. Process it
    - Format input into something BERT can work with, including `[CLS]` and `[SEP]`
    - We were thinking which label is correct: 
    - Tokenize 
    - Create an output layer using softmax. 
3. Train it
    - Specify how many layers of BERT to fine tune
    

# BERT base model (uncased)

From: https://huggingface.co/bert-base-uncased

> Pretrained model on English language using a masked language modeling (MLM) objective. It was introduced in this paper and first released in this repository. This model is uncased: it does not make a difference between english and English.
> 
> Disclaimer: The team releasing BERT did not write a model card for this model so this model card has been written by the Hugging Face team.

For each question, there are five answer choices. Only one of them is correct.

For BERT, the first thought was to have all five answers attached to each question, and the model would choose one of the five responses. This is how it's originally done in the CommonsenseQA paper.

```
[CLS] Question text here [SEP] Ans choice A [SEP] Ans choice B [SEP] Ans choice C [SEP] Ans choice D [SEP] Ans choice E [SEP]
```

It seems complicated, however, and requires a significant lift. So for now, let me try creating five question-answer pairs for each question. Like this:

```
[CLS] Question text here [SEP] Ans choice A [SEP]
[CLS] Question text here [SEP] Ans choice B [SEP]
[CLS] Question text here [SEP] Ans choice C [SEP]
[CLS] Question text here [SEP] Ans choice D [SEP]
[CLS] Question text here [SEP] Ans choice E [SEP]
```

Only one of the above 5 inputs will have a positive label for being the correct answer. The rest will have 0. The problem with this model is that we're evaluating each choice separately to see if it looks like a right answer at all. But I think it's important for the model to know how the answer choices compare to each other as well.


In [3]:
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
lab_order = {"A": 0, "B":1, "C":2, "D":3, "E":4}

class InputExample(object):
    """A single multiple choice question and its five multiple choice answer candidates"""
    # This class is adapted from https://github.com/jonathanherzig/commonsenseqa/blob/master/bert/run_commonsense_qa.py
    # and from https://github.com/rodgzilla/pytorch-pretrained-BERT/blob/dcb50eaa4b80d3ab75d373c36780c80fb47cfd97/examples/run_swag.py

    def __init__(
            self,
            qid,
            question,
            choice_0,
            choice_1,
            choice_2,
            choice_3,
            choice_4,
            label=None):
        """Construct an instance."""
        self.qid = qid
        self.question = question  # e.g., 'The sanctions against the school were a punishing blow, and they seemed to what the efforts the school had made to change?'
        self.choices = [          # All five anser choices as a list
            choice_0,
            choice_1,
            choice_2,
            choice_3,
            choice_4
        ]
        self.label = label        # 
        
    def __str__(self):
        return self.__repr__()

    def __repr__(self):
        l = [
            f"qid: {self.qid}",
            f"question: {self.question}",
            f"choice_0: {self.choices[0]}",
            f"choice_1: {self.choices[1]}",
            f"choice_2: {self.choices[2]}",
            f"choice_3: {self.choices[3]}",
            f"choice_4: {self.choices[4]}",
        ]

        if self.label is not None:
            l.append(f"label: {self.label}")

        return ", ".join(l)    

class InputFeatures(object):
    """Adapted from: https://github.com/rodgzilla/pytorch-pretrained-BERT/blob/dcb50eaa4b80d3ab75d373c36780c80fb47cfd97/examples/run_swag.py
    Stores Bert model inputs (ids, masks) for each example"""
    
    def __init__(self,
                 example_id,
                 choices_features,
                 label

    ):
        self.example_id = example_id
        self.choices_features = [
            {
                'input_ids': input_ids,
                'input_mask': input_mask,
                'segment_ids': segment_ids
            }
            for _, input_ids, input_mask, segment_ids in choices_features
        ]
        self.label = label

    
def process_examples(data):
    """Given the examples in a pandas df format, process examples into example class"""
    examples = []
    labels = []
    questions = []
    anscands = []
    
    
    for index, row in data.iterrows(): 
        example = InputExample(
                    qid=row.id,
                    question=row.stem,
                    choice_0=str(row.choices[0]).replace("'",""),
                    choice_1=str(row.choices[1]).replace("'",""),
                    choice_2=str(row.choices[2]).replace("'",""),
                    choice_3=str(row.choices[3]).replace("'",""),
                    choice_4=str(row.choices[4]).replace("'",""),
                    label=lab_order[row.answerKey]
                )
        examples.append(example)
        
    return examples 

def convert_examples_to_features(examples, tokenizer, max_seq_length, is_training):
    # For each quesiton, we generate five inputs: one for each answer choice. 
    
    # - [CLS] question [SEP] choice_1 [SEP]
    # - [CLS] question [SEP] choice_2 [SEP]
    # - [CLS] question [SEP] choice_3 [SEP]
    # - [CLS] question [SEP] choice_4 [SEP]
    # - [CLS] question [SEP] choice_5 [SEP]
    
    features = []
    # Loop through questions
    for example_index, example in enumerate(examples):
        question_tokens = tokenizer.tokenize(example.question)

        choices_features = []
        # For each question, loop through all answer choices 
        for choice_index, choice in enumerate(example.choices):
            # We create a copy of the question tokens in order to be
            # able to shrink it according to choice_tokens
            question_tokens_choice = question_tokens[:]
            choice_tokens = tokenizer.tokenize(choice)
            # Modifies `question_tokens_choice` and `choice_tokens` in
            # place so that the total length is less than the
            # specified length.  Account for [CLS], [SEP], [SEP] with
            # "- 3"
            _truncate_seq_pair(question_tokens_choice, choice_tokens, max_seq_length - 3)

            tokens = ["[CLS]"] + question_tokens_choice + ["[SEP]"] + choice_tokens + ["[SEP]"]
            segment_ids = [0] * (len(question_tokens_choice) + 2) + [1] * (len(choice_tokens) + 1)

            input_ids = tokenizer.convert_tokens_to_ids(tokens)
            input_mask = [1] * len(input_ids)

            # Zero-pad up to the sequence length.
            padding = [0] * (max_seq_length - len(input_ids))
            input_ids += padding
            input_mask += padding
            segment_ids += padding

            assert len(input_ids) == max_seq_length
            assert len(input_mask) == max_seq_length
            assert len(segment_ids) == max_seq_length

            choices_features.append((tokens, input_ids, input_mask, segment_ids))

        label = example.label

        features.append(
            InputFeatures(
                example_id = example.qid,
                choices_features = choices_features,
                label = label
            )
        )

    return features


def _truncate_seq_pair(tokens_a, tokens_b, max_length):
    """Truncates a sequence pair in place to the maximum length."""

    # This is a simple heuristic which will always truncate the longer sequence
    # one token at a time. This makes more sense than truncating an equal percent
    # of tokens from each, since if one sequence is very short then each token
    # that's truncated likely contains more information than a longer sequence.
    while True:
        total_length = len(tokens_a) + len(tokens_b)
        if total_length <= max_length:
            break
        if len(tokens_a) > len(tokens_b):
            tokens_a.pop()
        else:
            tokens_b.pop()
            
def select_field(features, field):
    """Yields a list, length equal to the total number of examples,
    where each item is a list of arrays,
    each array representing the feature array"""
    return [
        [
            choice[field]   # Grab the feature array of that choice.
            for choice in feature.choices_features  # Loop through 5 choices of that example
        ]
        for feature in features   # loop through each example
    ]



In [4]:
# Process inputs 
tiny_train = train.iloc[0:10]

train_examples= process_examples(tiny_train)
train_features = convert_examples_to_features(
                    examples=train_examples, 
                    tokenizer=tokenizer, 
                    max_seq_length=50, 
                    is_training=True)


In [12]:
all_input_ids = tf.convert_to_tensor(select_field(train_features, 'input_ids'), dtype=tf.int32)
all_input_mask = tf.convert_to_tensor(select_field(train_features, 'input_mask'), dtype=tf.int32)
all_segment_ids = tf.convert_to_tensor(select_field(train_features, 'segment_ids'), dtype=tf.int32)
all_label = tf.convert_to_tensor([f.label for f in train_features], dtype=tf.float32)

train_data = [all_input_ids, all_input_mask, all_segment_ids, all_label]


In [6]:
print(all_input_ids.shape)
print(all_input_mask.shape)
print(all_segment_ids.shape)
print(all_label.shape)


(10, 5, 50)
(10, 5, 50)
(10, 5, 50)
(10,)


In [7]:
from transformers import TFBertForMultipleChoice

In [8]:
model = TFBertForMultipleChoice.from_pretrained('bert-base-uncased')

Some weights of the model checkpoint at bert-base-uncased were not used when initializing TFBertForMultipleChoice: ['mlm___cls', 'nsp___cls']
- This IS expected if you are initializing TFBertForMultipleChoice from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing TFBertForMultipleChoice from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of TFBertForMultipleChoice were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['dropout_37', 'classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [14]:
# The forward method is the actual code that runs during the forward pass (like the predict method in sklearn or keras).
# call is a forward method.

output = model.call(
        inputs=all_input_ids,
        attention_mask=all_input_mask,
        token_type_ids=all_segment_ids,
        position_ids=None,
        head_mask=None,
        inputs_embeds=None,
        labels=all_label,
        output_attentions=None,
        output_hidden_states=None,
        return_dict=True,
    )

Output is of the class `TFMultipleChoiceModelOutput`. It contains the following elements:

            loss=loss,
            logits=reshaped_logits,
            hidden_states=outputs.hidden_states,
            attentions=outputs.attentions,
        

In [19]:
type(output)

transformers.modeling_tf_outputs.TFMultipleChoiceModelOutput

In [20]:
output.keys()

odict_keys(['loss', 'logits'])

Looks like I didn't opt for it to return hidden states or attention. So I only got losses and logits. More info on what these mean [here - transformers.BertModel.forward](https://huggingface.co/transformers/model_doc/bert.html#transformers.BertModel.forward)

In [22]:
output['logits'][0]

<tf.Tensor: shape=(5,), dtype=float32, numpy=
array([-0.75718963, -0.767469  , -0.65268475, -0.735697  , -0.74570036],
      dtype=float32)>

In [35]:
# So here are the guesses!

for i in range(0, 10):
    print("predicted:", np.argmax(tf.nn.softmax(output['logits'][i])),
         "actual:", int(all_label[i]), 
          np.argmax(tf.nn.softmax(output['logits'][i]))== int(all_label[i]))


predicted: 2 actual: 0 False
predicted: 3 actual: 1 False
predicted: 0 actual: 0 True
predicted: 2 actual: 3 False
predicted: 0 actual: 2 False
predicted: 3 actual: 3 True
predicted: 4 actual: 4 True
predicted: 3 actual: 1 False
predicted: 2 actual: 4 False
predicted: 3 actual: 3 True


Let's see if we can train it some more.

### Game plan

I'll be following the tutorial on: http://jalammar.github.io/a-visual-guide-to-using-bert-for-the-first-time/, but for specifically commonsenseQA. Here are the steps from the tutorial.

1. Embed all sentences. Let's look at the output from the BERT model for our inputs. 
2. The tutorial says do a train/test split. We don't need this 'cause our data came separate.
3. Train the logistic regressio model using the training set. This is training on the output of BERT. I will need to create a FFNN to attach at the end of BERT. See what it comes back with. 
4. (Optional for now) For each question, pick the answer with the highest score. 
5. Then, evaluate against true answers. The evaluation metric will be % of questions with the correct answers out of all questions. 


#### Step 1: Embed all sentences. 

I've already formatted the CommonsenseQA inpts to be fed into the BERT model. Let's look at the output from the BERT model for our inputs. It should have a 768-long vector for each input token.

### Sunday morning retry 

In [5]:
NAME = 'oob_bert_classify_{runtype}_{ts}'.format(runtype=runtype, ts=int(time.time()))
tensorboard = TensorBoard(log_dir='models/logs/{}'.format(NAME))

def classification_model(max_len):
    """Implementation of classification model.
    Returns: model"""
    ## BERT encoder
    encoder = TFBertModel.from_pretrained("bert-base-uncased")
    ## QA Model
    input_ids      = tf.keras.layers.Input(shape=(5, max_len,), dtype=tf.int32, name="InputID")
    attention_mask = tf.keras.layers.Input(shape=(5, max_len,), dtype=tf.int32, name="AttentionMask")
    token_type_ids = tf.keras.layers.Input(shape=(5, max_len,), dtype=tf.int32, name="TokenTypeID")
    embedding = encoder(
        [input_ids, attention_mask, token_type_ids]
    )[0]
    # Feed inputs through the bert model, 
    # then take just the vector associated with first token [CLS]
    bert_cls_output = embedding[:,0]
    # These are the layers that come after Bert.
    dense = tf.keras.layers.Dense(256, activation='relu', name='dense')(bert_cls_output)
    # Output layer to predict correct answer. 
    # For the future, we may modify it to choose the max candidate answer of each question
    # for now, just predict from 0 to 1 whether this looks like a correct answer. 
    pred = tf.keras.layers.Dense(1, activation='sigmoid', name='correct')(dense)
    model = tf.keras.models.Model(inputs=[input_ids, token_type_ids, attention_mask],
                                  outputs=pred)
    model.compile(loss="binary_crossentropy", 
                  optimizer="adam",
                  metrics=["accuracy"])
    model.summary()
    return model


# Train

On training set

In [6]:
# train_eg, train_encoded_eg, train_labs 

max_length= len(train_encoded_eg[0])   # num max token 
BertClassifierModel = classification_model(max_len=max_length)


Some weights of the model checkpoint at bert-base-uncased were not used when initializing TFBertModel: ['mlm___cls', 'nsp___cls']
- This IS expected if you are initializing TFBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing TFBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFBertModel were initialized from the model checkpoint at bert-base-uncased.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions without further training.


Model: "functional_1"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
InputID (InputLayer)            [(None, 10)]         0                                            
__________________________________________________________________________________________________
AttentionMask (InputLayer)      [(None, 10)]         0                                            
__________________________________________________________________________________________________
TokenTypeID (InputLayer)        [(None, 10)]         0                                            
__________________________________________________________________________________________________
tf_bert_model (TFBertModel)     ((None, 10, 768), (N 109482240   InputID[0][0]                    
                                                                 AttentionMask[0][0]   

In [None]:
start = time.time()

BertClassifierModel.fit(
    [   train_encoded_eg[0],
        train_encoded_eg[1],
        train_encoded_eg[2]],
    train_labs, 
    epochs=3,
    # Insert validation data
    validation_data=(
        [dev_encoded_eg[0],
         dev_encoded_eg[1],
         dev_encoded_eg[2]
        ], dev_labs
    ),
    # Log the training info on tensorboard
    callbacks=[tensorboard])

end = time.time()
print("Execution duration in minutes:", (end - start)/60)

Epoch 1/3


In [None]:
predictions = BertClassifierModel.predict(
    [dev_encoded_eg[0],
         dev_encoded_eg[1],
         dev_encoded_eg[2]
        ])

location="models/predictions/BertClassifierModel_{}_dev_predictions".format(runtype)
pickle_out = open(location, "wb")
pickle.dump(predictions, pickle_out)
pickle_out.close()

In [None]:
location='models/{NAME}.h5'.format(NAME=NAME)

BertClassifierModel.save(location)