# BERT - Out of the Box

In this notebook, we will test the performance of an out-of-the-box BERT model on CommonsenseQA. I follow the tutorial here: https://github.com/jalammar/jalammar.github.io/blob/master/notebooks/bert/A_Visual_Notebook_to_Using_BERT_for_the_First_Time.ipynb

I've implemented the Hugginface transformers library. 

I referred to the Commonsense QA repo and code to understand how the authors of this work establiahsed their baseline using BERT. This is the link to their code: https://github.com/jonathanherzig/commonsenseqa/blob/master/bert/run_commonsense_qa.py

From this repo (README): https://github.com/jonathanherzig/commonsenseqa

Their work is far more advanced and complicated than maybe what I want to do at this time. But I refer to their work to understand the set up.

In [1]:
# !pip install transformers

You should consider upgrading via the '/home/haeranglee/anaconda3/bin/python -m pip install --upgrade pip' command.[0m


In [3]:
# !pip install torch

Collecting torch
  Downloading torch-1.6.0-cp38-cp38-manylinux1_x86_64.whl (748.8 MB)
[K     |████████████████████████████████| 748.8 MB 6.7 kB/s  eta 0:00:01   |█                               | 24.4 MB 6.2 MB/s eta 0:01:57     |█▋                              | 38.1 MB 6.2 MB/s eta 0:01:55     |█████▍                          | 124.9 MB 16.7 MB/s eta 0:00:38     |████████▏                       | 191.6 MB 68.2 MB/s eta 0:00:09     |████████▍                       | 197.5 MB 68.2 MB/s eta 0:00:09     |███████████▍                    | 267.5 MB 39.8 MB/s eta 0:00:13     |████████████▏                   | 285.9 MB 39.8 MB/s eta 0:00:12     |████████████▌                   | 293.6 MB 39.8 MB/s eta 0:00:12     |██████████████▉                 | 348.2 MB 12.3 MB/s eta 0:00:33     |███████████████▍                | 360.7 MB 64.3 MB/s eta 0:00:07     |████████████████                | 373.7 MB 64.3 MB/s eta 0:00:06     |████████████████▉               | 395.0 MB 64.3 MB/s eta 0:00:06     |█

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
import torch
import transformers as ppb
import warnings

import json
from pandas.io.json import json_normalize
warnings.filterwarnings('ignore')

from tensorflow.keras.preprocessing.sequence import pad_sequences

## Import dataset

It's in the dataset folder.

In [22]:
file = '../dataset/train_rand_split.jsonl'
def load_data(file):

    lines = []
    with open(file, 'rb') as json_file:
        for json_line in json_file:
            lines.append(json.loads(json_line))
        data = json_normalize(lines)
        data.columns = data.columns.map(lambda x: x.split(".")[-1])
    return data

train = load_data(file)
train.head()

Unnamed: 0,answerKey,id,question_concept,choices,stem
0,A,075e483d21c29a511267ef62bedc0461,punishing,"[{'label': 'A', 'text': 'ignore'}, {'label': '...",The sanctions against the school were a punish...
1,B,61fe6e879ff18686d7552425a36344c8,people,"[{'label': 'A', 'text': 'race track'}, {'label...",Sammy wanted to go to where the people were. ...
2,A,4c1cb0e95b99f72d55c068ba0255c54d,choker,"[{'label': 'A', 'text': 'jewelry store'}, {'la...",To locate a choker not located in a jewelry bo...
3,D,02e821a3e53cb320790950aab4489e85,highway,"[{'label': 'A', 'text': 'united states'}, {'la...",Google Maps and other highway and street GPS s...
4,C,23505889b94e880c3e89cff4ba119860,fox,"[{'label': 'A', 'text': 'pretty flowers.'}, {'...","The fox walked from the city into the forest, ..."


In [24]:
type(train)

pandas.core.frame.DataFrame

In [3]:
train = data
train.shape

(9741, 5)

In [4]:
sample_data = train.iloc[0]

In [5]:
sample_data.choices

[{'label': 'A', 'text': 'ignore'},
 {'label': 'B', 'text': 'enforce'},
 {'label': 'C', 'text': 'authoritarian'},
 {'label': 'D', 'text': 'yell at'},
 {'label': 'E', 'text': 'avoid'}]

In [6]:
train.iloc[0]["stem"]

'The sanctions against the school were a punishing blow, and they seemed to what the efforts the school had made to change?'

In [7]:
train.iloc[0]["choices"]

[{'label': 'A', 'text': 'ignore'},
 {'label': 'B', 'text': 'enforce'},
 {'label': 'C', 'text': 'authoritarian'},
 {'label': 'D', 'text': 'yell at'},
 {'label': 'E', 'text': 'avoid'}]

In [8]:
train.iloc[0]["question_concept"]

'punishing'

## Steps

1. Import training examples
2. Process it
    - Format input into something BERT can work with, including `[CLS]` and `[SEP]`
    - We were thinking which label is correct: 
    - Tokenize 
    - Create an output layer using softmax. 
3. Train it
    - Specify how many layers of BERT to fine tune
    

In [25]:
train.head()

Unnamed: 0,answerKey,id,question_concept,choices,stem
0,A,075e483d21c29a511267ef62bedc0461,punishing,"[{'label': 'A', 'text': 'ignore'}, {'label': '...",The sanctions against the school were a punish...
1,B,61fe6e879ff18686d7552425a36344c8,people,"[{'label': 'A', 'text': 'race track'}, {'label...",Sammy wanted to go to where the people were. ...
2,A,4c1cb0e95b99f72d55c068ba0255c54d,choker,"[{'label': 'A', 'text': 'jewelry store'}, {'la...",To locate a choker not located in a jewelry bo...
3,D,02e821a3e53cb320790950aab4489e85,highway,"[{'label': 'A', 'text': 'united states'}, {'la...",Google Maps and other highway and street GPS s...
4,C,23505889b94e880c3e89cff4ba119860,fox,"[{'label': 'A', 'text': 'pretty flowers.'}, {'...","The fox walked from the city into the forest, ..."


# BERT base model (uncased)

From: https://huggingface.co/bert-base-uncased

> Pretrained model on English language using a masked language modeling (MLM) objective. It was introduced in this paper and first released in this repository. This model is uncased: it does not make a difference between english and English.
> 
> Disclaimer: The team releasing BERT did not write a model card for this model so this model card has been written by the Hugging Face team.

In [4]:
from transformers import AutoTokenizer, AutoModelWithLMHead

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

# model = AutoModelWithLMHead.from_pretrained("bert-base-uncased")

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=433.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=440473133.0, style=ProgressStyle(descri…




Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [5]:
?tokenizer

In [6]:
print(model)

BertForMaskedLM(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=Tr

In [7]:
# Try encoding something. 

encoded_input = tokenizer("Hello, I'm a single sentence!")
print(encoded_input)

{'input_ids': [101, 7592, 1010, 1045, 1005, 1049, 1037, 2309, 6251, 999, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


In [8]:
encoded_input = tokenizer("[CLS] Hello, I'm a single sentence! [SEP] And I'm a new sentence!")
print(encoded_input)

{'input_ids': [101, 101, 7592, 1010, 1045, 1005, 1049, 1037, 2309, 6251, 999, 102, 1998, 1045, 1005, 1049, 1037, 2047, 6251, 999, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


So I can see it's encoding the sentence. Somehow the tokenizer knew already the beginning was beginning, without me manually putting in `[CLS]`. That's token 101. `[SEP]` is token 102. 

Not sure why attention mask is always 1. 

In [9]:
tokenizer(train.iloc[0]["stem"])

{'input_ids': [101, 1996, 17147, 2114, 1996, 2082, 2020, 1037, 16385, 2075, 6271, 1010, 1998, 2027, 2790, 2000, 2054, 1996, 4073, 1996, 2082, 2018, 2081, 2000, 2689, 1029, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [10]:
# PyTorch version 
tokenizer(train.iloc[0]["stem"], return_tensors='pt')

{'input_ids': tensor([[  101,  1996, 17147,  2114,  1996,  2082,  2020,  1037, 16385,  2075,
          6271,  1010,  1998,  2027,  2790,  2000,  2054,  1996,  4073,  1996,
          2082,  2018,  2081,  2000,  2689,  1029,   102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1]])}

In [11]:
# TensorFlow version
tokenizer(train.iloc[0]["stem"], return_tensors='tf')

{'input_ids': <tf.Tensor: shape=(1, 27), dtype=int32, numpy=
array([[  101,  1996, 17147,  2114,  1996,  2082,  2020,  1037, 16385,
         2075,  6271,  1010,  1998,  2027,  2790,  2000,  2054,  1996,
         4073,  1996,  2082,  2018,  2081,  2000,  2689,  1029,   102]],
      dtype=int32)>, 'token_type_ids': <tf.Tensor: shape=(1, 27), dtype=int32, numpy=
array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0]], dtype=int32)>, 'attention_mask': <tf.Tensor: shape=(1, 27), dtype=int32, numpy=
array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1]], dtype=int32)>}

In [31]:
tokenizer.tokenize(train.iloc[0]["stem"])

['the',
 'sanctions',
 'against',
 'the',
 'school',
 'were',
 'a',
 'punish',
 '##ing',
 'blow',
 ',',
 'and',
 'they',
 'seemed',
 'to',
 'what',
 'the',
 'efforts',
 'the',
 'school',
 'had',
 'made',
 'to',
 'change',
 '?']

What is the difference between `tokenizer(sentence)` and `tokenizer.tokenize(sentence)`? Maybe the first is just fetching the ids of the list of tokens. The second is actually tokenizing the input.

In [12]:
tokenizer.tokenize("Well, here's a sentence for you")

['well', ',', 'here', "'", 's', 'a', 'sentence', 'for', 'you']

In [13]:
tokenizer.encode("Hello I am a sentence")

[101, 7592, 1045, 2572, 1037, 6251, 102]

In [14]:
tokenizer.encode("Hello I am a sentence", "But I am a brand new sentence")

[101,
 7592,
 1045,
 2572,
 1037,
 6251,
 102,
 2021,
 1045,
 2572,
 1037,
 4435,
 2047,
 6251,
 102]

For each question, there are five answer choices. Only one of them is correct.

For BERT, the first thought was to have all five answers attached to each question, and the model would choose one of the five responses. This is how it's originally done in the CommonsenseQA paper.

```
[CLS] Question text here [SEP] Ans choice A [SEP] Ans choice B [SEP] Ans choice C [SEP] Ans choice D [SEP] Ans choice E [SEP]
```

It seems complicated, however, and requires a significant lift. So for now, let me try creating five question-answer pairs for each question. Like this:

```
[CLS] Question text here [SEP] Ans choice A [SEP]
[CLS] Question text here [SEP] Ans choice B [SEP]
[CLS] Question text here [SEP] Ans choice C [SEP]
[CLS] Question text here [SEP] Ans choice D [SEP]
[CLS] Question text here [SEP] Ans choice E [SEP]
```

Only one of the above 5 inputs will have a positive label for being the correct answer. The rest will have 0. The problem with this model is that we're evaluating each choice separately to see if it looks like a right answer at all. But I think it's important for the model to know how the answer choices compare to each other as well.


In [102]:
lab_order.keys()

dict_keys(['A', 'B', 'C', 'D', 'E'])

In [165]:
lab_order = {"A": 0, "B":1, "C":2, "D":3, "E":4}

class InputExample(object):
    """A single multiple choice question."""
    # This class is adapted from https://github.com/jonathanherzig/commonsenseqa/blob/master/bert/run_commonsense_qa.py

    def __init__(
            self,
            qid,
            question,
            answer,
            label):
        """Construct an instance."""
        self.qid = qid
        self.question = question  # e.g., 'The sanctions against the school were a punishing blow, and they seemed to what the efforts the school had made to change?'
        self.answer = answer      # e.g., "ignore" if choice label is A 
        self.label = label        # e.g., If correct answer, 1. Otherwise 0. 
        
    def __str__(self):
        return "QUESTION: {}\nANSWER  : {}\nLABEL   : {}".format(self.question, self.answer, self.label)
    
    
def create_example(row, choice_num):
    qid = row.id
    
    # Question: Just take it from stem 
    question = row.stem
    
    # Answer choice 
    label = int(row["answerKey"] == choice_num)  # If the answer key is equal to the answer choice number, mark 1 
    answer = row["choices"][lab_order[choice_num]]["text"]         # actual ans text 
    
    return InputExample(qid, question, answer, label) 
    
    
def process_examples(data):
    examples = []
    input_ids = []
    input_masks = []
    input_segment_ids = []
    
    for index, row in data.iterrows(): 
        for letter in lab_order.keys():
            example = create_example(row, letter)
            examples.append(example)
        
            encoded_example = tokenizer.encode(example.question, example.answer)
            input_ids.append(encoded_example)
            
            # For input mask, create as many 1's as there is data. The rest we will pad with 0
            # OK this is DEFINITELY manual and I know there's a better way to do it. 
            # But I'll figure that out later
            input_mask = [1]*len(encoded_example)
            input_masks.append(input_mask)
            
            # For segment IDs, 
            # question segment (including the [SEP] after it) will have segment ID = 0
            # the candidate answer will have segment ID = 1
            first_sep_index = encoded_example.index(102)
            input_segment_id = [0]*(first_sep_index + 1) + [1]*(len(encoded_example) - first_sep_index - 1)
            input_segment_ids.append(input_segment_id)
            
    # pad the results 
    MAX_LEN = max([len(eg) for eg in input_ids])

    input_ids = pad_sequences(input_ids, maxlen=MAX_LEN, dtype="long", 
                                                  value=0, truncating="post", padding="post")
    input_masks = pad_sequences(input_masks, maxlen=MAX_LEN, dtype="long", 
                                                  value=0, truncating="post", padding="post")
    input_segment_ids = pad_sequences(input_segment_ids, maxlen=MAX_LEN, dtype="long", 
                                                  value=0, truncating="post", padding="post")
    
    return examples, input_ids, input_masks, input_segment_ids


In [166]:
encoded_example = tokenizer.encode(sample.question, sample.answer)

In [131]:
row = train.iloc[0]
sample = create_example(row, "A")
print(sample)

Question: The sanctions against the school were a punishing blow, and they seemed to what the efforts the school had made to change?
Answer  : ignore
Label   : 1


In [167]:
train_examples, train_input_ids, train_input_masks, train_input_segment_ids = process_examples(train)

**Some notes**

* `example` is the raw example. You can look at the raw data. 
* Input ID means the concatenated question and answer pairs from the example have been first tokenized, then ID'd. See below for how the raw text, tokens, and IDs translate to each other. 
* I've padded the data to the max length of the examples. 
* Segment embedding should be 0 for the first sentence (question) and 1 for the second sentence (answer choice.) 
* Input mask should be 1 for valid input nad 0 for `[PAD]`

In [171]:
print(train_examples[0])
print("\nInput IDs (token IDs)")
print(train_input_ids[0])

print("\nDecoded from input IDs")
print(tokenizer.decode(train_input_ids[0]))
print("\nTokens")
print(tokenizer.convert_ids_to_tokens(train_input_ids[0]))

print("\nInput masks")
print(train_input_masks[0])
print("\nInput segment IDs")
print(train_input_segment_ids[0])

QUESTION: The sanctions against the school were a punishing blow, and they seemed to what the efforts the school had made to change?
ANSWER  : ignore
LABEL   : 1

Input IDs (token IDs)
[  101  1996 17147  2114  1996  2082  2020  1037 16385  2075  6271  1010
  1998  2027  2790  2000  2054  1996  4073  1996  2082  2018  2081  2000
  2689  1029   102  8568   102     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0]

Decoded from input IDs
[CLS] the sanctions against the school were a punishing blow, and they seemed to what the efforts the school had made to change? [SEP] ignore [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]

In [151]:
from transformers import TFBertModel 

# The following code is adapted from W266 BERT notebook
# https://github.com/datasci-w266/2020-fall-main/blob/master/materials/Bert/BERT_NER_tf_21_v1.ipynb

def classification_model(max_input_length, train_layers, optimizer):
    """
    Implementation of binary classification model
    
    variables:
        max_input_length: number of tokens (max_length + 1)
        train_layers: number of layers to be retrained
        optimizer: optimizer to be used
    
    returns: model
    """

    in_id = tf.keras.layers.Input(shape=(max_length,), dtype='int32', name="input_ids")
    in_mask = tf.keras.layers.Input(shape=(max_length,), dtype='int32', name="input_masks")
    in_segment = tf.keras.layers.Input(shape=(max_length,), dtype='int32', name="segment_ids")
    
    bert_inputs = [in_id, in_mask, in_segment]
    
    bert_layer = TFBertModel.from_pretrained('bert-base-uncased')
    
    
    # Freeze layers, i.e. only train number of layers specified, starting from the top
    
    retrain_layers = []
    
    for retrain_layer_number in range(train_layers):
        
        layer_code = '_' + str(11 - retrain_layer_number)
        retrain_layers.append(layer_code)
    
    for w in bert_layer.weights:
        if not any([x in w.name for x in retrain_layers]):
            w._trainable = False
            
    # End of freezing section
    
    bert_sequence = bert_layer(bert_inputs)[0]

In [152]:
bert_layer = TFBertModel.from_pretrained('bert-base-uncased')

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=536063208.0, style=ProgressStyle(descri…




Some weights of the model checkpoint at bert-base-uncased were not used when initializing TFBertModel: ['nsp___cls', 'mlm___cls']
- This IS expected if you are initializing TFBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing TFBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFBertModel were initialized from the model checkpoint at bert-base-uncased.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions without further training.


Below is an attempt at the other input type:


```
[CLS] Question text here [SEP] Ans choice A [SEP] Ans choice B [SEP] Ans choice C [SEP] Ans choice D [SEP] Ans choice E [SEP]
```


In [71]:
class InputExample(object):
    """A single multiple choice question."""
    # This function is copied from https://github.com/jonathanherzig/commonsenseqa/blob/master/bert/run_commonsense_qa.py

    def __init__(
            self,
            qid,
            question,
            answers,
            label):
        """Construct an instance."""
        self.qid = qid
        self.question = question
        self.answers = answers
        self.label = label
    
    
def tokenize_qa(row):
    qid = row.id
    
    # Question: Just take it from stem 
    question = row.stem
    tokens_q = tokenizer.tokenize(question)
    
    # Answers
    # This snippet is adapted from https://github.com/jonathanherzig/commonsenseqa/blob/master/bert/run_commonsense_qa.py
    # Take the five responses and convert them into a single array. 
    # Sort them by ABCDE choice number. 
    answers = np.array([
        choice["text"]
        for choice in sorted(
            row['choices'],
            key=lambda c: c['label'])
      ])
    
#     tokens_a =answers)
        
    # Here is the answer key 
    label = row["answerKey"]

    # input_ids = tokenizer.encode(tokens_q, tokens_a)
        
    return InputExample(qid, question, answers, label) 
    
def process_examples(data):
    examples = []
#     encoded_examples = []
    
    for index, row in data.iterrows(): 
        example = tokenize_qa(row)
        examples.append(example)
        
        encoded_example = pad_n_tokenize(example)
#         encoded_examples.append(encoded_example)
        
    return examples#, encoded_examples 

def example_to_token_ids_segment_ids_label_ids(
    ex_index,
    example,
    max_seq_length,
    tokenizer):
    # This function is adapted from https://github.com/jonathanherzig/commonsenseqa/blob/master/bert/run_commonsense_qa.py
    
    """Converts an ``InputExample`` to token ids and segment ids."""
#     if ex_index < 5:
#         tf.logging.info(f"*** Example {ex_index} ***")
#         tf.logging.info("qid: %s" % (example.qid))

    question_tokens = tokenizer.tokenize(example.question)
    answers_tokens = map(tokenizer.tokenize, example.answers)   # Map b/c there are multiple sentences in here. Tokenizer typically only takes in two segments but we want more.

    token_ids = []
    segment_ids = []
    for choice_idx, answer_tokens in enumerate(answers_tokens):
        truncated_question_tokens = question_tokens[
            :max((max_seq_length - 3)//2, max_seq_length - (len(answer_tokens) + 3))]
        truncated_answer_tokens = answer_tokens[
            :max((max_seq_length - 3)//2, max_seq_length - (len(question_tokens) + 3))]

        choice_tokens = []
        choice_segment_ids = []
        choice_tokens.append("[CLS]")
        choice_segment_ids.append(0)
        for question_token in truncated_question_tokens:
            choice_tokens.append(question_token)
            choice_segment_ids.append(0)
        choice_tokens.append("[SEP]")
        choice_segment_ids.append(0)
        for answer_token in truncated_answer_tokens:
            choice_tokens.append(answer_token)
            choice_segment_ids.append(1)
        choice_tokens.append("[SEP]")
        choice_segment_ids.append(1)

        choice_token_ids = tokenizer.convert_tokens_to_ids(choice_tokens)

        token_ids.append(choice_token_ids)
        segment_ids.append(choice_segment_ids)

        if ex_index < 5:
            tf.logging.info("choice %s" % choice_idx)
            tf.logging.info("tokens: %s" % " ".join(
                [tokenization.printable_text(t) for t in choice_tokens]))
            tf.logging.info("token ids: %s" % " ".join(
                [str(x) for x in choice_token_ids]))
            tf.logging.info("segment ids: %s" % " ".join(
                [str(x) for x in choice_segment_ids]))

    label_ids = [example.label]

    if ex_index < 5:
        tf.logging.info("label: %s (id = %d)" % (example.label, label_ids[0]))

    return token_ids, segment_ids, label_ids

In [72]:
row = train.iloc[0]
tokenize_qa(row)

<__main__.InputExample at 0x7f1dc61c7a00>

In [69]:
row = train.iloc[0]
tokenize_qa(row)
sample = tokenize_qa(row)

In [87]:
row = train.iloc[0]
tokenize_qa(row)
sample = tokenize_qa(row)
# train_examples = process_examples(train)

In [74]:
type(train_examples)

list

In [75]:
train_examples[0:5]

[<__main__.InputExample at 0x7f1e94132310>,
 <__main__.InputExample at 0x7f1ddc0a6790>,
 <__main__.InputExample at 0x7f1ddc0a67f0>,
 <__main__.InputExample at 0x7f1ddc0a6910>,
 <__main__.InputExample at 0x7f1ddc0a6b20>]

I have to pad the examples. 

In [77]:
train_examples[0].question

'The sanctions against the school were a punishing blow, and they seemed to what the efforts the school had made to change?'

In [83]:
train_examples[0].answers

array(['ignore', 'enforce', 'authoritarian', 'yell at', 'avoid'],
      dtype='<U13')

In [85]:
print(*train_examples[0].answers)

ignore enforce authoritarian yell at avoid


In [88]:
# tokenizer.encode(train_examples[0].question, *train_examples[0].answers)