# **MultiRC** - Multihop multiple-choice question answering dataset

# **Model** - NER-based QA

**APPROACH** -

**Dataset Preparation**
1. Concatenate paragraph + question + answers into a single context
2. Use discriminatory tags for each of- paragraph(P), question(Q), correct answer(C), wrong answer(W) and inside tags(I)
3. Now, the dataset is a CSV file with the following structure-

\<ID, TOKEN, TAG\>

where,

ID- unique for every (paragraph,question,answers) combination

TOKEN- paragraph + question + options concatenated  tokenized

TAG - pre-determned tag for every portion in the context


**Model Preparation**

4. Train the model to learn this variation of BIO tagging

**Evaluation Preparation**

5. Evaluate model's performnance against expected results- tagging the correct answer as CI tags and wrong answer as WI tags.

# NOTE : Search "TODO" to make changes for original/sampled data

In [0]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline

# Mounting data
1. train.csv - training set
2. dev.csv - testing set

Note- We are using validation set as our test set since the MultiRC test set is not publicly available and it's not possible to verify labels and analyse model performance

In [0]:
from google.colab import drive
drive.mount('/content/gdrive')

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


In [0]:
PARENT_DIR = "/content/gdrive/My Drive/MultiRC_NER"

In [0]:
!ls "/content/gdrive/My Drive/MultiRC_NER/data"

dev.csv		dev_v2.csv  qa	       train_sample.csv  train_v3.csv
dev_sample.csv	dev_v3.csv  train.csv  train_v2.csv


# Requirements

In [0]:
!pip install seqeval
!pip install transformers

Collecting seqeval
  Downloading https://files.pythonhosted.org/packages/34/91/068aca8d60ce56dd9ba4506850e876aba5e66a6f2f29aa223224b50df0de/seqeval-0.0.12.tar.gz
Building wheels for collected packages: seqeval
  Building wheel for seqeval (setup.py) ... [?25l[?25hdone
  Created wheel for seqeval: filename=seqeval-0.0.12-cp36-none-any.whl size=7424 sha256=1214ce8d3aaccceabfbb6d8b1a91f28e6b9386860e9271381848e134bdc38d95
  Stored in directory: /root/.cache/pip/wheels/4f/32/0a/df3b340a82583566975377d65e724895b3fad101a3fb729f68
Successfully built seqeval
Installing collected packages: seqeval
Successfully installed seqeval-0.0.12
Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/a3/78/92cedda05552398352ed9784908b834ee32a0bd071a9b32de287327370b7/transformers-2.8.0-py3-none-any.whl (563kB)
[K     |████████████████████████████████| 573kB 6.0MB/s 
Collecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/7d/34/09d19aff26edcc8eb2a01bed

In [0]:
import pandas as pd
import math
import numpy as np
from seqeval.metrics import f1_score
from seqeval.metrics import classification_report,accuracy_score,f1_score
import torch.nn.functional as F

In [0]:
import torch
import os
from tqdm import tqdm,trange
from torch.optim import Adam
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler
from keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split
from transformers import BertTokenizer, BertConfig
from transformers import BertForTokenClassification, AdamW

Using TensorFlow backend.


In [0]:
# Check library version
!pip list | grep -E 'transformers|torch|Keras'

Keras                    2.3.1          
Keras-Applications       1.0.8          
Keras-Preprocessing      1.1.0          
torch                    1.5.0+cu101    
torchsummary             1.5.1          
torchtext                0.3.1          
torchvision              0.6.0+cu101    
transformers             2.8.0          


This notebook works with env:

- Keras                2.3.1                 
- torch                1.1.0                 
- transformers         2.2.0      

## Load training data

**Load CSV data**

In [0]:
data_path = PARENT_DIR + "/data" 

In [0]:
# TODO: "train.csv" - original, "train_sample.csv" - sampled file(1/100th data)
train_file_address = PARENT_DIR + "/data/train_v2.csv"

In [0]:
# Fillna method can make same sentence with same sentence name
# NOTE - encoding latin1 => utf-8
df_data = pd.read_csv(train_file_address,sep=",",encoding="utf-8").fillna(method='ffill')

In [0]:
df_data.columns

Index(['ID', 'TOKEN', 'TAG'], dtype='object')

In [0]:
df_data.head(n=20)

Unnamed: 0,ID,TOKEN,TAG
0,1,Animated,P
1,1,history,I
2,1,of,I
3,1,the,I
4,1,US,I
5,1,.,I
6,1,Of,P
7,1,course,I
8,1,the,I
9,1,cartoon,I


**TAG categories**


In [0]:
# TAG distribution
df_data.TAG.value_counts()

I    1541258
P      68462
W      15218
C      12025
Q       5131
Name: TAG, dtype: int64

### TAG nomenclature
As shown and explained above, there are 4 distinct tags, one each for- Paragraph, Question, Correct answer and Wrong answer
- P: Paragraph sentence begin, word at the first  position
- Q: Question sentence begin, word at the first  position
- C: Correct answer sentence begin, word at the first  position
- W: Wrong answer sentence begin, word at the first  position
- I: inside, word not at the first position, for sentences

## Parser data

**Parser data into document structure**

In [0]:
class SentenceGetter(object):
    
    def __init__(self, data):
        self.n_sent = 1
        self.data = data
        self.empty = False
        agg_func = lambda s: [(w, t) for w, t in zip(s["TOKEN"].values.tolist(),
                                                           s["TAG"].values.tolist())]
        self.grouped = self.data.groupby("ID").apply(agg_func)
        self.sentences = [s for s in self.grouped]
    
    def get_next(self):
        try:
            s = self.grouped["Sentence: {}".format(self.n_sent)]
            self.n_sent += 1
            return s
        except:
            return None

In [0]:
# Get full document data structure
getter = SentenceGetter(df_data)

In [0]:
# Get sentence data
sentences = [[s[0] for s in sent] for sent in getter.sentences]
# sentences[0]

In [0]:
# Get TAG labels data
labels = [[s[1] for s in sent] for sent in getter.sentences]
# print(labels[0])

**Convert TAG name into index for training**

In [0]:
tags_vals = list(set(df_data["TAG"].values))
# Add X  label for word piece support
# Add [CLS] and [SEP] as BERT need
tags_vals.append('X')
tags_vals.append('[CLS]')
tags_vals.append('[SEP]')
tags_vals = set(tags_vals)

In [0]:
# Set a dict for mapping id to tag name
# tag2idx = {t: i for i, t in enumerate(tags_vals)}
# TODO: why manual ?
# Manual definition
tag2idx={'C': 2,
 'I': 3,
 'P': 0,
 'Q': 1,
 'W': 4,
 'X':5,
 '[CLS]':6,
 '[SEP]':7}

In [0]:
# Mapping index to name (reverse)
tag2name={tag2idx[key] : key for key in tag2idx.keys()}

## Preprocess Training Data

Raw data => trainable data for BERT, including:

- GPU environment
- Loading tokenizer and tokenize
- Set 3 embeddings - token, mask word, segmentation

**Setting-up GPU environment**

In [0]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
n_gpu = torch.cuda.device_count()

In [0]:
n_gpu

1

### Loading Tokenizer

Downloading the tokenizer file into GDrive folder first :
- [vocab.txt](https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-cased-vocab.txt)

In [0]:
vocabulary = PARENT_DIR + "/models/vocab.txt"

In [0]:
# Length of the sentence = 384 (dataset analysis- paragraph + question + answers = ~ 350, generally.)
# CAUTION - should be less than 512
# TODO : try with increased length
max_len  = 384

In [0]:
# load tokenizer, with manual file address or pretrained address
tokenizer=BertTokenizer(vocab_file=vocabulary,do_lower_case=False)

**Tokenizer text**

- In hunggingface for bert, when come across OOV, will word piece the word
- We need to adjust the labels base on the tokenize result, “##abc” need to set label "X" 
- Need to set "[CLS]" at front and "[SEP]" at the end, as what the paper do, [BERT indexer should add [CLS] and [SEP] tokens](https://github.com/allenai/allennlp/issues/2141)


In [0]:
tokenized_texts = []
word_piece_labels = []
i_inc = 0
for word_list,label in (zip(sentences,labels)):
    temp_lable = []
    temp_token = []
    
    # Add [CLS] at the front 
    temp_lable.append('[CLS]')
    temp_token.append('[CLS]')
    
    for word,lab in zip(word_list,label):
        token_list = tokenizer.tokenize(word)
        for m,token in enumerate(token_list):
            temp_token.append(token)
            if m==0:
                temp_lable.append(lab)
            else:
                temp_lable.append('X')  
                
    # Add [SEP] at the end
    temp_lable.append('[SEP]')
    temp_token.append('[SEP]')
    
    tokenized_texts.append(temp_token)
    word_piece_labels.append(temp_lable)
    
    if 5 > i_inc:
        print("No.%d,len:%d"%(i_inc,len(temp_token)))
        print("texts:%s"%(" ".join(temp_token)))
        print("No.%d,len:%d"%(i_inc,len(temp_lable)))
        print("lables:%s"%(" ".join(temp_lable)))
    i_inc +=1
    
    
    

No.0,len:378
texts:[CLS] Animated history of the US . Of course the cartoon is highly overs ##im ##plified and most critics consider it one of the weak ##est parts of the film . But it makes a valid claim which you ignore entirely : That the strategy to promote gun rights for white people and to out ##law gun possession by black people was a way to up ##hold racism without letting an openly terrorist organization like the K ##K ##K flourish . Did the 19th century N ##RA in the southern states promote gun rights for black people ? I highly doubt it . But if they didn ' t one of their functions was to continue the racism of the K ##K ##K . This is the key message of this part of the animation which is again being ignored by its critics . B ##uel ##l shooting in Flint . You write : F ##act : The little boy was the class th ##ug already suspended from school for stabbing another kid with a pencil and had fought with Kay ##la the day before . This characterization of a six - year - old as a

### Setting-up token embedding

Pad or trim the text and label to fit the need for max len

In [0]:
# Make text token into id
input_ids = pad_sequences([tokenizer.convert_tokens_to_ids(txt) for txt in tokenized_texts],
                          maxlen=max_len, dtype="long", truncating="post", padding="post")
# print(input_ids[0])

In [0]:
# Make label into id, pad with "W" meaning others/wrong
# Note - Replaced "O" -> "W" (wrong)
tags = pad_sequences([[tag2idx.get(l) for l in lab] for lab in word_piece_labels],
                     maxlen=max_len, value=tag2idx["W"], padding="post",
                     dtype="long", truncating="post")
# print(tags[0])

### Setting-up mask word embedding

In [0]:
# For fine tune of predict, with token mask is 1,pad token is 0
attention_masks = [[int(i>0) for i in ii] for ii in input_ids]
attention_masks[0];

### Setting-up segment embedding(Analysis- for sequance tagging task, it's not necessary to make this embedding)

In [0]:
# Since only one sentence, all the segment set to 0
segment_ids = [[0] * len(input_id) for input_id in input_ids]
segment_ids[0];

In [0]:
# print(segment_ids) # ERROR - IOPub data rate exceeded. (TOO MUCH!)

## Load Training DataSet

In [0]:
tr_inputs, tr_tags, tr_masks, tr_segs = input_ids, tags, attention_masks, segment_ids

In [0]:
len(tr_inputs),len(tr_segs)

(5131, 5131)

In [0]:
print(tr_inputs)

[[  101 24238  1607 ...     0     0     0]
 [  101 24238  1607 ...  1111  1602  1234]
 [  101 24238  1607 ... 18848  2513  1104]
 ...
 [  101   158   119 ...  1523  1113  9170]
 [  101   158   119 ...  1523  1113  9170]
 [  101   158   119 ...  1523  1113  9170]]


**Set data into tensor**

NOTE - Not recommend tensor.to(device) at this process, since it will run out of GPU memory

In [0]:
tr_inputs = torch.tensor(tr_inputs)
tr_tags = torch.tensor(tr_tags)
tr_masks = torch.tensor(tr_masks)
tr_segs = torch.tensor(tr_segs)

**Put data into data loader**

In [0]:
# Set batch num
batch_num = 16

In [0]:
# Only set token embedding, attention embedding, no segment embedding
train_data = TensorDataset(tr_inputs, tr_masks, tr_tags)
train_sampler = RandomSampler(train_data)
# Drop last can make batch training better for the last one
train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=batch_num,drop_last=True)

## Train model

- Pre-requisite: Downloading model files in GDrive
- Model used - BERT-base-cased
- pytorch_model.bin: [pytorch_model.bin](https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-cased-pytorch_model.bin)
- config.json: [config.json](https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-cased-config.json)    

**Loading BERT model**

In [0]:
# In this folder, contain model confg(json) and model weight(bin) files
# pytorch_model.bin, download from: https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-cased-pytorch_model.bin
# config.json, downlaod from: https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-cased-config.json
model_file_address = PARENT_DIR + "/models"

In [0]:
!ls "/content/gdrive/My Drive/MultiRC_NER/models"

config.json  pytorch_model.bin	vocab.txt


In [0]:
# Will load config and weight with from_pretrained()
model = BertForTokenClassification.from_pretrained(model_file_address,num_labels=len(tag2idx))

In [0]:
model;

In [0]:
# Set model to GPU,if you are using GPU machine
model.cuda();

In [0]:
# OPTIONAL: for multi GPU support
#if n_gpu >1:
#   model = torch.nn.DataParallel(model)

In [0]:
# Set epoch and grad max num
epochs = 5
max_grad_norm = 1.0

In [0]:
# Cacluate train optimization num
num_train_optimization_steps = int( math.ceil(len(tr_inputs) / batch_num) / 1) * epochs

### Setting-up fine tuning method

**Manual optimizer**

In [0]:
# True: fine tuning all the layers using AdamW
# False: only fine tuning the classifier layers
FULL_FINETUNING = True

In [0]:
if FULL_FINETUNING:
    # Fine tune model all layer parameters
    param_optimizer = list(model.named_parameters())
    no_decay = ['bias', 'gamma', 'beta']
    optimizer_grouped_parameters = [
        {'params': [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)],
         'weight_decay_rate': 0.01},
        {'params': [p for n, p in param_optimizer if any(nd in n for nd in no_decay)],
         'weight_decay_rate': 0.0}
    ]
else:
    # Only fine tune classifier parameters
    param_optimizer = list(model.classifier.named_parameters()) 
    optimizer_grouped_parameters = [{"params": [p for n, p in param_optimizer]}]
optimizer = AdamW(optimizer_grouped_parameters, lr=3e-5)

### Fine-tuning model

In [0]:
# TRAIN loop
model.train();

In [0]:
# Check logs for crash
#!cat /var/log/colab-jupyter.log

In [0]:
print("***** Running training *****")
print("  Num examples = %d"%(len(tr_inputs)))
print("  Batch size = %d"%(batch_num))
print("  Num steps = %d"%(num_train_optimization_steps))
for _ in trange(epochs,desc="Epoch"):
    tr_loss = 0
    nb_tr_examples, nb_tr_steps = 0, 0
    for step, batch in enumerate(train_dataloader):
        # add batch to gpu
        batch = tuple(t.to(device) for t in batch)
        b_input_ids, b_input_mask, b_labels = batch
        
        # forward pass
        outputs = model(b_input_ids, token_type_ids=None,
        attention_mask=b_input_mask, labels=b_labels)
        loss, scores = outputs[:2]
      #  if n_gpu>1:
            # When multi gpu, average it
       #     loss = loss.mean()
        
        # backward pass
        loss.backward()
        
        # track train loss
        tr_loss += loss.item()
        nb_tr_examples += b_input_ids.size(0)
        nb_tr_steps += 1
        
        # gradient clipping
        torch.nn.utils.clip_grad_norm_(parameters=model.parameters(), max_norm=max_grad_norm)
        
        # update parameters
        optimizer.step()
        optimizer.zero_grad()
        
    # print train loss per epoch
    print("Train loss: {}".format(tr_loss/nb_tr_steps))
        

Epoch:   0%|          | 0/5 [00:00<?, ?it/s]

***** Running training *****
  Num examples = 5131
  Batch size = 16
  Num steps = 1605


	add_(Number alpha, Tensor other)
Consider using one of the following signatures instead:
	add_(Tensor other, *, Number alpha)
Epoch:  20%|██        | 1/5 [10:31<42:07, 631.78s/it]

Train loss: 0.0983140283060493


Epoch:  40%|████      | 2/5 [21:03<31:35, 631.80s/it]

Train loss: 0.00955794602195965


Epoch:  60%|██████    | 3/5 [31:41<21:07, 633.50s/it]

Train loss: 0.0076077595091192055


Epoch:  80%|████████  | 4/5 [42:23<10:36, 636.17s/it]

Train loss: 0.0066671763692284


Epoch: 100%|██████████| 5/5 [52:53<00:00, 634.69s/it]

Train loss: 0.005843549657220138





## Save model 

In [0]:
# TODO: output/ => original data, output/sample/ => sampled data
bert_out_address = PARENT_DIR + "/output/trained_v2_model/"

In [0]:
# Save a trained model, configuration and tokenizer
model_to_save = model.module if hasattr(model, 'module') else model  # Only save the model it-self

In [None]:
# Make dir if not exits
if not os.path.exists(bert_out_address):
        os.makedirs(bert_out_address)

In [0]:
# If we save using the predefined names, we can load using `from_pretrained`
output_model_file = bert_out_address + "pytorch_model.bin"
output_config_file = bert_out_address + "config.json"

In [0]:
# Save model into file in drive
torch.save(model_to_save.state_dict(), output_model_file)
model_to_save.config.to_json_file(output_config_file)
tokenizer.save_vocabulary(bert_out_address)

('/content/gdrive/My Drive/MultiRC_NER/output/trained_v2_model/vocab.txt',)

# ----------- END OF TRAINING -----------

# Refer to MultiRC-NER_eval note book for EVALUATIONS & ANALYSIS