# NLP ARC Semantic Parsing

**Objective** : Perform semantic parsing on the [ARC dataset](http://data.allenai.org/arc/) and do the question answering task.  We have chosen to focus on a subset of the ARC dataset mainly questions that ask to `find an example of` . 


*Note :* The below code references  and is based on the .[SWAG example ](https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/examples/run_swag.py) provided in the huggingface repository.

TODO
-------

1. Create dataset, dataloader for 
2. Split into train and eval
3. Write a benchmark model --> regular BERT model
4. Train loop & eval
5. Hookup context sentence 
6. Accuracy and Loss are not good 

The Arc dataset is divided into train and eval dataset using the category {Train, Test, dev}.  For training we use records in Train category and for the test or eval we use Test and dev.



# Setup

## Installation

Seting up the environment, we download the pre-trained bert.

In [67]:
!pip install pytorch-pretrained-bert



## Imports

In [0]:
import csv
import random
import torch
import torch.nn as nn
import numpy as np
import pandas as pd

from torch.utils.data import DataLoader, RandomSampler, SequentialSampler,TensorDataset, random_split
from pytorch_pretrained_bert import BertTokenizer, BertModel
from pytorch_pretrained_bert.modeling import BertForMultipleChoice, BertConfig, WEIGHTS_NAME, CONFIG_NAME
from pytorch_pretrained_bert.optimization import BertAdam

import logging
#from typing import Iterable
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

##Variables

In [69]:
MAX_SEQ_LENGTH = 100 # The max length of any token sequence
TRAIN_BATCH_SIZE = 4
VAL_BATCH_SIZE = 2
LR = 5e-5
NUM_EPOCHS = 5
WARMUP_PROPORTION = 0.1
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f'Computation will be on {device}')
#Set random seed
random.seed(12)
np.random.seed(12)
torch.manual_seed(12);
#WEIGHT_DECAY

Computation will be on cuda


# Load Data

Getting the actual questionid, question and options that in the selected problem type.

In [70]:
!mkdir -p data
!wget -q -P data 'https://raw.githubusercontent.com/Nydhal/nlp/master/ARC-multihop.csv'
input_file = 'data/ARC-multihop.csv'
!head $input_file

questionID,question,A,B,C,D,AnswerKey
Mercury_412775,A solution with a pH of 2 can be increased to a pH above 7 by adding , an acid. , water. , a base. , hydrogen.,C
Mercury_7221743,Which of the following elements is the least electrically conductive? , sodium , tungsten , zinc , argon,D
NCEOGA_2013_8_59,Which food provides the most energy for the body in the shortest amount of time? , potato , meat , milk , fruit,D
Mercury_7084210,"A researcher examines a marine organism that is the size of an average human hand. Without more information, which statement about the organism is most likely accurate? ", It is mobile. , It has organ systems. , It is made of many cells. , It makes its own food.,C
Mercury_7227833,What structure can be found in both a virus and a cell?  , nucleic acid chain , Golgi apparatus , endoplasmic reticulum , nuclear membrane,A
Mercury_7200165,Mr. Jenkins's class is studying sex chromosomes. He tells his students that the nuclei of human cells have 22 pairs of 

<br>


1. Getting the full ARC dataset with answers and storing them in local environment directory data.
2. Getting  the IR results of the question set selected, questionid and the choices, and storing them in local environment directory data
<br>

In [71]:
!wget -q -P data 'https://raw.githubusercontent.com/tanishkasingh9/Bert-ARC-Challenge/master/ARC-All-answers_columns.csv'
input_fileN = 'data/ARC-All-answers_columns.csv'
!head $input_fileN
!wget -q -P data 'https://raw.githubusercontent.com/tanishkasingh9/Bert-ARC-Challenge/master/retrieval_resMH.csv'
input_fileIN = 'data/retrieval_resMH.csv'
!head $input_fileIN

questionID,category,sets,question,A,B,C,D,AnswerKey
TIMSS_1995_8_I15,Train,easy,"Maria collected the gas given off by a glowing piece of charcoal. The gas was then bubbled through a small amount of colorless limewater. Part of Maria's report stated, ""After the gas was put into the jar, the limewater gradually changed to a milky white color."" This statement is ", an observation , a conclusion , a generalization , an assumption of the investigation ,A
TIMSS_2007_8_pg53,Test,chal,"Sally placed electrodes into a beaker containing a solution and connected the electrodes to a batter. Part of Sally's report stated ""Bubbles were given off at one of the electrodes."" The statement is ", an observation , a prediction , a conclusion , a theory ,A
NYSEDREGENTS_2013_8_12,Test,easy,A cell's chromosomes contain , genes , chlorophyll , sperm , eggs,A
NYSEDREGENTS_2010_8_12,Test,easy,Cancer is most often the result of , abnormal cell division , natural selection , bacterial infection , biologica

In [72]:
df_corpus = pd.read_csv('data/retrieval_resMH.csv', index_col='questionID')
df_corpus.head()

Unnamed: 0_level_0,question,info
questionID,Unnamed: 1_level_1,Unnamed: 2_level_1
Mercury_412775,solution pH 2 increased pH 7 adding,Add an aqueous solution of pH = 2. Add 30% NaO...
Mercury_7221743,following elements least electrically conductive?,and (2) restate its published rates into at le...
NCEOGA_2013_8_59,food provides energy body shortest amount time?,To provide your body with energy and nutrients...
Mercury_7084210,researcher examines marine organism size avera...,"TEM is used mainly for organic research, like ..."
Mercury_7227833,structure found virus cell?,Cell and Virus Structure - Virus and cell memb...


# Data Tokenization and tensor dataset creation 

In [0]:
class ArcInstanceTrain(object):
    """"A single training/test example for the ARC dataset."""
    def __init__(self,
                 question_id: str,
                 category: str,
                 question: str,
                 choice_0: str,
                 choice_1: str,
                 choice_2: str,
                 choice_3: str,
                 answer_key: str = None) -> None:
        self.question_id = question_id
        self.category = category
        self.question = question
        self.choices = [
          choice_0,
          choice_1,
          choice_2,
          choice_3,
        ]
        self.label_map = {'A' : 0, 'B' : 1, 'C' : 2, 'D' : 3}
        self.label = self.label_map[answer_key]

    def __str__(self):
        return self.__repr__()

    def __repr__(self):
        l = [
            "question_id: {}".format(self.question_id),
            "category: {}".format(self.category),
            "question: {}".format(self.question),
            "choice_0: {}".format(self.choices[0]),
            "choice_1: {}".format(self.choices[1]),
            "choice_2: {}".format(self.choices[2]),
            "choice_3: {}".format(self.choices[3]),
        ]

        if self.label is not None:
            l.append("label: {}".format(self.label))

        return ", ".join(l)
      
class ArcInstance(object):
    """"For the challenge Multihop QA with context."""
    def __init__(self,
                 question_id: str,
                 context_sentence: str,
                 question: str,
                 choice_0: str,
                 choice_1: str,
                 choice_2: str,
                 choice_3: str,
                 answer_key: str = None) -> None:
        self.question_id = question_id

        self.context_sentence = context_sentence
        self.question = question
        self.choices = [
          choice_0,
          choice_1,
          choice_2,
          choice_3,
        ]
        self.label_map = {'A' : 0, 'B' : 1, 'C' : 2, 'D' : 3}
        self.label = self.label_map[answer_key]

    def __str__(self):
        return self.__repr__()

    def __repr__(self):
        l = [
            "question_id: {}".format(self.question_id),
            "context_sentence: {}".format(self.context_sentence),
            "question: {}".format(self.question),
            "choice_0: {}".format(self.choices[0]),
            "choice_1: {}".format(self.choices[1]),
            "choice_2: {}".format(self.choices[2]),
            "choice_3: {}".format(self.choices[3]),
        ]

        if self.label is not None:
            l.append("label: {}".format(self.label))

        return ", ".join(l)
      
class InputFeatures(object):
    def __init__(self,
                 question_id,
                 choices_features,
                 label):
        self.question_id = question_id
        self.choices_features = [
            {
                'input_ids': input_ids,
                'input_mask': input_mask,
                'segment_ids': segment_ids
            }
            for _, input_ids, input_mask, segment_ids in choices_features
        ]
        self.label = label
        


In [0]:
def get_context_sentence(question_id):
    """ Retrieve all the IR-sentences relevant to question id """
    return df_corpus.loc[question_id]['info']

def get_arc_instances(input_file):
    arc_instances = []
    with open(input_file) as f:
        reader = csv.DictReader(f)
        for row in reader:
            arc_instance = ArcInstance(question_id=row['questionID'],
                                       context_sentence=get_context_sentence(row['questionID']),
                                        question=row['question'], 
                                        choice_0=row['A'], 
                                        choice_1=row['B'], 
                                        choice_2=row['C'],
                                        choice_3=row['D'],
                                        answer_key=row['AnswerKey']
                                       )
            arc_instances.append(arc_instance)
    return arc_instances

def get_arc_instances_train(input_fileN):
    arc_instances_train = []
    arc_instances_val = []
    with open(input_fileN) as f:
        reader = csv.DictReader(f)
        for row in reader:
            arc_instance_train = ArcInstanceTrain(question_id=row['questionID'],
                                        category = row['category'],
                                        question=row['question'], 
                                        choice_0=row['A'], 
                                        choice_1=row['B'], 
                                        choice_2=row['C'],
                                        choice_3=row['D'],
                                        answer_key=row['AnswerKey']
                                       )
            if arc_instance_train.category == 'Train':
              arc_instances_train.append(arc_instance_train)
            else: 
              arc_instances_val.append(arc_instance_train)
    return arc_instances_train, arc_instances_val

In [0]:
def _truncate_seq_pair(tokens_a, tokens_b, max_seq_length):
    """Truncates a sequence pair in place to the maximum length."""

    # This is a simple heuristic which will always truncate the longer sequence
    # one token at a time. This makes more sense than truncating an equal percent
    # of tokens from each, since if one sequence is very short then each token
    # that's truncated likely contains more information than a longer sequence.
    while True:
        total_length = len(tokens_a) + len(tokens_b)
        if total_length <= max_seq_length:
            break
        if len(tokens_a) > len(tokens_b):
            tokens_a.pop()
        else:
            tokens_b.pop()
            
            
def _truncate_seq_pair_train(tokens_b, max_seq_length):
    """Truncates a sequence pair in place to the maximum length."""

    # This is a simple heuristic which will always truncate the longer sequence
    # one token at a time. This makes more sense than truncating an equal percent
    # of tokens from each, since if one sequence is very short then each token
    # that's truncated likely contains more information than a longer sequence.
    while True:
        total_length = len(tokens_b)
        if total_length <= max_seq_length:
            break
        
        tokens_b.pop()


In [0]:
def convert_instances_to_features_train(instances, 
                                 tokenizer, 
                                 max_seq_length):
    features_train = []
    for instance_idx, instance in enumerate(instances):
        #context_sent_tokens = tokenizer.tokenize(instance.context_sentence)
        question_tokens = tokenizer.tokenize(instance.question)
        #category = instance.category
        choices_features = []
        for choice_index, choice in enumerate(instance.choices):

            choice_tokens = question_tokens + tokenizer.tokenize(choice)
            # Modifies `choice_tokens` in
            # place so that the total length is less than the
            # specified length.  Account for [CLS], [SEP], [SEP] with
            # "- 3"
            _truncate_seq_pair_train(choice_tokens, max_seq_length-2)

            tokens = ["[CLS]"] + choice_tokens + ["[SEP]"]
            segment_ids = [1] * (len(choice_tokens) + 2)

            input_ids = tokenizer.convert_tokens_to_ids(tokens)
            input_mask = [1] * len(input_ids)

            # Zero-pad up to the sequence length.
            padding = [0] * (max_seq_length - len(input_ids))
            input_ids += padding
            input_mask += padding
            segment_ids += padding

            assert len(input_ids) == max_seq_length
            assert len(input_mask) == max_seq_length
            assert len(segment_ids) == max_seq_length

            choices_features.append((tokens, input_ids, input_mask, segment_ids))

        label = instance.label
        
        if instance_idx < 5:
            logger.info("*** Example ***")
            logger.info("question_id: {}".format(instance.question_id))
            for choice_idx, (tokens, input_ids, input_mask, segment_ids) in enumerate(choices_features):
                logger.info("choice: {}".format(choice_idx))
                #logger.info("category: {}".format(category))
                logger.info("tokens: {}".format(' '.join(tokens)))
                logger.info("input_ids: {}".format(' '.join(map(str, input_ids))))
                logger.info("input_mask: {}".format(' '.join(map(str, input_mask))))
                logger.info("segment_ids: {}".format(' '.join(map(str, segment_ids))))
            logger.info("label: {}".format(label))
        
        features_train.append(InputFeatures(question_id=instance.question_id,
                                     choices_features=choices_features,
                                     label = label))
    return features_train

In [0]:
def convert_instances_to_features(instances, 
                                 tokenizer, 
                                 max_seq_length):
    features = []
    for instance_idx, instance in enumerate(instances):
        context_sent_tokens = tokenizer.tokenize(instance.context_sentence)
        question_tokens = tokenizer.tokenize(instance.question)

        choices_features = []
        for choice_index, choice in enumerate(instance.choices):
            # We create a copy of the context tokens in order to be
            # able to shrink it according to choice_tokens
            context_tokens_choice = context_sent_tokens.copy()
            choice_tokens = question_tokens + tokenizer.tokenize(choice)
            # Modifies `context_tokens_choice` and `choice_tokens` in
            # place so that the total length is less than the
            # specified length.  Account for [CLS], [SEP], [SEP] with
            # "- 3"
            _truncate_seq_pair(context_tokens_choice, choice_tokens, max_seq_length -3)
            tokens = ["[CLS]"] + context_tokens_choice + ["[SEP]"] + choice_tokens + ["[SEP]"]
            segment_ids = [0] * (len(context_tokens_choice) + 2) + [1] * (len(choice_tokens) + 1)

            input_ids = tokenizer.convert_tokens_to_ids(tokens)
            input_mask = [1] * len(input_ids)

            # Zero-pad up to the sequence length.
            padding = [0] * (max_seq_length - len(input_ids))
            input_ids += padding
            input_mask += padding
            segment_ids += padding

            assert len(input_ids) == max_seq_length
            assert len(input_mask) == max_seq_length
            assert len(segment_ids) == max_seq_length

            choices_features.append((tokens, input_ids, input_mask, segment_ids))

        label = instance.label
        
        if instance_idx < 5:
            logger.info("*** Example ***")
            logger.info("question_id: {}".format(instance.question_id))
            for choice_idx, (tokens, input_ids, input_mask, segment_ids) in enumerate(choices_features):
                logger.info("choice: {}".format(choice_idx))
                logger.info("tokens: {}".format(' '.join(tokens)))
                logger.info("input_ids: {}".format(' '.join(map(str, input_ids))))
                logger.info("input_mask: {}".format(' '.join(map(str, input_mask))))
                logger.info("segment_ids: {}".format(' '.join(map(str, segment_ids))))
            logger.info("label: {}".format(label))
        
        features.append(InputFeatures(question_id=instance.question_id,
                                     choices_features=choices_features,
                                     label = label))
    return features

In [0]:
def create_dataset(feats):
    all_input_ids = torch.tensor(select_field(feats, 'input_ids'), dtype=torch.long)
    all_input_mask = torch.tensor(select_field(feats, 'input_mask'), dtype=torch.long)
    all_segment_ids = torch.tensor(select_field(feats, 'segment_ids'), dtype=torch.long)
    all_label = torch.tensor([f.label for f in feats], dtype=torch.long)
    train_data = TensorDataset(all_input_ids, all_input_mask, all_segment_ids, all_label)
    return train_data

def select_field(features, field):
    return [
        [
            choice[field]
            for choice in feature.choices_features
        ]
        for feature in features
]

# Splitting Dataset into train and eval

In [0]:
def get_train_val_size(dataset_sz, train_ratio=0.7):
    train_size = int(dataset_sz * train_ratio)
    val_size = len(arc_instances) - train_size
    print(f'Train Size : {train_size} ; Validation Size : {val_size}')
    return [train_size, val_size]

In [80]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
arc_instances_train, arc_instances_val= get_arc_instances_train(input_fileN)
feats_train = convert_instances_to_features_train(arc_instances_train, tokenizer=tokenizer,max_seq_length = 100)
feats_val = convert_instances_to_features_train(arc_instances_val, tokenizer=tokenizer,max_seq_length = 100)
dataset_train = create_dataset(feats_train)
dataset_val = create_dataset(feats_val)
arc_instance_eval = get_arc_instances(input_file)

INFO:pytorch_pretrained_bert.tokenization:loading vocabulary file https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt from cache at /root/.pytorch_pretrained_bert/26bc1ad6c0ac742e9b52263248f6d0f00068293b33709fae12320c0e35ccfbbb.542ce4285a40d23a559526243235df47c5f75c197f04f37d1a0c124c32c9a084
INFO:__main__:*** Example ***
INFO:__main__:question_id: TIMSS_1995_8_I15
INFO:__main__:choice: 0
INFO:__main__:tokens: [CLS] maria collected the gas given off by a glowing piece of charcoal . the gas was then bubble ##d through a small amount of color ##less lime ##water . part of maria ' s report stated , " after the gas was put into the jar , the lime ##water gradually changed to a milky white color . " this statement is an observation [SEP]
INFO:__main__:input_ids: 101 3814 5067 1996 3806 2445 2125 2011 1037 10156 3538 1997 18872 1012 1996 3806 2001 2059 11957 2094 2083 1037 2235 3815 1997 3609 3238 14123 5880 1012 2112 1997 3814 1005 1055 3189 3090 1010 1000 2044 19

In [81]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
arc_instances = get_arc_instances(input_file=input_file)
feats = convert_instances_to_features(arc_instances, tokenizer=tokenizer, max_seq_length=100)
dataset = create_dataset(feats)

INFO:pytorch_pretrained_bert.tokenization:loading vocabulary file https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt from cache at /root/.pytorch_pretrained_bert/26bc1ad6c0ac742e9b52263248f6d0f00068293b33709fae12320c0e35ccfbbb.542ce4285a40d23a559526243235df47c5f75c197f04f37d1a0c124c32c9a084
INFO:__main__:*** Example ***
INFO:__main__:question_id: Mercury_412775
INFO:__main__:choice: 0
INFO:__main__:tokens: [CLS] add an a ##que ##ous solution of ph = 2 . add 30 % na ##oh solution and adjust to ph 7 . c ) add an a ##que ##ous solution of ph 2 . label test tubes as follows : a - ph 2 , b - ph 3 , c - ph 5 , d - ph 7 , e - ph 8 , f - ph 9 , g - ph 12 , h [SEP] a solution with a ph of 2 can be increased to a ph above 7 by adding an acid . [SEP]
INFO:__main__:input_ids: 101 5587 2019 1037 4226 3560 5576 1997 6887 1027 1016 1012 5587 2382 1003 6583 11631 5576 1998 14171 2000 6887 1021 1012 1039 1007 5587 2019 1037 4226 3560 5576 1997 6887 1016 1012 3830 3231 10868

In [0]:
train_sampler = RandomSampler(dataset_train)
train_dataloader = DataLoader(dataset_train, batch_size=TRAIN_BATCH_SIZE, sampler=train_sampler)
train_sampler = RandomSampler(dataset_val)
val_dataloader = DataLoader(dataset_val, batch_size=TRAIN_BATCH_SIZE, sampler=train_sampler)

In [83]:
#Uncomment for testing eval
#eval_sampler = SequentialSampler(dataset)
#eval_dataloader = DataLoader(dataset, batch_size=VAL_BATCH_SIZE, sampler=eval_sampler)

mp_train_dataset, mp_val_dataset =  random_split(dataset, get_train_val_size(len(arc_instances)))
train_sampler = RandomSampler(mp_train_dataset)
mp_train_dataloader = DataLoader(mp_train_dataset, batch_size=TRAIN_BATCH_SIZE, sampler=train_sampler)

val_sampler = SequentialSampler(mp_val_dataset)
mp_val_dataloader = DataLoader(mp_val_dataset, batch_size=VAL_BATCH_SIZE, sampler=val_sampler)

Train Size : 58 ; Validation Size : 26


# Model Training and Evaluation

## Common

In [0]:
def get_optimizer(model,train_dataset):
    # Prepare optimizer
    num_train_optimization_steps = int(len(train_dataset) / TRAIN_BATCH_SIZE ) * NUM_EPOCHS
    param_optimizer = list(model.named_parameters())

    # hack to remove pooler, which is not used
    # thus it produce None grad that break apex
    param_optimizer = [n for n in param_optimizer if 'pooler' not in n[0]]

    no_decay = ['bias', 'LayerNorm.bias', 'LayerNorm.weight']
    optimizer_grouped_parameters = [
        {'params': [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)], 'weight_decay': 0.01},
        {'params': [p for n, p in param_optimizer if any(nd in n for nd in no_decay)], 'weight_decay': 0.0}
    ]
    optimizer = BertAdam(optimizer_grouped_parameters, lr=LR)
    return optimizer

In [0]:
def accuracy(out, labels):
    outputs = np.argmax(out, axis=1)
    return np.sum(outputs == labels)

## Benchmark Model

Here we are using a pretrained BERT model without any training as our Benchmark model. 

In [0]:
def eval_model(model, dataloader):
    model.eval()
    model.to(device)
    eval_loss, eval_accuracy = 0, 0
    nb_eval_steps, nb_eval_examples = 0, 0
    for batch in dataloader:
        batch = tuple(t.to(device) for t in batch)
        input_ids, input_mask, segment_ids, label_ids = batch
        with torch.no_grad():
            tmp_eval_loss = model(input_ids, segment_ids, input_mask, label_ids)
            logits = model(input_ids, segment_ids, input_mask)
        logits = logits.detach().cpu().numpy()
        label_ids = label_ids.to('cpu').numpy()
        tmp_eval_accuracy = accuracy(logits, label_ids)
        
        eval_loss += tmp_eval_loss.mean().item()
        eval_accuracy += tmp_eval_accuracy

        nb_eval_examples += input_ids.size(0)
        nb_eval_steps += 1

    eval_loss = eval_loss / nb_eval_steps
    eval_accuracy = eval_accuracy / nb_eval_examples
    print(f'Eval accuracy {eval_accuracy:.3f} Eval loss : {eval_loss:.3f} val size : {nb_eval_examples}')

<br>
<br>
For evaluation on the base line uncomment the following lines of code and run

In [0]:
#Uncomment for running eval
#bert_model = BertForMultipleChoice.from_pretrained('bert-base-uncased', cache_dir='bert_pretrained_model', num_choices=4)
#eval_model(bert_model, eval_dataloader)

In [0]:
#Let's try to free up some space
#del bert_model
#del eval_dataloader

## Model Training

In [0]:
def train_model(model, train_dataloader, val_dataloader, optimizer, num_epochs=10):
    model.to(device)
    for epoch in range(num_epochs):
        model.train()
        running_loss, running_corrects = 0, 0   
        nb_tr_steps, nb_tr_examples = 0, 0
        for step, batch in enumerate(train_dataloader):
            # zero the parameter gradients
            optimizer.zero_grad()
            batch = tuple(t.to(device) for t in batch)
            input_ids, input_mask, segment_ids, label_ids = batch
            loss = model(input_ids, segment_ids, input_mask, label_ids)
            logits = model(input_ids, segment_ids, input_mask)
            logits = logits.detach().cpu().numpy()
            label_ids = label_ids.to('cpu').numpy()
            tmp_accuracy = accuracy(logits, label_ids)
            
            running_loss += loss.item()
            running_corrects += tmp_accuracy
            loss.backward()
            optimizer.step()
        epoch_loss = running_loss / len(train_dataloader.dataset)
        epoch_accuracy = running_corrects / len(train_dataloader.dataset)

        print(f'Epoch {epoch} acc : {epoch_accuracy:.3f} loss : {epoch_loss:.3f} ') #' accuracy : {epoch_accuracy:.3f}')

In [0]:
def train_val_model(model, train_dataloader, val_dataloader, optimizer, num_epochs=10):
    model.to(device)
    for epoch in range(num_epochs):
        for phase in ['train', 'val']:
            if phase == 'train':
                model.train()
                dataloader = train_dataloader
            else :
                model.eval()
                dataloader = val_dataloader
            running_loss, running_corrects = 0, 0  
            num_steps, num_examples = 0, 0
            for step, batch in enumerate(train_dataloader):
                # zero the parameter gradients
                optimizer.zero_grad()
                batch = tuple(t.to(device) for t in batch)
                input_ids, input_mask, segment_ids, label_ids = batch
                with torch.set_grad_enabled(phase == 'train'):
                    loss = model(input_ids, segment_ids, input_mask, label_ids)
                    logits = model(input_ids, segment_ids, input_mask)
                logits = logits.detach().cpu().numpy()
                label_ids = label_ids.to('cpu').numpy()
                corrects = accuracy(logits, label_ids)

                running_loss += loss.mean().item()
                running_corrects += corrects
                
                num_examples += input_ids.size(0)
                num_steps += 1
                if phase == 'train':
                    loss.backward()
                    optimizer.step()
            epoch_loss = running_loss / num_steps
            epoch_accuracy = running_corrects / num_examples

            print(f'Epoch {epoch} {phase} acc : {epoch_accuracy:.3f} loss : {epoch_loss:.3f} ')

<br>
<br>

For training we are not separating the easy and challenge problem sets as we want to stimulate real world results.

# Pre-Training without corpus

In [91]:
bert_model = BertForMultipleChoice.from_pretrained('bert-base-uncased', cache_dir='bert_pretrained_model', num_choices=4)
optimizer = get_optimizer(bert_model,dataset_train)
model = train_val_model(bert_model, train_dataloader=train_dataloader, val_dataloader=val_dataloader, optimizer=optimizer, num_epochs=NUM_EPOCHS)

INFO:pytorch_pretrained_bert.modeling:loading archive file https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased.tar.gz from cache at bert_pretrained_model/9c41111e2de84547a463fd39217199738d1e3deb72d4fec4399e6e241983c6f0.ae3cef932725ca7a30cdcb93fc6e09150a55e2a130ec7af63975a16c153ae2ba
INFO:pytorch_pretrained_bert.modeling:extracting archive file bert_pretrained_model/9c41111e2de84547a463fd39217199738d1e3deb72d4fec4399e6e241983c6f0.ae3cef932725ca7a30cdcb93fc6e09150a55e2a130ec7af63975a16c153ae2ba to temp dir /tmp/tmph5aah31c
INFO:pytorch_pretrained_bert.modeling:Model config {
  "attention_probs_dropout_prob": 0.1,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "max_position_embeddings": 512,
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "type_vocab_size": 2,
  "vocab_size": 30522
}

INFO:pytorch_pretrained_bert.modeling:Weights of BertForMultipleChoice not initialized fro

Epoch 0 train acc : 0.254 loss : 1.395 
Epoch 0 val acc : 0.241 loss : 1.386 
Epoch 1 train acc : 0.250 loss : 1.391 
Epoch 1 val acc : 0.256 loss : 1.386 
Epoch 2 train acc : 0.242 loss : 1.392 
Epoch 2 val acc : 0.261 loss : 1.386 
Epoch 3 train acc : 0.247 loss : 1.392 
Epoch 3 val acc : 0.277 loss : 1.386 
Epoch 4 train acc : 0.253 loss : 1.389 
Epoch 4 val acc : 0.237 loss : 1.386 


<br>
<br>
As we can see from the parameters of the bert Base model, the size of the vocabulary is 30522, which is too high for our problem statement. This arises a need to train with corpus. Due to unavailability of corpus knowledge base for the whole train and test data, we only use it for the problem set. 

In [108]:
optimizer = get_optimizer(bert_model,dataset_train)
train_val_model(bert_model, train_dataloader=mp_train_dataloader, val_dataloader=mp_val_dataloader, optimizer=optimizer, num_epochs=5)

Epoch 0 train acc : 0.259 loss : 1.411 
Epoch 0 val acc : 0.172 loss : 1.386 
Epoch 1 train acc : 0.172 loss : 1.399 
Epoch 1 val acc : 0.310 loss : 1.386 
Epoch 2 train acc : 0.259 loss : 1.409 
Epoch 2 val acc : 0.155 loss : 1.386 
Epoch 3 train acc : 0.207 loss : 1.424 
Epoch 3 val acc : 0.259 loss : 1.386 
Epoch 4 train acc : 0.241 loss : 1.378 
Epoch 4 val acc : 0.207 loss : 1.386 


In [109]:
eval_model(bert_model, mp_val_dataloader)

Eval accuracy 0.346 Eval loss : 1.386 val size : 26


# Experiment results


|Experiment details                     | Train Acc | Train loss | Val Acc | Val loss | Num Epochs |
|--------------------------------------------|:---------------:|:--------------:|:------------:|:------------:|:--------------------:|
|Full validation without corpus | 0.248 | 1.389| 0.239 | 1.386|5
|Problem subset with corpus | 0.241 | 1.378| 0.207 | 1.386|5
<br>

|Experiment details                     | Eval Acc | Eval loss | Val Size | Num Epochs |
|--------------------------------------------|:---------------:|:--------------:|:------------:|:------------:|:--------------------:|
|Evalutation  | 0346 | 1.386|26|5

# Conclusion

These are the conclusions that we have reached after our testing :

1. A benchmark model that uses BERT 
2. The BERT needs to be pre-trained on corpus which we havent done, we didnt have the knowledge base for the ARC Dataset
3. A better way of feeding in the data, explore training on just the corpus first and then the queries


What are things that might increase performance of the mode :
1. More data 
2. A more comprehensive corpus