# Commonsense QA: Fairseq Implementation

EECS 595 Final Project, Task 1: Commonsense QA

* Team ID: 2
* Credit: Ziqiao Ma, Qingyi Chen
* Last update: 2020.12.16
* Validation Accuracy: 77.4%


CommonsenseQA is proposed by  Talmor et al. (2019). As a question answering benchmark, it presents a natural language question $Q$ of $m$ tokens $\{q_1,q_2,\cdots,q_m\}$ and 5 choices $\{a_1,a_2,\cdots,a_5\}$ labeled with $\{A,B,\cdots,E\}$ regarding each question. Notably, the questions do not entail a inference basis in themselves, so the lack of evidence requires the model to hold a comprehensive understanding on common sense knowledge and a strong reasoning ability to make the right choice.

This .ipynb aims to fine-tune the roberta model for the CommonsenseQA task based on Fairseq.

## Dependency Installation and File preparation

This section installs depency such as fairseq and loads CommonsenseQA dataset.

In [None]:
# Use GPU during training and 
import torch
if torch.cuda.is_available():
    print("Using the GPU!")

from google.colab import drive
drive.mount('/content/drive')


Using the GPU!
Mounted at /content/drive


Load CommonsenseQA dataset and fairseq.


In [None]:
%cd /content/
!unzip CommonsenseQA.zip

/content
Archive:  CommonsenseQA.zip
   creating: CommonsenseQA/
  inflating: __MACOSX/._CommonsenseQA  
  inflating: CommonsenseQA/.DS_Store  
  inflating: __MACOSX/CommonsenseQA/._.DS_Store  
  inflating: CommonsenseQA/finetune-arc-web-open-atomic.sh  
  inflating: __MACOSX/CommonsenseQA/._finetune-arc-web-open-atomic.sh  
   creating: CommonsenseQA/fairseq/
  inflating: __MACOSX/CommonsenseQA/._fairseq  
  inflating: CommonsenseQA/finetune.sh  
  inflating: __MACOSX/CommonsenseQA/._finetune.sh  
  inflating: CommonsenseQA/wrong_preds.tsv  
  inflating: __MACOSX/CommonsenseQA/._wrong_preds.tsv  
  inflating: CommonsenseQA/finetune-web-arc-cn.sh  
  inflating: __MACOSX/CommonsenseQA/._finetune-web-arc-cn.sh  
  inflating: CommonsenseQA/wrong_preds.jsonl  
  inflating: __MACOSX/CommonsenseQA/._wrong_preds.jsonl  
  inflating: CommonsenseQA/convert_jsonl2tsv.py  
  inflating: __MACOSX/CommonsenseQA/._convert_jsonl2tsv.py  
   creating: CommonsenseQA/data/
  inflating: __MACOSX/Commonsen

In [None]:
!pip install fairseq

Collecting fairseq
[?25l  Downloading https://files.pythonhosted.org/packages/2c/da/7c7032988dade3b21ccfd5b226e50b382abfd3459129d67240bb004506ae/fairseq-0.10.1-cp36-cp36m-manylinux1_x86_64.whl (1.7MB)
[K     |████████████████████████████████| 1.7MB 12.8MB/s 
Collecting hydra-core
[?25l  Downloading https://files.pythonhosted.org/packages/f0/1f/7f502b9e37596164111655861370b08626f46f9e4524433c354f472765d4/hydra_core-1.0.4-py3-none-any.whl (122kB)
[K     |████████████████████████████████| 122kB 60.0MB/s 
Collecting sacrebleu>=1.4.12
[?25l  Downloading https://files.pythonhosted.org/packages/a3/c4/8e948f601a4f9609e8b2b58f31966cb13cf17b940b82aa3e767f01c42c52/sacrebleu-1.4.14-py3-none-any.whl (64kB)
[K     |████████████████████████████████| 71kB 7.4MB/s 
Collecting antlr4-python3-runtime==4.8
[?25l  Downloading https://files.pythonhosted.org/packages/56/02/789a0bddf9c9b31b14c3e79ec22b9656185a803dc31c15f006f9855ece0d/antlr4-python3-runtime-4.8.tar.gz (112kB)
[K     |██████████████████

Load Roberta.

In [None]:
# Download roberta model
!wget -O roberta.large.tar.gz https://dl.fbaipublicfiles.com/fairseq/models/roberta.large.tar.gz
!tar -xvzf roberta.large.tar.gz

--2020-12-14 03:06:49--  https://dl.fbaipublicfiles.com/fairseq/models/roberta.large.tar.gz
Resolving dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)... 104.22.74.142, 172.67.9.4, 104.22.75.142, ...
Connecting to dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)|104.22.74.142|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 655283069 (625M) [application/gzip]
Saving to: ‘roberta.large.tar.gz’


2020-12-14 03:07:17 (23.1 MB/s) - ‘roberta.large.tar.gz’ saved [655283069/655283069]

roberta.large/
roberta.large/dict.txt
roberta.large/model.pt
roberta.large/NOTE


## Preprocessing

This section prepares the data and arguments for training.

In [None]:
%cd /content/CommonsenseQA/

/content/CommonsenseQA


Compose the arguments for training.

In [None]:
%%writefile finetune.sh
#!/bin/bash

## Write the finetuning part to a bash script file
# Modified following from the original script to get it to run on Google AI platform and Colab
# - Set MAX_SENTENCES=8
# - Added --update-freq 4

MAX_UPDATES=3000      # Number of training steps.
WARMUP_UPDATES=150    # Linearly increase LR over this many steps.
LR=1e-05              # Peak LR for polynomial LR scheduler.
MAX_SENTENCES=8      # Batch size.
SEED=23                # Random seed.

BASEDIR=/content
CQA_PATH=$BASEDIR/CommonsenseQA 
ROBERTA_PATH=${BASEDIR}/roberta.large/model.pt
DATA_DIR=${CQA_PATH}/data/CommonsenseQA

# we use the --user-dir option to load the task from
# the examples/roberta/commonsense_qa directory:
FAIRSEQ_PATH=${CQA_PATH}/fairseq
FAIRSEQ_USER_DIR=${FAIRSEQ_PATH}/examples/roberta/commonsense_qa

cd $FAIRSEQ_PATH
CUDA_VISIBLE_DEVICES=0 fairseq-train --fp16 --ddp-backend=no_c10d \
    $DATA_DIR \
    --update-freq 4 \
    --save-dir ./checkpoints \
    --user-dir $FAIRSEQ_USER_DIR \
    --restore-file $ROBERTA_PATH \
    --reset-optimizer --reset-dataloader --reset-meters \
    --no-epoch-checkpoints --no-last-checkpoints --no-save-optimizer-state \
    --best-checkpoint-metric accuracy --maximize-best-checkpoint-metric \
    --task commonsense_qa --init-token 0 --bpe gpt2 \
    --arch roberta_large --max-positions 512 \
    --dropout 0.1 --attention-dropout 0.1 --weight-decay 0.01 \
    --criterion sentence_ranking --num-classes 5 \
    --optimizer adam --adam-betas '(0.9, 0.98)' --adam-eps 1e-06 --clip-norm 0.0 \
    --lr-scheduler polynomial_decay --lr $LR \
    --warmup-updates $WARMUP_UPDATES --total-num-update $MAX_UPDATES \
    --batch-size $MAX_SENTENCES \
    --max-update $MAX_UPDATES \
    --log-format simple --log-interval 25 \
    --seed $SEED

Overwriting finetune.sh


## Training...

This section trains the model.

In [None]:
!bash finetune.sh

2020-12-14 03:22:47 | INFO | fairseq_cli.train | Namespace(activation_dropout=0.0, activation_fn='gelu', adam_betas='(0.9, 0.98)', adam_eps=1e-06, all_gather_list_size=16384, arch='roberta_large', attention_dropout=0.1, batch_size=8, batch_size_valid=8, best_checkpoint_metric='accuracy', bf16=False, bpe='gpt2', broadcast_buffers=False, bucket_cap_mb=25, checkpoint_shard_count=1, checkpoint_suffix='', clip_norm=0.0, cpu=False, criterion='sentence_ranking', curriculum=0, data='/content/CommonsenseQA/data/CommonsenseQA', data_buffer_size=10, dataset_impl=None, ddp_backend='no_c10d', device_id=0, disable_validation=False, distributed_backend='nccl', distributed_init_method=None, distributed_no_spawn=False, distributed_num_procs=1, distributed_port=-1, distributed_rank=0, distributed_world_size=1, distributed_wrapper='DDP', dropout=0.1, empty_cache_freq=0, encoder_attention_heads=16, encoder_embed_dim=1024, encoder_ffn_embed_dim=4096, encoder_layerdrop=0, encoder_layers=24, encoder_layers_t

## Post training


Save the trained model to GoogleDrive.

In [None]:
%cd /content/
%cp -r CommonsenseQA/ drive/My\ Drive/EECS595_final_project/CSQA_Roberta

/content


Check model performance on validation set.

In [None]:
%cd /content/CommonsenseQA/fairseq/examples/roberta/commonsense_qa

import json
import torch
from fairseq.models.roberta import RobertaModel
import commonsense_qa_task  # load the Commonsense QA task

base_dir = '/content/CommonsenseQA'
roberta = RobertaModel.from_pretrained(base_dir + '/fairseq/checkpoints', 'checkpoint_best.pt', base_dir + '/data/CommonsenseQA')
roberta.eval()  # disable dropout

roberta.cuda()  # use the GPU (optional)
nsamples, ncorrect = 0, 0
wrong = []
with open(base_dir + '/data/CommonsenseQA/valid.jsonl') as h:
    print(3)
    for line in h:
        example = json.loads(line)
        scores = []
        for choice in example['question']['choices']:
            input = roberta.encode(
                'Q: ' + example['question']['stem'],
                'A: ' + choice['text'],
                no_separator=True
            )
            score = roberta.predict('sentence_classification_head', input, return_logits=True)
            scores.append(score)

        pred = torch.cat(scores).argmax()
        answer = ord(example['answerKey']) - ord('A')
        nsamples += 1
        if pred == answer:
            ncorrect += 1
        else:
            example['predicted'] = chr(ord('A') + pred)
            example['scores'] = {chr(ord('A') + i): s.data.item() for (i, s) in enumerate(scores)}
            wrong.append(json.dumps(example))


print(f'Accuracy: {ncorrect}/{nsamples} = {ncorrect / float(nsamples)}')

2020-12-14 04:45:33 | INFO | fairseq.file_utils | loading archive file /content/CommonsenseQA/fairseq/checkpoints
2020-12-14 04:45:33 | INFO | fairseq.file_utils | loading archive file /content/CommonsenseQA/data/CommonsenseQA


/content/CommonsenseQA/fairseq/examples/roberta/commonsense_qa
| dictionary: 50265 types


2020-12-14 04:45:47 | INFO | fairseq.models.roberta.model | Namespace(activation_dropout=0.0, activation_fn='gelu', adam_betas='(0.9, 0.98)', adam_eps=1e-06, all_gather_list_size=16384, arch='roberta_large', attention_dropout=0.1, batch_size=8, batch_size_valid=8, best_checkpoint_metric='accuracy', bf16=False, bpe='gpt2', broadcast_buffers=False, bucket_cap_mb=25, checkpoint_shard_count=1, checkpoint_suffix='', clip_norm=0.0, cpu=False, criterion='sentence_ranking', curriculum=0, data='/content/CommonsenseQA/data/CommonsenseQA', data_buffer_size=10, dataset_impl=None, ddp_backend='no_c10d', device_id=0, disable_validation=False, distributed_backend='nccl', distributed_init_method=None, distributed_no_spawn=False, distributed_num_procs=1, distributed_port=-1, distributed_rank=0, distributed_world_size=1, distributed_wrapper='DDP', dropout=0.1, empty_cache_freq=0, encoder_attention_heads=16, encoder_embed_dim=1024, encoder_ffn_embed_dim=4096, encoder_layerdrop=0, encoder_layers=24, encod

3
Accuracy: 945/1221 = 0.773955773955774


## Result Report
*   Training Accuracy: 94.4% 
*   Validation Accuracy: 77.4% 


