# Multiple Choice Commonsense Reasoning - SocialIQA

> Fine-tuning Pre-trained Language Models on Multiple Choice Question Answering for Commonsense Reasoning 

> Based on HuggingFace Transformers

> Chaehyeong Kim, CONVEI Lab 

In [1]:
!pip install transformers==4.26.1 evaluate datasets

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers==4.26.1
  Downloading transformers-4.26.1-py3-none-any.whl (6.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.3/6.3 MB[0m [31m57.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting evaluate
  Downloading evaluate-0.4.0-py3-none-any.whl (81 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.4/81.4 kB[0m [31m6.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting datasets
  Downloading datasets-2.12.0-py3-none-any.whl (474 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m474.6/474.6 kB[0m [31m29.1 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.11.0 (from transformers==4.26.1)
  Downloading huggingface_hub-0.14.1-py3-none-any.whl (224 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m224.5/224.5 kB[0m [31m4.9 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14

In [2]:
import warnings
warnings.filterwarnings(action='ignore')

In [3]:
import transformers
transformers.logging.set_verbosity_error()

## 1. Load Library

In [4]:
import os
import json
import random
import numpy as np
import pandas as pd

import torch
from torch.utils.data import Dataset, DataLoader
from transformers import AutoTokenizer, AutoModelForMultipleChoice, TrainingArguments, Trainer
from datasets import load_dataset
import evaluate

from tqdm.auto import tqdm

## Set Hyperparameters

In [5]:
DEVICE = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
DEVICE

device(type='cuda')

In [6]:
BASE_DIR = os.getcwd()
OUTPUT_DIR = os.path.join(BASE_DIR, 'checkpoint')
os.makedirs(OUTPUT_DIR, exist_ok=True)

# 과제: [HW5] Multiple choice commonsense reasoning (~ 05/21 23:59)

이번 실습에서는 `social_i_qa` 데이터셋에 대하여 `bert-base` 모델을 학습하고 평가하는 내용을 진행했습니다.  
이에 따라 이번 과제는 `social_i_qa` 및 `commonsense_qa` 데이터셋에 대하여 `roberta-base` 모델을 학습하고 평가하는 것입니다.

```
tokenizer = AutoTokenizer.from_pretrained('roberta-base')
model = AutoModelForMultipleChoice.from_pretrained('roberta-base')
datasets = load_dataset('commonsense_qa')
```

- 제출 분량
    - 2 page 이상 4 page 이하
- 제출 형식
    - pdf로 변환하여 learnus에 제출
    - 파일명 = HW5_학번_이름.pdf (ex. HW5_2021321394_김채형.pdf)
- 제출 마감 일시
    - 5월 21일 (일) 23:59

- 제출 내용
  - Q1) 동일한 조건 하에서 SocialIQA와 CommonsenseQA 성능 비교
    - Q1-a) `social_i_qa` 데이터셋에 학습한 `roberta-base` 모델의 성능을 test set에서 평가하세요. 이때 성능 지표로 accuracy 뿐만 아니라 macro precision, recall, f1-score를 모두 포함해야 합니다. (20점)
      - `social_i_qa` 데이터셋의 경우 train/validation 2가지 split만 존재합니다. 따라서 validation split을 test set으로 사용하도록 합니다.
    - Q1-b) `commonsense_qa` 데이터셋에 대해 학습한 `robert-base` 모델의 성능을 test set에서 평가하세요. 이때 성능 지표로 accuracy 뿐만 아니라 macro precision, recall, f1-score를 모두 포함해야 합니다. (20점)
      - `commonsense_qa` 데이터셋의 경우 train/validation/test 3가지 split이 존재합니다. 따라서 test split을 test set으로 사용하시면 됩니다.
    - Q1-c) 만약 SocialIQA와 CommonsenseQA에서의 성능에 차이가 존재한다면, 그 이유에 대해 서술하세요. (20점)
  - Q2) COMET inference를 활용하여 SocialIQA에서의 성능 개선
    - [`socialiqa` 폴더](https://drive.google.com/drive/folders/17WyMnrNvKKeMatPp4U4AypMlrVMZHR9n?usp=sharing)에는 train/dev(=test) set에 대한 COMET inference (context와 question이 주어졌을 때 생성한 inference) 가 annotation 되어있는 데이터가 relation type 별로 존재합니다. 이를 활용하여 SocialIQA에서의 성능 향상을 이끌어내세요.
    - Q2-a) 실험 환경 (e.g. 학습에 사용한 COMET inference의 종류, 각종 hyperparameters, ...) 을 설명하고, 그러한 세팅을 선택한 이유에 대해 서술하세요. (20점)
    - Q2-b) 실험 결과 (accuracy, macro precision, recall, f1-score) 를 보고하고, 성능이 항샹된 이유에 대하여 서술하세요. (20점)


In [7]:
MODEL_NAME = 'roberta-base'

## Load Tokenizer and Model

In [8]:
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForMultipleChoice.from_pretrained(MODEL_NAME).to(DEVICE)

Downloading (…)lve/main/config.json:   0%|          | 0.00/481 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/501M [00:00<?, ?B/s]

In [9]:
tokenizer.special_tokens_map

{'bos_token': '<s>',
 'eos_token': '</s>',
 'unk_token': '<unk>',
 'sep_token': '</s>',
 'pad_token': '<pad>',
 'cls_token': '<s>',
 'mask_token': '<mask>'}

## 3. Load Data

### SocialIQA + COMET Inference Dataset Generation

In [11]:
from google.colab import drive
import os

drive.mount('/content/drive', force_remount=True)
os.chdir('drive/MyDrive/연세대학교 2학년 2학기 (2023-1)/CSI4121 Big Data/HW5')

Mounted at /content/drive


In [None]:
from datasets import load_dataset

train = load_dataset('socialiqa', data_files='train.json')
train_effect = load_dataset('socialiqa', data_files='train-comet-atomic-2020-xEffect.json')
train_intent = load_dataset('socialiqa', data_files='train-comet-atomic-2020-xIntent.json')
train_need = load_dataset('socialiqa', data_files='train-comet-atomic-2020-xNeed.json')
train_react = load_dataset('socialiqa', data_files='train-comet-atomic-2020-xReact.json')
train_want = load_dataset('socialiqa', data_files='train-comet-atomic-2020-xWant.json')

dev = load_dataset('socialiqa', data_files='dev.json')
dev_effect = load_dataset('socialiqa', data_files='dev-comet-atomic-2020-xEffect.json')
dev_intent = load_dataset('socialiqa', data_files='dev-comet-atomic-2020-xIntent.json')
dev_need = load_dataset('socialiqa', data_files='dev-comet-atomic-2020-xNeed.json')
dev_react = load_dataset('socialiqa', data_files='dev-comet-atomic-2020-xReact.json')
dev_want = load_dataset('socialiqa', data_files='dev-comet-atomic-2020-xWant.json')

Downloading and preparing dataset json/socialiqa to /root/.cache/huggingface/datasets/json/socialiqa-afa85fdd5bd9aa05/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Dataset json downloaded and prepared to /root/.cache/huggingface/datasets/json/socialiqa-afa85fdd5bd9aa05/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

Downloading and preparing dataset json/socialiqa to /root/.cache/huggingface/datasets/json/socialiqa-10c8565c94505450/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Dataset json downloaded and prepared to /root/.cache/huggingface/datasets/json/socialiqa-10c8565c94505450/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

Downloading and preparing dataset json/socialiqa to /root/.cache/huggingface/datasets/json/socialiqa-d5d572c41c649d12/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Dataset json downloaded and prepared to /root/.cache/huggingface/datasets/json/socialiqa-d5d572c41c649d12/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

Downloading and preparing dataset json/socialiqa to /root/.cache/huggingface/datasets/json/socialiqa-22e9339bf8c0d510/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Dataset json downloaded and prepared to /root/.cache/huggingface/datasets/json/socialiqa-22e9339bf8c0d510/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

Downloading and preparing dataset json/socialiqa to /root/.cache/huggingface/datasets/json/socialiqa-898a24eae212d051/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Dataset json downloaded and prepared to /root/.cache/huggingface/datasets/json/socialiqa-898a24eae212d051/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

Downloading and preparing dataset json/socialiqa to /root/.cache/huggingface/datasets/json/socialiqa-505a4741fa4b77e8/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Dataset json downloaded and prepared to /root/.cache/huggingface/datasets/json/socialiqa-505a4741fa4b77e8/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

Downloading and preparing dataset json/socialiqa to /root/.cache/huggingface/datasets/json/socialiqa-a861bc782257bde5/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Dataset json downloaded and prepared to /root/.cache/huggingface/datasets/json/socialiqa-a861bc782257bde5/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

Downloading and preparing dataset json/socialiqa to /root/.cache/huggingface/datasets/json/socialiqa-ade67a3109a86fae/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Dataset json downloaded and prepared to /root/.cache/huggingface/datasets/json/socialiqa-ade67a3109a86fae/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

Downloading and preparing dataset json/socialiqa to /root/.cache/huggingface/datasets/json/socialiqa-8a50da924ff44a41/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Dataset json downloaded and prepared to /root/.cache/huggingface/datasets/json/socialiqa-8a50da924ff44a41/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

Downloading and preparing dataset json/socialiqa to /root/.cache/huggingface/datasets/json/socialiqa-5c2a8f0f7949bc21/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Dataset json downloaded and prepared to /root/.cache/huggingface/datasets/json/socialiqa-5c2a8f0f7949bc21/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

Downloading and preparing dataset json/socialiqa to /root/.cache/huggingface/datasets/json/socialiqa-e3745c3caefa5ff7/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Dataset json downloaded and prepared to /root/.cache/huggingface/datasets/json/socialiqa-e3745c3caefa5ff7/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

Downloading and preparing dataset json/socialiqa to /root/.cache/huggingface/datasets/json/socialiqa-526e14145f4c0c06/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Dataset json downloaded and prepared to /root/.cache/huggingface/datasets/json/socialiqa-526e14145f4c0c06/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

In [None]:
print(train)

DatasetDict({
    train: Dataset({
        features: ['context', 'choices', 'question', 'answer', 'comet_effect', 'comet_intent', 'comet_need', 'comet_react', 'comet_want'],
        num_rows: 33410
    })
})


In [None]:
train['train'] = train['train'].add_column('comet_effect', train_effect['train']['comet'])
train['train'] = train['train'].add_column('comet_intent', train_intent['train']['comet'])
train['train'] = train['train'].add_column('comet_need', train_need['train']['comet'])
train['train'] = train['train'].add_column('comet_react', train_react['train']['comet'])
train['train'] = train['train'].add_column('comet_want', train_want['train']['comet'])

dev['train'] = dev['train'].add_column('comet_effect', dev_effect['train']['comet'])
dev['train'] = dev['train'].add_column('comet_intent', dev_intent['train']['comet'])
dev['train'] = dev['train'].add_column('comet_need', dev_need['train']['comet'])
dev['train'] = dev['train'].add_column('comet_react', dev_react['train']['comet'])
dev['train'] = dev['train'].add_column('comet_want', dev_want['train']['comet'])

In [None]:
train['train'].to_json('train_comet.json')
dev['train'].to_json('dev_comet.json')

Creating json from Arrow format:   0%|          | 0/34 [00:00<?, ?ba/s]

Creating json from Arrow format:   0%|          | 0/2 [00:00<?, ?ba/s]

863171

### Select Dataset(SocialIQA or CommonsenseQA)

In [10]:
dataset_name = 'social_i_qa'
# dataset_name = 'commonsense_qa'
dataset = load_dataset(dataset_name)

Downloading builder script:   0%|          | 0.00/4.67k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/2.19k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/6.81k [00:00<?, ?B/s]

Downloading and preparing dataset social_i_qa/default to /root/.cache/huggingface/datasets/social_i_qa/default/0.1.0/674d85e42ac7430d3dcd4de7007feaffcb1527c535121e09bab2803fbcc925f8...


Downloading data:   0%|          | 0.00/2.20M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/33410 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/1954 [00:00<?, ? examples/s]

Dataset social_i_qa downloaded and prepared to /root/.cache/huggingface/datasets/social_i_qa/default/0.1.0/674d85e42ac7430d3dcd4de7007feaffcb1527c535121e09bab2803fbcc925f8. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

In [11]:
print(dataset)

DatasetDict({
    train: Dataset({
        features: ['context', 'question', 'answerA', 'answerB', 'answerC', 'label'],
        num_rows: 33410
    })
    validation: Dataset({
        features: ['context', 'question', 'answerA', 'answerB', 'answerC', 'label'],
        num_rows: 1954
    })
})


In [13]:
from datasets import concatenate_datasets

dataset = concatenate_datasets([dataset['train'], dataset['validation']])
dataset = dataset.train_test_split(test_size=1954)
test_dataset = dataset['test']
dataset = dataset['train'].train_test_split(test_size=0.3)
train_dataset = dataset['train']
valid_dataset = dataset['test']

### Select Dataset(SocialIQA with COMET Inference)

In [10]:
from google.colab import drive
import os

drive.mount('/content/drive', force_remount=True)
os.chdir('drive/MyDrive/연세대학교 2학년 2학기 (2023-1)/CSI4121 Big Data/HW5')

Mounted at /content/drive


In [None]:
from datasets import load_dataset

train_dataset = load_dataset('.', data_files='train_comet.json', split='train[:70%]')
valid_dataset = load_dataset('.', data_files='train_comet.json', split='train[70%:]')
test_dataset = load_dataset('.', data_files='dev_comet.json')['train']

In [13]:
from datasets import concatenate_datasets, load_dataset

dataset1 = load_dataset('.', data_files='train_comet.json')['train']
dataset2 = load_dataset('.', data_files='dev_comet.json')['train']
dataset = concatenate_datasets([dataset1, dataset2])

dataset = dataset.train_test_split(test_size=1954)
test_dataset = dataset['test']
dataset = dataset['train'].train_test_split(test_size=0.3)
train_dataset = dataset['train']
valid_dataset = dataset['test']



  0%|          | 0/1 [00:00<?, ?it/s]



  0%|          | 0/1 [00:00<?, ?it/s]

### Check Dataset

In [14]:
print(len(train_dataset))
print(len(valid_dataset))
print(len(test_dataset))

23387
10023
1954


In [15]:
train_dataset[0]

{'context': 'Jan wanted to propose to their partner.  Jan finally asked the question.',
 'question': 'How would you describe Jan?',
 'answerA': 'loved by their partner',
 'answerB': 'okay with casual relationships',
 'answerC': 'committed',
 'label': '3'}

In [16]:
from datasets import ClassLabel
from IPython.display import display, HTML

def show_random_elements(dataset, num_examples=10):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)
    
    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
    display(HTML(df.to_html()))

In [17]:
show_random_elements(train_dataset)

Unnamed: 0,context,question,answerA,answerB,answerC,label
0,Sasha settled among Bailey's neighborhood and thus saw Bailey nearly every day.,Why did Sasha do this?,wanted to punish Bailey,wanted to avoid Bailey,wanted to be close to her friend,3
1,Cameron spoke to Bailey well because he liked her.,How would Cameron feel afterwards?,very loving,hateful,spiteful,1
2,Jesse changed the oil on the atv under his garage.,What does Jesse need to do before this?,remove the cap,kick his dog,have fun,1
3,Addison was having her birthday party in the afternoon and nobody came.,How would Addison feel afterwards?,someone with poor party planning skills,rejected,happy,2
4,Sydney always went first because they gave the team the best shot to win.,Why did Sydney do this?,make the team fail,play,help the team win,3
5,Lee gave a homeless many some money to get some food for himself at Subway.,What will happen to Lee?,The others will ask Lee for money,run away,be thanked,3
6,Jan got a letter from a college they applied for. Jan was afraid to read it so Carson read it to Jan. Jan got into college.,What will Jan want to do next?,start to pack their things and buy college supplies,cry over not getting in and sulk being upset,congratulate jan,1
7,Baileys friends didn't want to watch the same movie as Bailey so she used trickery to deceive them into picking out the movie she wanted .,What will Bailey want to do next?,talk about the movie,avoid watching a movie she didn't like,see the movie,3
8,Kendall surprised Ash's wife with a card for her birthday.,How would you describe Kendall?,unfriendly,Like they hope they like it,affectionate,3
9,Addison got all of the ingredients and made cookies for the party.,How would you describe Addison?,avoiding the party,unskilled and unable,an experienced baker,3


## Preprocess Data

### SocialIQA

In [18]:
answer_names = ['answerA', 'answerB', 'answerC']
num_choices = len(answer_names)

In [19]:
def preprocess_function(examples):
  num_examples = len(examples['context'])
  # Repeat each first sentence three times (=num_choices) to go with the three possibilities of second sentences.
  first_sentences = [[examples['context'][i] + tokenizer.sep_token + examples['question'][i]] * num_choices for i in range(num_examples)]
  second_sentences = [[examples[answer][i] for answer in answer_names] for i in range(num_examples)]
  # Flatten everything
  first_sentences = sum(first_sentences, [])
  second_sentences = sum(second_sentences, [])
  # Tokenize
  # 토크나이저 결과값 input id랑 attention mask인데, 이거 어떻게 돌아가는지 확인좀 해보장.
  tokenized_examples = tokenizer(first_sentences, second_sentences, truncation=True)
  # Un-flatten
  return {k: [v[i : i + num_choices] for i in range(0, len(v), num_choices)] for k, v in tokenized_examples.items()}

### CommonsenseQA

In [None]:
train_dataset = train_dataset.rename_column('answerKey', "label")
valid_dataset = valid_dataset.rename_column('answerKey', "label")
test_dataset = test_dataset.rename_column('answerKey', "label")

In [None]:
def convertLabel(examples):
  examples['label'] = [ord(label) - ord('A') + 1 for label in examples['label']]
  return examples

In [None]:
train_dataset = train_dataset.map(convertLabel, batched=True)
valid_dataset = valid_dataset.map(convertLabel, batched=True)
test_dataset = test_dataset.map(convertLabel, batched=True)

Map:   0%|          | 0/6819 [00:00<?, ? examples/s]

Map:   0%|          | 0/2922 [00:00<?, ? examples/s]

Map:   0%|          | 0/1221 [00:00<?, ? examples/s]

In [None]:
answer_names = ['A', 'B', 'C', 'D', 'E']
num_choices = len(answer_names)

In [None]:
def preprocess_function(examples):
  num_examples = len(examples['question_concept'])
  # Repeat each first sentence three times (=num_choices) to go with the three possibilities of second sentences.
  first_sentences = [[examples['question_concept'][i] + tokenizer.sep_token + examples['question'][i]] * num_choices for i in range(num_examples)]
  second_sentences = [examples['choices'][i]['text'] for i in range(num_examples)]
  # Flatten everything
  first_sentences = sum(first_sentences, [])
  second_sentences = sum(second_sentences, [])
  # Tokenize
  # 토크나이저 결과값 input id랑 attention mask인데, 이거 어떻게 돌아가는지 확인좀 해보장.
  tokenized_examples = tokenizer(first_sentences, second_sentences, truncation=True)
  # Un-flatten
  return {k: [v[i : i + num_choices] for i in range(0, len(v), num_choices)] for k, v in tokenized_examples.items()}

### SocialIQA with COMET Inference

In [18]:
train_dataset = train_dataset.rename_column('answer', "label")
valid_dataset = valid_dataset.rename_column('answer', "label")
test_dataset = test_dataset.rename_column('answer', "label")

In [19]:
def convertLabel(examples):
  examples['label'] = [ord(label) - ord('A') + 1 for label in examples['label']]
  return examples

In [20]:
train_dataset = train_dataset.map(convertLabel, batched=True)
valid_dataset = valid_dataset.map(convertLabel, batched=True)
test_dataset = test_dataset.map(convertLabel, batched=True)

Map:   0%|          | 0/23387 [00:00<?, ? examples/s]

Map:   0%|          | 0/10023 [00:00<?, ? examples/s]

Map:   0%|          | 0/1954 [00:00<?, ? examples/s]

In [21]:
num_choices = 3

In [22]:
def preprocess_function(examples):
  num_examples = len(examples['context'])
  # Repeat each first sentence three times (=num_choices) to go with the three possibilities of second sentences.
  first_sentences = [['Effect:' + examples['comet_effect'][i] + tokenizer.sep_token
                      + 'Intent:' + examples['comet_intent'][i] + tokenizer.sep_token
                      + 'Need:' + examples['comet_need'][i] + tokenizer.sep_token
                      + 'Reaction:' + examples['comet_react'][i] + tokenizer.sep_token
                      + 'Want:' + examples['comet_want'][i] + tokenizer.sep_token
                      + examples['context'][i] + tokenizer.sep_token
                      + examples['question'][i]] * num_choices for i in range(num_examples)]
  second_sentences = [[item['text'] for item in examples['choices'][i]] for i in range(num_examples)]
  # Flatten everything
  first_sentences = sum(first_sentences, [])
  second_sentences = sum(second_sentences, [])
  # Tokenize
  # 토크나이저 결과값 input id랑 attention mask인데, 이거 어떻게 돌아가는지 확인좀 해보장.
  tokenized_examples = tokenizer(first_sentences, second_sentences, truncation=True)
  # Un-flatten
  return {k: [v[i : i + num_choices] for i in range(0, len(v), num_choices)] for k, v in tokenized_examples.items()}

### Check Preprocessing

In [20]:
examples = train_dataset[:5]
features = preprocess_function(examples)
# features의 구조
# 일단 딕셔너리임. input_ids랑 token_type_ids랑 attention_mask가 key고 value는 각각 5개의 샘플에 대한 list 인거임.
# ex) 하나의 원본 데이터에서 3가지 샘플 나옴(아래 셀 아웃풋 참고). 그러면 그걸 하나의 어레이로 묶을 수 있음
# 그리고 지금 examples 안에 샘플이 5개 있으니까 원소 3개짜리 어레이가 5개 나옴
# 그걸 묶은 것이 input_ids에 들어가고 나머지 key에 대한 value들도 동일한 구조
print(len(features['input_ids']), len(features['input_ids'][0]), [len(x) for x in features['input_ids'][0]])

5 3 [31, 31, 28]


In [21]:
idx = 0
[tokenizer.decode(features['input_ids'][idx][i]) for i in range(num_choices)]
# 기존 데이터셋에서 하나의 샘플을 이 셀의 실행과 같은 3개의 데이터로 바꾸고
# 각각에 대한 probability를 학습해서 젤 높은걸 답으로 뽑는다
# 우리가 객관식 문제를 풀때 선지간 비교를 활용하는데 그렇게 하지 못한다는게 살짝 한계

['<s>Jan wanted to propose to their partner.  Jan finally asked the question.</s>How would you describe Jan?</s></s>loved by their partner</s>',
 '<s>Jan wanted to propose to their partner.  Jan finally asked the question.</s>How would you describe Jan?</s></s>okay with casual relationships</s>',
 '<s>Jan wanted to propose to their partner.  Jan finally asked the question.</s>How would you describe Jan?</s></s>committed</s>']

In [22]:
tokenized_train_data = train_dataset.map(preprocess_function, batched=True)
tokenized_valid_data = valid_dataset.map(preprocess_function, batched=True)

Map:   0%|          | 0/23387 [00:00<?, ? examples/s]

Map:   0%|          | 0/10023 [00:00<?, ? examples/s]

In [23]:
from dataclasses import dataclass
from transformers.tokenization_utils_base import PreTrainedTokenizerBase, PaddingStrategy
from typing import Optional, Union

@dataclass
class DataCollatorForMultipleChoice:
    """
    Data collator that will dynamically pad the inputs for multiple choice received.
    """
    tokenizer: PreTrainedTokenizerBase
    padding: Union[bool, str, PaddingStrategy] = True
    max_length: Optional[int] = None
    pad_to_multiple_of: Optional[int] = None

    def __call__(self, features):
        label_name = 'label' if 'label' in features[0].keys() else 'labels'
        labels = [int(feature.pop(label_name))-1 for feature in features]
        batch_size = len(features)
        num_choices = len(features[0]['input_ids'])
        flattened_features = [
            [{k: v[i] for k, v in feature.items()} for i in range(num_choices)] for feature in features
        ]
        flattened_features = sum(flattened_features, [])

        batch = self.tokenizer.pad(
            flattened_features,
            padding=self.padding,
            max_length=self.max_length,
            pad_to_multiple_of=self.pad_to_multiple_of,
            return_tensors='pt',
        )
        # Un-flatten
        batch = {k: v.view(batch_size, num_choices, -1) for k, v in batch.items()}
        batch['labels'] = torch.tensor(labels, dtype=torch.int64)
        return batch

When called on a list of examples, it will flatten all the input_ids/attention_mask in big lists that it will pass to the `tokenizer.pad` method. This will return a dictionary with big tensors (of shape `(batch_size * num_choices) x seq_length`) that we then unflatten.

In [24]:
accepted_keys = ['input_ids', 'attention_mask', 'label']
features = [{k: v for k, v in tokenized_train_data[i].items() if k in accepted_keys} for i in range(10)]
batch = DataCollatorForMultipleChoice(tokenizer)(features)

In [25]:
[tokenizer.decode(batch['input_ids'][8][i].tolist()) for i in range(num_choices)]

['<s>Bailey came within range and took a shot at the target.</s>How would you describe Bailey?</s></s>uninterested</s><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad>',
 '<s>Bailey came within range and took a shot at the target.</s>How would you describe Bailey?</s></s>uninvolved</s><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad>',
 '<s>Bailey came within range and took a shot at the target.</s>How would you describe Bailey?</s></s>engaged</s><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad>']

## Train Model

In [26]:
# 과제에서 바꿀 필요 있는 부분
accuracy = evaluate.load('accuracy')
precision = evaluate.load('precision')
recall = evaluate.load('recall')
f1 = evaluate.load('f1')

Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/7.55k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/7.36k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/6.77k [00:00<?, ?B/s]

In [27]:
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)  # 확률값 아웃풋 3개 나올텐데 그중 최대를 뽑아옴

    return {'accuracy': accuracy.compute(predictions=predictions, references=labels),
            'precision': precision.compute(predictions=predictions, references=labels, average='macro'),
            'recall': recall.compute(predictions=predictions, references=labels, average='macro'),
            'f1': f1.compute(predictions=predictions, references=labels, average='macro')}

In [28]:
training_args = TrainingArguments(
    output_dir=OUTPUT_DIR,
    overwrite_output_dir = True,
    do_train=True,
    do_eval=True,
    evaluation_strategy='steps',
    save_strategy='steps',
    logging_strategy='steps',
    eval_steps=200,
    save_steps=200,
    logging_steps=20,
    save_total_limit=2,  # 체크포인트 최대 몇개까지 가지고 있을 건지
    per_device_train_batch_size=16,  # 배치 사이즈 train eval 다르게 가능
    per_device_eval_batch_size=16,
    learning_rate=5e-5,
    num_train_epochs=3,
    weight_decay=0.01,
    max_grad_norm=0.1,
    gradient_accumulation_steps=2,
    gradient_checkpointing=True,
    seed=516,
    logging_first_step=True,
    load_best_model_at_end=True,  # 학습 이후 따로 베스트 체크포인트 로드 안해도 바로 테스팅 해볼 수 있음
    metric_for_best_model='loss',  # 베스트 기준
    greater_is_better=False,  # 그 기준이 높을수록 좋은거임?
    push_to_hub=False,  # huggingface에 올릴건지
)

In [29]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train_data,
    eval_dataset=tokenized_valid_data,
    tokenizer=tokenizer,
    data_collator=DataCollatorForMultipleChoice(tokenizer=tokenizer),
    compute_metrics=compute_metrics,
)

In [30]:
trainer.train()

The following columns in the training set don't have a corresponding argument in `RobertaForMultipleChoice.forward` and have been ignored: question, answerB, answerC, answerA, context. If question, answerB, answerC, answerA, context are not expected by `RobertaForMultipleChoice.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 23387
  Num Epochs = 3
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 32
  Gradient Accumulation steps = 2
  Total optimization steps = 2193
  Number of trainable parameters = 124646401


{'loss': 1.0985, 'learning_rate': 4.997720018239854e-05, 'epoch': 0.0}
{'loss': 1.1032, 'learning_rate': 4.954400364797081e-05, 'epoch': 0.03}
{'loss': 1.0966, 'learning_rate': 4.908800729594164e-05, 'epoch': 0.05}
{'loss': 1.0478, 'learning_rate': 4.863201094391245e-05, 'epoch': 0.08}
{'loss': 0.9369, 'learning_rate': 4.8176014591883265e-05, 'epoch': 0.11}
{'loss': 0.9177, 'learning_rate': 4.772001823985408e-05, 'epoch': 0.14}
{'loss': 0.8774, 'learning_rate': 4.72640218878249e-05, 'epoch': 0.16}
{'loss': 0.853, 'learning_rate': 4.680802553579571e-05, 'epoch': 0.19}
{'loss': 0.8506, 'learning_rate': 4.6352029183766534e-05, 'epoch': 0.22}
{'loss': 0.7847, 'learning_rate': 4.5896032831737345e-05, 'epoch': 0.25}


The following columns in the evaluation set don't have a corresponding argument in `RobertaForMultipleChoice.forward` and have been ignored: question, answerB, answerC, answerA, context. If question, answerB, answerC, answerA, context are not expected by `RobertaForMultipleChoice.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 10023
  Batch size = 16


{'loss': 0.8346, 'learning_rate': 4.544003647970816e-05, 'epoch': 0.27}


Trainer is attempting to log a value of "{'accuracy': 0.6739499151950514}" of type <class 'dict'> for key "eval/accuracy" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.
Trainer is attempting to log a value of "{'precision': 0.6739643519043002}" of type <class 'dict'> for key "eval/precision" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.
Trainer is attempting to log a value of "{'recall': 0.6739415132580451}" of type <class 'dict'> for key "eval/recall" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.
Trainer is attempting to log a value of "{'f1': 0.6739335276469358}" of type <class 'dict'> for key "eval/f1" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.
Saving model checkpoint to /content/checkpoint/checkpoint-200
Configuration saved in /cont

{'eval_loss': 0.7537047266960144, 'eval_accuracy': {'accuracy': 0.6739499151950514}, 'eval_precision': {'precision': 0.6739643519043002}, 'eval_recall': {'recall': 0.6739415132580451}, 'eval_f1': {'f1': 0.6739335276469358}, 'eval_runtime': 25.8983, 'eval_samples_per_second': 387.013, 'eval_steps_per_second': 24.21, 'epoch': 0.27}


Model weights saved in /content/checkpoint/checkpoint-200/pytorch_model.bin
tokenizer config file saved in /content/checkpoint/checkpoint-200/tokenizer_config.json
Special tokens file saved in /content/checkpoint/checkpoint-200/special_tokens_map.json


{'loss': 0.7992, 'learning_rate': 4.498404012767898e-05, 'epoch': 0.3}
{'loss': 0.7546, 'learning_rate': 4.45280437756498e-05, 'epoch': 0.33}
{'loss': 0.7934, 'learning_rate': 4.407204742362061e-05, 'epoch': 0.36}
{'loss': 0.7755, 'learning_rate': 4.361605107159143e-05, 'epoch': 0.38}
{'loss': 0.785, 'learning_rate': 4.316005471956224e-05, 'epoch': 0.41}
{'loss': 0.7973, 'learning_rate': 4.270405836753306e-05, 'epoch': 0.44}
{'loss': 0.7133, 'learning_rate': 4.2248062015503877e-05, 'epoch': 0.47}
{'loss': 0.739, 'learning_rate': 4.1792065663474694e-05, 'epoch': 0.49}
{'loss': 0.764, 'learning_rate': 4.1336069311445504e-05, 'epoch': 0.52}


The following columns in the evaluation set don't have a corresponding argument in `RobertaForMultipleChoice.forward` and have been ignored: question, answerB, answerC, answerA, context. If question, answerB, answerC, answerA, context are not expected by `RobertaForMultipleChoice.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 10023
  Batch size = 16


{'loss': 0.7296, 'learning_rate': 4.088007295941633e-05, 'epoch': 0.55}


Trainer is attempting to log a value of "{'accuracy': 0.7195450463932954}" of type <class 'dict'> for key "eval/accuracy" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.
Trainer is attempting to log a value of "{'precision': 0.7195591781418048}" of type <class 'dict'> for key "eval/precision" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.
Trainer is attempting to log a value of "{'recall': 0.7194764964220591}" of type <class 'dict'> for key "eval/recall" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.
Trainer is attempting to log a value of "{'f1': 0.7194927024294545}" of type <class 'dict'> for key "eval/f1" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.
Saving model checkpoint to /content/checkpoint/checkpoint-400
Configuration saved in /cont

{'eval_loss': 0.6663312911987305, 'eval_accuracy': {'accuracy': 0.7195450463932954}, 'eval_precision': {'precision': 0.7195591781418048}, 'eval_recall': {'recall': 0.7194764964220591}, 'eval_f1': {'f1': 0.7194927024294545}, 'eval_runtime': 25.6163, 'eval_samples_per_second': 391.274, 'eval_steps_per_second': 24.477, 'epoch': 0.55}


Model weights saved in /content/checkpoint/checkpoint-400/pytorch_model.bin
tokenizer config file saved in /content/checkpoint/checkpoint-400/tokenizer_config.json
Special tokens file saved in /content/checkpoint/checkpoint-400/special_tokens_map.json


{'loss': 0.7551, 'learning_rate': 4.042407660738714e-05, 'epoch': 0.57}
{'loss': 0.7238, 'learning_rate': 3.9968080255357956e-05, 'epoch': 0.6}
{'loss': 0.7011, 'learning_rate': 3.9512083903328774e-05, 'epoch': 0.63}
{'loss': 0.6909, 'learning_rate': 3.905608755129959e-05, 'epoch': 0.66}
{'loss': 0.7135, 'learning_rate': 3.860009119927041e-05, 'epoch': 0.68}
{'loss': 0.6664, 'learning_rate': 3.8144094847241226e-05, 'epoch': 0.71}
{'loss': 0.7085, 'learning_rate': 3.7688098495212036e-05, 'epoch': 0.74}
{'loss': 0.6412, 'learning_rate': 3.7232102143182854e-05, 'epoch': 0.77}
{'loss': 0.6366, 'learning_rate': 3.677610579115367e-05, 'epoch': 0.79}


The following columns in the evaluation set don't have a corresponding argument in `RobertaForMultipleChoice.forward` and have been ignored: question, answerB, answerC, answerA, context. If question, answerB, answerC, answerA, context are not expected by `RobertaForMultipleChoice.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 10023
  Batch size = 16


{'loss': 0.7235, 'learning_rate': 3.632010943912449e-05, 'epoch': 0.82}


Trainer is attempting to log a value of "{'accuracy': 0.7363064950613589}" of type <class 'dict'> for key "eval/accuracy" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.
Trainer is attempting to log a value of "{'precision': 0.7363007226498257}" of type <class 'dict'> for key "eval/precision" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.
Trainer is attempting to log a value of "{'recall': 0.7363028087270055}" of type <class 'dict'> for key "eval/recall" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.
Trainer is attempting to log a value of "{'f1': 0.7363009033919385}" of type <class 'dict'> for key "eval/f1" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.
Saving model checkpoint to /content/checkpoint/checkpoint-600
Configuration saved in /cont

{'eval_loss': 0.6481437087059021, 'eval_accuracy': {'accuracy': 0.7363064950613589}, 'eval_precision': {'precision': 0.7363007226498257}, 'eval_recall': {'recall': 0.7363028087270055}, 'eval_f1': {'f1': 0.7363009033919385}, 'eval_runtime': 25.5896, 'eval_samples_per_second': 391.682, 'eval_steps_per_second': 24.502, 'epoch': 0.82}


Model weights saved in /content/checkpoint/checkpoint-600/pytorch_model.bin
tokenizer config file saved in /content/checkpoint/checkpoint-600/tokenizer_config.json
Special tokens file saved in /content/checkpoint/checkpoint-600/special_tokens_map.json
Deleting older checkpoint [/content/checkpoint/checkpoint-200] due to args.save_total_limit


{'loss': 0.6269, 'learning_rate': 3.5864113087095306e-05, 'epoch': 0.85}
{'loss': 0.7428, 'learning_rate': 3.540811673506612e-05, 'epoch': 0.88}
{'loss': 0.6659, 'learning_rate': 3.4952120383036933e-05, 'epoch': 0.9}
{'loss': 0.6752, 'learning_rate': 3.449612403100775e-05, 'epoch': 0.93}
{'loss': 0.6424, 'learning_rate': 3.404012767897857e-05, 'epoch': 0.96}
{'loss': 0.6245, 'learning_rate': 3.3584131326949385e-05, 'epoch': 0.98}
{'loss': 0.5859, 'learning_rate': 3.31281349749202e-05, 'epoch': 1.01}
{'loss': 0.5149, 'learning_rate': 3.267213862289102e-05, 'epoch': 1.04}
{'loss': 0.4906, 'learning_rate': 3.221614227086183e-05, 'epoch': 1.07}


The following columns in the evaluation set don't have a corresponding argument in `RobertaForMultipleChoice.forward` and have been ignored: question, answerB, answerC, answerA, context. If question, answerB, answerC, answerA, context are not expected by `RobertaForMultipleChoice.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 10023
  Batch size = 16


{'loss': 0.5091, 'learning_rate': 3.176014591883265e-05, 'epoch': 1.09}


Trainer is attempting to log a value of "{'accuracy': 0.7429911204230271}" of type <class 'dict'> for key "eval/accuracy" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.
Trainer is attempting to log a value of "{'precision': 0.7429398294426793}" of type <class 'dict'> for key "eval/precision" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.
Trainer is attempting to log a value of "{'recall': 0.7429256194289012}" of type <class 'dict'> for key "eval/recall" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.
Trainer is attempting to log a value of "{'f1': 0.7429261493124922}" of type <class 'dict'> for key "eval/f1" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.
Saving model checkpoint to /content/checkpoint/checkpoint-800
Configuration saved in /cont

{'eval_loss': 0.6889124512672424, 'eval_accuracy': {'accuracy': 0.7429911204230271}, 'eval_precision': {'precision': 0.7429398294426793}, 'eval_recall': {'recall': 0.7429256194289012}, 'eval_f1': {'f1': 0.7429261493124922}, 'eval_runtime': 25.506, 'eval_samples_per_second': 392.966, 'eval_steps_per_second': 24.582, 'epoch': 1.09}


Model weights saved in /content/checkpoint/checkpoint-800/pytorch_model.bin
tokenizer config file saved in /content/checkpoint/checkpoint-800/tokenizer_config.json
Special tokens file saved in /content/checkpoint/checkpoint-800/special_tokens_map.json
Deleting older checkpoint [/content/checkpoint/checkpoint-400] due to args.save_total_limit


{'loss': 0.5091, 'learning_rate': 3.1304149566803465e-05, 'epoch': 1.12}
{'loss': 0.4434, 'learning_rate': 3.084815321477428e-05, 'epoch': 1.15}
{'loss': 0.5042, 'learning_rate': 3.0392156862745097e-05, 'epoch': 1.18}
{'loss': 0.5145, 'learning_rate': 2.9936160510715917e-05, 'epoch': 1.2}
{'loss': 0.5196, 'learning_rate': 2.9480164158686728e-05, 'epoch': 1.23}
{'loss': 0.5056, 'learning_rate': 2.902416780665755e-05, 'epoch': 1.26}
{'loss': 0.5028, 'learning_rate': 2.8568171454628362e-05, 'epoch': 1.29}
{'loss': 0.529, 'learning_rate': 2.811217510259918e-05, 'epoch': 1.31}
{'loss': 0.5151, 'learning_rate': 2.7656178750569994e-05, 'epoch': 1.34}


The following columns in the evaluation set don't have a corresponding argument in `RobertaForMultipleChoice.forward` and have been ignored: question, answerB, answerC, answerA, context. If question, answerB, answerC, answerA, context are not expected by `RobertaForMultipleChoice.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 10023
  Batch size = 16


{'loss': 0.4632, 'learning_rate': 2.7200182398540814e-05, 'epoch': 1.37}


Trainer is attempting to log a value of "{'accuracy': 0.755961289035219}" of type <class 'dict'> for key "eval/accuracy" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.
Trainer is attempting to log a value of "{'precision': 0.7559461209829851}" of type <class 'dict'> for key "eval/precision" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.
Trainer is attempting to log a value of "{'recall': 0.7559559730086836}" of type <class 'dict'> for key "eval/recall" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.
Trainer is attempting to log a value of "{'f1': 0.7559453123727841}" of type <class 'dict'> for key "eval/f1" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.
Saving model checkpoint to /content/checkpoint/checkpoint-1000
Configuration saved in /cont

{'eval_loss': 0.6576991081237793, 'eval_accuracy': {'accuracy': 0.755961289035219}, 'eval_precision': {'precision': 0.7559461209829851}, 'eval_recall': {'recall': 0.7559559730086836}, 'eval_f1': {'f1': 0.7559453123727841}, 'eval_runtime': 25.4944, 'eval_samples_per_second': 393.145, 'eval_steps_per_second': 24.594, 'epoch': 1.37}


Model weights saved in /content/checkpoint/checkpoint-1000/pytorch_model.bin
tokenizer config file saved in /content/checkpoint/checkpoint-1000/tokenizer_config.json
Special tokens file saved in /content/checkpoint/checkpoint-1000/special_tokens_map.json
Deleting older checkpoint [/content/checkpoint/checkpoint-800] due to args.save_total_limit


{'loss': 0.5316, 'learning_rate': 2.674418604651163e-05, 'epoch': 1.4}
{'loss': 0.4825, 'learning_rate': 2.6288189694482446e-05, 'epoch': 1.42}
{'loss': 0.5035, 'learning_rate': 2.583219334245326e-05, 'epoch': 1.45}
{'loss': 0.5199, 'learning_rate': 2.5376196990424077e-05, 'epoch': 1.48}
{'loss': 0.5208, 'learning_rate': 2.4920200638394894e-05, 'epoch': 1.5}
{'loss': 0.5582, 'learning_rate': 2.446420428636571e-05, 'epoch': 1.53}
{'loss': 0.4981, 'learning_rate': 2.4008207934336525e-05, 'epoch': 1.56}
{'loss': 0.526, 'learning_rate': 2.3552211582307343e-05, 'epoch': 1.59}
{'loss': 0.4769, 'learning_rate': 2.309621523027816e-05, 'epoch': 1.61}


The following columns in the evaluation set don't have a corresponding argument in `RobertaForMultipleChoice.forward` and have been ignored: question, answerB, answerC, answerA, context. If question, answerB, answerC, answerA, context are not expected by `RobertaForMultipleChoice.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 10023
  Batch size = 16


{'loss': 0.4604, 'learning_rate': 2.2640218878248974e-05, 'epoch': 1.64}


Trainer is attempting to log a value of "{'accuracy': 0.7654394891748977}" of type <class 'dict'> for key "eval/accuracy" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.
Trainer is attempting to log a value of "{'precision': 0.7654170871906989}" of type <class 'dict'> for key "eval/precision" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.
Trainer is attempting to log a value of "{'recall': 0.7654252640478229}" of type <class 'dict'> for key "eval/recall" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.
Trainer is attempting to log a value of "{'f1': 0.7654189233394212}" of type <class 'dict'> for key "eval/f1" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.
Saving model checkpoint to /content/checkpoint/checkpoint-1200
Configuration saved in /con

{'eval_loss': 0.641149640083313, 'eval_accuracy': {'accuracy': 0.7654394891748977}, 'eval_precision': {'precision': 0.7654170871906989}, 'eval_recall': {'recall': 0.7654252640478229}, 'eval_f1': {'f1': 0.7654189233394212}, 'eval_runtime': 25.186, 'eval_samples_per_second': 397.96, 'eval_steps_per_second': 24.895, 'epoch': 1.64}


Model weights saved in /content/checkpoint/checkpoint-1200/pytorch_model.bin
tokenizer config file saved in /content/checkpoint/checkpoint-1200/tokenizer_config.json
Special tokens file saved in /content/checkpoint/checkpoint-1200/special_tokens_map.json
Deleting older checkpoint [/content/checkpoint/checkpoint-600] due to args.save_total_limit


{'loss': 0.555, 'learning_rate': 2.218422252621979e-05, 'epoch': 1.67}
{'loss': 0.4757, 'learning_rate': 2.172822617419061e-05, 'epoch': 1.7}
{'loss': 0.5653, 'learning_rate': 2.1272229822161423e-05, 'epoch': 1.72}
{'loss': 0.5105, 'learning_rate': 2.081623347013224e-05, 'epoch': 1.75}
{'loss': 0.5274, 'learning_rate': 2.0360237118103057e-05, 'epoch': 1.78}
{'loss': 0.4848, 'learning_rate': 1.990424076607387e-05, 'epoch': 1.81}
{'loss': 0.4792, 'learning_rate': 1.944824441404469e-05, 'epoch': 1.83}
{'loss': 0.4822, 'learning_rate': 1.8992248062015506e-05, 'epoch': 1.86}
{'loss': 0.4668, 'learning_rate': 1.853625170998632e-05, 'epoch': 1.89}


The following columns in the evaluation set don't have a corresponding argument in `RobertaForMultipleChoice.forward` and have been ignored: question, answerB, answerC, answerA, context. If question, answerB, answerC, answerA, context are not expected by `RobertaForMultipleChoice.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 10023
  Batch size = 16


{'loss': 0.5036, 'learning_rate': 1.8080255357957137e-05, 'epoch': 1.92}


Trainer is attempting to log a value of "{'accuracy': 0.7601516512022348}" of type <class 'dict'> for key "eval/accuracy" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.
Trainer is attempting to log a value of "{'precision': 0.7601601480971141}" of type <class 'dict'> for key "eval/precision" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.
Trainer is attempting to log a value of "{'recall': 0.7601731938450526}" of type <class 'dict'> for key "eval/recall" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.
Trainer is attempting to log a value of "{'f1': 0.7601474910853737}" of type <class 'dict'> for key "eval/f1" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.
Saving model checkpoint to /content/checkpoint/checkpoint-1400
Configuration saved in /con

{'eval_loss': 0.631726861000061, 'eval_accuracy': {'accuracy': 0.7601516512022348}, 'eval_precision': {'precision': 0.7601601480971141}, 'eval_recall': {'recall': 0.7601731938450526}, 'eval_f1': {'f1': 0.7601474910853737}, 'eval_runtime': 25.4126, 'eval_samples_per_second': 394.411, 'eval_steps_per_second': 24.673, 'epoch': 1.92}


Model weights saved in /content/checkpoint/checkpoint-1400/pytorch_model.bin
tokenizer config file saved in /content/checkpoint/checkpoint-1400/tokenizer_config.json
Special tokens file saved in /content/checkpoint/checkpoint-1400/special_tokens_map.json
Deleting older checkpoint [/content/checkpoint/checkpoint-1000] due to args.save_total_limit


{'loss': 0.5617, 'learning_rate': 1.7624259005927954e-05, 'epoch': 1.94}
{'loss': 0.4647, 'learning_rate': 1.716826265389877e-05, 'epoch': 1.97}
{'loss': 0.4814, 'learning_rate': 1.6712266301869586e-05, 'epoch': 2.0}
{'loss': 0.3053, 'learning_rate': 1.6256269949840403e-05, 'epoch': 2.02}
{'loss': 0.3379, 'learning_rate': 1.580027359781122e-05, 'epoch': 2.05}
{'loss': 0.2936, 'learning_rate': 1.5344277245782034e-05, 'epoch': 2.08}
{'loss': 0.3464, 'learning_rate': 1.4888280893752852e-05, 'epoch': 2.11}
{'loss': 0.3171, 'learning_rate': 1.4432284541723667e-05, 'epoch': 2.13}
{'loss': 0.3068, 'learning_rate': 1.3976288189694483e-05, 'epoch': 2.16}


The following columns in the evaluation set don't have a corresponding argument in `RobertaForMultipleChoice.forward` and have been ignored: question, answerB, answerC, answerA, context. If question, answerB, answerC, answerA, context are not expected by `RobertaForMultipleChoice.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 10023
  Batch size = 16


{'loss': 0.3224, 'learning_rate': 1.35202918376653e-05, 'epoch': 2.19}


Trainer is attempting to log a value of "{'accuracy': 0.7645415544248229}" of type <class 'dict'> for key "eval/accuracy" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.
Trainer is attempting to log a value of "{'precision': 0.7645265745735402}" of type <class 'dict'> for key "eval/precision" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.
Trainer is attempting to log a value of "{'recall': 0.7645415266509344}" of type <class 'dict'> for key "eval/recall" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.
Trainer is attempting to log a value of "{'f1': 0.7645281871355077}" of type <class 'dict'> for key "eval/f1" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.
Saving model checkpoint to /content/checkpoint/checkpoint-1600
Configuration saved in /con

{'eval_loss': 0.7431764602661133, 'eval_accuracy': {'accuracy': 0.7645415544248229}, 'eval_precision': {'precision': 0.7645265745735402}, 'eval_recall': {'recall': 0.7645415266509344}, 'eval_f1': {'f1': 0.7645281871355077}, 'eval_runtime': 25.5961, 'eval_samples_per_second': 391.583, 'eval_steps_per_second': 24.496, 'epoch': 2.19}


Model weights saved in /content/checkpoint/checkpoint-1600/pytorch_model.bin
tokenizer config file saved in /content/checkpoint/checkpoint-1600/tokenizer_config.json
Special tokens file saved in /content/checkpoint/checkpoint-1600/special_tokens_map.json
Deleting older checkpoint [/content/checkpoint/checkpoint-1200] due to args.save_total_limit


{'loss': 0.2817, 'learning_rate': 1.3064295485636116e-05, 'epoch': 2.22}
{'loss': 0.2818, 'learning_rate': 1.2608299133606931e-05, 'epoch': 2.24}
{'loss': 0.2822, 'learning_rate': 1.2152302781577749e-05, 'epoch': 2.27}
{'loss': 0.3002, 'learning_rate': 1.1696306429548564e-05, 'epoch': 2.3}
{'loss': 0.3062, 'learning_rate': 1.1240310077519382e-05, 'epoch': 2.33}
{'loss': 0.3304, 'learning_rate': 1.0784313725490197e-05, 'epoch': 2.35}
{'loss': 0.2918, 'learning_rate': 1.0328317373461013e-05, 'epoch': 2.38}
{'loss': 0.2857, 'learning_rate': 9.87232102143183e-06, 'epoch': 2.41}
{'loss': 0.2887, 'learning_rate': 9.416324669402646e-06, 'epoch': 2.44}


The following columns in the evaluation set don't have a corresponding argument in `RobertaForMultipleChoice.forward` and have been ignored: question, answerB, answerC, answerA, context. If question, answerB, answerC, answerA, context are not expected by `RobertaForMultipleChoice.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 10023
  Batch size = 16


{'loss': 0.2877, 'learning_rate': 8.960328317373462e-06, 'epoch': 2.46}


Trainer is attempting to log a value of "{'accuracy': 0.7691309987029832}" of type <class 'dict'> for key "eval/accuracy" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.
Trainer is attempting to log a value of "{'precision': 0.769112937423008}" of type <class 'dict'> for key "eval/precision" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.
Trainer is attempting to log a value of "{'recall': 0.7691034444752353}" of type <class 'dict'> for key "eval/recall" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.
Trainer is attempting to log a value of "{'f1': 0.7691074537306185}" of type <class 'dict'> for key "eval/f1" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.
Saving model checkpoint to /content/checkpoint/checkpoint-1800
Configuration saved in /cont

{'eval_loss': 0.7274134755134583, 'eval_accuracy': {'accuracy': 0.7691309987029832}, 'eval_precision': {'precision': 0.769112937423008}, 'eval_recall': {'recall': 0.7691034444752353}, 'eval_f1': {'f1': 0.7691074537306185}, 'eval_runtime': 25.3637, 'eval_samples_per_second': 395.17, 'eval_steps_per_second': 24.72, 'epoch': 2.46}


Model weights saved in /content/checkpoint/checkpoint-1800/pytorch_model.bin
tokenizer config file saved in /content/checkpoint/checkpoint-1800/tokenizer_config.json
Special tokens file saved in /content/checkpoint/checkpoint-1800/special_tokens_map.json
Deleting older checkpoint [/content/checkpoint/checkpoint-1600] due to args.save_total_limit


{'loss': 0.2893, 'learning_rate': 8.504331965344279e-06, 'epoch': 2.49}
{'loss': 0.2864, 'learning_rate': 8.048335613315095e-06, 'epoch': 2.52}
{'loss': 0.3015, 'learning_rate': 7.592339261285911e-06, 'epoch': 2.54}
{'loss': 0.2962, 'learning_rate': 7.136342909256727e-06, 'epoch': 2.57}
{'loss': 0.2679, 'learning_rate': 6.680346557227543e-06, 'epoch': 2.6}
{'loss': 0.2756, 'learning_rate': 6.224350205198359e-06, 'epoch': 2.63}
{'loss': 0.3746, 'learning_rate': 5.768353853169175e-06, 'epoch': 2.65}
{'loss': 0.2746, 'learning_rate': 5.312357501139991e-06, 'epoch': 2.68}
{'loss': 0.2739, 'learning_rate': 4.856361149110807e-06, 'epoch': 2.71}


The following columns in the evaluation set don't have a corresponding argument in `RobertaForMultipleChoice.forward` and have been ignored: question, answerB, answerC, answerA, context. If question, answerB, answerC, answerA, context are not expected by `RobertaForMultipleChoice.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 10023
  Batch size = 16


{'loss': 0.3717, 'learning_rate': 4.400364797081624e-06, 'epoch': 2.74}


Trainer is attempting to log a value of "{'accuracy': 0.7708270976753467}" of type <class 'dict'> for key "eval/accuracy" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.
Trainer is attempting to log a value of "{'precision': 0.7708075477166202}" of type <class 'dict'> for key "eval/precision" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.
Trainer is attempting to log a value of "{'recall': 0.7708012562351967}" of type <class 'dict'> for key "eval/recall" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.
Trainer is attempting to log a value of "{'f1': 0.7708035564663996}" of type <class 'dict'> for key "eval/f1" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.
Saving model checkpoint to /content/checkpoint/checkpoint-2000
Configuration saved in /con

{'eval_loss': 0.7382590770721436, 'eval_accuracy': {'accuracy': 0.7708270976753467}, 'eval_precision': {'precision': 0.7708075477166202}, 'eval_recall': {'recall': 0.7708012562351967}, 'eval_f1': {'f1': 0.7708035564663996}, 'eval_runtime': 25.528, 'eval_samples_per_second': 392.628, 'eval_steps_per_second': 24.561, 'epoch': 2.74}


Model weights saved in /content/checkpoint/checkpoint-2000/pytorch_model.bin
tokenizer config file saved in /content/checkpoint/checkpoint-2000/tokenizer_config.json
Special tokens file saved in /content/checkpoint/checkpoint-2000/special_tokens_map.json
Deleting older checkpoint [/content/checkpoint/checkpoint-1800] due to args.save_total_limit


{'loss': 0.2903, 'learning_rate': 3.9443684450524395e-06, 'epoch': 2.76}
{'loss': 0.3001, 'learning_rate': 3.488372093023256e-06, 'epoch': 2.79}
{'loss': 0.3122, 'learning_rate': 3.032375740994072e-06, 'epoch': 2.82}
{'loss': 0.3321, 'learning_rate': 2.5763793889648885e-06, 'epoch': 2.85}
{'loss': 0.2628, 'learning_rate': 2.1203830369357045e-06, 'epoch': 2.87}
{'loss': 0.2982, 'learning_rate': 1.6643866849065208e-06, 'epoch': 2.9}
{'loss': 0.2272, 'learning_rate': 1.208390332877337e-06, 'epoch': 2.93}
{'loss': 0.2384, 'learning_rate': 7.523939808481532e-07, 'epoch': 2.95}
{'loss': 0.2777, 'learning_rate': 2.963976288189695e-07, 'epoch': 2.98}




Training completed. Do not forget to share your model on huggingface.co/models =)


Loading best model from /content/checkpoint/checkpoint-1400 (score: 0.631726861000061).


{'train_runtime': 1027.6263, 'train_samples_per_second': 68.275, 'train_steps_per_second': 2.134, 'train_loss': 0.524611472539906, 'epoch': 3.0}


TrainOutput(global_step=2193, training_loss=0.524611472539906, metrics={'train_runtime': 1027.6263, 'train_samples_per_second': 68.275, 'train_steps_per_second': 2.134, 'train_loss': 0.524611472539906, 'epoch': 3.0})

## Evaluate Model

### SocialIQA

In [31]:
model.eval()
test_gths = []; test_preds = []
with torch.no_grad(): 
    # 지금 batch 안해놔서 그냥 패딩없이 올림
    for idx, example in tqdm(enumerate(test_dataset)):
        context_question = example['context'] + tokenizer.sep_token + example['question']
        inputs = tokenizer([[context_question, example['answerA']], [context_question, example['answerB']], [context_question, example['answerC']]], padding=True, return_tensors='pt').to(DEVICE)
        labels = int(example['label']) - 1
        outputs = model(**{k: v.unsqueeze(0) for k, v in inputs.items()})
        logits = outputs.logits
        preds = logits.argmax().item()
        test_gths.append(labels)
        test_preds.append(preds)
        torch.cuda.empty_cache()

0it [00:00, ?it/s]

### CommonsenseQA

In [None]:
model.eval()
test_gths = []; test_preds = []
with torch.no_grad(): 
    # 지금 batch 안해놔서 그냥 패딩없이 올림
    for idx, example in tqdm(enumerate(test_dataset)):
        context_question = example['question_concept'] + tokenizer.sep_token + example['question']
        inputs = tokenizer([[context_question, example['choices']['text'][i]] for i in range(num_choices)], padding=True, return_tensors='pt').to(DEVICE)
        labels = int(example['label']) - 1
        outputs = model(**{k: v.unsqueeze(0) for k, v in inputs.items()})
        logits = outputs.logits
        preds = logits.argmax().item()
        test_gths.append(labels)
        test_preds.append(preds)
        torch.cuda.empty_cache()

0it [00:00, ?it/s]

### SocialIQA with COMET Inference

In [34]:
model.eval()
test_gths = []; test_preds = []
with torch.no_grad(): 
    # 지금 batch 안해놔서 그냥 패딩없이 올림
    for idx, example in tqdm(enumerate(test_dataset)):
        context_question = 'Effect:' + example['comet_effect'] + tokenizer.sep_token \
                           + 'Intent:' + example['comet_intent'] + tokenizer.sep_token \
                           + 'Need:' + example['comet_need'] + tokenizer.sep_token \
                           + 'Reaction:' + example['comet_react'] + tokenizer.sep_token \
                           + 'Want:' + example['comet_want'] + tokenizer.sep_token \
                           + example['context'] + tokenizer.sep_token \
                           + example['question']
        inputs = tokenizer([[context_question, item['text']] for item in example['choices']], padding=True, return_tensors='pt').to(DEVICE)
        labels = int(example['label']) - 1
        outputs = model(**{k: v.unsqueeze(0) for k, v in inputs.items()})
        logits = outputs.logits
        preds = logits.argmax().item()
        test_gths.append(labels)
        test_preds.append(preds)
        torch.cuda.empty_cache()

0it [00:00, ?it/s]

### Check Metrics

In [32]:
print({'accuracy': accuracy.compute(predictions=test_preds, references=test_gths),
       'precision': precision.compute(predictions=test_preds, references=test_gths, average='macro'),
       'recall': recall.compute(predictions=test_preds, references=test_gths, average='macro'),
       'f1': f1.compute(predictions=test_preds, references=test_gths, average='macro')})

{'accuracy': {'accuracy': 0.7574206755373593}, 'precision': {'precision': 0.7575787077257603}, 'recall': {'recall': 0.7574292778261013}, 'f1': {'f1': 0.7574565725928807}}
