# Final Project: 2021년 국립국어원 인공지능 언어능력 평가

- [2021년 국립국어원 인공지능 언어능력 평가](https://corpus.korean.go.kr/task/taskList.do?taskId=1&clCd=END_TASK&subMenuId=sub01) 는 9월 1일부터 시작하여 11월 1일까지 마감된 [네 가지 과제에](https://corpus.korean.go.kr/task/taskDownload.do?taskId=1&clCd=END_TASK&subMenuId=sub02) 대한 언어능력 평가 대회
- 여기서 제시된 과제를 그대로 수행하여 그 결과를 [최종 선정된 결과들](https://corpus.korean.go.kr/task/taskLeaderBoard.do?taskId=4&clCd=END_TASK&subMenuId=sub04)과 비교할 수 있도록 수행
- 아직 테스트 셋의 정답이 공식적으로 공개되고 있지 않아, 네 가지 과제의 자료에서 evaluation dataset으로 가지고 성능을 비교할 계획
- 기말 발표전까지 정답셋이 공개될 경우 이 정답셋을 가지고 성능 검증
- Transformers 기반 방법론, 신경망 등 각자 생각한 방법대로 구현 가능
- 현재 대회기간이 종료되어 자료가 다운로드 가능하지 않으니 첨부된 자료 참조
- 개인적으로 하거나 최대 두명까지 그룹 허용. 
- 이 노트북 화일에 이름을 변경하여 작업하고 제출. 제출시 화일명을 FinalProject_[DS또는 CL]_학과_이름.ipynb
- 마감 12월 6일(월) 23:59분까지.
- 12월 7일, 9일 기말 발표 presentation 예정

## 리더보드

- 최종발표전까지 각조는 각 태스크별 실행성능을 **시도된 여러 방법의 결과들을 지속적으로**  [리더보드](https://docs.google.com/spreadsheets/d/1-uenfp5GolpY2Gf0TsFbODvj585IIiFKp9fvYxcfgkY/edit#gid=0)에 해당 팀명(구성원 이름 포함)을 입력하여 공개하여야 함. 
- 최종 마감일에 이 순위와 실제 제출한 프로그램의 수행 결과를 비교하여 성능을 확인

# BoolQ (판정 의문문, 정현진)

In [None]:
gpu_info = !nvidia-smi
gpu_info = '\n'.join(gpu_info)
if gpu_info.find('failed') >= 0:
  print('Not connected to a GPU')
else:
  print(gpu_info)

In [None]:
from psutil import virtual_memory
ram_gb = virtual_memory().total / 1e9
print('Your runtime has {:.1f} gigabytes of available RAM\n'.format(ram_gb))

if ram_gb < 20:
  print('Not using a high-RAM runtime')
else:
  print('You are using a high-RAM runtime!')

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
cur_path = "/content/drive/MyDrive/New Colab Notebooks/NLP/BoolQ/"

### Requirement

In [3]:
!pip install transformers
!pip install wandb
!pip install pytorch-lightning
!pip install tqdm
!pip install sentencepiece

Collecting transformers
  Downloading transformers-4.12.5-py3-none-any.whl (3.1 MB)
[K     |████████████████████████████████| 3.1 MB 4.1 MB/s 
Collecting tokenizers<0.11,>=0.10.1
  Downloading tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3 MB)
[K     |████████████████████████████████| 3.3 MB 60.5 MB/s 
Collecting sacremoses
  Downloading sacremoses-0.0.46-py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 63.5 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 71.5 MB/s 
Collecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.2.1-py3-none-any.whl (61 kB)
[K     |████████████████████████████████| 61 kB 677 kB/s 
Installing collected packages: pyyaml, tokenizers, sacremoses, huggingface-hub, transformers
  Attempting 

### Import packages

In [4]:
import os
import sys
import pandas as pd

import torch
import torch.nn as nn
import torch.nn.functional as F
import pytorch_lightning as pl

import numpy as np

import wandb
import re

from tqdm import tqdm

### Configuration

In [5]:
class config():
  """ Here type your configurations! """
  # paths
  train_path = "/content/drive/MyDrive/New Colab Notebooks/NLP/BoolQ/SKT_BoolQ_Train.tsv"
  dev_path = "/content/drive/MyDrive/New Colab Notebooks/NLP/BoolQ/SKT_BoolQ_Dev.tsv"
  test_path = "/content/drive/MyDrive/New Colab Notebooks/NLP/BoolQ/SKT_BoolQ_Test.tsv"
  cur_path = "/content/drive/MyDrive/New Colab Notebooks/NLP/BoolQ/"
  train_dev_crop = False

  # model
  model_list = {
      'roberta': "klue/roberta-large",
      'bigbird': "monologg/kobigbird-bert-base",
      'electra': 'monologg/koelectra-base-v3-discriminator',
      'albert': "kykim/albert-kor-base"
  }

  num_classes = 2

  # dataset
  k_fold = 5
  batch_size = 8
  inf_batch_size = 2

  # optimizer, schedular
  learning_rate = 8e-6
  weight_decay = 0.01
  warmup_steps = 500

  # Save
  log_interval = 20
  mode_wandb = True
  save_dir = "/content/drive/MyDrive/New Colab Notebooks/NLP/BoolQ/result/"


### Dataset

In [6]:
from torch.utils.data import Dataset, DataLoader

class BoolQ_Dataset(Dataset):
  def __init__(self, config, training=True):
    """ Configuration """ 
    self.config = config

    if training: # for K folding
      self.dataset = self.load_data(config.train_path)
    else: # test data
      self.dataset = self.load_data(config.dev_path)


  def __len__(self):
    return len(self.dataset)

  def __getitem__(self, idx):
    ## Return text and label
    return {
        "text": self.pre_process(self.dataset["text"].values[idx]), 
        "question": self.pre_process(self.dataset["question"].values[idx]), 
        "label": self.dataset["label"].values[idx]
    }


  def load_data(self, dataset_dir):
    dataset = pd.read_csv(dataset_dir, delimiter='\t', names=['ID', 'text', 'question', 'answer'], header=0)
    dataset["label"] = dataset["answer"].astype(int)
    dataset['text'] = dataset['text'].apply(self.pre_process)
    return dataset

  def pre_process(self, st):
    st = re.sub('\(.*\)|\s-\s.*', '', st)
    st = re.sub('\[.*\]|\s-\s.*', '', st)
    st = st.lower()

    st = re.sub('[”“]', '\"', st)
    st = re.sub('[’‘]', '\'', st)
    st = re.sub('[≫〉》＞』」]', '>', st)
    st = re.sub('[《「『〈≪＜]','<',st)
    st = re.sub('[−–—]', '−', st)
    st = re.sub('[･•・‧]','·', st)
    st = re.sub('<', '', st)
    st = re.sub('>', '', st)
    st = re.sub('·', ', ', st)
    st = st.replace('／', '/')
    st = st.replace('℃', '도')
    st = st.replace('→', '에서')
    st = st.replace('!', '')
    st = st.replace('，', ',')
    st = st.replace('㎢', 'km')
    st = st.replace('∼', '~')
    st = st.replace('㎜', 'mm')
    st = st.replace('×', '곱하기')
    st = st.replace('=', '는')
    st = st.replace('®', '')
    st = st.replace('㎖', 'ml')
    st = st.replace('ℓ', 'l')
    st = st.replace('˚C', '도')
    st = st.replace('˚', '도')
    st = st.replace('°C', '도')
    st = st.replace('°', '도')
    st = st.replace('＋', '+')
    st = st.replace('*', '')
    st = st.replace(';', '.')
    return st
    

### Define Model

In [10]:
from transformers import (
    BigBirdModel,
    BigBirdPreTrainedModel, 
    ElectraModel, 
    ElectraPreTrainedModel, 
    XLMRobertaModel, 
    BartModel, 
    BartPretrainedModel, 
    T5Model, 
    RobertaModel,
    AlbertModel
)

""" KoBigBird Pre-trained Model """

class BigBird_BoolQ(BigBirdPreTrainedModel):
    def __init__(self, config):
        super().__init__(config)
        self.bigbird = BigBirdModel.from_pretrained(
            "monologg/kobigbird-bert-base",
            config=config
        )  # Load pretrained bigbird
        
        self.num_labels = config.num_labels

        self.pooling = PoolingHead(input_dim=config.hidden_size,
            inner_dim=config.hidden_size,
            pooler_dropout=0.1)
        self.qa_classifier = nn.Linear(config.hidden_size, self.num_labels)

    def forward(self, input_ids=None, token_type_ids=None, attention_mask=None):
        outputs = self.bigbird(
            input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids
        )  # sequence_output, pooled_output, (hidden_states), (attentions)
        
        pooled_output = outputs[0][:,0,:] #cls
        

        # Dropout -> tanh -> fc_layer (Share FC layer for e1 and e2)
        pooled_output = self.pooling(pooled_output)

        # Concat -> fc_layer
        logits = self.qa_classifier(pooled_output)

        outputs = (logits,) + outputs[2:]  # add hidden states and attention if they are here

        return outputs  # logits, (hidden_states), (attentions)



""" KoElectra Pre-trained Model """

class Electra_BoolQ(ElectraPreTrainedModel):
    def __init__(self, config):
        super().__init__(config)

        #self.num_labels = config.num_labels
        self.num_labels = config.num_labels
        self.model = ElectraModel.from_pretrained(
            'monologg/koelectra-base-v3-discriminator', config=config)
        self.pooling = PoolingHead(input_dim=config.hidden_size,
            inner_dim=config.hidden_size,
            pooler_dropout=0.1)
        self.qa_classifier = nn.Linear(config.hidden_size, self.num_labels)

    def forward(self, input_ids=None, token_type_ids=None, attention_mask=None):
        outputs = self.model(input_ids=input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids)
        sequence_output = outputs[0][:,0,:] #cls
        sequence_output = self.pooling(sequence_output)
        logits = self.qa_classifier(sequence_output)
        outputs = (logits,) + outputs[2:]  # add hidden states and attention if they are here

        return outputs  # logits, (hidden_states), (attentions)



""" Roberta Pre-trained Model """

class Roberta_BoolQ(RobertaModel):
    def __init__(self, config):
        super().__init__(config)
        self.roberta = RobertaModel.from_pretrained("klue/roberta-large", config=config)  # Load pretrained Electra

        self.num_labels = config.num_labels

        self.pooling = PoolingHead(input_dim=config.hidden_size,
            inner_dim=config.hidden_size,
            pooler_dropout=0.1)
        self.qa_classifier = nn.Linear(config.hidden_size, self.num_labels)

    def forward(self, input_ids=None, token_type_ids=None, attention_mask=None):
        outputs = self.roberta(
            input_ids, attention_mask=attention_mask
        )  # sequence_output, pooled_output, (hidden_states), (attentions)
        pooled_output = outputs[0][:, 0, :]  # [CLS]

        pooled_output = self.pooling(pooled_output)
        # pooled_output_cat = torch.cat([pooled_output, pooled_output2], dim=1)
        
        logits = self.qa_classifier(pooled_output)

        outputs = (logits,) + outputs[2:]  # add hidden states and attention if they are here

        return outputs  # logits, (hidden_states), (attentions)        return outputs  # logits, (hidden_states), (attentions)



""" Roberta Ensemble Model """

class Roberta_ensemble_BoolQ(nn.Module):
  def __init__(self, config):
    super().__init__()
    self.roberta_f = RobertaModel.from_pretrained("klue/roberta-large", config=config)  # Input: [Question - Answering]
    self.roberta_b = RobertaModel.from_pretrained("klue/roberta-large", config=config)  # Input: [Answering - Question]

    self.num_labels = config.num_labels

    self.pooling_f = PoolingHead(input_dim=config.hidden_size,
        inner_dim=config.hidden_size,
        pooler_dropout=0.1)
    self.pooling_b = PoolingHead(input_dim=config.hidden_size,
        inner_dim=config.hidden_size,
        pooler_dropout=0.1)
    
    self.qa_classifier = nn.Linear(config.hidden_size * 2, self.num_labels)


  def forward(self, input_forward_dict, input_reverse_dict):
    """
    ? input
    - input_forward_dict : {input_ids, token_type_ids, attention_mask} 
      Input token has sequence of (question - answering)
    - input_reverse_dict : {input_ids, token_type_ids, attention_mask}
      Input token has sequence of (question - answering)
    """
    outputs = self.roberta(
        input_forward_dict['input_ids'], attention_mask=input_forward_dict['attention_mask']
    )  # sequence_output, pooled_output, (hidden_states), (attentions)
    pooled_output_1 = outputs[0][:, 0, :]  # [CLS]

    outputs = self.roberta(
        input_forward_dict['input_ids'], attention_mask=input_forward_dict['attention_mask']
    )  # sequence_output, pooled_output, (hidden_states), (attentions)
    pooled_output_2 = outputs[0][:, 0, :]  # [CLS]



    pooled_output_1 = self.pooling_f(pooled_output_1)
    pooled_output_2 = self.pooling_f(pooled_output_2)
    pooled_output_cat = torch.cat([pooled_output_1, pooled_output_2], dim=1)
    
    logits = self.qa_classifier(pooled_output_cat)

    outputs = (logits,) + outputs[2:]  # add hidden states and attention if they are here

    return outputs  # logits, (hidden_states), (attentions)        return outputs  # logits, (hidden_states), (attentions)



""" Albert Model"""

class Albert_BoolQ(AlbertModel):
    def __init__(self, config):
        super().__init__(config)
        self.albert = AlbertModel.from_pretrained("kykim/albert-kor-base", config=config)  # Load pretrained Electra

        self.num_labels = config.num_labels

        self.pooling = PoolingHead(input_dim=config.hidden_size,
            inner_dim=config.hidden_size,
            pooler_dropout=0.1)
        self.qa_classifier = nn.Linear(config.hidden_size, self.num_labels)

    def forward(self, input_ids=None, token_type_ids=None, attention_mask=None):
        outputs = self.albert(
            input_ids, attention_mask=attention_mask
        )  # sequence_output, pooled_output, (hidden_states), (attentions)
        pooled_output = outputs[0][:, 0, :]  # [CLS]

        pooled_output = self.pooling(pooled_output)
        # pooled_output_cat = torch.cat([pooled_output, pooled_output2], dim=1)
        
        logits = self.qa_classifier(pooled_output)

        outputs = (logits,) + outputs[2:]  # add hidden states and attention if they are here

        return outputs  # logits, (hidden_states), (attentions)        return outputs  # logits, (hidden_states), (attentions)




""" Additional Layers """


class FCLayer(nn.Module):
    def __init__(self, input_dim, output_dim, dropout_rate=0.0, use_activation=True):
        super(FCLayer, self).__init__()
        self.use_activation = use_activation
        self.dropout = nn.Dropout(dropout_rate)
        self.linear = nn.Linear(input_dim, output_dim)
        self.tanh = nn.Tanh()

    def forward(self, x):
        x = self.dropout(x)
        if self.use_activation:
            x = self.tanh(x)
        return self.linear(x)


class PoolingHead(nn.Module):
    def __init__(
        self,
        input_dim: int,
        inner_dim: int,
        pooler_dropout: float,
    ):
        super().__init__()
        self.dense = nn.Linear(input_dim, inner_dim)
        self.dropout = nn.Dropout(p=pooler_dropout)

    def forward(self, hidden_states: torch.Tensor):
        hidden_states = self.dropout(hidden_states)
        hidden_states = self.dense(hidden_states)
        hidden_states = torch.tanh(hidden_states)
        return hidden_states



### Training Center

In [11]:
import transformers
from transformers import AutoConfig, AutoTokenizer, BertTokenizerFast

from sklearn.model_selection import StratifiedKFold

from torch.utils.data import Subset

# https://visionhong.tistory.com/30
# Here is the code for pl.

class BoolQ_Model_Train():
  def __init__(self, config, model_name):
    super().__init__()
    device = 'cuda' if torch.cuda.is_available() else 'cpu'
    self.device = device
    self.config = config

    #####################
    ### Configuration ###
    #####################

    """ Model """

    assert model_name in config.model_list.keys(), "[Training] Please Give Correct Model Name which have been listed."
    self.model_name = model_name

    # load configuration of pretrained model
    MODEL_CONFIG = AutoConfig.from_pretrained(config.model_list[model_name])
    MODEL_CONFIG.num_labels = 2

    if model_name == "roberta":
      self.model = Roberta_BoolQ(MODEL_CONFIG)
    elif model_name == "bigbird":
      self.model = BigBird_BoolQ(MODEL_CONFIG)
    elif model_name == "electra":
      self.model = Electra_BoolQ(MODEL_CONFIG)
    elif model_name == "albert":
      self.model = Albert_BoolQ(MODEL_CONFIG)

    self.model.to(device)


    """ Tokenizer """
    if model_name == 'albert':
      self.tokenizer = BertTokenizerFast.from_pretrained(config.model_list[model_name])
    else:
      self.tokenizer = AutoTokenizer.from_pretrained(config.model_list[model_name])



    """ Dataset """

    # train_dataset
    self.train_dataset = BoolQ_Dataset(config)

    # k_fold index
    skf_iris = StratifiedKFold(n_splits=config.k_fold)
    self.kfold = config.k_fold
    self.KFold_index = list(skf_iris.split(
        self.train_dataset.dataset['text'], self.train_dataset.dataset['label']))
    
    # batch_size
    self.batch_size = config.batch_size


    """ optimizer, scheduler (in fit() function), criterion """

    self.optimizer = torch.optim.AdamW(self.model.parameters(), lr=config.learning_rate)
    self.criterion = nn.CrossEntropyLoss()


    """ Training Saving """

    self.log_interval = config.log_interval
    self.load_step = 0
    self.best_acc = 0
    self.wandb = config.mode_wandb
    self.save_dir = config.save_dir



  def fit(self, epoch):
    # schedular
    self.scheduler = transformers.get_linear_schedule_with_warmup(
      self.optimizer, 
      num_warmup_steps=config.warmup_steps, 
      num_training_steps=len(self.train_dataset) * epoch, 
      last_epoch= -1
    )

    
    """ GO TRAINING. """
    self.epoch = epoch

    for epo in tqdm(range(epoch), position=0):
      ### Stratified KFold
      train_idx, val_idx = self.KFold_index[epo % self.kfold]

      training_set = Subset(self.train_dataset, train_idx)
      validation_set = Subset(self.train_dataset, val_idx)

      ### make dataloader
      train_loader = DataLoader(training_set, batch_size=self.batch_size, shuffle=True, collate_fn=self.collate_fn)
      val_loader = DataLoader(validation_set, batch_size=self.batch_size, shuffle=True, collate_fn=self.collate_fn)

      ### train
      self.training_step(train_loader, epo)

      ### val
      self.validation_step(val_loader, epo)

      ### Best model save
      if self.best_acc < self.val_acc:
        self.best_acc = self.val_acc

        print("Best Model Saving!")
        print(f"Current Best Accuracy: {self.best_acc}")

        model_to_save = self.model.module if hasattr(self.model, "module") else self.model
        model_to_save.save_pretrained(f"{self.save_dir}/best/{self.model_name}")
        torch.save(self.config, os.path.join(f"{self.save_dir}/best/{self.model_name}", "training_config.bin"))


      
      

  def training_step(self, train_loader, epo):
    # allocate model to train mode
    self.model.train()
    tot_acc, tot_loss = 0., 0.

    pbar = tqdm(total = len(train_loader), desc="[Training] Epoch {}".format(epo+1), position=1)

    for texts, labels in train_loader:
      ### allocate to cuda or not.
      # texts -> cpu tensor, labels -> array.
      # texts: {input_ids, token_type_ids, attention_mask}
      texts = {key: torch.tensor(value).to(self.device) for key, value in texts.items()}
      labels = torch.tensor(labels).to(self.device)

      ###########################################
      # 1) zero_grad
      self.optimizer.zero_grad()

      # 2) forward
      y_pred = self.model(**texts)[0]

      # 3) calculate loss
      loss = self.criterion(y_pred, labels)

      # 4) backward
      loss.backward()

      # 5) optimier step
      self.optimizer.step()

      # 6) scheduler step
      self.scheduler.step()

      ###########################################


      ### update, and cumulate match and loss
      pbar.update()
      self.load_step += 1

      preds = torch.argmax(y_pred, dim=-1)
      tot_loss += loss.item()
      tot_acc += (preds == labels).sum().item() / self.batch_size

      ### saving to log
      if self.load_step % self.log_interval == 0:
        train_loss = tot_loss / self.log_interval
        train_acc = tot_acc / self.log_interval
        current_lr = self.get_lr(self.optimizer)

        pbar.set_description(f"Epoch: [{epo}/{self.epoch}] || loss: {train_loss:4.4} || acc: {train_acc:4.2%} || lr {current_lr:4.4}")

        self.train_loss = train_loss
        self.train_acc = train_acc
        self.current_lr = current_lr

        tot_acc, tot_loss = 0., 0.

    pbar.close()



  def validation_step(self, val_loader, epo):
    # allocate model to eval mode
    self.model.eval()
    tot_acc, tot_loss = 0., 0.

    pbar = tqdm(total = len(val_loader), desc="[Validation] Epoch {}".format(epo+1), position=1)
    with torch.no_grad():
      for texts, labels in val_loader:
        ### allocate to cuda or not.
        # texts -> cpu tensor, labels -> array.
        # texts: {input_ids, token_type_ids, attention_mask}
        texts = {key: torch.tensor(value).to(self.device) for key, value in texts.items()}
        labels = torch.tensor(labels).to(self.device)

        ###########################################
        # 1) forward
        y_pred = self.model(**texts)[0]

        # 2) calculate loss
        loss = self.criterion(y_pred, labels)

        ###########################################
        """ Update and save loss """

        pbar.update()
    
        preds = torch.argmax(y_pred, dim=-1)
        tot_loss += loss.item()
        tot_acc += (preds == labels).sum().item() / self.batch_size

        ############################################
        

    val_loss = tot_loss / len(val_loader)
    val_acc = tot_acc / len(val_loader)

    pbar.set_description(f"Validation: [{epo}/{self.epoch}] || loss: {val_loss:4.4} || acc: {val_acc:4.2%}")
    pbar.close()

    if self.wandb:
        wandb.log({"train_loss": self.train_loss, "train_acc": self.train_acc,
            "lr":self.current_lr, "valid_loss":val_loss, "valid_acc":val_acc
        })

    self.val_acc = val_acc



  def collate_fn(self, batch):
    """
      Collate a batch of dataset to same length of text.

    ? INPUT
    dataset: {text: string, question: string, label: int}

    ? OUTPUT
    padded token ids.
    """

    batch_size = len(batch)

    # integrate from dataset (dict) into list
    text_list = [b['text'] for b in batch]
    query_list = [b['question'] for b in batch]
    label_list = [b['label'] for b in batch]
    
    # tokenize
    text_query_list = list(zip(text_list, query_list))

    if self.model_name == 'bigbird':
      max_length = 1024
    else:
      max_length = 512

    tokenized_sentence = self.tokenizer(
        text_query_list,
        return_tensors="np",
        padding=True,
        truncation=True,
        max_length=max_length,
        add_special_tokens=True,
        return_token_type_ids = False
    )

    # output of tokenized_sentence: {input_ids, token_type_ids, attention_mask}
    return tokenized_sentence, label_list

  def get_lr(self, optimizer):
    for param_group in optimizer.param_groups:
      return param_group['lr']




In [12]:
class Inference():
  def __init__(self, config, model_name, ensemble=False):
    device = 'cuda' if torch.cuda.is_available() else 'cpu'
    self.device = device
    self.config = config

    #####################
    ### Configuration ###
    #####################

    """ Model """

    assert model_name in config.model_list.keys(), "[Inference] Please Give Correct Model Name which have been listed."
    self.model_name = model_name

    # load configuration of pretrained model
    MODEL_CONFIG = AutoConfig.from_pretrained(config.model_list[model_name])
    MODEL_CONFIG.num_labels = 2

    save_model_path = os.path.join(config.cur_path, 'result/best', model_name)
    if model_name == "roberta":
      self.model = Roberta_BoolQ.from_pretrained(save_model_path, config=MODEL_CONFIG)
    elif model_name == "bigbird":
      self.model = BigBird_BoolQ.from_pretrained(save_model_path, config=MODEL_CONFIG)
    elif model_name == "electra":
      self.model = Electra_BoolQ.from_pretrained(save_model_path, config=MODEL_CONFIG)
    elif model_name == "albert":
      self.model = Albert_BoolQ.from_pretrained(save_model_path, config=MODEL_CONFIG)

    self.model.to(device)
    self.model.eval()

    """ Tokenizer """

    if model_name == 'albert':
      self.tokenizer = BertTokenizerFast.from_pretrained(config.model_list[model_name])
    else:
      self.tokenizer = AutoTokenizer.from_pretrained(config.model_list[model_name])


    """ Dataset """

    # train_dataset
    self.test_dataset = BoolQ_Dataset(config, False)
    
    # batch_size
    self.batch_size = config.inf_batch_size



  def inference(self):
    ### test dataloader
    test_loader = DataLoader(
        self.test_dataset,
        batch_size = self.batch_size,
        collate_fn = self.collate_fn
    )

    # get accuracy.
    tot_acc = 0.
    with torch.no_grad():
      pbar = tqdm(total = len(test_loader), desc = "Inference")
      for texts, labels in test_loader:
        texts = {key: torch.tensor(value).to(self.device) for key, value in texts.items()}
        labels = torch.tensor(labels).to(self.device)

        y_pred = self.model(**texts)[0]

        preds = torch.argmax(y_pred, dim=-1)
        tot_acc += (preds == labels).sum().item() / self.batch_size

        pbar.update()

    tot_acc /= len(test_loader)
    print(f"Test Accuracy is {tot_acc:4.2%}. Congrats!")
      

  def collate_fn(self, batch):
    """
      Collate a batch of dataset to same length of text.

    ? INPUT
    dataset: {text: string, question: string, label: int}

    ? OUTPUT
    padded token ids.
    """

    batch_size = len(batch)

    # integrate from dataset (dict) into list
    text_list = [b['text'] for b in batch]
    query_list = [b['question'] for b in batch]
    label_list = [b['label'] for b in batch]
    
    # tokenize
    text_query_list = list(zip(text_list, query_list))

    if self.model_name == 'bigbird':
      max_length = 1024
    else:
      max_length = 512

    tokenized_sentence = self.tokenizer(
        text_query_list,
        return_tensors="np",
        padding=True,
        truncation=True,
        max_length=max_length,
        add_special_tokens=True,
        return_token_type_ids = False
    )

    # output of tokenized_sentence: {input_ids, token_type_ids, attention_mask}
    return tokenized_sentence, label_list


# Result!!

### 1) Bigbird!

In [None]:
torch.cuda.empty_cache()

model_name = 'bigbird'

if config.mode_wandb:
    wandb.login()
    wandb.init(project='HyunJin-BoolQ', name=f"hello_{model_name}")

Trainer = BoolQ_Model_Train(config, model_name)
Trainer.fit(epoch = 30)



VBox(children=(Label(value=' 0.00MB of 0.00MB uploaded (0.00MB deduped)\r'), FloatProgress(value=1.0, max=1.0)…

0,1
lr,▁██████▇▇▇
train_acc,▁▂▄▆▆▇▇███
train_loss,██▇▄▅▂▂▁▁▁
valid_acc,▁▂▄▆▇▇████
valid_loss,██▆▄▃▂▂▂▁▁

0,1
lr,1e-05
train_acc,0.96875
train_loss,0.11205
valid_acc,0.99049
valid_loss,0.02583


Some weights of the model checkpoint at monologg/kobigbird-bert-base were not used when initializing BigBirdModel: ['cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.predictions.decoder.bias', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing BigBirdModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BigBirdModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
  0%|          | 0/30 [00:00<?, ?it/s]
[Training] Epoch 1:   0%|          | 0/367 [00:00<?, ?it/s][AAt

Best Model Saving!
Current Best Accuracy: 0.5122282608695652


  3%|▎         | 1/30 [01:11<34:43, 71.85s/it]
[Training] Epoch 2:   0%|          | 0/367 [00:00<?, ?it/s][A
[Training] Epoch 2:   0%|          | 1/367 [00:00<01:03,  5.72it/s][A
[Training] Epoch 2:   1%|          | 2/367 [00:00<01:07,  5.38it/s][A
[Training] Epoch 2:   1%|          | 3/367 [00:00<01:08,  5.31it/s][A
[Training] Epoch 2:   1%|          | 4/367 [00:00<01:10,  5.14it/s][A
[Training] Epoch 2:   1%|▏         | 5/367 [00:00<01:08,  5.31it/s][A
[Training] Epoch 2:   2%|▏         | 6/367 [00:01<01:06,  5.47it/s][A
[Training] Epoch 2:   2%|▏         | 7/367 [00:01<01:03,  5.64it/s][A
[Training] Epoch 2:   2%|▏         | 8/367 [00:01<01:02,  5.72it/s][A
[Training] Epoch 2:   2%|▏         | 9/367 [00:01<01:03,  5.67it/s][A
[Training] Epoch 2:   3%|▎         | 10/367 [00:01<01:05,  5.43it/s][A
[Training] Epoch 2:   3%|▎         | 11/367 [00:02<01:07,  5.29it/s][A
[Training] Epoch 2:   3%|▎         | 12/367 [00:02<01:02,  5.67it/s][A
[Training] Epoch 2:   4%|▎         

Best Model Saving!
Current Best Accuracy: 0.5380434782608695


  7%|▋         | 2/30 [02:25<33:55, 72.69s/it]
[Training] Epoch 3:   0%|          | 0/367 [00:00<?, ?it/s][A
[Training] Epoch 3:   0%|          | 1/367 [00:00<00:54,  6.76it/s][A
[Training] Epoch 3:   1%|          | 2/367 [00:00<00:58,  6.29it/s][A
[Training] Epoch 3:   1%|          | 3/367 [00:00<01:04,  5.66it/s][A
[Training] Epoch 3:   1%|          | 4/367 [00:00<01:04,  5.67it/s][A
[Training] Epoch 3:   1%|▏         | 5/367 [00:00<01:08,  5.25it/s][A
[Training] Epoch 3:   2%|▏         | 6/367 [00:01<01:03,  5.66it/s][A
Epoch: [2/30] || loss: 0.2036 || acc: 15.62% || lr 7.982e-06:   2%|▏         | 6/367 [00:01<01:03,  5.66it/s][A
Epoch: [2/30] || loss: 0.2036 || acc: 15.62% || lr 7.982e-06:   2%|▏         | 7/367 [00:01<01:03,  5.65it/s][A
Epoch: [2/30] || loss: 0.2036 || acc: 15.62% || lr 7.982e-06:   2%|▏         | 8/367 [00:01<01:06,  5.39it/s][A
Epoch: [2/30] || loss: 0.2036 || acc: 15.62% || lr 7.982e-06:   2%|▏         | 9/367 [00:01<01:10,  5.10it/s][A
Epoch: [2/30

Best Model Saving!
Current Best Accuracy: 0.6942934782608695


 10%|█         | 3/30 [03:37<32:41, 72.63s/it]
[Training] Epoch 4:   0%|          | 0/367 [00:00<?, ?it/s][A
[Training] Epoch 4:   0%|          | 1/367 [00:00<01:08,  5.34it/s][A
[Training] Epoch 4:   1%|          | 2/367 [00:00<01:04,  5.64it/s][A
[Training] Epoch 4:   1%|          | 3/367 [00:00<00:58,  6.20it/s][A
[Training] Epoch 4:   1%|          | 4/367 [00:00<01:00,  6.03it/s][A
[Training] Epoch 4:   1%|▏         | 5/367 [00:00<01:02,  5.80it/s][A
[Training] Epoch 4:   2%|▏         | 6/367 [00:01<01:02,  5.74it/s][A
[Training] Epoch 4:   2%|▏         | 7/367 [00:01<01:04,  5.55it/s][A
[Training] Epoch 4:   2%|▏         | 8/367 [00:01<01:11,  5.00it/s][A
[Training] Epoch 4:   2%|▏         | 9/367 [00:01<01:07,  5.30it/s][A
[Training] Epoch 4:   3%|▎         | 10/367 [00:01<01:10,  5.09it/s][A
[Training] Epoch 4:   3%|▎         | 11/367 [00:02<01:06,  5.34it/s][A
[Training] Epoch 4:   3%|▎         | 12/367 [00:02<01:04,  5.54it/s][A
[Training] Epoch 4:   4%|▎         

Best Model Saving!
Current Best Accuracy: 0.842391304347826


 13%|█▎        | 4/30 [04:50<31:26, 72.54s/it]
[Training] Epoch 5:   0%|          | 0/367 [00:00<?, ?it/s][A
[Training] Epoch 5:   0%|          | 1/367 [00:00<01:35,  3.85it/s][A
[Training] Epoch 5:   1%|          | 2/367 [00:00<01:22,  4.45it/s][A
[Training] Epoch 5:   1%|          | 3/367 [00:00<01:17,  4.72it/s][A
[Training] Epoch 5:   1%|          | 4/367 [00:00<01:08,  5.32it/s][A
[Training] Epoch 5:   1%|▏         | 5/367 [00:00<01:03,  5.74it/s][A
[Training] Epoch 5:   2%|▏         | 6/367 [00:01<01:01,  5.84it/s][A
[Training] Epoch 5:   2%|▏         | 7/367 [00:01<01:00,  5.92it/s][A
[Training] Epoch 5:   2%|▏         | 8/367 [00:01<00:59,  6.05it/s][A
[Training] Epoch 5:   2%|▏         | 9/367 [00:01<01:01,  5.79it/s][A
[Training] Epoch 5:   3%|▎         | 10/367 [00:01<01:01,  5.81it/s][A
[Training] Epoch 5:   3%|▎         | 11/367 [00:02<01:05,  5.40it/s][A
[Training] Epoch 5:   3%|▎         | 12/367 [00:02<01:09,  5.10it/s][A
Epoch: [4/30] || loss: 0.1929 || ac

Best Model Saving!
Current Best Accuracy: 0.8967391304347826


 17%|█▋        | 5/30 [06:02<30:14, 72.58s/it]
[Training] Epoch 6:   0%|          | 0/367 [00:00<?, ?it/s][A
[Training] Epoch 6:   0%|          | 1/367 [00:00<01:29,  4.10it/s][A
[Training] Epoch 6:   1%|          | 2/367 [00:00<01:11,  5.09it/s][A
[Training] Epoch 6:   1%|          | 3/367 [00:00<01:14,  4.90it/s][A
[Training] Epoch 6:   1%|          | 4/367 [00:00<01:14,  4.88it/s][A
[Training] Epoch 6:   1%|▏         | 5/367 [00:01<01:12,  5.00it/s][A
Epoch: [5/30] || loss: 0.06717 || acc: 23.12% || lr 7.902e-06:   1%|▏         | 5/367 [00:01<01:12,  5.00it/s][A
Epoch: [5/30] || loss: 0.06717 || acc: 23.12% || lr 7.902e-06:   2%|▏         | 6/367 [00:01<01:12,  4.98it/s][A
Epoch: [5/30] || loss: 0.06717 || acc: 23.12% || lr 7.902e-06:   2%|▏         | 7/367 [00:01<01:11,  5.05it/s][A
Epoch: [5/30] || loss: 0.06717 || acc: 23.12% || lr 7.902e-06:   2%|▏         | 8/367 [00:01<01:07,  5.30it/s][A
Epoch: [5/30] || loss: 0.06717 || acc: 23.12% || lr 7.902e-06:   2%|▏         |

Best Model Saving!
Current Best Accuracy: 0.9578804347826086


 20%|██        | 6/30 [07:15<28:59, 72.48s/it]
[Training] Epoch 7:   0%|          | 0/367 [00:00<?, ?it/s][A
[Training] Epoch 7:   0%|          | 1/367 [00:00<01:03,  5.75it/s][A
[Training] Epoch 7:   1%|          | 2/367 [00:00<01:05,  5.58it/s][A
[Training] Epoch 7:   1%|          | 3/367 [00:00<01:00,  6.02it/s][A
[Training] Epoch 7:   1%|          | 4/367 [00:00<01:05,  5.57it/s][A
[Training] Epoch 7:   1%|▏         | 5/367 [00:00<01:00,  5.94it/s][A
[Training] Epoch 7:   2%|▏         | 6/367 [00:01<00:59,  6.02it/s][A
[Training] Epoch 7:   2%|▏         | 7/367 [00:01<01:04,  5.55it/s][A
[Training] Epoch 7:   2%|▏         | 8/367 [00:01<00:59,  5.99it/s][A
[Training] Epoch 7:   2%|▏         | 9/367 [00:01<00:56,  6.34it/s][A
[Training] Epoch 7:   3%|▎         | 10/367 [00:01<00:58,  6.10it/s][A
[Training] Epoch 7:   3%|▎         | 11/367 [00:01<00:59,  6.03it/s][A
[Training] Epoch 7:   3%|▎         | 12/367 [00:01<00:55,  6.34it/s][A
[Training] Epoch 7:   4%|▎         

Best Model Saving!
Current Best Accuracy: 0.96875


 23%|██▎       | 7/30 [08:27<27:45, 72.40s/it]
[Training] Epoch 8:   0%|          | 0/367 [00:00<?, ?it/s][A
[Training] Epoch 8:   0%|          | 1/367 [00:00<01:06,  5.50it/s][A
[Training] Epoch 8:   1%|          | 2/367 [00:00<01:02,  5.84it/s][A
[Training] Epoch 8:   1%|          | 3/367 [00:00<01:02,  5.78it/s][A
[Training] Epoch 8:   1%|          | 4/367 [00:00<00:58,  6.25it/s][A
[Training] Epoch 8:   1%|▏         | 5/367 [00:00<00:57,  6.33it/s][A
[Training] Epoch 8:   2%|▏         | 6/367 [00:01<01:03,  5.67it/s][A
[Training] Epoch 8:   2%|▏         | 7/367 [00:01<01:03,  5.64it/s][A
[Training] Epoch 8:   2%|▏         | 8/367 [00:01<01:08,  5.27it/s][A
[Training] Epoch 8:   2%|▏         | 9/367 [00:01<01:06,  5.42it/s][A
[Training] Epoch 8:   3%|▎         | 10/367 [00:01<01:09,  5.13it/s][A
[Training] Epoch 8:   3%|▎         | 11/367 [00:01<01:07,  5.30it/s][A
Epoch: [7/30] || loss: 0.06394 || acc: 53.75% || lr 7.848e-06:   3%|▎         | 11/367 [00:01<01:07,  5.30i

Best Model Saving!
Current Best Accuracy: 0.9782608695652174


 27%|██▋       | 8/30 [09:40<26:40, 72.75s/it]
[Training] Epoch 9:   0%|          | 0/367 [00:00<?, ?it/s][A
[Training] Epoch 9:   0%|          | 1/367 [00:00<00:56,  6.51it/s][A
[Training] Epoch 9:   1%|          | 2/367 [00:00<01:04,  5.68it/s][A
[Training] Epoch 9:   1%|          | 3/367 [00:00<01:00,  6.07it/s][A
[Training] Epoch 9:   1%|          | 4/367 [00:00<01:01,  5.95it/s][A
Epoch: [8/30] || loss: 0.008422 || acc: 20.00% || lr 7.822e-06:   1%|          | 4/367 [00:00<01:01,  5.95it/s][A
Epoch: [8/30] || loss: 0.008422 || acc: 20.00% || lr 7.822e-06:   1%|▏         | 5/367 [00:00<00:59,  6.11it/s][A
Epoch: [8/30] || loss: 0.008422 || acc: 20.00% || lr 7.822e-06:   2%|▏         | 6/367 [00:00<00:57,  6.25it/s][A
Epoch: [8/30] || loss: 0.008422 || acc: 20.00% || lr 7.822e-06:   2%|▏         | 7/367 [00:01<01:01,  5.83it/s][A
Epoch: [8/30] || loss: 0.008422 || acc: 20.00% || lr 7.822e-06:   2%|▏         | 8/367 [00:01<01:04,  5.57it/s][A
Epoch: [8/30] || loss: 0.008422

Best Model Saving!
Current Best Accuracy: 0.9904891304347826


 30%|███       | 9/30 [10:52<25:22, 72.50s/it]
[Training] Epoch 10:   0%|          | 0/367 [00:00<?, ?it/s][A
[Training] Epoch 10:   0%|          | 1/367 [00:00<00:50,  7.24it/s][A
[Training] Epoch 10:   1%|          | 2/367 [00:00<00:52,  6.90it/s][A
[Training] Epoch 10:   1%|          | 3/367 [00:00<01:04,  5.61it/s][A
[Training] Epoch 10:   1%|          | 4/367 [00:00<01:00,  6.02it/s][A
[Training] Epoch 10:   1%|▏         | 5/367 [00:00<01:07,  5.39it/s][A
[Training] Epoch 10:   2%|▏         | 6/367 [00:01<01:04,  5.57it/s][A
[Training] Epoch 10:   2%|▏         | 7/367 [00:01<01:05,  5.48it/s][A
[Training] Epoch 10:   2%|▏         | 8/367 [00:01<01:03,  5.61it/s][A
[Training] Epoch 10:   2%|▏         | 9/367 [00:01<01:01,  5.86it/s][A
[Training] Epoch 10:   3%|▎         | 10/367 [00:01<01:01,  5.85it/s][A
[Training] Epoch 10:   3%|▎         | 11/367 [00:01<01:00,  5.84it/s][A
[Training] Epoch 10:   3%|▎         | 12/367 [00:02<00:58,  6.03it/s][A
[Training] Epoch 10:  

Best Model Saving!
Current Best Accuracy: 0.9945652173913043


 37%|███▋      | 11/30 [13:15<22:46, 71.91s/it]
[Training] Epoch 12:   0%|          | 0/367 [00:00<?, ?it/s][A
[Training] Epoch 12:   0%|          | 1/367 [00:00<01:08,  5.33it/s][A
[Training] Epoch 12:   1%|          | 2/367 [00:00<01:10,  5.21it/s][A
[Training] Epoch 12:   1%|          | 3/367 [00:00<01:05,  5.56it/s][A
Epoch: [11/30] || loss: 0.001721 || acc: 15.00% || lr 7.741e-06:   1%|          | 3/367 [00:00<01:05,  5.56it/s][A
Epoch: [11/30] || loss: 0.001721 || acc: 15.00% || lr 7.741e-06:   1%|          | 4/367 [00:00<01:00,  5.97it/s][A
Epoch: [11/30] || loss: 0.001721 || acc: 15.00% || lr 7.741e-06:   1%|▏         | 5/367 [00:00<00:58,  6.21it/s][A
Epoch: [11/30] || loss: 0.001721 || acc: 15.00% || lr 7.741e-06:   2%|▏         | 6/367 [00:01<01:06,  5.40it/s][A
Epoch: [11/30] || loss: 0.001721 || acc: 15.00% || lr 7.741e-06:   2%|▏         | 7/367 [00:01<01:08,  5.26it/s][A
Epoch: [11/30] || loss: 0.001721 || acc: 15.00% || lr 7.741e-06:   2%|▏         | 8/367 [00:

### Bigbird Inference

In [13]:
Eval = Inference(config, 'bigbird')
Eval.inference()

Downloading:   0%|          | 0.00/870 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/436M [00:00<?, ?B/s]

Some weights of the model checkpoint at monologg/kobigbird-bert-base were not used when initializing BigBirdModel: ['cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.predictions.decoder.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.decoder.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing BigBirdModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BigBirdModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Downloading:   0%|          | 0.00/373 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/236k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/480k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/169 [00:00<?, ?B/s]

Inference:   0%|          | 0/350 [00:00<?, ?it/s]Attention type 'block_sparse' is not possible if sequence_length: 144 <= num global tokens: 2 * config.block_size + min. num sliding tokens: 3 * config.block_size + config.num_random_blocks * config.block_size + additional buffer: config.num_random_blocks * config.block_size = 704 with config.block_size = 64, config.num_random_blocks = 3. Changing attention type to 'original_full'...
Inference: 100%|██████████| 350/350 [00:04<00:00, 70.66it/s]

Test Accuracy is 75.00%. Congrats!





### 2) Roborta

In [None]:
torch.cuda.empty_cache()

model_name = 'roberta'

if config.mode_wandb:
    wandb.login()
    wandb.init(project='HyunJin-BoolQ', name=f"hello_{model_name}")

Trainer = BoolQ_Model_Train(config, model_name)
Trainer.fit(epoch = 30)

### Roberta Inference

In [14]:
Eval = Inference(config, 'roberta')
Eval.inference()

Downloading:   0%|          | 0.00/547 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.25G [00:00<?, ?B/s]

Some weights of the model checkpoint at klue/roberta-large were not used when initializing RobertaModel: ['lm_head.layer_norm.weight', 'lm_head.layer_norm.bias', 'lm_head.decoder.bias', 'lm_head.dense.bias', 'lm_head.decoder.weight', 'lm_head.dense.weight', 'lm_head.bias']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaModel were not initialized from the model checkpoint at klue/roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it f

Downloading:   0%|          | 0.00/375 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/243k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/734k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/173 [00:00<?, ?B/s]

Inference: 100%|██████████| 350/350 [00:12<00:00, 27.49it/s]

Test Accuracy is 80.71%. Congrats!





### 3) Electra

In [None]:
torch.cuda.empty_cache()

model_name = 'electra'

if config.mode_wandb:
    wandb.login()
    wandb.init(project='HyunJin-BoolQ', name=f"hello_{model_name}")

Trainer = BoolQ_Model_Train(config, model_name)
Trainer.fit(epoch = 30)

### Electra Inference

In [15]:
Eval = Inference(config, 'electra')
Eval.inference()

Downloading:   0%|          | 0.00/467 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/431M [00:00<?, ?B/s]

Some weights of the model checkpoint at monologg/koelectra-base-v3-discriminator were not used when initializing ElectraModel: ['discriminator_predictions.dense_prediction.weight', 'discriminator_predictions.dense.bias', 'discriminator_predictions.dense_prediction.bias', 'discriminator_predictions.dense.weight']
- This IS expected if you are initializing ElectraModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing ElectraModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Downloading:   0%|          | 0.00/61.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/257k [00:02<?, ?B/s]

Inference: 100%|██████████| 350/350 [00:04<00:00, 79.88it/s]

Test Accuracy is 80.57%. Congrats!





### 4) Albert (koreAlbert)

In [None]:
torch.cuda.empty_cache()

model_name = 'albert'

if config.mode_wandb:
    wandb.login()
    wandb.init(project='HyunJin-BoolQ', name=f"hello_{model_name}")

Trainer = BoolQ_Model_Train(config, model_name)
Trainer.fit(epoch = 30)

### Albert Inference

In [None]:
Eval = Inference(config, 'albert')
Eval.inference()

NameError: ignored

# Logit ensamble of 3 model

In [16]:
class InferenceLogitEnsemble():
  """
    Logit Ensemble of 3 well-trained model
  """
  def __init__(self, config):
    device = 'cuda' if torch.cuda.is_available() else 'cpu'
    self.device = device
    self.config = config

    #####################
    ### Configuration ###
    #####################

    """ Model """

    # make a list of dictionary the well-trained model saved.
    save_model_path = [os.path.join(config.cur_path, 'result/best', name) for name in config.model_list.keys()]

    # make a model_config of the well-trained models saved.
    config_list = [AutoConfig.from_pretrained(config.model_list[name]) for name in config.model_list.keys()]
    config_list[0].num_labels = 2
    config_list[1].num_labels = 2
    config_list[2].num_labels = 2

    # load trained models (no Albert)
    self.model_r = Roberta_BoolQ.from_pretrained(save_model_path[0], config=config_list[0]).to(device).eval()
    self.model_b = BigBird_BoolQ.from_pretrained(save_model_path[1], config=config_list[1]).to(device).eval()
    self.model_e = Electra_BoolQ.from_pretrained(save_model_path[2], config=config_list[2]).to(device).eval()
    

    """ Tokenizer """
    self.tokenizer_list = [AutoTokenizer.from_pretrained(config.model_list[name]) for name in list(config.model_list.keys())[:3]]


    """ Dataset """

    # train_dataset
    self.test_dataset = BoolQ_Dataset(config, False)
    
    # batch_size
    self.batch_size = config.inf_batch_size



  def inference(self):
    ### test dataloader
    loader_list = [DataLoader(
        self.test_dataset,
        batch_size = self.batch_size,
        collate_fn = self.ensemble_fn(self.tokenizer_list[i])) for i in range(3)
    ]

    # get accuracy.
    tot_acc = 0.
    with torch.no_grad():
      pbar = tqdm(total = len(loader_list[0]), desc = "Inference")
      for (t_r, labels), (t_b, _), (t_e, _) in zip(loader_list[0], loader_list[1], loader_list[2]):
        ### simple function to allocate to cuda (or cpu)
        convert_text = lambda t: {key: torch.tensor(value).to(self.device) for key, value in t.items()}
        convert_label = lambda l: torch.tensor(l).to(self.device)

        t_r, t_b, t_e = convert_text(t_r), convert_text(t_b), convert_text(t_e)
        
        y_pred_r = self.model_r(**t_r)[0]
        y_pred_b = self.model_b(**t_b)[0]
        y_pred_e = self.model_e(**t_e)[0]

        y_pred = (y_pred_r + y_pred_b + y_pred_e)/3

        preds = torch.argmax(y_pred, dim=-1).to('cpu')
        tot_acc += (preds == torch.tensor(labels)).sum().item() / self.batch_size

        pbar.update()

    tot_acc /= len(loader_list[0])
    print(f"Test Accuracy is {tot_acc:4.2%}. Congrats!")
      
  def ensemble_fn(self, tokenizer):
    fn = lambda batch: self.collate_fn(batch, tokenizer = tokenizer)
    return fn

  def collate_fn(self, batch, tokenizer):
    """
      Collate a batch of dataset to same length of text.

    ? INPUT
    dataset: {text: string, question: string, label: int}

    ? OUTPUT
    padded token ids.
    """

    batch_size = len(batch)

    # integrate from dataset (dict) into list
    text_list = [b['text'] for b in batch]
    query_list = [b['question'] for b in batch]
    label_list = [b['label'] for b in batch]
    
    # tokenize
    text_query_list = list(zip(text_list, query_list))

    tokenized_sentence = tokenizer(
        text_query_list,
        return_tensors="np",
        padding=True,
        truncation=True,
        max_length=512,
        add_special_tokens=True,
        return_token_type_ids = False
    )

    # output of tokenized_sentence: {input_ids, token_type_ids, attention_mask}
    return tokenized_sentence, label_list


Ensemble Inference!

In [17]:
#torch.cuda.empty_cache()
Eval = InferenceLogitEnsemble(config)

Downloading:   0%|          | 0.00/684 [00:00<?, ?B/s]

Some weights of the model checkpoint at klue/roberta-large were not used when initializing RobertaModel: ['lm_head.layer_norm.weight', 'lm_head.layer_norm.bias', 'lm_head.decoder.bias', 'lm_head.dense.bias', 'lm_head.decoder.weight', 'lm_head.dense.weight', 'lm_head.bias']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaModel were not initialized from the model checkpoint at klue/roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it f

In [18]:
Eval.inference()

Inference:   0%|          | 0/350 [00:00<?, ?it/s]Attention type 'block_sparse' is not possible if sequence_length: 144 <= num global tokens: 2 * config.block_size + min. num sliding tokens: 3 * config.block_size + config.num_random_blocks * config.block_size + additional buffer: config.num_random_blocks * config.block_size = 704 with config.block_size = 64, config.num_random_blocks = 3. Changing attention type to 'original_full'...
Inference: 100%|██████████| 350/350 [00:21<00:00, 16.20it/s]

Test Accuracy is 81.86%. Congrats!





# Roberta Ensemble

Train and inference with two models which be taken by input of sequence of (question, answer) and (answer, question) respectively.

In [None]:
import transformers
from transformers import AutoConfig, AutoTokenizer, BertTokenizerFast

from sklearn.model_selection import StratifiedKFold

from torch.utils.data import Subset

# https://visionhong.tistory.com/30
# Here is the code for pl.

class BoolQ_Model_Ensemble_Train():
  def __init__(self, config, model_name):
    super().__init__()
    device = 'cuda' if torch.cuda.is_available() else 'cpu'
    self.device = device
    self.config = config

    #####################
    ### Configuration ###
    #####################

    """ Model """

    assert model_name in config.model_list.keys(), "[Training] Please Give Correct Model Name which have been listed."
    self.model_name = model_name

    # load configuration of pretrained model
    MODEL_CONFIG = AutoConfig.from_pretrained(config.model_list[model_name])
    MODEL_CONFIG.num_labels = 2

    if model_name == "roberta":
      self.model = Roberta_BoolQ(MODEL_CONFIG)
    elif model_name == "bigbird":
      self.model = BigBird_BoolQ(MODEL_CONFIG)
    elif model_name == "electra":
      self.model = Electra_BoolQ(MODEL_CONFIG)
    elif model_name == "albert":
      self.model = Albert_BoolQ(MODEL_CONFIG)

    self.model.to(device)


    """ Tokenizer """
    if model_name == 'albert':
      self.tokenizer = BertTokenizerFast.from_pretrained(config.model_list[model_name])
    else:
      self.tokenizer = AutoTokenizer.from_pretrained(config.model_list[model_name])



    """ Dataset """

    # train_dataset
    self.train_dataset = BoolQ_Dataset(config)

    # k_fold index
    skf_iris = StratifiedKFold(n_splits=config.k_fold)
    self.kfold = config.k_fold
    self.KFold_index = list(skf_iris.split(
        self.train_dataset.dataset['text'], self.train_dataset.dataset['label']))
    
    # batch_size
    self.batch_size = config.batch_size


    """ optimizer, scheduler (in fit() function), criterion """

    self.optimizer = torch.optim.AdamW(self.model.parameters(), lr=config.learning_rate)
    self.criterion = nn.CrossEntropyLoss()


    """ Training Saving """

    self.log_interval = config.log_interval
    self.load_step = 0
    self.best_acc = 0
    self.wandb = config.mode_wandb
    self.save_dir = config.save_dir



  def fit(self, epoch):
    # schedular
    self.scheduler = transformers.get_linear_schedule_with_warmup(
      self.optimizer, 
      num_warmup_steps=config.warmup_steps, 
      num_training_steps=len(self.train_dataset) * epoch, 
      last_epoch= -1
    )

    
    """ GO TRAINING. """
    self.epoch = epoch

    for epo in tqdm(range(epoch), position=0):
      ### Stratified KFold
      train_idx, val_idx = self.KFold_index[epo % self.kfold]

      training_set = Subset(self.train_dataset, train_idx)
      validation_set = Subset(self.train_dataset, val_idx)

      ### make dataloader
      train_loader = DataLoader(training_set, batch_size=self.batch_size, shuffle=True, collate_fn=self.collate_fn)
      val_loader = DataLoader(validation_set, batch_size=self.batch_size, shuffle=True, collate_fn=self.collate_fn)

      ### train
      self.training_step(train_loader, epo)

      ### val
      self.validation_step(val_loader, epo)

      ### Best model save
      if self.best_acc < self.val_acc:
        self.best_acc = self.val_acc

        print("Best Model Saving!")
        print(f"Current Best Accuracy: {self.best_acc}")

        model_to_save = self.model.module if hasattr(self.model, "module") else self.model
        model_to_save.save_pretrained(f"{self.save_dir}/best/{self.model_name}")
        torch.save(self.config, os.path.join(f"{self.save_dir}/best/{self.model_name}", "training_config.bin"))


      

  def training_step(self, train_loader, epo):
    # allocate model to train mode
    self.model.train()
    tot_acc, tot_loss = 0., 0.

    pbar = tqdm(total = len(train_loader), desc="[Training] Epoch {}".format(epo+1), position=1)

    for texts, labels in train_loader:
      ### allocate to cuda or not.
      # texts -> cpu tensor, labels -> array.
      # texts: {input_ids, token_type_ids, attention_mask}
      texts = {key: torch.tensor(value).to(self.device) for key, value in texts.items()}
      labels = torch.tensor(labels).to(self.device)

      ###########################################
      # 1) zero_grad
      self.optimizer.zero_grad()

      # 2) forward
      y_pred = self.model(**texts)[0]

      # 3) calculate loss
      loss = self.criterion(y_pred, labels)

      # 4) backward
      loss.backward()

      # 5) optimier step
      self.optimizer.step()

      # 6) scheduler step
      self.scheduler.step()

      ###########################################


      ### update, and cumulate match and loss
      pbar.update()
      self.load_step += 1

      preds = torch.argmax(y_pred, dim=-1)
      tot_loss += loss.item()
      tot_acc += (preds == labels).sum().item() / self.batch_size

      ### saving to log
      if self.load_step % self.log_interval == 0:
        train_loss = tot_loss / self.log_interval
        train_acc = tot_acc / self.log_interval
        current_lr = self.get_lr(self.optimizer)

        pbar.set_description(f"Epoch: [{epo}/{self.epoch}] || loss: {train_loss:4.4} || acc: {train_acc:4.2%} || lr {current_lr:4.4}")

        self.train_loss = train_loss
        self.train_acc = train_acc
        self.current_lr = current_lr

        tot_acc, tot_loss = 0., 0.

    pbar.close()



  def validation_step(self, val_loader, epo):
    # allocate model to eval mode
    self.model.eval()
    tot_acc, tot_loss = 0., 0.

    pbar = tqdm(total = len(val_loader), desc="[Validation] Epoch {}".format(epo+1), position=1)
    with torch.no_grad():
      for texts, labels in val_loader:
        ### allocate to cuda or not.
        # texts -> cpu tensor, labels -> array.
        # texts: {input_ids, token_type_ids, attention_mask}
        texts = {key: torch.tensor(value).to(self.device) for key, value in texts.items()}
        labels = torch.tensor(labels).to(self.device)

        ###########################################
        # 1) forward
        y_pred = self.model(**texts)[0]

        # 2) calculate loss
        loss = self.criterion(y_pred, labels)

        ###########################################
        """ Update and save loss """

        pbar.update()
    
        preds = torch.argmax(y_pred, dim=-1)
        tot_loss += loss.item()
        tot_acc += (preds == labels).sum().item() / self.batch_size

        ############################################
        

    val_loss = tot_loss / len(val_loader)
    val_acc = tot_acc / len(val_loader)

    pbar.set_description(f"Validation: [{epo}/{self.epoch}] || loss: {val_loss:4.4} || acc: {val_acc:4.2%}")
    pbar.close()

    if self.wandb:
        wandb.log({"train_loss": self.train_loss, "train_acc": self.train_acc,
            "lr":self.current_lr, "valid_loss":val_loss, "valid_acc":val_acc
        })

    self.val_acc = val_acc



  def collate_fn(self, batch):
    """
      Collate a batch of dataset to same length of text.

    ? INPUT
    dataset: {text: string, question: string, label: int}

    ? OUTPUT
    padded token ids.
    """

    batch_size = len(batch)

    # integrate from dataset (dict) into list
    text_list = [b['text'] for b in batch]
    query_list = [b['question'] for b in batch]
    label_list = [b['label'] for b in batch]
    
    # tokenize
    text_query_list = list(zip(text_list, query_list))

    if self.model_name == 'bigbird':
      max_length = 1024
    else:
      max_length = 512

    tokenized_sentence = self.tokenizer(
        text_query_list,
        return_tensors="np",
        padding=True,
        truncation=True,
        max_length=max_length,
        add_special_tokens=True,
        return_token_type_ids = False
    )

    # output of tokenized_sentence: {input_ids, token_type_ids, attention_mask}
    return tokenized_sentence, label_list

  def get_lr(self, optimizer):
    for param_group in optimizer.param_groups:
      return param_group['lr']




##### Test Code

In [None]:
list(config.model_list.keys())

In [None]:
for (i, x), (j, y) in zip(enumerate(range(3)), enumerate(range(3))):
  print(x, y)

In [None]:

def _collate_fn(batch):
    """
    Collate a batch of dataset to same length of text.

    ? INPUT
    dataset: {text: string, question: string, label: int}

    ? OUTPUT
    padded token ids.
    """

    batch_size = len(batch)

    # integrate from dataset (dict) into list
    text_list = [b['text'] for b in batch]
    query_list = [b['question'] for b in batch]
    label_list = [b['label'] for b in batch]

    # tokenize
    text_query_list = list(zip(text_list, query_list))


    tokenized_sentence = tokenizer(
        text_query_list,
        return_tensors="np",
        padding=True,
        truncation=True,
        max_length=512,
        add_special_tokens=True,
        return_token_type_ids = True
    )

# output of tokenized_sentence: {input_ids, token_type_ids, attention_mask}
    return tokenized_sentence, label_list


In [None]:
dataset = BoolQ_Dataset(config)
for data in Subset(dataset, idx):
    print(data)

In [None]:
from transformers import AutoTokenizer, BigBirdTokenizer, BertTokenizerFast
#tokenizer = AutoTokenizer.from_pretrained("monologg/kobigbird-bert-base")
tokenizer = BertTokenizerFast.from_pretrained("kykim/albert-kor-base")
#tokenizer = AutoTokenizer.from_pretrained("xlm-roberta-large")
#tokenizer = AutoTokenizer.from_pretrained("klue/roberta-large")

from torch.utils.data import Subset
dataset = BoolQ_Dataset(config)
print(dataset)
idx = np.asarray([1, 3, 5, 6])
print(Subset(dataset, idx))
loader = DataLoader(
    dataset,
    batch_size = 8,
    shuffle = True,
    collate_fn = _collate_fn
)

for batch, label_list in loader:
  print(batch)
  print(batch['input_ids'].shape)
  print(batch['token_type_ids'].shape)
  print(batch['attention_mask'].shape)

  print(tokenizer.batch_decode(batch['input_ids'].tolist()))
  print(label_list)
  break

In [None]:
from transformers import AutoConfig

config = AutoConfig.from_pretrained("kykim/albert-kor-base")

In [None]:
import functools

def foo(a):
  fn = lambda b: bar(a, b)
  return fn

def bar(a, b):
  return a+b

c = foo(5)
c(3)

# 여기서부터 COPA (ㅇ
---
- Pretrained Model : klue/Roberta-large 

- 아래 실행하여 라이브러리 설치
-``` pip install BackTranslation```

## 모듈 임포트

In [2]:
import copy
import glob
import os
import random
import json
import time
import re
from time import sleep
from importlib import import_module
from pathlib import Path

import pandas as pd
import numpy as np
from tqdm.auto import tqdm
from sklearn.metrics import accuracy_score
from easydict import EasyDict

import torch
import torch.nn as nn
import transformers
from torch.utils.tensorboard import SummaryWriter
from torch.utils.data import DataLoader
from transformers import AutoTokenizer
from transformers import (
    BertModel,
    BertPreTrainedModel,
    ElectraModel,
    ElectraPreTrainedModel,
    XLMRobertaModel,
    BartModel,
    BartPretrainedModel,
    T5Model,
    RobertaModel,
)
from transformers import MBartModel, MBartConfig
from transformers import BertTokenizer, BertModel
from BackTranslation import BackTranslation

## Data Augmentation by Backtranslation
---
- Google Translation 사용
- 매우 오래걸리므로 전처리된 파일 사용

In [3]:
original_train_data = "dataset/copa/SKT_COPA_Train.tsv"
augmented_train_data = "dataset/copa/SKT_COPA_Train_aug.tsv"
valid_data = "dataset/copa/SKT_COPA_Dev.tsv"

In [4]:
dataset = pd.read_csv(
    original_train_data,
    delimiter="\t",
    names=["ID", "sentence", "question", "1", "2", "answer"],
    header=0,
)

- 아래와 같은 코드를 사용하여 backtranslate하였음

In [5]:
saved_backtranslated = "dataset/copa/en_new_sentences.pth"

In [6]:
# trans = BackTranslation(url=['translate.google.co.kr',])
# def augment_sentence(trans, s, tmp='en'):
#     return trans.translate(s, src='ko', tmp=tmp).result_text
# tmps = [en']
# new_datasets = dict()
# new_sentences = dict()

# for tmp in tmps:
#     new_dataset = copy.deepcopy(dataset)
#     sentences = new_dataset['sentence'].tolist()
#     new_sentences[tmp] = list()
#     for sent in tqdm(sentences):
#         new_sentences[tmp].append(augment_sentence(trans, sent, tmp=tmp))

# torch.save(new_sentences, saved_backtranslated)

In [7]:
sent_en = torch.load(saved_backtranslated)
sent = dict()
sent['en'] = sent_en['en']

## 원본 데이터셋

In [8]:
dataset.head()

Unnamed: 0,ID,sentence,question,1,2,answer
0,1,이퀄라이저로 저음 음역대 소리 크기를 키웠다.,결과,베이스 소리가 잘 들리게 되었다.,베이스 소리가 들리지 않게 되었다.,1
1,2,음료에 초콜렛 시럽을 넣었다.,결과,음료수가 더 달아졌다.,음료수가 차가워졌다.,1
2,3,남자는 휴대폰을 호수에 빠뜨렸다.,결과,휴대폰이 업그레이드 되었다.,휴대폰이 고장났다.,2
3,4,옆 집 사람이 이사를 나갔다.,원인,옆 집 사람은 계약이 완료되었다.,옆 집 사람은 계약을 연장했다.,1
4,5,문을 밀었다.,결과,문이 잠겼다.,문이 열렸다.,2


In [9]:
new_dataset = copy.deepcopy(dataset)
new_dataset['ID'] += len(dataset)
new_dataset['sentence'] = sent['en']

## BackTranslate로 augment한 데이터셋

In [10]:
new_dataset.head()

Unnamed: 0,ID,sentence,question,1,2,answer
0,3081,이퀄라이저는베이스 스캔의 사운드를 올렸습니다.,결과,베이스 소리가 잘 들리게 되었다.,베이스 소리가 들리지 않게 되었다.,1
1,3082,나는 음료에 초콜릿 시럽을 넣었다.,결과,음료수가 더 달아졌다.,음료수가 차가워졌다.,1
2,3083,그 남자는 호수에 휴대 전화를 넣었습니다.,결과,휴대폰이 업그레이드 되었다.,휴대폰이 고장났다.,2
3,3084,집 옆에있는 사람이 나갔다.,원인,옆 집 사람은 계약이 완료되었다.,옆 집 사람은 계약을 연장했다.,1
4,3085,나는 문을 밀었다.,결과,문이 잠겼다.,문이 열렸다.,2


## 데이터 합병

In [11]:
new_dataset = dataset.append(new_dataset)
new_dataset

Unnamed: 0,ID,sentence,question,1,2,answer
0,1,이퀄라이저로 저음 음역대 소리 크기를 키웠다.,결과,베이스 소리가 잘 들리게 되었다.,베이스 소리가 들리지 않게 되었다.,1
1,2,음료에 초콜렛 시럽을 넣었다.,결과,음료수가 더 달아졌다.,음료수가 차가워졌다.,1
2,3,남자는 휴대폰을 호수에 빠뜨렸다.,결과,휴대폰이 업그레이드 되었다.,휴대폰이 고장났다.,2
3,4,옆 집 사람이 이사를 나갔다.,원인,옆 집 사람은 계약이 완료되었다.,옆 집 사람은 계약을 연장했다.,1
4,5,문을 밀었다.,결과,문이 잠겼다.,문이 열렸다.,2
...,...,...,...,...,...,...
3075,6156,계약자로 일한 남자들은 떠났다.,원인,계약을 연장했다.,계약이 종료되었다.,2
3076,6157,목 마른.,원인,물을 마시지 못했다.,텀블러를 샀다.,1
3077,6158,나는 그 노래를 오랫동안 전화 했어.,결과,목이 아프다.,노래방이 폐업했다.,1
3078,6159,사람들은 한 번 함께 일하고 있습니다.,원인,우리나라 축구팀이 골을 넣었다.,우리나라 축구팀이 경기에서 패배했다.,2


In [12]:
new_dataset.to_csv(augmented_train_data, sep='\t')

# COPA 학습 & Inference to json 코드
---

## Transformers의 Wrapper Class와 일부 테스트 모델 선언 및 구현부

In [13]:
class FCLayer(nn.Module):
    def __init__(self, input_dim, output_dim, dropout_rate=0.0, use_activation=True):
        super(FCLayer, self).__init__()
        self.use_activation = use_activation
        self.dropout = nn.Dropout(dropout_rate)
        self.linear = nn.Linear(input_dim, output_dim)
        self.tanh = nn.Tanh()

    def forward(self, x):
        x = self.dropout(x)
        if self.use_activation:
            x = self.tanh(x)
        return self.linear(x)


class PoolingHead(nn.Module):
    def __init__(
        self, input_dim: int, inner_dim: int, pooler_dropout: float,
    ):
        super().__init__()
        self.dense = nn.Linear(input_dim, inner_dim)
        self.dropout = nn.Dropout(p=pooler_dropout)

    def forward(self, hidden_states: torch.Tensor):
        hidden_states = self.dropout(hidden_states)
        hidden_states = self.dense(hidden_states)
        hidden_states = torch.tanh(hidden_states)
        return hidden_states


class Bert(BertPreTrainedModel):
    def __init__(self, config, args):
        super(Bert, self).__init__(config)
        self.bert = BertModel(config=config)  # Load pretrained bert

        self.num_labels = config.num_labels

        self.pooling = PoolingHead(
            input_dim=config.hidden_size,
            inner_dim=config.hidden_size,
            pooler_dropout=0.1,
        )
        self.qa_classifier = nn.Linear(config.hidden_size, self.num_labels - 1)

    def forward(
        self,
        input_ids=None,
        attention_mask=None,
        token_type_ids=None,
        input_ids2=None,
        attention_mask2=None,
        token_type_ids2=None,
        labels=None,
    ):
        outputs = self.bert(
            input_ids, attention_mask=attention_mask
        )  # sequence_output, pooled_output, (hidden_states), (attentions)
        outputs2 = self.bert(input_ids2, attention_mask=attention_mask2)
        sequence_output = outputs[0]
        sequence_output2 = outputs2[0]
        pooled_output = outputs[0][:, 0, :]  # [CLS]
        pooled_output2 = outputs2[0][:, 0, :]

        sentence_representation = torch.cat([pooled_output, pooled_output2], dim=1)

        pooled_output = self.pooling(pooled_output)
        pooled_output2 = self.pooling(pooled_output2)

        logits1 = self.qa_classifier(pooled_output)
        logits2 = self.qa_classifier(pooled_output2)

        logits = torch.cat([logits1, logits2], dim=1)

        outputs = (logits,) + outputs[
            2:
        ]  # add hidden states and attention if they are here

        return outputs  # logits, (hidden_states), (attentions)


class XLMRoberta(XLMRobertaModel):
    def __init__(self, config, args):
        super(XLMRoberta, self).__init__(config)
        self.xlmroberta = XLMRobertaModel.from_pretrained(
            "xlm-roberta-large", config=config
        )  # Load pretrained Electra

        self.num_labels = config.num_labels

        self.pooling = PoolingHead(
            input_dim=config.hidden_size,
            inner_dim=config.hidden_size,
            pooler_dropout=0.1,
        )
        self.qa_classifier = nn.Linear(config.hidden_size, self.num_labels - 1)

    def forward(
        self,
        input_ids=None,
        attention_mask=None,
        token_type_ids=None,
        input_ids2=None,
        attention_mask2=None,
        token_type_ids2=None,
        labels=None,
    ):
        outputs = self.xlmroberta(
            input_ids, attention_mask=attention_mask
        )  # sequence_output, pooled_output, (hidden_states), (attentions)
        outputs2 = self.xlmroberta(input_ids2, attention_mask=attention_mask2)
        sequence_output = outputs[0]
        sequence_output2 = outputs2[0]
        pooled_output = outputs[0][:, 0, :]  # [CLS]
        pooled_output2 = outputs2[0][:, 0, :]

        sentence_representation = torch.cat([pooled_output, pooled_output2], dim=1)

        pooled_output = self.pooling(pooled_output)
        pooled_output2 = self.pooling(pooled_output2)

        logits1 = self.qa_classifier(pooled_output)
        logits2 = self.qa_classifier(pooled_output2)

        logits = torch.cat([logits1, logits2], dim=1)

        outputs = (logits,) + outputs[
            2:
        ]  # add hidden states and attention if they are here

        return outputs  # logits, (hidden_states), (attentions)


class Electra_BoolQ(ElectraPreTrainedModel):
    def __init__(self, config, args):
        super(Electra_BoolQ, self).__init__(config)

        # self.num_labels = config.num_labels
        self.num_labels = config.num_labels
        self.model = ElectraModel.from_pretrained(
            "monologg/koelectra-base-v3-discriminator", config=config
        )
        self.pooling = PoolingHead(
            input_dim=config.hidden_size,
            inner_dim=config.hidden_size,
            pooler_dropout=0.1,
        )
        self.qa_classifier = nn.Linear(config.hidden_size, self.num_labels - 1)
        # self.sparse = Sparsemax()

    def forward(
        self,
        input_ids=None,
        attention_mask=None,
        token_type_ids=None,
        input_ids2=None,
        attention_mask2=None,
        token_type_ids2=None,
        labels=None,
    ):
        outputs = self.model(
            input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids
        )  # sequence_output, pooled_output, (hidden_states), (attentions)
        outputs2 = self.model(
            input_ids2, attention_mask=attention_mask2, token_type_ids=token_type_ids2
        )
        sequence_output = outputs[0]
        sequence_output2 = outputs2[0]
        pooled_output = outputs[0][:, 0, :]  # [CLS]
        pooled_output2 = outputs2[0][:, 0, :]

        sentence_representation = torch.cat([pooled_output, pooled_output2], dim=1)

        pooled_output = self.pooling(pooled_output)
        pooled_output2 = self.pooling(pooled_output2)

        logits1 = self.qa_classifier(pooled_output)
        logits2 = self.qa_classifier(pooled_output2)

        logits = torch.cat([logits1, logits2], dim=1)

        outputs = (logits,) + outputs[
            2:
        ]  # add hidden states and attention if they are here

        return outputs  # logits, (hidden_states), (attentions)


class Roberta(RobertaModel):
    def __init__(self, config, args):
        super(Roberta, self).__init__(config)
        self.roberta = RobertaModel.from_pretrained(
            "klue/roberta-large", config=config
        )  # Load pretrained Electra

        self.num_labels = config.num_labels

        self.pooling = PoolingHead(
            input_dim=config.hidden_size,
            inner_dim=config.hidden_size,
            pooler_dropout=0.1,
        )
        self.qa_classifier = nn.Linear(config.hidden_size, self.num_labels - 1)

    def forward(
        self,
        input_ids=None,
        attention_mask=None,
        token_type_ids=None,
        input_ids2=None,
        attention_mask2=None,
        token_type_ids2=None,
        labels=None,
    ):
        outputs = self.roberta(
            input_ids, attention_mask=attention_mask
        )  # sequence_output, pooled_output, (hidden_states), (attentions)
        outputs2 = self.roberta(input_ids2, attention_mask=attention_mask2)
        sequence_output = outputs[0]
        sequence_output2 = outputs2[0]
        pooled_output = outputs[0][:, 0, :]  # [CLS]
        pooled_output2 = outputs2[0][:, 0, :]

        sentence_representation = torch.cat([pooled_output, pooled_output2], dim=1)

        pooled_output = self.pooling(pooled_output)
        pooled_output2 = self.pooling(pooled_output2)

        logits1 = self.qa_classifier(pooled_output)
        logits2 = self.qa_classifier(pooled_output2)

        logits = torch.cat([logits1, logits2], dim=1)

        outputs = (logits,) + outputs[
            2:
        ]  # add hidden states and attention if they are here

        return outputs  # logits, (hidden_states), (attentions)


## 데이터 전처리부
---

In [14]:
class CustomDataset(torch.utils.data.Dataset):
    def __init__(self, tokenized_dataset, labels):
        self.tokenized_dataset = tokenized_dataset
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: val[idx] for key, val in self.tokenized_dataset.items()}
        item["labels"] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)
    
def load_data(dataset_dir):
    dataset = pd.read_csv(
        dataset_dir,
        delimiter="\t",
        names=["ID", "sentence", "question", "1", "2", "answer"],
        header=0,
    )
    dataset["label"] = dataset["answer"].astype(int) - 1

    new_sentence1_1 = []
    new_sentence1_2 = []
    new_sentence2_1 = []
    new_sentence2_2 = []
    for i in range(len(dataset)):
        s = dataset.iloc[i]["sentence"]
        q = dataset.iloc[i]["question"]
        s1 = dataset.iloc[i]["1"]
        s2 = dataset.iloc[i]["2"]
        lb = dataset.iloc[i]["label"]
        if q == "결과":
            new_sentence1_1.append("[결과]" + s)
            # new_sentence1_1.append(s)
            new_sentence1_2.append(s1)
            new_sentence2_1.append("[결과]" + s)
            # new_sentence2_1.append(s)
            new_sentence2_2.append(s2)

        else:
            new_sentence1_1.append("[원인]" + s1)
            # new_sentence1_1.append(s1)
            new_sentence1_2.append(s)
            new_sentence2_1.append("[원인]" + s2)
            # new_sentence2_1.append(s2)
            new_sentence2_2.append(s)

    dataset["new_sentence1_1"] = new_sentence1_1
    dataset["new_sentence1_2"] = new_sentence1_2
    dataset["new_sentence2_1"] = new_sentence2_1
    dataset["new_sentence2_2"] = new_sentence2_2

    return dataset


def tokenized_dataset(dataset, tokenizer, arch="encoder"):
    sentence1_1 = dataset["new_sentence1_1"].tolist()
    sentence1_2 = dataset["new_sentence1_2"].tolist()
    sentence2_1 = dataset["new_sentence2_1"].tolist()
    sentence2_2 = dataset["new_sentence2_2"].tolist()

    tokenized_sentences = tokenizer(
        sentence1_1,
        sentence1_2,
        return_tensors="pt",
        padding=True,
        truncation=True,
        max_length=150,
        add_special_tokens=True,
        return_token_type_ids=True,
    )
    tokenized_sentences2 = tokenizer(
        sentence2_1,
        sentence2_2,
        return_tensors="pt",
        padding=True,
        truncation=True,
        max_length=150,
        add_special_tokens=True,
        return_token_type_ids=True,
    )
    for key, value in tokenized_sentences2.items():
        tokenized_sentences[key + "2"] = value

    return tokenized_sentences


## 트레이닝

In [15]:
def check_arch(model_type):
    archs = {
        "encoder": ["Bert", "Electra", "XLMRoberta", "Electra_BoolQ", "Roberta"],
        "encoder-decoder": ["T5", "Bart", "Bart_BoolQ"],
    }
    for arch in archs:
        if model_type in archs[arch]:
            return arch
    raise ValueError(f"Model [{model_type}] no defined archtecture")


def seed_everything(seed):
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)  # if use multi-GPU
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False
    np.random.seed(seed)
    random.seed(seed)

def get_lr(optimizer):
    for param_group in optimizer.param_groups:
        return param_group['lr']

def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    # calculate accuracy using sklearn's function
    acc = accuracy_score(labels, preds)
    return {
      'accuracy': acc,
    }

def increment_output_dir(output_path, exist_ok=False):
    path = Path(output_path)
    if (path.exists() and exist_ok) or (not path.exists()):
        return str(path)
    else:
        dirs = glob.glob(f"{path}*")
        matches = [re.search(rf"%s(\d+)" % path.stem, d) for d in dirs]
        i = [int(m.groups()[0]) for m in matches if m]
        n = max(i) + 1 if i else 2
        return f"{path}{n}"

def train(model_dir, args):

    seed_everything(args.seed)
    device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
    print(f"device(GPU) : {torch.cuda.is_available()}")
    num_classes = 2

    # load model and tokenizerƒ
    MODEL_NAME = args.pretrained_model
    tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

    # load dataset
    train_dataset = load_data(augmented_train_data)
    val_dataset = load_data(valid_data)

    train_label = train_dataset["label"].values
    val_label = val_dataset["label"].values

    # tokenizing dataset
    tokenized_train = tokenized_dataset(
        train_dataset, tokenizer, check_arch(args.model_type)
    )
    tokenized_val = tokenized_dataset(
        val_dataset, tokenizer, check_arch(args.model_type)
    )

    # make dataset for pytorch.
    train_dataset = CustomDataset(tokenized_train, train_label)
    val_dataset = CustomDataset(tokenized_val, val_label)
    # -- data_loader
    train_loader = DataLoader(
        train_dataset, batch_size=args.batch_size, shuffle=True, drop_last=True,
    )

    val_loader = DataLoader(
        val_dataset, batch_size=args.valid_batch_size, shuffle=False, drop_last=False,
    )

    # setting model hyperparameter
    if args.model_type == "Electra_BoolQ":
        config_module = ElectraConfig
    else:
        config_module = getattr(
            import_module("transformers"), args.model_type + "Config"
        )

    model_config = config_module.from_pretrained(MODEL_NAME)
    model_config.num_labels = 2

    model_module = eval(args.model_type)

    if args.model_type in ["BERT", "Electra"]:
        model = model_module.from_pretrained(
            MODEL_NAME, config=model_config, args=args
        )
    else:
        model = model_module(config=model_config, args=args)

    model.parameters
    model.to(device)
    save_dir = increment_output_dir(os.path.join(model_dir, args.name, str(args.kfold)))

    # Freeze Parameter
    for name, param in model.named_parameters():
        if ("cls_fc_layer" not in name) and (
            "label_classifier" not in name
        ):  # classifier layer
            param.requires_grad = False

    # -- loss & metric
    criterion = nn.CrossEntropyLoss()
    
    opt_module = getattr(import_module("transformers"), args.optimizer)
    optimizer = opt_module(
        model.parameters(), lr=args.lr, weight_decay=args.weight_decay, eps=1e-8
    )
    scheduler = transformers.get_linear_schedule_with_warmup(
        optimizer,
        num_warmup_steps=args.warmup_steps,
        num_training_steps=len(train_loader) * args.epochs,
        last_epoch=-1,
    )

    # -- logging
    start_time = time.time()
    logger = SummaryWriter(log_dir=save_dir)
    with open(os.path.join(save_dir, "config.json"), "w", encoding="utf-8") as f:
        json.dump(vars(args), f, ensure_ascii=False, indent=4)

    best_val_acc = 0
    best_val_loss = np.inf
    for epoch in range(args.epochs):
        # train loop
        # unFreeze parameters
        if epoch == args.freeze_epoch:
            for name, param in model.named_parameters():
                param.requires_grad = True
        model.train()
        loss_value = 0
        matches = 0
        for idx, items in enumerate(train_loader):
            item = {key: val.to(device) for key, val in items.items()}

            optimizer.zero_grad()
            outs = model(**item)
            loss = criterion(outs[0], item["labels"])

            preds = torch.argmax(outs[0], dim=-1)

            loss.backward()
            optimizer.step()
            scheduler.step()

            loss_value += loss.item()
            matches += (preds == item["labels"]).sum().item()
            if (idx + 1) % args.log_interval == 0:
                train_loss = loss_value / args.log_interval
                train_acc = matches / args.batch_size / args.log_interval
                current_lr = get_lr(optimizer)
                print(
                    f"Epoch[{epoch}/{args.epochs}]({idx + 1}/{len(train_loader)}) || "
                    f"training loss {train_loss:4.4} || training accuracy {train_acc:4.2%} || lr {current_lr}"
                )

                logger.add_scalar(
                    "Train/loss", train_loss, epoch * len(train_loader) + idx
                )
                logger.add_scalar(
                    "Train/accuracy", train_acc, epoch * len(train_loader) + idx
                )
                logger.add_scalar("LR", current_lr, epoch * len(train_loader) + idx)

                loss_value = 0
                matches = 0

        # val loop
        with torch.no_grad():
            print("Calculating validation results...")
            model.eval()
            val_loss_items = []
            val_acc_items = []
            acc_okay = 0
            count_all = 0
            for idx, items in enumerate(tqdm(val_loader)):
                sleep(0.01)
                item = {key: val.to(device) for key, val in items.items()}

                outs = model(**item)

                preds = torch.argmax(outs[0], dim=-1)
                loss = criterion(outs[0], item["labels"]).item()

                acc_item = (item["labels"] == preds).sum().item()

                val_loss_items.append(loss)
                val_acc_items.append(acc_item)
                acc_okay += acc_item
                count_all += len(preds)

            val_loss = np.sum(val_loss_items) / len(val_loss_items)
            val_acc = acc_okay / count_all

            if val_acc > best_val_acc:
                print(
                    f"New best model for val acc : {val_acc:4.2%}! saving the best model.."
                )
                model_to_save = model.module if hasattr(model, "module") else model
                model_to_save.save_pretrained(f"{save_dir}/best")
                torch.save(args, os.path.join(f"{save_dir}/best", "training_args.bin"))
                best_val_acc = val_acc

            if val_loss < best_val_loss:
                best_val_loss = val_loss
            print(
                f"[Val] acc : {val_acc:4.2%}, loss: {val_loss:4.4}|| "
                f"best acc : {best_val_acc:4.2%}, best loss: {best_val_loss:4.4}"
            )

            logger.add_scalar("Val/loss", val_loss, epoch)
            logger.add_scalar("Val/accuracy", val_acc, epoch)
            s = f"Time elapsed: {(time.time() - start_time)/60: .2f} min"
            print(s)
            print()
            if epoch > 24:
                model_to_save = model.module if hasattr(model, "module") else model
                model_to_save.save_pretrained(f"{save_dir}/best")
                torch.save(args, os.path.join(f"{save_dir}/best", "training_args.bin"))
                break
    return model

## Training Configuration
---
1. Roberta-large pretrained model 사용하여 fine-tune
2. 10epoch 내외로 수렴하는 것을 확인해서 15epoch만 돌림

In [16]:
os.environ["TOKENIZERS_PARALLELISM"] = "false"

args  = EasyDict(dict(
    epochs = 15,
    model_type = "Roberta",
    pretrained_model = "klue/roberta-large",
    lr = 8e-6,
    batch_size = 32,
    freeze_epoch = 0,
    valid_batch_size = 128,
    val_ratio = 0.2,
    dropout_rate = 0.1,
    criterion = 'cross_entropy',
    optimizer = 'AdamW',
    weight_decay = 0.01,
    warmup_steps = 500,
    seed = 42,
    log_interval = 20,
    kfold = 1,
    model_dir = "./copa_data_results/results",
))
    
    
    
args.name = f'TrainAll_{args.model_type}_{args.lr}'

## Training

In [17]:
model = train(args.model_dir, args)

device(GPU) : True


Some weights of the model checkpoint at klue/roberta-large were not used when initializing RobertaModel: ['lm_head.decoder.weight', 'lm_head.dense.bias', 'lm_head.bias', 'lm_head.decoder.bias', 'lm_head.layer_norm.bias', 'lm_head.dense.weight', 'lm_head.layer_norm.weight']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaModel were not initialized from the model checkpoint at klue/roberta-large and are newly initialized: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it f

Epoch[0/15](20/192) || training loss 0.6938 || training accuracy 50.62% || lr 3.2e-07
Epoch[0/15](40/192) || training loss 0.6936 || training accuracy 48.44% || lr 6.4e-07
Epoch[0/15](60/192) || training loss 0.6956 || training accuracy 47.34% || lr 9.6e-07
Epoch[0/15](80/192) || training loss 0.6906 || training accuracy 52.81% || lr 1.28e-06
Epoch[0/15](100/192) || training loss 0.6913 || training accuracy 54.22% || lr 1.6e-06
Epoch[0/15](120/192) || training loss 0.6918 || training accuracy 53.59% || lr 1.92e-06
Epoch[0/15](140/192) || training loss 0.6883 || training accuracy 56.41% || lr 2.24e-06
Epoch[0/15](160/192) || training loss 0.6875 || training accuracy 54.53% || lr 2.56e-06
Epoch[0/15](180/192) || training loss 0.6704 || training accuracy 61.09% || lr 2.88e-06
Calculating validation results...


  0%|          | 0/4 [00:00<?, ?it/s]

New best model for val acc : 78.80%! saving the best model..
[Val] acc : 78.80%, loss: 0.4907|| best acc : 78.80%, best loss: 0.4907
Time elapsed:  1.08 min

Epoch[1/15](20/192) || training loss 0.4782 || training accuracy 77.50% || lr 3.392e-06
Epoch[1/15](40/192) || training loss 0.381 || training accuracy 84.06% || lr 3.712e-06
Epoch[1/15](60/192) || training loss 0.4154 || training accuracy 82.19% || lr 4.032e-06
Epoch[1/15](80/192) || training loss 0.3368 || training accuracy 85.62% || lr 4.352e-06
Epoch[1/15](100/192) || training loss 0.3439 || training accuracy 85.47% || lr 4.6719999999999995e-06
Epoch[1/15](120/192) || training loss 0.3751 || training accuracy 82.97% || lr 4.992e-06
Epoch[1/15](140/192) || training loss 0.2885 || training accuracy 88.28% || lr 5.312e-06
Epoch[1/15](160/192) || training loss 0.2783 || training accuracy 86.72% || lr 5.632e-06
Epoch[1/15](180/192) || training loss 0.3064 || training accuracy 87.81% || lr 5.952e-06
Calculating validation results...

  0%|          | 0/4 [00:00<?, ?it/s]

New best model for val acc : 91.20%! saving the best model..
[Val] acc : 91.20%, loss: 0.2792|| best acc : 91.20%, best loss: 0.2792
Time elapsed:  2.26 min

Epoch[2/15](20/192) || training loss 0.1712 || training accuracy 93.28% || lr 6.464e-06
Epoch[2/15](40/192) || training loss 0.1538 || training accuracy 92.97% || lr 6.784e-06
Epoch[2/15](60/192) || training loss 0.1435 || training accuracy 94.38% || lr 7.104e-06
Epoch[2/15](80/192) || training loss 0.1839 || training accuracy 92.50% || lr 7.424e-06
Epoch[2/15](100/192) || training loss 0.1547 || training accuracy 94.38% || lr 7.743999999999999e-06
Epoch[2/15](120/192) || training loss 0.1975 || training accuracy 92.81% || lr 7.986554621848738e-06
Epoch[2/15](140/192) || training loss 0.1413 || training accuracy 95.00% || lr 7.919327731092437e-06
Epoch[2/15](160/192) || training loss 0.141 || training accuracy 95.16% || lr 7.852100840336134e-06
Epoch[2/15](180/192) || training loss 0.1112 || training accuracy 95.47% || lr 7.784873

  0%|          | 0/4 [00:00<?, ?it/s]

New best model for val acc : 91.60%! saving the best model..
[Val] acc : 91.60%, loss: 0.3272|| best acc : 91.60%, best loss: 0.2792
Time elapsed:  3.43 min

Epoch[3/15](20/192) || training loss 0.06428 || training accuracy 97.34% || lr 7.677310924369748e-06
Epoch[3/15](40/192) || training loss 0.066 || training accuracy 97.97% || lr 7.610084033613444e-06
Epoch[3/15](60/192) || training loss 0.06924 || training accuracy 97.97% || lr 7.542857142857142e-06
Epoch[3/15](80/192) || training loss 0.05448 || training accuracy 98.28% || lr 7.47563025210084e-06
Epoch[3/15](100/192) || training loss 0.05794 || training accuracy 97.97% || lr 7.408403361344538e-06
Epoch[3/15](120/192) || training loss 0.0687 || training accuracy 97.66% || lr 7.341176470588234e-06
Epoch[3/15](140/192) || training loss 0.04584 || training accuracy 99.06% || lr 7.273949579831932e-06
Epoch[3/15](160/192) || training loss 0.03607 || training accuracy 99.06% || lr 7.20672268907563e-06
Epoch[3/15](180/192) || training lo

  0%|          | 0/4 [00:00<?, ?it/s]

[Val] acc : 90.60%, loss: 0.4094|| best acc : 91.60%, best loss: 0.2792
Time elapsed:  4.46 min

Epoch[4/15](20/192) || training loss 0.04117 || training accuracy 98.59% || lr 7.031932773109243e-06
Epoch[4/15](40/192) || training loss 0.03569 || training accuracy 98.75% || lr 6.964705882352941e-06
Epoch[4/15](60/192) || training loss 0.02685 || training accuracy 99.06% || lr 6.897478991596638e-06
Epoch[4/15](80/192) || training loss 0.03934 || training accuracy 98.12% || lr 6.830252100840335e-06
Epoch[4/15](100/192) || training loss 0.02512 || training accuracy 99.22% || lr 6.763025210084033e-06
Epoch[4/15](120/192) || training loss 0.02224 || training accuracy 99.38% || lr 6.695798319327731e-06
Epoch[4/15](140/192) || training loss 0.02148 || training accuracy 99.69% || lr 6.628571428571428e-06
Epoch[4/15](160/192) || training loss 0.02032 || training accuracy 99.06% || lr 6.5613445378151255e-06
Epoch[4/15](180/192) || training loss 0.02397 || training accuracy 99.38% || lr 6.49411764

  0%|          | 0/4 [00:00<?, ?it/s]

New best model for val acc : 92.00%! saving the best model..
[Val] acc : 92.00%, loss: 0.3837|| best acc : 92.00%, best loss: 0.2792
Time elapsed:  5.64 min

Epoch[5/15](20/192) || training loss 0.02074 || training accuracy 99.06% || lr 6.386554621848739e-06
Epoch[5/15](40/192) || training loss 0.01713 || training accuracy 99.53% || lr 6.319327731092436e-06
Epoch[5/15](60/192) || training loss 0.02199 || training accuracy 99.22% || lr 6.252100840336134e-06
Epoch[5/15](80/192) || training loss 0.02969 || training accuracy 98.75% || lr 6.184873949579832e-06
Epoch[5/15](100/192) || training loss 0.01875 || training accuracy 99.69% || lr 6.1176470588235285e-06
Epoch[5/15](120/192) || training loss 0.0256 || training accuracy 98.91% || lr 6.0504201680672265e-06
Epoch[5/15](140/192) || training loss 0.02173 || training accuracy 99.69% || lr 5.9831932773109244e-06
Epoch[5/15](160/192) || training loss 0.01657 || training accuracy 99.53% || lr 5.9159663865546215e-06
Epoch[5/15](180/192) || tra

  0%|          | 0/4 [00:00<?, ?it/s]

[Val] acc : 91.20%, loss: 0.4203|| best acc : 92.00%, best loss: 0.2792
Time elapsed:  6.66 min

Epoch[6/15](20/192) || training loss 0.01329 || training accuracy 99.69% || lr 5.741176470588235e-06
Epoch[6/15](40/192) || training loss 0.02177 || training accuracy 99.69% || lr 5.6739495798319324e-06
Epoch[6/15](60/192) || training loss 0.01304 || training accuracy 99.69% || lr 5.6067226890756295e-06
Epoch[6/15](80/192) || training loss 0.009519 || training accuracy 99.84% || lr 5.5394957983193275e-06
Epoch[6/15](100/192) || training loss 0.01774 || training accuracy 99.53% || lr 5.472268907563025e-06
Epoch[6/15](120/192) || training loss 0.01284 || training accuracy 99.69% || lr 5.4050420168067225e-06
Epoch[6/15](140/192) || training loss 0.008861 || training accuracy 99.69% || lr 5.33781512605042e-06
Epoch[6/15](160/192) || training loss 0.009139 || training accuracy 99.84% || lr 5.2705882352941176e-06
Epoch[6/15](180/192) || training loss 0.01358 || training accuracy 99.53% || lr 5.20

  0%|          | 0/4 [00:00<?, ?it/s]

[Val] acc : 91.80%, loss: 0.4458|| best acc : 92.00%, best loss: 0.2792
Time elapsed:  7.69 min

Epoch[7/15](20/192) || training loss 0.01839 || training accuracy 99.22% || lr 5.0957983193277305e-06
Epoch[7/15](40/192) || training loss 0.01006 || training accuracy 99.69% || lr 5.0285714285714285e-06
Epoch[7/15](60/192) || training loss 0.009921 || training accuracy 99.84% || lr 4.9613445378151256e-06
Epoch[7/15](80/192) || training loss 0.006946 || training accuracy 99.84% || lr 4.8941176470588235e-06
Epoch[7/15](100/192) || training loss 0.01625 || training accuracy 99.53% || lr 4.826890756302521e-06
Epoch[7/15](120/192) || training loss 0.009226 || training accuracy 99.69% || lr 4.7596638655462185e-06
Epoch[7/15](140/192) || training loss 0.01174 || training accuracy 99.69% || lr 4.692436974789916e-06
Epoch[7/15](160/192) || training loss 0.005835 || training accuracy 100.00% || lr 4.625210084033614e-06
Epoch[7/15](180/192) || training loss 0.007438 || training accuracy 99.84% || lr 

  0%|          | 0/4 [00:00<?, ?it/s]

[Val] acc : 91.80%, loss: 0.4472|| best acc : 92.00%, best loss: 0.2792
Time elapsed:  8.72 min

Epoch[8/15](20/192) || training loss 0.006374 || training accuracy 100.00% || lr 4.4504201680672266e-06
Epoch[8/15](40/192) || training loss 0.006707 || training accuracy 99.84% || lr 4.3831932773109245e-06
Epoch[8/15](60/192) || training loss 0.006601 || training accuracy 99.84% || lr 4.315966386554622e-06
Epoch[8/15](80/192) || training loss 0.007672 || training accuracy 99.84% || lr 4.2487394957983195e-06
Epoch[8/15](100/192) || training loss 0.005372 || training accuracy 99.84% || lr 4.181512605042017e-06
Epoch[8/15](120/192) || training loss 0.008952 || training accuracy 99.69% || lr 4.114285714285714e-06
Epoch[8/15](140/192) || training loss 0.006693 || training accuracy 99.69% || lr 4.047058823529412e-06
Epoch[8/15](160/192) || training loss 0.004281 || training accuracy 100.00% || lr 3.979831932773109e-06
Epoch[8/15](180/192) || training loss 0.003682 || training accuracy 100.00% ||

  0%|          | 0/4 [00:00<?, ?it/s]

[Val] acc : 92.00%, loss: 0.483|| best acc : 92.00%, best loss: 0.2792
Time elapsed:  9.74 min

Epoch[9/15](20/192) || training loss 0.002338 || training accuracy 100.00% || lr 3.805042016806722e-06
Epoch[9/15](40/192) || training loss 0.003144 || training accuracy 100.00% || lr 3.73781512605042e-06
Epoch[9/15](60/192) || training loss 0.001843 || training accuracy 100.00% || lr 3.670588235294117e-06
Epoch[9/15](80/192) || training loss 0.009822 || training accuracy 99.53% || lr 3.603361344537815e-06
Epoch[9/15](100/192) || training loss 0.007631 || training accuracy 99.69% || lr 3.5361344537815122e-06
Epoch[9/15](120/192) || training loss 0.00453 || training accuracy 100.00% || lr 3.46890756302521e-06
Epoch[9/15](140/192) || training loss 0.002018 || training accuracy 100.00% || lr 3.4016806722689073e-06
Epoch[9/15](160/192) || training loss 0.005038 || training accuracy 100.00% || lr 3.3344537815126052e-06
Epoch[9/15](180/192) || training loss 0.01228 || training accuracy 99.69% || l

  0%|          | 0/4 [00:00<?, ?it/s]

[Val] acc : 90.40%, loss: 0.4537|| best acc : 92.00%, best loss: 0.2792
Time elapsed:  10.77 min

Epoch[10/15](20/192) || training loss 0.001722 || training accuracy 100.00% || lr 3.159663865546218e-06
Epoch[10/15](40/192) || training loss 0.004434 || training accuracy 100.00% || lr 3.092436974789916e-06
Epoch[10/15](60/192) || training loss 0.004272 || training accuracy 99.84% || lr 3.0252100840336132e-06
Epoch[10/15](80/192) || training loss 0.002358 || training accuracy 100.00% || lr 2.9579831932773108e-06
Epoch[10/15](100/192) || training loss 0.001988 || training accuracy 100.00% || lr 2.8907563025210083e-06
Epoch[10/15](120/192) || training loss 0.005524 || training accuracy 99.84% || lr 2.823529411764706e-06
Epoch[10/15](140/192) || training loss 0.001401 || training accuracy 100.00% || lr 2.7563025210084033e-06
Epoch[10/15](160/192) || training loss 0.005311 || training accuracy 99.84% || lr 2.689075630252101e-06
Epoch[10/15](180/192) || training loss 0.002814 || training accur

  0%|          | 0/4 [00:00<?, ?it/s]

[Val] acc : 90.80%, loss: 0.4928|| best acc : 92.00%, best loss: 0.2792
Time elapsed:  11.79 min

Epoch[11/15](20/192) || training loss 0.006685 || training accuracy 99.53% || lr 2.5142857142857142e-06
Epoch[11/15](40/192) || training loss 0.001531 || training accuracy 100.00% || lr 2.4470588235294118e-06
Epoch[11/15](60/192) || training loss 0.002856 || training accuracy 99.84% || lr 2.3798319327731093e-06
Epoch[11/15](80/192) || training loss 0.004037 || training accuracy 99.84% || lr 2.312605042016807e-06
Epoch[11/15](100/192) || training loss 0.003251 || training accuracy 100.00% || lr 2.2453781512605043e-06
Epoch[11/15](120/192) || training loss 0.007113 || training accuracy 99.84% || lr 2.1781512605042014e-06
Epoch[11/15](140/192) || training loss 0.001861 || training accuracy 100.00% || lr 2.110924369747899e-06
Epoch[11/15](160/192) || training loss 0.001811 || training accuracy 100.00% || lr 2.0436974789915965e-06
Epoch[11/15](180/192) || training loss 0.001926 || training accu

  0%|          | 0/4 [00:00<?, ?it/s]

[Val] acc : 90.80%, loss: 0.4789|| best acc : 92.00%, best loss: 0.2792
Time elapsed:  12.82 min

Epoch[12/15](20/192) || training loss 0.002779 || training accuracy 99.84% || lr 1.86890756302521e-06
Epoch[12/15](40/192) || training loss 0.00307 || training accuracy 100.00% || lr 1.8016806722689076e-06
Epoch[12/15](60/192) || training loss 0.003593 || training accuracy 99.84% || lr 1.734453781512605e-06
Epoch[12/15](80/192) || training loss 0.003506 || training accuracy 100.00% || lr 1.6672268907563026e-06
Epoch[12/15](100/192) || training loss 0.00425 || training accuracy 100.00% || lr 1.6e-06
Epoch[12/15](120/192) || training loss 0.001967 || training accuracy 100.00% || lr 1.5327731092436974e-06
Epoch[12/15](140/192) || training loss 0.004414 || training accuracy 99.84% || lr 1.4655462184873948e-06
Epoch[12/15](160/192) || training loss 0.002093 || training accuracy 100.00% || lr 1.3983193277310923e-06
Epoch[12/15](180/192) || training loss 0.005751 || training accuracy 99.84% || lr

  0%|          | 0/4 [00:00<?, ?it/s]

[Val] acc : 91.40%, loss: 0.479|| best acc : 92.00%, best loss: 0.2792
Time elapsed:  13.85 min

Epoch[13/15](20/192) || training loss 0.001776 || training accuracy 100.00% || lr 1.2235294117647059e-06
Epoch[13/15](40/192) || training loss 0.001028 || training accuracy 100.00% || lr 1.1563025210084034e-06
Epoch[13/15](60/192) || training loss 0.002657 || training accuracy 100.00% || lr 1.0890756302521007e-06
Epoch[13/15](80/192) || training loss 0.0009199 || training accuracy 100.00% || lr 1.0218487394957982e-06
Epoch[13/15](100/192) || training loss 0.004904 || training accuracy 99.84% || lr 9.546218487394957e-07
Epoch[13/15](120/192) || training loss 0.001452 || training accuracy 100.00% || lr 8.873949579831932e-07
Epoch[13/15](140/192) || training loss 0.002073 || training accuracy 100.00% || lr 8.201680672268907e-07
Epoch[13/15](160/192) || training loss 0.002645 || training accuracy 100.00% || lr 7.529411764705882e-07
Epoch[13/15](180/192) || training loss 0.00177 || training accu

  0%|          | 0/4 [00:00<?, ?it/s]

[Val] acc : 91.00%, loss: 0.4957|| best acc : 92.00%, best loss: 0.2792
Time elapsed:  14.88 min

Epoch[14/15](20/192) || training loss 0.0006829 || training accuracy 100.00% || lr 5.781512605042017e-07
Epoch[14/15](40/192) || training loss 0.005888 || training accuracy 99.84% || lr 5.109243697478991e-07
Epoch[14/15](60/192) || training loss 0.004824 || training accuracy 99.69% || lr 4.436974789915966e-07
Epoch[14/15](80/192) || training loss 0.001067 || training accuracy 100.00% || lr 3.764705882352941e-07
Epoch[14/15](100/192) || training loss 0.003102 || training accuracy 100.00% || lr 3.0924369747899157e-07
Epoch[14/15](120/192) || training loss 0.001379 || training accuracy 100.00% || lr 2.4201680672268904e-07
Epoch[14/15](140/192) || training loss 0.001097 || training accuracy 100.00% || lr 1.7478991596638653e-07
Epoch[14/15](160/192) || training loss 0.002399 || training accuracy 100.00% || lr 1.0756302521008403e-07
Epoch[14/15](180/192) || training loss 0.0005987 || training ac

  0%|          | 0/4 [00:00<?, ?it/s]

[Val] acc : 91.00%, loss: 0.4968|| best acc : 92.00%, best loss: 0.2792
Time elapsed:  15.91 min



## Inference : 
---
- target_dir(Best Val Accuracy model) 를 상황에 맞게 수정해야 함.

In [18]:
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
target_dir = "copa_data_results/results/TrainAll_Roberta_8e-06/1/best"
model_module = eval(args.model_type)
model = model_module.from_pretrained(target_dir, args=args)
model.parameters
model.to(device)
model.eval()
""

Some weights of the model checkpoint at klue/roberta-large were not used when initializing RobertaModel: ['lm_head.decoder.weight', 'lm_head.dense.bias', 'lm_head.bias', 'lm_head.decoder.bias', 'lm_head.layer_norm.bias', 'lm_head.dense.weight', 'lm_head.layer_norm.weight']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaModel were not initialized from the model checkpoint at klue/roberta-large and are newly initialized: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it f

''

In [19]:
tokenizer = AutoTokenizer.from_pretrained(args.pretrained_model)

dataset = load_data(valid_data)
test_label = dataset["label"].values

tokenized_test = tokenized_dataset(dataset, tokenizer, check_arch(args.model_type))
test_dataset = CustomDataset(tokenized_test, test_label)

In [20]:
def inference(model, tokenized_sent, device):
    dataloader = DataLoader(tokenized_sent, batch_size=8, shuffle=False)
    model.eval()
    results = []
    preds = []

    for i, items in enumerate(tqdm(dataloader)):
        item = {key: val.to(device) for key, val in items.items()}
        with torch.no_grad():
            outputs = model(**item)
        logits = outputs[0]
        m = nn.Softmax(dim=1)
        logits = m(logits)
        logits = logits.detach().cpu().numpy()  # (Batch_size, 5)  5개의 클래스 확률형태
        pred = logits[:, 1]
        result = np.argmax(logits, axis=-1)
        results += result.tolist()
        preds += pred.tolist()

    return np.array(results).flatten(), np.array(preds).flatten()

In [21]:
pred_answer, preds = inference(model, tokenized_sent=test_dataset, device=device)

  0%|          | 0/63 [00:00<?, ?it/s]

In [22]:
# make json
submission_json = {"copa": []}
for i, pred in enumerate(pred_answer.tolist()):
    submission_json["copa"].append({"idx": i, "label": int(pred + 1)})
with open("submission.json", "w") as fp:
    json.dump(submission_json, fp)

In [23]:
dataset["model_answer"] = pred_answer
dataset["model_pred"] = preds
dataset.to_csv("copa_result.csv", index=False, encoding="utf-8-sig")

# 여기까지 COPA (인과추론)

# CoLA (문법성 판단)

# Data

## imports

In [1]:
import pickle
import pandas as pd
import torch
from pathlib import Path
import multiprocessing
import time
import numpy as np
from tqdm import tqdm
from typing import List
import random
from functools import partial
from itertools import repeat
import re

from Korpora import Korpora

## fetch additional data
추가 학습을 위해 KorSTS와 KorNLI 데이터셋에서 텍스트만 가져온다.

In [2]:
# 추가적으로 사용할 추가 데이터 가져오기
def create_additional_text(name):
    path = Path('.')
    root_dir = path / 'dataset'
    Korpora.fetch(name, root_dir = root_dir)
    corpus = Korpora.load(name)
    additional_text = corpus.get_all_texts()

    temp = []
    for i, text in enumerate(tqdm(additional_text)):
        temp.append(text+'\n')

    additional_text = set(temp)
    del temp
    additional_text = list(additional_text)
    
    new_file = root_dir / 'additional.txt'
    with open(new_file, 'a+', encoding='utf-8') as writer:
        writer.writelines(additional_text)

In [3]:
#create_additional_text("korsts")
#create_additional_text("kornli")

## Preprocessing
1. 추가 데이터셋에서 일부러 조사가 틀린 텍스트를 만들어서 문법이 틀린 데이터도 증강

In [4]:
from pyjosa import josa, jonsung

In [5]:
class SabotageSentence(object):
    def __init__(self, sentence: str):
        self.sentence = sentence
        self.josa_dict = {
            'for_jongsung':['을','은','이','과'], 
            'no_jongsung':['를','는','가','와','나','로','야','랑','며']
        }


    @property
    def get_all_josa(self):
        return self.josa_dict


    def jongsung_wrong_josa(self) -> str:
        new_sent = ''
        for _, word in enumerate(self.sentence.split()):
            tmp_word = word[:-1]
            if word[-1] in self.josa_dict['for_jongsung']:
                tmp_word+=random.choice(self.josa_dict['no_jongsung'])
                new_sent+=tmp_word
                new_sent+=' '
            elif word[-1] in self.josa_dict['no_jongsung']:
                tmp_word+=random.choice(self.josa_dict['for_jongsung'])
                new_sent+=tmp_word
                new_sent+=' '
            else:
                new_sent+=word
                new_sent+=' '

        return new_sent
    


2. 기존 데이터셋을 pandas DataFrame으로 변환

In [6]:
def load_data(filename):
    data_dir = Path("./dataset/cola/") / filename
    dataset = pd.read_csv(
        data_dir, 
        sep="\t", 
        header=0, 
        encoding='utf-8', 
        names=['source', 'acceptability_label', 'source_annotation', 'sentence']
    )
    dataset['label'] = dataset['acceptability_label'].astype(int)

    return dataset


def augment_data_orig(new_data: List[str]):
    tmp_data_holder = {'source':[], 'label':[], 'source_annotation':[], 'sentence':[]}
    for i, row in enumerate(new_data):
        if (re.match('[a-zA-Z]', row) is not None) or (len(row) >= 70) or (len(row) == 0) or (row[-2:]!='.\n'):
            continue
        else:
            tmp_data_holder['source'].append('T'+str(10001+i))
            tmp_data_holder['label'].append(1)
            tmp_data_holder['source_annotation'].append('*')
            assert type(row) == str
            tmp_data_holder['sentence'].append(row.replace('\n',''))

    dataset = pd.DataFrame(tmp_data_holder)
    return dataset


def augment_data(data):
    tmp_data_holder = {'source':[], 'label':[], 'source_annotation':[], 'sentence':[]}
    for _, row in data.iterrows():
        tmp_data_holder['source'].append(row['source'])
        tmp_data_holder['label'].append(0)
        tmp_data_holder['source_annotation'].append(np.NaN)
        
        text = SabotageSentence(row['sentence']).jongsung_wrong_josa()
        tmp_data_holder['sentence'].append(text)

    dataset = pd.DataFrame(tmp_data_holder)
    return dataset

    
def read_txt(path='./dataset/additional.txt') -> List[str]:
    with open(path, 'r+', encoding='utf-8') as reader:
        new_data = reader.readlines()
    
    tmp_list = []
    for text in new_data:
        text.rstrip('\n')
        text.replace('\n','')
        tmp_list.append(text)
    new_data = tmp_list

    return new_data

def multiprocess_aug(orig_dataset, func_name):
    num_process = multiprocessing.cpu_count()
    
    chunk_size = int(orig_dataset.shape[0] / num_process)
    chunks = [orig_dataset.iloc[orig_dataset.index[i:i+chunk_size]] for i in range(0, orig_dataset.shape[0], chunk_size)]
    assert len(chunks) != 0

    with multiprocessing.Pool(processes=num_process) as pool:
        results = pool.map(func_name, chunks)
        
        new_dataset = pd.concat(results)
        dataset = pd.concat([orig_dataset, new_dataset])

    return dataset


def tokenize_datasets(dataset, tokenizer, arch="encoder"):
    sentence = dataset['sentence'].tolist()
    tokenize_sent = tokenizer(
        sentence,
        return_tensors="pt",
        padding = True,
        truncation = True,
        max_length = 200,
        add_special_tokens=True,
        return_token_type_ids = True
    )

    return tokenize_sent

## Custom Dataset Class

In [7]:
class ColaDataset(torch.utils.data.Dataset):
    def __init__(self, tokenized_dataset, labels= None, test=False):
        self.tokenized_dataset = tokenized_dataset
        self.labels = labels
    
    def __getitem__(self, idx):
        item = {key: val[idx] for key, val in self.tokenized_dataset.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item
    
    def __len__(self):
        return len(self.labels)

# Model

In [8]:
import torch.nn as nn
from transformers import ElectraModel, ElectraPreTrainedModel

In [9]:
class Electra(ElectraPreTrainedModel):
    def __init__(self, config):
        super(Electra, self).__init__(config)
        self.electra = ElectraModel(config)
        self.num_labels = config.num_labels
        self.linear = nn.Linear(config.hidden_size, config.hidden_size)
        self.dropout = nn.Dropout(p=0.1)
        self.classifier = nn.Linear(config.hidden_size, self.num_labels)
        
    def forward(self, input_ids=None, attention_mask=None, token_type_ids=None, labels=None):
        outputs = self.electra(input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids)
        output = outputs[0][:, 0, :]
        output = self.linear(self.dropout(output))
        output = torch.tanh(output)
        logits = self.classifier(output)
        outputs = (logits,) + outputs[2:]

        return outputs


# Loss

In [10]:
class CrossEntropy(nn.Module):
    def __init__(self):
        super(CrossEntropy, self).__init__()
        self.CE = nn.CrossEntropyLoss()
        

    def forward(self, inputs, target):
        """
        :param inputs: predictions
        :param target: target labels
        :return: loss
        """
        loss = self.CE(inputs, target)
        return loss

_criterion_entrypoints = {
    'cross_entropy': CrossEntropy,
}

def criterion_entrypoint(criterion_name):
    return _criterion_entrypoints[criterion_name]

def is_criterion(criterion_name):
    return criterion_name in _criterion_entrypoints

def create_criterion(criterion_name, **kwargs):
    if is_criterion(criterion_name):
        create_fn = criterion_entrypoint(criterion_name)
        criterion = create_fn(**kwargs)
    else:
        raise RuntimeError('Unknown loss (%s)' % criterion_name)
    return criterion

# Utility function(s)

In [11]:
def check_arch(model_type):
  archs = {
    "encoder" : ["Bert", "Electra", "XLMRoberta", "Electra_BoolQ", "Roberta"],
    "encoder-decoder" : ["T5", "Bart", "Bart_BoolQ"]
  }
  for arch in archs:
    if model_type in archs[arch]:
      return arch
  raise ValueError(f"Model [{model_type}] no defined archtecture")

# Training setup

In [12]:
import os
import argparse
from importlib import import_module
import glob
import re
from collections import defaultdict
import time
from time import sleep

from sklearn.metrics import accuracy_score, classification_report

import transformers
from transformers import AutoTokenizer, AutoModel, AutoModelForSequenceClassification

from torch.utils.tensorboard import SummaryWriter
from torch.utils.data import DataLoader


## Set seed

In [13]:
def set_seed(seed):
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False
    np.random.seed(seed)
    random.seed(seed)
    

## Training utilities

In [14]:
def get_lr(optimizer):
    for param_group in optimizer.param_groups:
        return param_group['lr']

def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    acc = accuracy_score(labels, preds)
    return {'accuracy': acc}

def output_dir(output_path, exist_ok = False):
    path = Path(output_path)
    if (path.exists() and exist_ok) or (not path.exists()):
        return str(path)
    else:
        dirs = glob.glob(f"{path}*")
        matches = [re.search(rf"%s(\d+)" %path.stem, d) for d in dirs]
        i = [int(m.groups()[0]) for m in matches if m]
        n = max(i) + 1 if i else 2
        return f"{path}{n}"

# Training

In [15]:
def train(args):
    model_dir = args.model_dir
    set_seed(args.seed)
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

    # tokenizer
    MODEL_NAME = args.pretrained_model
    tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

    # load dataset
    datasets_ = load_data("./NIKL_CoLA_train.tsv")

    # 아래 코드는 데이터를 증강하는 코드지만 MCC에 도움이 되지 않는 관계로 주석처리함
    # new_data = read_txt()
    # new_data = augment_data_orig(new_data)
    # new_data_corrupt = multiprocess_aug(new_data, augment_data)
    # datasets_ = pd.concat([datasets_, new_data, new_data_corrupt], ignore_index=True)

    # make validation sets from training set
    labels_ = datasets_["label"]
    length = len(labels_)
    kf = args.kfold
    class_indexs = defaultdict(list)
    for i, label_ in enumerate(labels_):
        class_indexs[np.argmax(label_)].append(i)
    val_indices = set()
    for index in class_indexs: 
        val_indices = (val_indices | set(class_indexs[index][int(
            len(class_indexs[index])*(kf-1)/9): int(len(class_indexs[index])*kf/9)]))
    train_indices = set(range(length)) - val_indices

    train_dataset = datasets_.loc[np.array(list(train_indices))]
    val_dataset = datasets_.loc[np.array(list(val_indices))]

    train_label = train_dataset['label'].values
    val_label = val_dataset['label'].values

    tokenized_train = tokenize_datasets(
        train_dataset, tokenizer, check_arch(args.model_type))
    tokenized_val = tokenize_datasets(
        val_dataset, tokenizer, check_arch(args.model_type))

    train_dataset = ColaDataset(tokenized_train, train_label)
    val_dataset = ColaDataset(tokenized_val, val_label)

    train_loader = DataLoader(
        train_dataset,
        batch_size=args.batch_size,
        shuffle=True,
        drop_last=True,
    )

    val_loader = DataLoader(
        val_dataset,
        batch_size=args.valid_batch_size,
        shuffle=False,
        drop_last=False,
    )

    config_module = getattr(import_module(
        "transformers"), args.model_type + "Config")

    model_config = config_module.from_pretrained(MODEL_NAME)
    model_config.num_labels = 2

    model = Electra.from_pretrained(MODEL_NAME, config=model_config)
    model = nn.DataParallel(model)

    model.parameters
    model.to(device)

    save_dir = output_dir(os.path.join(model_dir, args.name, str(args.kfold)))

    for name, param in model.named_parameters():
        if ('cls_fc_layer' not in name) and ('label_classifier' not in name):  # classifier layer
            param.requires_grad = False

    criterion = create_criterion(args.criterion)  # default: cross_entropy
    opt_module = getattr(import_module("transformers"), args.optimizer)
    optimizer = opt_module(
        model.parameters(),
        lr=args.lr,
        weight_decay=args.weight_decay,
        eps=1e-8
    )
    scheduler = transformers.get_linear_schedule_with_warmup(
        optimizer,
        num_warmup_steps=args.warmup_steps,
        num_training_steps=len(train_loader) * args.epochs,
        last_epoch=- 1
    )

    # logging
    best_val_mcc = -1
    best_val_loss = np.inf
    for epoch in range(args.epochs):
        pbar = tqdm(train_loader, dynamic_ncols=True)
        if epoch == args.freeze_epoch:
            for name, param in model.named_parameters():
                param.requires_grad = True

        model.train()

        loss_value = 0
        matches = 0
        for idx, items in enumerate(pbar):
            item = {key: val.to(device) for key, val in items.items()}

            optimizer.zero_grad()
            outs = model(**item)
            loss = criterion(outs[0], item['labels'])

            preds = torch.argmax(outs[0], dim=-1)

            loss.backward()
            optimizer.step()
            scheduler.step()

            loss_value += loss.item()
            matches += (preds == item['labels']).sum().item()
            if (idx + 1) % args.log_interval == 0:
                train_loss = loss_value / args.log_interval
                train_acc = matches / args.batch_size / args.log_interval
                current_lr = get_lr(optimizer)
                pbar.set_description(
                    f"Epoch: [{epoch}/{args.epochs}]({idx + 1}/{len(train_loader)}) || loss: {train_loss:4.4} || acc: {train_acc:4.2%} || lr {current_lr:4.4}")

                loss_value = 0
                matches = 0

    # validation
    with torch.no_grad():
        pbar = tqdm(val_loader, dynamic_ncols=True)
        print("Calculating validation results...")
        model.eval()
        val_loss_items = []
        val_acc_items = []
        acc_okay = 0
        count_all = 0
        TP = 0
        FP = 0
        TN = 0
        FN = 0
        eps = 1e-9
        for idx, items in enumerate(pbar):
            sleep(0.01)
            item = {key: val.to(device) for key, val in items.items()}

            outs = model(**item)

            preds = torch.argmax(outs[0], dim=-1)
            loss = criterion(outs[0], item['labels']).item()

            acc_item = (item['labels'] == preds).sum().item()

            TRUE = (item['labels'] == preds)
            FALSE = (item['labels'] != preds)

            TP += (TRUE * preds).sum().item()
            TN += (TRUE * (preds == 0)).sum().item()
            FP += (FALSE * preds).sum().item()
            FN += (FALSE * (preds == 0)).sum().item()

            val_loss_items.append(loss)
            val_acc_items.append(acc_item)
            acc_okay += acc_item
            count_all += len(preds)

            # Calculate MCC
            MCC = ((TP*TN) - (FP*FN)) / \
                (((TP+FP+eps)*(TP+FN+eps)*(TN+FP+eps)*(TN+FN+eps))**0.5)

            pbar.set_description(
                f"Epoch: [{epoch}/{args.epochs}]({idx + 1}/{len(val_loader)}) || val_loss: {loss:4.4} || acc: {acc_okay/count_all:4.2%} || MCC: {MCC:4.2%}")

        val_loss = np.sum(val_loss_items) / len(val_loss_items)
        val_acc = acc_okay / count_all

        if MCC > best_val_mcc:
            print(
                f"New best model for val mcc : {MCC:4.2%}! saving the best model..")
            model_to_save = model.module if hasattr(model, "module") else model
            model_to_save.save_pretrained(f"{save_dir}/best")
            torch.save(args, os.path.join(
                f"{save_dir}/best", "training_args.bin"))
            best_val_mcc = MCC

        if val_loss < best_val_loss:
            best_val_loss = val_loss
        print(
            f"[Val] acc : {val_acc:4.2%}, loss: {val_loss:4.4}|| "
            f"best mcc : {best_val_mcc:4.2%}, best loss: {best_val_loss:4.4}|| "
            f"MCC : {MCC:4.2%}|| "
            f"TP:{TP} / TN:{TN} / FP:{FP} / FN:{FN}"
        )

    time.sleep(5)
    torch.cuda.empty_cache()


## Training arguments

In [16]:
os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"]= "0,1,2,3"
os.environ["TOKENIZERS_PARALLELISM"] = "false"
args = argparse.Namespace(
    seed = 42,
    epochs = 30,
    freeze_epoch=0,
    optimizer = 'AdamW',
    weight_decay = 0.01,
    warmup_steps = 500,
    log_interval = 20,
    kfold = 9,

    criterion = 'cross_entropy',
    dropout_rate = 0.1,
    model_type = "Electra",
    pretrained_model = "tunib/electra-ko-base",
    lr = 4e-6,
    batch_size = 32,
    valid_batch_size = 128,

    val_ratio=0.2,
    name = 'exp',
    model_dir = os.environ.get('SM_MODEL_DIR', './results'),
    custompretrain = ""
)

args.name = f'{args.model_type}_{args.lr}_{args.kfold}'

## Training Results

In [17]:
print('='*40)
print(f"k-fold num : {args.kfold}")
print('='*40)

train(args)

k-fold num : 9


Some weights of the model checkpoint at tunib/electra-ko-base were not used when initializing Electra: ['discriminator_predictions.dense_prediction.bias', 'discriminator_predictions.dense.weight', 'discriminator_predictions.dense.bias', 'discriminator_predictions.dense_prediction.weight']
- This IS expected if you are initializing Electra from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing Electra from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of Electra were not initialized from the model checkpoint at tunib/electra-ko-base and are newly initialized: ['classifier.bias', 'linear.bias', 'classifier.weight', 'linear.weight']
You should probably TRAIN this model on a down-stream task to be 

Calculating validation results...


Epoch: [29/30](14/14) || val_loss: 0.7995 || acc: 80.67% || MCC: 61.83%: 100%|██████████| 14/14 [00:02<00:00,  6.45it/s]


New best model for val mcc : 61.83%! saving the best model..
[Val] acc : 80.67%, loss: 0.8838|| best mcc : 61.83%, best loss: 0.8838|| MCC : 61.83%|| TP:779 / TN:644 / FP:229 / FN:112


# Evaluation

In [31]:
def evaluate(args):
    tokenizer = AutoTokenizer.from_pretrained(args.pretrained_model)

    file = 'NIKL_CoLA_dev.tsv'
    dataset = load_data(file)
    tokenized_test = tokenize_datasets(dataset, tokenizer)
    test_label = dataset['label'].values
    test_dataset = ColaDataset(tokenized_test, test_label)
    
    test_loader = DataLoader(
        test_dataset,
        batch_size=args.test_batch_size,
        shuffle=False
    )

    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    
    model = Electra.from_pretrained(args.model_dir) 
    model.parameters
    model.to(device)
    model.eval()

    pbar = tqdm(test_loader)
    print("Calculating validation results...")
    test_acc_items = []
    acc_okay = 0
    count_all = 0
    TP = 0
    FP = 0
    TN = 0
    FN = 0
    eps = 1e-9

    for idx, items in enumerate(pbar):
        sleep(0.01)

        item = {key: val.to(device) for key, val in items.items()}
        with torch.no_grad():
            outs = model(**item)

        preds = torch.argmax(outs[0], dim=-1)
        labels = item['labels']

        acc_item = (labels == preds).sum().item()

        TRUE = (labels == preds)
        FALSE = (labels != preds)

        TP += (TRUE * preds).sum().item()
        TN += (TRUE * (preds==0)).sum().item()
        FP += (FALSE * preds).sum().item()
        FN += (FALSE * (preds==0)).sum().item()

        MCC = ((TP*TN) - (FP*FN)) / (((TP+FP+eps)*(TP+FN+eps)*(TN+FP+eps)*(TN+FN+eps))**0.5)

        test_acc_items.append(acc_item)
        acc_okay += acc_item
        count_all += len(preds)

        pbar.set_description(f"({idx + 1}/{len(test_loader)}) || acc: {acc_okay/count_all:4.2%} || MCC: {MCC:4.2%}")

    test_acc = acc_okay / count_all

    print(
        f"[Test] acc : {test_acc:4.2%}|| "
        f"MCC : {MCC:4.2%}|| "
        f"TP:{TP} / TN:{TN} / FP:{FP} / FN:{FN}\n"
        f"======================================\n"
        f"Test MCC: {MCC:4.2%}"
    )
    time.sleep(5)
    torch.cuda.empty_cache()

## Evaluation arguments

In [32]:
#eval args
args = argparse.Namespace(
    model_type = "Electra",
    pretrained_model = "tunib/electra-ko-base",

    model_dir = './results/Electra_4e-06_9/97/best',
    criterion = 'cross_entropy',
    num_labels=2,

    test_batch_size=8
)

# Inference

In [33]:
evaluate(args)

(9/254) || acc: 76.39% || MCC: 53.29%:   2%|▏         | 5/254 [00:00<00:05, 49.35it/s]

Calculating validation results...


(254/254) || acc: 75.84% || MCC: 51.64%: 100%|██████████| 254/254 [00:05<00:00, 48.49it/s]


[Test] acc : 75.84%|| MCC : 51.64%|| TP:892 / TN:649 / FP:312 / FN:179
Test MCC: 51.64%


# 동형이의어 task 시작

## model

- pretrained model : klue/Roberta-large
- R-BERT : pretrained model의 [CLS], entity1, entity2 부분의 embedding 값을 concat하여 최종 분류

In [1]:
import torch
import torch.nn as nn
from transformers import AutoModel, RobertaPreTrainedModel

class FCLayer(nn.Module):
    def __init__(self, input_dim, output_dim, dropout_rate=0.0, use_activation=True):
        super(FCLayer, self).__init__()
        self.use_activation = use_activation
        self.dropout = nn.Dropout(dropout_rate)
        self.linear = nn.Linear(input_dim, output_dim)
        self.tanh = nn.Tanh()

    def forward(self, x):
        x = self.dropout(x)
        if self.use_activation:
            x = self.tanh(x) 
        return self.linear(x)


class R_RoBERTa_WiC(RobertaPreTrainedModel):
    def __init__(self,  model_name, config, dropout_rate):
        super(R_RoBERTa_WiC, self).__init__(config)
        self.model = AutoModel.from_pretrained(model_name, config=config)

        self.num_labels = config.num_labels

        self.cls_fc_layer = FCLayer(config.hidden_size, config.hidden_size, dropout_rate)
        self.entity_fc_layer1 = FCLayer(config.hidden_size, config.hidden_size, dropout_rate)
        self.entity_fc_layer2 = FCLayer(config.hidden_size, config.hidden_size, dropout_rate)

        self.label_classifier = FCLayer(
            config.hidden_size * 3,
            config.num_labels,
            dropout_rate,
            use_activation=False,
        )

    @staticmethod
    def entity_average(hidden_output, e_mask):
        """
        Average the entity hidden state vectors (H_i ~ H_j)
        :param hidden_output: [batch_size, j-i+1, dim]
        :param e_mask: [batch_size, max_seq_len]
                e.g. e_mask[0] == [0, 0, 0, 1, 1, 1, 0, 0, ... 0]
        :return: [batch_size, dim]
        """
        e_mask_unsqueeze = e_mask.unsqueeze(1)  # [b, 1, j-i+1]
        length_tensor = (e_mask != 0).sum(dim=1).unsqueeze(1)  # [batch_size, 1]

        # [b, 1, j-i+1] * [b, j-i+1, dim] = [b, 1, dim] -> [b, dim]
        sum_vector = torch.bmm(e_mask_unsqueeze.float(), hidden_output).squeeze(1)
        avg_vector = sum_vector.float() / length_tensor.float()  # broadcasting
        return avg_vector

    def forward(self, input_ids, attention_mask, labels, e1_mask, e2_mask):
        outputs = self.model(
            input_ids, attention_mask=attention_mask
        )  # sequence_output, pooled_output, (hidden_states), (attentions)
        sequence_output = outputs[0] #batch, max_len, hidden_size  

        e1_h = self.entity_average(sequence_output, e1_mask)
        e2_h = self.entity_average(sequence_output, e2_mask)
        # Dropout -> tanh -> fc_layer (Share FC layer for e1 and e2)
        sentence_representation = self.cls_fc_layer(outputs.pooler_output)

        e1_h = self.entity_fc_layer1(e1_h)
        e2_h = self.entity_fc_layer2(e2_h)
        # Concat -> fc_layer
        concat_h = torch.cat([sentence_representation, e1_h, e2_h], dim=-1)
        logits = self.label_classifier(concat_h)
        outputs = (logits,) + outputs[2:]  # add hidden states and attention if they are here
        # Softmax
        loss_fct = nn.CrossEntropyLoss()
        loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
        outputs = (loss,) + outputs

        return outputs  # (loss), logits, (hidden_states), (attentions)

## Data Lodaer

In [2]:
import pickle as pickle
import os
import pandas as pd
import torch
from tqdm import tqdm

class WICDataset(torch.utils.data.Dataset):
    def __init__(self, tokenized_dataset, labels):
        self.tokenized_dataset = tokenized_dataset
        self.labels = labels
    
    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.tokenized_dataset.items()}
        item['labels'] = torch.tensor(self.labels[idx], dtype=torch.long)
        return item

    def __len__(self):
        return len(self.labels)

def load_data(dataset_dir, mode = 'train'):
    dataset = pd.read_csv(dataset_dir, delimiter='\t')
    li = []
    for s1, s2 in zip(list(dataset['SENTENCE1']), list(dataset['SENTENCE2'])):
        li.append(s1+' '+s2)
    dataset["ANSWER"] = dataset["ANSWER"].astype(int)
    if mode == 'test':
        dataset["ANSWER"] = [0] * len(dataset)
    return dataset

def convert_sentence_to_features(train_dataset, tokenizer, max_len):
    
    max_seq_len=max_len
    pad_token=tokenizer.pad_token_id
    add_sep_token=False
    mask_padding_with_zero=True
    
    all_input_ids = []
    all_attention_mask = []
    all_e1_mask=[]
    all_e2_mask=[]
    all_label=[]
    m_len=0
    for idx in tqdm(range(len(train_dataset))):
        sentence = '<s>' + train_dataset['SENTENCE1'][idx][:train_dataset['start_s1'][idx]] \
            + ' <e1> ' + train_dataset['SENTENCE1'][idx][train_dataset['start_s1'][idx]:train_dataset['end_s1'][idx]] \
            + ' </e1> ' + train_dataset['SENTENCE1'][idx][train_dataset['end_s1'][idx]:] + '</s>' \
            + ' ' \
            + '<s>' + train_dataset['SENTENCE2'][idx][:train_dataset['start_s2'][idx]] \
            + ' <e2> ' + train_dataset['SENTENCE2'][idx][train_dataset['start_s2'][idx]:train_dataset['end_s2'][idx]] \
            + ' </e2> ' + train_dataset['SENTENCE2'][idx][train_dataset['end_s2'][idx]:] + '</s>'
        
        token = tokenizer.tokenize(sentence)
        m_len = max(m_len, len(token))
        e11_p = token.index("<e1>")  # the start position of entity1
        e12_p = token.index("</e1>")  # the end position of entity1
        e21_p = token.index("<e2>")  # the start position of entity2
        e22_p = token.index("</e2>")  # the end position of entity2

        token[e11_p] = "$"
        token[e12_p] = "$"
        token[e21_p] = "#"
        token[e22_p] = "#"

        e11_p += 1
        e12_p += 1
        e21_p += 1
        e22_p += 1

        special_tokens_count = 1

        if len(token) < max_seq_len - special_tokens_count:
            input_ids = tokenizer.convert_tokens_to_ids(token)
            attention_mask = [1 if mask_padding_with_zero else 0] * len(input_ids)

            padding_length = max_seq_len - len(input_ids)
            input_ids = input_ids + ([pad_token] * padding_length)
            attention_mask = attention_mask + ([0 if mask_padding_with_zero else 1] * padding_length)

            e1_mask = [0] * len(attention_mask)
            e2_mask = [0] * len(attention_mask)

            for i in range(e11_p, e12_p + 1):
                e1_mask[i] = 1
            for i in range(e21_p, e22_p + 1):
                e2_mask[i] = 1

            assert len(input_ids) == max_seq_len, "Error with input length {} vs {}".format(len(input_ids), max_seq_len)
            assert len(attention_mask) == max_seq_len, "Error with attention mask length {} vs {}".format(
                len(attention_mask), max_seq_len
            )

            all_input_ids.append(input_ids)
            all_attention_mask.append(attention_mask)
            all_e1_mask.append(e1_mask)
            all_e2_mask.append(e2_mask)
            all_label.append(train_dataset['ANSWER'][idx])

    all_features = {
        'input_ids' : torch.tensor(all_input_ids),
        'attention_mask' : torch.tensor(all_attention_mask),
        'e1_mask' : torch.tensor(all_e1_mask),
        'e2_mask' : torch.tensor(all_e2_mask)
    }  
    return WICDataset(all_features, all_label)

In [3]:
import os
import pandas as pd

BASE_DIR = "./"
import torch
import numpy as np
import random
from torch.utils.data import DataLoader, RandomSampler, SequentialSampler
import json
import logging
import os
import torch.nn as nn
from tqdm import tqdm
from transformers import AdamW, get_linear_schedule_with_warmup, AutoTokenizer
import torch.nn.functional as F

from transformers import AutoModel, AutoConfig
import argparse

In [4]:
# seed 고정 
def seed_everything(seed):
  torch.manual_seed(seed)
  torch.cuda.manual_seed(seed)
  torch.cuda.manual_seed_all(seed)  # if use multi-GPU
  torch.backends.cudnn.deterministic = True
  torch.backends.cudnn.benchmark = False
  np.random.seed(seed)
  random.seed(seed)


def compute_metrics(preds, labels):
    assert len(preds) == len(labels)
    return acc_and_f1(preds, labels)

def simple_accuracy(preds, labels):
    return (preds == labels).mean()

def acc_and_f1(preds, labels, average="macro"):
    acc = simple_accuracy(preds, labels)
    return {
        "acc": acc,
    }

def init_logger():
    logging.basicConfig(
        format="%(asctime)s - %(levelname)s - %(name)s -   %(message)s",
        datefmt="%m/%d/%Y %H:%M:%S",
        level=logging.INFO,
    )

class Trainer(object):
    def __init__(self, args, model_dir = None,train_dataset=None, dev_dataset=None, test_dataset=None,tokenizer=None):
        self.train_dataset = train_dataset
        self.dev_dataset = dev_dataset
        self.test_dataset = test_dataset
        self.tokenizer = tokenizer
        self.model_dir = model_dir 
        self.best_score = 0
        self.hold_epoch = 0

        self.eval_batch_size = args.eval_batch_size
        self.train_batch_size = args.train_batch_size
        self.max_steps = args.max_steps
        self.weight_decay = args.weight_decay
        self.learning_rate = args.lr
        self.adam_epsilon= args.adam_epsilon
        self.warmup_steps = args.warmup_steps
        self.num_train_epochs = args.num_train_epochs
        self.logging_steps = args.logging_steps
        self.max_grad_norm = args.max_grad_norm
        self.dropout_rate = args.dropout_rate
        self.gradient_accumulation_steps = args.gradient_accumulation_steps
        
        self.config = AutoConfig.from_pretrained(
            "klue/roberta-large",
            num_labels = 2
        )
        self.model = R_RoBERTa_WiC(
           "klue/roberta-large", 
            config=self.config, 
            dropout_rate = self.dropout_rate,
        )

        # GPU or CPU
        self.device = "cuda" if torch.cuda.is_available() else "cpu"
        self.model.to(self.device)
        
        
    def train(self):
        init_logger()
        seed_everything(args.seed)
        train_sampler = RandomSampler(self.train_dataset)
        train_dataloader = DataLoader(
            self.train_dataset,
            sampler=train_sampler,
            batch_size=self.train_batch_size,
        )

        if self.max_steps > 0:
            t_total = self.max_steps
            self.num_train_epochs = (
                self.max_steps // (len(train_dataloader) // self.gradient_accumulation_steps) + 1
            )
        else:
            t_total = len(train_dataloader) // self.gradient_accumulation_steps * self.num_train_epochs

        # Prepare optimizer and schedule (linear warmup and decay)
        no_decay = ["bias", "LayerNorm.weight"]
        optimizer_grouped_parameters = [
            {
                "params": [p for n, p in self.model.named_parameters() if not any(nd in n for nd in no_decay)],
                "weight_decay": self.weight_decay,
            },
            {
                "params": [p for n, p in self.model.named_parameters() if any(nd in n for nd in no_decay)],
                "weight_decay": 0.0,
            },
        ]
        optimizer = AdamW(
            optimizer_grouped_parameters,
            lr=self.learning_rate,
            eps=self.adam_epsilon,
        )
        scheduler = get_linear_schedule_with_warmup(
            optimizer,
            num_warmup_steps=self.warmup_steps,
            num_training_steps=t_total,
        )
        
        # Train!
        logger.info("***** Running training *****")
        logger.info("  Num examples = %d", len(self.train_dataset))
        logger.info("  Num Epochs = %d", self.num_train_epochs)
        logger.info("  Total train batch size = %d", self.train_batch_size)
        logger.info("  Gradient Accumulation steps = %d", self.gradient_accumulation_steps)
        logger.info("  Total optimization steps = %d", t_total)
        logger.info("  Logging steps = %d", self.logging_steps)

        global_step = 0
        tr_loss = 0.0
        self.model.zero_grad()

        train_iterator = tqdm(range(int(self.num_train_epochs)), desc="Epoch")

        for epo_step in train_iterator:
            self.global_epo = epo_step
            epoch_iterator = tqdm(train_dataloader, desc="Iteration")
            for step, batch in enumerate(epoch_iterator):
                self.model.train()
                batch = tuple(batch[t].to(self.device) for t in batch)  # GPU or CPU
                inputs = {
                    "input_ids": batch[0],
                    "attention_mask": batch[1],
                    "labels": batch[4],
                    "e1_mask": batch[2],
                    "e2_mask": batch[3]
                }
                
                outputs = self.model(**inputs)
                loss = outputs[0]

                if self.gradient_accumulation_steps > 1:
                    loss = loss / self.gradient_accumulation_steps

                loss.backward()

                tr_loss += loss.item()
                if (step + 1) % self.gradient_accumulation_steps == 0:
                    torch.nn.utils.clip_grad_norm_(self.model.parameters(), self.max_grad_norm)

                    optimizer.step()
                    scheduler.step()  # Update learning rate schedule
                    self.model.zero_grad()
                    global_step += 1

                if self.logging_steps > 0 and global_step % self.logging_steps == 0:
                    logger.info("  global steps = %d", global_step)

                if 0 < self.max_steps < global_step:
                    epoch_iterator.close()
                    break
            
            self.evaluate("dev")
            if self.hold_epoch > 4:
                train_iterator.close()
                break
                
            if 0 < self.max_steps < global_step:
                train_iterator.close()
                break
          

        return global_step, tr_loss / global_step
    
   
    def evaluate(self, mode):
        # We use test dataset because semeval doesn't have dev dataset
        if mode == "test":
            dataset = self.test_dataset
        elif mode == "dev":
            dataset = self.dev_dataset
        elif mode == "train":
            dataset = self.train_dataset
        else:
            raise Exception("Only dev and test dataset available")

        eval_sampler = SequentialSampler(dataset)
        eval_dataloader = DataLoader(dataset, sampler=eval_sampler, batch_size=self.eval_batch_size)

        # Eval!
        logger.info('---------------------------------------------------')
        logger.info("***** Running evaluation on %s dataset *****", mode)
        logger.info("  Num examples = %d", len(dataset))
        logger.info("  Batch size = %d", self.eval_batch_size)
        eval_loss = 0.0
        nb_eval_steps = 0
        preds = None
        out_label_ids = None

        self.model.eval()

        for batch in tqdm(eval_dataloader, desc="Evaluating"):
            batch = tuple(batch[t].to(self.device) for t in batch)
            with torch.no_grad():
                inputs = {
                    "input_ids": batch[0],
                    "attention_mask": batch[1],
                    "labels": batch[4],
                    "e1_mask": batch[2],
                    "e2_mask": batch[3],
                }
                #with torch.cuda.amp.autocast():
                outputs = self.model(**inputs)
                tmp_eval_loss, logits = outputs[:2]
                eval_loss += tmp_eval_loss.mean().item()
            nb_eval_steps += 1

            if preds is None:
                preds = logits.detach().cpu().numpy()
                out_label_ids = inputs["labels"].detach().cpu().numpy()
            else:
                preds = np.append(preds, logits.detach().cpu().numpy(), axis=0)
                out_label_ids = np.append(out_label_ids, inputs["labels"].detach().cpu().numpy(), axis=0)

        eval_loss = eval_loss / nb_eval_steps
        results = {"loss": eval_loss}
        preds = np.argmax(preds, axis=1)
        result = compute_metrics(preds, out_label_ids)
        
        if mode == "dev":
            if result['acc']>self.best_score:
                self.save_model()
                self.best_score = result['acc']
                print('save new best model acc : ',str(self.best_score))
                self.hold_epoch = 0
            else:
                self.hold_epoch += 1
        
        
        results.update(result)

        logger.info("***** Eval results *****")
        for key in sorted(results.keys()):
            logger.info("  {} = {:.4f}".format(key, results[key]))
        logger.info("---------------------------------------------------")
        return results
        

    def save_model(self,new_dir=None):
        # Save model checkpoint (Overwrite)
        if not os.path.exists(self.model_dir):
            os.makedirs(self.model_dir)
        if new_dir == None:
            pass
        else:
            if not os.path.exists(new_dir):
                os.makedirs(new_dir)
            self.model_dir = new_dir
        model_to_save = self.model.module if hasattr(self.model, "module") else self.model
        model_to_save.save_pretrained(self.model_dir)

        # Save training arguments together with the trained model
        logger.info("Saving model checkpoint to %s", self.model_dir)

  


- train 과정은 서버에서 진행하였고, ipynb에서는 중간에 중단하였습니다.
- data augmentation은 두 문장의 순서를 바꾸어 train dataset을 2배로 증강하였고, 이는 외부에서 실행하여 파일로 저장했습니다.  

In [5]:
os.environ["TOKENIZERS_PARALLELISM"] = "false"
import easydict

args = easydict.EasyDict({
 
        "num_train_epochs": 10,
        "train_batch_size": 4,
        "eval_batch_size": 4,
        "max_steps": -1,
        "dropout_rate": 0.1,
        "lr" : 1e-5,
        "adam_epsilon" : 1e-8,
        "weight_decay" : 0.01,
        "warmup_steps" : 64,
        "seed" : 42,
        "logging_steps" : 500,
        "max_grad_norm" : 1.0,
        "gradient_accumulation_steps" : 1,
        "train_data_dir" : f"{BASE_DIR}dataset/wic/NIKL_SKT_WiC_Train.tsv",
        "dev_data_dir" : f"{BASE_DIR}dataset/wic/NIKL_SKT_WiC_Dev.tsv" 
})

train_dataset = load_data(args.train_data_dir)
dev_dataset = load_data(args.dev_data_dir)
ADDITIONAL_SPECIAL_TOKENS = ["<e1>", "</e1>", "<e2>", "</e2>"]
tokenizer = AutoTokenizer.from_pretrained("klue/roberta-large", return_token_type_ids=False)
tokenizer.add_special_tokens({"additional_special_tokens": ADDITIONAL_SPECIAL_TOKENS})

concat_dataset = train_dataset

def make_fold(x):
  if x <= concat_dataset.shape[0]*0.2:
      return 0
  elif x > concat_dataset.shape[0]*0.2 and x <= concat_dataset.shape[0]*0.4:
      return 1
  elif x > concat_dataset.shape[0]*0.4 and x <= concat_dataset.shape[0]*0.6 :
      return 2
  elif x > concat_dataset.shape[0]*0.6 and x <= concat_dataset.shape[0]*0.8 :
      return 3
  else:
      return 4

concat_dataset['fold']= concat_dataset['ID'].apply(make_fold)
concat_dataset = concat_dataset.drop(['ID', 'Target'],axis=1)

logger = logging.getLogger(__name__)
for fold in tqdm(range(5)): 
  trn_idx = concat_dataset[concat_dataset['fold'] != fold].index
  val_idx = concat_dataset[concat_dataset['fold'] == fold].index

  half_val_len = len(val_idx)//2
  add_trn_idx = val_idx[:half_val_len]

  trn_idx.append(add_trn_idx)
  val_idx = val_idx[half_val_len:]

  train_folds = concat_dataset.loc[trn_idx].reset_index(drop=True).drop(['fold'],axis=1)
  valid_folds = concat_dataset.loc[val_idx].reset_index(drop=True).drop(['fold'],axis=1)

  train_Dataset = convert_sentence_to_features(train_dataset, tokenizer, max_len = 280)
  valid_Dataset = convert_sentence_to_features(dev_dataset, tokenizer, max_len= 280)

  trainer = Trainer(args,
                  train_dataset=train_Dataset,
                  dev_dataset=valid_Dataset,
                  tokenizer =tokenizer,
                  model_dir = f'{BASE_DIR}roberta_model_fold_{str(fold)}')

  trainer.train()
  trainer.save_model(new_dir=f'{BASE_DIR}roberta_model_final_fold_{str(fold)}')

100%|██████████| 15496/15496 [00:15<00:00, 984.14it/s]
100%|██████████| 1166/1166 [00:01<00:00, 752.67it/s]
Some weights of the model checkpoint at klue/roberta-large were not used when initializing RobertaModel: ['lm_head.decoder.bias', 'lm_head.layer_norm.weight', 'lm_head.layer_norm.bias', 'lm_head.dense.bias', 'lm_head.decoder.weight', 'lm_head.dense.weight', 'lm_head.bias']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaModel were not initialized from the model checkpoint at klue/roberta-large and are newly initialized: ['roberta.pooler.dense.weight', 'r

KeyboardInterrupt: 

## Inference


In [6]:
import pickle as pickle
import os
import pandas as pd
import torch
from sklearn.metrics import accuracy_score
import numpy as np
import matplotlib.pyplot as plt
import random
from itertools import chain
from torch.utils.data import DataLoader, RandomSampler, SequentialSampler
import copy
import csv
import json
import logging
import os
import torch.nn as nn
from tqdm.auto import tqdm
from transformers import AdamW, get_linear_schedule_with_warmup
import torch.nn.functional as F
from transformers import AutoTokenizer,AutoModel, RobertaPreTrainedModel, AutoConfig, RobertaModel
import numpy as np
import os 

class FCLayer(nn.Module):
    def __init__(self, input_dim, output_dim, dropout_rate=0.0, use_activation=True):
        super(FCLayer, self).__init__()
        self.use_activation = use_activation
        self.dropout = nn.Dropout(dropout_rate)
        self.linear = nn.Linear(input_dim, output_dim)
        self.tanh = nn.Tanh()

    def forward(self, x):
        x = self.dropout(x)
        if self.use_activation:
            x = self.tanh(x)
        return self.linear(x)

class Roberta_WiC(RobertaPreTrainedModel):
    def __init__(self,  model_name, config, dropout_rate):
        super(Roberta_WiC, self).__init__(config)
        self.model = AutoModel.from_pretrained(model_name, config=config)  # Load pretrained XLMRoberta

        self.num_labels = config.num_labels

        self.cls_fc_layer = FCLayer(config.hidden_size, config.hidden_size, dropout_rate)
        self.entity_fc_layer1 = FCLayer(config.hidden_size, config.hidden_size, dropout_rate)
        self.entity_fc_layer2 = FCLayer(config.hidden_size, config.hidden_size, dropout_rate)

        self.label_classifier = FCLayer(
            config.hidden_size * 3,
            config.num_labels,
            dropout_rate,
            use_activation=False,
        )

    @staticmethod
    def entity_average(hidden_output, e_mask):
        """
        Average the entity hidden state vectors (H_i ~ H_j)
        :param hidden_output: [batch_size, j-i+1, dim]
        :param e_mask: [batch_size, max_seq_len]
                e.g. e_mask[0] == [0, 0, 0, 1, 1, 1, 0, 0, ... 0]
        :return: [batch_size, dim]
        """
        e_mask_unsqueeze = e_mask.unsqueeze(1)  # [b, 1, j-i+1]
        length_tensor = (e_mask != 0).sum(dim=1).unsqueeze(1)  # [batch_size, 1]

        # [b, 1, j-i+1] * [b, j-i+1, dim] = [b, 1, dim] -> [b, dim]
        sum_vector = torch.bmm(e_mask_unsqueeze.float(), hidden_output).squeeze(1)
        avg_vector = sum_vector.float() / length_tensor.float()  # broadcasting
        return avg_vector

    def forward(self, input_ids, attention_mask, labels, e1_mask, e2_mask):
        outputs = self.model(
            input_ids, attention_mask=attention_mask
        )  
        sequence_output = outputs[0] 
        e1_h = self.entity_average(sequence_output, e1_mask)
        e2_h = self.entity_average(sequence_output, e2_mask)

        sentence_representation = self.cls_fc_layer(outputs.pooler_output)
        
        e1_h = self.entity_fc_layer1(e1_h)
        e2_h = self.entity_fc_layer2(e2_h)

        concat_h = torch.cat([sentence_representation, e1_h, e2_h], dim=-1)
        logits = self.label_classifier(concat_h)
        outputs = (logits,) + outputs[2:]  # add hidden states and attention if they are here
        # Softmax
        if labels is not None:
            if self.num_labels == 1:
                loss_fct = nn.MSELoss()
                loss = loss_fct(logits.view(-1), labels.view(-1))
            else:
                loss_fct = nn.CrossEntropyLoss()
                loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))

            outputs = (loss,) + outputs

        return outputs  # (loss), logits, (hidden_states), (attentions)

class RE_Dataset(torch.utils.data.Dataset):
    def __init__(self, tokenized_dataset, labels):
        self.tokenized_dataset = tokenized_dataset
        self.labels = labels
    
    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.tokenized_dataset.items()}
        item['labels'] = torch.tensor(self.labels[idx], dtype=torch.long)
        return item

    def __len__(self):
        return len(self.labels)

def init_logger():
    logging.basicConfig(
        format="%(asctime)s - %(levelname)s - %(name)s -   %(message)s",
        datefmt="%m/%d/%Y %H:%M:%S",
        level=logging.INFO,
    )
    

def test_pred(test_dataset, eval_batch_size, model):
    test_dataset = test_dataset
    test_sampler = SequentialSampler(test_dataset)
    test_dataloader = DataLoader(test_dataset, sampler=test_sampler,batch_size=eval_batch_size)

    logger = logging.getLogger(__name__)
    init_logger()

    # Eval!
    logger.info("***** Running evaluation on %s dataset *****", "test")
    logger.info("  Batch size = %d", eval_batch_size)

    nb_eval_steps = 0
    preds = None
    
    device = "cuda" if torch.cuda.is_available() else "cpu"
    model.to(device)
    model.eval()

    for batch in tqdm(test_dataloader, desc="Predicting"):
        batch = tuple(batch[t].to(device) for t in batch)
        with torch.no_grad():
            inputs = {
                "input_ids": batch[0],
                "attention_mask": batch[1],
                "labels": None,
                "e1_mask": batch[2],
                "e2_mask": batch[3],
            }
            outputs = model(**inputs)
            pred = outputs[0]

        nb_eval_steps += 1

        if preds is None:
            preds = pred.detach().cpu().numpy()
        else:
            preds = np.append(preds, pred.detach().cpu().numpy(), axis=0)

    preds_label = np.argmax(preds, axis=1)
    df = pd.DataFrame(preds, columns=['pred_0','pred_1'])
    df['label'] = preds_label
    preds = preds.astype(int)
    return df 


def load_test_data(dataset_dir):
    dataset = pd.read_csv(dataset_dir, delimiter='\t')
    li = []
    for s1, s2 in zip(list(dataset['SENTENCE1']), list(dataset['SENTENCE2'])):
        li.append(s1+' '+s2)
    dataset["ANSWER"] = dataset["ANSWER"].astype(int)
    return dataset

def convert_sentence_to_features(train_dataset, tokenizer, max_len, mode='train'):
    max_seq_len=max_len
    pad_token=tokenizer.pad_token_id
    add_sep_token=False
    mask_padding_with_zero=True
    
    all_input_ids = []
    all_attention_mask = []
    all_e1_mask=[]
    all_e2_mask=[]
    all_label=[]
    m_len=0
    for idx in tqdm(range(len(train_dataset))):
        sentence = '<s>' + train_dataset['SENTENCE1'][idx][:train_dataset['start_s1'][idx]] \
            + ' <e1> ' + train_dataset['SENTENCE1'][idx][train_dataset['start_s1'][idx]:train_dataset['end_s1'][idx]] \
            + ' </e1> ' + train_dataset['SENTENCE1'][idx][train_dataset['end_s1'][idx]:] + '</s>' \
            + ' ' \
            + '<s>' + train_dataset['SENTENCE2'][idx][:train_dataset['start_s2'][idx]] \
            + ' <e2> ' + train_dataset['SENTENCE2'][idx][train_dataset['start_s2'][idx]:train_dataset['end_s2'][idx]] \
            + ' </e2> ' + train_dataset['SENTENCE2'][idx][train_dataset['end_s2'][idx]:] + '</s>'

            
        
        token = tokenizer.tokenize(sentence)
        m_len = max(m_len, len(token))
        e11_p = token.index("<e1>")  # the start position of entity1
        e12_p = token.index("</e1>")  # the end position of entity1
        e21_p = token.index("<e2>")  # the start position of entity2
        e22_p = token.index("</e2>")  # the end position of entity2

        token[e11_p] = "$"
        token[e12_p] = "$"
        token[e21_p] = "#"
        token[e22_p] = "#"

        e11_p += 1
        e12_p += 1
        e21_p += 1
        e22_p += 1

        special_tokens_count = 1

        if len(token) < max_seq_len - special_tokens_count:
            input_ids = tokenizer.convert_tokens_to_ids(token)
            attention_mask = [1 if mask_padding_with_zero else 0] * len(input_ids)

            padding_length = max_seq_len - len(input_ids)
            input_ids = input_ids + ([pad_token] * padding_length)
            attention_mask = attention_mask + ([0 if mask_padding_with_zero else 1] * padding_length)

            e1_mask = [0] * len(attention_mask)
            e2_mask = [0] * len(attention_mask)

            for i in range(e11_p, e12_p + 1):
                e1_mask[i] = 1
            for i in range(e21_p, e22_p + 1):
                e2_mask[i] = 1

            assert len(input_ids) == max_seq_len, "Error with input length {} vs {}".format(len(input_ids), max_seq_len)
            assert len(attention_mask) == max_seq_len, "Error with attention mask length {} vs {}".format(
                len(attention_mask), max_seq_len
            )

            all_input_ids.append(input_ids)
            all_attention_mask.append(attention_mask)
            all_e1_mask.append(e1_mask)
            all_e2_mask.append(e2_mask)
            all_label.append(train_dataset['ANSWER'][idx])

    all_features = {
        'input_ids' : torch.tensor(all_input_ids),
        'attention_mask' : torch.tensor(all_attention_mask),
        'e1_mask' : torch.tensor(all_e1_mask),
        'e2_mask' : torch.tensor(all_e2_mask)
    }  
    return RE_Dataset(all_features, all_label)

def softmax(sr):
    
    max_val = np.max(sr)
    exp_a = np.exp(sr-max_val)
    sum_exp_a = np.sum(exp_a)
    y = exp_a / sum_exp_a
    return y

def compute_metrics(preds, labels):
    assert len(preds) == len(labels)
    return acc_and_f1(preds, labels)

def simple_accuracy(preds, labels):
    return (preds == labels).mean()

def acc_and_f1(preds, labels, average="macro"):
    acc = simple_accuracy(preds, labels)
    return {
        "acc": acc,
    }


In [7]:
eval_batch_size = 4
ADDITIONAL_SPECIAL_TOKENS = ["<e1>", "</e1>", "<e2>", "</e2>"]
tokenizer = AutoTokenizer.from_pretrained("klue/roberta-large", return_token_type_ids=False)
tokenizer.add_special_tokens({"additional_special_tokens": ADDITIONAL_SPECIAL_TOKENS})

test_dataset = load_test_data(f"{BASE_DIR}dataset/wic/NIKL_SKT_WiC_Dev.tsv")
test_Dataset = convert_sentence_to_features(test_dataset, tokenizer, max_len= 280, mode='eval')

n_fold = 5
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')

for fold in tqdm(range(n_fold)):
    config = AutoConfig.from_pretrained(
            "klue/roberta-large",
            num_labels= 2
        )
    model = Roberta_WiC(
            'klue/roberta-large',
            config= config, 
            dropout_rate = 0.1
        )
    model.load_state_dict(torch.load(f'{BASE_DIR}roberta_model_final_fold_'+str(fold)+'/pytorch_model.bin', map_location=device))
    model.eval()
    result = test_pred(test_Dataset, eval_batch_size, model)
    result.to_csv(f'{BASE_DIR}{str(fold)}_rbt_result.csv', index=False)

ensemble= pd.DataFrame()
for fold in range(n_fold):
    df = pd.read_csv(f'{BASE_DIR}{str(fold)}_rbt_result.csv')
    ensemble['label'+str(fold)]= df['label']


soft_ensemble= pd.DataFrame()
soft_ensemble['pred_0'] = ensemble['label0']
soft_ensemble['pred_1'] = ensemble['label0']
soft_ensemble['pred_0'] = 0
soft_ensemble['pred_1'] = 0

for fold in range(n_fold):
    df = pd.read_csv(f'{BASE_DIR}{str(fold)}_rbt_result.csv')
    df= df.drop('label',axis=1)
    df = df.apply(softmax,axis=1)
    soft_ensemble['pred_0'] += df['pred_0']
    soft_ensemble['pred_1'] += df['pred_1']

soft_ensemble['predicted'] = [1 if p_0 < p_1 else 0 for p_0, p_1 in zip(soft_ensemble['pred_0'], soft_ensemble['pred_1'])]
result = compute_metrics(soft_ensemble['predicted'], test_dataset['ANSWER'])
print('================= devset acc =================')
print(f"accuracy : {result['acc']}")


HBox(children=(FloatProgress(value=0.0, max=1166.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=5.0), HTML(value='')))

Some weights of the model checkpoint at klue/roberta-large were not used when initializing RobertaModel: ['lm_head.decoder.bias', 'lm_head.layer_norm.weight', 'lm_head.layer_norm.bias', 'lm_head.dense.bias', 'lm_head.decoder.weight', 'lm_head.dense.weight', 'lm_head.bias']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaModel were not initialized from the model checkpoint at klue/roberta-large and are newly initialized: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it f

HBox(children=(FloatProgress(value=0.0, description='Predicting', max=292.0, style=ProgressStyle(description_w…






Some weights of the model checkpoint at klue/roberta-large were not used when initializing RobertaModel: ['lm_head.decoder.bias', 'lm_head.layer_norm.weight', 'lm_head.layer_norm.bias', 'lm_head.dense.bias', 'lm_head.decoder.weight', 'lm_head.dense.weight', 'lm_head.bias']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaModel were not initialized from the model checkpoint at klue/roberta-large and are newly initialized: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it f

HBox(children=(FloatProgress(value=0.0, description='Predicting', max=292.0, style=ProgressStyle(description_w…




Some weights of the model checkpoint at klue/roberta-large were not used when initializing RobertaModel: ['lm_head.decoder.bias', 'lm_head.layer_norm.weight', 'lm_head.layer_norm.bias', 'lm_head.dense.bias', 'lm_head.decoder.weight', 'lm_head.dense.weight', 'lm_head.bias']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaModel were not initialized from the model checkpoint at klue/roberta-large and are newly initialized: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it f

HBox(children=(FloatProgress(value=0.0, description='Predicting', max=292.0, style=ProgressStyle(description_w…




Some weights of the model checkpoint at klue/roberta-large were not used when initializing RobertaModel: ['lm_head.decoder.bias', 'lm_head.layer_norm.weight', 'lm_head.layer_norm.bias', 'lm_head.dense.bias', 'lm_head.decoder.weight', 'lm_head.dense.weight', 'lm_head.bias']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaModel were not initialized from the model checkpoint at klue/roberta-large and are newly initialized: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it f

HBox(children=(FloatProgress(value=0.0, description='Predicting', max=292.0, style=ProgressStyle(description_w…




Some weights of the model checkpoint at klue/roberta-large were not used when initializing RobertaModel: ['lm_head.decoder.bias', 'lm_head.layer_norm.weight', 'lm_head.layer_norm.bias', 'lm_head.dense.bias', 'lm_head.decoder.weight', 'lm_head.dense.weight', 'lm_head.bias']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaModel were not initialized from the model checkpoint at klue/roberta-large and are newly initialized: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it f

HBox(children=(FloatProgress(value=0.0, description='Predicting', max=292.0, style=ProgressStyle(description_w…



accuracy : 0.934819897084048


## 동형이의어 task 끝