# GLUE data의 'cola' task 를 수행하는 프로젝트

- CoLA : 문법에 맞는 문장인지 판단
- MNLI : 두 문장의 관계 판단(entailment, contradiction, neutral)
- MNLI-MM : 두 문장이 안 맞는지 판단
- MRPC : 두 문장의 유사도 평가
- SST-2 : 감정분석
- STS-B : 두 문장의 유사도 평가
- QQP : 두 질문의 유사도 평가
- QNLI : 질문과 paragraph 내 한 문장이 함의 관계(entailment)인지 판단
- RTE : 두 문장의 관계 판단(entailment, not_entailment)
- WNLI : 원문장과 대명사로 치환한 문장 사이의 함의 관계 판단

`mnli` task는 이전 스텝에서 사용한 BERT를 사용하면 학습이 제대로 되지 않습니다. 

https://huggingface.co/models 를 참조하여 BERT가 아닌 다른 모델을 선택하세요.  
tensorflow와 해당 모델에 대한 task로 검색하면 사용할 수 있는 모델이 나옵니다.  
그 후 선택한 모델의 `_tokenizer_`와 해당 모델에 대한 task 와 모델 의 정보를 https://huggingface.co/transformers/index.html 에서 찾아 여러분의 프로젝트를 완성해 보세요.

그냥 run_glue.py를 돌려보는 방식으로 진행하는 것을 원하는 것은 아닙니다. 

아래와 같은 순서를 지켜서 진행해 주세요.

### My

CoLA task 에 대해서는 RoBERTa 를 모델로 선택  
tokenizer 는 byte-level bpe 이용

RoBERTa has the same architecture as BERT, but uses a byte-level BPE as a tokenizer (same as GPT-2) and uses a different pretraining scheme.

## 라이브러리 버전을 확인해 봅니다.

In [1]:
import os
import tensorflow as tf
import numpy as np
import transformers
import argparse
import datasets

print(tf.__version__)
print(np.__version__)
print(transformers.__version__)
print(argparse.__version__)
print(datasets.__version__)

2.10.0
1.23.4
4.23.1
1.1
2.7.1


## STEP 1. huggingface를 적극 활용해 CoLA 데이터셋을 분석해 보기

In [2]:
import datasets
from datasets import load_dataset, load_metric
import collections

cola_dataset = load_dataset('glue', 'cola')
print(cola_dataset)

# collections을 이용해 label의 숫자를 확인할 수 있습니다.

label_count = collections.Counter(cola_dataset['train']['label'])
print(label_count)

Downloading and preparing dataset glue/cola to C:/Users/ziipp/.cache/huggingface/datasets/glue/cola/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad...


Downloading data:   0%|          | 0.00/377k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/8551 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/1043 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1063 [00:00<?, ? examples/s]

Dataset glue downloaded and prepared to C:/Users/ziipp/.cache/huggingface/datasets/glue/cola/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 8551
    })
    validation: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 1043
    })
    test: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 1063
    })
})
Counter({1: 6023, 0: 2528})


Dataset dictionary안에 train dataset, validation dataset, test dataset으로 구성되어 있고  
각 Dataset은 ‘sentence’, ‘label’, ‘idx’(인덱스)로 구성되어 있습니다.

## STEP 2. Huggingface에서 제공하는 tokenizer를 활용하여 데이터셋 구성하기

In [3]:
from transformers import RobertaTokenizer
tokenizer = RobertaTokenizer.from_pretrained("roberta-base")

Downloading:   0%|          | 0.00/899k [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/481 [00:00<?, ?B/s]

In [4]:
from transformers import AutoTokenizer
# distilbert-base-uncased 모델(distilbert 기본모델인데 대소문자를 구별하지 않는 모델)을 토크나이저로 불러오세요
# 나는 roberta-base 이용
bert_tokenizer = AutoTokenizer.from_pretrained("roberta-base")

def transform(data):
  return bert_tokenizer(
      data['sentence'],
      truncation = True,
      return_token_type_ids = False,
      )
  
examples = cola_dataset['train'][:5]
examples_transformed = transform(examples)

print(examples)
print(examples_transformed)

Downloading:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

{'sentence': ["Our friends won't buy this analysis, let alone the next one we propose.", "One more pseudo generalization and I'm giving up.", "One more pseudo generalization or I'm giving up.", 'The more we study verbs, the crazier they get.', 'Day by day the facts are getting murkier.'], 'label': [1, 1, 1, 1, 1], 'idx': [0, 1, 2, 3, 4]}
{'input_ids': [[0, 2522, 964, 351, 75, 907, 42, 1966, 6, 905, 1937, 5, 220, 65, 52, 15393, 4, 2], [0, 3762, 55, 38283, 937, 1938, 8, 38, 437, 1311, 62, 4, 2], [0, 3762, 55, 38283, 937, 1938, 50, 38, 437, 1311, 62, 4, 2], [0, 133, 55, 52, 892, 47041, 6, 5, 26002, 906, 51, 120, 4, 2], [0, 10781, 30, 183, 5, 4905, 32, 562, 22802, 330, 906, 4, 2]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}


In [5]:
#데이터셋을 map을 이용해 토크나이징을 합니다.
encoded_dataset = cola_dataset.map(transform, batched=True)

  0%|          | 0/9 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

## STEP 3. model을 생성하여 학습 및 테스트를 진행해 보기

In [6]:
from transformers import AutoModelForSequenceClassification
# distilbert-base-uncased 모델(distilbert 기본모델인데 대소문자를 구별하지 않는 모델)을 pretrained model로 불러오고 label개수를 확인해 넣어주세요. [위에 있는 collections 함수를 확인하시면 됩니다]
# 나는 roberta-bae 이용
model = AutoModelForSequenceClassification.from_pretrained("roberta-base", num_labels=2)

print(model.__class__)

Downloading:   0%|          | 0.00/501M [00:00<?, ?B/s]

Some weights of the model checkpoint at roberta-base were not used when initializing RobertaForSequenceClassification: ['lm_head.dense.bias', 'lm_head.decoder.weight', 'lm_head.dense.weight', 'roberta.pooler.dense.weight', 'lm_head.bias', 'roberta.pooler.dense.bias', 'lm_head.layer_norm.bias', 'lm_head.layer_norm.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.weight', 'classifie

<class 'transformers.models.roberta.modeling_roberta.RobertaForSequenceClassification'>


In [7]:
#'glue/cola'  metric을 불러오세요.
metric = load_metric('glue', 'cola')

# compute_metrics를 구성해봅니다. (어렵다면 앞에 있는 노드 내용 참고하시면 됩니다)
def compute_metrics(eval_pred):    
    predictions,labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return metric.compute(predictions=predictions, references = labels)

  metric = load_metric('glue', 'cola')


In [8]:
from transformers import Trainer, TrainingArguments
metric_name = 'loss'
batch_size = 16
output_dir = './data/transformers'

# 다음과 같은 조건으로 training Arguments를 설정합니다.
"""
조건
1. output_directory를  output_dir로 설정한다.
2. learning_rate : 2e-5 
3. train과 evaluation batch_size는 위에 선언하는 batch_size로 한다.
4. train_epoch를 10으로 설정한다.
5. weight_decay는 0.01로 설정한다.
6. evaluation_strategy를 'steps'로 설정한다.
7. 가장 좋은 모델을 불러온다.
8. 가장 좋은 모델의 측정을 한다.
"""

training_arguments = TrainingArguments(
    output_dir, # output이 저장될 경로
    evaluation_strategy="steps", #evaluation하는 빈도
    learning_rate = 2e-5, #learning_rate
    per_device_train_batch_size = batch_size, # 각 device 당 batch size
    per_device_eval_batch_size = batch_size, # evaluation 시에 batch size
    num_train_epochs = 1, # train 시킬 총 epochs
    weight_decay = 0.01, # weight decay
    load_best_model_at_end=True,
    metric_for_best_model = metric_name,
)

In [9]:
#Trainer를 설정합니다.
"""
조건
1. training arguments를 넣는다.
2. automodel을 설정한다.
3. train_dataset을 설정한다.
4. evaluation_dataset을 validation으로 설정한다. 
5. tokenizer를 설정한다.
6. 계산할 metrics를 설정한다.
"""

trainer = Trainer(
   model=model,                           # 학습시킬 model
   args=training_arguments,                  # TrainingArguments을 통해 설정한 arguments
   train_dataset=encoded_dataset['train'],    # training dataset
   eval_dataset=encoded_dataset['validation'],       # evaluation dataset
   tokenizer = bert_tokenizer,
   compute_metrics=compute_metrics,
)

trainer.train()

The following columns in the training set don't have a corresponding argument in `RobertaForSequenceClassification.forward` and have been ignored: sentence, idx. If sentence, idx are not expected by `RobertaForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 8551
  Num Epochs = 1
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 535
You're using a RobertaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss,Validation Loss,Matthews Correlation
500,0.4882,0.48823,0.504672


The following columns in the evaluation set don't have a corresponding argument in `RobertaForSequenceClassification.forward` and have been ignored: sentence, idx. If sentence, idx are not expected by `RobertaForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 1043
  Batch size = 16
Saving model checkpoint to ./data/transformers\checkpoint-500
Configuration saved in ./data/transformers\checkpoint-500\config.json
Model weights saved in ./data/transformers\checkpoint-500\pytorch_model.bin
tokenizer config file saved in ./data/transformers\checkpoint-500\tokenizer_config.json
Special tokens file saved in ./data/transformers\checkpoint-500\special_tokens_map.json


Training completed. Do not forget to share your model on huggingface.co/models =)


Loading best model from ./data/transformers\checkpoint-500 (score: 0.48822999000549316).


TrainOutput(global_step=535, training_loss=0.47935959468378087, metrics={'train_runtime': 716.2238, 'train_samples_per_second': 11.939, 'train_steps_per_second': 0.747, 'total_flos': 90067230915480.0, 'train_loss': 0.47935959468378087, 'epoch': 1.0})

현재 CoLA 데이터셋의 정확도를 측정하는 metric은 Matthews Correlations입니다.  
https://choice-life.tistory.com/82 참고

### (보너스) CoLA processor 구축하기

In [10]:
#https://github.com/huggingface/transformers/blob/main/src/transformers/data/processors/glue.py 해당 내용에서 찾아보세요 (raise NotImplemetedError()는 작성할때 지워주세요)

from transformers.data.processors.utils import DataProcessor

class ColaProcessor(DataProcessor):
    """Processor for the CoLA data set (GLUE version)."""

    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        warnings.warn(DEPRECATION_WARNING.format("processor"), FutureWarning)

    def get_example_from_tensor_dict(self, tensor_dict):
        """See base class."""
        return InputExample(
            tensor_dict["idx"].numpy(),
            tensor_dict["sentence"].numpy().decode("utf-8"),
            None,
            str(tensor_dict["label"].numpy()),
        )

    def get_train_examples(self, data_dir):
        """See base class."""
        return self._create_examples(self._read_tsv(os.path.join(data_dir, "train.tsv")), "train")

    def get_dev_examples(self, data_dir):
        """See base class."""
        return self._create_examples(self._read_tsv(os.path.join(data_dir, "dev.tsv")), "dev")

    def get_test_examples(self, data_dir):
        """See base class."""
        return self._create_examples(self._read_tsv(os.path.join(data_dir, "test.tsv")), "test")

    def get_labels(self):
        """See base class."""
        return ["0", "1"]

    def _create_examples(self, lines, set_type):
        """Creates examples for the training, dev and test sets."""
        test_mode = set_type == "test"
        if test_mode:
            lines = lines[1:]
        text_index = 1 if test_mode else 3
        examples = []
        for i, line in enumerate(lines):
            guid = f"{set_type}-{i}"
            text_a = line[text_index]
            label = None if test_mode else line[1]
            examples.append(InputExample(guid=guid, text_a=text_a, text_b=None, label=label))
        return examples