# 24-1. 프로젝트 : 커스텀 프로젝트 직접 만들기

> 앞서 본 GLUE benchmark의 한국어 버전 KLUE benchmark를 이용한다. 하지만 모델은 model(klue/ber-base)를 활용하여 NSMC(Naver Sentiment Movie Corpus) task

GLUE와 마찬가지로 한국어 자연어처리에 대한 이해도를 높이기 위해 만들어진 데이터셋 benchmark입니다. 총 8가지의 데이터셋이 있습니다. 다만 이번 시간에 진행할 프로젝트는 KLUE의 dataset을 활용하는 것이 아닌, model(klue/ber-base)를 활용하여 NSMC(Naver Sentiment Movie Corpus) task

- earlystopping 겁시다

STEP 1. NSMC 데이터 분석 및 Huggingface dataset 구성    
- 데이터셋은 깃허브에서 다운받거나, Huggingface datasets에서 가져올 수 있습니다. 앞에서 배운 방법들을 활용해봅시다!    

STEP 2. klue/bert-base model 및 tokenizer 불러오기    
STEP 3. 위에서 불러온 tokenizer으로 데이터셋을 전처리하고, model 학습 진행해 보기   
STEP 4. Fine-tuning을 통하여 모델 성능(accuarcy) 향상시키기   
데이터 전처리, TrainingArguments 등을 조정하여 모델의 정확도를 90% 이상으로 끌어올려봅시다.     

STEP 5. Bucketing을 적용하여 학습시키고, STEP 4의 결과와의 비교    
아래 링크를 바탕으로 bucketing과 dynamic padding이 무엇인지 알아보고, 이들을 적용하여 model을 학습시킵니다.   

Data Collator   

Trainer.TrainingArguments 의 group_by_length   

STEP 4에 학습한 결과와 bucketing을 적용하여 학습시킨 결과를 비교해보고, 모델 성능 향상과 훈련 시간 두 가지 측면에서 각각 어떤 이점이 있는지 비교해봅시다.   

In [1]:
import tensorflow
import numpy
import transformers
import datasets
from datasets import load_dataset, DatasetDict, Dataset
import os
print(tensorflow.__version__)
print(numpy.__version__)
print(transformers.__version__)
print(datasets.__version__)

2.6.0
1.21.4
4.11.3
1.14.0


In [2]:
import numpy as np

In [3]:
from transformers import AutoTokenizer, DataCollatorWithPadding, Trainer, TrainingArguments
from sklearn.metrics import accuracy_score

# STEP 1. NSMC 데이터 분석 및 Huggingface dataset 구성    
- 데이터셋은 깃허브에서 다운받거나, Huggingface datasets에서 가져올 수 있습니다. 앞에서 배운 방법들을 활용해봅시다!    


In [4]:
#import tensorflow_datasets as tfds
#tf_dataset, tf_dataset_info = tfds.load('glue/mrpc', with_info=True)

In [5]:
# tf_dataset_info

In [6]:
# del tf_dataset, tf_dataset_info

In [7]:
nsmc_dataset = datasets.load_dataset('nsmc')

Downloading:   0%|          | 0.00/1.36k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/807 [00:00<?, ?B/s]

Using custom data configuration default


Downloading and preparing dataset nsmc/default (download: 18.62 MiB, generated: 20.90 MiB, post-processed: Unknown size, total: 39.52 MiB) to /aiffel/.cache/huggingface/datasets/nsmc/default/1.1.0/bfd4729bf1a67114e5267e6916b9e4807010aeb238e4a3c2b95fbfa3a014b5f3...


  0%|          | 0/2 [00:00<?, ?it/s]

Downloading:   0%|          | 0.00/6.33M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.12M [00:00<?, ?B/s]

  0%|          | 0/2 [00:00<?, ?it/s]

0 examples [00:00, ? examples/s]

0 examples [00:00, ? examples/s]

Dataset nsmc downloaded and prepared to /aiffel/.cache/huggingface/datasets/nsmc/default/1.1.0/bfd4729bf1a67114e5267e6916b9e4807010aeb238e4a3c2b95fbfa3a014b5f3. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

In [8]:
print(nsmc_dataset)

DatasetDict({
    train: Dataset({
        features: ['id', 'document', 'label'],
        num_rows: 150000
    })
    test: Dataset({
        features: ['id', 'document', 'label'],
        num_rows: 50000
    })
})


In [9]:
train = nsmc_dataset['train']

In [10]:
cols = train.column_names

In [11]:
cols

['id', 'document', 'label']

In [12]:
for i in range(5):
    for col in cols:
        print(col, ":", train[col][i])
    print('\n')

id : 9976970
document : 아 더빙.. 진짜 짜증나네요 목소리
label : 0


id : 3819312
document : 흠...포스터보고 초딩영화줄....오버연기조차 가볍지 않구나
label : 1


id : 10265843
document : 너무재밓었다그래서보는것을추천한다
label : 0


id : 9045019
document : 교도소 이야기구먼 ..솔직히 재미는 없다..평점 조정
label : 0


id : 6483659
document : 사이몬페그의 익살스런 연기가 돋보였던 영화!스파이더맨에서 늙어보이기만 했던 커스틴 던스트가 너무나도 이뻐보였다
label : 1




# STEP 2. klue/bert-base model 및 tokenizer 불러오기   

In [13]:
import transformers
from transformers import AutoTokenizer, AutoModelForSequenceClassification

In [14]:
huggingface_tokenizer = 'klue/bert-base'

In [15]:
class DataSet():
    """
    Initialize: 
    - Initialize dataset object with pre-trained tokenizer.
    - padding set to maximum length
    - pre-trained tokenizer
    - incoming dataset should already been separated with
    three columns
    
    Transform fn:
    - Transform the data with auto tokenizer with self padding
    
    _set fn:
    - Train dataset will be further splited into 80-20.
    
    """
    def __init__(self, dataset_name, huggingface_tokenizer, padding='max_length'):
        super(DataSet, self).__init__()
        
        self.tokenizer = AutoTokenizer.from_pretrained(huggingface_tokenizer)
        
        self.padding = padding
        dataset = self._set(dataset_name)
        
        self.train = dataset['train']
        self.test = dataset['test']
        self.valid = dataset['valid']
                                        
            
    def transform(self, data):
        return self.tokenizer(
            data['document'],
            truncation=True,
            padding=self.padding,
#             padding='max_length',
            return_token_type_ids=False,
        )
       
        
    def _set(self, dataset_name):
        data = datasets.load_dataset(dataset_name)
        train_valid = data['train'].train_test_split(test_size=0.2)
                
        return DatasetDict({
            'train': train_valid['train'],
            'valid': train_valid['test'],
            'test': data['test']
        }).map(self.transform, batched=True)

In [16]:
dataset_name = 'nsmc'
output_dir = os.getenv('HOME')+'/aiffel/transformers'

In [17]:
dataset = DataSet(dataset_name, huggingface_tokenizer, padding=False)

Downloading:   0%|          | 0.00/289 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/425 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/243k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/483k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/125 [00:00<?, ?B/s]

Using custom data configuration default
Reusing dataset nsmc (/aiffel/.cache/huggingface/datasets/nsmc/default/1.1.0/bfd4729bf1a67114e5267e6916b9e4807010aeb238e4a3c2b95fbfa3a014b5f3)


  0%|          | 0/2 [00:00<?, ?it/s]

  0%|          | 0/120 [00:00<?, ?ba/s]

  0%|          | 0/30 [00:00<?, ?ba/s]

  0%|          | 0/50 [00:00<?, ?ba/s]

In [27]:
final_d = dataset._set(dataset_name)

Using custom data configuration default
Reusing dataset nsmc (/aiffel/.cache/huggingface/datasets/nsmc/default/1.1.0/bfd4729bf1a67114e5267e6916b9e4807010aeb238e4a3c2b95fbfa3a014b5f3)


  0%|          | 0/2 [00:00<?, ?it/s]

  0%|          | 0/120 [00:00<?, ?ba/s]

  0%|          | 0/30 [00:00<?, ?ba/s]

  0%|          | 0/50 [00:00<?, ?ba/s]

In [19]:
np.unique(dataset.train['label'])

array([0, 1])

In [20]:
model_name = 'klue/bert-base'

In [21]:
huggingface_model = AutoModelForSequenceClassification.from_pretrained(model_name,
                                                                      num_labels=2)

Downloading:   0%|          | 0.00/424M [00:00<?, ?B/s]

Some weights of the model checkpoint at klue/bert-base were not used when initializing BertForSequenceClassification: ['cls.seq_relationship.bias', 'cls.predictions.decoder.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized

In [33]:
training_arguments = TrainingArguments(
    output_dir,                                         # output이 저장될 경로
    evaluation_strategy="epoch",           #evaluation하는 빈도
    learning_rate = 2e-5,                         #learning_rate
    per_device_train_batch_size = 16,   # 각 device 당 batch size
    per_device_eval_batch_size = 16,    # evaluation 시에 batch size
    num_train_epochs = 3,                     # train 시킬 총 epochs
    weight_decay = 0.01,                        # weight decay
)

- Metrics

> 감성분류임으로 "SST2" 이다. 즉 "accuracy" 다.     
https://choice-life.tistory.com/77

In [25]:
from datasets import load_metric
metric = load_metric('accuracy')

def compute_metrics(eval_pred):    
    predictions,labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return metric.compute(predictions=predictions, references = labels)

Downloading:   0%|          | 0.00/1.42k [00:00<?, ?B/s]

In [None]:
import time
start = time.time() # get the current time in seconds
trainer = Trainer(
        model=huggingface_model,           # 학습시킬 model
        args=training_arguments,           # TrainingArguments을 통해 설정한 arguments
        train_dataset=final_d['train'],    # training dataset
        eval_dataset=final_d['valid'],       # evaluation dataset
        compute_metrics=compute_metrics,
    data_collator=DataCollatorWithPadding(tokenizer=dataset.tokenizer))
trainer.train()   
end = time.time() # get the current time in seconds
elapsed = end - start # calculate the elapsed time in seconds
print(elapsed) # print the elapsed time

The following columns in the training set  don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: document, id.
***** Running training *****
  Num examples = 120000
  Num Epochs = 3
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 22500


Epoch,Training Loss,Validation Loss
