## klue nsmc 모델
klue : 한국어 버전 glue   
nsmc : Naver Sentiment Movie Corpus, 네이버 영화리뷰 감정분석

#### 루브릭
1. 모델과 데이터를 정상적으로 불러오고, 작동하는 것을 확인하였다.

    ```text
    ***** Running training *****
    Num examples = 150000
    Num Epochs = 3
    Instantaneous batch size per device = 64
    Total train batch size (w. parallel, distributed & accumulation) = 64
    Gradient Accumulation steps = 1
    Total optimization steps = 7032
    ```
2. Preprocessing을 개선하고, fine-tuning을 통해 모델의 성능을 개선시켰다.  

    토큰화를 진행하고 평균길이 22, 최대길이 120정도이기 떄문에 패딩을 축소.  
    배치사이즈 증대  
    ```python
    tokenizer.model_max_length = 32
    per_device_train_batch_size = 64
    per_device_eval_batch_size = 64
    ```  
    ```text
    [7032/7032 51:11, Epoch 3/3]
    Epoch	Training Loss	Validation Loss	Accuracy	F1
    1	0.272700	0.259106	0.893620	0.893516
    2	0.196600	0.263661	0.894220	0.898333
    3	0.141600	0.284205	0.900300	0.901678
    ```
3. 모델 학습에 Bucketing을 성공적으로 적용하고, 그 결과를 비교분석하였다.

    bucketing 이전 모델도 패딩 길이를 줄여서 시간이 많이 단축되었지만,  
    bucketing하니 시간이 더 줄었고, 성능의 저하도 있지 않았다. 
    ```python
    datacollator = DataCollatorWithPadding(tokenizer, padding=True)
    TrainingArguments(group_by_length = True)
    ```
    ```text
    [7032/7032 50:33, Epoch 3/3]
    Epoch	Training Loss	Validation Loss	Accuracy	F1
    1	0.256600	0.238329	0.902100	0.902453
    2	0.178400	0.241524	0.905640	0.906046
    3	0.125600	0.276496	0.905860	0.907165
    ```

In [1]:
import tensorflow as tf
import numpy as np
import transformers
import datasets

In [2]:
### 데이터 셋
from datasets import load_dataset

dataset = load_dataset("nsmc")


Downloading:   0%|          | 0.00/1.36k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/807 [00:00<?, ?B/s]

Using custom data configuration default


Downloading and preparing dataset nsmc/default (download: 18.62 MiB, generated: 20.90 MiB, post-processed: Unknown size, total: 39.52 MiB) to /aiffel/.cache/huggingface/datasets/nsmc/default/1.1.0/bfd4729bf1a67114e5267e6916b9e4807010aeb238e4a3c2b95fbfa3a014b5f3...


  0%|          | 0/2 [00:00<?, ?it/s]

Downloading:   0%|          | 0.00/6.33M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/4.89M [00:00<?, ?B/s]

  0%|          | 0/2 [00:00<?, ?it/s]

0 examples [00:00, ? examples/s]

0 examples [00:00, ? examples/s]

Dataset nsmc downloaded and prepared to /aiffel/.cache/huggingface/datasets/nsmc/default/1.1.0/bfd4729bf1a67114e5267e6916b9e4807010aeb238e4a3c2b95fbfa3a014b5f3. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

In [63]:
dataset

DatasetDict({
    train: Dataset({
        features: ['id', 'document', 'label'],
        num_rows: 150000
    })
    test: Dataset({
        features: ['id', 'document', 'label'],
        num_rows: 50000
    })
})

In [64]:
dataset["train"][0]

{'document': '아 더빙.. 진짜 짜증나네요 목소리', 'id': '9976970', 'label': 0}

In [3]:
from transformers import AutoModelForSequenceClassification, AutoTokenizer

model = AutoModelForSequenceClassification.from_pretrained("klue/bert-base")
tokenizer = AutoTokenizer.from_pretrained("klue/bert-base")

Downloading:   0%|          | 0.00/425 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/424M [00:00<?, ?B/s]

Some weights of the model checkpoint at klue/bert-base were not used when initializing BertForSequenceClassification: ['cls.predictions.decoder.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized

Downloading:   0%|          | 0.00/289 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/243k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/483k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/125 [00:00<?, ?B/s]

In [66]:
# 버트는 기본적으로 패딩이 512개인데 리뷰데이터는 그렇게 길지 않음
len(tokenizer("sample text", padding="max_length")["input_ids"])

512

In [67]:
# train 토큰화 길이
train_documents = dataset["train"]["document"]
train_input_lengths = [len(tokenizer(doc)["input_ids"]) for doc in train_documents]

# test 토큰화하고 길이
test_documents = dataset["test"]["document"]
test_input_lengths = [len(tokenizer(doc)["input_ids"]) for doc in test_documents]

# 평균
avg_train_input_length = sum(train_input_lengths) / len(train_input_lengths)
avg_test_input_length = sum(test_input_lengths) / len(test_input_lengths)

In [68]:
print("Train, Test 최대 길이: ", max(train_input_lengths), max(test_input_lengths))
print(f"Train 평균 길이: {avg_train_input_length}")
print(f"Test 평균 길이: {avg_test_input_length}")

Train, Test 최대 길이:  142 122
Train 평균 길이: 22.275513333333333
Test 평균 길이: 22.35976


In [69]:
tokenizer.model_max_length = 32
len(tokenizer("sample text", padding="max_length")["input_ids"])

32

In [70]:
def transform(data):
    return tokenizer(
        text=data["document"],
        truncation=True,
        padding="max_length",
        return_token_type_ids=False,
    )

In [71]:
dataset_tokenized = dataset.map(
    transform,
    batched=True,
)

  0%|          | 0/150 [00:00<?, ?ba/s]

  0%|          | 0/50 [00:00<?, ?ba/s]

In [72]:
train = dataset_tokenized["train"]
test = dataset_tokenized["test"]

In [73]:
# 학습시간이 길어 데이터 축소
subset_size = 30000
train_subset = train.shuffle(seed=42).select(range(subset_size))
test_subset = test.shuffle(seed=42).select(range(int(subset_size/3)))



### 학습

In [12]:
import numpy as np
from transformers import Trainer, TrainingArguments

output_dir = "model/nsmc/"

training_arguments = TrainingArguments(
    output_dir,  # output이 저장될 경로
    evaluation_strategy="epoch",  # evaluation하는 빈도
    learning_rate=2e-5,  # learning_rate
    per_device_train_batch_size=64,  # 각 device 당 batch size
    per_device_eval_batch_size=64,  # evaluation 시에 batch size
    num_train_epochs=3,  # train 시킬 총 epochs
    weight_decay=0.01,  # weight decay
)

In [15]:
# 정확도
from datasets import load_metric
metric = load_metric('glue', 'mrpc')   # 바이너리 크로스엔트로피

def compute_metrics(eval_pred):    
    predictions,labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return metric.compute(predictions=predictions, references = labels)

Downloading:   0%|          | 0.00/1.86k [00:00<?, ?B/s]

In [76]:
trainer = Trainer(
    model=model,  # 학습시킬 model
    args=training_arguments,  # TrainingArguments을 통해 설정한 arguments
    train_dataset=train,  # training dataset
    eval_dataset=test,
    compute_metrics=compute_metrics
)
trainer.train()

The following columns in the training set  don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: document, id.
***** Running training *****
  Num examples = 150000
  Num Epochs = 3
  Instantaneous batch size per device = 64
  Total train batch size (w. parallel, distributed & accumulation) = 64
  Gradient Accumulation steps = 1
  Total optimization steps = 7032


Epoch,Training Loss,Validation Loss,Accuracy,F1
1,0.2727,0.259106,0.89362,0.893516
2,0.1966,0.263661,0.89422,0.898333
3,0.1416,0.284205,0.9003,0.901678


Saving model checkpoint to model/nsmc/checkpoint-500
Configuration saved in model/nsmc/checkpoint-500/config.json
Model weights saved in model/nsmc/checkpoint-500/pytorch_model.bin
Saving model checkpoint to model/nsmc/checkpoint-1000
Configuration saved in model/nsmc/checkpoint-1000/config.json
Model weights saved in model/nsmc/checkpoint-1000/pytorch_model.bin
Saving model checkpoint to model/nsmc/checkpoint-1500
Configuration saved in model/nsmc/checkpoint-1500/config.json
Model weights saved in model/nsmc/checkpoint-1500/pytorch_model.bin
Saving model checkpoint to model/nsmc/checkpoint-2000
Configuration saved in model/nsmc/checkpoint-2000/config.json
Model weights saved in model/nsmc/checkpoint-2000/pytorch_model.bin
The following columns in the evaluation set  don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: document, id.
***** Running Evaluation *****
  Num examples = 50000
  Batch size = 64
Saving model checkpoint to model/n

TrainOutput(global_step=7032, training_loss=0.21353977382928982, metrics={'train_runtime': 3072.1841, 'train_samples_per_second': 146.476, 'train_steps_per_second': 2.289, 'total_flos': 7399998432000000.0, 'train_loss': 0.21353977382928982, 'epoch': 3.0})

### Bucketing

Data Collator를 사용해서 Bucketing과  dynamic padding 구현 후 비교

In [4]:
# 패딩을 안하고 토크나이징 
def transform_bucket(data):
    return tokenizer(
        text=data["document"],
        return_token_type_ids=False,
    )

In [5]:
dataset_bucket = dataset.map(
    transform_bucket,
    batched=True,
)

  0%|          | 0/150 [00:00<?, ?ba/s]

  0%|          | 0/50 [00:00<?, ?ba/s]

In [9]:
train_bucket = dataset_bucket['train']
test_bucket = dataset_bucket['test']

In [10]:
from transformers import DataCollatorWithPadding

datacollator = DataCollatorWithPadding(tokenizer, padding=True)

In [21]:
training_arguments_bucket = TrainingArguments(
    output_dir,  # output이 저장될 경로
    evaluation_strategy="epoch",  # evaluation하는 빈도
    learning_rate=2e-5,  # learning_rate
    per_device_train_batch_size=64,  # 각 device 당 batch size
    per_device_eval_batch_size=64,  # evaluation 시에 batch size
    num_train_epochs=3,  # train 시킬 총 epochs
    weight_decay=0.01,  # weight decay
    group_by_length=True,
)

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).


In [22]:
trainer_bucket = Trainer(
    model=model,  # 학습시킬 model
    args=training_arguments_bucket,  # TrainingArguments을 통해 설정한 arguments
    train_dataset=train_bucket,  # training dataset
    eval_dataset=test_bucket,
    compute_metrics=compute_metrics,
    data_collator=datacollator,
)
trainer_bucket.train()

The following columns in the training set  don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: document, id.
***** Running training *****
  Num examples = 150000
  Num Epochs = 3
  Instantaneous batch size per device = 64
  Total train batch size (w. parallel, distributed & accumulation) = 64
  Gradient Accumulation steps = 1
  Total optimization steps = 7032


Epoch,Training Loss,Validation Loss,Accuracy,F1
1,0.2566,0.238329,0.9021,0.902453
2,0.1784,0.241524,0.90564,0.906046
3,0.1256,0.276496,0.90586,0.907165


Saving model checkpoint to model/nsmc/checkpoint-500
Configuration saved in model/nsmc/checkpoint-500/config.json
Model weights saved in model/nsmc/checkpoint-500/pytorch_model.bin
Saving model checkpoint to model/nsmc/checkpoint-1000
Configuration saved in model/nsmc/checkpoint-1000/config.json
Model weights saved in model/nsmc/checkpoint-1000/pytorch_model.bin
Saving model checkpoint to model/nsmc/checkpoint-1500
Configuration saved in model/nsmc/checkpoint-1500/config.json
Model weights saved in model/nsmc/checkpoint-1500/pytorch_model.bin
Saving model checkpoint to model/nsmc/checkpoint-2000
Configuration saved in model/nsmc/checkpoint-2000/config.json
Model weights saved in model/nsmc/checkpoint-2000/pytorch_model.bin
The following columns in the evaluation set  don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: document, id.
***** Running Evaluation *****
  Num examples = 50000
  Batch size = 64
Saving model checkpoint to model/n

TrainOutput(global_step=7032, training_loss=0.18966671400645216, metrics={'train_runtime': 3035.1956, 'train_samples_per_second': 148.261, 'train_steps_per_second': 2.317, 'total_flos': 5453371288919040.0, 'train_loss': 0.18966671400645216, 'epoch': 3.0})

###  결론 및 회고

bucketing의 경우 이번 프로젝트에서 특히나 효과적인 기법이었다. bert의 기본 패딩 사이즈가 512나 되지만, nsmc 리뷰데이터는 그 정도로 긴 문장이 없고, 토큰화를 하더라도 평균길이가 22, 최대 120이기 때문에 엄청난 절약이 있었다. 데이터 길이를 확인하고 수동으로 줄여도 되지만, bucketing을 하면 데이터 길이를 고려하지 않아도 자동적으로 되니 특히나 좋은것 같았다.  

그리고 huggingface transformers 패키지가 왜 그렇게 인기가 많은지도 잘 알게 되었다. 모델 자체를 불러오는 것도 편할 뿐 아니라, 파인튜닝 과정도 전부 구현이 되어있어 바로 불러올 수 있는 것이 매우 편했다. 