## 커스텀 프로젝트
 Glue의 mrpc(두 문장의 유사도 평가) 작업 

### 데이터 불러오기

In [1]:
from datasets import load_dataset

huggingface_mrpc_dataset = load_dataset("glue", "mrpc")
print(huggingface_mrpc_dataset)

DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 408
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 1725
    })
})


### 모델 및 토크나이저

In [2]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification

huggingface_tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
huggingface_model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased", num_labels=2
)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [3]:
def transform(data):
    return huggingface_tokenizer(
        data["sentence1"],
        data["sentence2"],
        truncation=True,
        padding="max_length",
        return_token_type_ids=False,
    )

In [4]:
# 토크나이저 적용
hf_dataset = huggingface_mrpc_dataset.map(transform, batched=True)

Map:   0%|          | 0/1725 [00:00<?, ? examples/s]

In [5]:
# train-valid-test
hf_train_dataset = hf_dataset["train"]
hf_val_dataset = hf_dataset["validation"]
hf_test_dataset = hf_dataset["test"]

In [6]:
hf_train_dataset

Dataset({
    features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'attention_mask'],
    num_rows: 3668
})

### 학습

In [10]:
import numpy as np
from transformers import Trainer, TrainingArguments

output_dir = "model/"

training_arguments = TrainingArguments(
    output_dir,  # output이 저장될 경로
    evaluation_strategy="epoch",  # evaluation하는 빈도
    learning_rate=2e-5,  # learning_rate
    per_device_train_batch_size=8,  # 각 device 당 batch size
    per_device_eval_batch_size=8,  # evaluation 시에 batch size
    num_train_epochs=3,  # train 시킬 총 epochs
    weight_decay=0.01,  # weight decay
)

In [13]:
from datasets import load_metric

metric = load_metric("glue", "mrpc", trust_remote_code=True)

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return metric.compute(predictions=predictions, references=labels)

In [14]:
trainer = Trainer(
    model=huggingface_model,  # 학습시킬 model
    args=training_arguments,  # TrainingArguments을 통해 설정한 arguments
    train_dataset=hf_train_dataset,  # training dataset
    eval_dataset=hf_val_dataset,  # evaluation dataset
    compute_metrics=compute_metrics,
)
trainer.train()

  0%|          | 0/1377 [00:00<?, ?it/s]

  0%|          | 0/51 [00:00<?, ?it/s]

{'eval_loss': 0.37157121300697327, 'eval_accuracy': 0.8333333333333334, 'eval_f1': 0.8831615120274914, 'eval_runtime': 2.6628, 'eval_samples_per_second': 153.223, 'eval_steps_per_second': 19.153, 'epoch': 1.0}
{'loss': 0.5037, 'grad_norm': 14.555951118469238, 'learning_rate': 1.2737835875090777e-05, 'epoch': 1.09}


  0%|          | 0/51 [00:00<?, ?it/s]

{'eval_loss': 0.3975215256214142, 'eval_accuracy': 0.8504901960784313, 'eval_f1': 0.893542757417103, 'eval_runtime': 2.6078, 'eval_samples_per_second': 156.454, 'eval_steps_per_second': 19.557, 'epoch': 2.0}
{'loss': 0.3194, 'grad_norm': 4.390669345855713, 'learning_rate': 5.475671750181555e-06, 'epoch': 2.18}


  0%|          | 0/51 [00:00<?, ?it/s]

{'eval_loss': 0.5352193713188171, 'eval_accuracy': 0.8553921568627451, 'eval_f1': 0.8984509466437177, 'eval_runtime': 2.6045, 'eval_samples_per_second': 156.651, 'eval_steps_per_second': 19.581, 'epoch': 3.0}
{'train_runtime': 222.9447, 'train_samples_per_second': 49.358, 'train_steps_per_second': 6.176, 'train_loss': 0.35915515486884136, 'epoch': 3.0}


TrainOutput(global_step=1377, training_loss=0.35915515486884136, metrics={'train_runtime': 222.9447, 'train_samples_per_second': 49.358, 'train_steps_per_second': 6.176, 'total_flos': 1457671254810624.0, 'train_loss': 0.35915515486884136, 'epoch': 3.0})

In [15]:
trainer.evaluate(hf_test_dataset)

  0%|          | 0/216 [00:00<?, ?it/s]

{'eval_loss': 0.5688545107841492,
 'eval_accuracy': 0.841159420289855,
 'eval_f1': 0.8836023789294817,
 'eval_runtime': 11.0416,
 'eval_samples_per_second': 156.227,
 'eval_steps_per_second': 19.562,
 'epoch': 3.0}

In [33]:
# 예측해보기
np.argmax(trainer.predict([hf_test_dataset[0]]).predictions)

  0%|          | 0/1 [00:00<?, ?it/s]

1