<a href="https://colab.research.google.com/github/seopbo/nlp_tutorials/blob/main/single_text_classification_(klue_ynat)_BERT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Single text classification - BERT
- pre-trained language model로는 `klue/bert-base`를 사용합니다.
  - https://huggingface.co/klue/bert-base
- single text classification task를 수행하는 예시 데이터셋으로는 klue의 ynat을 사용합니다.
  - https://huggingface.co/datasets/klue

## Setup
어떠한 GPU가 할당되었는 지 아래의 코드 셀을 실행함으로써 확인할 수 있습니다.

In [1]:
gpu_info = !nvidia-smi
gpu_info = '\n'.join(gpu_info)

if gpu_info.find('failed') >= 0:
    print('Not connected to a GPU')
else:
    print(gpu_info)

Tue Dec 28 02:45:20 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 495.44       Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   36C    P0    28W / 250W |      0MiB / 16280MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

아래의 코드 셀을 실행함으로써 본 노트북을 실행하기위한 library를 install하고 load합니다.

In [2]:
!pip install torch
!pip install transformers
!pip install datasets
!pip install -U scikit-learn

import torch
import transformers
import datasets



## Preprocess data
1. `klue/bert-base`가 사용한 subword tokenizer를 load합니다.
2. `datasets` library를 이용하여 klue ynat을 load합니다.
3. 1의 subword tokenizer를 이용 klue ynat의 data를 single text classification을 수행할 수 있는 형태, train example로 transform합니다.
  - `[CLS] tok 1 ... tok N [SEP]`로 만들고, 이를 list_of_integers로 transform합니다.


`nsmc`를 load하고, `train_ds`, `valid_ds`, `test_ds`를 생성합니다

In [3]:
from datasets import load_dataset

cs = load_dataset("klue", "ynat", split="train")
cs = cs.train_test_split(0.1)
test_cs = load_dataset("klue", "ynat", split="validation")
train_cs = cs["train"]
valid_cs = cs["test"]

Reusing dataset klue (/root/.cache/huggingface/datasets/klue/ynat/1.0.0/e0fc3bc3de3eb03be2c92d72fd04a60ecc71903f821619cb28ca0e1e29e4233e)
Loading cached split indices for dataset at /root/.cache/huggingface/datasets/klue/ynat/1.0.0/e0fc3bc3de3eb03be2c92d72fd04a60ecc71903f821619cb28ca0e1e29e4233e/cache-2ccccbdb100b5393.arrow and /root/.cache/huggingface/datasets/klue/ynat/1.0.0/e0fc3bc3de3eb03be2c92d72fd04a60ecc71903f821619cb28ca0e1e29e4233e/cache-8a6e134918c0527f.arrow
Reusing dataset klue (/root/.cache/huggingface/datasets/klue/ynat/1.0.0/e0fc3bc3de3eb03be2c92d72fd04a60ecc71903f821619cb28ca0e1e29e4233e)


transform을 위한 함수를 정의하고 적용합니다.

In [4]:
from transformers import AutoTokenizer, AutoConfig

tokenizer = AutoTokenizer.from_pretrained("klue/bert-base")
config = AutoConfig.from_pretrained("klue/bert-base")

print(tokenizer.__class__)
print(config.__class__)

<class 'transformers.models.bert.tokenization_bert_fast.BertTokenizerFast'>
<class 'transformers.models.bert.configuration_bert.BertConfig'>


In [5]:
from typing import Union, List, Dict


def transform(sentences: Union[str, List[str]], tokenizer) -> Dict[str, List[List[int]]]:
    if isinstance(sentences, str):
        sentences = [sentences]
    return tokenizer(text=sentences, add_special_tokens=True, padding=False, truncation=False)

samples = train_cs[:2]
transformed_samples = transform(samples["title"], tokenizer)

print(samples)
print(transformed_samples)

{'guid': ['ynat-v1_train_19037', 'ynat-v1_train_34937'], 'title': ['미국 휴스턴 포장회사서 큰불…유해물질 대기 중 확산', 'SK텔레콤 SK와이번스 한국시리즈 우승 기념행사'], 'label': [4, 5], 'url': ['https://news.naver.com/main/read.nhn?mode=LS2D&mid=shm&sid1=104&sid2=232&oid=001&aid=0008382661', 'https://news.naver.com/main/read.nhn?mode=LS2D&mid=shm&sid1=105&sid2=230&oid=001&aid=0010470470'], 'date': ['2016.05.06. 오전 4:32', '2018.11.15. 오후 1:49']}
{'input_ids': [[2, 3666, 23097, 6211, 6166, 2112, 1751, 2588, 121, 7609, 2266, 2431, 5889, 1570, 5149, 3], [2, 4387, 2659, 2189, 2987, 4387, 12213, 26584, 3629, 2067, 2059, 2228, 4564, 19583, 3]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}


In [6]:
train_cs.features

{'date': Value(dtype='string', id=None),
 'guid': Value(dtype='string', id=None),
 'label': ClassLabel(num_classes=7, names=['IT과학', '경제', '사회', '생활문화', '세계', '스포츠', '정치'], names_file=None, id=None),
 'title': Value(dtype='string', id=None),
 'url': Value(dtype='string', id=None)}

In [7]:
train_ds = train_cs.map(lambda data: transform(data["title"], tokenizer), remove_columns=["guid", "date", "title", "url"], batched=True).rename_column("label", "labels")
valid_ds = valid_cs.map(lambda data: transform(data["title"], tokenizer), remove_columns=["guid", "date", "title", "url"], batched=True).rename_column("label", "labels")
test_ds = test_cs.map(lambda data: transform(data["title"], tokenizer), remove_columns=["guid", "date", "title", "url"], batched=True).rename_column("label", "labels")

Loading cached processed dataset at /root/.cache/huggingface/datasets/klue/ynat/1.0.0/e0fc3bc3de3eb03be2c92d72fd04a60ecc71903f821619cb28ca0e1e29e4233e/cache-0761833971d825cd.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/klue/ynat/1.0.0/e0fc3bc3de3eb03be2c92d72fd04a60ecc71903f821619cb28ca0e1e29e4233e/cache-61bd13e736d62846.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/klue/ynat/1.0.0/e0fc3bc3de3eb03be2c92d72fd04a60ecc71903f821619cb28ca0e1e29e4233e/cache-4dff6fb5117d903a.arrow


## Prepare model
single text classification을 수행하기위해서 `klue/bert-base`를 load합니다.

In [8]:
from transformers import  AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("klue/bert-base", num_labels=7)

print(model.__class__)

Some weights of the model checkpoint at klue/bert-base were not used when initializing BertForSequenceClassification: ['cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls.predictions.decoder.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized

<class 'transformers.models.bert.modeling_bert.BertForSequenceClassification'>


## Train model
`Trainer` class를 이용하여 train합니다.

- https://huggingface.co/transformers/custom_datasets.html?highlight=trainer#fine-tuning-with-trainer

In [9]:
import numpy as np
from transformers.data.data_collator import DataCollatorWithPadding
from sklearn.metrics import accuracy_score

def compute_metrics(p):    
    pred, labels = p
    pred = np.argmax(pred, axis=1)
    accuracy = accuracy_score(y_true=labels, y_pred=pred)
    return {"accuracy": accuracy}


batchify = DataCollatorWithPadding(
    tokenizer=tokenizer,
    padding="longest",
)

In [10]:
# mini-batch 구성확인
batchify(train_ds[:2])

{'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0]]), 'input_ids': tensor([[    2,  3666, 23097,  6211,  6166,  2112,  1751,  2588,   121,  7609,
          2266,  2431,  5889,  1570,  5149,     3],
        [    2,  4387,  2659,  2189,  2987,  4387, 12213, 26584,  3629,  2067,
          2059,  2228,  4564, 19583,     3,     0]]), 'labels': tensor([4, 5]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])}

In [11]:
from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir='./results',          
    evaluation_strategy="steps",
    eval_steps=1000,
    per_device_train_batch_size=32, 
    per_device_eval_batch_size=32,
    learning_rate=1e-4,
    weight_decay=0.01,
    adam_beta1=.9,
    adam_beta2=.95,
    adam_epsilon=1e-8,
    max_grad_norm=1.,
    num_train_epochs=2,    
    lr_scheduler_type="linear",
    warmup_steps=100,
    logging_dir='./logs',
    logging_strategy="steps",
    logging_first_step=True,
    logging_steps=100,
    save_strategy="epoch",
    seed=42,
    dataloader_drop_last=False,
    dataloader_num_workers=2
)

trainer = Trainer(
    args=training_args,
    data_collator=batchify,
    model=model,
    train_dataset=train_ds,
    eval_dataset=valid_ds,
    compute_metrics=compute_metrics
)

trainer.train()

***** Running training *****
  Num examples = 41110
  Num Epochs = 2
  Instantaneous batch size per device = 32
  Total train batch size (w. parallel, distributed & accumulation) = 32
  Gradient Accumulation steps = 1
  Total optimization steps = 2570


Step,Training Loss,Validation Loss,Accuracy
1000,0.3686,0.361339,0.880692
2000,0.2655,0.362917,0.889229


***** Running Evaluation *****
  Num examples = 4568
  Batch size = 32
Saving model checkpoint to ./results/checkpoint-1285
Configuration saved in ./results/checkpoint-1285/config.json
Model weights saved in ./results/checkpoint-1285/pytorch_model.bin
***** Running Evaluation *****
  Num examples = 4568
  Batch size = 32
Saving model checkpoint to ./results/checkpoint-2570
Configuration saved in ./results/checkpoint-2570/config.json
Model weights saved in ./results/checkpoint-2570/pytorch_model.bin


Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=2570, training_loss=0.36267023745213967, metrics={'train_runtime': 296.4238, 'train_samples_per_second': 277.373, 'train_steps_per_second': 8.67, 'total_flos': 920138610951000.0, 'train_loss': 0.36267023745213967, 'epoch': 2.0})

In [12]:
trainer.evaluate(test_ds)

***** Running Evaluation *****
  Num examples = 9107
  Batch size = 32


{'epoch': 2.0,
 'eval_accuracy': 0.8691116723399582,
 'eval_loss': 0.3852609694004059,
 'eval_runtime': 8.5023,
 'eval_samples_per_second': 1071.118,
 'eval_steps_per_second': 33.52}