<a href="https://colab.research.google.com/github/seopbo/nlp_tutorials/blob/main/pairwise_text_classification_(klue_nli)_BERT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Pairwise text classification - BERT
- pre-trained language model로는 `klue/bert-base`를 사용합니다.
  - https://huggingface.co/klue/bert-base
- pairwise text classification task를 수행하는 예시 데이터셋으로는 klue의 nli를 사용합니다.
  - https://huggingface.co/datasets/klue

## Setup
어떠한 GPU가 할당되었는 지 아래의 코드 셀을 실행함으로써 확인할 수 있습니다.

In [1]:
gpu_info = !nvidia-smi
gpu_info = '\n'.join(gpu_info)

if gpu_info.find('failed') >= 0:
    print('Not connected to a GPU')
else:
    print(gpu_info)

Fri Dec 24 06:09:20 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 495.44       Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   37C    P0    26W / 250W |      0MiB / 16280MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

아래의 코드 셀을 실행함으로써 본 노트북을 실행하기위한 library를 install하고 load합니다.

In [2]:
!pip install torch
!pip install transformers
!pip install datasets
!pip install -U scikit-learn

import torch
import transformers
import datasets

Collecting transformers
  Downloading transformers-4.15.0-py3-none-any.whl (3.4 MB)
[K     |████████████████████████████████| 3.4 MB 4.2 MB/s 
[?25hCollecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 49.3 MB/s 
[?25hCollecting tokenizers<0.11,>=0.10.1
  Downloading tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3 MB)
[K     |████████████████████████████████| 3.3 MB 53.9 MB/s 
[?25hCollecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.2.1-py3-none-any.whl (61 kB)
[K     |████████████████████████████████| 61 kB 666 kB/s 
Collecting sacremoses
  Downloading sacremoses-0.0.46-py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 24.6 MB/s 
Installing collected packages: pyyaml, tokenizers, sacremoses, huggingface-hub, transfor

## Preprocess data
1. `klue/bert-base`가 사용한 subword tokenizer를 load합니다.
2. `datasets` library를 이용하여 klue nli를 load합니다.
3. 1의 subword tokenizer를 이용 klue nli의 data를 pairwise text classification을 수행할 수 있는 형태, train example로 transform합니다.
  - `[CLS] premise_tokens [SEP] hypothesis_tokens [SEP]`


In [3]:
from datasets import load_dataset

cs = load_dataset("klue", "nli", split="train")
cs = cs.train_test_split(0.1)
test_cs = load_dataset("klue", "nli", split="validation")
train_cs = cs["train"]
valid_cs = cs["test"]

Downloading:   0%|          | 0.00/5.21k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.93k [00:00<?, ?B/s]

Downloading and preparing dataset klue/nli (download: 1.20 MiB, generated: 6.10 MiB, post-processed: Unknown size, total: 7.30 MiB) to /root/.cache/huggingface/datasets/klue/nli/1.0.0/e0fc3bc3de3eb03be2c92d72fd04a60ecc71903f821619cb28ca0e1e29e4233e...


Downloading:   0%|          | 0.00/1.26M [00:00<?, ?B/s]

0 examples [00:00, ? examples/s]

0 examples [00:00, ? examples/s]

Dataset klue downloaded and prepared to /root/.cache/huggingface/datasets/klue/nli/1.0.0/e0fc3bc3de3eb03be2c92d72fd04a60ecc71903f821619cb28ca0e1e29e4233e. Subsequent calls will reuse this data.


Reusing dataset klue (/root/.cache/huggingface/datasets/klue/nli/1.0.0/e0fc3bc3de3eb03be2c92d72fd04a60ecc71903f821619cb28ca0e1e29e4233e)


In [None]:
cs["train"][:2]

{'guid': ['klue-nli-v1_train_14265', 'klue-nli-v1_train_13962'],
 'hypothesis': ['시민들은 오전 10시쯤 일어난 사고로 인해 두려움을 느꼈다.',
  '영화가 시작한지 10분만에 시계를 쳐다보았다.'],
 'label': [0, 0],
 'premise': ['오전 10시경 일어난 사고였기에 출근길만큼 큰 불편을 겪지는 않았지만, 회사를 불문한 연이은 탈선 사고에 시민들이 두려움을 느끼는 것은 사실이다.',
  '영화시작 10분만에 시계보기 시작한 첫 영화'],
 'source': ['wikinews', 'NSMC']}

transform을 위한 함수를 정의하고 적용합니다.

In [4]:
from transformers import AutoTokenizer, AutoConfig

tokenizer = AutoTokenizer.from_pretrained("klue/bert-base")
config = AutoConfig.from_pretrained("klue/bert-base")

print(tokenizer.__class__)
print(config.__class__)

Downloading:   0%|          | 0.00/289 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/425 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/243k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/483k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/125 [00:00<?, ?B/s]

<class 'transformers.models.bert.tokenization_bert_fast.BertTokenizerFast'>
<class 'transformers.models.bert.configuration_bert.BertConfig'>


In [7]:
from typing import Union, List, Dict


def transform(hypotheses: Union[str, List[str]], premises: Union[str, List[str]], tokenizer) -> Dict[str, List[List[int]]]:
    if isinstance(hypotheses, str):
        hypotheses = [hypotheses]
    if isinstance(premises, str):
        premises = [premises]
    return tokenizer(text=premises, text_pair=hypotheses, add_special_tokens=True, padding=False, truncation=False)

samples = train_cs[:2]
transformed_samples = transform(samples["hypothesis"], samples["premise"], tokenizer)

print(samples)
print(transformed_samples)

{'guid': ['klue-nli-v1_train_14265', 'klue-nli-v1_train_13962'], 'source': ['wikinews', 'NSMC'], 'premise': ['오전 10시경 일어난 사고였기에 출근길만큼 큰 불편을 겪지는 않았지만, 회사를 불문한 연이은 탈선 사고에 시민들이 두려움을 느끼는 것은 사실이다.', '영화시작 10분만에 시계보기 시작한 첫 영화'], 'hypothesis': ['시민들은 오전 10시쯤 일어난 사고로 인해 두려움을 느꼈다.', '영화가 시작한지 10분만에 시계를 쳐다보았다.'], 'label': [0, 0]}
{'input_ids': [[2, 4400, 3633, 2067, 2382, 6657, 4022, 2507, 12551, 21383, 22883, 1751, 5153, 2069, 585, 2118, 2259, 1380, 2886, 3683, 16, 3769, 2138, 14294, 2470, 9913, 2073, 1764, 2020, 4022, 2170, 3857, 7285, 7391, 2069, 4491, 2259, 575, 2073, 3669, 28674, 18, 3, 3857, 2031, 2073, 4400, 3633, 2067, 3353, 6657, 4022, 2200, 4534, 7391, 2069, 6227, 2062, 18, 3], [2, 3771, 2067, 2333, 3633, 2377, 2154, 2170, 6974, 24374, 3670, 2470, 1656, 3771, 3, 3771, 2116, 3670, 2470, 2118, 3633, 2377, 2154, 2170, 6974, 2138, 6969, 2886, 2062, 18, 3]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 

In [8]:
train_ds = train_cs.map(lambda data: transform(data["hypothesis"], data["premise"], tokenizer), remove_columns=["guid", "source", "hypothesis", "premise"], batched=True).rename_column("label", "labels")
valid_ds = valid_cs.map(lambda data: transform(data["hypothesis"], data["premise"], tokenizer), remove_columns=["guid", "source", "hypothesis", "premise"], batched=True).rename_column("label", "labels")
test_ds = test_cs.map(lambda data: transform(data["hypothesis"], data["premise"], tokenizer), remove_columns=["guid", "source", "hypothesis", "premise"], batched=True).rename_column("label", "labels")

  0%|          | 0/23 [00:00<?, ?ba/s]

  0%|          | 0/3 [00:00<?, ?ba/s]

  0%|          | 0/3 [00:00<?, ?ba/s]

## Prepare model
pairwise text classification을 수행하기위한 `klue/bert-base`를 load합니다.

In [9]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("klue/bert-base", num_labels=3)
print(model.__class__)

Downloading:   0%|          | 0.00/424M [00:00<?, ?B/s]

Some weights of the model checkpoint at klue/bert-base were not used when initializing BertForSequenceClassification: ['cls.predictions.decoder.bias', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized

<class 'transformers.models.bert.modeling_bert.BertForSequenceClassification'>


## Train model
`Trainer` class를 이용하여 train합니다.

- https://huggingface.co/transformers/custom_datasets.html?highlight=trainer#fine-tuning-with-trainer

In [10]:
import numpy as np
from transformers.data.data_collator import DataCollatorWithPadding
from sklearn.metrics import accuracy_score


def compute_metrics(p):
    pred, labels = p
    pred = np.argmax(pred, axis=1)
    accuracy = accuracy_score(y_true=labels, y_pred=pred)
    return {"accuracy": accuracy}


batchify = DataCollatorWithPadding(
    tokenizer=tokenizer,
    padding="longest",   
)

In [11]:
# mini-batch 구성확인
batchify(train_ds[:2])

{'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'input_ids': tensor([[    2,  4400,  3633,  2067,  2382,  6657,  4022,  2507, 12551, 21383,
         22883,  1751,  5153,  2069,   585,  2118,  2259,  1380,  2886,  3683,
            16,  3769,  2138, 14294,  2470,  9913,  2073,  1764,  2020,  4022,
          2170,  3857,  7285,  7391,  2069,  4491,  2259,   575,  2073,  3669,
         28674,    18,     3,  3857,  2031,  2073,  4400,  3633,  2067,  3353,
          6657,  4022,  2200,  4534,  7391,  2069,  6227,  2062,    18,     3],
        [    2,  3771,  2067,  2333,  3633,  2377,  2154,  2170,  6974, 243

In [12]:
from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir='./results',     
    evaluation_strategy="epoch",
    per_device_train_batch_size=32, 
    per_device_eval_batch_size=32,
    learning_rate=1e-4,
    weight_decay=0.01,
    adam_beta1=.9,
    adam_beta2=.95,
    adam_epsilon=1e-8,
    max_grad_norm=1.,
    num_train_epochs=2,    
    lr_scheduler_type="linear",
    warmup_steps=100,
    logging_dir='./logs',
    logging_strategy="steps",
    logging_first_step=True,
    logging_steps=100,
    save_strategy="epoch",
    seed=42,
    dataloader_drop_last=False,
    dataloader_num_workers=2
)

trainer = Trainer(
    args=training_args,
    data_collator=batchify,
    model=model,
    train_dataset=train_ds,
    eval_dataset=valid_ds,
    compute_metrics=compute_metrics
)

trainer.train()

***** Running training *****
  Num examples = 22498
  Num Epochs = 2
  Instantaneous batch size per device = 32
  Total train batch size (w. parallel, distributed & accumulation) = 32
  Gradient Accumulation steps = 1
  Total optimization steps = 1408


Epoch,Training Loss,Validation Loss,Accuracy
1,0.5067,0.449375,0.8236
2,0.2603,0.427146,0.8512


***** Running Evaluation *****
  Num examples = 2500
  Batch size = 32
Saving model checkpoint to ./results/checkpoint-704
Configuration saved in ./results/checkpoint-704/config.json
Model weights saved in ./results/checkpoint-704/pytorch_model.bin
***** Running Evaluation *****
  Num examples = 2500
  Batch size = 32
Saving model checkpoint to ./results/checkpoint-1408
Configuration saved in ./results/checkpoint-1408/config.json
Model weights saved in ./results/checkpoint-1408/pytorch_model.bin


Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=1408, training_loss=0.45023471320217306, metrics={'train_runtime': 354.8486, 'train_samples_per_second': 126.803, 'train_steps_per_second': 3.968, 'total_flos': 1592668684552680.0, 'train_loss': 0.45023471320217306, 'epoch': 2.0})

In [13]:
trainer.evaluate(test_ds)

***** Running Evaluation *****
  Num examples = 3000
  Batch size = 32


{'epoch': 2.0,
 'eval_accuracy': 0.8046666666666666,
 'eval_loss': 0.5251408219337463,
 'eval_runtime': 7.3011,
 'eval_samples_per_second': 410.897,
 'eval_steps_per_second': 12.875}