<a href="https://colab.research.google.com/github/seopbo/nlp_tutorials/blob/main/Single_text_classification_BERT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Single text classification - BERT
- pre-trained language model로는 `klue/bert-base`를 사용합니다.
  - https://huggingface.co/klue/bert-base
- single text classification task를 수행하는 예시 데이터셋으로는 `nsmc`를 사용합니다.
  - https://huggingface.co/datasets/nsmc

## Setup
어떠한 GPU가 할당되었는 지 아래의 코드 셀을 실행함으로써 확인할 수 있습니다.

In [None]:
gpu_info = !nvidia-smi
gpu_info = '\n'.join(gpu_info)
if gpu_info.find('failed') >= 0:
  print('Not connected to a GPU')
else:
  print(gpu_info)

Thu Nov 18 05:23:11 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 495.44       Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   35C    P0    26W / 250W |      0MiB / 16280MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

아래의 코드 셀을 실행함으로써 본 노트북을 실행하기위한 library를 install하고 load합니다.

In [None]:
!pip install torch
!pip install transformers
!pip install datasets
!pip install -U scikit-learn

import torch
import transformers
import datasets

Collecting transformers
  Downloading transformers-4.12.5-py3-none-any.whl (3.1 MB)
[K     |████████████████████████████████| 3.1 MB 14.7 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 65.5 MB/s 
Collecting sacremoses
  Downloading sacremoses-0.0.46-py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 76.2 MB/s 
[?25hCollecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.1.2-py3-none-any.whl (59 kB)
[K     |████████████████████████████████| 59 kB 9.4 MB/s 
Collecting tokenizers<0.11,>=0.10.1
  Downloading tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3 MB)
[K     |████████████████████████████████| 3.3 MB 48.2 MB/s 
Installing collected packages: pyyaml, tokenizers, sacremoses, huggingface-hub, transformers
  Atte

## Data preprocessing
1. `klue/bert-base`가 사용한 subword tokenizer를 load합니다.
2. `datasets` library를 이용하여 `nsmc`를 load합니다.
3. 1의 subword tokenizer를 이용 `nsmc`의 data를 single text classification을 수행할 수 있는 형태, train example로 transform합니다.
  - `[CLS] tok 1 ... tok N [SEP]`로 만들고, 이를 list_of_integers로 transform합니다.


`nsmc`를 load하고, `train_ds`, `valid_ds`, `test_ds`를 생성합니다

In [None]:
from datasets import load_dataset

cs = load_dataset("nsmc", split="train")
cs = cs.train_test_split(0.1)
test_cs = load_dataset("nsmc", split="test")
train_cs = cs["train"]
valid_cs = cs["test"]

Downloading:   0%|          | 0.00/1.36k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/807 [00:00<?, ?B/s]

Using custom data configuration default


Downloading and preparing dataset nsmc/default (download: 18.62 MiB, generated: 20.90 MiB, post-processed: Unknown size, total: 39.52 MiB) to /root/.cache/huggingface/datasets/nsmc/default/1.1.0/bfd4729bf1a67114e5267e6916b9e4807010aeb238e4a3c2b95fbfa3a014b5f3...


  0%|          | 0/2 [00:00<?, ?it/s]

Downloading:   0%|          | 0.00/6.33M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.12M [00:00<?, ?B/s]

  0%|          | 0/2 [00:00<?, ?it/s]

0 examples [00:00, ? examples/s]

0 examples [00:00, ? examples/s]

Dataset nsmc downloaded and prepared to /root/.cache/huggingface/datasets/nsmc/default/1.1.0/bfd4729bf1a67114e5267e6916b9e4807010aeb238e4a3c2b95fbfa3a014b5f3. Subsequent calls will reuse this data.


Using custom data configuration default
Reusing dataset nsmc (/root/.cache/huggingface/datasets/nsmc/default/1.1.0/bfd4729bf1a67114e5267e6916b9e4807010aeb238e4a3c2b95fbfa3a014b5f3)


transform을 위한 함수를 정의하고 적용합니다.

In [None]:
from transformers import AutoTokenizer, AutoConfig

tokenizer = AutoTokenizer.from_pretrained("klue/bert-base")
config = AutoConfig.from_pretrained("klue/bert-base")

print(tokenizer.__class__)
print(config.__class__)

Downloading:   0%|          | 0.00/289 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/425 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/243k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/483k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/125 [00:00<?, ?B/s]

<class 'transformers.models.bert.tokenization_bert_fast.BertTokenizerFast'>
<class 'transformers.models.bert.configuration_bert.BertConfig'>


In [None]:
from typing import Union, List


def transform(sentences: Union[str, List[str]], tokenizer):
  if isinstance(sentences, list):
    list_of_examples = []
    training_examples = {}

    for sentence in sentences:
    
      list_of_tokens = tokenizer.tokenize(sentence)
      list_of_ids = tokenizer.convert_tokens_to_ids(list_of_tokens)
      example = tokenizer.prepare_for_model(list_of_ids, add_special_tokens=True, padding=False, truncation=False)
      list_of_examples.append(example)

    for training_example in list_of_examples:
      for key in training_example:
        if key not in training_examples:
          training_examples.setdefault(key, [])
        training_examples[key].append(training_example[key])
    return training_examples
  else:
    list_of_tokens = tokenizer.tokenize(sentences)
    list_of_ids = tokenizer.convert_tokens_to_ids(list_of_tokens)
    training_example = tokenizer.prepare_for_model(list_of_ids, add_special_tokens=True, padding=False, truncation=False)
    return training_example

samples = train_cs[:2]
transformed_samples = transform(samples["document"], tokenizer)

print(samples)
print(transformed_samples)

{'id': ['8852339', '8694184'], 'document': ['평점이왜이렇게낮지?난 이거 또다운받고있다.3번째...정말잼있던데...끝에 견자단하고 대빵하고싸울땐 정말 가슴이뜨거워졌다.난정말로 잼있게봤다...최근댓글이니 내가 알바아닌줄은 알거다.', '재미없습니다 그리고 신음 소리땜에 당황했네요 아 물런 전 21살 영화가 참 재미없어보였는데 평점들 보고 재밌구나 하고 봤는데 젠장 재수없게 속았네요 이런게 도대체 뭐가 재밌다는건지 망할만한 영화였습니다 전 공짜로봐서다행입니다 이걸돈내고보면미친년'], 'label': [1, 0]}
{'input_ids': [[2, 20609, 2052, 3132, 2052, 7633, 3264, 2118, 35, 720, 4647, 918, 20721, 2757, 2088, 2689, 2062, 18, 23, 2517, 3135, 18, 18, 18, 3944, 3468, 2689, 2414, 2147, 18, 18, 18, 711, 2170, 586, 2155, 2286, 19521, 823, 2625, 19521, 2935, 2177, 2355, 3944, 4494, 2052, 2751, 2180, 14578, 2062, 18, 720, 2287, 5466, 1530, 2689, 2318, 3072, 2062, 18, 18, 18, 3744, 3315, 2701, 2052, 2209, 732, 2116, 14321, 30428, 2776, 2073, 1381, 2180, 2062, 18, 3], [2, 19113, 2219, 3606, 3673, 12433, 3856, 3515, 2170, 7389, 2371, 2203, 2182, 1376, 1093, 2957, 1537, 4041, 2593, 3771, 2116, 1637, 19113, 15882, 2507, 13964, 20609, 2031, 4530, 7478, 6074, 6159, 1170, 13964, 21195, 10970, 2899, 2318, 1

In [None]:
train_ds = train_cs.map(lambda data: transform(data["document"], tokenizer), remove_columns=["id", "document"], batched=True).rename_column("label", "labels")
valid_ds = valid_cs.map(lambda data: transform(data["document"], tokenizer), remove_columns=["id", "document"], batched=True).rename_column("label", "labels")
test_ds = test_cs.map(lambda data: transform(data["document"], tokenizer), remove_columns=["id", "document"], batched=True).rename_column("label", "labels")

  0%|          | 0/135 [00:00<?, ?ba/s]

  0%|          | 0/15 [00:00<?, ?ba/s]

  0%|          | 0/50 [00:00<?, ?ba/s]

## Prepare model
single text classification을 수행하기위해서 `klue/bert-base`를 load합니다.

In [None]:
from transformers import  AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("klue/bert-base", num_labels=2)

print(model.__class__)

Downloading:   0%|          | 0.00/424M [00:00<?, ?B/s]

Some weights of the model checkpoint at klue/bert-base were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized

<class 'transformers.models.bert.modeling_bert.BertForSequenceClassification'>


## Training model
`Trainer` class를 이용하여 train합니다.

- https://huggingface.co/transformers/custom_datasets.html?highlight=trainer#fine-tuning-with-trainer

In [None]:
import numpy as np
from transformers.data.data_collator import DataCollatorWithPadding
from sklearn.metrics import accuracy_score


batchify = DataCollatorWithPadding(
    tokenizer=tokenizer,
    padding="longest",
)


def compute_metrics(p):    
  pred, labels = p
  pred = np.argmax(pred, axis=1)
  accuracy = accuracy_score(y_true=labels, y_pred=pred)
  return {"accuracy": accuracy}

In [None]:
from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir='./results',          
    evaluation_strategy="steps",
    eval_steps=1000,
    per_device_train_batch_size=64, 
    per_device_eval_batch_size=128,
    learning_rate=1e-4,
    weight_decay=0.01,
    adam_beta1=.9,
    adam_beta2=.95,
    adam_epsilon=1e-8,
    max_grad_norm=1.,
    num_train_epochs=2,    
    lr_scheduler_type="linear",
    warmup_steps=100,
    logging_dir='./logs',
    logging_strategy="steps",
    logging_first_step=True,
    logging_steps=100,
    save_strategy="epoch",
    seed=42,
    dataloader_drop_last=False,
    dataloader_num_workers=2
)

trainer = Trainer(
    args=training_args,
    data_collator=batchify,
    model=model,
    train_dataset=train_ds,
    eval_dataset=valid_ds,
    compute_metrics=compute_metrics
)

trainer.train()

***** Running training *****
  Num examples = 135000
  Num Epochs = 2
  Instantaneous batch size per device = 64
  Total train batch size (w. parallel, distributed & accumulation) = 64
  Gradient Accumulation steps = 1
  Total optimization steps = 4220


Step,Training Loss,Validation Loss,Accuracy
1000,0.279,0.286065,0.8788
2000,0.2712,0.24527,0.902067
3000,0.1738,0.280479,0.9016
4000,0.1562,0.261864,0.9052


***** Running Evaluation *****
  Num examples = 15000
  Batch size = 128
***** Running Evaluation *****
  Num examples = 15000
  Batch size = 128
Saving model checkpoint to ./results/checkpoint-2110
Configuration saved in ./results/checkpoint-2110/config.json
Model weights saved in ./results/checkpoint-2110/pytorch_model.bin
***** Running Evaluation *****
  Num examples = 15000
  Batch size = 128
***** Running Evaluation *****
  Num examples = 15000
  Batch size = 128
Saving model checkpoint to ./results/checkpoint-4220
Configuration saved in ./results/checkpoint-4220/config.json
Model weights saved in ./results/checkpoint-4220/pytorch_model.bin


Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=4220, training_loss=0.23283525819744544, metrics={'train_runtime': 2217.7774, 'train_samples_per_second': 121.744, 'train_steps_per_second': 1.903, 'total_flos': 1.119332807266848e+16, 'train_loss': 0.23283525819744544, 'epoch': 2.0})

In [None]:
trainer.evaluate(test_ds)

***** Running Evaluation *****
  Num examples = 50000
  Batch size = 128


{'epoch': 2.0,
 'eval_accuracy': 0.90632,
 'eval_loss': 0.2563643455505371,
 'eval_runtime': 128.2681,
 'eval_samples_per_second': 389.809,
 'eval_steps_per_second': 3.048}