<a href="https://colab.research.google.com/github/seopbo/nlp_tutorials/blob/main/token_classification_(klue_ner)_BERT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Token text classification - BERT
- pre-trained language model로는 `klue/bert-base`를 사용합니다.
  - https://huggingface.co/klue/bert-base
- token classification task를 수행하는 예시 데이터셋으로는 klue의 ner를 사용합니다.
  - https://huggingface.co/datasets/klue

## Setup
어떠한 GPU가 할당되었는 지 아래의 코드 셀을 실행함으로써 확인할 수 있습니다.

In [None]:
gpu_info = !nvidia-smi
gpu_info = '\n'.join(gpu_info)

if gpu_info.find('failed') >= 0:
    print('Not connected to a GPU')
else:
    print(gpu_info)

Mon Dec 27 02:44:35 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 495.44       Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   37C    P0    27W / 250W |      0MiB / 16280MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

아래의 코드 셀을 실행함으로써 본 노트북을 실행하기위한 library를 install하고 load합니다.

In [None]:
!pip install torch
!pip install transformers
!pip install datasets
!pip install seqeval
!pip install numpy
!pip install scikit-learn

from pprint import pprint
import sklearn
import torch
import transformers
import datasets

Collecting transformers
  Downloading transformers-4.15.0-py3-none-any.whl (3.4 MB)
[K     |████████████████████████████████| 3.4 MB 8.2 MB/s 
Collecting tokenizers<0.11,>=0.10.1
  Downloading tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3 MB)
[K     |████████████████████████████████| 3.3 MB 48.4 MB/s 
[?25hCollecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 68.6 MB/s 
Collecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.2.1-py3-none-any.whl (61 kB)
[K     |████████████████████████████████| 61 kB 636 kB/s 
Collecting sacremoses
  Downloading sacremoses-0.0.46-py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 66.8 MB/s 
Installing collected packages: pyyaml, tokenizers, sacremoses, huggingface-hub, transformers
  Attem

## Preprocess data
1. `klue/bert-base`가 사용한 subword tokenizer를 load합니다.
2. `datasets` library를 이용하여 klue의 ner를 load합니.
3. 1의 subword tokenizer를 이용 klue ner의 data를 token classification을 수행할 수 있는 형태, train example로 transformation합니다.

- `[CLS] tok 1 ... tok N [SEP]`로 만들고, 이를 list_of_integers로 transform합니다.
- 기존 klue ner의 label이 character level로 달려있으므로, 이를 token level로 adaptation하는 function인 `relabel_to_token` function을 작성합니다.



In [None]:
from transformers import AutoTokenizer, AutoConfig

tokenizer = AutoTokenizer.from_pretrained("klue/bert-base")

print(tokenizer.__class__)

Downloading:   0%|          | 0.00/289 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/425 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/243k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/483k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/125 [00:00<?, ?B/s]

<class 'transformers.models.bert.tokenization_bert_fast.BertTokenizerFast'>


In [None]:
from datasets import load_dataset

cs = load_dataset("klue", "ner", split="train")
cs = cs.train_test_split(0.1)
test_cs = load_dataset("klue", "ner", split="validation")
train_cs = cs["train"]
valid_cs = cs["test"]

Downloading:   0%|          | 0.00/5.21k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.93k [00:00<?, ?B/s]

Downloading and preparing dataset klue/ner (download: 4.11 MiB, generated: 23.68 MiB, post-processed: Unknown size, total: 27.79 MiB) to /root/.cache/huggingface/datasets/klue/ner/1.0.0/e0fc3bc3de3eb03be2c92d72fd04a60ecc71903f821619cb28ca0e1e29e4233e...


Downloading:   0%|          | 0.00/4.31M [00:00<?, ?B/s]

0 examples [00:00, ? examples/s]

0 examples [00:00, ? examples/s]

Dataset klue downloaded and prepared to /root/.cache/huggingface/datasets/klue/ner/1.0.0/e0fc3bc3de3eb03be2c92d72fd04a60ecc71903f821619cb28ca0e1e29e4233e. Subsequent calls will reuse this data.


Reusing dataset klue (/root/.cache/huggingface/datasets/klue/ner/1.0.0/e0fc3bc3de3eb03be2c92d72fd04a60ecc71903f821619cb28ca0e1e29e4233e)


klue_ner dataset을 확인해보면 character 단위로 BIO tag가 label 되어있는 것을 확인할 수 있습니다.

In [None]:
# label의 목록
list_of_labels = train_cs.features["ner_tags"].feature.names
print(list_of_labels)

['B-DT', 'I-DT', 'B-LC', 'I-LC', 'B-OG', 'I-OG', 'B-PS', 'I-PS', 'B-QT', 'I-QT', 'B-TI', 'I-TI', 'O']


In [None]:
example = train_cs[4]
original_sentence = example["sentence"]
original_clean_sentence = "".join(example["tokens"]).replace("\xa0"," ")
original_clean_tokens = example["tokens"]
original_clean_labels = example["ner_tags"]

pprint(list(zip(original_clean_tokens, original_clean_labels)))

[('다', 12),
 ('음', 12),
 ('은', 12),
 (' ', 12),
 ('공', 2),
 ('주', 3),
 ('사', 3),
 ('대', 3),
 ('부', 3),
 ('고', 3),
 (' ', 12),
 ('대', 12),
 ('강', 12),
 ('당', 12),
 ('에', 12),
 (' ', 12),
 ('차', 12),
 ('려', 12),
 ('진', 12),
 (' ', 12),
 ('합', 12),
 ('동', 12),
 (' ', 12),
 ('분', 12),
 ('향', 12),
 ('소', 12),
 (' ', 12),
 ('모', 12),
 ('습', 12),
 ('입', 12),
 ('니', 12),
 ('다', 12),
 ('.', 12)]


위와 같이 character 단위로 labeling 되어있는 data를 token level로 adaptation하는 function인 `relabel_to_token` function을 작성합니다.

In [None]:
def relabel_to_token(original_clean_labels, offset_mappings):
    labels_of_tokens = []

    for offset_mapping in offset_mappings:

        cur_start_offset, cur_end_offset = offset_mapping
        if cur_start_offset == cur_end_offset:
            labels_of_tokens.append(-100)
            continue
        labels_of_tokens.append(original_clean_labels[cur_start_offset])
    return labels_of_tokens

offset_mappings = tokenizer(original_clean_sentence, return_offsets_mapping=True, return_attention_mask=False, return_token_type_ids=False, add_special_tokens=False, padding=False, truncation=False)["offset_mapping"]
labels_of_tokens = relabel_to_token(original_clean_labels, offset_mappings)
pprint(list(zip(tokenizer.tokenize(original_clean_sentence), labels_of_tokens)))

[('다음', 12),
 ('##은', 12),
 ('공주', 2),
 ('##사', 3),
 ('##대', 3),
 ('##부', 3),
 ('##고', 3),
 ('대강', 12),
 ('##당', 12),
 ('##에', 12),
 ('차려', 12),
 ('##진', 12),
 ('합동', 12),
 ('분향소', 12),
 ('모습', 12),
 ('##입니다', 12),
 ('.', 12)]


정의한 `relabel_to_token` function을 `transform` function 내부에서 사용합니다.

In [None]:
def transform(example, tokenizer):
    original_clean_sentence = "".join(example["tokens"]).replace("\xa0"," ")
    original_clean_labels = example["ner_tags"]

    encoded = tokenizer(original_clean_sentence, return_offsets_mapping=True, return_attention_mask=True, return_token_type_ids=True, add_special_tokens=True, padding=False, truncation=False)
    labels = relabel_to_token(original_clean_labels, encoded["offset_mapping"])
    encoded.update({"labels": labels})
    return encoded

In [None]:
train_ds = train_cs.map(lambda example: transform(example, tokenizer), remove_columns=["sentence"], batched=False)
valid_ds = valid_cs.map(lambda example: transform(example, tokenizer), remove_columns=["sentence"], batched=False)
test_ds = test_cs.map(lambda example: transform(example, tokenizer), remove_columns=["sentence"], batched=False)

  0%|          | 0/18907 [00:00<?, ?ex/s]

  0%|          | 0/2101 [00:00<?, ?ex/s]

  0%|          | 0/5000 [00:00<?, ?ex/s]

## Prepare model
token classification을 수행하기위해서 `klue/bert-base` load합니다.

In [None]:
from transformers import AutoModelForTokenClassification

model = AutoModelForTokenClassification.from_pretrained("klue/bert-base", num_labels=13)

print(model.__class__)

Downloading:   0%|          | 0.00/424M [00:00<?, ?B/s]

Some weights of the model checkpoint at klue/bert-base were not used when initializing BertForTokenClassification: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.decoder.bias', 'cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForTokenClassification were not initialized from the mo

<class 'transformers.models.bert.modeling_bert.BertForTokenClassification'>


## Train model
`Trainer` class를 이용하여 train합니다.

- https://huggingface.co/transformers/custom_datasets.html?highlight=trainer#fine-tuning-with-trainer

In [None]:
import numpy as np
from transformers.data.data_collator import DataCollatorForTokenClassification
from datasets import load_metric

metric = load_metric("seqeval")


def compute_metrics(p):
    predictions, labels = p
    predictions = np.argmax(predictions, axis=2)

    # Remove ignored index (special tokens)
    true_predictions = [
        [list_of_labels[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    true_labels = [
        [list_of_labels[l] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]

    results = metric.compute(predictions=true_predictions, references=true_labels)
    return {
        "precision": results["overall_precision"],
        "recall": results["overall_recall"],
        "f1": results["overall_f1"],
        "accuracy": results["overall_accuracy"],
    }


batchify = DataCollatorForTokenClassification(
    tokenizer=tokenizer,
    padding=True
)

In [None]:
from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir='./results',     
    evaluation_strategy="epoch",
    per_device_train_batch_size=32, 
    per_device_eval_batch_size=32,
    learning_rate=1e-4,
    weight_decay=0.01,
    adam_beta1=.9,
    adam_beta2=.95,
    adam_epsilon=1e-8,
    max_grad_norm=1.,
    num_train_epochs=2,
    lr_scheduler_type="linear",
    warmup_steps=100,
    logging_dir='./logs',
    logging_strategy="steps",
    logging_first_step=True,
    logging_steps=100,
    save_strategy="epoch",
    seed=42,
    dataloader_drop_last=False,
    dataloader_num_workers=2
)

trainer = Trainer(
    args=training_args,
    data_collator=batchify,
    model=model,
    train_dataset=train_ds,
    eval_dataset=valid_ds,
    compute_metrics=compute_metrics
)

trainer.train()

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).
The following columns in the training set  don't have a corresponding argument in `BertForTokenClassification.forward` and have been ignored: offset_mapping, ner_tags, tokens.
***** Running training *****
  Num examples = 18907
  Num Epochs = 2
  Instantaneous batch size per device = 32
  Total train batch size (w. parallel, distributed & accumulation) = 32
  Gradient Accumulation steps = 1
  Total optimization steps = 1182


Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy
1,0.0695,0.066514,0.870283,0.899321,0.884564,0.979727
2,0.0303,0.067305,0.884119,0.902813,0.893368,0.980515


The following columns in the evaluation set  don't have a corresponding argument in `BertForTokenClassification.forward` and have been ignored: offset_mapping, ner_tags, tokens.
***** Running Evaluation *****
  Num examples = 2101
  Batch size = 32
Saving model checkpoint to ./results/checkpoint-591
Configuration saved in ./results/checkpoint-591/config.json
Model weights saved in ./results/checkpoint-591/pytorch_model.bin
The following columns in the evaluation set  don't have a corresponding argument in `BertForTokenClassification.forward` and have been ignored: offset_mapping, ner_tags, tokens.
***** Running Evaluation *****
  Num examples = 2101
  Batch size = 32
Saving model checkpoint to ./results/checkpoint-1182
Configuration saved in ./results/checkpoint-1182/config.json
Model weights saved in ./results/checkpoint-1182/pytorch_model.bin


Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=1182, training_loss=0.04516238271055488, metrics={'train_runtime': 288.2896, 'train_samples_per_second': 131.167, 'train_steps_per_second': 4.1, 'total_flos': 1254452265351564.0, 'train_loss': 0.04516238271055488, 'epoch': 2.0})

## Evaulate model
학습된 model을 entity-level, char-level로 평가합니다. entity-level로 평가시에는 위에서 정의한 `compute_metrics` function을 사용하면 되지만, char-level로 평가하기위해서는 별도의 function을 정의해야합니다.

### entity-level로 test_ds에 대해서 성능을 계산


In [None]:
trainer.evaluate(test_ds)

The following columns in the evaluation set  don't have a corresponding argument in `BertForTokenClassification.forward` and have been ignored: offset_mapping, ner_tags, tokens.
***** Running Evaluation *****
  Num examples = 5000
  Batch size = 32


{'epoch': 2.0,
 'eval_accuracy': 0.977719109546625,
 'eval_f1': 0.8845402539416771,
 'eval_loss': 0.07692207396030426,
 'eval_precision': 0.8778647095478779,
 'eval_recall': 0.8913181019332161,
 'eval_runtime': 15.6802,
 'eval_samples_per_second': 318.873,
 'eval_steps_per_second': 10.013}

### char-level로 test_ds에 대해서 성능을 계산

In [None]:
# prediction만 수행해놓기
test_result = trainer.predict(test_ds)

The following columns in the test set  don't have a corresponding argument in `BertForTokenClassification.forward` and have been ignored: offset_mapping, ner_tags, tokens.
***** Running Prediction *****
  Num examples = 5000
  Batch size = 32


`relabel_to_char` function을 정의합니다. 해당 function은 model이 token-level로 prediction한 label을 다시 char-level로 변경하는 function입니다.

In [None]:
def relabel_to_char(predicted_label, offset_mappings, list_of_labels):
    labels_of_chars = []
    prev_end_offset = 0

    for label, offset_mapping in zip(predicted_label, offset_mappings):
        cur_start_offset, cur_end_offset = offset_mapping

        if prev_end_offset != cur_start_offset:
            labels_of_chars.append(12)

        for idx in range(cur_end_offset - cur_start_offset):
            if idx != 0:
                if list_of_labels[prev_label].startswith("B"):
                    labels_of_chars.append(label + 1)
                else:
                    labels_of_chars.append(label)
            else:
                labels_of_chars.append(label)
                prev_label = label
    
        prev_end_offset = cur_end_offset
    return labels_of_chars

predicted_label =  [_ for _ in test_result.label_ids[4] if _ != -100]
example ="".join(test_ds["tokens"][4]).replace("\xa0"," ")
trues = test_ds["ner_tags"][4]
preds = relabel_to_char(predicted_label, tokenizer(example, return_offsets_mapping=True, add_special_tokens=False)["offset_mapping"], list_of_labels)
print(example)
print(trues, "\n", len(trues))
print(preds, "\n", len(preds))

유시진 대위(송중기)와 의사 강모연 팀장(송혜교)의 흔적이라도 엿볼 수 있을까'하는 마음에 촬영지를 찾은 관광객들은 아쉬운 발걸음을 되돌리고 있다.
[6, 7, 7, 12, 12, 12, 12, 6, 7, 7, 12, 12, 12, 12, 12, 12, 6, 7, 7, 12, 12, 12, 12, 6, 7, 7, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12] 
 82
[6, 7, 7, 12, 12, 12, 12, 6, 7, 7, 12, 12, 12, 12, 12, 12, 6, 7, 7, 12, 12, 12, 12, 6, 7, 7, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12] 
 82


`relabel_to_char` function을 이용하여 token-level로 prediction된 결과를 char-level로 변환합니다.

In [None]:
trues = []
preds = []

for i in range(len(test_result.label_ids)):
    if test_ds["tokens"][i][-1] == " ":
        continue

    true = test_ds["ner_tags"][i]
    example ="".join(test_ds["tokens"][i]).replace("\xa0"," ")
    offset_mapping = tokenizer(example, return_offsets_mapping=True, add_special_tokens=False)["offset_mapping"]
    predicted_label =  [_ for _ in test_result.label_ids[i] if _ != -100]
    pred = relabel_to_char(predicted_label, offset_mapping, list_of_labels)

    if len(true) != len(pred):
        print(i)
    trues.extend(true)
    preds.extend(pred)

    if (i + 1) % 100 == 0:
        print(f"{i+1} / {len(test_result.label_ids)}")
else:
    print(f"{i+1} / {len(test_result.label_ids)}")

100 / 5000
200 / 5000
300 / 5000
400 / 5000
500 / 5000
600 / 5000
700 / 5000
800 / 5000
900 / 5000
1000 / 5000
1100 / 5000
1200 / 5000
1300 / 5000
1400 / 5000
1500 / 5000
1600 / 5000
1700 / 5000
1800 / 5000
1900 / 5000
2000 / 5000
2100 / 5000
2200 / 5000
2300 / 5000
2400 / 5000
2500 / 5000
2600 / 5000
2700 / 5000
2800 / 5000
2900 / 5000
3000 / 5000
3100 / 5000
3200 / 5000
3300 / 5000
3400 / 5000
3500 / 5000
3600 / 5000
3700 / 5000
3800 / 5000
3900 / 5000
4000 / 5000
4100 / 5000
4200 / 5000
4300 / 5000
4400 / 5000
4500 / 5000
4600 / 5000
4700 / 5000
4800 / 5000
4900 / 5000
5000 / 5000
5000 / 5000


In [None]:
sklearn.metrics.f1_score(trues, preds, labels=list(range(len(list_of_labels))), average="macro", zero_division=True) * 100.0

97.65262555798931