## 토크나이저 초기화

BERT(`kcbert-base`) 모델이 쓰는 토크나이저를 선언합니다.

In [1]:
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained(
    "beomi/kcbert-base",
    do_lower_case=False,
)

  from .autonotebook import tqdm as notebook_tqdm
2023-04-13 19:23:44.219792: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
Downloading: 100%|██████████| 250k/250k [00:00<00:00, 1.07MB/s]
Downloading: 100%|██████████| 49.0/49.0 [00:00<00:00, 49.0kB/s]
Downloading: 100%|██████████| 619/619 [00:00<00:00, 636kB/s]


## 모델 초기화

BERT(`kcbert-base`) 모델을 읽어들입니다. 

In [2]:
from transformers import BertConfig, BertModel
pretrained_model_config = BertConfig.from_pretrained(
    "beomi/kcbert-base"
)
model = BertModel.from_pretrained(
    "beomi/kcbert-base",
    config=pretrained_model_config,
)

Downloading: 100%|██████████| 438M/438M [06:27<00:00, 1.13MB/s] 
Some weights of the model checkpoint at beomi/kcbert-base were not used when initializing BertModel: ['cls.predictions.transform.dense.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias', 'cls.predictions.decoder.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


`pretrained_model_config`의 내용을 확인합니다.

In [3]:
pretrained_model_config

BertConfig {
  "_name_or_path": "beomi/kcbert-base",
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "directionality": "bidi",
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 300,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "pooler_fc_size": 768,
  "pooler_num_attention_heads": 12,
  "pooler_num_fc_layers": 3,
  "pooler_size_per_head": 128,
  "pooler_type": "first_token_transform",
  "position_embedding_type": "absolute",
  "transformers_version": "4.10.0",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30000
}

## 모델 입력값 만들기

문장 2개를 모델 입력값으로 만들어보겠습니다.

In [14]:
sentences = ["안녕하세요", "하이!", "우리의 정원은 아름답다."]
features = tokenizer(
    sentences,
    max_length=10,
    padding="max_length",
    truncation=True,
)

`features`의 내용을 확인합니다.

In [15]:
features.keys()

dict_keys(['input_ids', 'token_type_ids', 'attention_mask'])

In [16]:
features['input_ids']

[[2, 19017, 8482, 3, 0, 0, 0, 0, 0, 0],
 [2, 15830, 5, 3, 0, 0, 0, 0, 0, 0],
 [2, 10293, 2539, 10480, 13213, 12108, 17, 3, 0, 0]]

In [17]:
features['attention_mask']

[[1, 1, 1, 1, 0, 0, 0, 0, 0, 0],
 [1, 1, 1, 1, 0, 0, 0, 0, 0, 0],
 [1, 1, 1, 1, 1, 1, 1, 1, 0, 0]]

In [18]:
features['token_type_ids']

[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]

## BERT 임베딩 추출

위에서 만든 `features`를 파이토치 텐서(tensor)로 변환합니다.

In [19]:
import torch
features = {k: torch.tensor(v) for k, v in features.items()}

In [20]:
features

{'input_ids': tensor([[    2, 19017,  8482,     3,     0,     0,     0,     0,     0,     0],
         [    2, 15830,     5,     3,     0,     0,     0,     0,     0,     0],
         [    2, 10293,  2539, 10480, 13213, 12108,    17,     3,     0,     0]]),
 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
         [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
         [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]),
 'attention_mask': tensor([[1, 1, 1, 1, 0, 0, 0, 0, 0, 0],
         [1, 1, 1, 1, 0, 0, 0, 0, 0, 0],
         [1, 1, 1, 1, 1, 1, 1, 1, 0, 0]])}

BERT 모델에 `features`를 입력해 계산합니다.

In [21]:
outputs = model(**features)

BERT 마지막 레이어의 단어 수준 벡터들을 확인합니다.

In [31]:
outputs.last_hidden_state

torch.Size([3, 10, 768])

BERT 마지막 레이어의 문서 수준 벡터를 확인합니다.

In [28]:
outputs.pooler_output

tensor([[-0.1594,  0.0547,  0.1101,  ...,  0.2684,  0.1596, -0.9828],
        [-0.9221,  0.2969, -0.0110,  ...,  0.4291,  0.0311, -0.9955],
        [-0.3893, -0.1339, -0.2305,  ...,  0.1188,  0.1195,  0.3018]],
       grad_fn=<TanhBackward0>)