##**HuggingFace 라이브러리 사용하기**
지난 시간에는 Huggingface의 Tokenizers 라이브러리를 사용해보았습니다. 이번 시간에는 사전학습 모델들이 포함되어 있는 Transformers 라이브러리와 자연어처리 데이터셋들을 받을 수 있는 datasets 라이브러리를 사용해보겠습니다.

### **필요 패키지 다운로드 및 import**

In [1]:
!pip install transformers[sentencepiece]
!pip install datasets

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers[sentencepiece]
  Downloading transformers-4.21.1-py3-none-any.whl (4.7 MB)
[K     |████████████████████████████████| 4.7 MB 32.5 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 69.3 MB/s 
Collecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.12.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.6 MB)
[K     |████████████████████████████████| 6.6 MB 47.9 MB/s 
[?25hCollecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.8.1-py3-none-any.whl (101 kB)
[K     |████████████████████████████████| 101 kB 12.1 MB/s 
Collecting sentencepiece!=0.1.92,>=0.1.91
  Downloading sentencepiece-0.1.97-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[K     |

In [30]:
import numpy as np
import pandas as pd
from tqdm.auto import tqdm
import time

# 데이터 크롤링을 위한 라이브러리
import requests
from bs4 import BeautifulSoup

# Pytorch
import torch
import torch.nn as nn

# Huggingface
import transformers
import datasets

### **Huggingface's Transformers**

- 시작하기 전에 Huggingface에서 제공하는 Transformers에 대하여 알아봅니다. 
- 자연어 처리 관련 여러 라이브러리가 있지만 Transformer를 활용한 자연어 처리 task에서 가장 많이 활용되고 있는 라이브러리는 transformers입니다.
- pytorch version의 BERT를 가장 먼저 구현하며 주목받았던 huggingface는 현재 transformer기반의 다양한 모델들은 구현 및 공개하며 많은 주목을 받고 있습니다.([Pre-trained Transformers](https://huggingface.co/models))
- tensorflow, pytorch 버전의 모델 모두 공개되어 있어 다양한 상황에서 활용하기 좋습니다.
- 등록된 모델 이외에도 custom model을 업로드하여 사용할 수 있습니다.
- Transformers Documentation과 실습 자료를 이용해 transformers 라이브러리에 대해 알아봅니다.
- [Transformers Library](https://huggingface.co/transformers/)

#### Main Classes
- Configuration: https://huggingface.co/transformers/main_classes/configuration.html
- AutoConfig에서는 다양한 모델의 configuration (환경 설정)을 string tag를 이용해 쉽게 load할 수 있습니다.
- 각 Config에는 해당 모델 architecture와 task에 필요한 다양한 정보(architecture 종류, 레이어 수, hidden unit size, hyperparameter)를 담고 있습니다.
- [Pre-trained Transformers](https://huggingface.co/models)에서 해당 모델들의 name tag를 확인할 수 있습니다.

In [31]:
from transformers import AutoConfig

# AutoConfig을 이용해 "BERT" 모델의 configuration을 받아봅시다.
config = AutoConfig.from_pretrained('bert-large-uncased')
config

BertConfig {
  "_name_or_path": "bert-large-uncased",
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 1024,
  "initializer_range": 0.02,
  "intermediate_size": 4096,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 16,
  "num_hidden_layers": 24,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.21.1",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}

In [32]:
# AutoConfig을 이용해 "GPT-2" 모델의 configuration을 받아봅시다.
gpt_config = AutoConfig.from_pretrained('gpt2')

In [33]:
gpt_config

GPT2Config {
  "_name_or_path": "gpt2",
  "activation_function": "gelu_new",
  "architectures": [
    "GPT2LMHeadModel"
  ],
  "attn_pdrop": 0.1,
  "bos_token_id": 50256,
  "embd_pdrop": 0.1,
  "eos_token_id": 50256,
  "initializer_range": 0.02,
  "layer_norm_epsilon": 1e-05,
  "model_type": "gpt2",
  "n_ctx": 1024,
  "n_embd": 768,
  "n_head": 12,
  "n_inner": null,
  "n_layer": 12,
  "n_positions": 1024,
  "reorder_and_upcast_attn": false,
  "resid_pdrop": 0.1,
  "scale_attn_by_inverse_layer_idx": false,
  "scale_attn_weights": true,
  "summary_activation": null,
  "summary_first_dropout": 0.1,
  "summary_proj_to_labels": true,
  "summary_type": "cls_index",
  "summary_use_proj": true,
  "task_specific_params": {
    "text-generation": {
      "do_sample": true,
      "max_length": 50
    }
  },
  "transformers_version": "4.21.1",
  "use_cache": true,
  "vocab_size": 50257
}

In [34]:
print(config.vocab_size)

30522


In [35]:
# config 을 편하게 사용하기 위해 .to_dict() 함수를 통해 dict형식으로 바꿀 수 있습니다.
config_dict = config.to_dict()
config_dict

{'return_dict': True,
 'output_hidden_states': False,
 'output_attentions': False,
 'torchscript': False,
 'torch_dtype': None,
 'use_bfloat16': False,
 'tf_legacy_loss': False,
 'pruned_heads': {},
 'tie_word_embeddings': True,
 'is_encoder_decoder': False,
 'is_decoder': False,
 'cross_attention_hidden_size': None,
 'add_cross_attention': False,
 'tie_encoder_decoder': False,
 'max_length': 20,
 'min_length': 0,
 'do_sample': False,
 'early_stopping': False,
 'num_beams': 1,
 'num_beam_groups': 1,
 'diversity_penalty': 0.0,
 'temperature': 1.0,
 'top_k': 50,
 'top_p': 1.0,
 'typical_p': 1.0,
 'repetition_penalty': 1.0,
 'length_penalty': 1.0,
 'no_repeat_ngram_size': 0,
 'encoder_no_repeat_ngram_size': 0,
 'bad_words_ids': None,
 'num_return_sequences': 1,
 'chunk_size_feed_forward': 0,
 'output_scores': False,
 'return_dict_in_generate': False,
 'forced_bos_token_id': None,
 'forced_eos_token_id': None,
 'remove_invalid_values': False,
 'exponential_decay_length_penalty': None,
 'ar

In [36]:
# AutoConfig을 사용하지 않고, 특정 모델의 config임을 명시하여 사용할 수도 있습니다.

from transformers import BertConfig
bertconfig = BertConfig.from_pretrained('bert-base-uncased')

In [37]:
#bertconfig

In [38]:
# 이렇게 명시하여 사용하는 경우, 다른 모델의 config에 있는 값들을 원하는 모델의 config 형식으로 변환하여 사용할수도 있습니다.
bert_in_gpt2_config = BertConfig.from_pretrained('gpt2')

You are using a model of type gpt2 to instantiate a model of type bert. This is not supported for all configurations of models and can yield errors.


In [39]:
#bert_in_gpt2_config

- Model: https://github.com/huggingface/transformers/tree/master/src/transformers/models
    - Transformers에서는 transformer기반의 모델 architecture를 구현해두었습니다.
    - 최근에는 [ViT](https://arxiv.org/abs/2010.11929)와 같이 Vision task에서 활용하는 transformer 모델들을 추가하며 그 확장성을 더해가고 있습니다.
    - 모델 architecture 뿐만 아니라 관련 task에 적용가능한 형태의 구현체들이 있습니다.
    - BERT 구현체에서 제공하고 있는 class를 확인하고 해당 구조를 이용해 학습한 모델들을 load해보겠습니다.


In [40]:
from transformers import BertForMaskedLM, BertForQuestionAnswering, BertForSequenceClassification, BertForTokenClassification, BertForMultipleChoice, BertModel

In [41]:
## 진행할 태스크에 따라서 모델 조금씩 차이남 , 예를들어 BertForSequenceClassification 이런거 감정분석용 

In [42]:
from transformers import AutoModel, AutoTokenizer, AutoConfig

# BERT model
![image](https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2Fb3a7WV%2FbtqVZTeLXwY%2FmDrKNb2oGLzUJPW5N7Azlk%2Fimg.png)

- BERT 모델은 transformer 모델의 "Encoder" 부분만 사용한 형태의 모델입니다.
![](https://pytorch.org/tutorials/_images/transformer_architecture.jpg)

Huggingface library를 이용해 직접 BERT 모델을 불러와보도록 하겠습니다.

In [43]:
# bertmodel 받은거랑 같데 

In [44]:
bertmodel = AutoModel.from_pretrained('bert-base-uncased')

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


BERT 모델의 레이어들을 확인해보시면, 아래와 같은 레이어들로 구성된 것을 알 수 있습니다. 

- word_embeddings, position_mebeddings (Transformer의 word embedding과 positionl encoding)
- token_type_embeddings (BERT에 새롭게 추가된 입력 문장의 인덱스 임베딩)
- BertLayer
  - attention (multi-head attention)
  - intermediate + output (FeedFoward)

이는 Transformer encoder에 token type embedding만 추가된 형태입니다.


In [45]:
bertmodel

BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(30522, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0): BertLayer(
        (attention): BertAttention(
          (self): BertSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          

모델의 word embedding을 사용하기 위해서는, 우선 입력 문장을 word (혹은 sub-word)들로 나누어 index로 변환해주는 작업이 필요하겠죠? 우리는 이런 일을 해주는 것을 tokenizer라고 부릅니다.

BERT 모델이 사용하는 tokenizer도 불러와봅시다.

BERT tokenizer는 문장이 입력되면, 
- 여러개의 token들로 쪼개주고,
- 쪼개진 token들을 index로 변환해주고,
- attention에 사용되는 mask, token type(문장 인덱스), 추가로 필요한 토큰들 (CLS, SEP)까지 알아서 추가해줍니다.

In [46]:
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

In [47]:
input = tokenizer('Hello, world. The dog is so cute. I love you.')
input

{'input_ids': [101, 7592, 1010, 2088, 1012, 1996, 3899, 2003, 2061, 10140, 1012, 1045, 2293, 2017, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

특정 태스크에 특화된 형태의 모델을 불러올 수도 있습니다.

- BertForQuestionAnswering의 경우 마지막 레이어의 output dimension이 2인 것을 확인할 수 있습니다.
- 단순히 "bert-base-uncased"를 이용해 불러오면, BERT pre-training이 완료된 모델을 사용하지만, "deepset/bert-base-cased-squad2"에서 모델을 불러오면 SQuAD 2.0 데이터셋에 fine-tuning 까지 완료된 모델을 불러올수도 있습니다.

In [48]:
bert_qa = BertForQuestionAnswering.from_pretrained('bert-base-uncased')

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForQuestionAnswering: ['cls.predictions.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-uncased a

In [49]:
bert_qa

BertForQuestionAnswering(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_

In [50]:
bert_qa = BertForQuestionAnswering.from_pretrained('deepset/bert-base-cased-squad2') # SQuAD 2.0 데이터셋에 fine-tuning까지 완료된 모델입니다.

In [51]:
input.keys()

dict_keys(['input_ids', 'token_type_ids', 'attention_mask'])

In [52]:
input = tokenizer('Hello, world. The dog is so cute. I love you.', return_tensors='pt')

input['input_ids'], input['input_ids'].shape

(tensor([[  101,  7592,  1010,  2088,  1012,  1996,  3899,  2003,  2061, 10140,
           1012,  1045,  2293,  2017,  1012,   102]]), torch.Size([1, 16]))

In [53]:
bert_qa(**input)['start_logits']

tensor([[-1.3124, -1.2964, -0.6426, -2.2511, -3.0333, -2.1410, -2.3993, -1.9935,
         -1.6857, -3.4457, -2.1735, -2.8759, -4.1210, -4.0565, -4.4373, -5.7442]],
       grad_fn=<CloneBackward0>)

In [54]:
bert_qa(**input)['start_logits'].shape

torch.Size([1, 16])

Token Classification 모델의 경우, 입력된 각 단어마다의 classification 이 필요한 경우에 사용할 수 있는 모델입니다. (ex. Named Entity Recognition)

In [55]:
bert_token_cls = BertForTokenClassification.from_pretrained('ckiplab/bert-base-chinese-ner')

Downloading config.json:   0%|          | 0.00/3.62k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/388M [00:00<?, ?B/s]

In [56]:
input = tokenizer('Hello, world. The dog is so cute. I love you.', return_tensors='pt')
bert_token_cls(**input)

TokenClassifierOutput(loss=None, logits=tensor([[[ 11.8494,  -0.8205,  -3.0278,  ..., -10.4908,  -7.2487,  -9.2110],
         [ 11.7722,  -0.2565,  -1.6897,  ..., -11.1585,  -7.2616,  -7.1234],
         [ 12.1725,  -1.9621,  -3.3095,  ..., -11.1084,  -7.8367, -10.3945],
         ...,
         [ 12.8367,  -2.0028,  -3.1620,  ..., -11.8043,  -7.9622, -10.4055],
         [ 12.2464,  -2.2219,  -3.2702,  ..., -10.8565,  -8.0020, -11.1123],
         [ 11.8494,  -0.8205,  -3.0278,  ..., -10.4908,  -7.2487,  -9.2110]]],
       grad_fn=<ViewBackward0>), hidden_states=None, attentions=None)

# Optimization
https://huggingface.co/transformers/main_classes/optimizer_schedules.html
- optimization에서는 널리 쓰이고 있는 다양한 optimizer를 제공하고 있습니다.
- 이와 관련된 learning rate을 조절하는 scheduler도 제공하고 있습니다.
- 물론, PyTorch 라이브러리에서 제공하는 것을 사용해도 됩니다.

In [57]:
from transformers import AdamW, get_linear_schedule_with_warmup

In [58]:
bert_maskedlm = BertForMaskedLM.from_pretrained('bert-base-uncased')

parameters = bert_maskedlm.parameters()
# parameters = bert_maskedlm.named_parameters()
optimizer = AdamW(parameters, lr=5e-5)
total_training_step = 100
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=int(total_training_step/10), num_training_steps=total_training_step)

# loss.backward()
optimizer.step()
scheduler.step()

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


# Pipelines
https://huggingface.co/transformers/main_classes/pipelines.html
- pipeline에서는 단 1줄로 모델을 load할 수 있습니다.
- 현재 pipeline에서 제공하고 있는 task들은 다음과 같습니다.
    - fill-mask
    - text-classification
    - text-generation
    - sentiment-analysis
    - text2text-generation
    - ner
    - translation_xx_to_yy
    - zero-shot-classification
- 예시를 통해 pipeline class를 어떻게 활용하는지 알아보겠습니다.

In [59]:
from transformers import pipeline, AutoTokenizer, AutoModel

In [60]:
fill_masker = pipeline("fill-mask")
fill_masker("I <mask> you.")

No model was supplied, defaulted to distilroberta-base and revision ec58a5b (https://huggingface.co/distilroberta-base).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading config.json:   0%|          | 0.00/480 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/316M [00:00<?, ?B/s]

Downloading vocab.json:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading merges.txt:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading tokenizer.json:   0%|          | 0.00/1.29M [00:00<?, ?B/s]

[{'score': 0.24132834374904633,
  'token': 23824,
  'token_str': ' salute',
  'sequence': 'I salute you.'},
 {'score': 0.17694053053855896,
  'token': 2649,
  'token_str': ' miss',
  'sequence': 'I miss you.'},
 {'score': 0.1473003327846527,
  'token': 657,
  'token_str': ' love',
  'sequence': 'I love you.'},
 {'score': 0.059955764561891556,
  'token': 3392,
  'token_str': ' thank',
  'sequence': 'I thank you.'},
 {'score': 0.046513769775629044,
  'token': 19477,
  'token_str': ' applaud',
  'sequence': 'I applaud you.'}]

In [61]:
classifier = pipeline("text-classification")
classifier("This restaurant is not bad.")

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/255M [00:00<?, ?B/s]

Downloading tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

Downloading vocab.txt:   0%|          | 0.00/226k [00:00<?, ?B/s]

[{'label': 'POSITIVE', 'score': 0.9994020462036133}]

In [67]:
en_fr_translator = pipeline("translation_en_to_fr")
en_fr_translator("How old are you?")
# Quel âge avez-vous?

No model was supplied, defaulted to t5-base and revision 686f1db (https://huggingface.co/t5-base).
Using a pipeline without specifying a model name and revision in production is not recommended.
For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.


[{'translation_text': ' quel âge êtes-vous?'}]

In [63]:
en_fr_translator("What is your name?")

[{'translation_text': 'Quel est votre nom?'}]

# Tokenizer
https://huggingface.co/transformers/main_classes/tokenizer.html
- tokenizer에서는 tokenization과 관련된 다양한 기능을 제공하고 있습니다.
- string을 tokenization, token을 string으로 바꿔주는 기능은 input embedding을 만들거나 model의 output을 decoding하기 위해 사용됩니다.
- tokenizer에서는 주어진 tokenization config를 바탕으로 transformer input으로 필요한 정보를 생성합니다.

Q. Pretrained model만이 아니라, tokenizer도 pretrained 된 것을 사용하는 이유가 무엇일까요?

In [68]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

In [69]:
tokenizer.get_vocab()

{'hamburger': 24575,
 'approaches': 8107,
 '##tem': 18532,
 'sean': 5977,
 '##eers': 22862,
 'chaired': 12282,
 'huts': 23326,
 'note': 3602,
 'helmets': 22674,
 '##roi': 26692,
 'frequently': 4703,
 'buyers': 17394,
 'concentrate': 10152,
 'wien': 22782,
 '##tite': 23096,
 '##inski': 19880,
 '136': 15407,
 'k': 1047,
 'dmitri': 28316,
 'lighted': 26390,
 '##iation': 18963,
 'inning': 12994,
 'aurora': 13158,
 '[unused765]': 770,
 'ᅳ': 1481,
 'balance': 5703,
 'sheffield': 8533,
 'delicate': 10059,
 'won': 2180,
 'laundering': 28289,
 'eastman': 24252,
 'things': 2477,
 'hank': 9180,
 'cardiovascular': 22935,
 'employing': 15440,
 '##onale': 22823,
 'deutschland': 28668,
 'jefferson': 7625,
 'avant': 14815,
 'scrutiny': 17423,
 'pea': 26034,
 'へ': 1675,
 'otis': 18899,
 'dime': 27211,
 'napoleon': 8891,
 '[unused815]': 820,
 'worse': 4788,
 '1893': 6489,
 'fires': 8769,
 '##sse': 11393,
 '##pton': 15857,
 'scary': 12459,
 'signaling': 14828,
 'trail': 4446,
 'inspiration': 7780,
 '##di

In [70]:
print(tokenizer.tokenize("I love natural language processing"))
print(tokenizer.convert_tokens_to_ids(tokenizer.tokenize("I love natural language processing")))
print(tokenizer.convert_tokens_to_string(tokenizer.tokenize("I love natural language processing")))

['i', 'love', 'natural', 'language', 'processing']
[1045, 2293, 3019, 2653, 6364]
i love natural language processing


In [71]:
tokenizer("I love natural language processing")

{'input_ids': [101, 1045, 2293, 3019, 2653, 6364, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1]}

## BERT를 이용한 네이버 영화 리뷰 sentiment classification

BERT모델을 fine-tuning하여 네이버 영화리뷰가 긍정적인 리뷰인지, 아닌지를 분석하는 실험을 진행해보겠습니다.

이번 실습에서는 실제 상황과 비슷하게, 네이버 영화리뷰 데이터를 크롤링하여 데이터를 생성해보겠습니다.

In [72]:
# 1 page 데이터를 읽어와보겠습니다.
page = 1
url = f'https://movie.naver.com/movie/point/af/list.naver?&page={page}'
url


'https://movie.naver.com/movie/point/af/list.naver?&page=1'

In [73]:
# url로부터 html정보를 얻어옵니다.
html = requests.get(url)
#html을 받아온 문서를 .content로 지정 후 soup객체로 변환
soup = BeautifulSoup(html.content,'html.parser')
soup


<!DOCTYPE html>

<html lang="ko">
<head>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<meta content="IE=edge" http-equiv="X-UA-Compatible"/>
<meta content="http://imgmovie.naver.com/today/naverme/naverme_profile.jpg" property="me2:image">
<meta content="네이버영화 " property="me2:post_tag">
<meta content="네이버영화" property="me2:category1"/>
<meta content="" property="me2:category2"/>
<meta content="평점 : 네이버 영화" property="og:title"/>
<meta content="네티즌 평점과 리뷰 정보 제공" property="og:description"/>
<meta content="article" property="og:type"/>
<meta content="https://movie.naver.com/movie/point/af/list.naver?&amp;page=1" property="og:url"/>
<meta content="http://static.naver.net/m/movie/icons/OG_270_270.png" property="og:image"/><!-- http://static.naver.net/m/movie/im/navermovie.jpg -->
<meta content="http://imgmovie.naver.com/today/naverme/naverme_profile.jpg" property="og:article:thumbnailUrl"/>
<meta content="네이버 영화" property="og:article:author"/>
<meta content="https://mo

In [75]:
#find_all : 지정한 태그의 내용을 모두 찾아 리스트로 반환
reviews = soup.find_all("td",{"class":"title"})
reviews

[<td class="title">
 <a class="movie color_b" href="?st=mcode&amp;sword=184519&amp;target=after">비상선언</a>
 <div class="list_netizen_score">
 <span class="st_off"><span class="st_on" style="width:100%">별점 - 총 10점 중</span></span><em>10</em>
 </div>
 <br/>저는 재밌게 봤습니다. 현실성이 떨어진다고들 많이 하는데그러니 영화 아니겠습니까? 
 			
 			
 			
 				
 				
 				
 				<a class="report" href="#" onclick="report('p312****', 'h4YnDfzdBQiwWtfYkh4HuEwAzPIOErNLulSuhXucjZU=', '저는 재밌게 봤습니다. 현실성이 떨어진다고들 많이 하는데그러니 영화 아니겠습니까?', '18376945', 'point_after');" style="color:#8F8F8F" title="새 창">신고</a>
 </td>, <td class="title">
 <a class="movie color_b" href="?st=mcode&amp;sword=81888&amp;target=after">탑건: 매버릭</a>
 <div class="list_netizen_score">
 <span class="st_off"><span class="st_on" style="width:100%">별점 - 총 10점 중</span></span><em>10</em>
 </div>
 <br/>굿굿 
 			
 			
 			
 				
 				
 				
 				<a class="report" href="#" onclick="report('ni12****', 'HnGu3jQlxxygZeh4t7wL/HlO+S6nlTVB02orW0O52Dw=', '굿굿', '18376944', 'point_after');

In [76]:
review_data = []
#한 페이지의 리뷰 리스트의 리뷰를 하나씩 보면서 데이터 추출
for review in reviews:
    # 예시: <a class="report" href="#" onclick="report('nimo****', 'VYvWCsRvbPwqbJZjcJx6mHaPm5IRnI1v1MYlRfqlIlA=', '재밌는데 겟아웃 보단…', '18375693', 'point_after');" style="color:#8F8F8F" title="새 창">신고</a>
    sentence = review.find("a",{"class":"report"}).get("onclick").split("', '")[2]
    #만약 리뷰 내용이 비어있다면 데이터를 사용하지 않음
    if sentence != "":
        # 예시: <a class="movie color_b" href="?st=mcode&amp;sword=194196&amp;target=after">한산: 용의 출현</a>
        movie = review.find("a",{"class":"movie color_b"}).get_text()
        # 예시:  <span class="st_off"><span class="st_on" style="width:100%">별점 - 총 10점 중</span></span><em>10</em>
        score = review.find("em").get_text()
        review_data.append([movie,sentence,int(score)])

review_data

[['비상선언', '저는 재밌게 봤습니다. 현실성이 떨어진다고들 많이 하는데그러니 영화 아니겠습니까?', 10],
 ['탑건: 매버릭', '굿굿', 10],
 ['탑건: 매버릭', '역시 톰형이 톰형했네 최고다!!!', 10],
 ['프레데터', '수십년이 지났지만, 몇번을 봐도 그야말로 미친 영화. 액션SF공포스릴러 한획을 그음.', 10],
 ['놉', '이 감독의 작품중에서 스케일은 가장 컸던 영화. 뭔가 알듯말듯 미묘한 메세지의 전달과 함께 섬뜻한 긴장감이 좋았어요', 9],
 ['탑건: 매버릭', '역시.. 탑건.. 반드시 영화관에서 봐야하는 영화.', 10],
 ['비상선언', '재밌고슬픈영화 긴장감 넘친다', 10],
 ['놉', '뭘 말하려는지도 모르겠고...무슨 스토리인지도 모르겠다....그냥 그런느낌...???', 3],
 ['탑건: 매버릭',
  '어떻게 보면 소재 자체는 단순한데, 기막힌 전투기 조종씬과 꼼꼼하게 구성한 스토리 라인으로 끝까지 긴장감 있게 극이 진행된다. 관람비가 하나도 아깝지 않은 영화!',
  10],
 ['놉', '돈룩업!!돈룩업!!', 10]]

In [77]:
def get_page_reviews(page):
    review_data = []
    url = f'https://movie.naver.com/movie/point/af/list.naver?&page={page}'
    #get : request로 url의  html문서의 내용 요청
    html = requests.get(url)
    #html을 받아온 문서를 .content로 지정 후 soup객체로 변환
    soup = BeautifulSoup(html.content,'html.parser')
    #find_all : 지정한 태그의 내용을 모두 찾아 리스트로 반환
    reviews = soup.find_all("td",{"class":"title"})
    
    #한 페이지의 리뷰 리스트의 리뷰를 하나씩 보면서 데이터 추출
    for review in reviews:
        sentence = review.find("a",{"class":"report"}).get("onclick").split("', '")[2]
        #만약 리뷰 내용이 비어있다면 데이터를 사용하지 않음
        if sentence != "":
            movie = review.find("a",{"class":"movie color_b"}).get_text()
            score = review.find("em").get_text()
            review_data.append([movie,sentence,int(score)])
    return review_data


from tqdm.contrib.concurrent import process_map # 병렬처리+진행도 시각화를 위한 라이브러리입니다.
review_data = process_map(get_page_reviews, range(1, 1000), max_workers=8, chunksize=8) # multiprocessing을 통해 데이터를 크롤링합니다.

  0%|          | 0/999 [00:00<?, ?it/s]

In [78]:
review_data[:10]

[[['비상선언', '너무 억지스러운 반미 반일 감정 유도, 재미 존나 없음 ', 1],
  ['한산: 용의 출현',
   '초2 아들과 함께 관람했어요(엄마와동반). 장면마다 음악이 너무 잘 어우러져서 영화에 몰입하기 좋았어요~',
   10],
  ['비상선언', '저는 재밌게 봤습니다. 현실성이 떨어진다고들 많이 하는데그러니 영화 아니겠습니까?', 10],
  ['탑건: 매버릭', '굿굿', 10],
  ['탑건: 매버릭', '역시 톰형이 톰형했네 최고다!!!', 10],
  ['프레데터', '수십년이 지났지만, 몇번을 봐도 그야말로 미친 영화. 액션SF공포스릴러 한획을 그음.', 10],
  ['놉', '이 감독의 작품중에서 가장 스케일이 큰 영화. 뭔가 알듯말듯 미묘한 메세지의 전달과 함께 섬뜩한 긴장감이 좋았어요', 9],
  ['탑건: 매버릭', '역시.. 탑건.. 반드시 영화관에서 봐야하는 영화.', 10],
  ['비상선언', '재밌고슬픈영화 긴장감 넘친다', 10],
  ['놉', '뭘 말하려는지도 모르겠고...무슨 스토리인지도 모르겠다....그냥 그런느낌...???', 3]],
 [['탑건: 매버릭',
   '어떻게 보면 소재 자체는 단순한데, 기막힌 전투기 조종씬과 꼼꼼하게 구성한 스토리 라인으로 끝까지 긴장감 있게 극이 진행된다. 관람비가 하나도 아깝지 않은 영화!',
   10],
  ['놉', '돈룩업!!돈룩업!!', 10],
  ['헌트', '이정재 정우성 조합에 우정출연인지 어마어마한 배우들 대거 나옴 ', 10],
  ['헌트', '보는 내내 소름 돋았고 지루한 부분 없이 깔끔했음 계속 감탄하면서 봄', 10],
  ['용서받지 못한 자',
   '군대를 적나라하게 표현한 영화라는 것을 이미 알고봐서 더 충격적이고 진하게 다가왔다. 대사와 분위기, 긴장감에 압도당해 보는데 힘이 들었을 정도. 배우들이 생활 연기를 너무 잘했다, 특히 하정우. 툭툭 말걸고 짜증내는 연기에서 현실감이 느껴져서 하정우 나오는 씬마다 기대가 됨',
  

In [79]:
df = pd.DataFrame(sum(review_data, []))
df.columns = ['movie','review','score']
df = df.dropna()
df

Unnamed: 0,movie,review,score
0,비상선언,"너무 억지스러운 반미 반일 감정 유도, 재미 존나 없음",1
1,한산: 용의 출현,초2 아들과 함께 관람했어요(엄마와동반). 장면마다 음악이 너무 잘 어우러져서 영화...,10
2,비상선언,저는 재밌게 봤습니다. 현실성이 떨어진다고들 많이 하는데그러니 영화 아니겠습니까?,10
3,탑건: 매버릭,굿굿,10
4,탑건: 매버릭,역시 톰형이 톰형했네 최고다!!!,10
...,...,...,...
9365,불꽃 슛 통키,인생 띵작……..ㅋㅋㅋㅋ,10
9366,헌트,재밌습니다.마음이 무거워요,10
9367,한산: 용의 출현,재밌었고 이순신 장군이 얼마나 대단한 위인인지 다시 느낌.박해일 변요한 연기도 좋았음,10
9368,한산: 용의 출현,한산최고꼭보세요!!,10


### **Training Movie Review Classifier with BERTForSequenceClassification Class**

Pre-trained BERT의 config, tokenizer, model을 각각 불러오겠습니다.

이 때, 한국어에 특화되어 학습된 모델 중 kcBERT를 이용해보겠습니다.
- kcBERT 모델은 네이버 뉴스의 댓글과 대댓글을 이용하여 사전학습한 모델로, 기존의 koBERT 모델들이 정형화된 언어만 사용하는 것에 비해 구어체와 신조어 등을 더 많이 학습한 모델입니다. 네이버 영화 리뷰 역시 구어체와 신조어가 많은 데이터이므로, kcBERT가 적합한 모델입니다.

- https://github.com/Beomi/KcBERT

- https://huggingface.co/beomi/kcbert-base
<!-- https://huggingface.co/monologg/kobert -->

In [87]:
from transformers import AutoConfig, AutoTokenizer, AutoModelForSequenceClassification

In [88]:
config = AutoConfig.from_pretrained('beomi/kcbert-base')
tokenizer = AutoTokenizer.from_pretrained('beomi/kcbert-base')
model = AutoModelForSequenceClassification.from_pretrained("beomi/kcbert-base")

Some weights of the model checkpoint at beomi/kcbert-base were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight', 'cls.predictions.decoder.bias', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initiali

In [89]:
config

BertConfig {
  "_name_or_path": "beomi/kcbert-base",
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "directionality": "bidi",
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 300,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "pooler_fc_size": 768,
  "pooler_num_attention_heads": 12,
  "pooler_num_fc_layers": 3,
  "pooler_size_per_head": 128,
  "pooler_type": "first_token_transform",
  "position_embedding_type": "absolute",
  "transformers_version": "4.21.1",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30000
}

In [90]:
tokenizer

PreTrainedTokenizerFast(name_or_path='beomi/kcbert-base', vocab_size=30000, model_max_len=300, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'})

In [91]:
model

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30000, 768, padding_idx=0)
      (position_embeddings): Embedding(300, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, element

In [85]:
from datasets import load_metric

# 성능평가지표 
def compute_metrics(eval_pred):
   load_accuracy = load_metric("accuracy")
   load_f1 = load_metric("f1")
  
   logits, labels = eval_pred
   predictions = np.argmax(logits, axis=-1)
   accuracy = load_accuracy.compute(predictions=predictions, references=labels)["accuracy"]
   f1 = load_f1.compute(predictions=predictions, references=labels)["f1"]
   return {"accuracy": accuracy, "f1": f1}

In [86]:
class NaverReviewDataset(torch.utils.data.Dataset):
    def __init__(self, data, tokenizer):
        self.tokenizer = tokenizer
        self.x = data['review'].values
        self.y = (data['score']>=7).astype(int).values

    def __getitem__(self, idx):
        item = self.tokenizer(self.x[idx], truncation=True) # 길면 잘라줌 ,
        item['labels'] = [self.y[idx]]
        return item

    def __len__(self):
        return len(self.x)


train_dataset = NaverReviewDataset(df.iloc[:len(df)//5*4], tokenizer)
test_dataset  = NaverReviewDataset(df.iloc[len(df)//5*4:], tokenizer)

# train_dataset = NaverReviewDataset(df.iloc[:9000], tokenizer)
# test_dataset  = NaverReviewDataset(df.iloc[9000:], tokenizer)

In [92]:
from transformers import TrainingArguments, Trainer,  DataCollatorWithPadding
 
training_args = TrainingArguments(
   output_dir="finetuning-sentiment",
   learning_rate=2e-5,
   per_device_train_batch_size=5,
   per_device_eval_batch_size=5,
   num_train_epochs=2,
   weight_decay=0.01,
   save_strategy="epoch",
)


data_collator = DataCollatorWithPadding(tokenizer=tokenizer) 

trainer = Trainer(
   model=model,
   args=training_args,
   train_dataset=train_dataset,
   eval_dataset=test_dataset,
   tokenizer=tokenizer,
   data_collator=data_collator,
   compute_metrics=compute_metrics,
)

In [93]:
trainer.train()

***** Running training *****
  Num examples = 7496
  Num Epochs = 2
  Instantaneous batch size per device = 5
  Total train batch size (w. parallel, distributed & accumulation) = 5
  Gradient Accumulation steps = 1
  Total optimization steps = 3000


Step,Training Loss
500,0.4736
1000,0.445
1500,0.4289
2000,0.2531
2500,0.2857
3000,0.2789


Saving model checkpoint to finetuning-sentiment/checkpoint-1500
Configuration saved in finetuning-sentiment/checkpoint-1500/config.json
Model weights saved in finetuning-sentiment/checkpoint-1500/pytorch_model.bin
tokenizer config file saved in finetuning-sentiment/checkpoint-1500/tokenizer_config.json
Special tokens file saved in finetuning-sentiment/checkpoint-1500/special_tokens_map.json
Saving model checkpoint to finetuning-sentiment/checkpoint-3000
Configuration saved in finetuning-sentiment/checkpoint-3000/config.json
Model weights saved in finetuning-sentiment/checkpoint-3000/pytorch_model.bin
tokenizer config file saved in finetuning-sentiment/checkpoint-3000/tokenizer_config.json
Special tokens file saved in finetuning-sentiment/checkpoint-3000/special_tokens_map.json


Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=3000, training_loss=0.36087737782796225, metrics={'train_runtime': 334.9425, 'train_samples_per_second': 44.76, 'train_steps_per_second': 8.957, 'total_flos': 480176648254440.0, 'train_loss': 0.36087737782796225, 'epoch': 2.0})

In [94]:
trainer.evaluate()

***** Running Evaluation *****
  Num examples = 1874
  Batch size = 5


Downloading builder script:   0%|          | 0.00/1.65k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

{'eval_loss': 0.3883701264858246,
 'eval_accuracy': 0.9050160085378869,
 'eval_f1': 0.9385359116022098,
 'eval_runtime': 10.7615,
 'eval_samples_per_second': 174.139,
 'eval_steps_per_second': 34.846,
 'epoch': 2.0}

# BERT를 이용한 NER

https://huggingface.co/dslim/bert-base-NER

In [95]:
from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline


In [96]:


tokenizer = AutoTokenizer.from_pretrained("dslim/bert-base-NER")
model = AutoModelForTokenClassification.from_pretrained("dslim/bert-base-NER")


https://huggingface.co/dslim/bert-base-NER/resolve/main/tokenizer_config.json not found in cache or force_download set to True, downloading to /root/.cache/huggingface/transformers/tmph879b1fs


Downloading tokenizer_config.json:   0%|          | 0.00/59.0 [00:00<?, ?B/s]

storing https://huggingface.co/dslim/bert-base-NER/resolve/main/tokenizer_config.json in cache at /root/.cache/huggingface/transformers/de9f40a9d698f5f7227cbc2798430cb498bb680bcd657f1c2bd897a6a2f63953.6391beef2ceed2cdba47401eb12680200856c97d2f2b56143e515d7c0f36a66a
creating metadata file for /root/.cache/huggingface/transformers/de9f40a9d698f5f7227cbc2798430cb498bb680bcd657f1c2bd897a6a2f63953.6391beef2ceed2cdba47401eb12680200856c97d2f2b56143e515d7c0f36a66a
https://huggingface.co/dslim/bert-base-NER/resolve/main/config.json not found in cache or force_download set to True, downloading to /root/.cache/huggingface/transformers/tmpsygt824n


Downloading config.json:   0%|          | 0.00/829 [00:00<?, ?B/s]

storing https://huggingface.co/dslim/bert-base-NER/resolve/main/config.json in cache at /root/.cache/huggingface/transformers/a5ff16a1d557b5ad480f50b1d454448475c644d08df9ce8fccabea7745bebd9f.a61836f2236a3ff1a0827544e2d7c512cbb8cd26ed7b32d643526bebb5d7f92e
creating metadata file for /root/.cache/huggingface/transformers/a5ff16a1d557b5ad480f50b1d454448475c644d08df9ce8fccabea7745bebd9f.a61836f2236a3ff1a0827544e2d7c512cbb8cd26ed7b32d643526bebb5d7f92e
loading configuration file https://huggingface.co/dslim/bert-base-NER/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/a5ff16a1d557b5ad480f50b1d454448475c644d08df9ce8fccabea7745bebd9f.a61836f2236a3ff1a0827544e2d7c512cbb8cd26ed7b32d643526bebb5d7f92e
Model config BertConfig {
  "_name_or_path": "dslim/bert-base-NER",
  "_num_labels": 9,
  "architectures": [
    "BertForTokenClassification"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  

Downloading vocab.txt:   0%|          | 0.00/208k [00:00<?, ?B/s]

storing https://huggingface.co/dslim/bert-base-NER/resolve/main/vocab.txt in cache at /root/.cache/huggingface/transformers/d426f14ce999ecd9a2f26bd379117e988775a97ca1d30e72941824935563e2a6.437aa611e89f6fc6675a049d2b5545390adbc617e7d655286421c191d2be2791
creating metadata file for /root/.cache/huggingface/transformers/d426f14ce999ecd9a2f26bd379117e988775a97ca1d30e72941824935563e2a6.437aa611e89f6fc6675a049d2b5545390adbc617e7d655286421c191d2be2791
https://huggingface.co/dslim/bert-base-NER/resolve/main/added_tokens.json not found in cache or force_download set to True, downloading to /root/.cache/huggingface/transformers/tmpbai39lvs


Downloading added_tokens.json:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

storing https://huggingface.co/dslim/bert-base-NER/resolve/main/added_tokens.json in cache at /root/.cache/huggingface/transformers/256d34bb8f151641e2ce0fcb0263b6652c9ddd412b271fddb03da7d3c6d74448.5cc6e825eb228a7a5cfd27cb4d7151e97a79fb962b31aaf1813aa102e746584b
creating metadata file for /root/.cache/huggingface/transformers/256d34bb8f151641e2ce0fcb0263b6652c9ddd412b271fddb03da7d3c6d74448.5cc6e825eb228a7a5cfd27cb4d7151e97a79fb962b31aaf1813aa102e746584b
https://huggingface.co/dslim/bert-base-NER/resolve/main/special_tokens_map.json not found in cache or force_download set to True, downloading to /root/.cache/huggingface/transformers/tmpkm19r2v2


Downloading special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

storing https://huggingface.co/dslim/bert-base-NER/resolve/main/special_tokens_map.json in cache at /root/.cache/huggingface/transformers/8ecdadef2bc275e74e0d4541ae8a5db151fba13174b86dfa88ef5765d30feb77.dd8bd9bfd3664b530ea4e645105f557769387b3da9f79bdb55ed556bdd80611d
creating metadata file for /root/.cache/huggingface/transformers/8ecdadef2bc275e74e0d4541ae8a5db151fba13174b86dfa88ef5765d30feb77.dd8bd9bfd3664b530ea4e645105f557769387b3da9f79bdb55ed556bdd80611d
loading file https://huggingface.co/dslim/bert-base-NER/resolve/main/vocab.txt from cache at /root/.cache/huggingface/transformers/d426f14ce999ecd9a2f26bd379117e988775a97ca1d30e72941824935563e2a6.437aa611e89f6fc6675a049d2b5545390adbc617e7d655286421c191d2be2791
loading file https://huggingface.co/dslim/bert-base-NER/resolve/main/tokenizer.json from cache at None
loading file https://huggingface.co/dslim/bert-base-NER/resolve/main/added_tokens.json from cache at /root/.cache/huggingface/transformers/256d34bb8f151641e2ce0fcb0263b6652c

Downloading pytorch_model.bin:   0%|          | 0.00/413M [00:00<?, ?B/s]

storing https://huggingface.co/dslim/bert-base-NER/resolve/main/pytorch_model.bin in cache at /root/.cache/huggingface/transformers/3ca763a5697d51432247d711b6aae51030a05f5b0c9a59cb83b20255eabb7ff4.aeec53fbb8d04bbdb0c84621a6f18491499bffc49a246808de99e63e7684ad79
creating metadata file for /root/.cache/huggingface/transformers/3ca763a5697d51432247d711b6aae51030a05f5b0c9a59cb83b20255eabb7ff4.aeec53fbb8d04bbdb0c84621a6f18491499bffc49a246808de99e63e7684ad79
loading weights file https://huggingface.co/dslim/bert-base-NER/resolve/main/pytorch_model.bin from cache at /root/.cache/huggingface/transformers/3ca763a5697d51432247d711b6aae51030a05f5b0c9a59cb83b20255eabb7ff4.aeec53fbb8d04bbdb0c84621a6f18491499bffc49a246808de99e63e7684ad79
All model checkpoint weights were used when initializing BertForTokenClassification.

All the weights of BertForTokenClassification were initialized from the model checkpoint at dslim/bert-base-NER.
If your task is similar to the task the model of the checkpoint was

In [97]:
# 간단한 형태의 inference 예제
nlp = pipeline("ner", model=model, tokenizer=tokenizer)
example = "My name is Wolfgang and I live in Berlin"

ner_results = nlp(example)
print(ner_results)

[{'entity': 'B-PER', 'score': 0.9990139, 'index': 4, 'word': 'Wolfgang', 'start': 11, 'end': 19}, {'entity': 'B-LOC', 'score': 0.999645, 'index': 9, 'word': 'Berlin', 'start': 34, 'end': 40}]


In [98]:
pd.DataFrame(dataset['train']['ner_tags']).min().min()

NameError: ignored

In [99]:

# 학습 예제 - 한국어 NER을 시도해보겠습니다.

tokenizer = AutoTokenizer.from_pretrained('beomi/kcbert-base')
model = AutoModelForTokenClassification.from_pretrained('beomi/kcbert-base')

loading configuration file https://huggingface.co/beomi/kcbert-base/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/10de039f2f91b0c6fbd30fad5bf8a7468a20701212ed12f9f5e610edb99c55d1.d8a72131e15fd1d856f1b39abf4eff31d458aeeca0a4192df898ca699ec7d779
Model config BertConfig {
  "_name_or_path": "beomi/kcbert-base",
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "directionality": "bidi",
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 300,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "pooler_fc_size": 768,
  "pooler_num_attention_heads": 12,
  "pooler_num_fc_layers": 3,
  "pooler_size_per_head": 128,
  "pooler_type": "first_token_transform",
  "position_embedding_type": "absolute",
  "transformer

In [104]:
model

BertForTokenClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30000, 768, padding_idx=0)
      (position_embeddings): Embedding(300, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwis

In [107]:
# fine tuning
model.classifier = nn.Linear(model.classifier.in_features, 7) # 0~6까지
model.num_labels = 7

from datasets import load_dataset

dataset = load_dataset("kor_ner")

train_dataset = dataset['train']
val_dataset = dataset['validation']
test_dataset = dataset['test']

Downloading builder script:   0%|          | 0.00/1.67k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/994 [00:00<?, ?B/s]



Downloading and preparing dataset kor_ner/default (download: 3.33 MiB, generated: 4.68 MiB, post-processed: Unknown size, total: 8.02 MiB) to /root/.cache/huggingface/datasets/kor_ner/default/1.1.0/1f019c13620cd0f9d3a4b79c684b0b3e2ece528a306b9f5a1be5c4154d405c02...


Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/749k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/90.9k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/93.6k [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/3 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/2928 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/366 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/366 [00:00<?, ? examples/s]

Dataset kor_ner downloaded and prepared to /root/.cache/huggingface/datasets/kor_ner/default/1.1.0/1f019c13620cd0f9d3a4b79c684b0b3e2ece528a306b9f5a1be5c4154d405c02. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

In [None]:
# 데이터셋 예제
train_dataset[0]

In [None]:
train_dataset[0]

In [None]:
def preprocess(data):
    input_ids = tokenizer.convert_tokens_to_ids(data['tokens'])
    labels = data['ner_tags']
    if len(input_ids)>tokenizer.model_max_length:
        input_ids = input_ids[:tokenizer.model_max_length]
        labels = labels[:tokenizer.model_max_length]
    input_ids = input_ids + [0] * (tokenizer.model_max_length - len(labels))
    labels = labels + [0] * (tokenizer.model_max_length - len(labels))
    token_type_ids = [0] * len(input_ids)
    attention_mask = [1] * len(data['tokens']) + [0] * (tokenizer.model_max_length - len(data['tokens']))
    return {'input_ids':input_ids, 'token_type_ids':token_type_ids, 'attention_mask':attention_mask, 'labels':labels}

train_dataset = train_dataset.map(preprocess)
val_dataset = val_dataset.map(preprocess)
test_dataset = test_dataset.map(preprocess)

In [None]:
from transformers import TrainingArguments, Trainer,  DataCollatorForTokenClassification
from datasets import load_metric


metric = load_metric("accuracy")
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=np.reshape(predictions, -1), references=np.reshape(labels,-1))


 
training_args = TrainingArguments(
   output_dir="finetuning-ner",
   learning_rate=2e-5,
   per_device_train_batch_size=16,
   per_device_eval_batch_size=16,
   num_train_epochs=2,
   weight_decay=0.01,
   save_strategy="epoch",
)


data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer)

trainer = Trainer(
   model=model,
   args=training_args,
   train_dataset=train_dataset,
   eval_dataset=val_dataset,
   tokenizer=tokenizer,
   data_collator=data_collator,
   compute_metrics=compute_metrics,
)

In [None]:
trainer.train()

In [None]:
trainer.evaluate(eval_dataset=test_dataset)

# KorQuAD 데이터셋을 이용한 QA 모델 만들기

In [None]:
from datasets import load_dataset
raw_datasets = load_dataset("KETI-AIR/korquad", 'v1.0')

In [None]:
# 데이터셋마다 구분된 이름이 다를 수 있습니다.
raw_datasets.keys()

In [None]:
train_dataset = raw_datasets['train']
val_dataset = raw_datasets['dev']

In [None]:
train_dataset[0]

In [None]:
from transformers import AutoTokenizer, AutoModelForQuestionAnswering
tokenizer = AutoTokenizer.from_pretrained('sangrimlee/bert-base-multilingual-cased-korquad')
model = AutoModelForQuestionAnswering.from_pretrained('sangrimlee/bert-base-multilingual-cased-korquad')

In [None]:
question = train_dataset[0]['question']
context = train_dataset[0]['context']
tokenizer(question, context)

In [None]:
def preprocess_function(examples):
    inputs = tokenizer(
        examples['question'],
        examples["context"],
        truncation="only_second",
        return_offsets_mapping=True,
        padding="max_length",
    )

    offset_mapping = inputs.pop("offset_mapping")
    answers = examples["answers"]
    start_positions = []
    end_positions = []

  

    for i, offset in enumerate([offset_mapping]):
        answer = answers
        start_char = answer["answer_start"][0]
        end_char = answer["answer_start"][0] + len(answer["text"][0])
        sequence_ids = inputs.sequence_ids(i)

        # Find the start and end of the context
        idx = 0
        while sequence_ids[idx] != 1:
            idx += 1
        context_start = idx
        while sequence_ids[idx] == 1:
            idx += 1
        context_end = idx - 1

        # If the answer is not fully inside the context, label it (0, 0)
        if offset[context_start][0] > end_char or offset[context_end][1] < start_char:
            start_positions.append(0)
            end_positions.append(0)
        else:
            # Otherwise it's the start and end token positions
            idx = context_start
            while idx <= context_end and offset[idx][0] <= start_char:
                idx += 1
            start_positions.append(idx - 1)

            idx = context_end
            while idx >= context_start and offset[idx][1] >= end_char:
                idx -= 1
            end_positions.append(idx + 1)

    inputs["start_positions"] = start_positions
    inputs["end_positions"] = end_positions
    inputs['labels'] = answer
    return inputs

In [None]:
train_dataset[0]

In [None]:
preprocess_function(train_dataset[0])

In [None]:
train_dataset = train_dataset.map(preprocess_function)
val_dataset = val_dataset.map(preprocess_function)

In [None]:
from transformers import DefaultDataCollator
from datasets import load_metric
accuracy = load_metric('accuracy')
# 성능평가지표 
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    labels = np.concatenate(labels, -1).T
    predictions = np.argmax(logits, axis=-1)
    return accuracy.compute(predictions=predictions.reshape(-1), references=labels.reshape(-1))


data_collator = DefaultDataCollator()




from transformers import TrainingArguments, Trainer
training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics
)

In [None]:
trainer.train()

In [None]:
# 이 실습에서 불러온 모델은 이미 korquad 데이터셋에 fine-tuning이 완료된 모델입니다. 학습 없이 evaluation을 해봅시다.
model = AutoModelForQuestionAnswering.from_pretrained('sangrimlee/bert-base-multilingual-cased-korquad')

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset[:10],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics
)


trainer.evaluate()

## GPT를 이용한 문장 생성

In [None]:
# pipeline을 이용한 방식
from transformers import pipeline, set_seed
generator = pipeline('text-generation', model='gpt2')

In [None]:
generator("Hello, I'm a language model,", max_length=50, num_return_sequences=10)


In [None]:
# pipeline을 사용하지 않는 방식
from transformers import GPT2LMHeadModel, GPT2Tokenizer

model = GPT2LMHeadModel.from_pretrained("gpt2")
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

input_ids = tokenizer.encode("Some text to encode", return_tensors='pt')

generated_text_samples = model.generate(
    input_ids,
    max_length=150,
    num_return_sequences=5,
    no_repeat_ngram_size=2, #2-gram 동어 반복을 피함
    repetition_penalty=1.5,
    top_p=0.92,
    temperature=0.85,
    do_sample=True,
    top_k=125,
    early_stopping=True
)

In [None]:
for i, beam in enumerate(generated_text_samples):
    print("{}: {}".format(i, tokenizer.decode(beam, skip_special_tokens=True)))
    print()

In [None]:
# 한국어 GPT

from transformers import PreTrainedTokenizerFast
tokenizer = PreTrainedTokenizerFast.from_pretrained("skt/kogpt2-base-v2", bos_token='</s>', eos_token='</s>', unk_token='<unk>', pad_token='<pad>', mask_token='<mask>')
tokenizer.tokenize("안녕하세요. 한국어 GPT-2 입니다.😤:)l^o")

In [None]:
import torch
from transformers import GPT2LMHeadModel

model = GPT2LMHeadModel.from_pretrained('skt/kogpt2-base-v2')
text = '근육이 커지기 위해서는'
input_ids = tokenizer.encode(text, return_tensors='pt')
gen_ids = model.generate(input_ids,
                           max_length=128,
                           repetition_penalty=2.0,
                           pad_token_id=tokenizer.pad_token_id,
                           eos_token_id=tokenizer.eos_token_id,
                           bos_token_id=tokenizer.bos_token_id,
                           use_cache=True)

In [None]:
generated = tokenizer.decode(gen_ids[0])
print(generated)

## koBART를 이용한 문단 요약

In [108]:
from transformers import AutoTokenizer
from transformers import BartForConditionalGeneration

tokenizer = AutoTokenizer.from_pretrained('gogamza/kobart-summarization')
model = BartForConditionalGeneration.from_pretrained('gogamza/kobart-summarization')

Could not locate the tokenizer configuration file, will try to use the model config instead.
https://huggingface.co/gogamza/kobart-summarization/resolve/main/config.json not found in cache or force_download set to True, downloading to /root/.cache/huggingface/transformers/tmp8a4furz0


Downloading config.json:   0%|          | 0.00/1.15k [00:00<?, ?B/s]

storing https://huggingface.co/gogamza/kobart-summarization/resolve/main/config.json in cache at /root/.cache/huggingface/transformers/1c32baaf6a1067a5e27a0dfbac0a3d23a86d958ab10b092d5ea4150bd451de17.4e52ef6c87e6938c92ba0d19888607d76e30e950e81060a8fa6cb1189c93614d
creating metadata file for /root/.cache/huggingface/transformers/1c32baaf6a1067a5e27a0dfbac0a3d23a86d958ab10b092d5ea4150bd451de17.4e52ef6c87e6938c92ba0d19888607d76e30e950e81060a8fa6cb1189c93614d
loading configuration file https://huggingface.co/gogamza/kobart-summarization/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/1c32baaf6a1067a5e27a0dfbac0a3d23a86d958ab10b092d5ea4150bd451de17.4e52ef6c87e6938c92ba0d19888607d76e30e950e81060a8fa6cb1189c93614d
You passed along `num_labels=3` with an incompatible id to label map: {'0': 'NEGATIVE', '1': 'POSITIVE'}. The number of labels wil be overwritten to 2.
Model config BartConfig {
  "_name_or_path": "gogamza/kobart-summarization",
  "activation_dropout": 0

Downloading vocab.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

storing https://huggingface.co/gogamza/kobart-summarization/resolve/main/vocab.json in cache at /root/.cache/huggingface/transformers/efa926bcf60bc6e29d6e1270f15e111373fcd61e4d225e1d0b8388ad7ebdb684.a90b011e37fbb81820978fa316a49a85ea809362a79a5cd873f7c2531bedb6f8
creating metadata file for /root/.cache/huggingface/transformers/efa926bcf60bc6e29d6e1270f15e111373fcd61e4d225e1d0b8388ad7ebdb684.a90b011e37fbb81820978fa316a49a85ea809362a79a5cd873f7c2531bedb6f8
https://huggingface.co/gogamza/kobart-summarization/resolve/main/merges.txt not found in cache or force_download set to True, downloading to /root/.cache/huggingface/transformers/tmp1wy6pk59


Downloading merges.txt:   0%|          | 0.00/172k [00:00<?, ?B/s]

storing https://huggingface.co/gogamza/kobart-summarization/resolve/main/merges.txt in cache at /root/.cache/huggingface/transformers/e86e8db4d87c08096d0daf1cfb524d49ccecbb664ff39474be41ffd2c0569ad7.bef4d0a1ddc0882dc673f77f1562e08ddfd9e1e3604046a6493168b6b77c2c7e
creating metadata file for /root/.cache/huggingface/transformers/e86e8db4d87c08096d0daf1cfb524d49ccecbb664ff39474be41ffd2c0569ad7.bef4d0a1ddc0882dc673f77f1562e08ddfd9e1e3604046a6493168b6b77c2c7e
https://huggingface.co/gogamza/kobart-summarization/resolve/main/tokenizer.json not found in cache or force_download set to True, downloading to /root/.cache/huggingface/transformers/tmp33lp2v81


Downloading tokenizer.json:   0%|          | 0.00/666k [00:00<?, ?B/s]

storing https://huggingface.co/gogamza/kobart-summarization/resolve/main/tokenizer.json in cache at /root/.cache/huggingface/transformers/4369897f91813214377063544fb9a44ad537ca3a2559c7bdc98eaf9d934d4a89.dc2013f8bbecd755468e2c44397f53dc624be5451d0190744397caf61a20383f
creating metadata file for /root/.cache/huggingface/transformers/4369897f91813214377063544fb9a44ad537ca3a2559c7bdc98eaf9d934d4a89.dc2013f8bbecd755468e2c44397f53dc624be5451d0190744397caf61a20383f
https://huggingface.co/gogamza/kobart-summarization/resolve/main/added_tokens.json not found in cache or force_download set to True, downloading to /root/.cache/huggingface/transformers/tmpm6cqt9f8


Downloading added_tokens.json:   0%|          | 0.00/4.00 [00:00<?, ?B/s]

storing https://huggingface.co/gogamza/kobart-summarization/resolve/main/added_tokens.json in cache at /root/.cache/huggingface/transformers/c8171f2310611c5f6994c35b7016633d42194eb424192baa1910c896fdd197f6.04312f398a3bbda664297588800a86e0fda9d4ef4f0749cd9d96f88043daad39
creating metadata file for /root/.cache/huggingface/transformers/c8171f2310611c5f6994c35b7016633d42194eb424192baa1910c896fdd197f6.04312f398a3bbda664297588800a86e0fda9d4ef4f0749cd9d96f88043daad39
https://huggingface.co/gogamza/kobart-summarization/resolve/main/special_tokens_map.json not found in cache or force_download set to True, downloading to /root/.cache/huggingface/transformers/tmp0yyo_bcz


Downloading special_tokens_map.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

storing https://huggingface.co/gogamza/kobart-summarization/resolve/main/special_tokens_map.json in cache at /root/.cache/huggingface/transformers/aed722871fe9f8d064a1df70dcfe967be2f02797eaaaf6ee28ffd2c59d7514e9.15447ae63ad4a2eba8bc7a5146360711dc32b315b4f1488b4806debf35315e9a
creating metadata file for /root/.cache/huggingface/transformers/aed722871fe9f8d064a1df70dcfe967be2f02797eaaaf6ee28ffd2c59d7514e9.15447ae63ad4a2eba8bc7a5146360711dc32b315b4f1488b4806debf35315e9a
loading file https://huggingface.co/gogamza/kobart-summarization/resolve/main/vocab.json from cache at /root/.cache/huggingface/transformers/efa926bcf60bc6e29d6e1270f15e111373fcd61e4d225e1d0b8388ad7ebdb684.a90b011e37fbb81820978fa316a49a85ea809362a79a5cd873f7c2531bedb6f8
loading file https://huggingface.co/gogamza/kobart-summarization/resolve/main/merges.txt from cache at /root/.cache/huggingface/transformers/e86e8db4d87c08096d0daf1cfb524d49ccecbb664ff39474be41ffd2c0569ad7.bef4d0a1ddc0882dc673f77f1562e08ddfd9e1e3604046a6493

Downloading pytorch_model.bin:   0%|          | 0.00/473M [00:00<?, ?B/s]

storing https://huggingface.co/gogamza/kobart-summarization/resolve/main/pytorch_model.bin in cache at /root/.cache/huggingface/transformers/f30ba9ba60f377194e6a39913246c76f6dcac8158e399598ed56fec262103dba.b063b56b256aaf29f8c7c67e318ed78b83b9381147ac794d0df9ef0399066ea7
creating metadata file for /root/.cache/huggingface/transformers/f30ba9ba60f377194e6a39913246c76f6dcac8158e399598ed56fec262103dba.b063b56b256aaf29f8c7c67e318ed78b83b9381147ac794d0df9ef0399066ea7
loading weights file https://huggingface.co/gogamza/kobart-summarization/resolve/main/pytorch_model.bin from cache at /root/.cache/huggingface/transformers/f30ba9ba60f377194e6a39913246c76f6dcac8158e399598ed56fec262103dba.b063b56b256aaf29f8c7c67e318ed78b83b9381147ac794d0df9ef0399066ea7
All model checkpoint weights were used when initializing BartForConditionalGeneration.

All the weights of BartForConditionalGeneration were initialized from the model checkpoint at gogamza/kobart-summarization.
If your task is similar to the task 

In [115]:
text = "미국 연방준비제도(Fed·연준) 위원들이 인플레이션이 '상당히 낮아질 때까지'(substantially low) 금리 인상을 추진하겠다고 밝혔다. 인플레이션을 2%로 되돌린다는 목표 아래 경기가 둔화하더라도 금리 인상을 지속하겠다는 의지를 재확인한 것이다."

raw_input_ids = tokenizer.encode(text)
input_ids = [tokenizer.bos_token_id] + raw_input_ids + [tokenizer.eos_token_id]

summary_ids = model.generate(torch.tensor([input_ids]))
tokenizer.decode(summary_ids.squeeze().tolist(), skip_special_tokens=True)

'미국 연방준비제도(Fed·연준) 위원들이 인플레이션이 2%로 되돌려질'