News Topic Classifier 
===
-------

# 뉴스 헤드라인을 입력받아 Topic을 분류하는 모델
```
* Bert-base-multilingual-cased 모델을 klue-tc 데이터 셋으로 fine-tuning 하였습니다.
* 코드는 huggingface 의 Fine-tuning a pretrained model 문서를 참조하였습니다.
* workspace 는 Ainize의 workspace를 사용하였습니다.
* Demo버전은 Ainize에서 확인 가능합니다.
```
-------
##### 사전학습 모델 : [bert-base-multilingual-cased](https://huggingface.co/bert-base-multilingual-cased)
##### 데이터 셋 : [klue-tc (a.k.a. YNAT) ](https://klue-benchmark.com/tasks/66/overview/description)
##### 코드 참조 : [huggingface](https://huggingface.co/transformers/training.html)
##### workspace : [Ainize](https://ainize.ai/workspace)
##### endpoint : [Ainize](https://main-klue-tc-bert-base-multilingual-cased-rjdm1324.endpoint.ainize.ai)

------
### 필요한 라이브러리를 install 하고 import 합니다.

* Dataset : dataset을 trainer에 맞는 형식으로 만들기 위해 사용합니다.
* TrainingArguments, Trainer : 모델 fine-tuning을 위해 사용합니다.
* BertTokenizer, BertForSequenceClassification : huggingface에서 사전학습된 tokenizer와 model을 사용하기 위해 사용합니다.
* LabelEncoder : label을 dataset에 맞는 형식으로 만들기 위해 사용합니다.
* accuracy_score : 모델 정확도 측정을 위해 사용합니다.
* cuda : GPU 사용을 위해 사용합니다.

In [1]:
!pip install -U transformers datasets scipy scikit-learn



In [3]:
import torch
from torch.utils.data import Dataset
import numpy as np
import pandas as pd
from transformers import  TrainingArguments, Trainer
from transformers import BertTokenizer, BertForSequenceClassification
from sklearn.preprocessing import LabelEncoder
from torch import cuda
from sklearn.metrics import accuracy_score

### GPU는 [AI NETWORK workspace]()에서 제공하는 GPU를 사용하였습니다.

* gpu를 사용하기 위해 device를 cuda로 선언합니다.

In [4]:
device = 'cuda:0' if cuda.is_available() else 'cpu'

### json 형식의 데이터를 불러와 csv형식으로 바꾸어줍니다.

In [6]:
import json
import csv

with open('ynat-v1_train.json','r', encoding='utf-8') as f :
    data = json.loads(f.read())
df = pd.json_normalize(data)
df.to_csv('ynat-v1_train.csv', index=False, encoding='utf-8')

with open('ynat-v1_dev.json','r', encoding='utf-8') as f :
    data = json.loads(f.read())
df = pd.json_normalize(data)
df.to_csv('ynat-v1_dev.csv', index=False, encoding='utf-8')


### csv 파일을 불러와 title 과 label column을 추출합니다.

* data는 train dataset의 title과 label column을 사용합니다.
* data_dev dev dataset의 title과 label column을 사용합니다.

In [7]:
data = pd.read_csv("ynat-v1_train.csv")
data_dev = pd.read_csv("ynat-v1_dev.csv")
data = data[['title','label']]
data_dev = data_dev[['title','label']]

### tokenizer를 선언합니다.

* tokenizer는 사전 훈련된 bert-base-multilingual-cased tokenizer를 사용합니다.

In [27]:
tokenizer = BertTokenizer.from_pretrained('bert-base-multilingual-cased',do_lower_case=False).save_pretrained(".")

loading file https://huggingface.co/bert-base-multilingual-cased/resolve/main/vocab.txt from cache at /workspace/.cache/huggingface/transformers/eff018e45de5364a8368df1f2df3461d506e2a111e9dd50af1fae061cd460ead.6c5b6600e968f4b5e08c86d8891ea99e51537fc2bf251435fb46922e8f7a7b29
loading file https://huggingface.co/bert-base-multilingual-cased/resolve/main/added_tokens.json from cache at None
loading file https://huggingface.co/bert-base-multilingual-cased/resolve/main/special_tokens_map.json from cache at None
loading file https://huggingface.co/bert-base-multilingual-cased/resolve/main/tokenizer_config.json from cache at /workspace/.cache/huggingface/transformers/f55e7a2ad4f8d0fff2733b3f79777e1e99247f2e4583703e92ce74453af8c235.ec5c189f89475aac7d8cbd243960a0655cfadc3d0474da8ff2ed0bf1699c2a5f
loading file https://huggingface.co/bert-base-multilingual-cased/resolve/main/tokenizer.json from cache at /workspace/.cache/huggingface/transformers/46880f3b0081fda494a4e15b05787692aa4c1e21e0ff2428ba8b

### label column을 형식에 맞게 encoding 합니다.

* data와 data_dev의 label column을 Dataset Class에서 사용할 수 있게 encoding합니다.

In [9]:

label_encoder = LabelEncoder()
data["label"] = label_encoder.fit_transform(data["label"])
data_dev["label"] = label_encoder.fit_transform(data_dev["label"])

### label을 분류할 class들과 mapping합니다.

* 출력을 위해 mapping에 label의 name을 저장합니다.

In [10]:
mapping = dict(zip(range(len(label_encoder.classes_)), label_encoder.classes_))
mapping

{0: 'IT과학', 1: '경제', 2: '사회', 3: '생활문화', 4: '세계', 5: '스포츠', 6: '정치'}

### 학습에 필요한 파라미터를 설정합니다.

* label의 개수가 7개 이므로 num_labels를 7로 설정합니다.
* 나머지 파라미터를 설정합니다.

In [11]:
num_labels=7
max_len = 128
batch_size = 32
num_epochs = 5
log_interval = 200
learning_rate =  5e-5

### train을 위한 data와 evalutaion을 위한 data를 나누고 tokenizing을 합니다.

* 각 dataset을 tokenizing을 하여 저장합니다.

In [12]:
X_train= list(data['title'])
y_train= list(data['label'])
X_val = list(data_dev['title'])
y_val = list(data_dev['label'])
X_train_tokenized = tokenizer(X_train, padding=True, truncation = True, max_length =max_len)
X_val_tokenized = tokenizer(X_val, padding=True, truncation = True, max_length =max_len)

### tokenizing된 title과 label을 data set으로 만드는 class를 정의합니다.

In [13]:
class Dataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels=None):
        self.encodings = encodings
        self.labels = labels
    
    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item["labels"] = torch.tensor([self.labels[idx]])
        return item
    
    def __len__(self):
        return len(self.labels)
    

### train data set과 evaluation data set으로 나누어줍니다.

* train_dataset은 train을 할 때 사용합니다.
* val_dataset은 evaluation할 때 사용합니다.

In [14]:
train_dataset = Dataset(X_train_tokenized, y_train)
val_dataset = Dataset(X_val_tokenized, y_val)

### model을 선언합니다.

* huggingface에서 사전 학습된 bert-base-multilingual-cased 모델을 사용합니다.

In [15]:
model = BertForSequenceClassification.from_pretrained('bert-base-multilingual-cased',num_labels=num_labels).to("cuda")

Some weights of the model checkpoint at bert-base-multilingual-cased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.predictions.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model ch

### accuracy를 측정하는 함수를 정의합니다.

In [16]:


def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    acc = accuracy_score(labels, preds)
    return {
        'accuracy': acc,
  }

### training arguments를 설정합니다.

* TrainingArguments를 통해 args에 arguments를 정의하고 저장합니다.

In [17]:
args = TrainingArguments(
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    learning_rate =  learning_rate ,
    num_train_epochs=num_epochs,
    logging_steps= log_interval ,
    output_dir="output",
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir='log',
    load_best_model_at_end=True,
    evaluation_strategy="steps"
)

In [18]:
model = model.to(device)

### trainer를 설정하고 학습을 시작합니다.

In [19]:
trainer = Trainer(
    model=model,
    args=args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    compute_metrics=compute_metrics
)

In [20]:
trainer.train()

***** Running training *****
  Num examples = 45678
  Num Epochs = 5
  Instantaneous batch size per device = 32
  Total train batch size (w. parallel, distributed & accumulation) = 64
  Gradient Accumulation steps = 1
  Total optimization steps = 3570


Step,Training Loss,Validation Loss,Accuracy
200,1.2986,0.881617,0.693972
400,0.636,0.902439,0.673877
600,0.553,0.836179,0.700999
800,0.4818,0.803289,0.721423
1000,0.4365,0.568277,0.799165
1200,0.4112,0.574548,0.80202
1400,0.3997,0.584228,0.801581
1600,0.3184,0.601114,0.79664
1800,0.3093,0.5878,0.800044
2000,0.3019,0.493249,0.827825


***** Running Evaluation *****
  Num examples = 9107
  Batch size = 64
Saving model checkpoint to output/checkpoint-200
Configuration saved in output/checkpoint-200/config.json
Model weights saved in output/checkpoint-200/pytorch_model.bin
***** Running Evaluation *****
  Num examples = 9107
  Batch size = 64
Saving model checkpoint to output/checkpoint-400
Configuration saved in output/checkpoint-400/config.json
Model weights saved in output/checkpoint-400/pytorch_model.bin
***** Running Evaluation *****
  Num examples = 9107
  Batch size = 64
Saving model checkpoint to output/checkpoint-600
Configuration saved in output/checkpoint-600/config.json
Model weights saved in output/checkpoint-600/pytorch_model.bin
***** Running Evaluation *****
  Num examples = 9107
  Batch size = 64
Saving model checkpoint to output/checkpoint-800
Configuration saved in output/checkpoint-800/config.json
Model weights saved in output/checkpoint-800/pytorch_model.bin
***** Running Evaluation *****
  Num exa

TrainOutput(global_step=3570, training_loss=0.3722160959110207, metrics={'train_runtime': 1477.3876, 'train_samples_per_second': 154.59, 'train_steps_per_second': 2.416, 'total_flos': 9017901201863340.0, 'train_loss': 0.3722160959110207, 'epoch': 5.0})

### 모델을 평가합니다.

In [21]:
trainer.evaluate()

***** Running Evaluation *****
  Num examples = 9107
  Batch size = 64


{'eval_loss': 0.4932490885257721,
 'eval_accuracy': 0.8278247501921598,
 'eval_runtime': 16.777,
 'eval_samples_per_second': 542.825,
 'eval_steps_per_second': 8.524,
 'epoch': 5.0}

### 모델을 저장합니다.

In [22]:
model_path = "klue-ynat-bert-base-multilingual-cased"
model.save_pretrained(model_path)
tokenizer.save_pretrained(model_path)

Configuration saved in klue-ynat-bert-base-multilingual-cased/config.json
Model weights saved in klue-ynat-bert-base-multilingual-cased/pytorch_model.bin
tokenizer config file saved in klue-ynat-bert-base-multilingual-cased/tokenizer_config.json
Special tokens file saved in klue-ynat-bert-base-multilingual-cased/special_tokens_map.json


('klue-ynat-bert-base-multilingual-cased/tokenizer_config.json',
 'klue-ynat-bert-base-multilingual-cased/special_tokens_map.json',
 'klue-ynat-bert-base-multilingual-cased/vocab.txt',
 'klue-ynat-bert-base-multilingual-cased/added_tokens.json')

Test Prediction
===

* 출력을 얻기위한 함수를 정의합니다.

In [23]:
def get_prediction(text):
    inputs = tokenizer(text, padding=True, truncation=True, max_length=max_len, return_tensors="pt").to("cuda")
    outputs = model(**inputs)
    probs = outputs[0].softmax(1)
    return mapping[probs.argmax().item()]

-----------

### 뉴스 헤드라인을 입력하세요.

In [24]:
text="""
    내년 최저임금 9160원…"자영업자 한계 상황…실업난 우려" 中企 소상공인 반발
"""

### 입력한 헤드라인의 Topic을 분류합니다.

In [25]:
print(get_prediction(text))

경제
