Intent Classifier
===

---

## 문장을 입력받아 intent를 분류하는 모델

* Bert-base-multilingual-cased 모델을 3i4k 데이터 셋으로 Fine-tunning 하였습니다.
* 코드는 huggingface의 Fine-tuning a pretrained model 문서를 참조하였습니다.
* workspace는 Ainize의 workspace를 사용하였습니다.
* Demo버전은 Ainize에서 확인 가능합니다.

##### 사전학습 모델 : [bert-base-multilingual-cased](https://huggingface.co/bert-base-multilingual-cased)
##### 데이터 셋 : [3i4k](https://huggingface.co/datasets/kor_3i4k)
##### 코드 참조 : [huggingface](https://huggingface.co/transformers/training.html)
##### workspace : [Ainize](https://ainize.ai/workspace)
##### endpoint : [Ainize](https://main-kor-3i4k-bert-base-cased-rjdm1324.endpoint.ainize.ai)

## 필요한 라이브러리를 import 합니다.

* Dataset : dataset을 trainer에 맞는 형식으로 만들기 위해 사용합니다.
* TrainingArguments, Trainer : 모델 fine-tuning을 위해 사용합니다.
* BertTokenizer, BertForSequenceClassification : huggingface에서 사전학습된 tokenizer와 model을 사용하기 위해 사용합니다.
* LabelEncoder : label을 dataset에 맞는 형식으로 만들기 위해 사용합니다.
* accuracy_score : 모델 정확도 측정을 위해 사용합니다.
* cuda : GPU 사용을 위해 사용합니다.

In [40]:
!pip install -U transformers datasets



In [2]:
import torch
from torch.utils.data import Dataset
import numpy as np
import pandas as pd
from transformers import  TrainingArguments, Trainer
from transformers import BertTokenizer, BertForSequenceClassification
from torch import cuda
from sklearn.metrics import accuracy_score

### GPU는 [AI NETWORK workspace]()에서 제공하는 GPU를 사용하였습니다.

* gpu를 사용하기 위해 device를 cuda로 선언합니다.

In [3]:
device = 'cuda:0' if cuda.is_available() else 'cpu'

### 학습에 필요한 파라미터를 설정하고 모델과 토크나이저 데이터 셋을 불러옵니다.

* 분류 case가 7개이기 때문에 num_labels는 7로 설정합니다.

In [4]:
num_labels=7
max_length = 256
batch_size = 16
num_epochs = 3
log_interval = 200
learning_rate =  5e-5

In [6]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("bert-base-multilingual-cased", num_labels=7)

Some weights of the model checkpoint at bert-base-multilingual-cased were not used when initializing BertForSequenceClassification: ['cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model ch

In [7]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('bert-base-multilingual-cased',do_lower_case=False)
tokenizer.save_pretrained(".")

('./tokenizer_config.json',
 './special_tokens_map.json',
 './vocab.txt',
 './added_tokens.json',
 './tokenizer.json')

In [8]:
from datasets import load_dataset

dataset = load_dataset("kor_3i4k")

Using custom data configuration default
Reusing dataset kor_3i4k (/workspace/.cache/huggingface/datasets/kor_3i4k/default/1.1.0/5cd76dab10e6a5f36fd0ae9e1d01a725b6312307a5fd991a10b423e49e690dfe)


### 데이터 셋을 train set과 test set으로 나누고 tokenizing을 합니다.

In [11]:
X_train = list(dataset['train']['text'])
y_train = list(dataset['train']['label'])
X_val = list(dataset['test']['text'])
y_val = list(dataset['test']['label'])
X_train_tokenized = tokenizer(X_train, padding=True, truncation = True, max_length =max_length)
X_val_tokenized = tokenizer(X_val, padding=True, truncation = True, max_length =max_length)

### tokenizing된 data를 data set으로 변환하는 class를 선언한 후 학습 시킬 data set으로 만들어 줍니다. 

In [12]:
class Dataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels=None):
        self.encodings = encodings
        self.labels = labels
    
    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item["labels"] = torch.tensor([self.labels[idx]])
        return item
    
    def __len__(self):
        return len(self.labels)
    

In [13]:
train_dataset = Dataset(X_train_tokenized, y_train)
val_dataset = Dataset(X_val_tokenized, y_val)

### accuracy를 계산할 함수를 정의합니다.

In [14]:
def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    acc = accuracy_score(labels, preds)
    return {
        'accuracy': acc,
  }

### TrainerArgument를 설정한 후 trainer를  설정하고 학습을 시작합니다.

In [15]:
args = TrainingArguments(
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    learning_rate =  learning_rate ,
    num_train_epochs=num_epochs,
    logging_steps= log_interval ,
    output_dir="output",
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir='log',
    load_best_model_at_end=True,
    evaluation_strategy="steps"
)

In [16]:
model = model.to(device)

In [17]:
trainer = Trainer(
    model=model,
    args=args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    compute_metrics=compute_metrics
)

In [18]:
trainer.train()

***** Running training *****
  Num examples = 55134
  Num Epochs = 3
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 32
  Gradient Accumulation steps = 1
  Total optimization steps = 5169


Step,Training Loss,Validation Loss,Accuracy
200,1.2282,0.657881,0.805424
400,0.6562,0.571649,0.834994
600,0.5514,0.507469,0.849371
800,0.4947,0.468215,0.853619
1000,0.4886,0.404701,0.866852
1200,0.4469,0.417196,0.870446
1400,0.4351,0.414374,0.870283
1600,0.3789,0.377815,0.879431
1800,0.3674,0.393021,0.876164
2000,0.3297,0.36614,0.876981


***** Running Evaluation *****
  Num examples = 6121
  Batch size = 32
Saving model checkpoint to output/checkpoint-200
Configuration saved in output/checkpoint-200/config.json
Model weights saved in output/checkpoint-200/pytorch_model.bin
***** Running Evaluation *****
  Num examples = 6121
  Batch size = 32
Saving model checkpoint to output/checkpoint-400
Configuration saved in output/checkpoint-400/config.json
Model weights saved in output/checkpoint-400/pytorch_model.bin
***** Running Evaluation *****
  Num examples = 6121
  Batch size = 32
Saving model checkpoint to output/checkpoint-600
Configuration saved in output/checkpoint-600/config.json
Model weights saved in output/checkpoint-600/pytorch_model.bin
***** Running Evaluation *****
  Num examples = 6121
  Batch size = 32
Saving model checkpoint to output/checkpoint-800
Configuration saved in output/checkpoint-800/config.json
Model weights saved in output/checkpoint-800/pytorch_model.bin
***** Running Evaluation *****
  Num exa

***** Running Evaluation *****
  Num examples = 6121
  Batch size = 32
Saving model checkpoint to output/checkpoint-3400
Configuration saved in output/checkpoint-3400/config.json
Model weights saved in output/checkpoint-3400/pytorch_model.bin
***** Running Evaluation *****
  Num examples = 6121
  Batch size = 32
Saving model checkpoint to output/checkpoint-3600
Configuration saved in output/checkpoint-3600/config.json
Model weights saved in output/checkpoint-3600/pytorch_model.bin
***** Running Evaluation *****
  Num examples = 6121
  Batch size = 32
Saving model checkpoint to output/checkpoint-3800
Configuration saved in output/checkpoint-3800/config.json
Model weights saved in output/checkpoint-3800/pytorch_model.bin
***** Running Evaluation *****
  Num examples = 6121
  Batch size = 32
***** Running Evaluation *****
  Num examples = 6121
  Batch size = 32
Saving model checkpoint to output/checkpoint-4200
Configuration saved in output/checkpoint-4200/config.json
Model weights saved i

TrainOutput(global_step=5169, training_loss=0.356829648599977, metrics={'train_runtime': 2376.638, 'train_samples_per_second': 69.595, 'train_steps_per_second': 2.175, 'total_flos': 1.4650266110839308e+16, 'train_loss': 0.356829648599977, 'epoch': 3.0})

fragment (0), statement (1), question (2), command (3), rhetorical question (4), rhetorical command (5) and intonation-depedent utterance (6).

### 모델을 평가합니다.

In [20]:
trainer.evaluate()

***** Running Evaluation *****
  Num examples = 6121
  Batch size = 32


{'eval_loss': 0.3275899887084961,
 'eval_accuracy': 0.892991341284104,
 'eval_runtime': 23.8916,
 'eval_samples_per_second': 256.198,
 'eval_steps_per_second': 8.036,
 'epoch': 3.0}

### 모델을 저장합니다.

In [21]:
model_path = "kor_3i4k_bert_base_cased"
model.save_pretrained(model_path)
tokenizer.save_pretrained(model_path)

Configuration saved in kor_3i4k_bert_base_cased/config.json
Model weights saved in kor_3i4k_bert_base_cased/pytorch_model.bin
tokenizer config file saved in kor_3i4k_bert_base_cased/tokenizer_config.json
Special tokens file saved in kor_3i4k_bert_base_cased/special_tokens_map.json


('kor_3i4k_bert_base_cased/tokenizer_config.json',
 'kor_3i4k_bert_base_cased/special_tokens_map.json',
 'kor_3i4k_bert_base_cased/vocab.txt',
 'kor_3i4k_bert_base_cased/added_tokens.json',
 'kor_3i4k_bert_base_cased/tokenizer.json')

# Test Prediction
* 출력을 얻기위한 함수를 정의합니다.

In [25]:
def get_prediction(text):
    inputs = tokenizer(text, padding=True, truncation=True, max_length=max_length, return_tensors="pt").to("cuda")
    outputs = model(**inputs)
    probs = outputs[0].softmax(1)
    prediction = probs.argmax().item()
    intent=""
    if prediction==0 :
        intent="Fragment"
    elif prediction==1 :
        intent="statement"
    elif prediction==2 :
        intent="question"
    elif prediction==3 :
        intent="command"
    elif prediction==4 :
        intent="rhetorical question"
    elif prediction==5 :
        intent="rhetorical command"
    elif prediction==6 :
        intent="intonation-depedent utterance"
    return intent

### text를 입력하세요

In [None]:
test="""
    너 이것좀 해라.
"""

In [79]:
prediction = get_prediction(text)

### text의 intent를 출력합니다.

In [80]:
print(prediction)

command
