<h1>Korean Text classification with KoBERT</h1>

## **Summary of the project**

In Korea, even though there are many research conducted being conducted on voice phishing, it remains a real case problem that technology such as artificial intelligence can tackle. Through a previous project conducted, we created a dataset containing phone call conversation transcripts and general conversation text data. This voice phishing dataset has two different class which are **voice phishing** (represented as "1") and **non-voice phishing** (represented as "0").

Using this dataset with state-of-the-art (SOTA) pre-trained word embedding [KoBERT](https://github.com/SKTBrain/KoBERT), we will perform NLP task such as text classification to build binary classification models.

## **Aim of the project**
In this project, we aim to build binary classification models capable to determine whether the inputted Korean conversation text is voice phishing ("1") or non-voice phishing ("0") related text.

The API used are Tensorflow for BERT model and Pytorch for KoBERT model.

## **Desired outputs of the project**
From the trained models, we expect to achieve great classification performance on this voice phishing dataset such as the model tells us if a conversation is harmful or not harmful.
At the end of this project, we will look at the accuracy of the model on the test set.

# Training the binary classification model with KoBERT

In [None]:
!nvidia-smi

## Installing the common needed packages

In [60]:
# dowload and install KoBERT as a python package
# this commande will install the requirted package at the same time
  # gluonnlp >= 0.6.0
  # mxnet >= 1.4.0
  # onnxruntime >= 0.3.0
  # sentencepiece >= 0.1.6
  # torch >= 1.7.0
  # transformers >= 4.8.1

!pip install git+https://git@github.com/SKTBrain/KoBERT.git@master

Collecting git+https://****@github.com/SKTBrain/KoBERT.git@master
  Cloning https://****@github.com/SKTBrain/KoBERT.git (to revision master) to /tmp/pip-req-build-ddiszrb2
  Running command git clone --filter=blob:none --quiet 'https://****@github.com/SKTBrain/KoBERT.git' /tmp/pip-req-build-ddiszrb2
  Resolved https://****@github.com/SKTBrain/KoBERT.git to commit a82d428c26988ff40c03309038dba813fb83b92e
  Preparing metadata (setup.py) ... [?25ldone
You should consider upgrading via the '/home/phenomx/anaconda3/bin/python -m pip install --upgrade pip' command.[0m[33m
[0m

## Import all the needed libraries

In [61]:
## importing the required packages
import torch
from torch import nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
import gluonnlp as nlp
import numpy as np
import pandas as pd
from tqdm import tqdm, tqdm_notebook

from sklearn.model_selection import train_test_split

In [62]:
from sklearn.datasets import make_circles
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score
from sklearn.metrics import cohen_kappa_score
from sklearn.metrics import roc_auc_score
from sklearn.metrics import confusion_matrix
# from keras.models import Sequential

In [63]:
## importing KoBERT functions
from kobert.utils import get_tokenizer
from kobert.pytorch_kobert import get_pytorch_kobert_model

In [64]:
## import transformers functions
from transformers import AdamW
from transformers.optimization import get_cosine_schedule_with_warmup

In [65]:
## Configure the GPU  device
device = torch.device("cuda:0") if torch.cuda.is_available() else torch.device("cpu")

## Importing the dataset

In [66]:
"""
Since we are using Colab, we will provide a test to check if environment is 
colab or not so that the data can also be imported in case this jupyter file is 
ran on local machine and not on colab
"""

if 'google.colab' in str(get_ipython()):
  print('Running on CoLab')
  ## mount the google drive
  from google.colab import drive
  drive.mount('drive')
  # move to the directory where dataset is saved
  %cd drive/My\ Drive/Colab\ Notebooks/
else:
  print('Not running on CoLab')

Not running on CoLab


In [67]:
# import the dataset
dataset = pd.read_csv('KorCCViD_v1.3_fullcleansed.csv').sample(frac=1.0)
dataset.sample(n=15)

Unnamed: 0,Transcript,Label
660,그거 그쪽 카드 통합 채권 관리 부서 연람 게끔 어제 사장 전화 동의 부분 전산 확...,1
1150,그것 나쁘 방법 사실 연봉 얼마나 근무 환경 얘기 잖아 되게 다더라 아니 그런 얘기...,0
1013,본점 으로 서류 보내 드릴 에요 네네 그러면 고객 대출금 승인 오늘 저희 으로 본점...,1
893,수고 십니다 서울 지검 수사관 에요 지금 통화 괜찮 으세요 지금 다름 아니 으로 명...,1
480,혹시나 해서 그냥 금융 감독원 공부 요청 해서 회원 확실히 연락 드렸 오늘 입금 처...,1
953,많이 드라 명훈 자기 에서 160 불렀 쩌쪽 인제 진영 에서 140 불렀 거든 근데...,0
443,약간 헬스 직원 에서 만약 누가 면은 전달 드리 아니 어쩔 최선 잖아 어떻게 무슨 ...,0
1081,왜냐면 고대 해도 고대 부터 중세 까지 해도 철학 되게 그니까 인문학 되게 받들 여...,0
398,그리고 와이파이 터진다고 노트북 때릴 기세 와이파이 터진다고 약간 그런 무슨 자기 ...,0
203,아까 전화 드렸 정말 주사 과정 대해서 녹취 상태 다시 총괄 책임 드릴 예요 아니요...,1


## Data transformation and splitting

In [68]:
## transform our train set and test set into tsv file to usedd into KoBERT
# train_tsv = nlp.data.TSVDataset('KorCCViD_v1.3_fullcleansed.csv')
# train_tsv = nlp.data.TSVDataset('KorCCViD_v1.3_fullcleansed.csv')

dataset_tsv = []
for text, label in zip(dataset['Transcript'], dataset['Label']):
    data = []
    data.append(text)
    data.append(str(label))

    dataset_tsv.append(data)

In [69]:
dataset_tsv[:5]

[['여보세요 입금 아까 계좌 오류 나가 오류 잠시 잠시 만요 오류 나온다 고요 금융 감독원 에서 발급 요청 해서 고객 께서 에스크로 계좌 발급 신청 아요 그렇 때문 문제 저희 에서 확인 잠시 그러 일단 고객 납부 납부 시간 너무 지연 어요 에스크로 계좌 경우 납부 다고 지연 시간 너무 오래 걸리 면은 다시 회수 처리 거든요 아마 지금 현재 시간 너무 오래 때문 입금 예요 그죠 입금 오류 나오 아요 그래서 다시 발급 요청 해야 돼요 그래서 일단 직장 근무 셔야 때문 일단 복귀 세요 복귀 산와 대부 산와 머니 통화 해서 다시 해서 에스크로 계좌 발급 요청 할께요 그런데 발급 요청 저희 발급 더라고 해서 바로 나오 아니 거든요 에스크로 계좌 경우 금융 감독원 에서 안전 진행 드리 위해서 고객 께서 만일 저희 에서 상대 산와 머니 납부 중도 상환 없이 했는데도 불구 저희 에서 만일 대출 부결 부결 다시 환급 계좌 발급 에요 그렇 때문 지금 시간 오래 지연 잖아요 그래서 산와 머니 에서 통화 해서 에스크로 계좌 발급 새로 발급 드리 전화 드리 습니다 근데 어서 통화 어려울 거든요 그러 세요 그러면 그러면요 어떻게 오늘 지금 시간 아요 그래서 시간 여섯 저희 금융 업무 마감 시간 때문 오늘 납부 난다 더라도 내일 내일 해서 송금 처리 으실 거든요 시간 너무 촉박 니까 그래서 내일 오전 으로 처리 어떠실까 편하 시간 그게 괜찮 아요 그렇 습니다 그러면 내일 오전 정도 해서 진행 드릴까요 그럼 12 아니 오후 정도 그때 오후 오전 오후 에서 사이 해서 진행 드릴게요 습니다 그럼 오늘 근무 구요 괜찮 내일 저희 대출금 수령 게끔 처리 건데 괜찮 내일 해도 상관 어요 습니다 고객',
  '1'],
 ['그래 어떤 정말로 맛있 예요 모두 보여 드릴게요 지금 그래서 저희 과정 에서 통장 일단 당하 에게 걸로 확인 고요 확인 연락 드립니다 당하 셔서 계신 건지 직접 건지 확인 세요 아니 면서 200 아니 중요 저희 수업 어요 니까 사람 30 에서 50 까지 대창 시대 부분 에요 저

In [71]:
# Split the data into train set and test set

# train_set, test_set = train_test_split(dataset_tsv, 
#                                test_size=0.3, 
#                                random_state=42, 
#                                shuffle=True)
# print(f"Numbers of train instances by class: {len(train_set)}")
# print(f"Numbers of test instances by class: {len(test_set)}")

train_set, val_set = train_test_split(dataset_tsv, 
                               test_size=0.2, 
                               random_state=42, 
                               shuffle=True)

# train_set, val_set = train_test_split(train_set, 
#                                test_size=0.2, 
#                                random_state=42, 
#                                shuffle=True)
print(f"Numbers of train instances by class: {len(train_set)}")
print(f"Numbers of val instances by class: {len(val_set)}")
# print(f"Numbers of test instances by class: {len(test_set)}")



Numbers of train instances by class: 974
Numbers of val instances by class: 244


## Prepare the data as input for the KoBERT model
According tot he documentation the class BERTDataset is to be used to perform in the background the following tasks.
- Tokenization
- Numericalization (encoding string to integer)
- Padding
- etc



In [72]:
# Definition of BERTDataset class (mandatory)
class BERTDataset(Dataset):
    def __init__(self, dataset, sent_idx, label_idx, bert_tokenizer, max_len,
                 pad, pair):
        transform = nlp.data.BERTSentenceTransform(
            bert_tokenizer, max_seq_length=max_len, pad=pad, pair=pair)

        self.sentences = [transform([i[sent_idx]]) for i in dataset]
        self.labels = [np.int32(i[label_idx]) for i in dataset]

    def __getitem__(self, i):
        return (self.sentences[i] + (self.labels[i], ))

    def __len__(self):
        return (len(self.labels))

In [73]:
# Setting the hyperparameters
max_len = 64 # The maximum sequence length that this model might ever be used with. 
             # Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
batch_size = 32
warmup_ratio = 0.1
num_epochs = 10   # only parameter changed from 5 to 10 compared to the documentation
max_grad_norm = 1
log_interval = 200
learning_rate = 5e-5  # 4e-5

In [74]:
# Perform the prearation task of the data using class defined above
bertmodel, vocab = get_pytorch_kobert_model() # calling the bert model and the vocabulary

tokenizer = get_tokenizer()
tok = nlp.data.BERTSPTokenizer(tokenizer, vocab, lower=False)

train_set = BERTDataset(train_set, 0, 1, tok, max_len, True, False)
val_set = BERTDataset(val_set, 0, 1, tok, max_len, True, False)
#test_set = BERTDataset(test_set, 0, 1, tok, max_len, True, False)

using cached model. /home/user_01/.cache/kobert_v1.zip
using cached model. /home/user_01/.cache/kobert_news_wiki_ko_cased-1087f8699e.spiece
using cached model. /home/user_01/.cache/kobert_news_wiki_ko_cased-1087f8699e.spiece


In [75]:
tokenizer

'/home/user_01/.cache/kobert_news_wiki_ko_cased-1087f8699e.spiece'

In [76]:
tok

<gluonnlp.data.transforms.BERTSPTokenizer at 0x7f11e04144f0>

In [77]:
# verifying the transformation
train_set[1]

(array([   2, 2726, 4269, 4316,  900, 7431, 4480, 2320, 2882, 1334, 7344,
        2882, 5474,  517, 7139,  517, 5771, 4542, 1761, 6999, 5130, 3343,
         517, 5925, 3946, 4758, 2123,  533,  958, 1098, 4584, 3224, 1076,
         517, 6700,  517, 7330, 4468,  517, 6896, 6999, 1316, 6559, 7227,
        2574, 4164,    3,    1,    1,    1,    1,    1,    1,    1,    1,
           1,    1,    1,    1,    1,    1,    1,    1,    1], dtype=int32),
 array(47, dtype=int32),
 array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       dtype=int32),
 1)

In [78]:
# creating torch-type datasets
train_dataloader = torch.utils.data.DataLoader(train_set, batch_size=batch_size, num_workers=5)
val_dataloader = torch.utils.data.DataLoader(val_set, batch_size=batch_size, num_workers=5)
#test_dataloader = torch.utils.data.DataLoader(test_set, batch_size=batch_size, num_workers=5)

## Creation of the KoBERT learing model

In [79]:
# This class is from the GitHub repository and the documentation
class BERTClassifier(nn.Module):
    def __init__(self,
                 bert,
                 hidden_size = 768,
                 num_classes=2,   # since we are in binary classification we set the value 2
                 dr_rate=None,
                 params=None):
        super(BERTClassifier, self).__init__()
        self.bert = bert
        self.dr_rate = dr_rate
                 
        self.classifier = nn.Linear(hidden_size , num_classes)
        if dr_rate:
            self.dropout = nn.Dropout(p=dr_rate)
    
    def gen_attention_mask(self, token_ids, valid_length):
        attention_mask = torch.zeros_like(token_ids)
        for i, v in enumerate(valid_length):
            attention_mask[i][:v] = 1
        return attention_mask.float()

    def forward(self, token_ids, valid_length, segment_ids):
        attention_mask = self.gen_attention_mask(token_ids, valid_length)
        
        _, pooler = self.bert(input_ids = token_ids, token_type_ids = segment_ids.long(), attention_mask = attention_mask.float().to(token_ids.device))
        if self.dr_rate:
            out = self.dropout(pooler)
        return self.classifier(out)

In [80]:
import torch
# torch.cuda.empty_cache()

In [81]:
# creation of the model
model = BERTClassifier(bertmodel,  dr_rate=0.4).to(device)

In [82]:
%%time
print(model)

BERTClassifier(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(8002, 768, padding_idx=1)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True

In [83]:
# Prepare optimizer and schedule (linear warmup and decay)
no_decay = ['bias', 'LayerNorm.weight']
optimizer_grouped_parameters = [
    {'params': [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)], 'weight_decay': 0.01},
    {'params': [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)], 'weight_decay': 0.0}
]

In [84]:
# configuration f the optimizer and loss function
optimizer = AdamW(optimizer_grouped_parameters, lr=learning_rate)
loss_fn = nn.CrossEntropyLoss()

t_total = len(train_dataloader) * num_epochs
warmup_step = int(t_total * warmup_ratio)

scheduler = get_cosine_schedule_with_warmup(optimizer, num_warmup_steps=warmup_step, num_training_steps=t_total)



In [85]:
# define the function to compute the accury of the model
def calc_accuracy(X,Y):
    max_vals, max_indices = torch.max(X, 1)
    train_acc = (max_indices == Y).sum().data.cpu().numpy()/max_indices.size()[0]
    return train_acc

In [86]:
# model.summary()

In [87]:
def get_metrics(pred, label, threshold=0.5):
    pred = (pred > threshold).astype('float32')
    tp = ((pred == 1) & (label == 1)).sum()
    fp = ((pred == 1) & (label == 0)).sum()
    fn = ((pred == 0) & (label == 1)).sum()
    
    recall = tp / (tp + fn)
    precision = tp / (tp + fp)
    f1 = 2 * recall * precision / (precision + recall)
    
    return {
        'recall': recall,
        'precision': precision,
        'f1': f1
    }

## Training the KoBERT model


In [88]:
%%time
from time import time
from timeit import default_timer as timer

# Training code from the github library
start_time = time()

for e in range(num_epochs):
    train_acc = 0.0
    test_acc = 0.0

    # Training of the model with the train set
    model.train()
    for batch_id, (token_ids, valid_length, segment_ids, label) in tqdm(enumerate(train_dataloader), total=len(train_dataloader)):
        optimizer.zero_grad()
        token_ids = token_ids.long().to(device)
        segment_ids = segment_ids.long().to(device)
        valid_length= valid_length
        label = label.long().to(device)
        out = model(token_ids, valid_length, segment_ids)
        loss = loss_fn(out, label)
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_grad_norm)
        optimizer.step()
        scheduler.step()  # Update learning rate schedule
        train_acc += calc_accuracy(out, label)
        if batch_id % log_interval == 0:
            print("epoch {} batch id {} loss {} train acc {}".format(e+1, batch_id+1, loss.data.cpu().numpy(), train_acc / (batch_id+1)))
    print("epoch {} train acc {}".format(e+1, train_acc / (batch_id+1)))
    
    preds = []
    labels = []
    # evaluation of the model train on the test set
    model.eval()
    for batch_id, (token_ids, valid_length, segment_ids, label) in tqdm(enumerate(val_dataloader), total=len(val_dataloader)):
        token_ids = token_ids.long().to(device)
        segment_ids = segment_ids.long().to(device)
        valid_length= valid_length
        label = label.long().to(device)
        labe2 = label.cpu()
        out = model(token_ids, valid_length, segment_ids)
        test_acc += calc_accuracy(out, label)
        
        pred = out.detach()
        pred = F.softmax(pred)
        pred = pred[:, 1].cpu().numpy().tolist()
        preds += pred
        labels += label.cpu().numpy().tolist()
        
    preds = np.array(preds)
    labels = np.array(labels)
    metrics = get_metrics(preds, labels)
    print("epoch {} test acc {}".format(e+1, test_acc / (batch_id+1)))
    # print('ACCURACY 2 = ', accuracy_score(out, label))
    print('Metrics: ', metrics)

run_time = time() - start_time

 10%|████████████████▌                                                                                                                                                          | 3/31 [00:00<00:02, 10.90it/s]

epoch 1 batch id 1 loss 0.6538559794425964 train acc 0.625


100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 31/31 [00:02<00:00, 12.43it/s]

epoch 1 train acc 0.7610887096774194



100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:00<00:00, 31.03it/s]

epoch 1 test acc 0.98828125
Metrics:  {'recall': 0.9914529914529915, 'precision': 0.9830508474576272, 'f1': 0.9872340425531915}



  6%|███████████                                                                                                                                                                | 2/31 [00:00<00:02, 12.20it/s]

epoch 2 batch id 1 loss 0.09335373342037201 train acc 1.0


100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 31/31 [00:02<00:00, 12.83it/s]

epoch 2 train acc 0.9765264976958525



100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:00<00:00, 29.45it/s]

epoch 2 test acc 0.984375
Metrics:  {'recall': 0.9658119658119658, 'precision': 1.0, 'f1': 0.9826086956521739}



  6%|███████████                                                                                                                                                                | 2/31 [00:00<00:02, 12.00it/s]

epoch 3 batch id 1 loss 0.19126488268375397 train acc 0.9375


100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 31/31 [00:02<00:00, 12.60it/s]

epoch 3 train acc 0.9828629032258065



100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:00<00:00, 29.43it/s]

epoch 3 test acc 0.9921875
Metrics:  {'recall': 0.9829059829059829, 'precision': 1.0, 'f1': 0.9913793103448275}



  6%|███████████                                                                                                                                                                | 2/31 [00:00<00:02, 11.86it/s]

epoch 4 batch id 1 loss 0.20007063448429108 train acc 0.9375


100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 31/31 [00:02<00:00, 12.52it/s]

epoch 4 train acc 0.9959677419354839



100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:00<00:00, 29.85it/s]

epoch 4 test acc 0.99375
Metrics:  {'recall': 1.0, 'precision': 0.9915254237288136, 'f1': 0.9957446808510638}



  6%|███████████                                                                                                                                                                | 2/31 [00:00<00:02, 12.40it/s]

epoch 5 batch id 1 loss 0.0014406294794753194 train acc 1.0


100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 31/31 [00:02<00:00, 12.54it/s]

epoch 5 train acc 0.998991935483871



100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:00<00:00, 31.64it/s]

epoch 5 test acc 0.99609375
Metrics:  {'recall': 0.9914529914529915, 'precision': 1.0, 'f1': 0.9957081545064378}



  6%|███████████                                                                                                                                                                | 2/31 [00:00<00:02, 11.56it/s]

epoch 6 batch id 1 loss 0.0010592221515253186 train acc 1.0


100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 31/31 [00:02<00:00, 12.69it/s]

epoch 6 train acc 1.0



100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:00<00:00, 30.14it/s]

epoch 6 test acc 0.99609375
Metrics:  {'recall': 0.9914529914529915, 'precision': 1.0, 'f1': 0.9957081545064378}



  6%|███████████                                                                                                                                                                | 2/31 [00:00<00:02, 11.76it/s]

epoch 7 batch id 1 loss 0.0009047709172591567 train acc 1.0


100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 31/31 [00:02<00:00, 12.73it/s]

epoch 7 train acc 1.0



100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:00<00:00, 30.23it/s]

epoch 7 test acc 0.99609375
Metrics:  {'recall': 0.9914529914529915, 'precision': 1.0, 'f1': 0.9957081545064378}



  6%|███████████                                                                                                                                                                | 2/31 [00:00<00:02, 12.43it/s]

epoch 8 batch id 1 loss 0.0006728395819664001 train acc 1.0


100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 31/31 [00:02<00:00, 12.83it/s]

epoch 8 train acc 1.0



100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:00<00:00, 32.16it/s]

epoch 8 test acc 0.99609375
Metrics:  {'recall': 0.9914529914529915, 'precision': 1.0, 'f1': 0.9957081545064378}



  6%|███████████                                                                                                                                                                | 2/31 [00:00<00:02, 12.40it/s]

epoch 9 batch id 1 loss 0.0006391439819708467 train acc 1.0


100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 31/31 [00:02<00:00, 12.47it/s]

epoch 9 train acc 1.0



100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:00<00:00, 30.08it/s]

epoch 9 test acc 0.99609375
Metrics:  {'recall': 0.9914529914529915, 'precision': 1.0, 'f1': 0.9957081545064378}



  6%|███████████                                                                                                                                                                | 2/31 [00:00<00:02, 11.04it/s]

epoch 10 batch id 1 loss 0.0005727168754674494 train acc 1.0


100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 31/31 [00:02<00:00, 12.65it/s]

epoch 10 train acc 1.0



100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:00<00:00, 30.10it/s]

epoch 10 test acc 0.99609375
Metrics:  {'recall': 0.9914529914529915, 'precision': 1.0, 'f1': 0.9957081545064378}
CPU times: user 23.7 s, sys: 6.32 s, total: 30 s
Wall time: 31.8 s





In [89]:
run_time
#224.96386766433716
#0.9947

31.77228093147278

In [57]:
preds = []
labels = []
test_acc = 0.0
# evaluation of the model train on the test set
model.eval()
for batch_id, (token_ids, valid_length, segment_ids, label) in tqdm(enumerate(test_dataloader), total=len(test_dataloader)):
    token_ids = token_ids.long().to(device)
    segment_ids = segment_ids.long().to(device)
    valid_length= valid_length
    label = label.long().to(device)
    labe2 = label.cpu()
    out = model(token_ids, valid_length, segment_ids)
    test_acc += calc_accuracy(out, label)

    pred = out.detach()
    pred = F.softmax(pred)
    pred = pred[:, 1].cpu().numpy().tolist()
    preds += pred
    labels += label.cpu().numpy().tolist()
    
preds = np.array(preds)
labels = np.array(labels)
metrics = get_metrics(preds, labels)
print("epoch {} test acc {}".format(e+1, test_acc / (batch_id+1)))
# print('ACCURACY 2 = ', accuracy_score(out, label))
print('Metrics: ', metrics)

  pred = F.softmax(pred)
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:00<00:00, 29.56it/s]

epoch 3 test acc 0.9921875
Metrics:  {'recall': 0.9826086956521739, 'precision': 1.0, 'f1': 0.9912280701754386}





<h2>Model Training result</h2>

From the previous training result, we can see that our KoBERT binary classification model reached **99.68%** of accuracy on the test set and **100**%  of accuracy on the train set.