# 교차 검증 학습 
 
모델을 학습할 때, 검증을 위해 우리는 train data와 validation data를 분리합니다.  
이 경우, validation data는 모델의 학습에 영향을 미치지 않습니다.  
따라서 모델이 학습하는 data의 수가 줄어들고, train data에 overfitting 됩니다.  
  
이를 해결하기 위해 train data와 validation data를 나누는 과정을 여러번 반복하고   
다양한 데이터셋을 사용하여 모델을 학습하는 방법을 cross validation (교차 검증 학습) 이라고 합니다.   
  
cross validation을 사용할 경우 모든 데이터를 학습과 평가에 사용할 수 있다는 장점이 있지만   
학습시간이 오래걸린다는 단점이 있습니다..  


cross validation에는 다양한 방법이 있지만 이번 노트북에서는 Stratified k-fold cross validation을 사용해보았습니다.  
stratified k-fold cross validation을 사용하면  
Label의 분포가 불균형한 데이터일 경우 Label의 갯수를 고려하여 train, validation data를 나눠줍니다.   

모델은 klue/bert-base 모델을 사용했습니다.  

In [1]:
import random
from tqdm.notebook import tqdm, tnrange
import os

import numpy as np
import pandas as pd
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from transformers import AdamW
from transformers import get_linear_schedule_with_warmup

from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import accuracy_score

import torch
from torch import nn
from torch.utils.data import Dataset,TensorDataset, DataLoader, RandomSampler


if torch.cuda.is_available():
    print("사용가능한 GPU수 : ",torch.cuda.device_count())
    device = torch.device("cuda")
else:
    print("CPU 사용")
    device = torch.device("cpu")

사용가능한 GPU수 :  1


Reproduction을 위한 Seed 고정  
출처 : https://dacon.io/codeshare/2363?dtype=vote&s_id=0

In [2]:
RANDOM_SEED = 42

def seed_everything(seed: int = 42):
    random.seed(seed)
    np.random.seed(seed)
    os.environ["PYTHONHASHSEED"] = str(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)  # type: ignore
    torch.backends.cudnn.deterministic = True  # type: ignore
    torch.backends.cudnn.benchmark = True  # type: ignore
    
seed_everything(RANDOM_SEED)

In [3]:
model_checkpoint = "klue/bert-base"
batch_size = 32

In [4]:
dataset = pd.read_csv("data/train_data.csv")
test = pd.read_csv("data/test_data.csv")

In [5]:
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)

In [6]:
def bert_tokenize(dataset,sent_key,label_key,tokenizer):
    if label_key is None :
        labels = [np.int64(0) for i in dataset[sent_key]]
    else :
        labels = [np.int64(i) for i in dataset[label_key]]
    
    sentences = tokenizer(dataset[sent_key].tolist(),truncation=True,padding=True)

    input_ids = sentences.input_ids
    token_type_ids = sentences.token_type_ids
    attention_mask = sentences.attention_mask
    
    return [input_ids, token_type_ids, attention_mask, labels]

sklearn의 StratifiedKFold를 불러오고 예측한 데이터를 저장할 수 있는 변수를 만듭니다.  
`StratifiedKFold()`에서 `n_split=5`는 5개의 train data와 validation data를 만들겠다는 이야기입니다.  

In [7]:
NUM_TEST_DATA = len(test)
skf = StratifiedKFold(n_splits=5)
final_test_pred = np.zeros([NUM_TEST_DATA,7])

parameter들을 정의합니다. 

In [8]:
lr = 2e-5
adam_epsilon = 1e-8
epochs = 3
num_warmup_steps = 0
num_labels = 7

`train()`, `evaluate()`, `predict()`를 정의합니다.  

In [9]:
def train(model,train_dataloader):
    train_loss_set = []
    learning_rate = []
    batch_loss = 0

    for step, batch in enumerate(tqdm(train_dataloader)):
        model.train()

        batch = tuple(t.to(device) for t in batch)
        b_input_ids, b_token_type_ids, b_input_mask, b_labels = batch

        outputs = model(b_input_ids, token_type_ids=b_token_type_ids, attention_mask=b_input_mask, labels=b_labels)
        loss = outputs[0]

        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)

        optimizer.step()

        scheduler.step()
        optimizer.zero_grad()

        batch_loss += loss.item()

    avg_train_loss = batch_loss / len(train_dataloader)


    for param_group in optimizer.param_groups:
        print("\n\tCurrent Learning rate: ",param_group['lr'])
        learning_rate.append(param_group['lr'])

    train_loss_set.append(avg_train_loss)
    print(F'\n\tAverage Training loss: {avg_train_loss}')
    
def evaluate(model, validation_dataloader):
    # validation
    model.eval()
    eval_accuracy,nb_eval_steps = 0, 0

    for batch in tqdm(validation_dataloader):
    
        batch = tuple(t.to(device) for t in batch)
        b_input_ids, b_token_type_ids, b_input_mask, b_labels = batch
        
        with torch.no_grad():
            logits = model(b_input_ids, token_type_ids=b_token_type_ids, attention_mask=b_input_mask)
            
        logits = logits[0].to('cpu').numpy()
        label_ids = b_labels.to('cpu').numpy()

        pred_flat = np.argmax(logits, axis=1).flatten()
        labels_flat = label_ids.flatten()

        tmp_eval_accuracy = accuracy_score(labels_flat,pred_flat)

        eval_accuracy += tmp_eval_accuracy
        nb_eval_steps += 1

    print(F'\n\tValidation Accuracy: {eval_accuracy/nb_eval_steps}')
    
def predict(model, test_dataloader):
    pred = []
    model.eval()

    for batch in tqdm(test_dataloader):

        batch = tuple(t.to(device) for t in batch)
        b_input_ids, b_token_type_ids, b_input_mask, b_labels = batch

        with torch.no_grad():
            logits = model(b_input_ids, token_type_ids=b_token_type_ids, attention_mask=b_input_mask)
        logits = logits[0].to('cpu').numpy()

        for p in logits:
            pred.append(p)

    return pred

`StratifiedKFold()`의 `split()`함수를 사용하면 인자로 주어진 데이터를 train data와 validation data로 나눈 index를 돌려줍니다.   DataFrame에 index를 사용하여 train data와 validation data를 나눌 수 있습니다.   
나눠진 데이터로 학습과 평가를 진행한 뒤 test data를 예측합니다.     
예측한 데이터는 최종 예측 데이터(`final_test_pred`)에 합쳐집니다.  
총 학습에 걸리는 시간은 한번 학습하는데 걸리는 시간 * `n_splits`로 넘겨준 수 ( 여기서는 5 )입니다. 


In [10]:
for train_idx, validation_idx in skf.split(dataset["title"],dataset["topic_idx"]):
    
    dataset_train = pd.DataFrame()
    dataset_val = pd.DataFrame()
    
    dataset_train["title"] = dataset["title"][train_idx]
    dataset_train["topic_idx"] = dataset["topic_idx"][train_idx]
    
    dataset_val["title"] = dataset["title"][validation_idx]
    dataset_val["topic_idx"] = dataset["topic_idx"][validation_idx]
    
    train_inputs = bert_tokenize(dataset_train,"title","topic_idx",tokenizer)
    validation_inputs = bert_tokenize(dataset_val,"title","topic_idx",tokenizer)
    test_inputs = bert_tokenize(test,"title",None,tokenizer)
    
    for i in range(len(train_inputs)):
        train_inputs[i] = torch.tensor(train_inputs[i])

    for i in range(len(validation_inputs)):
        validation_inputs[i] = torch.tensor(validation_inputs[i])

    for i in range(len(test_inputs)):
        test_inputs[i] = torch.tensor(test_inputs[i])
    
    train_data = TensorDataset(*train_inputs)
    train_sampler = RandomSampler(train_data)
    train_dataloader = DataLoader(train_data,sampler=train_sampler,batch_size=batch_size)

    validation_data = TensorDataset(*validation_inputs)
    validation_sampler = RandomSampler(validation_data)
    validation_dataloader = DataLoader(validation_data,sampler=validation_sampler,batch_size=batch_size)

    test_data = TensorDataset(*test_inputs)
    test_dataloader = DataLoader(test_data,batch_size=batch_size)
    
    model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint,num_labels=num_labels)
    model.zero_grad()
    
    model.to(device)

    optimizer = AdamW(model.parameters(), lr=lr,eps=adam_epsilon,correct_bias=False)
    scheduler = get_linear_schedule_with_warmup(optimizer,
                                                num_warmup_steps=num_warmup_steps,
                                                num_training_steps=len(train_dataloader)*epochs) 
    
    for _ in tnrange(1,epochs+1,desc='Epoch'):
        print("<" + "="*22 + F" Epoch {_} "+ "="*22 + ">")
        # train
        train(model, train_dataloader)
        
        # validation
        evaluate(model, validation_dataloader)
        
    # predict
    pred = predict(model, test_dataloader)
    final_test_pred += pred
        
    

Some weights of the model checkpoint at klue/bert-base were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight', 'cls.predictions.decoder.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.predictions.decoder.weight', 'cls.predictions.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized

Epoch:   0%|          | 0/3 [00:00<?, ?it/s]



  0%|          | 0/1142 [00:00<?, ?it/s]


	Current Learning rate:  1.3333333333333333e-05

	Average Training loss: 0.3570210219845306


  0%|          | 0/286 [00:00<?, ?it/s]


	Validation Accuracy: 0.8528289891926255


  0%|          | 0/1142 [00:00<?, ?it/s]


	Current Learning rate:  6.666666666666667e-06

	Average Training loss: 0.20184547494079155


  0%|          | 0/286 [00:00<?, ?it/s]


	Validation Accuracy: 0.8533852511125238


  0%|          | 0/1142 [00:00<?, ?it/s]


	Current Learning rate:  0.0

	Average Training loss: 0.1371668318915303


  0%|          | 0/286 [00:00<?, ?it/s]


	Validation Accuracy: 0.8527097902097902


  0%|          | 0/286 [00:00<?, ?it/s]

Some weights of the model checkpoint at klue/bert-base were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight', 'cls.predictions.decoder.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.predictions.decoder.weight', 'cls.predictions.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized

Epoch:   0%|          | 0/3 [00:00<?, ?it/s]



  0%|          | 0/1142 [00:00<?, ?it/s]


	Current Learning rate:  1.3333333333333333e-05

	Average Training loss: 0.3597159437988534


  0%|          | 0/286 [00:00<?, ?it/s]


	Validation Accuracy: 0.8473756357279084


  0%|          | 0/1142 [00:00<?, ?it/s]


	Current Learning rate:  6.666666666666667e-06

	Average Training loss: 0.2004227205620012


  0%|          | 0/286 [00:00<?, ?it/s]


	Validation Accuracy: 0.8455181182453909


  0%|          | 0/1142 [00:00<?, ?it/s]


	Current Learning rate:  0.0

	Average Training loss: 0.13471043498773982


  0%|          | 0/286 [00:00<?, ?it/s]


	Validation Accuracy: 0.8490146217418945


  0%|          | 0/286 [00:00<?, ?it/s]

Some weights of the model checkpoint at klue/bert-base were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight', 'cls.predictions.decoder.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.predictions.decoder.weight', 'cls.predictions.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized

Epoch:   0%|          | 0/3 [00:00<?, ?it/s]



  0%|          | 0/1142 [00:00<?, ?it/s]


	Current Learning rate:  1.3333333333333333e-05

	Average Training loss: 0.38425392524099494


  0%|          | 0/286 [00:00<?, ?it/s]


	Validation Accuracy: 0.8886880165289256


  0%|          | 0/1142 [00:00<?, ?it/s]


	Current Learning rate:  6.666666666666667e-06

	Average Training loss: 0.22122074047106785


  0%|          | 0/286 [00:00<?, ?it/s]


	Validation Accuracy: 0.8908733312142403


  0%|          | 0/1142 [00:00<?, ?it/s]


	Current Learning rate:  0.0

	Average Training loss: 0.15173925946535127


  0%|          | 0/286 [00:00<?, ?it/s]


	Validation Accuracy: 0.8887774157660521


  0%|          | 0/286 [00:00<?, ?it/s]

Some weights of the model checkpoint at klue/bert-base were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight', 'cls.predictions.decoder.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.predictions.decoder.weight', 'cls.predictions.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized

Epoch:   0%|          | 0/3 [00:00<?, ?it/s]



  0%|          | 0/1142 [00:00<?, ?it/s]


	Current Learning rate:  1.3333333333333333e-05

	Average Training loss: 0.41829780071660644


  0%|          | 0/286 [00:00<?, ?it/s]


	Validation Accuracy: 0.9458041958041958


  0%|          | 0/1142 [00:00<?, ?it/s]


	Current Learning rate:  6.666666666666667e-06

	Average Training loss: 0.25026649765787734


  0%|          | 0/286 [00:00<?, ?it/s]


	Validation Accuracy: 0.9434003496503497


  0%|          | 0/1142 [00:00<?, ?it/s]


	Current Learning rate:  0.0

	Average Training loss: 0.17853587064092846


  0%|          | 0/286 [00:00<?, ?it/s]


	Validation Accuracy: 0.9446022727272727


  0%|          | 0/286 [00:00<?, ?it/s]

Some weights of the model checkpoint at klue/bert-base were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight', 'cls.predictions.decoder.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.predictions.decoder.weight', 'cls.predictions.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized

Epoch:   0%|          | 0/3 [00:00<?, ?it/s]



  0%|          | 0/1142 [00:00<?, ?it/s]


	Current Learning rate:  1.3333333333333333e-05

	Average Training loss: 0.3941842956262952


  0%|          | 0/286 [00:00<?, ?it/s]


	Validation Accuracy: 0.9101617132867132


  0%|          | 0/1142 [00:00<?, ?it/s]


	Current Learning rate:  6.666666666666667e-06

	Average Training loss: 0.231594713246564


  0%|          | 0/286 [00:00<?, ?it/s]


	Validation Accuracy: 0.8965909090909091


  0%|          | 0/1142 [00:00<?, ?it/s]


	Current Learning rate:  0.0

	Average Training loss: 0.16084701035354423


  0%|          | 0/286 [00:00<?, ?it/s]


	Validation Accuracy: 0.8984702797202796


  0%|          | 0/286 [00:00<?, ?it/s]

5번의 교차 학습동안 서로 다른 train, validation data를 통해 학습한 model이 예측한 값은 `final_test_pred`에 더해져 있습니다.   
이 예측값을 `argmax`하여 최종 예측값을 만들어냅니다. 

In [11]:
final_test_pred[:10]

array([[ 1.64885781e+01,  8.76850499e-01,  7.11387616e+00,
         1.47765276e+01, -1.25753672e+01, -1.30626745e+01,
        -1.64131484e+01],
       [-7.75764048e+00, -1.09310429e+01,  1.44481805e+00,
         3.52703748e+01, -6.21469492e+00, -8.14098918e+00,
        -6.04971761e+00],
       [ 6.70001668e+00, -1.93561076e+00,  2.65456839e+01,
        -2.21080732e+00, -1.30748584e+01, -1.39026191e+01,
        -4.75335102e+00],
       [ 1.87000976e+01, -6.19321951e+00,  1.93904028e+01,
        -6.24161780e+00, -1.12162007e+01, -1.08564138e+01,
        -4.77193654e+00],
       [-6.87773919e+00, -8.77894616e+00, -3.71424071e-01,
         3.61361575e+01, -6.00767535e+00, -7.74340332e+00,
        -7.06648469e+00],
       [ 1.80406456e+01,  1.09737453e+01,  7.30523503e+00,
         6.01799917e+00, -1.61820772e+01, -1.36202390e+01,
        -1.70074041e+01],
       [-7.52531660e+00, -7.46243680e+00, -4.82346600e+00,
        -8.39387345e+00,  1.94720095e+00,  3.75804706e+01,
        -5.6562108

In [12]:
len(final_test_pred)

9131

In [13]:
total_pred = np.argmax(final_test_pred,axis = 1)
total_pred[:10]

array([0, 3, 2, 2, 3, 0, 5, 3, 4, 4], dtype=int64)

In [14]:
submission = pd.read_csv('data/sample_submission.csv')
submission['topic_idx'] = total_pred
submission.to_csv("results/klue-bert-base-kfold5.csv",index=False)