https://dacon.io/competitions/official/236214/overview/description

'고객 대출등급 분류 해커톤'에 제출한 MLP 모델입니다.

In [22]:
import numpy as np
import pandas as pd

from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import MinMaxScaler

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import TensorDataset, DataLoader

사용할 라이브러리를 import합니다.

In [23]:
train = pd.read_csv('open/train.csv')
test = pd.read_csv('open/test.csv')

In [24]:
col_id = ['ID']

col_num = ['대출금액', '연간소득', '부채_대비_소득_비율', '총계좌수', '최근_2년간_연체_횟수', '총상환원금', '총상환이자', '총연체금액', '연체계좌수']
col_cat = ['대출기간', '근로기간', '주택소유상태', '대출목적']

col_x = ['대출금액', '연간소득', '부채_대비_소득_비율', '총계좌수', '최근_2년간_연체_횟수', '총상환원금', '총상환이자', '총연체금액', '연체계좌수', '대출기간', '근로기간', '주택소유상태', '대출목적']
col_y = ['대출등급']

columns를 나눠서 정리했습니다.

In [25]:
x_train = train[col_x]
y_train = train[col_y]

x_test = test[col_x]

In [26]:
x = pd.concat([x_train, x_test])

from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder(handle_unknown='ignore')
ohe.fit(x[col_cat])

x_train_res = ohe.transform(x_train[col_cat])
x_test_res = ohe.transform(x_test[col_cat])

x_train_ohe = pd.DataFrame(x_train_res.todense(), columns=ohe.get_feature_names_out())
x_test_ohe = pd.DataFrame(x_test_res.todense(), columns=ohe.get_feature_names_out())

x_train_fin = pd.concat([x_train[col_num], x_train_ohe], axis=1)
x_test_fin = pd.concat([x_test[col_num], x_test_ohe], axis=1)

범주형 자료에 대해서 원-핫 인코딩 실시했습니다.

In [27]:
num_scalers = 9

scalers = [MinMaxScaler() for _ in range(num_scalers)]

for i, col in enumerate(col_num):
    x_train_fin[col] = scalers[i].fit_transform(x_train_fin[[col]])
    x_test_fin[col] = scalers[i].transform(x_test_fin[[col]])

수치형 자료에 대해서는 최대-최소 스케일링 실시했습니다.

In [28]:
label_encoder = LabelEncoder()

y_train['대출등급'] = label_encoder.fit_transform(y_train['대출등급'])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  y_train['대출등급'] = label_encoder.fit_transform(y_train['대출등급'])


자료를 텐서에 넣기위해서 라벨 인코딩 실시했습니다.

In [29]:
x_train_tensor = torch.tensor(x_train_fin.values, dtype=torch.float32)
x_test_tensor = torch.tensor(x_test_fin.values, dtype=torch.float32)

y_train_tensor = torch.tensor(y_train.values)
y_train_tensor = y_train_tensor.squeeze()

In [30]:
dataset = TensorDataset(x_train_tensor, y_train_tensor)
dataloader = DataLoader(dataset, batch_size=32, shuffle=True)

미니배치를 이용할거고, 배치 사이즈는 32로 설정했습니다.

In [31]:
class MLP(nn.Module):
  def __init__(self):
    super(MLP, self).__init__()
    self.fc1 = nn.Linear(44, 128)
    self.fc2 = nn.Linear(128, 256)
    self.fc3 = nn.Linear(256, 112)
    self.fc4 = nn.Linear(112, 56)
    self.fc5 = nn.Linear(56, 7)
    self.relu = nn.ReLU()
    self.dropout = nn.Dropout(0.25)

  def forward(self, x):
    x = self.fc1(x)
    x = self.relu(x)
    x = self.dropout(x)
    x = self.fc2(x)
    x = self.relu(x)
    x = self.dropout(x)
    x = self.fc3(x)
    x = self.relu(x)
    x = self.dropout(x)
    x = self.fc4(x)
    x = self.relu(x)
    x = self.dropout(x)
    x = self.fc5(x)
    return x

MLP 모델을 설계했고, Linear, ReLU, Dropout을 이용했습니다.

In [32]:
model = MLP()

criterion = nn.CrossEntropyLoss()

optimizer = optim.Adam(model.parameters(), lr=0.0001)

학습모델 정의, 손실함수 정의, 최적화함수 정의 했습니다.

In [33]:
def calculate_accuracy(outputs, targets):
    _, predicted = torch.max(outputs, 1)
    correct = (predicted == targets).sum().item()
    accuracy = correct / targets.size(0)
    return accuracy

정확도를 계산하기 위한 함수도 정의하겠습니다.

In [38]:
from tqdm import tqdm

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)

num_epochs = 20
for epoch in range(1, num_epochs+1):
    total_loss = 0.0
    total_accuracy = 0.0

    # tqdm을 사용하여 dataloader 반복 상태를 시각적으로 표시
    loop = tqdm(enumerate(dataloader), total=len(dataloader), leave=True)
    for batch_idx, (batch_x, batch_y) in loop:
        batch_x, batch_y = batch_x.to(device), batch_y.to(device).long()

        outputs = model(batch_x)

        loss = criterion(outputs, batch_y)
        total_loss += loss.item()

        accuracy = calculate_accuracy(outputs, batch_y)
        total_accuracy += accuracy

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        # tqdm 프로그레스 바 업데이트
        loop.set_description(f"Epoch [{epoch}/{num_epochs}]")
        loop.set_postfix(loss=loss.item(), accuracy=accuracy)

    avg_loss = total_loss / len(dataloader)
    avg_accuracy = total_accuracy / len(dataloader)

    if epoch % 10 == 0:
        print(f'Epoch [{epoch}/{num_epochs}], Loss: {avg_loss:.4f}, Accuracy: {avg_accuracy:.4f}')


Epoch [1/100]:   0%|          | 0/3010 [00:00<?, ?it/s, accuracy=0.469, loss=1.2] 

Epoch [1/100]:  12%|█▏        | 348/3010 [00:04<00:37, 70.13it/s, accuracy=0.531, loss=1.33]


KeyboardInterrupt: 

GPU 연산을 실시 하였습니다.

100회 반복 실시했습니다.

정확도는 78%정도 나옵니다.

In [None]:
model.eval()

with torch.no_grad():
  x_test_tensor = x_test_tensor.to(device)

  predictions = model(x_test_tensor)

_, predicted_labels = torch.max(predictions, 1)

predicted_labels = predicted_labels.cpu().numpy()
predicted_labels = label_encoder.inverse_transform(predicted_labels)

학습한 모델을 이용해서 예측을 했습니다.

In [None]:
test_id = test[col_id].values.flatten()

result_df = pd.DataFrame({'ID': test_id, '대출등급': predicted_labels})
result_df.to_csv('/content/drive/MyDrive/Colab Notebooks/dacon_bank/pred.csv', index=False)

자료를 양식에 맞춰서 csv로 저장하는 코드입니다.