## Machine Learning 프로젝트 수행을 위한 코드 구조화

- ML project를 위해서 사용하는 템플릿 코드를 만듭니다.

1. **필요한 라이브러리와 데이터를 불러옵니다.**


2. **EDA를 수행합니다.** 이 때 EDA의 목적은 풀어야하는 문제를 위해서 수행됩니다.


3. **전처리를 수행합니다.** 이 때 중요한건 **feature engineering**을 어떻게 하느냐 입니다.


4. **데이터 분할을 합니다.** 이 때 train data와 test data 간의 분포 차이가 없는지 확인합니다.


5. **학습을 진행합니다.** 어떤 모델을 사용하여 학습할지 정합니다. 성능이 잘 나오는 GBM을 추천합니다.


6. **hyper-parameter tuning을 수행합니다.** 원하는 목표 성능이 나올 때 까지 진행합니다. 검증 단계를 통해 지속적으로 **overfitting이 되지 않게 주의**하세요.


7. **최종 테스트를 진행합니다.** 데이터 분석 대회 포맷에 맞는 submission 파일을 만들어서 성능을 확인해보세요.

## 1. 라이브러리, 데이터 불러오기

In [None]:
# 데이터분석 4종 세트
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# 모델들, 성능 평가
# (저는 일반적으로 정형데이터로 머신러닝 분석할 때는 이 2개 모델은 그냥 돌려봅니다. 특히 RF가 테스트하기 좋습니다.)
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from xgboost import XGBClassifier

# KFold(CV), partial : optuna를 사용하기 위함
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import roc_auc_score
from functools import partial

# hyper-parameter tuning을 위한 라이브러리, optuna
!pip install optuna
import optuna

Collecting optuna
  Downloading optuna-3.3.0-py3-none-any.whl (404 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/404.2 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m399.4/404.2 kB[0m [31m12.7 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m404.2/404.2 kB[0m [31m10.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting alembic>=1.5.0 (from optuna)
  Downloading alembic-1.12.0-py3-none-any.whl (226 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/226.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m226.0/226.0 kB[0m [31m19.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting cmaes>=0.10.0 (from optuna)
  Downloading cmaes-0.10.0-py3-none-any.whl (29 kB)
Collecting colorlog (from optuna)
  Downloading colorlog-6.7.0-py2.py3-none-any.whl (11 kB)
Collecting Mako (from alembic>=1.5.0-

In [10]:
# 데이터를 불러옵니다.
base_path = '../data/'
train = pd.read_csv(base_path +'train.csv')
test = pd.read_csv(base_path + 'test.csv')
submission = pd.read_csv(base_path + 'sample_submission.csv')
print(train.shape, test.shape, submission.shape)

(101763, 23) (67842, 22) (67842, 2)


In [9]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## 2. EDA

- 데이터에서 찾아야 하는 기초적인 내용들을 확인합니다.


- class imbalance, target distribution, outlier, correlation을 확인합니다.

In [None]:
train.columns

Index(['id', 'loc', 'v(g)', 'ev(g)', 'iv(g)', 'n', 'v', 'l', 'd', 'i', 'e',
       'b', 't', 'lOCode', 'lOComment', 'lOBlank', 'locCodeAndComment',
       'uniq_Op', 'uniq_Opnd', 'total_Op', 'total_Opnd', 'branchCount',
       'defects'],
      dtype='object')

### 3. 전처리

#### 결측치 처리

### 4. 학습 데이터 분할

In [13]:
# 첫번째 테스트용으로 사용하고, 실제 학습시에는 K-Fold CV를 사용합니다.
from sklearn.model_selection import train_test_split
X = train.drop(columns=['defects'])
y = train.defects
# for OOF-prediction split 5% of data as validation dataset.
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=61, stratify=y)

In [None]:
print(X_train.shape, y_train.shape, X_val.shape, y_val.shape)
print(y_train.mean(), y_val.mean())

(81410, 22) (81410,) (20353, 22) (20353,)
0.2266429185603734 0.22664963396059548


### 5. 학습 및 평가

In [14]:
# 간단하게 LightGBM 테스트
# 적당한 hyper-parameter 조합을 두었습니다. (항상 best는 아닙니다. 예시입니다.)
model = XGBClassifier(
    booster='gbtree',
    objective='binary:logistic',
    max_depth=6,
    learning_rate=0.1,
    n_estimators=100,
    n_jobs=-1,
    random_state=61
)

In [15]:
print("\nFitting LightGBM...")
model.fit(X_train, y_train)


Fitting LightGBM...


In [16]:
# metric은 그때마다 맞게 바꿔줘야 합니다.
evaluation_metric = roc_auc_score

In [17]:
print("Prediction")
pred_train = model.predict(X_train)
pred_val = model.predict(X_val)


train_score = evaluation_metric(y_train, pred_train)
val_score = evaluation_metric(y_val, pred_val)

print("Train Score : %.4f" % train_score)
print("Validation Score : %.4f" % val_score)

Prediction
Train Score : 0.6805
Validation Score : 0.6661


### 6. Hyper-parameter Tuning

In [29]:
from sklearn.utils import class_weight
cls_weight = (y_train.shape[0] - np.sum(y_train)) / np.sum(y_train)
cls_weight

3.412226979567503

In [31]:
def optimizer(trial, X, y, K):
    # 조절할 hyper-parameter 조합을 적어줍니다.
    max_depth = trial.suggest_int('max_depth', 5, 15)
    colsample_bynode = trial.suggest_float('colsample_bynode', 0.5, 0.8)
    reg_lambda = trial.suggest_float('reg_lambda', 0.5, 5.0)
    n_estimators = trial.suggest_int('n_estimators', 50, 1000)
    learning_rate = trial.suggest_float('learning_rate', 0.01, 0.3)

    # 원하는 모델을 지정합니다, optuna는 시간이 오래걸리기 때문에 저는 보통 RF로 일단 테스트를 해본 뒤에 LGBM을 사용합니다.
    model = XGBClassifier(
        max_depth=max_depth,
        colsample_bynode=colsample_bynode,
        reg_lambda=reg_lambda,
        n_estimators=n_estimators,
        learning_rate=learning_rate,
        random_state=61,
        scale_pos_weight=cls_weight,
        eval_metric=evaluation_metric
    )

    # K-Fold Cross validation을 구현합니다.
    folds = StratifiedKFold(n_splits=K, random_state=61, shuffle=True)
    losses = []

    for train_idx, val_idx in folds.split(X, y):
        X_train = X.iloc[train_idx, :]
        y_train = y.iloc[train_idx]

        X_val = X.iloc[val_idx, :]
        y_val = y.iloc[val_idx]

        model.fit(X_train, y_train)
        preds = model.predict(X_val)
        loss = evaluation_metric(y_val, preds)
        losses.append(loss)


    # K-Fold의 평균 loss값을 돌려줍니다.
    return np.mean(losses)

In [32]:
K = 5   # Kfold 수
opt_func = partial(optimizer, X=X, y=y, K=K)
study = optuna.create_study(direction="maximize") # 최소/최대 어느 방향의 최적값을 구할 건지.
study.optimize(opt_func, n_trials=50)

[I 2023-10-13 04:16:30,014] A new study created in memory with name: no-name-af6e7b37-b25f-4514-a1d4-8082f2ce5df0
[I 2023-10-13 04:16:49,029] Trial 0 finished with value: 0.7237099002284478 and parameters: {'max_depth': 5, 'colsample_bynode': 0.5264611055532672, 'reg_lambda': 4.629002595311057, 'n_estimators': 690, 'learning_rate': 0.03510055302925545}. Best is trial 0 with value: 0.7237099002284478.
[I 2023-10-13 04:17:07,734] Trial 1 finished with value: 0.713795700362349 and parameters: {'max_depth': 7, 'colsample_bynode': 0.5501470846761399, 'reg_lambda': 3.7989797736428437, 'n_estimators': 466, 'learning_rate': 0.10588182452686619}. Best is trial 0 with value: 0.7237099002284478.
[I 2023-10-13 04:17:48,219] Trial 2 finished with value: 0.6743365037581354 and parameters: {'max_depth': 10, 'colsample_bynode': 0.5782417654403937, 'reg_lambda': 0.9512169815662834, 'n_estimators': 674, 'learning_rate': 0.21017281919281555}. Best is trial 0 with value: 0.7237099002284478.
[I 2023-10-13 

In [33]:
# optuna가 시도했던 모든 실험 관련 데이터
study.trials_dataframe()

Unnamed: 0,number,value,datetime_start,datetime_complete,duration,params_colsample_bynode,params_learning_rate,params_max_depth,params_n_estimators,params_reg_lambda,state
0,0,0.72371,2023-10-13 04:16:30.017135,2023-10-13 04:16:49.028805,0 days 00:00:19.011670,0.526461,0.035101,5,690,4.629003,COMPLETE
1,1,0.713796,2023-10-13 04:16:49.030344,2023-10-13 04:17:07.734023,0 days 00:00:18.703679,0.550147,0.105882,7,466,3.79898,COMPLETE
2,2,0.674337,2023-10-13 04:17:07.735967,2023-10-13 04:17:48.218930,0 days 00:00:40.482963,0.578242,0.210173,10,674,0.951217,COMPLETE
3,3,0.707495,2023-10-13 04:17:48.220523,2023-10-13 04:18:30.085644,0 days 00:00:41.865121,0.681661,0.048177,14,173,4.024347,COMPLETE
4,4,0.697522,2023-10-13 04:18:30.087613,2023-10-13 04:18:52.880483,0 days 00:00:22.792870,0.780326,0.138293,10,286,1.920385,COMPLETE
5,5,0.670519,2023-10-13 04:18:52.881949,2023-10-13 04:19:55.793659,0 days 00:01:02.911710,0.652544,0.264276,11,843,3.750053,COMPLETE
6,6,0.679611,2023-10-13 04:19:55.795444,2023-10-13 04:20:21.077637,0 days 00:00:25.282193,0.549556,0.201532,13,220,0.654532,COMPLETE
7,7,0.722561,2023-10-13 04:20:21.079391,2023-10-13 04:20:27.898552,0 days 00:00:06.819161,0.663896,0.147661,5,180,4.01448,COMPLETE
8,8,0.712556,2023-10-13 04:20:27.900518,2023-10-13 04:20:44.828910,0 days 00:00:16.928392,0.614021,0.110778,7,453,1.303692,COMPLETE
9,9,0.672652,2023-10-13 04:20:44.830635,2023-10-13 04:22:47.806935,0 days 00:02:02.976300,0.745658,0.137026,15,917,4.080633,COMPLETE


In [34]:
print("Best Score: %.4f" % study.best_value) # best score 출력
print("Best params: ", study.best_trial.params) # best score일 때의 하이퍼파라미터들

Best Score: 0.7253
Best params:  {'max_depth': 6, 'colsample_bynode': 0.5692021789449475, 'reg_lambda': 4.25151501107612, 'n_estimators': 250, 'learning_rate': 0.010726959499003449}


In [35]:
# 실험 기록 시각화
optuna.visualization.plot_optimization_history(study)

In [36]:
# hyper-parameter들의 중요도
optuna.visualization.plot_param_importances(study)

### 7. 테스트 및 제출 파일 생성

In [37]:
# Make KFold OOF prediction
def oof_preds(best_model):

    # make KFold
    folds = StratifiedKFold(n_splits=K, random_state=42, shuffle=True)
    final_preds = []
    losses = []
    # fitting with best_model
    for i, (train_idx, val_idx) in enumerate(folds.split(X, y)):
        X_train = X.iloc[train_idx, :]
        y_train = y.iloc[train_idx]
        X_val = X.iloc[val_idx, :]
        y_val = y.iloc[val_idx]

        print(f"========== Fold {i+1} ==========")
        best_model.fit(X_train, y_train)
        preds = best_model.predict_proba(X_val)[:, 1]
        test_preds = best_model.predict_proba(test)[:, 1]
        final_preds.append(test_preds)
        loss = evaluation_metric(y_val, preds)

        losses.append(loss)

    avg_loss = np.mean(losses)
    print(f"Loss : {avg_loss:.4f}")
    return final_preds

In [38]:
test.info() # 결측치 없음.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 67842 entries, 0 to 67841
Data columns (total 22 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   id                 67842 non-null  int64  
 1   loc                67842 non-null  float64
 2   v(g)               67842 non-null  float64
 3   ev(g)              67842 non-null  float64
 4   iv(g)              67842 non-null  float64
 5   n                  67842 non-null  float64
 6   v                  67842 non-null  float64
 7   l                  67842 non-null  float64
 8   d                  67842 non-null  float64
 9   i                  67842 non-null  float64
 10  e                  67842 non-null  float64
 11  b                  67842 non-null  float64
 12  t                  67842 non-null  float64
 13  lOCode             67842 non-null  int64  
 14  lOComment          67842 non-null  int64  
 15  lOBlank            67842 non-null  int64  
 16  locCodeAndComment  678

In [None]:
## X_test 만들기 : 앞서했던 전처리를 동일하게 적용해주면 됨.


In [40]:
best_params = study.best_trial.params

# define best model
best_model = XGBClassifier(**best_params, scale_pos_weight=cls_weight, eval_metric=evaluation_metric, random_state=61)

# model finalization : 가장 일반적으로 좋은 예측 성능을 냈던 모델로, 전체 데이터 트레이닝.

preds = oof_preds(best_model)
preds = np.mean(preds, axis=0)
preds

Loss : 0.7919


array([0.5172478, 0.4611327, 0.8364431, ..., 0.4142035, 0.2939667,
       0.8889718], dtype=float32)

In [41]:
submission['defects'] = preds

In [42]:
submission.to_csv("submission_xgboost_kfold.csv", index=False)

In [48]:
pd.read_csv('submission_xgboost_kfold.csv')

Unnamed: 0,id,defects
0,101763,0.517248
1,101764,0.461133
2,101765,0.836443
3,101766,0.743730
4,101767,0.369976
...,...,...
67837,169600,0.566138
67838,169601,0.313743
67839,169602,0.414204
67840,169603,0.293967
