## Machine Learning 프로젝트 수행을 위한 코드 구조화

- ML project를 위해서 사용하는 템플릿 코드를 만듭니다.

1. **필요한 라이브러리와 데이터를 불러옵니다.**


2. **EDA를 수행합니다.** 이 때 EDA의 목적은 풀어야하는 문제를 위해서 수행됩니다.


3. **전처리를 수행합니다.** 이 때 중요한건 **feature engineering**을 어떻게 하느냐 입니다.


4. **데이터 분할을 합니다.** 이 때 train data와 test data 간의 분포 차이가 없는지 확인합니다.


5. **학습을 진행합니다.** 어떤 모델을 사용하여 학습할지 정합니다. 성능이 잘 나오는 GBM을 추천합니다.


6. **hyper-parameter tuning을 수행합니다.** 원하는 목표 성능이 나올 때 까지 진행합니다. 검증 단계를 통해 지속적으로 **overfitting이 되지 않게 주의**하세요.


7. **최종 테스트를 진행합니다.** 데이터 분석 대회 포맷에 맞는 submission 파일을 만들어서 성능을 확인해보세요.

## 1. 라이브러리, 데이터 불러오기

In [103]:
# 데이터분석 4종 세트
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# 모델들, 성능 평가
# (저는 일반적으로 정형데이터로 머신러닝 분석할 때는 이 2개 모델은 그냥 돌려봅니다. 특히 RF가 테스트하기 좋습니다.)
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from lightgbm import LGBMClassifier

# KFold(CV), partial : optuna를 사용하기 위함
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import roc_auc_score
from functools import partial

# hyper-parameter tuning을 위한 라이브러리, optuna
import optuna

In [104]:
# 데이터를 불러옵니다.
base_path = '../../data/'
# train = pd.read_csv(base_path + 'train.csv', index_col='id')
# test = pd.read_csv(base_path + 'test.csv', index_col='id')
train = pd.read_csv(base_path + 'train_f5.csv', index_col='id')
test = pd.read_csv(base_path + 'test_f5.csv', index_col='id')
submission = pd.read_csv(base_path + 'sample_submission.csv', index_col='id')
print(train.shape, test.shape, submission.shape)

(101763, 23) (67842, 22) (67842, 1)


In [105]:
train['lOBlank_n'] = train['lOBlank_n'].astype('category')
test['lOBlank_n'] = test['lOBlank_n'].astype('category')
train['l_n'] = train['l_n'].astype('category')
test['l_n'] = test['l_n'].astype('category')
train['n_n'] = train['n_n'].astype('category')
test['n_n'] = test['n_n'].astype('category')
train['branchCount_n'] = train['branchCount_n'].astype('category')
test['branchCount_n'] = test['branchCount_n'].astype('category')
train['outlier_count'] = train['outlier_count'].astype('category')
test['outlier_count'] = test['outlier_count'].astype('category')
train['total_Op_n'] = train['total_Op_n'].astype('category')
test['total_Op_n'] = test['total_Op_n'].astype('category')
train['total_Opnd_n'] = train['total_Opnd_n'].astype('category')
test['total_Opnd_n'] = test['total_Opnd_n'].astype('category')

In [106]:
test.info()

<class 'pandas.core.frame.DataFrame'>
Index: 67842 entries, 101763 to 169604
Data columns (total 22 columns):
 #   Column             Non-Null Count  Dtype   
---  ------             --------------  -----   
 0   loc                67842 non-null  float64 
 1   v(g)               67842 non-null  float64 
 2   ev(g)              67842 non-null  float64 
 3   iv(g)              67842 non-null  float64 
 4   v                  67842 non-null  float64 
 5   d                  67842 non-null  float64 
 6   i                  67842 non-null  float64 
 7   e                  67842 non-null  float64 
 8   b                  67842 non-null  float64 
 9   t                  67842 non-null  float64 
 10  lOCode             67842 non-null  int64   
 11  lOComment          67842 non-null  int64   
 12  locCodeAndComment  67842 non-null  int64   
 13  uniq_Op            67842 non-null  float64 
 14  uniq_Opnd          67842 non-null  float64 
 15  lOBlank_n          67842 non-null  category
 16  n_n

## 2. EDA

- 데이터에서 찾아야 하는 기초적인 내용들을 확인합니다.


- class imbalance, target distribution, outlier, correlation을 확인합니다.

In [107]:
train.columns

Index(['loc', 'v(g)', 'ev(g)', 'iv(g)', 'v', 'd', 'i', 'e', 'b', 't', 'lOCode',
       'lOComment', 'locCodeAndComment', 'uniq_Op', 'uniq_Opnd', 'lOBlank_n',
       'n_n', 'branchCount_n', 'l_n', 'total_Op_n', 'total_Opnd_n',
       'outlier_count', 'defects'],
      dtype='object')

In [108]:
train.info()

<class 'pandas.core.frame.DataFrame'>
Index: 101763 entries, 0 to 101762
Data columns (total 23 columns):
 #   Column             Non-Null Count   Dtype   
---  ------             --------------   -----   
 0   loc                101763 non-null  float64 
 1   v(g)               101763 non-null  float64 
 2   ev(g)              101763 non-null  float64 
 3   iv(g)              101763 non-null  float64 
 4   v                  101763 non-null  float64 
 5   d                  101763 non-null  float64 
 6   i                  101763 non-null  float64 
 7   e                  101763 non-null  float64 
 8   b                  101763 non-null  float64 
 9   t                  101763 non-null  float64 
 10  lOCode             101763 non-null  int64   
 11  lOComment          101763 non-null  int64   
 12  locCodeAndComment  101763 non-null  int64   
 13  uniq_Op            101763 non-null  float64 
 14  uniq_Opnd          101763 non-null  float64 
 15  lOBlank_n          101763 non-null  cat

### 3. 전처리

#### 결측치 처리

### 4. 학습 데이터 분할

In [109]:
# 첫번째 테스트용으로 사용하고, 실제 학습시에는 K-Fold CV를 사용합니다.
from sklearn.model_selection import train_test_split

X = train.drop(columns=['defects'])
y = train.defects

# for OOF-prediction split 5% of data as validation dataset.
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=61, stratify=y)

In [110]:
print(X_train.shape, y_train.shape, X_val.shape, y_val.shape)
print(y_train.mean(), y_val.mean())

(81410, 22) (81410,) (20353, 22) (20353,)
0.2266429185603734 0.22664963396059548


### 5. 학습 및 평가

In [111]:
# 간단하게 LightGBM 테스트
# 적당한 hyper-parameter 조합을 두었습니다. (항상 best는 아닙니다. 예시입니다.)
model = LGBMClassifier(
    n_jobs=-1,
    random_state=61
)

In [112]:
print("\nFitting LightGBM...")
model.fit(X_train, y_train)


Fitting LightGBM...


In [113]:
# metric은 그때마다 맞게 바꿔줘야 합니다.
evaluation_metric = roc_auc_score

In [114]:
print("Prediction")
pred_train = model.predict(X_train)
pred_val = model.predict(X_val)


train_score = evaluation_metric(y_train, pred_train)
val_score = evaluation_metric(y_val, pred_val)

print("Train Score : %.4f" % train_score)
print("Validation Score : %.4f" % val_score)

Prediction
Train Score : 0.6766
Validation Score : 0.6677


### 6. Hyper-parameter Tuning

In [115]:
def optimizer(trial, X, y, K):
    # 조절할 hyper-parameter 조합을 적어줍니다.
    max_depth = trial.suggest_int('max_depth', 15, 25)
    num_leaves = trial.suggest_categorical('num_leaves', [32,64,128,256,512,])
    min_child_samples = trial.suggest_int('min_child_samples', 20, 80)
    colsample_bytree = trial.suggest_float('colsample_bytree', 0.5, 0.7)
    n_estimators = trial.suggest_int('n_estimators', 50, 1023)
    learning_rate = trial.suggest_float('learning_rate', 0.005, 0.2)

    # 원하는 모델을 지정합니다, optuna는 시간이 오래걸리기 때문에 저는 보통 RF로 일단 테스트를 해본 뒤에 LGBM을 사용합니다.
    model = LGBMClassifier(
        max_depth=max_depth,
        num_leaves=num_leaves,
        min_child_samples=min_child_samples,
        colsample_bytree=colsample_bytree,
        n_estimators=n_estimators,
        learning_rate=learning_rate,
        random_state=61,
    )

    # K-Fold Cross validation을 구현합니다.
    folds = StratifiedKFold(n_splits=K, random_state=61, shuffle=True)
    losses = []

    for train_idx, val_idx in folds.split(X, y):
        X_train = X.iloc[train_idx, :]
        y_train = y.iloc[train_idx]

        X_val = X.iloc[val_idx, :]
        y_val = y.iloc[val_idx]

        model.fit(X_train, y_train)
        preds = model.predict(X_val)
        loss = evaluation_metric(y_val, preds)
        losses.append(loss)


    # K-Fold의 평균 loss값을 돌려줍니다.
    return np.mean(losses)

In [116]:
K = 5   # Kfold 수
opt_func = partial(optimizer, X=X, y=y, K=K)

study = optuna.create_study(direction="maximize") # 최소/최대 어느 방향의 최적값을 구할 건지.
study.optimize(opt_func, n_trials=50)

[I 2023-10-19 02:28:23,454] A new study created in memory with name: no-name-a75215cc-ad2a-434c-830c-fa0f9ec66744
[I 2023-10-19 02:29:28,630] Trial 0 finished with value: 0.6526392318019023 and parameters: {'max_depth': 20, 'num_leaves': 64, 'min_child_samples': 26, 'colsample_bytree': 0.5100788961201105, 'n_estimators': 925, 'learning_rate': 0.18401347928183212}. Best is trial 0 with value: 0.6526392318019023.
[I 2023-10-19 02:29:58,756] Trial 1 finished with value: 0.6552852886351263 and parameters: {'max_depth': 15, 'num_leaves': 32, 'min_child_samples': 30, 'colsample_bytree': 0.6417886331778208, 'n_estimators': 704, 'learning_rate': 0.16557138205766944}. Best is trial 1 with value: 0.6552852886351263.
[I 2023-10-19 02:32:31,248] Trial 2 finished with value: 0.6548954179173035 and parameters: {'max_depth': 21, 'num_leaves': 256, 'min_child_samples': 26, 'colsample_bytree': 0.6948174622687029, 'n_estimators': 1002, 'learning_rate': 0.05936937086736877}. Best is trial 1 with value: 0

In [117]:
# optuna가 시도했던 모든 실험 관련 데이터
study.trials_dataframe()

Unnamed: 0,number,value,datetime_start,datetime_complete,duration,params_colsample_bytree,params_learning_rate,params_max_depth,params_min_child_samples,params_n_estimators,params_num_leaves,state
0,0,0.652639,2023-10-19 02:28:23.460734,2023-10-19 02:29:28.628557,0 days 00:01:05.167823,0.510079,0.184013,20,26,925,64,COMPLETE
1,1,0.655285,2023-10-19 02:29:28.631021,2023-10-19 02:29:58.755732,0 days 00:00:30.124711,0.641789,0.165571,15,30,704,32,COMPLETE
2,2,0.654895,2023-10-19 02:29:58.758471,2023-10-19 02:32:31.247383,0 days 00:02:32.488912,0.694817,0.059369,21,26,1002,256,COMPLETE
3,3,0.657425,2023-10-19 02:32:31.249381,2023-10-19 02:33:15.133273,0 days 00:00:43.883892,0.626728,0.075714,24,67,930,32,COMPLETE
4,4,0.654287,2023-10-19 02:33:15.135275,2023-10-19 02:34:05.927393,0 days 00:00:50.792118,0.500687,0.13941,19,57,342,512,COMPLETE
5,5,0.653163,2023-10-19 02:34:05.929078,2023-10-19 02:35:31.522810,0 days 00:01:25.593732,0.668691,0.128863,16,42,702,512,COMPLETE
6,6,0.658978,2023-10-19 02:35:31.525340,2023-10-19 02:36:02.090703,0 days 00:00:30.565363,0.665966,0.022475,23,80,128,256,COMPLETE
7,7,0.653363,2023-10-19 02:36:02.094140,2023-10-19 02:37:05.054266,0 days 00:01:02.960126,0.586436,0.143064,19,75,625,256,COMPLETE
8,8,0.663187,2023-10-19 02:37:05.056265,2023-10-19 02:38:05.839970,0 days 00:01:00.783705,0.696452,0.014823,22,67,905,64,COMPLETE
9,9,0.656301,2023-10-19 02:38:05.841646,2023-10-19 02:38:46.375486,0 days 00:00:40.533840,0.593794,0.126855,22,78,315,256,COMPLETE


In [118]:
print("Best Score: %.4f" % study.best_value) # best score 출력
print("Best params: ", study.best_trial.params) # best score일 때의 하이퍼파라미터들

Best Score: 0.6655
Best params:  {'max_depth': 17, 'num_leaves': 32, 'min_child_samples': 31, 'colsample_bytree': 0.6338246570775382, 'n_estimators': 115, 'learning_rate': 0.04902005456004471}


In [119]:
# 실험 기록 시각화
optuna.visualization.plot_optimization_history(study)

In [120]:
# hyper-parameter들의 중요도
optuna.visualization.plot_param_importances(study)

### 7. 테스트 및 제출 파일 생성

In [121]:
# Make KFold OOF prediction
def oof_preds(best_model):

    # make KFold
    folds = StratifiedKFold(n_splits=K, random_state=61, shuffle=True)
    final_preds = []
    losses = []
    oof = np.full(len(X), np.nan)
    # fitting with best_model
    for i, (train_idx, val_idx) in enumerate(folds.split(X, y)):
        X_train = X.iloc[train_idx, :]
        y_train = y.iloc[train_idx]
        X_val = X.iloc[val_idx, :]
        y_val = y.iloc[val_idx]

        print(f"========== Fold {i+1} ==========")
        best_model.fit(X_train, y_train)
        preds = best_model.predict_proba(X_val)[:, 1]
        oof[val_idx] = preds
        test_preds = best_model.predict_proba(test)[:, 1]
        final_preds.append(test_preds)
        loss = evaluation_metric(y_val, preds)

        losses.append(loss)

    avg_loss = np.mean(losses)
    print(f"Loss : {avg_loss:.4f}")
    return final_preds, oof

In [122]:
test.info() # 결측치 없음.

<class 'pandas.core.frame.DataFrame'>
Index: 67842 entries, 101763 to 169604
Data columns (total 22 columns):
 #   Column             Non-Null Count  Dtype   
---  ------             --------------  -----   
 0   loc                67842 non-null  float64 
 1   v(g)               67842 non-null  float64 
 2   ev(g)              67842 non-null  float64 
 3   iv(g)              67842 non-null  float64 
 4   v                  67842 non-null  float64 
 5   d                  67842 non-null  float64 
 6   i                  67842 non-null  float64 
 7   e                  67842 non-null  float64 
 8   b                  67842 non-null  float64 
 9   t                  67842 non-null  float64 
 10  lOCode             67842 non-null  int64   
 11  lOComment          67842 non-null  int64   
 12  locCodeAndComment  67842 non-null  int64   
 13  uniq_Op            67842 non-null  float64 
 14  uniq_Opnd          67842 non-null  float64 
 15  lOBlank_n          67842 non-null  category
 16  n_n

In [123]:
## X_test 만들기 : 앞서했던 전처리를 동일하게 적용해주면 됨.


In [124]:
best_params = study.best_trial.params

# define best model
best_model = LGBMClassifier(**best_params, random_state=61)

# model finalization : 가장 일반적으로 좋은 예측 성능을 냈던 모델로, 전체 데이터 트레이닝.

preds, oof = oof_preds(best_model)
preds = np.mean(preds, axis=0)
preds # 0.7913

Loss : 0.7914


array([0.26603909, 0.21272493, 0.67021534, ..., 0.16518349, 0.10858143,
       0.80623079])

In [125]:
submission['defects'] = preds

In [126]:
submission.to_csv(base_path+'preds/lightgbm.csv')

In [127]:
oof_df = pd.DataFrame({
    'lightgbm_oof': oof
})
oof_df.head()

Unnamed: 0,lightgbm_oof
0,0.084921
1,0.057131
2,0.048797
3,0.096845
4,0.10823


In [128]:
oof_df.to_csv(base_path+'oof/lightgbm.csv',index_label='id', header=['lightgbm_oof'])

In [129]:
pd.DataFrame({'importance':best_model.feature_importances_,'column':test.columns}).sort_values(by='importance', ascending=False)

Unnamed: 0,importance,column
0,544,loc
6,324,i
5,303,d
13,267,uniq_Op
14,248,uniq_Opnd
4,236,v
10,227,lOCode
1,203,v(g)
9,191,t
7,181,e
