loss: log_loss (이진, 다항 분류 모두 사용가능)

learning_rate: 0 ~ 1

max_iter: 부스팅 반복 횟수

max_leaf_nodes: 최대 리프 노드 수 지정

max_depth: 트리의 최대 깊이를 제한

min_samples_leaf: 리프 노드의 최소 샘플 수

l2_regularization: L2 정규화 항 추가

max_bins: 히스토그램의 최대 구간 수 지정

early_stopping: 조기 중지를 활성화, 개선이 멈추면 학습을 중지

validation_fraction: 검증 데이터 세트에 할당된 비율을 지정

n_iter_no_change: 조기 중지를 활성화, 지정된 반복 횟수 동안 모델 성능 비개선 시 학습중지

random_state: 61

verbose: 학습 과정 중에 출력되는 정보의 양 제어

warm_start: True: 이전 모델을 재사용한 추가 훈련 수행

categorical_features: 범주형 변수로 처리할 특성의 인덱스를 지정

monotonic_cst: 특성과 예측 클래스 간의 증가 또는 감소 제약을 지정

HGBM 공식
https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.HistGradientBoostingClassifier.html

Light GBM VS HistGradientBoostingmachine

1. 동작 방식의 차이: 트리 분할방법(리프중심 VS 히스토그램 중심)
2. 속도 및 메모리 사용 우수 정도: LGBM < HGBM
3. 트리 확장방법: 수직적(n_estimators 사용) / 수평적(n_estimators 사용 x)

공통 파라미터:

num_leaves
max_depth
learning_rate
feature_fraction: 각 트리 분할에서 고려할 피처의 비율

HGBM 특유의 파라미터:

max_iter: 최대 반복 횟수, 트리를 확장하며 예측 오차를 줄이는 데 사용
l2_regularization: L2 정규화의 강도, 높을수록 정규화 ↑
max_bins: 데이터를 히스토그램으로 분할하는 데 사용.
categorical_features: 범주형 변수의 목록을 지정

In [28]:
pip install optuna

Collecting optuna
  Downloading optuna-3.3.0-py3-none-any.whl (404 kB)
                                              0.0/404.2 kB ? eta -:--:--
     --                                    30.7/404.2 kB 660.6 kB/s eta 0:00:01
     -------------------------------------  399.4/404.2 kB 4.2 MB/s eta 0:00:01
     -------------------------------------- 404.2/404.2 kB 4.2 MB/s eta 0:00:00
Collecting alembic>=1.5.0 (from optuna)
  Downloading alembic-1.12.0-py3-none-any.whl (226 kB)
                                              0.0/226.0 kB ? eta -:--:--
     ------------------------------------- 226.0/226.0 kB 13.5 MB/s eta 0:00:00
Collecting cmaes>=0.10.0 (from optuna)
  Downloading cmaes-0.10.0-py3-none-any.whl (29 kB)
Collecting colorlog (from optuna)
  Downloading colorlog-6.7.0-py2.py3-none-any.whl (11 kB)
Collecting Mako (from alembic>=1.5.0->optuna)
  Downloading Mako-1.2.4-py3-none-any.whl (78 kB)
                                              0.0/78.7 kB ? eta -:--:--
     ------------

In [16]:
import optuna
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.experimental import enable_hist_gradient_boosting
#HGBM
from sklearn.ensemble import HistGradientBoostingClassifier
#AUC
import sklearn.metrics as metrics

In [17]:
#데이터 생성
train = pd.read_csv('./play/train.csv', index_col = 'id')
test = pd.read_csv('./play/test.csv', index_col = 'id')
submission_df = pd.read_csv('./play/sample_submission.csv', index_col='id')

In [18]:
X = train.drop(columns = ['defects'])
y = train.defects.astype(int)

In [19]:
# 데이터 분할
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=61)

# HistGradientBoostingClassifier 모델
model = HistGradientBoostingClassifier()

# 모델 훈련
model.fit(X_train, y_train)

# 테스트 데이터로 예측
y_pred = model.predict(X_test)

# 데이터로 모델 평가
auc = metrics.roc_auc_score(y_test, y_pred)
auc
#cm = confusion_matrix(y_test, y_pred)
#f1 = f1_score(y_test, y_pred, average='weighted')  # F1 점수 계산

#print("confusion Matrix", cm)

0.661660214285162

In [28]:
# HistGradientBoostingClassifier + optuna
def optimizer(trial):
    loss = "log_loss"  # 자동으로 손실 함수 선택
    max_iter = trial.suggest_int("max_iter", 100, 1000)
    learning_rate = trial.suggest_float('learning_rate', 0.01, 0.1)
    max_depth = trial.suggest_int('max_depth', 10, 100)
    l2_regularization = trial.suggest_float('l2_regularization', 1e-5, 1e-3)
    model = HistGradientBoostingClassifier(loss=loss, learning_rate=learning_rate,
                                           max_depth=max_depth, l2_regularization=l2_regularization,
                                           max_iter=max_iter, random_state=61)
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)

    auc = metrics.roc_auc_score(y_test, y_pred)
    return auc

study = optuna.create_study(direction='maximize')
study.optimize(optimizer, n_trials=50)

print("Best AUC: %.4f" % study.best_value)
print("Best params: ", study.best_params)

[I 2023-10-11 20:10:07,396] A new study created in memory with name: no-name-30edcfb4-4a51-48d6-b4eb-7194a9d8178e
[I 2023-10-11 20:10:09,252] Trial 0 finished with value: 0.660572053636394 and parameters: {'max_iter': 610, 'learning_rate': 0.015922423533422003, 'max_depth': 47, 'l2_regularization': 0.0003431420010760205}. Best is trial 0 with value: 0.660572053636394.
[I 2023-10-11 20:10:11,069] Trial 1 finished with value: 0.6615159034513873 and parameters: {'max_iter': 856, 'learning_rate': 0.015436470962669344, 'max_depth': 24, 'l2_regularization': 0.00022875991167100473}. Best is trial 1 with value: 0.6615159034513873.
[I 2023-10-11 20:10:11,475] Trial 2 finished with value: 0.6617239654008066 and parameters: {'max_iter': 357, 'learning_rate': 0.09381141886043964, 'max_depth': 10, 'l2_regularization': 0.0003705978754837451}. Best is trial 2 with value: 0.6617239654008066.
[I 2023-10-11 20:10:12,002] Trial 3 finished with value: 0.6617390323561432 and parameters: {'max_iter': 169, '

[I 2023-10-11 20:10:28,095] Trial 32 finished with value: 0.6622020987681404 and parameters: {'max_iter': 505, 'learning_rate': 0.05967340165388839, 'max_depth': 71, 'l2_regularization': 0.0006645203160189844}. Best is trial 13 with value: 0.6632212082299398.
[I 2023-10-11 20:10:28,679] Trial 33 finished with value: 0.6612635490244714 and parameters: {'max_iter': 317, 'learning_rate': 0.07268875251873894, 'max_depth': 58, 'l2_regularization': 0.0007346982422816084}. Best is trial 13 with value: 0.6632212082299398.
[I 2023-10-11 20:10:29,242] Trial 34 finished with value: 0.6616956482671594 and parameters: {'max_iter': 383, 'learning_rate': 0.06535149053906152, 'max_depth': 77, 'l2_regularization': 0.0008182973903726179}. Best is trial 13 with value: 0.6632212082299398.
[I 2023-10-11 20:10:29,787] Trial 35 finished with value: 0.6621179806258352 and parameters: {'max_iter': 448, 'learning_rate': 0.07922217961344595, 'max_depth': 18, 'l2_regularization': 0.000565639349036172}. Best is tr

Best AUC: 0.6633
Best params:  {'max_iter': 110, 'learning_rate': 0.09494605702447576, 'max_depth': 83, 'l2_regularization': 0.00045512891761208057}


In [29]:
study.trials_dataframe()

Unnamed: 0,number,value,datetime_start,datetime_complete,duration,params_l2_regularization,params_learning_rate,params_max_depth,params_max_iter,state
0,0,0.660572,2023-10-11 20:10:07.397564,2023-10-11 20:10:09.252817,0 days 00:00:01.855253,0.000343,0.015922,47,610,COMPLETE
1,1,0.661516,2023-10-11 20:10:09.253817,2023-10-11 20:10:11.069224,0 days 00:00:01.815407,0.000229,0.015436,24,856,COMPLETE
2,2,0.661724,2023-10-11 20:10:11.070225,2023-10-11 20:10:11.475315,0 days 00:00:00.405090,0.000371,0.093811,10,357,COMPLETE
3,3,0.661739,2023-10-11 20:10:11.476315,2023-10-11 20:10:12.001433,0 days 00:00:00.525118,0.000303,0.070437,82,169,COMPLETE
4,4,0.661067,2023-10-11 20:10:12.002433,2023-10-11 20:10:12.441532,0 days 00:00:00.439099,0.00061,0.097102,89,951,COMPLETE
5,5,0.661316,2023-10-11 20:10:12.442532,2023-10-11 20:10:12.948646,0 days 00:00:00.506114,0.000162,0.076706,90,374,COMPLETE
6,6,0.661733,2023-10-11 20:10:12.949646,2023-10-11 20:10:13.463761,0 days 00:00:00.514115,0.000991,0.081548,73,151,COMPLETE
7,7,0.662052,2023-10-11 20:10:13.464761,2023-10-11 20:10:14.209928,0 days 00:00:00.745167,9.1e-05,0.045976,46,666,COMPLETE
8,8,0.662063,2023-10-11 20:10:14.210929,2023-10-11 20:10:14.837069,0 days 00:00:00.626140,0.000216,0.06505,51,908,COMPLETE
9,9,0.662182,2023-10-11 20:10:14.838069,2023-10-11 20:10:15.330180,0 days 00:00:00.492111,0.000791,0.083498,34,694,COMPLETE


In [30]:
study.best_trial.params

{'max_iter': 110,
 'learning_rate': 0.09494605702447576,
 'max_depth': 83,
 'l2_regularization': 0.00045512891761208057}

In [33]:
optuna.visualization.plot_optimization_history(study)
optuna.visualization.plot_param_importances(study) #max_iter : 0.31, l2 : 0.3

In [36]:
Hgbm_best = HistGradientBoostingClassifier(**study.best_trial.params,
                                           loss = "log_loss", 
                                            random_state=61)

In [37]:
Hgbm_best.fit(X_train, y_train)

In [38]:
print(y.unique())
y_proba = Hgbm_best.predict_proba(test)
print(y_proba)
submission_df['defects'] = y_proba[:, 1]
submission_df.to_csv('submission.csv')
submission_df

[0 1]
[[0.77541812 0.22458188]
 [0.75834061 0.24165939]
 [0.28970196 0.71029804]
 ...
 [0.74912132 0.25087868]
 [0.90831426 0.09168574]
 [0.20386966 0.79613034]]


Unnamed: 0_level_0,defects
id,Unnamed: 1_level_1
101763,0.224582
101764,0.241659
101765,0.710298
101766,0.476019
101767,0.142550
...,...
169600,0.265374
169601,0.101148
169602,0.250879
169603,0.091686


In [90]:
pd.read_csv('submission.csv', index_col='id')

Unnamed: 0_level_0,defects
id,Unnamed: 1_level_1
101763,0.251435
101764,0.215032
101765,0.690652
101766,0.483614
101767,0.141881
...,...
169600,0.274047
169601,0.104643
169602,0.206772
169603,0.098215
