## XGBoost: A Complete Guide to Fine-Tune and Optimize your Model
- https://towardsdatascience.com/xgboost-fine-tune-and-optimize-your-model-23d996fab663

<div style="text-align: right"> <b>Author : Kwang Myung Yu</b></div>
<div style="text-align: right"> Initial upload: 2023.7.17</div>
<div style="text-align: right"> Last update: 2023.7.17</div>

In [1]:
import os
import sys
import time
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
import seaborn as sns
from scipy import stats
import warnings; warnings.filterwarnings('ignore')
#plt.style.use('ggplot')
plt.style.use('seaborn-whitegrid')
%matplotlib inline

XGB는 두가지 사용법이 있다.

- Learning api : 기본, low-level 방법

```python
import xgboost as xgb
 
X, y = #Import your data
dmatrix = xgb.DMatrix(data=x, label=y) #Learning API uses a dmatrix
params = {'objective':'reg:squarederror'}
cv_results = xgb.cv(dtrain=dmatrix, 
                    params=params, 
                    nfold=10, 
                    metrics={'rmse'})
print('RMSE: %.2f' % cv_results['test-rmse-mean'].min())
```

- scikit-learn api : 사이킷런 래퍼, 사이킷런과 호완됨

```python
import xgboost as xgb
X, y = # Import your data
xtrain, xtest, ytrain, ytest = train_test_split(X, y, test_size=0.2)
xgbr = xgb.XGBRegressor(objective='reg:squarederror')
xgbr.fit(xtrain, ytrain)
 
ypred = xgbr.predict(xtest)
mse = mean_squared_error(ytest, ypred)
print("RMSE: %.2f" % (mse**(1/2.0)))
```

objective function  
- reg:squarederror: for linear regression
- reg:logistic: for logistic regression
- binary:logistic: for logistic regression — with output of the probabilities

### Hyperparameter tuning  
보스턴 집값 데이터로 설명함

In [4]:
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
 
from sklearn import datasets
X, y = datasets.fetch_openml('boston', return_X_y=True)

In [6]:
X.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296.0,15.3,396.9,4.98
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242.0,17.8,396.9,9.14
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242.0,17.8,392.83,4.03
3,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222.0,18.7,394.63,2.94
4,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222.0,18.7,396.9,5.33


In [9]:
X.dtypes

CRIM        float64
ZN          float64
INDUS       float64
CHAS       category
NOX         float64
RM          float64
AGE         float64
DIS         float64
RAD        category
TAX         float64
PTRATIO     float64
B           float64
LSTAT       float64
dtype: object

In [10]:
X = X.select_dtypes(include = np.number)
X.head()

Unnamed: 0,CRIM,ZN,INDUS,NOX,RM,AGE,DIS,TAX,PTRATIO,B,LSTAT
0,0.00632,18.0,2.31,0.538,6.575,65.2,4.09,296.0,15.3,396.9,4.98
1,0.02731,0.0,7.07,0.469,6.421,78.9,4.9671,242.0,17.8,396.9,9.14
2,0.02729,0.0,7.07,0.469,7.185,61.1,4.9671,242.0,17.8,392.83,4.03
3,0.03237,0.0,2.18,0.458,6.998,45.8,6.0622,222.0,18.7,394.63,2.94
4,0.06905,0.0,2.18,0.458,7.147,54.2,6.0622,222.0,18.7,396.9,5.33


In [11]:
dmatrix = xgb.DMatrix(data=X, label=y)
params={'objective':'reg:squarederror'}
cv_results = xgb.cv(
    dtrain=dmatrix, 
    params=params, 
    nfold=10, 
    metrics={'rmse'}, 
    as_pandas=True, 
    seed=20
    )

print('RMSE: %.2f' % cv_results['test-rmse-mean'].min())

RMSE: 3.43


파라미터 튜닝하고 다시 진행하기

In [12]:
params={ 'objective':'reg:squarederror',
         'max_depth': 6, 
         'colsample_bylevel':0.5,
         'learning_rate':0.01,
         'random_state':20}

cv_results = xgb.cv(
    dtrain=dmatrix, 
    params=params, 
    nfold=10, 
    metrics={'rmse'}, 
    as_pandas=True, 
    seed=20, 
    num_boost_round=1000
    )

print('RMSE: %.2f' % cv_results['test-rmse-mean'].min())

RMSE: 2.73


주요 파라미터는 다음과 같다.  
- learning api : https://xgboost.readthedocs.io/en/latest/parameter.html
- 사이킷런 api : https://xgboost.readthedocs.io/en/latest/python/python_api.html#module-xgboost.sklearn

트리 관련 파라미터
- max_depth: The maximum depth per tree. A deeper tree might increase the performance, but also the complexity and chances to overfit.
The value must be an integer greater than 0. Default is 6.
- learning_rate: The learning rate determines the step size at each iteration while your model optimizes toward its objective. A low learning rate makes computation slower, and requires more rounds to achieve the same reduction in residual error as a model with a high learning rate. But it optimizes the chances to reach the best optimum.
The value must be between 0 and 1. Default is 0.3.
- n_estimators: The number of trees in our ensemble. Equivalent to the number of boosting rounds.
The value must be an integer greater than 0. Default is 100.
NB: In the standard library, this is referred as num_boost_round.
- colsample_bytree: Represents the fraction of columns to be randomly sampled for each tree. It might improve overfitting.
The value must be between 0 and 1. Default is 1.
- subsample: Represents the fraction of observations to be sampled for each tree. A lower values prevent overfitting but might lead to under-fitting.
The value must be between 0 and 1. Default is 1.

정규화 관련 파라미터
- alpha (reg_alpha): L1 regularization on the weights (Lasso Regression). When working with a large number of features, it might improve speed performances. It can be any integer. Default is 0.
- lambda (reg_lambda): L2 regularization on the weights (Ridge Regression). It might help to reduce overfitting. It can be any integer. Default is 1.
- gamma: Gamma is a pseudo-regularisation parameter (Lagrangian multiplier), and depends on the other parameters. The higher Gamma is, the higher the regularization. It can be any integer. Default is 0.

#### 접근 방법 1 : 직권, 리저너블한(합리적인) 값 범위 선정  


- max_depth : 3 ~ 10
- n_estimators : 100 ~ 1000
- learning_rate: 0.01 ~ 0.3
- colsample_bytre: 0.5 ~ 1
- subsample: 0.6 ~ 1   


그런 다음 max_depth 및 n_estimators 최적화에 집중할 수 있습니다.

그런 다음 학습 속도(learning_rate)와 함께 재생하고 이를 증가시켜 성능 저하 없이 모델 속도를 높일 수 있습니다. 성능이 저하되지 않고 속도가 빨라지면 estimators 수를 늘려 성능을 향상시킬 수 있습니다.

마지막으로, 일반적으로 알파와 람다로 시작하는 정규화 파라미터로 작업할 수 있습니다. 감마의 경우 0은 정규화가 적용되지 않았음을 의미하며, 1~5는 일반적으로 사용되는 값이고, 10 이상은 매우 높은 것으로 간주됩니다.  

#### 접근방법 2: Optimization Algorithms

grid search

수명 예측 데이터를 예로 들어본다. 
- https://www.kaggle.com/datasets/kumarajarshi/life-expectancy-who?resource=download&select=Life+Expectancy+Data.csv

In [13]:
import xgboost as xgb
from sklearn.model_selection import GridSearchCV

data = pd.read_csv("../data/life-expec/Life Expectancy Data.csv")
data.head()

Unnamed: 0,Country,Year,Status,Life expectancy,Adult Mortality,infant deaths,Alcohol,percentage expenditure,Hepatitis B,Measles,...,Polio,Total expenditure,Diphtheria,HIV/AIDS,GDP,Population,thinness 1-19 years,thinness 5-9 years,Income composition of resources,Schooling
0,Afghanistan,2015,Developing,65.0,263.0,62,0.01,71.279624,65.0,1154,...,6.0,8.16,65.0,0.1,584.25921,33736494.0,17.2,17.3,0.479,10.1
1,Afghanistan,2014,Developing,59.9,271.0,64,0.01,73.523582,62.0,492,...,58.0,8.18,62.0,0.1,612.696514,327582.0,17.5,17.5,0.476,10.0
2,Afghanistan,2013,Developing,59.9,268.0,66,0.01,73.219243,64.0,430,...,62.0,8.13,64.0,0.1,631.744976,31731688.0,17.7,17.7,0.47,9.9
3,Afghanistan,2012,Developing,59.5,272.0,69,0.01,78.184215,67.0,2787,...,67.0,8.52,67.0,0.1,669.959,3696958.0,17.9,18.0,0.463,9.8
4,Afghanistan,2011,Developing,59.2,275.0,71,0.01,7.097109,68.0,3013,...,68.0,7.87,68.0,0.1,63.537231,2978599.0,18.2,18.2,0.454,9.5


In [14]:
X, y = data[data.columns.tolist()[:-1]], data[data.columns.tolist()[-1]]

In [16]:
X.dtypes

Country                             object
Year                                 int64
Status                              object
Life expectancy                    float64
Adult Mortality                    float64
infant deaths                        int64
Alcohol                            float64
percentage expenditure             float64
Hepatitis B                        float64
Measles                              int64
 BMI                               float64
under-five deaths                    int64
Polio                              float64
Total expenditure                  float64
Diphtheria                         float64
 HIV/AIDS                          float64
GDP                                float64
Population                         float64
 thinness  1-19 years              float64
 thinness 5-9 years                float64
Income composition of resources    float64
dtype: object

In [20]:
X = X.select_dtypes(include='float')
X.dtypes

Life expectancy                    float64
Adult Mortality                    float64
Alcohol                            float64
percentage expenditure             float64
Hepatitis B                        float64
 BMI                               float64
Polio                              float64
Total expenditure                  float64
Diphtheria                         float64
 HIV/AIDS                          float64
GDP                                float64
Population                         float64
 thinness  1-19 years              float64
 thinness 5-9 years                float64
Income composition of resources    float64
dtype: object

In [31]:
X = X.fillna(X.mean())
y = y.fillna(y.mean())

In [32]:
params = { 'max_depth': [3,6,10],
           'learning_rate': [0.01, 0.05, 0.1],
           'n_estimators': [100, 500, 1000],
           'colsample_bytree': [0.3, 0.7]}
xgbr = xgb.XGBRegressor(seed = 20)
clf = GridSearchCV(estimator=xgbr, 
                   param_grid=params,
                   scoring='neg_mean_squared_error', 
                   verbose=1)
clf.fit(X, y)
print("Best parameters:", clf.best_params_)
print("Lowest RMSE: ", (-clf.best_score_)**(1/2.0))

Fitting 5 folds for each of 54 candidates, totalling 270 fits
Best parameters: {'colsample_bytree': 0.7, 'learning_rate': 0.01, 'max_depth': 6, 'n_estimators': 1000}
Lowest RMSE:  1.6143158345631234


- 확인요

#### random search

In [33]:
import xgboost as xgb
from sklearn.model_selection import RandomizedSearchCV

params = { 'max_depth': [3, 5, 6, 10, 15, 20],
           'learning_rate': [0.01, 0.1, 0.2, 0.3],
           'subsample': np.arange(0.5, 1.0, 0.1),
           'colsample_bytree': np.arange(0.4, 1.0, 0.1),
           'colsample_bylevel': np.arange(0.4, 1.0, 0.1),
           'n_estimators': [100, 500, 1000]}
xgbr = xgb.XGBRegressor(seed = 20)
clf = RandomizedSearchCV(estimator=xgbr,
                         param_distributions=params,
                         scoring='neg_mean_squared_error',
                         n_iter=25,
                         verbose=1)
clf.fit(X, y)
print("Best parameters:", clf.best_params_)
print("Lowest RMSE: ", (-clf.best_score_)**(1/2.0))

Fitting 5 folds for each of 25 candidates, totalling 125 fits
Best parameters: {'subsample': 0.8999999999999999, 'n_estimators': 1000, 'max_depth': 6, 'learning_rate': 0.01, 'colsample_bytree': 0.7, 'colsample_bylevel': 0.8999999999999999}
Lowest RMSE:  1.6113749618054218
