# XGBoost
https://xgboost.readthedocs.io/en/stable/index.html

**핵심 파라미터**
1. **learning_rate**: 각 트리의 기여도를 조절하는 학습률로, 값이 작을수록 모델의 복잡도가 낮아지지만 더 많은 트리를 필요로 한다.
2. **n_estimators**: 트리의 개수를 의미하며, 많을수록 복잡한 모델이 된다.
3. **max_depth**: 각 트리의 최대 깊이로, 트리가 너무 깊으면 과적합될 수 있다.
4. **objective**: 손실 함수의 종류로, 회귀 문제의 경우 'reg:squarederror', 분류 문제의 경우 'binary:logistic' 등을 사용한다.

In [None]:
# !pip install xgboost

Collecting xgboost
  Using cached xgboost-2.1.3-py3-none-win_amd64.whl.metadata (2.1 kB)
Using cached xgboost-2.1.3-py3-none-win_amd64.whl (124.9 MB)
Installing collected packages: xgboost
Successfully installed xgboost-2.1.3


In [2]:
# !pip install --upgrade scikit-learn



In [3]:
# !pip install --upgrade ipython numpy scipy

Collecting ipython
  Downloading ipython-8.31.0-py3-none-any.whl.metadata (4.9 kB)
Collecting numpy
  Downloading numpy-2.2.2-cp312-cp312-win_amd64.whl.metadata (60 kB)
Collecting scipy
  Using cached scipy-1.15.1-cp312-cp312-win_amd64.whl.metadata (60 kB)
Downloading ipython-8.31.0-py3-none-any.whl (821 kB)
   ---------------------------------------- 0.0/821.6 kB ? eta -:--:--
   --------------------------------------- 821.6/821.6 kB 37.3 MB/s eta 0:00:00
Downloading numpy-2.2.2-cp312-cp312-win_amd64.whl (12.6 MB)
   ---------------------------------------- 0.0/12.6 MB ? eta -:--:--
   ----------------------------------- ---- 11.3/12.6 MB 54.2 MB/s eta 0:00:01
   ---------------------------------------- 12.6/12.6 MB 46.4 MB/s eta 0:00:00
Using cached scipy-1.15.1-cp312-cp312-win_amd64.whl (43.6 MB)
Installing collected packages: numpy, scipy, ipython
  Attempting uninstall: numpy
    Found existing installation: numpy 2.2.1
    Uninstalling numpy-2.2.1:
      Successfully uninstalled 

In [None]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
from xgboost import XGBClassifier    # xgboost에서는 sklearn과 비슷한 api 제공

iris_data = load_iris()
X_train, X_test, y_train, y_test = \
    train_test_split(iris_data.data, iris_data.target, random_state=0)

xgb_clf = XGBClassifier(
    n_estimators=100,
    max_depth=3,
    learning_rate=0.1,
    random_state=0
)
xgb_clf.fit(X_train, y_train)

y_pred_train = xgb_clf.predict(X_train)
y_pred_test = xgb_clf.predict(X_test)

print(accuracy_score(y_train, y_pred_train))
print(accuracy_score(y_test, y_pred_test))

print(classification_report(y_test, y_pred_test))

In [None]:
# 유방암 데이터셋 이진분류
from sklearn.datasets import load_breast_cancer

data = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, random_state=0)

xgb_clf = XGBClassifier(random_state=0)
xgb_clf.fit(X_train, y_train)

y_pred_train = xgb_clf.predict(X_train)
y_pred_test = xgb_clf.predict(X_test)

print(classification_report(y_test, y_pred_test))

In [None]:
# XGBClassifier에 조기종료 적용 -> 과적합 방지, 훈련시간 단축
xgb_clf = XGBClassifier(
    n_estimators=500,
    learning_rate=0.1,
    max_depth=3,
    random_state=0,
    early_stopping_rounds=10,    # 성능이 일정 횟수 이상 향상되지 않으면 조기종료 (반복을 중단할 횟수)
    eval_metric='logloss'
)

X_tr, X_val, y_tr, y_val = train_test_split(X_train, y_train, random_state=0)
eval_set = [(X_tr, y_tr), (X_val, y_val)]
print(type(X_tr), type(y_tr), type(X_val), type(y_val))
xgb_clf.fit(X_tr, y_tr, eval_set=eval_set, verbose=True)
# eval_set: 검증 데이터, verbose: 학습 중 평가 결과 출력 여부

In [None]:
# 훈련 과정 시각화
import matplotlib.pyplot as plt

result = xgb_clf.evals_result()
train_loss = result['validation_0']['logloss']
val_loss = result['validation_1']['logloss']

plt.plot(train_loss, label='train')
plt.plot(val_loss, label='validation')
plt.legend()
plt.xlabel('nth round')
plt.ylabel('logloss')
plt.show()

In [None]:
xgb_clf.score(X_train, y_train), xgb_clf.score(X_test, y_test)

In [None]:
# 특성 중요도 시각화
from xgboost import plot_importance

fig, ax = plt.subplots(figsize=(10, 12))
plot_importance(xgb_clf, ax=ax)
plt.show()