# Applied Machine Learning: In-class Exercise 03-2

## Goal

Our goal for this exercise sheet is to understand how to apply and work with XGBoost. The XGBoost algorithm has a large range of hyperparameters. We learn specifically how to tune these hyperparameters to optimize our XGBoost model for the task at hand.

## German Credit Dataset

As in previous exercises, we use the German credit dataset of Prof. Dr. Hans Hoffman of the University of Hamburg in 1994. By using XGBoost, we want to classify people as a good or bad credit risk based on 20 personal, demographic and financial features. The dataset is available at the UCI repository as Statlog (German Credit Data) Data Set.

## Preprocessing

To apply the XGBoost algorithm to the credit dataset, categorical features need to be converted into numeric features, e.g. using one-hot encoding. In the Python solution, we load the dataset using `fetch_openml` from scikit-learn. We then separate the features (`X`) and the target (`y`), and encode the target variable using `LabelEncoder` by mapping 'bad' to 0 and 'good' to 1. Next, we identify the categorical columns by selecting those with the object or category data type and set up a `ColumnTransformer` that applies `OneHotEncoder` to these columns while passing through the numerical features unchanged.


In [1]:
import numpy as np
from sklearn.datasets import fetch_openml
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, LabelEncoder


rng = np.random.default_rng(42)

# Load the German credit dataset from OpenML
X, y = fetch_openml("credit-g", version=1, as_frame=True, return_X_y=True)

# Encode the target variable: 'good' as 1 and 'bad' as 0
label_encoder = LabelEncoder()
label_encoder.fit(['bad', 'good'])  # Explicitly define the mapping
y = label_encoder.transform(y)

# Identify categorical columns (assumes object or category dtype)
cat_cols = X.select_dtypes(include=['object', 'category']).columns.tolist()

# Set up a ColumnTransformer to one-hot encode categorical features while passing through numerical features
preprocessor = ColumnTransformer(
    transformers=[
        ('cat', OneHotEncoder(handle_unknown='ignore'), cat_cols)
    ],
    remainder='passthrough',
    force_int_remainder_cols=False
)

## 1 XGBoost Learner

### 1.1 Initialize an XGBoost Learner

Initialize an XGBoost classifier from scikit-learn with 100 boosting rounds. Make sure that you have installed the Python package `xgboost`.

In XGBoost for Python (scikit-learn API), the number of iterations (also known as boosting rounds or trees) is specified via the parameter `n_estimators`. This hyperparameter determines the total number of boosting iterations performed by the model.

There is a trade-off between underfitting (not enough iterations) and overfitting (too many iterations). Therefore, it is always better to tune such a hyperparameter. In this exercise, we choose 100 iterations as we believe it provides a good upper bound. Later, we will introduce early stopping to prevent overfitting.


In [2]:
#===SOLUTION===

from xgboost import XGBClassifier
from sklearn.pipeline import Pipeline


xgb_clf = XGBClassifier(n_estimators=100, 
                        random_state=42, 
                        eval_metric='logloss')

pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', xgb_clf)
])

### 1.2 Performance Assessment using Cross-validation

Use 5-fold cross-validation to estimate the generalization error of the XGBoost classifier with 100 boosting iterations on the one-hot-encoded German credit dataset. Measure the learner's performance using classification accuracy. 

Specifically, you need to conduct three steps:

1. Use `cross_val_score()` from scikit-learn with the pipeline defined above.
2. Set the parameter `cv=5` to specify 5-fold cross-validation.
3. Aggregate the performance scores by computing the mean accuracy.


In [3]:
#===SOLUTION===

from sklearn.model_selection import cross_val_score

cv_scores = cross_val_score(pipeline, X, y, cv=5, scoring='accuracy')

# Print the cross-validation results
print(f"Cross-validation scores: {cv_scores}")
print(f"Mean CV accuracy: {cv_scores.mean():.4f}")


Cross-validation scores: [0.735 0.76  0.735 0.74  0.765]
Mean CV accuracy: 0.7470


## 2 Hyperparameters

### 2.1 Overview of Hyperparameters

Apart from the number of iterations (`n_estimators`), the XGBoost classifier has several other hyperparameters which were kept to their default values in the previous exercise. Extract an overview of all hyperparameters from the initialized XGBoost classifier (previous exercise) and their default values.

In [4]:
#===SOLUTION===

params = xgb_clf.get_params()

print("XGBoost hyperparameters with the configured values:")
for param, value in params.items():
    print(f"{param}: {value}")

XGBoost hyperparameters with the configured values:
objective: binary:logistic
base_score: None
booster: None
callbacks: None
colsample_bylevel: None
colsample_bynode: None
colsample_bytree: None
device: None
early_stopping_rounds: None
enable_categorical: False
eval_metric: logloss
feature_types: None
feature_weights: None
gamma: None
grow_policy: None
importance_type: None
interaction_constraints: None
learning_rate: None
max_bin: None
max_cat_threshold: None
max_cat_to_onehot: None
max_delta_step: None
max_depth: None
max_leaves: None
min_child_weight: None
missing: nan
monotone_constraints: None
multi_strategy: None
n_estimators: 100
n_jobs: None
num_parallel_tree: None
random_state: 42
reg_alpha: None
reg_lambda: None
sampling_method: None
scale_pos_weight: None
subsample: None
tree_method: None
validate_parameters: None
verbosity: None


### Questions and Answers
1. Does the learner rely on a tree or a linear booster by default?

===SOLUTION===

The default booster in XGBoost is "gbtree", meaning it uses a tree-based booster.

2. Do more hyperparameters exist for the tree or the linear booster?

===SOLUTION===

The tree booster (gbtree) has a richer set of hyperparameters (such as max_depth, min_child_weight, subsample, colsample_bytree, etc.) compared to the linear booster, which typically has fewer hyperparameters (like lambda, alpha, and lambda_bias for regularization).


 3. What do `max_depth`, `eta`, `n_estimators` mean and what are their default values?

===SOLUTION===

`max_depth`: Controls the maximum depth of the trees. Its default value is 6.
`eta`: Represents the learning rate, which determines the contribution of each tree to the overall model. The default value is 0.3. In Python’s XGBoost, the number of boosting rounds is managed by the parameter `n_estimators` (analogous to nrounds in R), which we have set to 100.


4. Does a larger value for `eta` imply a larger value for `n_estimators`?

===SOLUTION===

No, a larger `eta` means that each tree has a larger impact on the overall model. Consequently, if `eta` is increased, fewer boosting rounds (lower `n_estimators`) are typically needed to avoid overfitting, not more.

### 2.2 Tune Hyperparameters

Tune the tree depth (`max_depth`) and the learning rate (`learning_rate`, alias for `eta`) of the XGBoost classifier on the German credit dataset using random search:

- Search space:
    - `max_depth`: integer values between 1 and 8.
    - `learning_rate`: continuous values between 0.2 and 0.4. Note: in practice the learning rate is usually much smaller. The range of large learning rates here is only for an educational purpose, so that you can train the models faster on your laptop.
- Termination criterion: Perform 20 evaluations.
- Performance measure: Classification error (equivalent to 1 - accuracy). Note: This dataset is imbalanced, so accuracy is not an ideal evaluation metric. For now, we'll use (1 - accuracy), but you'll later learn how to apply more advanced metrics that are better suited for imbalanced classification tasks.
- Resampling strategy: 3-fold cross-validation.

<details><summary>Hint 1:</summary>
Use `scipy.stats.uniform` to instantiate a uniform distribution, and use it for the search space of `learning_rate`.
</details>

In [5]:
#===SOLUTION===

from scipy.stats import uniform
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import make_scorer, accuracy_score

# Define a custom scoring function: classification error = 1 - accuracy.
# Setting greater_is_better=False so that lower error is better.
def classification_error(y_true, y_pred):
    return 1 - accuracy_score(y_true, y_pred)

error_scorer = make_scorer(classification_error, greater_is_better=False)

# Define the search space.
# Note: Since our pipeline names the XGBoost classifier step "classifier",
# we specify the hyperparameters with the prefix "classifier__".
param_distributions = {
    "classifier__max_depth": list(range(1, 9)),
    "classifier__learning_rate": uniform(0.2, 0.2) 
}

# Set up RandomizedSearchCV with 20 random evaluations and 3-fold cross-validation.
random_search = RandomizedSearchCV(
    pipeline,
    param_distributions=param_distributions,
    n_iter=20,
    scoring=error_scorer,
    cv=3,
    random_state=42,
    verbose=1,
    n_jobs=-1
)

# Execute the hyperparameter tuning on the training data.
random_search.fit(X, y)


Fitting 3 folds for each of 20 candidates, totalling 60 fits


### 2.3 Inspect the Best Performing Setup
Which tree depth was the best performing one?

In [6]:
#===SOLUTION===

# Display the best hyperparameters and corresponding classification error.
print(f"Best hyperparameters: {random_search.best_params_}")
# Note: The best_score_ is negative classification error (because greater_is_better=False)
print(f"Best classification error: {-random_search.best_score_:.4f}")

Best hyperparameters: {'classifier__learning_rate': 0.32367720186661747, 'classifier__max_depth': 4}
Best classification error: 0.2350


## 3 Early Stopping

### 3.1 Set up an XGBoost Learner with Early Stopping

Now that we've derived the best hyperparameters for `max_depth` and `learning_rate`, we can train our final model. To avoid overfitting, we perform early stopping. Early stopping halts training when the performance on a validation dataset stops improving for a specified number of iterations.

Set up an XGBoost classifier with the following hyperparameters:

- `max_depth` and `learning_rate` set according to the best values identified from the tuning step.
- `n_estimators` set to 100.
- Early stopping set to 5 rounds (this parameter could also be tuned, but we simplify the exercise here).

Split the `X` and `y` into a training set and a tes set, with `test_size=0.1`.

<details><summary>Hint 1:</summary>
Since our original pipeline contains a preprocessor, we need to ensure that both the training and validation sets are preprocessed in the same way. Here we fit the preprocessor on the training subset and transform both sets.
</details>


In [7]:
#===SOLUTION===

from sklearn.model_selection import train_test_split


best_max_depth = random_search.best_params_['classifier__max_depth']
best_learning_rate = random_search.best_params_['classifier__learning_rate']

# Partition the training set (X_train, y_train) into training (90%) and early stopping (validation) sets (10%)
X_train_es, X_val_es, y_train_es, y_val_es = train_test_split(
    X, y, test_size=0.1, random_state=42
)

# Initialize a new XGBoost classifier with early stopping.
# Note: n_estimators corresponds to nrounds in R.
xgb_es = XGBClassifier(
    n_estimators=100,
    max_depth=best_max_depth,
    learning_rate=best_learning_rate,
    early_stopping_rounds=5,
    random_state=2001,
    eval_metric='logloss'
)

# Since our original pipeline contains a preprocessor, we need to ensure that both
# the training and validation sets are preprocessed in the same way.
# Here we fit the preprocessor on the training subset and transform both sets.
preprocessor.fit(X_train_es)
X_train_es_trans = preprocessor.transform(X_train_es)
X_val_es_trans = preprocessor.transform(X_val_es)


### 3.2 Training on Credit Data

Train the XGBoost classifier with early stopping from the training set obtained in the last chunk of exercise. How many iterations were conducted before the boosting algorithm stopped?

In [8]:
#===SOLUTION===

xgb_es.fit(
    X_train_es_trans,
    y_train_es,
    eval_set=[(X_val_es_trans, y_val_es)],
    verbose=True
)

# Evaluate the performance on the validation set.
y_val_pred = xgb_es.predict(X_val_es_trans)
val_accuracy = accuracy_score(y_val_es, y_val_pred)
print(f"Classification error with early stopping: {(1 - val_accuracy):.2f}")

# Store the number of iterations (trees) used by the model with early stopping
best_iteration = xgb_es.best_iteration
print(f"Number of trees used with early stopping: {best_iteration}")

[0]	validation_0-logloss:0.55338
[1]	validation_0-logloss:0.51626
[2]	validation_0-logloss:0.50632
[3]	validation_0-logloss:0.48498
[4]	validation_0-logloss:0.46744
[5]	validation_0-logloss:0.47045
[6]	validation_0-logloss:0.46659
[7]	validation_0-logloss:0.46170
[8]	validation_0-logloss:0.44685
[9]	validation_0-logloss:0.44670
[10]	validation_0-logloss:0.45028
[11]	validation_0-logloss:0.44845
[12]	validation_0-logloss:0.44415
[13]	validation_0-logloss:0.44018
[14]	validation_0-logloss:0.43668
[15]	validation_0-logloss:0.43682
[16]	validation_0-logloss:0.43732
[17]	validation_0-logloss:0.43450
[18]	validation_0-logloss:0.43009
[19]	validation_0-logloss:0.43432
[20]	validation_0-logloss:0.43173
[21]	validation_0-logloss:0.43376
[22]	validation_0-logloss:0.42858
[23]	validation_0-logloss:0.42894
[24]	validation_0-logloss:0.43310
[25]	validation_0-logloss:0.43090
[26]	validation_0-logloss:0.42902
[27]	validation_0-logloss:0.43151
Classification error with early stopping: 0.21
Number of t

## 4 Extra: Nested Resampling

To obtain an unbiased performance estimate while tuning hyperparameters and applying early stopping, conduct nested resampling with:

- **3-fold cross-validation** for both the outer and inner resampling loops.
- **Search space**:
    - `max_depth` between 1 and 8.
    - `learning_rate` (eta) between 0.2 and 0.4.
- **Random search** with 20 evaluations.
- **Performance measure**: classification error (`1 - accuracy`).

Extract the performance estimate on the outer resampling folds.

<details><summary>Hint 1:</summary>
    Note: Because early stopping in XGBoost requires a separate validation set, 
    we wrap XGBClassifier in a custom estimator that automatically partitions the training data for early stopping. This estimator should inherit `sklearn.base.BaseEstimator` and `ClassifierMixin` and implement the required interfaces. Furthermore, in `def fit(self, X, y)`, the this class should partition the `X` and `y` (training data) into a "local" training and a "local" validation subsets. These "local" training and validation sets are further passed into `XGBClassifier.fit(X_train_local, y_train_local, eval_set=[(X_val_local, y_val_local)])` to perform the training with early stopping.
</details>


In [9]:
#===SOLUTION===

from sklearn.base import BaseEstimator, ClassifierMixin
from sklearn.model_selection import StratifiedKFold, train_test_split


# Custom wrapper to enable early stopping within a pipeline.
class XGBClassifierWithEarlyStopping(BaseEstimator, ClassifierMixin):
    """
    Note: Because early stopping in XGBoost requires a separate validation set, 
    we wrap XGBClassifier in a custom estimator that automatically partitions the training data for early stopping.

    However, the drawback is that the training data is split into two parts, 
    which means that the model is trained on less data. If you have better solutions to utilizing the inner CV validation set
    for the early stopping validation set, please submit a PR.
    """
    def __init__(self, n_estimators: int =100, max_depth: int = 6, learning_rate: float = 0.3,
                 early_stopping_rounds: int = 5, test_size: float = 0.1, random_state: int = None, **kwargs):
        self.n_estimators = n_estimators
        self.max_depth = max_depth
        self.learning_rate = learning_rate
        self.early_stopping_rounds = early_stopping_rounds
        self.test_size = test_size
        self.random_state = random_state
        self.kwargs = kwargs
        self.model_: XGBClassifier = None

    def fit(self, X, y):
        # Partition data for early stopping: inner training and validation splits.
        X_train_local, X_val_local, y_train_local, y_val_local = train_test_split(
            X, y, test_size=self.test_size, random_state=self.random_state)
        self.model_ = XGBClassifier(
            n_estimators=self.n_estimators,
            max_depth=self.max_depth,
            learning_rate=self.learning_rate,
            early_stopping_rounds=self.early_stopping_rounds,
            random_state=self.random_state,
            eval_metric='logloss',
            **self.kwargs
        )
        self.model_.fit(
            X_train_local, y_train_local,
            eval_set=[(X_val_local, y_val_local)],
            verbose=False
        )
        return self

    def predict(self, X):
        return self.model_.predict(X)

    def predict_proba(self, X):
        return self.model_.predict_proba(X)

# Build a pipeline that applies preprocessing then the custom XGBoost classifier.
# We assume the preprocessor (e.g., a ColumnTransformer for one-hot encoding) is defined earlier.
pipeline_es = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', XGBClassifierWithEarlyStopping(
        n_estimators=100,
        max_depth=6, 
        learning_rate=0.3,   
        early_stopping_rounds=best_iteration,
        test_size=0.1,
        random_state=42
    ))
])

# Define the custom scoring: classification error = 1 - accuracy.
def classification_error(y_true, y_pred):
    return 1 - accuracy_score(y_true, y_pred)

error_scorer = make_scorer(classification_error, greater_is_better=False)

# Define the search space for hyperparameter tuning.
param_distributions_nested = {
    "classifier__max_depth": list(range(1, 9)),
    "classifier__learning_rate": uniform(0.2, 0.2)      
}

# Inner resampling: 3-fold CV using RandomizedSearchCV with 20 evaluations.
inner_cv = StratifiedKFold(n_splits=3, shuffle=True, random_state=42)
random_search_nested = RandomizedSearchCV(
    pipeline_es,
    param_distributions=param_distributions_nested,
    n_iter=20,   
    scoring=error_scorer,
    cv=inner_cv,
    verbose=1,
    n_jobs=-1,
    random_state=42,
)

# Outer resampling: perform 3-fold CV over the entire dataset (X, y).
outer_cv = StratifiedKFold(n_splits=3, shuffle=True, random_state=42)
nested_scores = cross_val_score(
    random_search_nested,
    X,
    y,
    cv=outer_cv,
    scoring=error_scorer,
    n_jobs=-1,
)

# Since error_scorer was defined with greater_is_better=False, scores are negative.
mean_classification_error = -np.mean(nested_scores)
print(f"Nested CV Classification Error: {mean_classification_error:.4f}")


Fitting 3 folds for each of 20 candidates, totalling 60 fits
Fitting 3 folds for each of 20 candidates, totalling 60 fits
Fitting 3 folds for each of 20 candidates, totalling 60 fits
Nested CV Classification Error: 0.2580


Question: How is the classification error compared to the previous experiment without nested sampling? 

===SOLUTION===

We obtain a higher classification error than we received without nested resampling.

## Summary

In this exercise sheet, we learned how to apply a XGBoost learner to the credit data set By using resampling, we estimated the performance. XGBoost has a lot of hyperparameters and we only had a closer look on two of them. We also saw how early stopping could be facilitated which should help to avoid overfitting of the XGBoost model.

Interestingly, we obtained best results, when we used 100 iterations, without tuning or early stopping. However, performance differences were quite small - if we set a different seed, we might see a different ranking. Furthermore, we could extend our tuning search space such that more hyperparameters are considered to increase overall performance of the learner for the task at hand. Of course, this also requires more budget for the tuning (e.g., more evaluations of random search).