# Finetuning XGBoost

In [10]:
# untuned model

import numpy as np
import pandas as pd
import xgboost as xgb

housing_data = pd.read_csv('ames_housing_trimmed_processed.csv')
X, y = housing_data[housing_data.columns.tolist()[:-1]], housing_data[housing_data.columns.tolist()[-1]]
housing_dmatrix = xgb.DMatrix(data=X, label=y)
untuned_params = {"objective":"reg:squarederror"}
untuned_cv_results_rmse = xgb.cv(dtrain=housing_dmatrix, params=untuned_params, nfold=4, 
num_boost_round=200, 
metrics="rmse", as_pandas=True, seed = 123)
print("Untuned rmse: %f" %((untuned_cv_results_rmse["test-rmse-mean"]).tail(1)))

Untuned rmse: 33288.916054


In [9]:
# TUNED MODEL
tuned_params = {"objective":"reg:squarederror", "colsample_bytree":0.3, "learning_rate":0.1, "max_depth":5}
tuned_cv_results_rmse = xgb.cv(dtrain=housing_dmatrix, params=tuned_params, nfold=4, 
num_boost_round=200, 
metrics="rmse", as_pandas=True, seed = 123)
print("Tuned rmse: %f" %((tuned_cv_results_rmse["test-rmse-mean"]).tail(1)))

Tuned rmse: 29965.411196


## tuning number of boosting rounds

In [3]:
# Create the DMatrix: housing_dmatrix
housing_dmatrix = xgb.DMatrix(data=X, label=y)

# Create the parameter dictionary for each tree: params 
params = {"objective":"reg:squarederror", "max_depth":3}

# Create list of number of boosting rounds
num_rounds = [5, 10, 15]

# Empty list to store final round rmse per XGBoost model
final_rmse_per_round = []

# Iterate over num_rounds and build one model per num_boost_round parameter
for curr_num_rounds in num_rounds:

    # Perform cross-validation: cv_results
    cv_results = xgb.cv(dtrain=housing_dmatrix, params=params, nfold=3, num_boost_round=curr_num_rounds, metrics="rmse", as_pandas=True, seed=123)
    
    # Append final round RMSE
    final_rmse_per_round.append(cv_results["test-rmse-mean"].tail().values[-1])

# Print the resultant DataFrame
num_rounds_rmses = list(zip(num_rounds, final_rmse_per_round))
print(pd.DataFrame(num_rounds_rmses,columns=["num_boosting_rounds","rmse"]))

   num_boosting_rounds          rmse
0                    5  50903.299752
1                   10  34774.194090
2                   15  32895.099185


## early stopping


Early stopping in `xgb.cv` is a technique used to prevent overfitting by halting the training process when the model's performance on a validation set stops improving. It monitors a specified evaluation metric and stops training if the metric does not improve for a given number of consecutive boosting rounds.

### How Early Stopping Works in `xgb.cv`
1. **Parameters**:
   - `early_stopping_rounds`: The number of consecutive rounds without improvement after which training will be stopped.
   - `metrics`: The evaluation metric to monitor (e.g., 'rmse', 'logloss').
   - `as_pandas`: Whether to return the results as a pandas DataFrame.

2. **Process**:
   - During cross-validation, the model's performance is evaluated on a validation set at each boosting round.
   - If the specified metric does not improve for `early_stopping_rounds` consecutive rounds, training is stopped.
   - The best iteration is recorded, and the model parameters from that iteration are used.

### Example Code


In [11]:
import xgboost as xgb
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split

# Load the dataset
california = fetch_california_housing()
X, y = california.data, california.target

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create DMatrix for training
dtrain = xgb.DMatrix(X_train, label=y_train)

# Set parameters for the model
params = {
    'objective': 'reg:squarederror',
    'max_depth': 4,
    'eval_metric': 'rmse'
}

# Perform cross-validation with early stopping
cv_results = xgb.cv(
    params=params,
    dtrain=dtrain,
    num_boost_round=1000,
    nfold=4,
    metrics='rmse',
    early_stopping_rounds=10,
    as_pandas=True,
    seed=42
)

# Display the results
print(cv_results)

     train-rmse-mean  train-rmse-std  test-rmse-mean  test-rmse-std
0           1.464993        0.004004        1.467093       0.014044
1           1.145017        0.003081        1.150520       0.013757
2           0.938464        0.002520        0.948589       0.013294
3           0.805468        0.001678        0.819693       0.012398
4           0.718616        0.003191        0.734411       0.008979
..               ...             ...             ...            ...
212         0.316960        0.001958        0.475336       0.007646
213         0.316600        0.002033        0.475354       0.007728
214         0.316141        0.001964        0.475322       0.007779
215         0.315654        0.001907        0.475273       0.007831
216         0.315245        0.002167        0.475257       0.007612

[217 rows x 4 columns]




### Explanation
- **Parameters**: Define the model parameters, including `objective`, `max_depth`, and `eval_metric`.
- **Cross-Validation**: Perform cross-validation using `xgb.cv` with `early_stopping_rounds` set to 10. This means training will stop if the RMSE does not improve for 10 consecutive rounds.
- **Results**: The results are returned as a pandas DataFrame, showing the evaluation metrics for each boosting round.

This example demonstrates how to use early stopping in `xgb.cv` to prevent overfitting and find the optimal number of boosting rounds.

## tunable parameters

**tree-related parameters**:

* learning_rate
* gamma 
* lambda 
* alpha 
* max_depth
* subsample
* colsamples_bytree
* 

### Tunable Parameters for XGBoost Tree Base Learner

1. **learning_rate (eta)**
   - **Description**: Controls the step size at each boosting iteration. Lower values make the model more robust to overfitting but require more boosting rounds.
   - **Typical Range**: 0.01 to 0.3
   - **Default**: 0.3

2. **gamma (min_split_loss)**
   - **Description**: Minimum loss reduction required to make a further partition on a leaf node of the tree. Higher values make the algorithm more conservative.
   - **Typical Range**: 0 to 5
   - **Default**: 0

3. **lambda (reg_lambda)**
   - **Description**: L2 regularization term on weights. It helps prevent overfitting by shrinking the weights.
   - **Typical Range**: 0 to 10
   - **Default**: 1

4. **alpha (reg_alpha)**
   - **Description**: L1 regularization term on weights. It can help with feature selection by shrinking some feature weights to zero.
   - **Typical Range**: 0 to 10
   - **Default**: 0

5. **max_depth**
   - **Description**: Maximum depth of a tree. Increasing this value makes the model more complex and more likely to overfit.
   - **Typical Range**: 3 to 10
   - **Default**: 6

6. **subsample**
   - **Description**: Fraction of training samples used to grow each tree. Lower values prevent overfitting but may increase bias.
   - **Typical Range**: 0.5 to 1
   - **Default**: 1

7. **colsample_bytree**
   - **Description**: Fraction of features used to grow each tree. Lower values prevent overfitting but may increase bias.
   - **Typical Range**: 0.5 to 1
   - **Default**: 1

### Example Code for Setting Parameters
Below is an example of how to set these parameters in an XGBoost model:



In [12]:
import xgboost as xgb
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split

# Load the dataset
california = fetch_california_housing()
X, y = california.data, california.target

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create DMatrix for training
dtrain = xgb.DMatrix(X_train, label=y_train)

# Set parameters for the model
params = {
    'objective': 'reg:squarederror',
    'learning_rate': 0.1,
    'gamma': 0.1,
    'lambda': 1,
    'alpha': 0,
    'max_depth': 6,
    'subsample': 0.8,
    'colsample_bytree': 0.8,
    'eval_metric': 'rmse'
}

# Train the model
bst = xgb.train(params, dtrain, num_boost_round=100)

# Make predictions
dtest = xgb.DMatrix(X_test)
y_pred = bst.predict(dtest)

# Evaluate the model
from sklearn.metrics import mean_squared_error
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)

print(f"Mean Squared Error: {mse}")
print(f"Root Mean Squared Error: {rmse}")

Mean Squared Error: 0.22010881503525193
Root Mean Squared Error: 0.46915755885976296




### Explanation
- **learning_rate**: Set to 0.1 to control the step size.
- **gamma**: Set to 0.1 to make the algorithm more conservative.
- **lambda**: Set to 1 for L2 regularization.
- **alpha**: Set to 0 for no L1 regularization.
- **max_depth**: Set to 6 to control the complexity of the trees.
- **subsample**: Set to 0.8 to use 80% of the training samples for each tree.
- **colsample_bytree**: Set to 0.8 to use 80% of the features for each tree.

This example demonstrates how to set and tune the key parameters for an XGBoost tree-based model.

## linear tunable parameters



### Linear Base Learner Parameters in XGBoost

1. **lambda (reg_lambda)**
   - **Description**: L2 regularization term on weights. It helps prevent overfitting by shrinking the weights.
   - **Typical Range**: 0 to 10
   - **Default**: 0

2. **alpha (reg_alpha)**
   - **Description**: L1 regularization term on weights. It can help with feature selection by shrinking some feature weights to zero.
   - **Typical Range**: 0 to 10
   - **Default**: 0

3. **lambda_bias**
   - **Description**: L2 regularization term on the bias. It helps prevent overfitting by shrinking the bias term.
   - **Typical Range**: 0 to 10
   - **Default**: 0

### Summary
- **lambda (reg_lambda)**: Controls L2 regularization on weights, reducing overfitting by shrinking weights.
- **alpha (reg_alpha)**: Controls L1 regularization on weights, aiding feature selection by shrinking some weights to zero.
- **lambda_bias**: Controls L2 regularization on the bias term, reducing overfitting by shrinking the bias.

These parameters are crucial for tuning linear models in XGBoost to achieve better generalization and performance.

### tuning

#### eta (learning rate)

In [14]:
# Create your housing DMatrix: housing_dmatrix
housing_dmatrix = xgb.DMatrix(data=X, label=y)

# Create the parameter dictionary for each tree (boosting round)
params = {"objective":"reg:squarederror", "max_depth":3}

# Create list of eta values and empty list to store final round rmse per xgboost model
eta_vals = [0.001, 0.01, 0.1]
best_rmse = []

# Systematically vary the eta 
for curr_val in eta_vals:

    params["eta"] = curr_val
    
    # Perform cross-validation: cv_results
    cv_results = xgb.cv(params=params, dtrain=housing_dmatrix, nfold=3, early_stopping_rounds=5, metrics='rmse', as_pandas=True, num_boost_round=10, seed=123)
    
    
    
    # Append the final round rmse to best_rmse
    best_rmse.append(cv_results["test-rmse-mean"].tail().values[-1])

# Print the resultant DataFrame
print(pd.DataFrame(list(zip(eta_vals, best_rmse)), columns=["eta","best_rmse"]))

     eta  best_rmse
0  0.001   1.931078
1  0.010   1.793167
2  0.100   0.974117


#### max_depth

In [15]:
# Create your housing DMatrix
housing_dmatrix = xgb.DMatrix(data=X,label=y)

# Create the parameter dictionary
params = {"objective":"reg:squarederror"}

# Create list of max_depth values
max_depths = [2,5,10,20]
best_rmse = []

# Systematically vary the max_depth
for curr_val in max_depths:

    params["max_depth"] = curr_val
    
    # Perform cross-validation
    cv_results = xgb.cv(dtrain=housing_dmatrix, params=params, nfold=2, early_stopping_rounds=5, num_boost_round=10, metrics='rmse', seed=123, as_pandas=True)
    
    
    
    # Append the final round rmse to best_rmse
    best_rmse.append(cv_results["test-rmse-mean"].tail().values[-1])

# Print the resultant DataFrame
print(pd.DataFrame(list(zip(max_depths, best_rmse)),columns=["max_depth","best_rmse"]))

   max_depth  best_rmse
0          2   0.696380
1          5   0.562417
2         10   0.541528
3         20   0.565039


#### colsample_bytree

In [16]:
# Create your housing DMatrix
housing_dmatrix = xgb.DMatrix(data=X,label=y)

# Create the parameter dictionary
params={"objective":"reg:squarederror","max_depth":3}

# Create list of hyperparameter values: colsample_bytree_vals
colsample_bytree_vals = [0.1,0.5,0.8,1]
best_rmse = []

# Systematically vary the hyperparameter value 
for curr_val in colsample_bytree_vals:

    params['colsample_bytree'] = curr_val
    
    # Perform cross-validation
    cv_results = xgb.cv(dtrain=housing_dmatrix, params=params, nfold=2,
                 num_boost_round=10, early_stopping_rounds=5,
                 metrics="rmse", as_pandas=True, seed=123)
    
    # Append the final round rmse to best_rmse
    best_rmse.append(cv_results["test-rmse-mean"].tail().values[-1])

# Print the resultant DataFrame
print(pd.DataFrame(list(zip(colsample_bytree_vals, best_rmse)), columns=["colsample_bytree","best_rmse"]))

   colsample_bytree  best_rmse
0               0.1   0.830366
1               0.5   0.659547
2               0.8   0.627300
3               1.0   0.623250


## grid search and random search

### GridSearchCV vs RandomizedSearchCV

#### GridSearchCV
**What It Is**:
- An exhaustive search method that evaluates all possible combinations of hyperparameters specified in a grid.
- It systematically works through multiple combinations of parameter values, cross-validating as it goes to determine which combination gives the best performance.

**Pros**:
- **Comprehensive**: Evaluates all possible combinations, ensuring the best combination is found within the specified grid.
- **Deterministic**: Always produces the same results given the same data and parameter grid.

**Cons**:
- **Computationally Expensive**: Can be very slow and resource-intensive, especially with a large number of parameters and values.
- **Scalability**: Not practical for high-dimensional parameter spaces due to the exponential growth in the number of combinations.

#### RandomizedSearchCV
**What It Is**:
- A search method that evaluates a fixed number of random combinations of hyperparameters from a specified distribution.
- It samples a specified number of parameter settings from the given distributions and evaluates them.

**Pros**:
- **Efficiency**: Faster and less computationally expensive than GridSearchCV, especially with large parameter spaces.
- **Scalability**: More practical for high-dimensional parameter spaces as it does not evaluate all combinations.
- **Flexibility**: Allows specifying distributions for parameters, enabling more flexible and targeted searches.

**Cons**:
- **Non-Exhaustive**: Does not guarantee finding the best combination as it only evaluates a subset of possible combinations.
- **Stochastic**: Results can vary between runs due to the random sampling of parameter combinations.

### Summary
- **GridSearchCV**: Exhaustive and deterministic but computationally expensive and less scalable.
- **RandomizedSearchCV**: Efficient and scalable but non-exhaustive and stochastic.

### Example Code
Here is an example of how to use both methods with `XGBRegressor`:



In [19]:
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from xgboost import XGBRegressor
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split

# Load the dataset
california = fetch_california_housing()
X, y = california.data, california.target

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define the model
model = XGBRegressor()

# Define parameter grids
param_grid = {
    'learning_rate': [0.01, 0.1, 0.2],
    'max_depth': [3, 4, 5],
    'n_estimators': [100, 200, 300]
}

param_dist = {
    'learning_rate': [0.01, 0.1, 0.2],
    'max_depth': [3, 4, 5],
    'n_estimators': [100, 200, 300]
}

# GridSearchCV
grid_search = GridSearchCV(model, param_grid, cv=3, scoring='neg_mean_squared_error')
grid_search.fit(X_train, y_train)
print("Best parameters (GridSearchCV):", grid_search.best_params_)

# RandomizedSearchCV
random_search = RandomizedSearchCV(model, param_distributions=param_dist, n_iter=10, cv=3, scoring='neg_mean_squared_error', random_state=42)
random_search.fit(X_train, y_train)
print("Best parameters (RandomizedSearchCV):", random_search.best_params_)

Best parameters (GridSearchCV): {'learning_rate': 0.2, 'max_depth': 5, 'n_estimators': 300}
Best parameters (RandomizedSearchCV): {'n_estimators': 300, 'max_depth': 5, 'learning_rate': 0.1}




This example demonstrates how to set up and run both `GridSearchCV` and `RandomizedSearchCV` for hyperparameter tuning of an `XGBRegressor` model.

In [17]:
# grid search example = ames
import xgboost as xgb
import pandas as pd
import numpy as np
from sklearn.model_selection import GridSearchCV

housing_data = pd.read_csv('ames_housing_trimmed_processed.csv')
X, y = housing_data[housing_data.columns.tolist()[:-1]], housing_data[housing_data.columns.tolist()[-1]]
housing_dmatrix = xgb.DMatrix(data=X, label=y)
gbm_param_grid = {'learning_rate': [0.01, 0.1, 0.5, 0.9], 'n_estimators': [200], 'subsample': [0.3, 0.5, 0.9]}
gbm = xgb.XGBRegressor()
grid_mse = GridSearchCV(estimator=gbm, param_grid=gbm_param_grid, scoring='neg_mean_squared_error', cv=4, verbose=1)
grid_mse.fit(X, y)
print("Best parameters found: ", grid_mse.best_params_)
print("Lowest RMSE found: ", np.sqrt(np.abs(grid_mse.best_score_)))

Fitting 4 folds for each of 12 candidates, totalling 48 fits
Best parameters found:  {'learning_rate': 0.1, 'n_estimators': 200, 'subsample': 0.5}
Lowest RMSE found:  29105.179169382693


In [18]:
# random search example = ames
import xgboost as xgb
import pandas as pd
import numpy as np
from sklearn.model_selection import RandomizedSearchCV

housing_data = pd.read_csv('ames_housing_trimmed_processed.csv')
X, y = housing_data[housing_data.columns.tolist()[:-1]], housing_data[housing_data.columns.tolist()[-1]]
housing_dmatrix = xgb.DMatrix(data=X, label=y)
gbm_param_grid = {'learning_rate': np.arange(0.05, 1.05, 0.05), 'n_estimators': [200], 'subsample': np.arange(0.05, 1.05, 0.05)}
gbm = xgb.XGBRegressor()
randomized_mse = RandomizedSearchCV(estimator=gbm, param_distributions=gbm_param_grid, n_iter=25, scoring='neg_mean_squared_error', cv=4, verbose=1)
randomized_mse.fit(X, y)
print("Best parameters found: ", randomized_mse.best_params_)
print("Lowest RMSE found: ", np.sqrt(np.abs(randomized_mse.best_score_)))

Fitting 4 folds for each of 25 candidates, totalling 100 fits
Best parameters found:  {'subsample': 0.8, 'n_estimators': 200, 'learning_rate': 0.05}
Lowest RMSE found:  29420.3817978446


In [20]:
# grid search
# 
# # Create the parameter grid: gbm_param_grid
gbm_param_grid = {
    'colsample_bytree': [0.3, 0.7],
    'n_estimators': [50],
    'max_depth': [2, 5]
}

# Instantiate the regressor: gbm
gbm = xgb.XGBRegressor()

# Perform grid search: grid_mse
grid_mse = GridSearchCV(estimator=gbm, param_grid=gbm_param_grid, scoring='neg_mean_squared_error', cv=4, verbose=1)


# Fit grid_mse to the data
grid_mse.fit(X,y)

# Print the best parameters and lowest RMSE
print("Best parameters found: ", grid_mse.best_params_)
print("Lowest RMSE found: ", np.sqrt(np.abs(grid_mse.best_score_)))

Fitting 4 folds for each of 4 candidates, totalling 16 fits
Best parameters found:  {'colsample_bytree': 0.7, 'max_depth': 2, 'n_estimators': 50}
Lowest RMSE found:  0.706204362480588


In [21]:
# Create the parameter grid: gbm_param_grid 
gbm_param_grid = {
    'n_estimators': [25],
    'max_depth': range(2, 12)
}

# Instantiate the regressor: gbm
gbm = xgb.XGBRegressor(n_estimators=10)

# Perform random search: grid_mse
randomized_mse = RandomizedSearchCV(estimator=gbm, param_distributions=gbm_param_grid, scoring='neg_mean_squared_error',
n_iter=5, cv=4, verbose=1)


# Fit randomized_mse to the data
randomized_mse.fit(X,y)

# Print the best parameters and lowest RMSE
print("Best parameters found: ", randomized_mse.best_params_)
print("Lowest RMSE found: ", np.sqrt(np.abs(randomized_mse.best_score_)))

Fitting 4 folds for each of 5 candidates, totalling 20 fits
Best parameters found:  {'n_estimators': 25, 'max_depth': 5}
Lowest RMSE found:  0.6526003321681331


### Limitations of GridSearchCV

1. **Computationally Expensive**:
   - **Description**: GridSearchCV evaluates all possible combinations of hyperparameters specified in the grid, which can be very slow and resource-intensive.
   - **Impact**: This can be impractical for large datasets or models with many hyperparameters, leading to long training times and high computational costs.

2. **Scalability**:
   - **Description**: The number of combinations grows exponentially with the number of hyperparameters and their possible values.
   - **Impact**: This makes GridSearchCV less suitable for high-dimensional parameter spaces, as the search space can become prohibitively large.

3. **Fixed Grid**:
   - **Description**: GridSearchCV requires predefined parameter values, which may not cover the optimal values if the grid is not well-chosen.
   - **Impact**: This can lead to suboptimal performance if the true optimal parameters lie between the specified grid points.

4. **Overfitting Risk**:
   - **Description**: Evaluating many combinations increases the risk of overfitting to the validation set used during cross-validation.
   - **Impact**: This can result in a model that performs well on the validation set but poorly on unseen data.

### Limitations of RandomizedSearchCV

1. **Non-Exhaustive Search**:
   - **Description**: RandomizedSearchCV evaluates a fixed number of random combinations of hyperparameters, which means it does not cover the entire search space.
   - **Impact**: There is no guarantee that the best combination of hyperparameters will be found, especially if the number of iterations is low.

2. **Stochastic Nature**:
   - **Description**: The results can vary between runs due to the random sampling of parameter combinations.
   - **Impact**: This can lead to variability in the selected hyperparameters and model performance, requiring multiple runs to ensure stability.

3. **Requires Distribution Knowledge**:
   - **Description**: RandomizedSearchCV requires specifying distributions for the hyperparameters, which may not be straightforward for all parameters.
   - **Impact**: Poorly chosen distributions can lead to inefficient searches and suboptimal performance.

4. **Computational Cost**:
   - **Description**: While generally more efficient than GridSearchCV, RandomizedSearchCV can still be computationally expensive if the number of iterations is high or if the model is complex.
   - **Impact**: This can limit its practicality for very large datasets or highly complex models.

### Summary
- **GridSearchCV**: Comprehensive but computationally expensive and less scalable. Fixed grid may miss optimal values and increases overfitting risk.
- **RandomizedSearchCV**: More efficient and scalable but non-exhaustive and stochastic. Requires careful distribution specification and can still be computationally costly.

Both methods have their strengths and weaknesses, and the choice between them depends on the specific requirements and constraints of the problem at hand.