<h1 style="text-align:center;">XGBoost Hyperparameters</h1>

# Introduction
XGBoost is renowned for its performance and speed, yet it comes with a rich set of hyperparameters that can be intimidating to explore, especially all at once. 

### Base Learner Hyperparameters
At its core, XGBoost employs decision trees as base learners. Consequently, it incorporates all the hyperparameters related to decision trees, such as:
- `max_depth`: Controls the maximum depth of the tree.
- `min_child_weight`: Helps prevent over-fitting by defining the minimum sum of weights needed in a child.
- `gamma`: Specifies the minimum loss reduction required to make a split.
These hyperparameters influence how large and complex the individual trees become within the ensemble model.

### Gradient Boosting Hyperparameters
Since XGBoost is an extension of gradient boosting, it also inherits hyperparameters pertinent to gradient boosting mechanisms, such as:
- `learning_rate` (or `eta`): Dictates the step size at each iteration while moving towards a minimum of the loss function.
- `n_estimators`: Determines the number of boosting rounds or trees to build.
- `subsample`: Specifies the fraction of samples used to train each tree, introducing randomness and preventing overfitting.
These hyperparameters are pivotal in managing the ensemble learning process to enhance predictive capability.

### Unique XGBoost Hyperparameters
XGBoost brings along additional hyperparameters that are specifically designed to refine the model’s accuracy and speed up the training process, such as:
- `colsample_bytree`, `colsample_bylevel`, and `colsample_bynode`: These parameters control the fraction of features used at different levels and nodes, thereby introducing additional randomness and mitigating overfitting.
- `scale_pos_weight`: Helps in handling class imbalance by giving more weight to the minority class.
- `alpha` and `lambda`: Regularization parameters that add a penalty term to the objective function, thus preventing overfitting by constraining the model.

### Navigating the Hyperparameter Space
Approaching XGBoost's extensive hyperparameter space might feel like navigating through a dense forest. Practical strategies like grid search, random search, or Bayesian optimization can be employed to systematically explore and tune these hyperparameters, ensuring that the model is well-tuned and robust, without feeling disoriented by the multitude of tuning options. 

In essence, while XGBoost provides a plethora of tuning options through its extensive hyperparameters, an informed, structured approach to navigating this space can yield models that are both accurate and c

### Preparing the Groundwork: Data and Base Models

Embarking on a journey with XGBoost? Let's lay the foundation first, ensuring our data and base models are ready to roll before diving into the sea of hyperparameters!

#### 1. **Securing the Dataset: A Look at Heart Disease**
To kick things off, we’ll grab the heart disease dataset. Our aim? To predict the onset of heart disease using various medical attributes.

#### 2. **Constructing the XGBClassifier**
Next, let's piece together our first XGBClassifier model. Consider this as assembling our first little robot soldier who's all set to learn from the data!

#### 3. **Implementing StratifiedKFold: An Equal Opportunity for All Classes**
We’ll employ StratifiedKFold, ensuring that our validation folds are representative of the overall class distribution, granting equal opportunities for all classes to be learned during training.

#### 4. **Scoring a Baseline: Where Do We Stand?**
Before tuning and tweaking, let's gauge our base XGBoost model. This baseline score will be our benchmark, guiding us on how much our fine-tuning efforts are paying off!

#### 5. **Merging Powers: GridSearchCV and RandomizedSearchCV United**
Let’s harness the meticulousness of GridSearchCV and the randomness of RandomizedSearchCV, blending them into one robust function that explores and exploits the hyperparameter space efficiently.

### The Essence: Preparedness Breeds Precision
Solid preparation is more than half the battle won in achieving accuracy, consistency, and speed in the model tuning process. By setting up our data and baseline model meticulously, we create a stable launchpad from which we can dive into the intricate world of XGBoost hyperparameter tuning with confidencetically implementing it.omputationally efficient.

In [19]:
import os
import warnings

os.environ['PYTHONWARNINGS'] = 'ignore::FutureWarning' 

import numpy as np
import pandas as pd
from sklearn import datasets
from sklearn.model_selection import (train_test_split, cross_val_score, 
                        StratifiedKFold, GridSearchCV, RandomizedSearchCV)

from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score

from helper_file import *

warnings.filterwarnings("ignore", category=FutureWarning) 
# export PYTHONWARNINGS="ignore::FutureWarning"

In [20]:
import xgboost
print(xgboost.__version__)

1.7.6


In [21]:
data_path = "data/heart_disease.csv"

In [22]:
df = pd.read_csv(data_path)
df.sample(n=5, random_state=43)

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
242,64,1,0,145,212,0,0,132,0,2.0,1,2,1,0
130,54,0,2,160,201,0,1,163,0,0.0,2,1,2,1
208,49,1,2,120,188,0,1,139,0,2.0,1,3,3,0
160,56,1,1,120,240,0,1,169,0,0.0,0,0,2,1
124,39,0,2,94,199,0,1,179,0,0.0,2,0,2,1


The `target` column! This one's our treasure map, indicating whether a patient has heart disease (`1` - presence) or not (`2` - absence). It's our guiding star, showing us what we’re trying to predict and navigate towards with our XGBoost model.

To delve into the intricate details of the other columns, hop over to the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/Heart+Disease). It's a goldmine of information, providing a deeper dive into the various attributes and their meanings, enabling us to understand the nuances of the data we are working with.

In [23]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 14 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       303 non-null    int64  
 1   sex       303 non-null    int64  
 2   cp        303 non-null    int64  
 3   trestbps  303 non-null    int64  
 4   chol      303 non-null    int64  
 5   fbs       303 non-null    int64  
 6   restecg   303 non-null    int64  
 7   thalach   303 non-null    int64  
 8   exang     303 non-null    int64  
 9   oldpeak   303 non-null    float64
 10  slope     303 non-null    int64  
 11  ca        303 non-null    int64  
 12  thal      303 non-null    int64  
 13  target    303 non-null    int64  
dtypes: float64(1), int64(13)
memory usage: 33.3 KB


We can see see that all data points are numerical and have no null values. We will proceed with building our model and we will require the serveices of `xgbclassifier`.

In [24]:
X, y = splitX_y(df, 'target')

print(f"shape of target vector: {y.shape}")
print(f"shape of feature matrix: {X.shape}")

shape of target vector: (303,)
shape of feature matrix: (303, 13)


In [25]:
model = XGBClassifier(booster='gbtree', 
                      objective='binary:logistic', 
                      random_state=43
                     )

scores = cross_val_score(model, X, y, cv=5)

print(f'Accuracy: {np.round(scores, 2)}')

print(f'Accuracy mean: {(scores.mean()):.2f}')

Accuracy: [0.84 0.85 0.82 0.8  0.77]
Accuracy mean: 0.81


We had tried this in [notebook 3](https://github.com/theAfricanQuant/XGBoost4machinelearning/blob/main/006_CaseStudy.ipynb) and the results with `DecisionTreeClassifier` was not as good.

We used `cross_val_score` here, and we will use `GridSearchCV` to tune hyperparameters. Next, let's find a way to ensure that the test folds are the same using `StratifiedKFold`.

### StratifiedKFold: Ensuring Fair Play in Hyperparameter Tuning

As we delve into the adventurous journey of hyperparameter tuning with GridSearchCV and RandomizedSearchCV, a subtle issue peeks through from Chapter 2’s exploration of Decision Trees. Not all cross-validation strategies are created equal, and this becomes particularly evident with how `cross_val_score` an`d GridSearch` or `V/RandomizedSearch`CV split our data.

#### The Dilemma: Consistent Splits in Cross-Validation
Here's the catch: when these tools carve up our data for cross-validation, they might not maintain the same distribution of target values across all folds, potentially skewing our validation results. Imagine one fold unintentionally hoarding all the positive cases, while another is left with the negatives – not quite the fair and balanced validation we're afedKFold
Enter `StratifiedKFold`, our knight in shining armor, ensuring that each fold is a true miniature of our entire dataset. 

**How Does it Work?**
- **Uniform Distribution:** It ensures that each fold contains the same percentage of target values as the whole dataset. So, if our target column is adorned with 60% 1s (presence of heart disease) and 40% 0s (absence), StratifiedKFold guarantees that each fold mirrors this distribution.
- **Avoiding Skewed Splits:** Unlike random folds, which may accidentally create one test set with, say, a 70-30 split and another with a 50-50 split of target values, StratifiedKFold safeguards against such inconsistencies.

#### Ensuring Robustness in Model Tuning
Employing `StratifiedKFold` in our cross-validation strategy, especially during hyperparameter tuning, ensures that our validation results are robust and reliable. It shields against the risk of our model being validated against unrepresentative folds, thereby ensuring that our tuning is grounded on stable and consis

We will now define the number of folds as kfold by selecting `n_splits=5`, `shuffle=True`, and `random_state=43` as the StratifiedKFold parameters. Note that random_state provides a consistent ordering of indices, while `shuffle=True` allows rows to be initially shuffled:tical model tuning.

In [26]:
kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=43)

The `kfold` variable we just defined above can then be used inside `cross_val_score`, `GridSeachCV`, and `RandomizedSearchCV` to ensure consistent results.

Now that we have a method for obtaining consistent folds, it's time to score an official baseline model using `cv=kfold` inside `cross_val_score`.  Let's whip out our magnifying glass and scrutinize what happens when we unleash our model onto the data:

In [27]:
scores = cross_val_score(model, X, y, cv=kfold)

print(f'Accuracy: {np.round(scores, 2)}')

print(f'Accuracy mean: {(scores.mean()):.2f}')

Accuracy: [0.82 0.74 0.77 0.82 0.87]
Accuracy mean: 0.80


And voilà! Our scores are laid bare before us. The above score now forms our baseline.

But wait! Our accuracy has dwindled a tad. What's the story here?

### A Gentle Reminder: The True Meaning of Scores

Ah, the allure of high scores! But let’s pause for a moment and ponder: Is the pursuit of the pinnacle score always the righteous path? 

Here's the conundrum: while we zealously trained our `XGBClassifier` model on different folds, we encountered a spectrum of scores. It’s a subtle reminder of the paramount importance of consistency in our test folds during model training and validation. Indeed, scores are vital, but they're not the sole beacon of a model's worth.

#### Consistency vs. Absolute Scores
While the chase for the best possible score may be an intuitive strategy when picking between models, the variances in scores in this instance pull back the curtains on a crucial insight: a higher score does not always herald a superior model. Our two models, twins in terms of hyperparameters, unveiled different scores purely due to the different folds they were trained and validated on.

### In Essence: Balance and Perspective
Navigating through the labyrinth of model tuning and validation, it's pivotal to maintain a balanced perspective. Scores provide invaluable insights, but it’s the consistency, reliability, and understanding of these scores that truly guides us towards crafting models that are not just statistically sound, but also practically insightful and reliable.


### Merging Forces: GridSearchCV Meets RandomizedSearchCV 

In our quest for the most enchanting model, we encounter two potent allies: `GridSearchCV` and `RandomizedSearchCV`. Both have their unique charms and spells to navigate the mystical world of hyperparameters.

#### **GridSearchCV: The Thorough Explorer**
`GridSearchCV` is akin to a meticulous explorer, scrutinizing every possible combination in the hyperparameter grid, ensuring nothing is overlooked in the pursuit of optimal tuning. It’s exhaustive, leaving no stone unturned, but alas, with great power comes great computational demand!

#### **RandomizedSearchCV: The Swift Adventurer**
On the flip side, `RandomizedSearchCV`, the swift and nimble adventurer, randomly selects a subset of hyperparameter combinations (10 by default), offering a quicker yet slightly less thorough exploration. It comes into its own when `GridSearchCV` demands more computational prowess than we can afford, due to a vast hyperparameter landscape.

#### **A Unified Force: One Function to Rule Them All**
Now, what if we could harness the thoroughness of GridSearchCV and the agility of RandomizedSearchCV, all in one neat package? A unified function that embodies the best of both worlds! Let’s weave this magic with the following steps:

1. **Define the Hyperparameter Universe:** Craft a grid that encompasses all possible hyperparameter values we wish to explore.
   
2. **Choose the Search Strategy:** Decide whether to employ the exhaustive search of GridSearchCV or the agile exploration of RandomizedSearchCV, perhaps based on the size and complexity of our hyperparameter universe.

3. **Execute the Search:** Unleash the chosen strategy upon our hyperparameter space, tuning, testing, and validating, ensuring our model learns, adapts, and evolves.

4. **Evaluate and Iterate:** Analyze the results, learn from the insights, and perhaps iterate further, refining our hyperparameter space and honing our model to perfection.

### Crafting Models: A Balance of Precision and Expediency
Embarking on this journey, it’s pivotal to balance the precision of GridSearchCV with the expediency of RandomizedSearchCV. By combining them into a single, streamlined function, we ensure that our model tuning is not only thorough and precise but also efficient and computationally feasible, guiding us towards crafting models that are both robust and reliable.

In [28]:
def grid_search(params, random=False): 
    
    xgb = XGBClassifier(booster='gbtree', objective='binary:logistic', random_state=43)
    
    kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=43)
    
    grid = (
        RandomizedSearchCV(xgb, params, cv=kfold, n_iter=20, n_jobs=-1, random_state=43) 
        if random 
        else GridSearchCV(xgb, params, cv=kfold, n_jobs=-1)
    )
    
    # Fit and extract information in a chained manner
    grid.fit(X, y)

    # Print best params and score using f-string formatting
    print(f"Best params: {grid.best_params_}")
    print(f"Best score: {grid.best_score_:.5f}")
    return print("search completed!")


### Unveiling XGBoost Hyperparameters: The Art of Fine-Tuning 

Embarking into the realm of XGBoost, we’re greeted by an ensemble of hyperparameters, each with their own unique character and influence upon our model's learning journey. Here, we shall delve into some of these frequently tinkered-with hyperparameters, understanding their essence, and experimenting with their variations using our crafted `grid_search` function.

#### **1. n_estimators: The Ensemble of Trees**
Let’s first converse with `n_estimators`, a crucial player in the ensemble method. 

- **Role in the Ensemble:** The `n_estimators` hyperparameter dictates the number of trees in the ensemble, each one constructed by learning from the residuals (errors) of its predecessors.
- **In the XGBoost Context:** Specifically for XGBoost, `n_estimators` signifies the quantity of trees trained upon the residuals, each tree attempting to correct the mistakes of the collective ensemble thus far.

##### **Diving into Practical Tuning:**
With a basic understanding, let's get our hands dirty and experiment with this hyperparameter:

- **Starting Point:** We initialize our exploration with the standard `n_estimators` value of 100.
- **Scaling Up:** Gradually, we shall amplify our ensemble, doubling the trees, exploring through to 800.

In [29]:
param_grid = {'n_estimators': [100, 200, 400, 800]}
grid_search(param_grid)

Best params: {'n_estimators': 100}
Best score: 0.80224
search completed!


#### **2. learning_rate: Calibrating Ensemble Contribution**
Let's delve into `learning_rate`, a pivotal hyperparameter steering the ensemble's direction and magnitude of correction.

- **Role in the Ensemble:**
  - **Correction Weightage**: The `learning_rate` hyperparameter, sometimes referred to as 'shrinkage', moderates the influence of each tree on the final prediction, by scaling the contribution of each tree's prediction toward the final ensemble prediction.
  - **Preventing Overfitting**: By systematically tuning down the contribution of each tree, `learning_rate` acts as a regularization technique, often preventing overfitting by avoiding the model to too eagerly fit the training data.
  - **Balance with Tree Quantity**: It indirectly influences the ensemble's size and complexity, as a smaller `learning_rate` generally necessitates a larger ensemble (more trees, or higher `n_estimators`) to maintain predictive performance.

- **In the XGBoost Context:**
  - **Scaled Corrections**: Within XGBoost, each successive tree is built on the residuals (errors) of the preceding ensemble, and `learning_rate` determines how much of the new tree's prediction should be incorporated to correct previous errors.
  - **Stabilizing Predictions**: The careful calibration of `learning_rate` ensures that the model adapts to the patterns in the data, but not to the noise, ensuring a balance between bias and variance.

- **Diving into Practical Tuning:**
  - **Starting Point**: A sensible starting point for tuning might be 0.1, which is often used as a default in various contexts, and provides a balance between learning speed and convergence stability.
  - **Adjustment Strategy**: To find the optimal `learning_rate`, a gradual adjustment is key. One might start by trying a range of values, for example, `[0.01, 0.05, 0.1, 0.2, 0.3, 0.4, 0.5]`, to observe how the model's performance varies.
  - **Synergy with n_estimators**: Remember that there is a trade-off between `learning_rate` and `n_estimators`: a smaller `learning_rate` often requires a larger `n_estimators` to be effective, and vice versa. Therefore, tuning them in tandem is crucial for efficient model tuning.
  - **Grid Search Integration**: Incorporate `learning_rate` tuning into grid search or randomized search strategies to systematically explore the hyperparameter space, potentially in combination with other hyperparameters, to find a configuration that performs well on your specific dataset.

By mindfully tuning `learning_rate`, we modulate the ensemble's learning pace and stability, ensuring it learns the underlying patterns in the data, while preserving its ability to generalize to unseen data. This subtle, yet potent, hyperparameter demands careful tuning and consideration in the context of other parameters to unlock the full potential of the XGBoost model.

In [30]:
grid_search(params={'learning_rate':[0.01, 0.05, 0.1, 0.2, 0.3, 0.4, 0.5]})

Best params: {'learning_rate': 0.01}
Best score: 0.81514
search completed!


#### **3. max_depth: Controlling Tree Complexity**
Let’s explore `max_depth`, an essential hyperparameter that governs the structural depth and complexity of the trees within the ensemble.

- **Role in the Ensemble:**
  - **Tree Complexity**: `max_depth` sets the maximum depth of each decision tree, directly influencing the tree's complexity and capability to model intricate patterns in the data.
  - **Regulating Overfitting**: By capping the depth of the trees, `max_depth` acts as a regularization parameter, preventing the model from fitting too closely to the training data and potentially capturing noise as if it were a real pattern.
  - **Interaction with Other Parameters**: The impact of `max_depth` may intertwine with other tree-specific parameters like `min_child_weight`, forming a collaborative influence on the tree's structure and final ensemble's predictive power.

- **In the XGBoost Context:**
  - **Tree Construction**: In XGBoost, each tree is constructed sequentially, with each one trying to correct the errors of its predecessors. The `max_depth` parameter ensures that these corrections are kept reasonably simple and broad, preventing the model from creating overly complex corrections that do not generalize well to unseen data.
  - **Default and Common Values**: Although XGBoost defaults to a `max_depth` of 6, it’s essential to consider the nature of the data and task when choosing an appropriate value. Smaller values may be apt for simpler tasks, while larger values might be considered for more complex datasets, always being cautious of overfitting.

- **Diving into Practical Tuning:**
  - **Starting Point**: The default value of 6 serves as a common starting point, offering a balance between model complexity and generalization in many scenarios.
  - **Strategic Increment**: Begin with a smaller `max_depth`, and gradually increase, observing how each increment affects model performance. Pay attention to the signs of overfitting (i.e., improving training performance while validation performance deteriorates).
  - **Problem Specific**: Remember that the optimal `max_depth` is highly dependent on the specific dataset and problem at hand, and should be chosen based on performance metrics on validation data.
  - **Cross-Validation**: Employ cross-validation strategies to evaluate the impact of `max_depth` in a robust manner, ensuring the chosen value offers stable and reliable performance across different data subsets.

Adjusting `max_depth` provides us with a powerful lever to control the model’s complexity, safeguarding against overfitting while enabling the model to learn nuanced patterns within the data. Striking the right balance through meticulous tuning ensures that our XGBoost model is both accurate and generalizable.

In [31]:
grid_search(params={'max_depth':[2, 3, 5, 6, 8]})

Best params: {'max_depth': 8}
Best score: 0.80552
search completed!


#### **4. gamma: Steering Tree Growth Prudence**
Let's delve into `gamma`, a pivotal hyperparameter in XGBoost that influences the decision-making regarding further partitioning of the tree nodes.

- **Role in the Ensemble:**
  - **Splitting Criterion**: `gamma` establishes a threshold for the reduction in the loss function, mandating that any further split at a node must provide at least this much reduction in the loss.
  - **Tree Pruning**: By enforcing a stricter criterion for splits, `gamma` effectively prunes the tree, avoiding unnecessary depth and complexity that could lead to overfitting.
  - **Balancing Bias and Variance**: Through its influence on tree depth, `gamma` plays a key role in managing the trade-off between a model’s bias and variance.

- **In the XGBoost Context:**
  - **Regulating Tree Growth**: Within the realm of XGBoost, `gamma` is employed to curb the enthusiastic growth of the trees, ensuring that splits contributing marginally to the predictive power are discouraged, thereby fostering a more robust and generalizable model.
  - **Impact on Complexity**: As a node is split only when the associated reduction in loss exceeds the `gamma` value, higher `gamma` values naturally lead to simpler, more conservative models.

- **Diving into Practical Tuning:**
  - **Starting Point**: Although the default value is 0, initiating with a small, non-zero `gamma` might be a prudent choice to introduce a mild regularizing effect from the get-go.
  - **Progressive Adjustments**: It’s worthwhile to increment `gamma` gradually, observing how increasing conservatism affects the model's performance, and being mindful to avoid introducing too much bias (underfitting).
  - **Upper Limit Considerations**: While there's technically no upper limit to `gamma`, practical applications and empirical observations often deem values above 10 to be quite high, and such values might overly constrain the model.
  - **Validation Importance**: Given its impact on model complexity, utilizing a validation set or employing cross-validation to assess the impact of `gamma` is crucial to ensure that the chosen value does not overly simplify the model.

Navigating through the tuning of `gamma`, we modulate the austerity of tree growth, ensuring that the ensemble remains judicious in its complexity, providing accurate yet generalizable predictions. It demands a mindful tuning approach, ever conscious of the subtle balance between learning the data's inherent patterns and not being swayed by its random fluctuations.

In [32]:
grid_search(params={'gamma':[0, 0.1, 0.5, 1, 2, 5]})

Best params: {'gamma': 5}
Best score: 0.83836
search completed!


#### **5. min_child_weight: Governing Tree Split Precision**
Let's traverse through `min_child_weight`, a crucial hyperparameter that orchestrates the conditions under which the tree nodes in the ensemble decide to bifurcate further.

- **Role in the Ensemble:**
  - **Regulating Splits**: `min_child_weight` establishes a minimum bound on the sum of instance weights needed in a child node post-split, influencing the decision tree's willingness to create additional branches.
  - **Combatting Overfitting**: By dictating a minimum requirement for node splits, it naturally prevents the formation of nodes that only fit a small subset of the data, thereby serving as a mechanism to deter overfitting.
  - **Navigating Complexity**: With its direct impact on tree structure, it can subtly influence the model's complexity, steering the balance between underfitting and overfitting.

- **In the XGBoost Context:**
  - **Weighted Instances**: In XGBoost, where instance weights can play a vital role, `min_child_weight` ensures that splits are made only when they are justified by a substantial sum of instance weights, promoting model stability and robustness.
  - **Mitigating Noise Influence**: It ensures that the model is not swayed by potentially noisy or outlier instances, contributing to the model’s generalization capability across varied data.

- **Diving into Practical Tuning:**
  - **Starting Point**: A starting value might be set to 1, ensuring that splits are allowed as long as they are justified by at least one instance, acting as a relatively unrestrictive initial condition.
  - **Gradual Enhancement**: Tuning may involve gradually increasing `min_child_weight`, observing how additional restrictions on tree growth influence the model's performance and complexity.
  - **Precision vs. Generalization**: The tuning of `min_child_weight` demands a keen eye on the model’s performance metrics, ensuring that the pursuit of precision does not jeopardize the model’s ability to generalize to unseen data.
  - **Cross-Validation Safeguard**: Employing cross-validation during tuning enables a more reliable and robust selection of `min_child_weight`, safeguarding against potential overfitting and underfitting.

With judicious tuning of `min_child_weight`, we wield the capability to guide our ensemble's learning, ensuring that the model is sufficiently expressive to capture underlying patterns while maintaining a steadfast resilience against the perils of overfitting. The hyperparameter, while seemingly simple, plays a pivotal role in harmonizing the ensemble's predictive prowess across varied datasets and contexts.

In [33]:
grid_search(params={'min_child_weight':[1, 2, 3, 4, 5]})

Best params: {'min_child_weight': 4}
Best score: 0.81213
search completed!


#### **6. subsample: Modulating Training Instance Utilization**
Let’s explore `subsample`, a key hyperparameter that delineates the fraction of training instances employed in each boosting round, impacting the ensemble's learning dynamics and generalization capacity.

- **Role in the Ensemble:**
  - **Sampling Fraction**: `subsample` determines the proportion of training instances that are randomly sampled during each boosting round, affecting the diversity of instances that each tree in the ensemble is exposed to.
  - **Reducing Overfitting**: By intentionally using a subset of instances during each boosting round, `subsample` facilitates the creation of models that are less likely to overfit to the training data by introducing variability into the training process.
  - **Inducing Robustness**: Through randomized sampling of instances, it encourages the ensemble to construct trees that are not overly sensitive to specific instances or noisy patterns, thereby enhancing model robustness.

- **In the XGBoost Context:**
  - **Stochastic Boosting**: Within XGBoost, `subsample` introduces a stochastic element into the ensemble learning process. Each tree experiences different slices of the data, ensuring that the collective ensemble is informed by a broad spectrum of data scenarios.
  - **Balancing Bias and Variance**: Through its controlled sampling, `subsample` navigates the trade-off between bias and variance, ensuring the ensemble is neither too simplistic nor too tailored to the training data.

- **Diving into Practical Tuning:**
  - **Starting Point**: A common starting point might be 1.0, indicating that all instances are used in each boosting round, providing a baseline against which reductions can be compared.
  - **Exploration Strategy**: Tuning typically involves decreasing `subsample` incrementally, for example, experimenting with values like 0.8 or 0.6, and observing how the reduced instance availability influences model performance and overfitting.
  - **Monitoring Performance**: Close observation of performance metrics on both training and validation datasets is paramount to ensure that reductions in `subsample` enhance generalization without overly compromising predictive accuracy.
  - **Validation Stability**: Given its stochastic nature, ensuring that performance improvements are consistent and stable, potentially through cross-validation, is vital to ascertain the genuine impact of `subsample` adjustments.

Through the nuanced tuning of `subsample`, we subtly orchestrate the ensemble's learning experience, ensuring that it benefits from a rich, varied exposure to the training data, thereby constructing a model that is simultaneously accurate and generalizable across different data scenarios. Careful tuning, while respecting the balance between learning and overfitting, empowers the ensemble to deliver reliable predictions in diverse contexts.

In [34]:
grid_search(params={'subsample':[0.5, 0.7, 0.8, 0.9, 1]})

Best params: {'subsample': 0.9}
Best score: 0.81863
search completed!


#### **7. colsample_bytree: Orchestrating Feature Sampling**
Let's delve into `colsample_bytree`, a vital hyperparameter that curates the fraction of features (columns) utilized in constructing each tree within the ensemble, thereby manipulating the model’s exposure to the diverse range of predictors during training.

- **Role in the Ensemble:**
  - **Feature Sampling**: `colsample_bytree` governs the proportion of features that are randomly sampled to construct each tree, influencing the diversity and breadth of features each tree is privy to during its construction.
  - **Mitigating Overfitting**: By selectively using a subset of features for each tree, `colsample_bytree` acts as a regularization mechanism, deterring the model from becoming overly reliant on specific features and thereby mitigating overfitting.
  - **Enhancing Generalization**: Introducing variability in the feature set for each tree encourages the ensemble to construct models that are robust and capable of generalizing well to unseen data by diversifying the learning experience across the trees.

- **In the XGBoost Context:**
  - **Stochastic Element**: Within XGBoost, `colsample_bytree` injects a stochastic flavor into the ensemble learning process, ensuring that each tree is trained on a slightly different subset of features, which can enhance the ensemble's robustness and generalization.
  - **Balancing Feature Influence**: By selectively sampling features, it ensures that no single feature or set of features disproportionately influences the model, fostering an ensemble that is collectively informed by a balanced set of predictors.

- **Diving into Practical Tuning:**
  - **Starting Point**: Commencing tuning with `colsample_bytree` set to 1.0 (utilizing all features) provides a neutral starting point, from which the influence of feature sampling can be methodically explored.
  - **Sequential Reduction**: One might consider gradually reducing `colsample_bytree`, experimenting with values like 0.8 or 0.6, to observe how varying feature exposures impact the model’s predictive performance and robustness.
  - **Performance Metrics**: Vigilant monitoring of performance on both training and validation datasets is crucial to discern the impact of `colsample_bytree` on overfitting and predictive accuracy.
  - **Cross-Validation Consistency**: Employing cross-validation to ensure that the chosen `colsample_bytree` value yields stable and consistent performance across varied data subsets is pivotal to confirm its efficacy.

Tuning `colsample_bytree` adroitly shapes the ensemble’s interaction with the feature space, ensuring that the learning process is sufficiently diverse and balanced, culminating in a model that is adept at navigating through various predictive challenges with stability and accuracy. This hyperparameter, while simple, plays a profound role in aligning the ensemble’s learning towards robust and generalized predictive performance.

In [35]:
grid_search(params={'colsample_bytree':[0.5, 0.7, 0.8, 0.9, 1]})

Best params: {'colsample_bytree': 0.7}
Best score: 0.80552
search completed!
