<h1 style="text-align:center;">XGBoost Unveiled</h1>

Alright, let’s dive into a charming journey with Extreme Gradient Boosting, or XGBoost. Picture this: we’ve been exploring the wonderful world of machine learning, journeying from the simplicity of decision trees all the way up to the robustness of gradient boosting. Now, let’s sprinkle some magic, shall we?

In the first half of this notebook, we're going to peel back the layers of XGBoost, revealing the secret sauce that gives it its powerful punch in tree ensemble algorithms. We'll wade through the theory and find out what really makes XGBoost tick and stand out in the crowd.

In the second act, we’ll roll up our sleeves and get our hands dirty by building XGBoost models within the frame of the Higgs Boson Kaggle Competition - the very stage where XGBoost took its first bow and dazzled the world.

What are we going to dig into, you ask?
- We’ll unveil the tricks under XGBoost's hood that make it a speed demon in the machine learning drag race.
- We'll explore the clever ways XGBoost deals with the pesky issue of missing values - it’s got some neat tricks up its sleeve!
- And we’ll dive deep (but keep it breezy) into the mathematical wizardry that powers XGBoost's regularized parameter selection.

We’ll get friendly with model templates, crafting our own XGBoost classifiers and regressors, becoming wizards in our own right.

And for our grand finale, we’ll teleport ourselves to the Large Hadron Collider, the stage where the Higgs boson made its debut. Here, we’ll play with data, make some predictions, and get cozy with the original XGBoost Python API.

Ready to embark on this adventure together? Let’s dive in, explore, and demystify XGBoost in our own relaxed, yet straight-to-the-point way.

# Designing XGBoost
XGBoost takes gradient boosting to the next level. In this part, we'll spotlight those unique characteristics of XGBoost that set it apart from traditional gradient boosting and other tree ensemble techniques.


## A Historical Glimpse

In the era of big data, the race to find powerful machine learning algorithms for optimal predictions took off. Decision trees were accurate but didn’t generalize well, while ensemble methods, like bagging and boosting, showed more promise. One standout was gradient boosting, which inspired Tianqi Chen to create **XGBoost**, bringing built-in regularization and remarkable speed gains to the table. After gaining recognition in Kaggle competitions, Chen and Carlos Guestrin introduced XGBoost to the wider machine learning community in 2016. [Read the original paper for more](https://arxiv.org/pdf/1603.02754.pdf).

## Diving into Design Features

XGBoost, aptly named for pushing computational limits to the extreme, addresses the need for faster algorithms in big data contexts. While our main focus here is building XGBoost models, we’ll sneak a peek under its hood to pinpoint key enhancements like handling missing values and improving speed and accuracy, which make it an attractive choice in the ML toolkit.

### Handling Missing Values

No need to stress over null values; XGBoost has got it covered with a `missing` hyperparameter. It smartly scores different split options and picks the best one when faced with missing data.

### Speeding Things Up

Designed with speed in mind, XGBoost quickly builds models even when grappling with massive datasets. Its design features that give it a speed advantage include:

- Approximate split-finding algorithm
- Sparsity-aware split-finding
- Parallel computing
- Cache-aware access
- Block compression and sharding

### Accuracy Gains with Regularization

XGBoost doesn’t just stop at the gradient boosting; it includes built-in regularization to prevent overfitting and enhance accuracy, setting it apart from gradient boosting and random forests.

# Crafting XGBoost Templates Together

Let’s roll up our sleeves and create some handy templates for building XGBoost models! These will be your trusty guides, helping you craft XGBoost classifiers and regressors in your future adventures.

## Classic Datasets as Our Playground

We'll tinker with two classic datasets: the Iris for classification and the Diabetes for regression. Both are petite, nestled within scikit-learn, and well-explored by our fellow data explorers, providing us a common ground in the machine learning realm. 

And hey, we’ll get acquainted with some default hyperparameters along the way - they usually give XGBoost a good starting point, and knowing them will gear you up for any tuning adventures ahead!

### The Iris Dataset: A Friendly Classic

The Iris dataset, introduced by our friend Robert Fischer back in 1936, has been a darling of the machine learning community, thanks to its easy access, neat data, and symmetrical values. It's like the friendly neighborhood park where we all test our classification algorithms.

Here’s how we invite the Iris dataset into our sandbox, straight from scikit-learn:

```python
from sklearn.datasets import load_iris
iris = load_iris()
```

In [16]:
import numpy as np
import pandas as pd
from sklearn import datasets
from sklearn.model_selection import train_test_split, cross_val_score

from xgboost import XGBClassifier, XGBRegressor
from sklearn.metrics import accuracy_score

In [8]:
iris = datasets.load_iris()

Scikit-learn tucks datasets into **NumPy arrays**, a beloved storage format for our machine learning escapades. Whereas, **pandas DataFrames** tend to be the champions for diving into data analysis and crafting visualizations. To peek at NumPy arrays through the lens of DataFrames, we enlist the `pandas DataFrame` method. Notably, scikit-learn datasets come pre-partitioned into predictor and target columns. To weave them back together, we concatenate the NumPy arrays with a dash of `np.c_` before making the conversion. np.c_ is a convenient attribute in NumPy for concatenating arrays along the second axis (i.e., columns). It's often used for horizontally stacking arrays (i.e., column-wise stacking), translating slice objects to concatenation along the second axis:

```python
import numpy as np

# Example arrays
a = np.array([1, 2, 3])
b = np.array([4, 5, 6])

# Using np.c_ for concatenation along the second axis
c = np.c_[a, b]

# Output
# array([[1, 4],
#        [2, 5],
#        [3, 6]])

```


In [4]:
iris["target"]

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

In [5]:
df = pd.DataFrame(
    data=np.c_[iris['data'],
    iris['target']],
    columns = iris['feature_names'] + ['target']
            )

df.sample(n=5, random_state=43)

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
30,4.8,3.1,1.6,0.2,0.0
0,5.1,3.5,1.4,0.2,0.0
138,6.0,3.0,4.8,1.8,2.0
67,5.8,2.7,4.1,1.0,1.0
105,7.6,3.0,6.6,2.1,2.0


The predictor columns, capturing sepal and petal dimensions, are straightforward. The target column encompasses three iris flower types: setosa, versicolor, and virginica, as outlined in the scikit-learn documentation, with a total of 150 entries. 

For machine learning prep, import `train_test_split` and partition the data. We will utilize the original NumPy arrays, `iris['data']` and `iris['target']`, as inputs for the splitting process.

In [9]:
X_train, X_test, y_train, y_test = train_test_split(iris['data'], 
                                                    iris['target'], 
                                                    random_state=43
                                                   )

In [13]:
xgb = XGBClassifier(booster='gbtree', 
                    objective='multi:softprob', 
                    max_depth=6, learning_rate=0.1, 
                    n_estimators=100, 
                    random_state=43, n_jobs=-1
                   )
xgb.fit(X_train, y_train)

Let's delve into a brief exploration of the selected hyperparameters:

##### a) Booster Type: `booster='gbtree'`
   - **What is it?** The `booster` is the model (or "learner") that gets adjusted during the boosting rounds.
   - **What does 'gbtree' mean?** It stands for gradient boosted tree, which is the default base learner in XGBoost.
   - **Note**: While 'gbtree' is commonly used, we’ll explore other base learners in Chapter 8.

##### b) Objective Function: `objective='multi:softprob'`
   - **What is it?** The `objective` determines the loss function to be used in the model.
   - **Why 'multi:softprob'?** It’s suitable for multiclass problems and outputs the predicted probability of each class. The class with the highest probability becomes the final prediction.
   - **Extra Info**: XGBoost can often pick an appropriate objective if it's not explicitly defined. Dive into other options in the [XGBoost documentation](https://xgboost.readthedocs.io/en/latest/parameter.html).

##### c) Tree Depth: `max_depth=6`
   - **What is it?** `max_depth` specifies the maximum depth of a tree.
   - **Why is it important?** It controls the complexity of the model by limiting the number of branches in the trees. A key parameter to tweak to avoid overfitting or underfitting.
   - **Note**: XGBoost defaults to 6; in contrast, random forests don’t specify a `max_depth` unless defined.

##### d) Learning Rate: `learning_rate=0.1`
   - **What is it?** Also known as `eta` in XGBoost, the `learning_rate` scales the contribution of each tree.
   - **Why does it matter?** It's a tuning knob, reducing the step size during boosting and thus controlling overfitting.
   - **In Depth**: We explored this concept thoroughly in Chapter 4.

##### e) Number of Trees: `n_estimators=100`
   - **What is it?** `n_estimators` dictates the number of boosting rounds, or in simpler terms, the number of trees added to the model.
   - **What’s the impact?** More trees can model more complexity, but also might lead to overfitting. Balancing this with `learning_rate` can often yield more robust models.

Understanding each hyperparameter and its impact on the model is crucial for fine-tuning and achieving better predictive performance with XGBoost!


In [14]:
y_pred = xgb.predict(X_test)

score = accuracy_score(y_pred, y_test)

print(f'Score: {str(score)}')

Score: 0.9473684210526315


# The Diabetes dataset
In this section, an XGBoost regressor template is provided using cross_val_score with scikit-learn's Diabetes dataset.

Before building the template, import the predictor columns as X and the target columns as y.

In [15]:
X, y = datasets.load_diabetes(return_X_y=True)

In [17]:
xgb = XGBRegressor(booster='gbtree', 
                   objective='reg:squarederror', 
                   max_depth=6, learning_rate=0.1, 
                   n_estimators=100, random_state=43, n_jobs=-1
                  )

scores = cross_val_score(xgb, X, y, 
                         scoring='neg_mean_squared_error', cv=5
                        )

rmse = np.sqrt(-scores)

print(f'RMSE: {np.round(rmse, 3)}')

print(f'RMSE mean: {rmse.mean():.3f}')

RMSE: [63.011 59.705 64.538 63.706 64.588]
RMSE mean: 63.109


Converting the target column, y, into a pandas DataFrame with the .describe() method will give the quartiles and the general statistics of the predictor column.

In [18]:
pd.DataFrame(y).describe()

Unnamed: 0,0
count,442.0
mean,152.133484
std,77.093005
min,25.0
25%,87.0
50%,140.5
75%,211.5
max,346.0


A score of 63.109 is less than 1 standard deviation, a respectable result.