Here's a demonstration. This example uses a decision tree regressor. 

Imagine that the target is a continuous variable. 

Let's start with a set of features X and a target variable Y. 

We'll train the first base learner decision tree on this data and call it learner1. 

learner1 makes its predictions, which we'll call Y^_1. 

The residual errors of learner1's prediction are found by subtracting the predicted values from the actual values. 

Call the set of residual errors, error_1. 

Now train a new base learner using the same X data but instead of the original Y data, use error_1 as the target. 

That's because this learner is predicting the error made by learner1. 

Call this new base learner, learner2. 

Learner2's predictions are assigned to error^_1. 

Then compare learner2's predictions to the actual values and assign the difference to error_2. 

In this case, the actual values are the errors made by learner1. 

This process will continue for as many base learners as we specify. 

For now, repeat it just once more. 

Stopping here results in an ensemble that contains three base learners. 

To get the final prediction for any new X, add together the predictions of all three learners.  

Ensembles that use gradient boosting are called gradient boosting machines or GBMs. 

GBMs are among the most widely used modeling techniques today because of their many advantages. 

One of these is high accuracy. As we mentioned earlier, many machine-learning competition winners succeeded largely because of the accuracy of their boosting models. 

Another advantage is that GBMs are scalable. 

Even though they can't be trained in parallel, like random forests, because they're base learners are developed sequentially they still scale well to large datasets. 

GBMs also work well with missing data. 

The fact that a value is missing is viewed as valuable information. 

So GBMs treat missing values just like any other value when determining how to split a feature. 

This makes gradient boosting relatively easy to use with messy data. 

Also because they are tree-based, GBMs don't require the data to be scaled and they can handle outliers easily. 

Gradient boosting also has its drawbacks. One is that GBMs have a lot of hyperparameters, and tuning them can be a time-consuming process. 

Another drawback is that they can be difficult to interpret. 

GBMs can provide feature importance but unlike linear models, they do not have coefficients or directionality. 

They only show how important each feature is relative to the other features. Because of this, they're often called black-box models. 

This is a model whose predictions cannot be precisely explained. In some industries such as medicine and banking, it's essential that your model's predictions be explainable. 

Therefore, GBMs are not well suited for some applications. 

GBMs can also have difficulty with extrapolation. 

Extrapolation is a model's ability to predict new values that fall outside of the range of values in the training data. 

For instance, if one loaf of bread costs one dollar, two loaves of bread cost two dollars, and three loaves cost three dollars. A linear regression model would have no trouble predicting that 10 loaves cost $10, but a GBM wouldn't be able to unless it saw the cost of 10 loaves in the training data.

Finally, GBMs are prone to overfitting if not trained carefully. 

Usually this is caused by tuning too many hyperparameters, which can result in the trees growing to fit the training data, but not generalizing well to unseen data. 

In [1]:
import pandas as pd
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

In [2]:
df = pd.read_csv("heart_tidy.csv")
df.head(10)

Unnamed: 0,Age,Sex,ChestPainType,RestingBP,CholesterolLevel,FastingBloodSugar,RestingElectrocardiographicResult,MaxHeartRate,ExerciseAngina,STDepression,STSegmentSlope,NumMajorVessels,ThalliumStressRest,HeartDiseasePresent,Unnamed: 14,Unnamed: 15,Unnamed: 16,Unnamed: 17,Unnamed: 18
0,63,1,1,145,233,1,2,150,0,2.3,3,0,6,0,,,,,
1,67,1,4,160,286,0,2,108,1,1.5,2,3,3,1,,,,,
2,67,1,4,120,229,0,2,129,1,2.6,2,2,7,1,,,,,
3,37,1,3,130,250,0,0,187,0,3.5,3,0,3,0,,,,,
4,41,0,2,130,204,0,2,172,0,1.4,1,0,3,0,,,,,
5,56,1,2,120,236,0,0,178,0,0.8,1,0,3,0,,,,,
6,62,0,4,140,268,0,2,160,0,3.6,3,2,3,1,,,,,
7,57,0,4,120,354,0,0,163,1,0.6,1,0,3,0,,,,,
8,63,1,4,130,254,0,2,147,0,1.4,2,1,7,1,,,,,
9,53,1,4,140,203,1,2,155,1,3.1,3,0,7,1,,,,,


In [5]:
df = df.drop(['Unnamed: 14','Unnamed: 15','Unnamed: 16','Unnamed: 17','Unnamed: 18'], axis=1)

In [6]:
# Assuming 'data' is your dataframe containing the dataset

# Splitting the data into predictors (X) and target variable (y)
X = df.drop('CholesterolLevel', axis=1)
y = df['CholesterolLevel']

# Splitting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Creating a Gradient Boosting Regressor
gb_regressor = GradientBoostingRegressor(n_estimators=100, learning_rate=0.1, random_state=42)

# Training the Gradient Boosting Regressor
gb_regressor.fit(X_train, y_train)

# Making predictions on the test set
y_pred = gb_regressor.predict(X_test)

# Evaluating the model using mean squared error
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error:", mse)


Mean Squared Error: 3321.00466126722


Mean Squared Error (MSE): MSE is a commonly used metric for evaluating regression models. It measures the average squared difference between the actual values (y_true) and the predicted values (y_pred). Mathematically, it is calculated as the average of the squared differences between the actual and predicted values:

Interpretation: A lower MSE indicates that the model's predictions are closer to the actual values, while a higher MSE indicates larger prediction errors. In this case, an MSE of 3321 means that, on average, the squared difference between the actual and predicted cholesterol levels is 3321 mg/dL². Since MSE is in squared units (mg/dL² in this case), it's not directly interpretable in terms of the original units (mg/dL). However, it gives a measure of the overall accuracy of the model's predictions.