## 03 Gradient Boosting Regressor
- Gradient Boosting Regressor is a type of ensemble model where many weak decision trees are built and combined to form a strong model
1. Perform a basic prediction for all rows
2. Calculate the errors (residuals) -> residuals represent leftover prediction errors the model must fix
3. Train a weak learner (a small tree) to predict the residuals
4. Add this to the overall model with a learning rate
5. Repeat iteratively

In [7]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn import ensemble
from sklearn.metrics import mean_absolute_error

df = pd.read_csv('../data/cleaned/car_prices_cleaned.csv')

In [11]:
# Average car price for reference
print(f"Average car price: ${df['sellingprice'].mean():.2f}")

Average car price: $13725.53


In [8]:
X = df.drop('sellingprice', axis=1)
y = df['sellingprice']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, shuffle=True)

n_estimators -> the number of decision trees
learning_rate -> the rate at which additional decision trees influence the overall prediction
max_depth -> the maximum number of layers for each decision tree
min_samples_split -> the minimum number of samples required to execute a new binary split (in other words, the number of samples to create a new branch)
min_samples_leaf -> the minimum number of samples that must appear in each child node (leaf) before a new branch can be implemented
max_features -> the total number of features presented to the model when determining the best split
loss -> calculates the models error rate based on a specific formula


In [10]:
model = ensemble.GradientBoostingRegressor(
    n_estimators=100,
    learning_rate=0.1,
    max_depth=20,
    min_samples_split=5,
    min_samples_leaf=3,
    max_features=0.6,
    loss='huber',
    random_state=42
)

model.fit(X_train, y_train)

mae_train = mean_absolute_error(y_train, model.predict(X_train))
print(f'Training Set MAE: {mae_train}')

mae_test = mean_absolute_error(y_test, model.predict(X_test))
print(f'Test Set MAE: {mae_test}')

Training Set MAE: 417.906826082576
Test Set MAE: 898.3399958825249


- First iteration of model is off by an average of $898.33
- Average car price in our dataset is $13,725 so the model is relatively on target, but can be better