https://towardsdatascience.com/understanding-the-bias-variance-tradeoff-165e6942b229

https://www.datacamp.com/tutorial/adaboost-classifier-python

https://www.datacamp.com/tutorial/guide-to-the-gradient-boosting-algorithm

https://www.geeksforgeeks.org/regularization-in-machine-learning/

https://www.analyticsvidhya.com/blog/2021/08/ensemble-stacking-for-machine-learning-and-deep-learning/

## Steps of Gradient Boosting with Learning Rate

**Initial Prediction:** The model starts with an initial prediction, often the mean of the target values.

**First Iteration:** The model fits a weak learner to the residuals of the initial prediction. The predictions of this weak learner are multiplied by the learning rate and added to the initial prediction.

**Subsequent Iterations:** Each new weak learner is fitted to the residuals of the current model. The predictions of these weak learners are multiplied by the learning rate and added to the current model's predictions to improve accuracy.

**Final Prediction:** After a specified number of iterations, the model combines all the weak learners' predictions (each adjusted by the learning rate) to make the final prediction.

In [None]:
import pandas as pd
import numpy as np
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error

education_level = [1, 3, 4, 3]
Age = [20, 64, 55, 35]
Salary = [45, 60, 70, 90]

df = pd.DataFrame({
    'Education Level': education_level,
    'Age': Age,
    'Salary': Salary
})

X = df[['Education Level', 'Age']]
y = df['Salary']





In [None]:
# Initialize Gradient Boosting Regressor
gb = GradientBoostingRegressor(n_estimators=7, learning_rate=1, max_depth=1, random_state=42)
gb.fit(X, y)

# Get predictions and residuals for each stage
predictions = np.zeros((4, 7))
residuals = np.zeros((4, 7))

for i, y_pred in enumerate(gb.staged_predict(X)):
    predictions[:, i] = y_pred
    residuals[:, i] = y - y_pred


# Create a DataFrame to store the data
data = pd.DataFrame({
    'Education Level': education_level,
    'Age': Age,
    'Salary': Salary,
    'Prediction Stage 1': predictions[:, 0],
    'Residual Stage 1': residuals[:, 0],
    'Prediction Stage 2': predictions[:, 1],
    'Residual Stage 2': residuals[:, 1],
    'Prediction Stage 3': predictions[:, 2],
    'Residual Stage 3': residuals[:, 2],
    'Prediction Stage 4': predictions[:, 3],
    'Residual Stage 4': residuals[:, 3],
    'Prediction Stage 5': predictions[:, 4],
    'Residual Stage 5': residuals[:, 4],
    'Prediction Stage 6': predictions[:, 5],
    'Residual Stage 6': residuals[:, 5],
    'Prediction Stage 7': predictions[:, 6],
    'Residual Stage 7': residuals[:, 6],
    'Final Prediction': gb.predict(X)
})

data

# print(data)
# # Save the DataFrame to an Excel file
# file_path = '/content/drive/MyDrive/Colab Notebooks/Gradient_Boosting_Example.xlsx'
# data.to_excel(file_path, index=False)

# print(f"Excel file generated and saved to {file_path}")

Unnamed: 0,Education Level,Age,Salary,Prediction Stage 1,Residual Stage 1,Prediction Stage 2,Residual Stage 2,Prediction Stage 3,Residual Stage 3,Prediction Stage 4,Residual Stage 4,Prediction Stage 5,Residual Stage 5,Prediction Stage 6,Residual Stage 6,Prediction Stage 7,Residual Stage 7,Final Prediction
0,1,20,45,45.0,0.0,53.333333,-8.333333,45.0,0.0,47.592593,-2.592593,45.0,0.0,46.049383,-1.049383,45.0,0.0,45.0
1,3,64,60,73.333333,-13.333333,65.0,-5.0,67.777778,-7.777778,60.0,0.0,60.864198,-0.864198,59.814815,0.185185,60.164609,-0.164609,60.164609
2,4,55,70,73.333333,-3.333333,65.0,5.0,67.777778,2.222222,70.37037,-0.37037,71.234568,-1.234568,70.185185,-0.185185,70.534979,-0.534979,70.534979
3,3,35,90,73.333333,16.666667,81.666667,8.333333,84.444444,5.555556,87.037037,2.962963,87.901235,2.098765,88.950617,1.049383,89.300412,0.699588,89.300412



In Gradient Boosting, the learning_rate is a hyperparameter that you set before training the model, rather than being calculated during the training process. It is a scaling factor applied to each step of the boosting process to control the contribution of each tree to the final model. The learning_rate essentially controls the weight of each weak learner in the ensemble.

**Key Points About learning_rate**

  * **Fixed Value:** Typically, the learning_rate is a fixed value chosen by the user before training the model. It does not change during training.

  * **Impact on Model:** A smaller learning_rate makes the model training slower but can lead to better generalization because the model takes smaller steps towards the final solution, reducing the risk of overfitting.

  * **Trade-Off with n_estimators:** There is usually a trade-off between the learning_rate and the number of boosting stages (n_estimators). A lower learning_rate often requires a larger number of boosting stages to maintain model performance.

## Explain the parameter learning_rate

**Explanation of the Example below**

**Dataset:** We generate a simple dataset with a linear relationship and some noise.

**Gradient Boosting Regressor:** Two models are created with different learning_rate values (0.01 and 0.1).

**Fitting Models:** Both models are trained on the training data.
Predictions and Evaluation: Predictions are made on the test set, and the Mean

**Squared Error (MSE)** is calculated to compare the performance.
Trade-Off and Choosing learning_rate

**Lower learning_rate:** Typically requires more trees (n_estimators) to achieve the same performance, leading to longer training times but potentially better generalization.

**Higher learning_rate:** Each tree has a larger impact, potentially leading to faster convergence but higher risk of overfitting.

**Choosing the learning_rate:** Choosing the right learning_rate often involves experimenting with different values and performing cross-validation to find the best balance between training time and model performance. Common values to try range from 0.01 to 0.3.

By understanding and properly tuning the learning_rate, you can significantly improve the performance and robustness of a Gradient Boosting model.

In [None]:
import numpy as np
import pandas as pd
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import train_test_split

# Example dataset
np.random.seed(42)
X = np.random.rand(100, 1) * 10
y = 2 * X.squeeze() + 1 + np.random.randn(100) * 2  # Linear relation with noise

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize Gradient Boosting Regressor with different learning rates
gb_low_lr = GradientBoostingRegressor(n_estimators=100, learning_rate=0.01, max_depth=1, random_state=42)
gb_high_lr = GradientBoostingRegressor(n_estimators=100, learning_rate=0.1, max_depth=1, random_state=42)

# Fit the models
gb_low_lr.fit(X_train, y_train)
gb_high_lr.fit(X_train, y_train)

# Predict
y_pred_low_lr = gb_low_lr.predict(X_test)
y_pred_high_lr = gb_high_lr.predict(X_test)

# Compare the performance
from sklearn.metrics import mean_squared_error
print(f"Low learning rate MSE: {mean_squared_error(y_test, y_pred_low_lr)}")
print(f"High learning rate MSE: {mean_squared_error(y_test, y_pred_high_lr)}")


Low learning rate MSE: 10.745529378608582
High learning rate MSE: 2.707721931894042


## Comparison Between Gradient Boosting and XGBoost


### Gradient Boosting (GBM):

**Implementation:** Traditional Gradient Boosting methods are implemented in libraries like scikit-learn.

**Speed:** Generally slower due to less optimization and lack of advanced features.

**Regularization:** Basic regularization techniques are available.

**Handling Missing Values:** Generally, the data needs to be preprocessed to handle missing values.

**Flexibility:** Supports basic hyperparameters tuning.

### XGBoost (Extreme Gradient Boosting):

**Implementation:** An optimized implementation of gradient boosting designed to be highly efficient, flexible, and portable.

**Speed:** Faster due to optimizations like parallel processing, tree pruning, and efficient memory usage.

**Regularization:** More advanced regularization (L1 & L2) to prevent overfitting.

**Handling Missing Values:** Automatically handles missing values in the dataset.

**Flexibility:** Offers more parameters and better control over the model training process.

**Scalability:** Scales well to large datasets.

In [None]:
import numpy as np
import pandas as pd
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.ensemble import GradientBoostingRegressor
import xgboost as xgb

# Load dataset
california_housing = fetch_california_housing()
X, y = california_housing.data, california_housing.target

In [None]:


# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize models
gbm = GradientBoostingRegressor(n_estimators=100, learning_rate=0.1, max_depth=3, random_state=42)
xgb_model = xgb.XGBRegressor(n_estimators=100, learning_rate=0.1, max_depth=3, random_state=42)

# Fit models
gbm.fit(X_train, y_train)
xgb_model.fit(X_train, y_train)

# Predict
y_pred_gbm = gbm.predict(X_test)
y_pred_xgb = xgb_model.predict(X_test)

# Evaluate performance
mse_gbm = mean_squared_error(y_test, y_pred_gbm)
mse_xgb = mean_squared_error(y_test, y_pred_xgb)

print(f"Gradient Boosting MSE: {mse_gbm}")
print(f"XGBoost MSE: {mse_xgb}")


Gradient Boosting MSE: 0.2939973248643864
XGBoost MSE: 0.29522676196268116


### Pros and Cons

### Gradient Boosting (GBM):

**Pros:**
* Simple to use with basic regularization.
* Integrated into many libraries (e.g., scikit-learn).

**Cons:**
* Slower and less efficient for large datasets.
* Requires more manual handling of missing values.

### XGBoost:

**Pros:**
* Faster and more efficient due to advanced optimizations.
* Automatically handles missing values.
* Better regularization techniques.
* Scales well with large datasets.

**Cons:**
* More complex and requires understanding of additional parameters.
* Can be overkill for small datasets or simple tasks.

## Key Parameters in Gradient Boosting (GBM):

**n_estimators:**

* Description: The number of boosting stages to be run.
* Effect: Increasing the number of estimators typically improves performance, but also increases training time and can lead to overfitting.

**learning_rate:**

* Description: The contribution of each tree to the final model.
* Effect: A lower learning rate requires more trees to reach the same level of performance. It controls the step size at each iteration while moving toward a minimum of the loss function.

**max_depth:**

* Description: The maximum depth of each individual tree.
* Effect: Limits the number of nodes in the tree, which helps control overfitting. Higher values lead to more complex models.

**min_samples_split:**

* Description: The minimum number of samples required to split an internal node.
* Effect: Helps control overfitting. Lower values allow the model to learn more, but can lead to overfitting.

**min_samples_leaf:**

* Description: The minimum number of samples required to be at a leaf node.
* Effect: Controls overfitting by ensuring a minimum number of observations at each leaf.

**subsample:**

* Description: The fraction of samples to be used for fitting each tree.
* Effect: Helps control overfitting by introducing randomness. Values less than 1.0 lead to Stochastic Gradient Boosting.

**max_features:**

* Description: The number of features to consider when looking for the best split.
* Effect: Reducing the number of features can help prevent overfitting and improve the speed of the algorithm.

**loss:**

* Description: The loss function to be optimized. Common choices are 'deviance' for classification (log-loss) and 'ls' for regression (least squares).
* Effect: Defines how the error is calculated and therefore how the model is optimized.

**criterion:**

* Description: The function to measure the quality of a split. Common choices are 'friedman_mse', 'mse', and 'mae'.
* Effect: Influences how the quality of a split is measured.

## Key Parameters of XGBoost


**General Parameters**

***booster:***

* Description: Specifies which booster to use: gbtree, gblinear, or dart.
* Default: gbtree
* Effect: Determines the type of model to be used. 'gbtree' and 'dart' use tree-based models, while 'gblinear' uses linear models.

***nthread:***

* Description: Number of parallel threads used to run XGBoost.
* Default: Maximum number of threads available if not set.
* Effect: Controls the parallelism of the algorithm, potentially speeding up training.

***random_state:***

* Description: Random number seed.
* Effect: Ensures reproducibility of the results.
Booster Parameters

**Parameters for Tree Booster (gbtree and dart)**
***eta (learning_rate):***

* Description: Step size shrinkage used to prevent overfitting.
* Default: 0.3
* Range: [0, 1]
* Effect: Smaller values make the model more robust by shrinking the weights of new trees.

***max_depth:***

* Description: Maximum depth of a tree.
* Default: 6
* Range: [0, ∞]
* Effect: Controls overfitting; deeper trees capture more patterns but can lead to overfitting.

***min_child_weight:***

* Description: Minimum sum of instance weight (hessian) needed in a child.
* Default: 1
* Range: [0, ∞]
* Effect: Higher values prevent overfitting by requiring a minimum weight for leaf nodes.

***gamma:***

* Description: Minimum loss reduction required to make a further partition on a leaf node.
* Default: 0
* Range: [0, ∞]
* Effect: Larger values make the algorithm more conservative.

***subsample:***

* Description: Subsample ratio of the training instances.
* Default: 1
* Range: (0, 1]
* Effect: Reduces overfitting by sampling a fraction of the training data.

***colsample_bytree:***

* Description: Subsample ratio of columns when constructing each tree.
* Default: 1
* Range: (0, 1]
* Effect: Reduces overfitting by sampling a fraction of the features.

***colsample_bylevel:***

* Description: Subsample ratio of columns for each level.
* Default: 1
* Range: (0, 1]
* Effect: Reduces overfitting by sampling a fraction of features at each level.

***lambda (reg_lambda):***

* Description: L2 regularization term on weights.
* Default: 1
* Range: [0, ∞]
* Effect: Prevents overfitting by penalizing large weights.

***alpha (reg_alpha):***

* Description: L1 regularization term on weights.
* Default: 0
* Range: [0, ∞]
* Effect: Prevents overfitting by inducing sparsity in the model.

**Parameters for Linear Booster (gblinear)**
***lambda (reg_lambda):***

* Description: L2 regularization term on weights.
* Default: 0
* Range: [0, ∞]
* Effect: Similar to ridge regression.

***alpha (reg_alpha):***

* Description: L1 regularization term on weights.
* Default: 0
* Range: [0, ∞]
* Effect: Similar to Lasso regression.

***updater:***

* Description: Choice of algorithm to fit linear model.
* Options: ‘shotgun’, ‘coord_descent’
* Effect: Controls the optimization algorithm.

**Learning Task Parameters**

***objective:***

* Description: Specifies the learning task and the corresponding objective function.
* Default: 'reg' for regression.

* Options: ‘reg
’, ‘binary
’, ‘multi
’, etc.

* Effect: Defines the type of problem (regression, binary classification, multi-class classification, etc.)

***eval_metric:***

* Description: Evaluation metric(s) for validation data.
* Default: Determined by objective.
* Options: 'rmse', 'mae', 'logloss', 'error', 'merror', etc.
* Effect: Measures the performance of the model during training and evaluation.

***base_score:***

* Description: Initial prediction score of all instances, global bias.
* Default: 0.5
* Effect: The starting value for the model's predictions.