# 17 APRIL ASSIGNMENT

Q1. What is Gradient Boosting Regression?

Gradient Boosting Regression is a machine learning approach in the ensemble method family. It is a strong approach that may be applied to both regression and classification situations. Gradient Boosting Regression seeks to generate a strong predictive model in the setting of regression by integrating numerous weak predictive models, often decision trees.

The approach works by adding new decision trees to the ensemble repeatedly while focusing on the residuals (the disparities between predicted and actual values) of prior models. In each iteration, a new decision tree is trained to anticipate the current ensemble's residuals. After then, the forecasts of all the trees in the ensemble are integrated to give the final prediction.

Gradient Boosting Regression seeks to optimise a loss function by determining the optimal weights or coefficients for the ensemble's weak models. The optimisation is carried out via gradient descent, which calculates the gradients of the loss function with respect to the predictions. The new weak model is fitted to minimise the loss function in relation to the negative gradients, thereby updating the ensemble in the direction that minimises error.

Gradient Boosting Regression has a number of advantages. It can handle a variety of data kinds, including numerical and categorical properties. It is capable of capturing complicated non-linear correlations and variable interactions. It also has built-in feature significance measurements, allowing you to identify the most significant elements in the prediction process.

It should be noted, however, that Gradient Boosting Regression is a computationally costly approach that may need careful adjustment of hyperparameters to get optimal performance. To prevent overfitting and increase generalisation, regularisation approaches such as restricting tree depth or utilising learning rate shrinkage are frequently used.

Gradient Boosting Regression is a very successful regression approach that provides correct predictions by exploiting the strengths of numerous weak models.


Q2. Implement a simple gradient boosting algorithm from scratch using Python and NumPy. Use a
simple regression problem as an example and train the model on a small dataset. Evaluate the model's
performance using metrics such as mean squared error and R-squared.

In [2]:
import numpy as np
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error, r2_score

class GradientBoostingRegressor:
    def __init__(self, n_estimators=100, learning_rate=0.1, max_depth=3):
        self.n_estimators = n_estimators
        self.learning_rate = learning_rate
        self.max_depth = max_depth
        self.models = []

    def fit(self, X, y):
        # Convert y to float64
        y = y.astype(np.float64)

        # Initialize the y_hat as the mean of the target variable
        y_hat = np.full_like(y, np.mean(y))

        for i in range(self.n_estimators):
            # Compute the residual
            residual = y - y_hat

            # Fit a decision tree regressor to the residual
            tree = DecisionTreeRegressor(max_depth=self.max_depth)
            tree.fit(X, residual)

            # Update y_hat by adding the prediction of the tree scaled by the learning rate
            y_hat += self.learning_rate * tree.predict(X)

            # Store the model for later prediction
            self.models.append(tree)

    def predict(self, X):
        y_hat = np.zeros(X.shape[0])

        for tree in self.models:
            y_hat += self.learning_rate * tree.predict(X)

        return y_hat

# Create a small dataset for demonstration
X = np.array([[1], [2], [3], [4], [5]])
y = np.array([2, 4, 6, 8, 10])

# Initialize and fit the gradient boosting model
gb = GradientBoostingRegressor(n_estimators=100, learning_rate=0.1, max_depth=3)
gb.fit(X, y)

# Make predictions on the training data
y_pred = gb.predict(X)

# Evaluate the model's performance
mse = mean_squared_error(y, y_pred)
r2 = r2_score(y, y_pred)

print("Mean Squared Error:", mse)
print("R-squared:", r2)


Mean Squared Error: 36.00000000564406
R-squared: -3.5000000007055077


Q3. Experiment with different hyperparameters such as learning rate, number of trees, and tree depth to
optimise the performance of the model. Use grid search or random search to find the best
hyperparameters

In [3]:
import numpy as np
from sklearn.datasets import make_regression
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import GradientBoostingRegressor

# Generate a regression dataset
X, y = make_regression(n_samples=100, n_features=1, noise=0.2, random_state=42)

# Define the parameter grid for the grid search
param_grid = {
    'n_estimators': [50, 100, 200],
    'learning_rate': [0.01, 0.1, 0.5],
    'max_depth': [3, 4, 5]
}

# Initialize the gradient boosting regressor
gb = GradientBoostingRegressor()

# Perform grid search
grid_search = GridSearchCV(gb, param_grid, cv=3, scoring='neg_mean_squared_error')
grid_search.fit(X, y)

# Get the best hyperparameters and the corresponding model
best_params = grid_search.best_params_
best_model = grid_search.best_estimator_

# Evaluate the best model on the training data
y_pred = best_model.predict(X)
mse = mean_squared_error(y, y_pred)
r2 = r2_score(y, y_pred)

print("Best Hyperparameters:", best_params)
print("Mean Squared Error:", mse)
print("R-squared:", r2)


Best Hyperparameters: {'learning_rate': 0.1, 'max_depth': 3, 'n_estimators': 100}
Mean Squared Error: 0.004608510740488687
R-squared: 0.9999967690220585


Q4. What is a weak learner in Gradient Boosting?

In the context of Gradient Boosting, a weak learner is a basic and generally low-complexity model that serves as the ensemble's basis model. Weak learners are frequently decision trees, particularly shallow decision trees with few levels or nodes. These decision trees are generally referred to be "weak" since they have poor predictive potential and are prone to significant bias on their own.

The goal of utilising weak learners in Gradient Boosting is to iteratively combine them to generate a strong prediction model. Each weak learner in the ensemble is trained using the residuals or mistakes of the preceding model. The weak learners try to catch patterns or information that were not properly captured by the prior models by focusing on the residuals.

Gradient Boosting adds a new weak learner to the ensemble with each iteration, and its predictions are mixed with the predictions of the previous weak learners. The aggregation of these weak learners eventually improves the ensemble's prediction accuracy, lowering both bias and variation. Gradient Boosting learning involves modifying the weights or coefficients of the weak learners in order to minimise the loss function.

There are various advantages of using weak learners in Gradient Boosting. For starters, weak learners are computationally efficient and can be taught fast, making the ensemble's entire training process more efficient. Second, the mixture of weak learners aids in the handling of complicated links and interactions in the data, enhancing the model's overall predictive capacity. Finally, weak learners give some regularisation, which prevents overfitting and improves generalisation.

It's worth mentioning that poor learners might differ depending on how Gradient Boosting is used. While decision trees are frequently used as weak learners, alternative algorithms, like as linear models or shallow neural networks, can also be utilised depending on the task and the algorithm's implementation.


Q5. What is the intuition behind the Gradient Boosting algorithm?

The Gradient Boosting approach is designed to develop a strong predictive model repeatedly by integrating numerous weak models in such a manner that each new model concentrates on the mistakes or residuals of the preceding models. Here's a visual breakdown of how Gradient Boosting works:

1. Initialization:The method begins by initialising the predictions using a basic model or a constant value. This first forecast is frequently set as the target variable's average or median.

2. Calculation of Residuals: The method computes residuals or errors by subtracting the original predictions from the actual target values. These residuals show the amount of information that the initial model did not capture.

3. Building Weak Models: To forecast the residuals, a weak model, usually a decision tree with a minimal depth, is trained. The weak model is fitted to the data features and learns to capture the patterns in the residuals.

4. Updating Predictions:Predictions are updated by adding the predictions of the weak model to the existing forecasts. The goal of this update is to correct or enhance the predictions using the information supplied by the weak model.

5. Iterative Process: Steps 3 and 4 are repeatedly repeated, with each new weak model focused on the residuals or mistakes of the preceding models. The method emphasises situations that are difficult to forecast properly in each iteration.

6. Combining Weak Models: Weak Model Combination: The final prediction is derived by combining the predictions of all the weak models in the ensemble. Each weak model makes a weighted prediction, with the weights defined by the algorithm during training.



Q6. How does Gradient Boosting algorithm build an ensemble of weak learners?

1. Initialization: The algorithm starts by initializing the ensemble with a base model, which can be a simple model or a constant value. This initial prediction is often set as the average or the median of the target variable.

2. Residual Calculation: The algorithm calculates the residuals or errors by subtracting the initial predictions from the actual target values. These residuals represent the amount of information that is not captured by the initial model.

3. Building a Weak Learner: A weak learner, typically a decision tree with shallow depth, is trained to predict the residuals. The weak learner is fit on the features of the data, considering the residuals as the target variable. It learns to capture the patterns in the residuals.

4. Updating Predictions: The predictions of the weak learner are then added to the current predictions of the ensemble. This update aims to correct or improve the predictions by considering the information provided by the weak learner.

5. Gradient Calculation: The algorithm calculates the negative gradient of the loss function with respect to the current predictions. The negative gradient provides the direction for updating the predictions in the subsequent iteration. It indicates how the predictions should be adjusted to minimize the loss function.

6. Updating the Ensemble: The predictions of the weak learner, scaled by a learning rate, are added to the current predictions of the ensemble. The learning rate controls the contribution of each weak learner, preventing rapid changes and ensuring a smooth convergence of the algorithm.

7. Iterative Process: Steps 3 to 6 are repeated iteratively. In each iteration, a new weak learner is trained on the residuals, and its predictions are added to the current predictions of the ensemble. The algorithm continues to focus on the residuals, with each new weak learner targeting the errors that the ensemble has not yet captured.

8. Final Prediction: The final prediction is obtained by combining the predictions of all the weak learners in the ensemble. Each weak learner's prediction is weighted according to its learning rate and its contribution to the overall performance.



Q7. What are the steps involved in constructing the mathematical intuition of Gradient Boosting
algorithm?

Constructing the mathematical intuition of the Gradient Boosting method entails comprehending the technique's essential components and phases. The following are the major steps in the mathematical intuition of Gradient Boosting:

1. The procedure begins by establishing an appropriate loss function that assesses the difference between the predicted and true values. The mean squared error (MSE) is a regularly used loss function for regression issues, whereas the log loss or exponential loss functions can be utilised for classification problems.

2. The technique begins by initialising the model with a constant value or a basic model, which is frequently specified as the average of the target variable for regression tasks or the logarithmic odds for binary classification jobs. This basic model serves as the foundation for further versions.

3. Gradient Calculation: The loss function's gradient with respect to the predictions is computed. For updating the projections, this gradient shows the direction and size of the sharpest rise or decline. The gradient relates to the negative gradient of the loss function in regression problems, and the derivatives of the loss function in classification issues.

4. Creating Weak Learners: A weak learner, often a shallow-depth decision tree, is trained to predict the negative gradient or derivatives of the loss function. The weak learner is fitted to the data features, with the negative gradient serving as the goal variable in the regression case and the derivatives acting as the target variable in the classification case. The weak learner learns to recognise patterns in the residuals or information that the prior models did not recognise.

5. Predictions are updated by adding the weak learner's predictions to the existing predictions, which are generally scaled by a learning rate or a shrinkage factor. This update step shifts the predictions in the direction indicated by the negative gradient, with the goal of lowering the loss function. The learning rate governs how much each weak learner contributes to the aggregate prediction, limiting abrupt shifts and promoting steady convergence.

6. Iterative Process: Steps 3–5 are iteratively repeated. In each cycle, a new weak learner is trained on the loss function's negative gradient or derivatives, and its predictions are added to the existing predictions. Iteratively, the algorithm updates the predictions and reduces the loss function.

7. Combining Weak Learners: The final prediction is derived by combining all of the weak learners' predictions in the ensemble. The forecast of each weak learner is weighted based on its learning rate and contribution to total performance. In comparison to individual weak learners, the combination of weak learners results in a better and more accurate prediction.



