## Q1. What is Gradient Boosting Regression?

Gradient Boosting Regression or Gradient Boosting, is a powerful machine learning technique used for regression tasks. It's an ensemble learning method that combines the predictions of multiple weak learners (usually decision trees) to create a strong regression model. Gradient Boosting is known for its high predictive accuracy and ability to handle complex relationships between features and the target variable.

## Q2. Implement a simple gradient boosting algorithm from scratch using Python and NumPy. Use a simple regression problem as an example and train the model on a small dataset. Evaluate the model's performance using metrics such as mean squared error and R-squared.

In [1]:
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split,GridSearchCV
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import r2_score,mean_squared_error,mean_absolute_error

In [2]:
X,y = make_regression(n_samples=1000,n_features=5,n_informative=3,random_state=42,shuffle=False)

X.shape , y.shape

((1000, 5), (1000,))

In [3]:
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.25,random_state=42)

X_train.shape , X_test.shape , y_train.shape , y_test.shape

((750, 5), (250, 5), (750,), (250,))

In [4]:
gb = GradientBoostingRegressor()
gb.fit(X_train,y_train)
y_pred = gb.predict(X_test)
print(f"R-squared : {r2_score(y_test,y_pred)}")
print(f"Mean Square Error : {mean_squared_error(y_test,y_pred)}")
print(f"Mean Absolute Error : {mean_absolute_error(y_test,y_pred)}")

R-squared : 0.9897602727854303
Mean Square Error : 28.72800275208031
Mean Absolute Error : 4.09290914102256


## Q3. Experiment with different hyperparameters such as learning rate, number of trees, and tree depth to optimise the performance of the model. Use grid search or random search to find the best hyperparameters

In [5]:
params = {'learning_rate':[0.01,0.1,0.5,1] ,
          'n_estimators': [100,200,300],
          'max_depth': [2,3,5]
         }

In [6]:
clf = GridSearchCV(GradientBoostingRegressor(),param_grid = params,cv=5,verbose=3,scoring="r2")
clf.fit(X_train,y_train)

Fitting 5 folds for each of 36 candidates, totalling 180 fits
[CV 1/5] END learning_rate=0.01, max_depth=2, n_estimators=100;, score=0.623 total time=   0.0s
[CV 2/5] END learning_rate=0.01, max_depth=2, n_estimators=100;, score=0.671 total time=   0.0s
[CV 3/5] END learning_rate=0.01, max_depth=2, n_estimators=100;, score=0.631 total time=   0.0s
[CV 4/5] END learning_rate=0.01, max_depth=2, n_estimators=100;, score=0.647 total time=   0.0s
[CV 5/5] END learning_rate=0.01, max_depth=2, n_estimators=100;, score=0.608 total time=   0.0s
[CV 1/5] END learning_rate=0.01, max_depth=2, n_estimators=200;, score=0.801 total time=   0.1s
[CV 2/5] END learning_rate=0.01, max_depth=2, n_estimators=200;, score=0.843 total time=   0.1s
[CV 3/5] END learning_rate=0.01, max_depth=2, n_estimators=200;, score=0.812 total time=   0.1s
[CV 4/5] END learning_rate=0.01, max_depth=2, n_estimators=200;, score=0.816 total time=   0.1s
[CV 5/5] END learning_rate=0.01, max_depth=2, n_estimators=200;, score=0.7

[CV 2/5] END learning_rate=0.1, max_depth=5, n_estimators=300;, score=0.983 total time=   0.3s
[CV 3/5] END learning_rate=0.1, max_depth=5, n_estimators=300;, score=0.985 total time=   0.3s
[CV 4/5] END learning_rate=0.1, max_depth=5, n_estimators=300;, score=0.981 total time=   0.3s
[CV 5/5] END learning_rate=0.1, max_depth=5, n_estimators=300;, score=0.983 total time=   0.3s
[CV 1/5] END learning_rate=0.5, max_depth=2, n_estimators=100;, score=0.985 total time=   0.0s
[CV 2/5] END learning_rate=0.5, max_depth=2, n_estimators=100;, score=0.984 total time=   0.0s
[CV 3/5] END learning_rate=0.5, max_depth=2, n_estimators=100;, score=0.980 total time=   0.0s
[CV 4/5] END learning_rate=0.5, max_depth=2, n_estimators=100;, score=0.986 total time=   0.0s
[CV 5/5] END learning_rate=0.5, max_depth=2, n_estimators=100;, score=0.984 total time=   0.0s
[CV 1/5] END learning_rate=0.5, max_depth=2, n_estimators=200;, score=0.988 total time=   0.1s
[CV 2/5] END learning_rate=0.5, max_depth=2, n_est

[CV 5/5] END learning_rate=1, max_depth=5, n_estimators=200;, score=0.875 total time=   0.2s
[CV 1/5] END learning_rate=1, max_depth=5, n_estimators=300;, score=0.904 total time=   0.3s
[CV 2/5] END learning_rate=1, max_depth=5, n_estimators=300;, score=0.879 total time=   0.3s
[CV 3/5] END learning_rate=1, max_depth=5, n_estimators=300;, score=0.924 total time=   0.3s
[CV 4/5] END learning_rate=1, max_depth=5, n_estimators=300;, score=0.881 total time=   0.3s
[CV 5/5] END learning_rate=1, max_depth=5, n_estimators=300;, score=0.873 total time=   0.3s


In [7]:
clf.best_params_

{'learning_rate': 0.1, 'max_depth': 2, 'n_estimators': 300}

In [8]:
clf.best_score_

0.990703793295617

In [9]:
model = GradientBoostingRegressor(**clf.best_params_)
model.fit(X_train,y_train)
y_pred = model.predict(X_test)
print(f"R-squared : {r2_score(y_test,y_pred)}")
print(f"Mean Square Error : {mean_squared_error(y_test,y_pred)}")
print(f"Mean Absolute Error : {mean_absolute_error(y_test,y_pred)}")

R-squared : 0.99338315831917
Mean Square Error : 18.56383886345091
Mean Absolute Error : 3.375041972273857


### Observation
- On performing Hyperparameter tuning on the same dataset we obtained an increase in r2 score from 98.9 to 99.3
- The best hyperparameters are 'learning_rate': 0.1, 'max_depth': 2, 'n_estimators': 300
- There is a significant decrease in MSE and MAE after performing hyper paramter tuning

## Q4. What is a weak learner in Gradient Boosting?

In Gradient Boosting, a weak learner, also known as a base learner or base model, is a simple and relatively low-performing machine learning model that is used as a building block in the ensemble. Weak learners are typically decision trees with limited depth, often referred to as "stumps" or "shallow trees." 

The key characteristics of a weak learner are:

1. ***Weak learners are intentionally kept simple and have limited complexity. For decision trees, this means they are shallow and have only a few levels or nodes.***

2. ***Weak learners individually may not perform well on the training data. They have limited ability to capture complex relationships in the data.***

3. ***Weak learners are designed to be biased towards the errors made by the current ensemble of models. They focus on the examples that are challenging to classify or predict correctly.***

The role of weak learners in Gradient Boosting is crucial. The algorithm combines the predictions of multiple weak learners in a sequential manner, with each learner aiming to correct the errors or residuals made by the previous ensemble. The cumulative effect of adding multiple weak learners is that the ensemble becomes a strong learner that can capture complex patterns and achieve high predictive accuracy.

Gradient Boosting iteratively fits a weak learner to the residuals of the previous ensemble. The weak learner's job is to find patterns in the data that the ensemble has not yet captured. These patterns typically correspond to the remaining errors in the predictions. By focusing on the most challenging examples, the ensemble gradually reduces the errors and improves its overall performance.

## Q5. What is the intuition behind the Gradient Boosting algorithm?

- Gradient Boosting is an ensemble learning technique that combines the predictions of multiple weak learners (usually shallow decision trees) to create a strong and accurate model. The idea is that by combining the outputs of these weak learners, the ensemble can overcome the limitations of individual models and capture complex patterns in the data.

- Gradient Boosting builds the ensemble sequentially, with each weak learner focusing on correcting the errors made by the previous ensemble. In other words, it learns from the mistakes of the ensemble and gradually reduces those mistakes.

- The "Gradient" in Gradient Boosting refers to the gradient descent optimization technique used to minimize a loss function. In each iteration, a weak learner is trained to minimize the loss function with respect to the residuals (errors) made by the current ensemble. This means the weak learner is "guided" by the gradient of the loss function to improve the predictions.

-  After training, the predictions from the weak learner are combined with the predictions from the previous ensemble, and each contribution is weighted. The weights are determined by the optimization process and are based on the weak learner's ability to reduce the errors. This weighted combination ensures that the ensemble focuses more on the examples that are challenging to predict.

- Gradient Boosting includes regularization techniques, such as the learning rate, to control the step size and prevent overfitting. The learning rate determines how much each weak learner's prediction is added to the ensemble's predictions. Smaller learning rates lead to more conservative updates and reduce the risk of overfitting.

- The iterative nature of Gradient Boosting allows it to adapt and improve its predictions over time. It continues to add weak learners until a stopping criterion is met, such as reaching a maximum number of iterations or achieving a satisfactory level of performance. The final ensemble is a strong learner capable of achieving high predictive accuracy.

In essence, Gradient Boosting is a "learning from mistakes" approach that builds a strong model by iteratively correcting the errors of the previous ensemble. It leverages the power of many weak learners to create a robust and accurate predictive model, making it one of the most popular and effective machine learning algorithms for a wide range of tasks, including regression and classification.

## Q6. How does Gradient Boosting algorithm build an ensemble of weak learners?

The Gradient Boosting algorithm builds an ensemble of weak learners in a sequential manner. The steps include :

1. **Initialization:**
   - The process starts with an initial prediction, often a simple one like the mean of the target variable for regression problems or the class with the highest frequency for classification problems. This initial prediction serves as the starting point for building the ensemble.

<br>

2. **Iteration (Sequential Construction):**
   - In each iteration, a new weak learner (often a shallow decision tree, also called a "stump") is trained on the training data.
   - The weak learner's primary task is to fit the residuals (the differences between the true target values and the current ensemble's predictions) from the previous iteration. These residuals represent the errors made by the current ensemble.
   - The weak learner is trained to minimize the residuals' loss function, effectively learning to correct the ensemble's errors.

<br>

3. **Weighted Contribution:**
   - Once the new weak learner is trained, its predictions are combined with the predictions from the previous ensemble. However, these contributions are weighted.
   - The weights are determined by an optimization process, where the algorithm finds the optimal weight for each weak learner. The optimization aims to minimize a loss function that quantifies how well the ensemble is performing.

<br>

4. **Update Ensemble Predictions:**
   - The predictions from the new weak learner, weighted according to their importance, are added to the ensemble's predictions from the previous iterations.
   - This update improves the ensemble's predictions, gradually reducing the errors made on the training data.

<br>

5. **Iterative Process:**
   - Steps 2 to 4 are repeated for a fixed number of iterations (controlled by the `n_estimators` hyperparameter) or until a stopping criterion is met. The stopping criterion can be based on the magnitude of residuals or other performance metrics.

<br>

6. **Final Ensemble:**
   - The final ensemble is the cumulative effect of all the weak learners' contributions. It represents the ensemble's predictions after all iterations.

<br>

7. **Regularization (Learning Rate):**
   - Gradient Boosting often includes a regularization parameter called the "learning rate" (λ or α). The learning rate controls the step size of each update, ensuring that the ensemble learning process is not too aggressive. Smaller learning rates lead to more conservative updates and help prevent overfitting.

<br>

The key idea behind Gradient Boosting is to iteratively improve the ensemble's predictions by adding weak learners that focus on correcting the errors made by the previous ensemble. By emphasizing the examples that are challenging to predict, Gradient Boosting creates a strong learner capable of capturing complex relationships in the data and achieving high predictive accuracy.

## Q7. What are the steps involved in constructing the mathematical intuition of Gradient Boosting algorithm?

Gradient Boosting is a popular machine learning algorithm used for both regression and classification tasks. 

The steps involved in constructing the mathematical intuition of Gradient Boosting:

1. ***Define the problem: Define the problem you want to solve using Gradient Boosting, whether it's a regression or classification task.***

2. ***Define the loss function: The loss function is a measure of how well the algorithm is doing in fitting the training data. In Gradient Boosting, we typically use a differentiable loss function such as mean squared error for regression or log loss for classification.***

3. ***Create an initial model: Create an initial model to make predictions. This model can be as simple as the mean of the target variable or a linear regression model.***

4. ***Calculate the residual errors: Calculate the residual errors by subtracting the predictions of the initial model from the actual values of the target variable.***

5. ***Train a new model on the residual errors: Train a new model on the residual errors from the previous step. This model is usually a decision tree with a fixed depth.***

6. ***Add the predictions of the new model to the previous predictions: Add the predictions of the new model to the previous predictions to update the model. This process is called boosting because we are boosting the performance of the model by adding new models to it.***

7. ***Repeat steps 4 to 6 until convergence: Repeat steps 4 to 6 until the model converges or until a stopping criterion is met. The stopping criterion can be a maximum number of models, a threshold for the improvement of the loss function, or a maximum depth for the decision trees.***

8. ***Make predictions: Use the final model to make predictions on new data.***

Overall, Gradient Boosting works by iteratively improving the model by adding new models to correct the errors of the previous models. This process continues until the model converges or a stopping criterion is met.