# 1] What is Gradient Boosting Regression?


### => Gradient Boosting Regression is a popular machine learning technique used for regression tasks. It is an ensemble learning method that combines the predictions of multiple weak learners (typically decision trees) to create a strong predictive model. The term "gradient" in "Gradient Boosting" refers to the optimization approach used during the model training process.

### Here's a high-level overview of how Gradient Boosting Regression works:

## 1) Decision Trees as Weak Learners:
### => In Gradient Boosting Regression, decision trees are commonly used as weak learners. A decision tree is a simple tree-like structure that makes a series of decisions based on input features and eventually predicts a target value.

## 2) Boosting: 
### => The boosting technique works by training a sequence of weak learners iteratively. Each weak learner is trained to correct the errors made by its predecessors. The model learns from the mistakes of the previous trees and focuses on examples that were mispredicted.

## 3) Residuals:
### => During each iteration, the algorithm calculates the difference between the predicted target values and the actual target values of the training data. These differences are known as "residuals" or "pseudo-residuals." The next weak learner is then trained to predict these residuals rather than the original target values.

## 4) Learning Rate
### => To control the contribution of each weak learner, a learning rate (or shrinkage) parameter is used. It scales the contribution of each tree in the ensemble. Lower learning rates usually require more iterations but can result in better generalization.

## 5) Aggregation:
### => ter training multiple weak learners, their predictions (in the form of residuals) are aggregated to make the final prediction. The predicted values from all the trees are summed up (or averaged) to obtain the final prediction of the Gradient Boosting Regression model.

## Gradient Boosting Regression has several advantages:

### => It can handle a mix of feature types (e.g., numerical and categorical) without requiring extensive data preprocessing.
### => It performs well on a wide range of regression problems.
### => It can automatically handle feature interactions, making it more expressive than simple linear models.
### => It is less prone to overfitting compared to individual decision trees.

# 2] Implement a simple gradient boosting algorithm from scratch using Python and NumPy. Use a simple regression problem as an example and train the model on a small dataset. Evaluate the model's performance using metrics such as mean squared error and R-squared.


In [1]:
import seaborn as sns
import pandas as pd
import numpy as np


In [2]:
df=sns.load_dataset("iris")
df.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   sepal_length  150 non-null    float64
 1   sepal_width   150 non-null    float64
 2   petal_length  150 non-null    float64
 3   petal_width   150 non-null    float64
 4   species       150 non-null    object 
dtypes: float64(4), object(1)
memory usage: 6.0+ KB


In [4]:
df.shape

(150, 5)

In [5]:
X=df.drop(columns=["species"])
y=df["species"]

In [6]:
from sklearn.preprocessing import LabelEncoder
encoder=LabelEncoder()

In [7]:
encoder.fit_transform(y)

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

In [8]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report,accuracy_score

In [9]:
class GradientBoostingClf:
    def __init__(self, num_trees, max_depth):
        self.num_trees = num_trees
        self.max_depth = max_depth
        self.binary_classifiers = []

    def fit(self, X, y):
        # Create a binary classifier for each class using OvR strategy
        classes = np.unique(y)
        for target_class in classes:
            binary_classifier = DecisionTreeClassifier(max_depth=self.max_depth)
            binary_y = np.where(y == target_class, 1, 0)
            binary_classifier.fit(X, binary_y)
            self.binary_classifiers.append(binary_classifier)

    def predict(self, X):
        # Predict the output for a single sample using all binary classifiers
        predictions = [clf.predict_proba(X)[:, 1] for clf in self.binary_classifiers]
        return np.argmax(predictions, axis=0)

In [10]:
# Convert the categorical target variable to numerical labels
class_mapping = {species: idx for idx, species in enumerate(df['species'].unique())}
y = np.array([class_mapping[label] for label in y])

In [11]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.175,random_state=129)

In [12]:
gb_clf=GradientBoostingClf(num_trees=100,max_depth=5)
gb_clf.fit(X_train,y_train)

In [13]:
y_pred=gb_clf.predict(X_test)

In [14]:
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00        10
           1       0.75      1.00      0.86         9
           2       1.00      0.62      0.77         8

    accuracy                           0.89        27
   macro avg       0.92      0.88      0.88        27
weighted avg       0.92      0.89      0.88        27



In [15]:
accuracy_score(y_test,y_pred)

0.8888888888888888

# 3] Experiment with different hyperparameters such as learning rate, number of trees, and tree depth to optimise the performance of the model. Use grid search or random search to find the best hyperparameters


In [16]:
from sklearn.ensemble import GradientBoostingClassifier

In [17]:
clf=GradientBoostingClassifier()

In [18]:
parameter={
    "learning_rate":[0.001,0.01,0.1,],
    "n_estimators":[100,150,200],
    "max_depth":[2,3,4,5]
}

In [21]:
from sklearn.model_selection import GridSearchCV
cv=GridSearchCV(estimator=clf,param_grid=parameter,cv=3,verbose=3)

In [22]:
cv.fit(X_train,y_train)

Fitting 3 folds for each of 36 candidates, totalling 108 fits
[CV 1/3] END learning_rate=0.001, max_depth=2, n_estimators=100;, score=1.000 total time=   0.3s
[CV 2/3] END learning_rate=0.001, max_depth=2, n_estimators=100;, score=0.927 total time=   0.3s
[CV 3/3] END learning_rate=0.001, max_depth=2, n_estimators=100;, score=0.976 total time=   0.3s
[CV 1/3] END learning_rate=0.001, max_depth=2, n_estimators=150;, score=1.000 total time=   0.4s
[CV 2/3] END learning_rate=0.001, max_depth=2, n_estimators=150;, score=0.927 total time=   0.5s
[CV 3/3] END learning_rate=0.001, max_depth=2, n_estimators=150;, score=0.976 total time=   0.4s
[CV 1/3] END learning_rate=0.001, max_depth=2, n_estimators=200;, score=1.000 total time=   0.5s
[CV 2/3] END learning_rate=0.001, max_depth=2, n_estimators=200;, score=0.927 total time=   0.5s
[CV 3/3] END learning_rate=0.001, max_depth=2, n_estimators=200;, score=0.976 total time=   0.5s
[CV 1/3] END learning_rate=0.001, max_depth=3, n_estimators=100;,

In [23]:
cv.best_params_

{'learning_rate': 0.001, 'max_depth': 2, 'n_estimators': 100}

# 4] What is a weak learner in Gradient Boosting?


### => In the context of Gradient Boosting, a weak learner refers to a simple, relatively low-complexity model that performs only slightly better than random guessing on a given learning task. Weak learners are also often called "base learners" or "base models."

### => In Gradient Boosting, the weak learners are typically decision trees with limited depth (also known as "shallow trees") or decision stumps (trees with a single split). These weak learners are called "weak" because their individual predictive power is modest, and they are prone to making errors on the training data.

### => The idea behind using weak learners in Gradient Boosting is to combine their predictions in a sequential manner, such that each new learner corrects the errors made by the previous ones. By iteratively adding weak learners to the ensemble, the overall model can become a strong learner with excellent predictive capabilities.
 
### => The key concept of Gradient Boosting is to fit each weak learner to the "residuals" of the previous ensemble. Residuals are the differences between the true target values and the predictions made by the current ensemble. By focusing on these residuals, the weak learners can concentrate on the patterns that the previous learners could not capture effectively. This process is repeated for several iterations until the model converges or a predefined number of weak learners are reached.

### => The combination of multiple weak learners through boosting allows Gradient Boosting to build powerful and robust predictive models, often outperforming individual strong models or deep decision trees. The success of Gradient Boosting lies in its ability to learn complex patterns and feature interactions by effectively aggregating the knowledge of the weak learners.

# 5] What is the intuition behind the Gradient Boosting algorithm?



### => The intuition behind the Gradient Boosting algorithm can be understood through the analogy of a team of "experts" collaborating to solve a problem. Each expert (weak learner) specializes in a specific aspect of the problem but may not be very accurate individually. However, when they work together in a coordinated manner, their collective knowledge leads to a highly accurate and robust solution.

### => Here's a step-by-step intuition of how the Gradient Boosting algorithm works:

## 1) Initialization:
### => Initially, the model starts with an "ensemble" that contains just one weak learner, which is often a decision tree with limited depth. This decision tree makes predictions, but it is likely to have significant errors.

## 2) Residuals and Learning from Mistakes: 
### => The algorithm calculates the difference between the true target values and the predictions made by the current ensemble. These differences are the "residuals" or "pseudo-residuals." The next weak learner is then trained to focus on and predict these residuals rather than the original target values. The new weak learner is designed to correct the mistakes made by the previous one.

## 3) Iterative Learning:
### => The algorithm iteratively adds new weak learners to the ensemble, and each new learner continues the process of learning from the residuals of the previous ensemble. At each iteration, the model pays more attention to the examples that were mispredicted in the previous rounds, effectively emphasizing the hard-to-learn patterns.

## 4) Weighted Voting:
### => The predictions of all weak learners are combined through a weighted voting scheme. The weight of each weak learner depends on its performance and contribution to reducing the overall error of the model. Weak learners that make fewer errors are given more weight in the final prediction.

## 5) Gradient Descent Optimization:
### => The term "gradient" in Gradient Boosting comes from the optimization technique used to minimize the error of the model. The algorithm uses gradient descent (a type of optimization algorithm) to find the optimal direction and magnitude of updates for the ensemble. It moves in the direction of steepest descent in the error landscape, reducing the error at each step.

### => By repeating this process for multiple iterations, the ensemble of weak learners gradually improves, and the model converges to a powerful predictive model that captures complex patterns and interactions in the data.

# 6] How does Gradient Boosting algorithm build an ensemble of weak learners?


### => The Gradient Boosting algorithm builds an ensemble of weak learners in a sequential manner. It follows a step-by-step process to iteratively improve the ensemble's predictions. Here's a high-level overview of how the ensemble is built:

## 1) Initialization: 
### => The ensemble starts with just one weak learner, typically a decision tree with limited depth. This is the starting point, and its initial predictions are likely to have significant errors.

## 2) Compute Residuals:
### => The algorithm calculates the difference between the true target values and the predictions made by the current ensemble (or the previous weak learner). These differences are known as "residuals" or "pseudo-residuals."
 
## 3) Train a Weak Learner on Residuals:
### => The next weak learner is trained to predict the residuals rather than the original target values. In other words, it focuses on learning the patterns in the data that were not captured well by the previous weak learner. The new weak learner is chosen based on the optimization of a loss function (typically the mean squared error for regression tasks), which measures how well it can fit the residuals.

## 4) Add Weak Learner to Ensemble:
### => Once the new weak learner is trained, it is added to the ensemble. At this point, the ensemble consists of the previously trained weak learners plus the newly added one.

## 5) Update Ensemble Predictions:
### => The predictions of the ensemble are updated by adding the prediction of the newly added weak learner, multiplied by a "learning rate" (or shrinkage) to control the contribution of each weak learner. The learning rate scales the impact of each weak learner, and lower values often lead to better generalization.

## 6) Repeat:
### => Steps 2 to 5 are repeated for a predefined number of iterations or until a stopping criterion is met (e.g., the model reaches a satisfactory level of performance).
 
## 7) Final Prediction:
### => After the last iteration, the final prediction of the Gradient Boosting model is the sum (or average, depending on the algorithm) of the predictions of all weak learners in the ensemble.

### => The key idea that drives the success of Gradient Boosting is the sequential training of weak learners to correct the mistakes of the previous ones. Each new weak learner focuses on learning the residuals of the current ensemble, which helps the model gradually improve its predictions. The final ensemble is a combination of multiple weak learners that, together, create a strong predictive model capable of capturing complex patterns and interactions in the data.

# 7] What are the steps involved in constructing the mathematical intuition of Gradient Boosting algorithm?

### => Constructing the mathematical intuition of the Gradient Boosting algorithm involves understanding the underlying principles and mathematical concepts driving its working. Below are the key steps involved in building the mathematical intuition of Gradient Boosting:

## 1)Loss Function: 
### => The process starts with defining a loss function, which quantifies the error between the model's predictions and the actual target values. For regression tasks, the mean squared error (MSE) is a common choice for the loss function. For classification tasks, functions like log-loss (binary cross-entropy) or softmax cross-entropy are used.

## 2)Initial Model: 
### => The algorithm begins with an initial model, usually a weak learner like a decision tree with limited depth. This model makes predictions, but its accuracy is limited, and it likely has significant errors.

## 3)Residuals:
### => The algorithm calculates the difference between the true target values and the predictions made by the current ensemble (or the previous weak learner). These differences are the "residuals" or "pseudo-residuals."

## 4)Training Weak Learners:
### => The next step is to train a new weak learner (e.g., another decision tree) on the residuals. This new learner focuses on learning the patterns in the data that were not captured well by the previous weak learner.

## 5)Weighted Update:
### => The predictions of the new weak learner are multiplied by a "learning rate" (often denoted by the symbol eta) before being added to the ensemble's predictions. The learning rate controls the contribution of each weak learner and prevents the model from overfitting. Smaller learning rates generally lead to better generalization.

## 6)Update Ensemble Predictions: 
### => The ensemble's predictions are updated by adding the prediction of the new weak learner (weighted by the learning rate) to the predictions of the previous ensemble. The updated predictions become the new predictions of the ensemble.

## 7)Repeat:
### => Steps 3 to 6 are repeated for a predefined number of iterations, with each new weak learner focusing on learning the residuals of the current ensemble. The algorithm continues to minimize the loss function and improve the model's predictive performance.

## 8)Final Prediction:
### => After completing all iterations, the final prediction of the Gradient Boosting model is the sum (or average, depending on the algorithm) of the predictions of all weak learners in the ensemble.

## 9)Regularization:
### => To prevent overfitting and improve generalization, regularization techniques may be used. Common regularization methods include limiting the depth of individual trees, using a maximum number of trees (n_estimators), and employing subsampling (stochastic gradient boosting).

## 10)Prediction for New Data:
### => Once the model is trained, it can make predictions on new data by passing the input features through the ensemble of weak learners, aggregating their predictions, and applying the learning rate.