# Q1

# Gradient Boosting Regression is a machine learning technique that uses an ensemble of weak learners (typically decision trees) to perform regression tasks. It is an extension of the Gradient Boosting algorithm, which was originally designed for classification problems. Gradient Boosting Regression, also known as Gradient Boosting Machines (GBM) for regression, is a popular and powerful approach for solving regression problems due to its ability to handle complex data patterns and provide accurate predictions.

# The key idea behind Gradient Boosting Regression is that the subsequent weak learners are trained to correct the errors or residuals of the previous learners. This iterative process allows the model to progressively improve its predictions and reduce the bias and variance, resulting in a more accurate and robust regression model.

# Q2

In [41]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

In [42]:
df=sns.load_dataset('tips')

In [43]:
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [44]:
df.isnull().sum()

total_bill    0
tip           0
sex           0
smoker        0
day           0
time          0
size          0
dtype: int64

In [45]:
df.describe()

Unnamed: 0,total_bill,tip,size
count,244.0,244.0,244.0
mean,19.785943,2.998279,2.569672
std,8.902412,1.383638,0.9511
min,3.07,1.0,1.0
25%,13.3475,2.0,2.0
50%,17.795,2.9,2.0
75%,24.1275,3.5625,3.0
max,50.81,10.0,6.0


In [52]:
from sklearn.preprocessing import LabelEncoder,OneHotEncoder
encoder=LabelEncoder()

ohe=OneHotEncoder()
columns_to_encode = ['sex', 'smoker', 'time']
all_encoded = ohe.fit_transform(df[columns_to_encode])

# Get feature names after OneHotEncoding
feature_names = ohe.get_feature_names_out(columns_to_encode)

# Create DataFrame from encoded data and feature names
encoded_data = pd.DataFrame(all_encoded.toarray(), columns=feature_names)

# Concatenate the encoded columns with the original tips_data DataFrame
encoded_df = pd.concat([df.drop(columns=columns_to_encode), encoded_data], axis=1)

In [53]:
encoded_df.head()

Unnamed: 0,total_bill,tip,day,size,sex_Female,sex_Male,smoker_No,smoker_Yes,time_Dinner,time_Lunch
0,16.99,1.01,2,2,1.0,0.0,1.0,0.0,1.0,0.0
1,10.34,1.66,2,3,0.0,1.0,1.0,0.0,1.0,0.0
2,21.01,3.5,2,3,0.0,1.0,1.0,0.0,1.0,0.0
3,23.68,3.31,2,2,0.0,1.0,1.0,0.0,1.0,0.0
4,24.59,3.61,2,4,1.0,0.0,1.0,0.0,1.0,0.0


In [54]:
X=encoded_df.drop('tip',axis=1)
y=encoded_df['tip']

In [55]:
X.head()

Unnamed: 0,total_bill,day,size,sex_Female,sex_Male,smoker_No,smoker_Yes,time_Dinner,time_Lunch
0,16.99,2,2,1.0,0.0,1.0,0.0,1.0,0.0
1,10.34,2,3,0.0,1.0,1.0,0.0,1.0,0.0
2,21.01,2,3,0.0,1.0,1.0,0.0,1.0,0.0
3,23.68,2,2,0.0,1.0,1.0,0.0,1.0,0.0
4,24.59,2,4,1.0,0.0,1.0,0.0,1.0,0.0


In [56]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.25,random_state=42)

In [57]:
from sklearn.ensemble import GradientBoostingRegressor

gbr=GradientBoostingRegressor()
gbr.fit(X_train,y_train)

In [58]:
y_pred=gbr.predict(X_test)

In [60]:
from sklearn.metrics import r2_score,mean_squared_error

print('R2 score is:',r2_score(y_test,y_pred))
print('mean squared error is:',mean_squared_error(y_test,y_pred))

R2 score is: 0.24270820985136987
mean squared error is: 1.007519802709521


# Q3

In [61]:
params={
    'learning_rate':[0.1,0.2,0.3,0.4,0.5],
    'n_estimators':[100,200,300,400,500],
    'max_depth':[2,3,4]
}

In [65]:
from sklearn.model_selection import GridSearchCV

gscv=GridSearchCV(gbr,param_grid=params,scoring='r2',cv=5,refit=True,verbose=3)

In [66]:
gscv.fit(X_train,y_train)

Fitting 5 folds for each of 75 candidates, totalling 375 fits
[CV 1/5] END learning_rate=0.1, max_depth=2, n_estimators=100;, score=-0.024 total time=   0.1s
[CV 2/5] END learning_rate=0.1, max_depth=2, n_estimators=100;, score=0.612 total time=   0.1s
[CV 3/5] END learning_rate=0.1, max_depth=2, n_estimators=100;, score=0.448 total time=   0.1s
[CV 4/5] END learning_rate=0.1, max_depth=2, n_estimators=100;, score=0.180 total time=   0.1s
[CV 5/5] END learning_rate=0.1, max_depth=2, n_estimators=100;, score=0.252 total time=   0.1s
[CV 1/5] END learning_rate=0.1, max_depth=2, n_estimators=200;, score=-0.280 total time=   0.1s
[CV 2/5] END learning_rate=0.1, max_depth=2, n_estimators=200;, score=0.593 total time=   0.1s
[CV 3/5] END learning_rate=0.1, max_depth=2, n_estimators=200;, score=0.433 total time=   0.1s
[CV 4/5] END learning_rate=0.1, max_depth=2, n_estimators=200;, score=-0.016 total time=   0.1s
[CV 5/5] END learning_rate=0.1, max_depth=2, n_estimators=200;, score=0.220 tota

In [67]:
gscv.best_params_

{'learning_rate': 0.1, 'max_depth': 3, 'n_estimators': 100}

In [68]:
from sklearn.ensemble import GradientBoostingRegressor

gbr=GradientBoostingRegressor(n_estimators=100,learning_rate=0.1,max_depth=3)
gbr.fit(X_train,y_train)

In [69]:
y_pred_cv=gbr.predict(X_test)

In [70]:
from sklearn.metrics import r2_score,mean_squared_error

print('R2 score is:',r2_score(y_test,y_pred_cv))
print('mean squared error is:',mean_squared_error(y_test,y_pred_cv))

R2 score is: 0.25410028391955986
mean squared error is: 0.9923635044808259


# Q4

# In Gradient Boosting, a weak learner refers to a simple model or a model with limited predictive power. It is typically a decision tree with a shallow depth (also known as a decision stump) or a small number of nodes. The term "weak" is used because these models alone may not perform well on a given task; they have a limited capacity to capture complex patterns in the data.

# Q5

# The intuition behind the Gradient Boosting algorithm is to build a strong predictive model by combining the expertise of multiple weak learners in a sequential and adaptive manner. It starts with an initial simple model that serves as the baseline prediction. In subsequent iterations, new weak learners are introduced to focus on the errors or residuals of the previous predictions, effectively learning from the mistakes of previous models. The algorithm assigns higher weights to misclassified examples, allowing subsequent models to prioritize these difficult instances and improve their accuracy. By iteratively refining the model through gradient descent optimization, the ensemble gradually minimizes the overall prediction errors. This adaptive boosting process helps the model gradually shift its focus towards complex patterns in the data and makes it more robust and accurate. The final ensemble of weak learners combines their predictions to make the overall strong prediction, which provides superior performance on both the training data and unseen test data, making Gradient Boosting a powerful technique for various machine learning tasks.

# Q6

# The Gradient Boosting algorithm builds an ensemble of weak learners through an iterative and sequential process. The key idea is to improve the ensemble's performance by gradually reducing prediction errors in each iteration. Here's a step-by-step explanation of how Gradient Boosting builds the ensemble:

# 1) Initialization:
The algorithm starts by initializing the ensemble with a simple model, usually a decision tree with a shallow depth (decision stump) or just a single node. This initial model serves as the baseline prediction.

# 2) Residual Calculation:
In each iteration, the algorithm calculates the residuals (errors) between the actual target values and the current ensemble's predictions. These residuals represent the differences between the observed data and the predictions made by the current ensemble.

# 3) Training New Weak Learner:
The next step is to train a new weak learner (e.g., another decision tree with limited depth) on the residuals calculated in the previous step. The goal of the new weak learner is to find patterns or relationships that the current ensemble is struggling to capture. The new weak learner is trained to minimize the residuals' errors, focusing on the areas where the current ensemble is making mistakes.

# 4) Weight Update:
After training the new weak learner, the algorithm assigns a weight (learning rate) to this learner based on its performance on the residuals. Better-performing weak learners receive higher weights to have more impact on the final prediction, while weaker learners receive lower weights.

# 5) Ensemble Update:
The new weak learner's predictions are then combined with the predictions of the previous ensemble to update the overall ensemble. The ensemble's update is done using a gradient descent-like optimization, where the predictions are updated to minimize the overall prediction errors. The learning rate controls the contribution of the new learner to the ensemble, preventing the model from overfitting and making more gradual adjustments.

# 6) Iterative Process:
Steps 2 to 5 are repeated for a predefined number of iterations or until a stopping criterion is met. In each iteration, a new weak learner is added to the ensemble, focusing on the residuals and improving the ensemble's accuracy.

# Q7

# Constructing the mathematical intuition of the Gradient Boosting algorithm involves understanding the key mathematical concepts and techniques used in the algorithm. The steps involved in building this intuition are as follows:

# Loss Function:
The foundation of Gradient Boosting is the choice of a differentiable loss function that measures the difference between the actual target values and the model's predictions. The most commonly used loss function for regression is the mean squared error (MSE). For classification, the log loss (cross-entropy) is often used.

# Objective Function:
The objective of Gradient Boosting is to minimize the loss function with respect to the model's predictions. The initial prediction is often set to the mean of the target values (for regression) or the log odds of the target classes' probabilities (for classification).

# Residuals and Gradient:
In each iteration of Gradient Boosting, weak learners are trained to fit the negative gradient of the loss function. The negative gradient represents the direction and magnitude of the steepest descent towards the optimal prediction that minimizes the loss. It is equivalent to the negative partial derivative of the loss function with respect to the current prediction.

# Weak Learner Training:
In each iteration, a new weak learner (e.g., decision tree with limited depth) is trained on the negative gradient (residuals) of the previous prediction. The goal of the weak learner is to approximate the negative gradient and capture the patterns in the residuals.

# Weight Update:
After training the weak learner, its predictions are multiplied by a learning rate (also known as the step size) to control the contribution of this weak learner to the ensemble. The learning rate is a hyperparameter that determines the weight of the new weak learner's prediction in the final ensemble.

# Ensemble Update:
The new weak learner's predictions are combined with the predictions of the previous ensemble to update the overall prediction. The ensemble update is done using a gradient descent-like optimization, where the predictions are updated to minimize the overall loss function.

# Iterative Process:
The steps 3 to 6 are repeated for a predefined number of iterations or until a stopping criterion is met. In each iteration, a new weak learner is added to the ensemble, focusing on the residuals and improving the model's accuracy.

# Final Prediction:
Once all iterations are completed, the ensemble of weak learners is combined to make the final prediction. The final prediction is typically the sum or weighted sum of all the weak learners' predictions, and it represents the strong learner built by the Gradient Boosting algorithm.