# Introduction

In this video, Arihant mentions the different boosting methods which will be discussed in the course:

- Adaptive Boosting
- Gradient Boosting
- XGBoost

In this session
This session will introduce you to the following topics: 

- Ensemble models
- Introduction to boosting
- Building blocks of boosting
- AdaBoost procedure
 

**Prerequisites**
There are no prerequisites for this session other than the concepts covered in Tree Models. Also, it may help to revisit the session on 'Gradient Descent' in the "Additional References" since that will be used while studying Gradient Boosting.

## Ensemble Models 

### Ensemble: Collection of Models

For a machine learning task (classification or regression), you need a model that identifies the necessary patterns in the data and does not overfit. In other words, ‘the models should not be so simple as to not be able to identify even the important patterns present in the data; on the other hand, they should not be so complex as to even learn the noise present in the data set’.


This solution can be arrived at either through a single model or an ensemble, i.e., a collection of models. **By combining several models, ensemble learning methods create a strong learner, thus reducing the bias and/or variance of the individual models.**
ccvcc
### Bias: (high == underfitting)

- Bias describes systematic errors that occur when an algorithm makes strong assumptions about the form of the target function and oversimplifies the data.

- High bias means the model is too simple, missing important patterns and relationships (underfitting), resulting in predictions consistently far from actual values.

- Example: Using a linear model to predict a nonlinear relationship likely produces high bias errors, missing the complexity of the data

### Varience: (high == overfitting)

- Variance quantifies how much a model’s predictions fluctuate for different training sets.

- High variance means the model is too complex, fitting the training data (including its noise) very closely but failing to generalize to new, unseen data (overfitting).

- Example: A deep neural network fitting training data perfectly but showing widely different predictions for new inputs has high variance.

**Building effective models requires balancing bias and variance: too much bias yields underfitting; too much variance leads to overfitting**

### Bagging vs Boosting

### Bagging - breaking down training data set to multiple subset


**Bagging** is one such ensemble model which creates different training subsets (**Bootstrap Sample**) from the training data with replacement. Then, an algorithm with the same set of hyperparameters is built on these different subsets of data.

In this way, the **same algorithm with a similar set of hyperparameters is exposed to different subsets of the training data**, resulting in a slight difference between the individual models. The **predictions of these individual models are combined**. by taking the average of all the values for regression or a majority vote for a classification problem **Random forest is an example** of the bagging method.

![1.png](attachment:7238e3cc-c657-434e-98e4-1145b0a170c6.png)

Bagging **works well** when the algorithm used to build our **model has high variance**. This means the model built changes a lot even with slight changes in the data. **As a result, these algorithms overfit easily if not controlled**. Recall that **decision trees are prone to overfitting** if the hyperparameters are not tuned well. Bagging works very well for high-variance models like decision trees.


### Boosting - combines individual models into a strong learner |

![diagram-export-9-29-2025-10_10_28-PM.png](attachment:566bf483-f655-492f-8224-0b0807c3e190.png)

Boosting is another popular approach to ensembling. This technique combines individual models into a strong learner by creating sequential models such that the final model has a higher accuracy than the individual models. Let’s understand this.

 ![2.png](attachment:897a2377-5919-435c-a4d8-9153f5f3928d.png)

These individual models are connected in such a way that the subsequent models are dependent on errors of the previous model and **each subsequent model tries to correct the errors of the previous models**. 


 


### Weak Learners

Before proceeding further on boosting, let’s take a look at the foundation of boosting algorithms — **Weak Learners**.

 

As discussed earlier, boosting is an approach where the individual models are connected in such a way that they correct the mistakes made by the previous models. Here, these individual models are called weak learners. 

 

Till now, **all the models you have learnt are strong learners**, where each model performs well on any task that it is assigned to do – classification or regression. 

 

**Weak learner, on the other hand, refers to a simple model which performs at least better than a random guesser** (the error rate should be lesser than 0.5). It primarily identifies only the prominent pattern(s) present in the data and thus is not capable of overfitting. In boosting, such weak learners can be used to build your ensemble.

 

In boosting, any model can be a weak learner – linear regression, decision tree or any other model, but more often than not, tree methods are used here.


Note: **Decision stump** is one such weak learner when talking about a shallow decision tree having a depth of only 1.

 

![4.png](attachment:52a4a9d5-167b-45d6-8869-65bab575e2bd.png)

**To summarise:** Weak learners are combined sequentially such that each subsequent model corrects the mistakes of the previous model, resulting in a strong overall model that gives good predictions.

 

Through weak learners, you can do the following:

- Reduce the variance of the final model, making it more robust (generalisable) 
- Train the ensemble quickly resulting in faster computation time
 


## Adaboost- stands for adaptive boosting

![d277a1e8-09d8-4850-83b4-c04d23e99bdf-adaboost.gif](attachment:ad2516ac-2b49-42c8-8487-3ea16a4dd6f5.gif)

Before starting with a numerical example to understand AdaBoost, let’s see an overview of the steps that need to be taken in this boosting algorithm:

- AdaBoost starts with a uniform distribution of weights over training examples, i.e., it gives equal weights to all its observations. These weights tell the importance of each datapoint being considered.
- We start with a single weak learner to make the initial predictions.
- Once the initial predictions are made, patterns which were not captured by the previous weak learner are taken care of by the next weak learner by giving more weightage to the misclassified datapoints.
- Apart from giving weightage to each observation, the model also gives weightage to each weak learner. More the error in the weak learner, lesser is the weightage given to it. This helps when the ensembled model makes final predictions.
- After getting the two weights for the observations and the individual weak learners, the next weak learner in the sequence trains on the resampled data (data sampled according to the weights) to make the next prediction.
- The model will iteratively continue the steps mentioned above for a pre-specified number of weak learners. 
- In the end, you need to take a weighted sum of the predictions from all these weak learners to get an overall strong learner.

**A strong learner is formed by combining multiple weak learners which are trained on the mistakes of the previous model.**

![5 (2).png](attachment:dc24c0db-a385-4d38-a694-0e6e3289d108.png)

