We looked at previous models like Logistic Regression, Decision Trees, Linear Regression etc. These models when used alone generally suffer from underfitting issue or overfitting issues.

The linear models like Linear Regression, Logistic Regression, SVM generally suffer from underfitting.
And tree based models like Decision Tree is very prone to overfitting.

So to overcome these issues we use ensemble learning methods which we are going to understand now in more detail.

### Ensemble Learning Methods

Ensemble models in machine learning are techniques that combine multiple individual models to improve predictive performance. The main idea behind ensemble methods is to leverage the diversity of individual models to produce a stronger, more robust predictor than any single model alone.

Ensemble methods solve several problems in machine learning:

- **Bias-Variance Tradeoff:** Ensemble methods can help mitigate the bias-variance tradeoff by combining models that have different sources of error. Typically, individual models may have high bias (underfitting) or high variance (overfitting), but by combining them, the ensemble can achieve a better balance between bias and variance, leading to improved generalization performance.


- **Improving Predictive Accuracy:** Ensemble methods often result in higher predictive accuracy compared to individual models. By aggregating predictions from multiple models, ensemble methods can capture different aspects of the underlying data distribution, leading to more accurate predictions.


- **Robustness to Noise and Outliers:** Ensemble methods are generally more robust to noise and outliers in the data. Since the predictions from individual models may vary due to noise or outliers, combining them can help reduce the impact of these errors and produce more reliable predictions.


- **Handling Complex Relationships:** Ensemble methods can capture complex relationships in the data by combining different modeling approaches or by incorporating diverse feature representations. This allows them to effectively model non-linear or intricate patterns in the data.


- **Enhancing Stability and Reliability:** Ensemble methods tend to be more stable and reliable than individual models. By aggregating predictions from multiple models, ensemble methods can reduce the variability in predictions and provide more consistent results across different subsets of the data.


**Common ensemble methods include:**

- **Bagging (Bootstrap Aggregating):** It involves training multiple models independently on different subsets of the training data and then averaging their predictions or taking the majority voting. Most popularly used bagging approach is `Random Forest` which builds multiple decision trees during training and combines their predictions through averaging or voting.

- **Boosting:** It iteratively trains models, with each subsequent model focusing on the instances that were misclassified by earlier models. Examples include `AdaBoost, Gradient Boosting Machines (GBM), and XGBoost`.

- **Stacking:** It trains a meta-model to combine the predictions of several base models.

## Bias-Variance Tradeoff

### Bias Error

While training, the model learns the patterns in the dataset and applies them to test data for prediction. While making predictions, `a difference occurs between prediction values made by the model and actual values/expected values, and this difference is known as bias errors or Errors due to bias.`

- A high bias model tends to oversimplify the underlying patterns in the data and leads to underfitting. This means the model fails to capture important relationships between features and the target variable. Hence a high bias model cannot perform well on new unseen data.


- **Characteristics of a high bias model include:**
    - Consistently poor performance on both training and test datasets.
    - High error on the training dataset, indicating the model is unable to learn the underlying patterns.
    - Limited capacity to capture complex relationships in the data.


- Common causes of bias include using a too simplistic model, not using enough features, or insufficient model complexity.

### Variance Error

Variance refers to the variability of model predictions for a given input data point. It measures how much the model's predictions would differ if trained on different subsets of the training data.

- A high variance model tends to capture noise in the training data and leads to overfitting. This means the model performs well on the training dataset but generalizes poorly to unseen data.


- **Characteristics of a high variance model include:**
    - Excellent performance on the training dataset but poor performance on the test dataset.
    - Low error on the training dataset, indicating the model has memorized the training data rather than learned the underlying patterns.
    - Sensitivity to small changes in the training data, leading to different predictions for similar data points.


- Common causes of variance include using a too complex model, using too many features, or insufficient regularization.

**Example:**

1. `Low Bias and High Variance` - Training Error = 1% & Test Error = 20% (Overfitting)

2. `High Bias and High Variance` - Train Error = 25% & Test Error = 23% (Underfitting)

3. `Low Bias and Low Variance` - Train Error < 10% & Test Error < 10% (Right Model)

### Bias Variance Tradeoff

- The bias-variance tradeoff refers to the inherent tradeoff between bias and variance when selecting a model. Increasing model complexity tends to decrease bias but increase variance, while decreasing model complexity tends to increase bias but decrease variance.

- The goal is to find the right balance between bias and variance to achieve optimal predictive performance on unseen data.

In simple terms, Bias refers to errors introduced by oversimplified models, leading to underfitting, while variance refers to errors introduced by overly complex models, leading to overfitting. The bias-variance tradeoff is a fundamental concept in machine learning, and finding the right balance between bias and variance is essential for building models that generalize well to unseen data.

## Bagging

Bagging is an ensemble technique used to reduce the variance(reduce overfitting) of our predictions by combining the result of multiple classifiers modeled on different sub-samples of the same data set.

![image-7.webp](attachment:image-7.webp)

Credits: Analytics Vidhya

The steps followed in bagging are:

1. **Create Multiple DataSets:** Sampling is done with replacement on the original data and new datasets are formed.

2. **Build Multiple Classifiers:** Classifiers are built on each data set. Generally the same classifier is modeled on each data set and predictions are made.

3. **Combine Classifiers:** The predictions of all the classifiers are combined using a mean, median or mode value depending on the problem at hand.
The combined values are generally more robust than a single model.

There are various implementations of bagging models but Random forest is the most popular implementation of bagging approach and we will study that now.

### Random Forest Algorithm (Classifier and Regressor)

Random forests are essentially ensembles of a number of decision trees. Bagging selects random samples of observations from a data set, and you can create a large number of models (say, 100 decision trees), each one on a different bootstrap sample from the training set.

Random forests are a parallel ensemble learning technique with decision trees as the building blocks. They use bagging as the ensemble learning method. Bagging creates different training subsets from the same training data through sampling with replacement. 

Each of these samples is then used to train each tree of the forest. In this manner, the same algorithm with a similar set of hyperparameters is exposed to different parts of the given data, resulting in a slight difference between individual models.

Recall that in a decision tree, every data point passes from the root node to the bottom until it is classified in a leaf node. A similar process occurs in random forests as well while making predictions. 

Each data point passes through different trees in the ensemble, which are built on different training and feature subsets. The final outcome of these trees is then combined either by taking the most frequent class prediction in the case of a classification problem or by taking the average of the predictions in the case of a regression problem.

![1_jE1Cb1Dc_p9WEOPMkC95WQ.png](attachment:1_jE1Cb1Dc_p9WEOPMkC95WQ.png)
Image Credit : medium.com

![image_dab23a4c33.png](attachment:image_dab23a4c33.png)

Image Credit : Datacamp.com

**For the regression data the tree would look something like this**

![04ea4b62-0f40-496c-8299-bd5a1e4a5907-C3M6_images.pptx%20%288%29.jpg](attachment:04ea4b62-0f40-496c-8299-bd5a1e4a5907-C3M6_images.pptx%20%288%29.jpg)

Diversity or randomness ensures that models serve complementary purposes, which means individual models make predictions independent of each other. 

Randomness ensures that even if some trees overfit, the other trees in an ensemble will neutralize that effect.

A random forest selects a random sample of data points (a bootstrap sample) to build each tree and a random sample of features while splitting a node. Randomly selecting features ensures that each tree is diverse.

![daaf8043-e412-4868-b118-936c753774b7-table_B.png](attachment:daaf8043-e412-4868-b118-936c753774b7-table_B.png)

Image Credit: Datacamp

Decision trees are built on each of these bootstrap samples, as you learned earlier, and a random subset of features is selected at each split of a tree. 

For example, in DT1, three features, F2, F3, and F4, are selected randomly from a total of six features for consideration as the splitting feature at the red node. 

Similarly, at the green node, you have another set of features (F1, F3, F4) that are selected randomly for splitting. This process is followed for building each decision tree in a random forest model, thereby introducing randomness at each split across the trees.

#### Advantages of using Random Forest

- **Capable of handling large quantities of data:** It works well with large volumes of data and can provide more accurate predictions than individual instances of decision tree models.


- **Parallelization:** You need multiple trees to make a forest. Since each tree is built independently on different data and attributes, they can be executed parallelly during implementation. This implies that you can make full use of your multicore CPU or a cluster of computers to build random forests. For example, if there are 4 cores and 100 trees to be constructed, then each core can build 25 trees to build a forest.


- **Feature selection:** If you recall, decision trees choose features for splitting at a node based on the reduction in Gini indices or the increase in the node’s homogeneity. Decision trees help quantify the importance of each feature by calculating the reduction in the Gini index for each of them at a node. The feature for which there is a significant reduction in the Gini index is an important variable, while the feature for which there is less reduction of impurity is a less important variable. This means the variable (and the corresponding value) at which the root node is being split in a decision tree can be considered the most important variable. Since decision trees are the building blocks for random forests, they also possess the property of feature selection, in which the impurity measure (Gini index) is considered for a particular feature by splitting at different node levels for all the trees in the forest. The average of these impurity values (Gini indices) for all the trees of an ensemble indicates the importance of features in random forests. In this way, you can determine the importance of each feature for a more accurate prediction in the random forest model.