# Ensemble Methods

Ensemble methods in machine learning refer to techniques that combine multiple individual models to create a stronger, more accurate predictive model. These methods are based on the principle that aggregating the predictions of multiple models can often result in better performance than using a single model.

## Plan

Here are the ensemble methods we will cover this mornning:

- Decision Trees: are popular machine learning algorithm that is widely used for both classification and regression tasks. 
- Bagging: short for Bootstrap Aggregating, involves training multiple models on different subsets of the training data.
- Boosting: is a technique where multiple models are trained sequentially, with each subsequent model focusing on correcting the mistakes made by the previous models.
- Stacking: involves training multiple models, often of different types or with different hyperparameters, and combining their predictions using another model called a meta-learner. 

Ensemble methods can help improve the performance, robustness, and generalization ability of machine learning models. They are particularly useful when individual models have different strengths and weaknesses or when dealing with complex and noisy datasets.

## 1. Decision Trees - read

### DecisionTreeClassifier

Here we load the Iris dataset you probably remember it from previous recaps, right?

We instantiate the decision tree classifier:
- max_depth -> The maximum depth of the tree
- random_state -> Controls the randomness of the estimator. 

So here we have a DecisionTreeClassifier, which builds a decision tree model based on the training data, where each internal node represents a decision rule based on a specific feature, and each leaf node represents a class label. During training, the algorithm recursively partitions the data based on feature values to create branches and nodes that optimize the classification performance.

We had three classes on the Iris dataset:<br>
So, Looking at this graph, where would you trace a line to split the dataset? one line! If we split too many times, it will overfit.

![ahFoMRfxBh.png](attachment:701dbe01-490e-48fd-a2c1-09d25e96eff1.png)

In [None]:
X = data.drop(columns=['target']).values
y = data.target.values


# Instantiate and train model
from sklearn.tree import DecisionTreeClassifier
tree_clf = DecisionTreeClassifier(max_depth=2, random_state=2)
tree_clf.fit(X,y)

I would explain this as follow. It concats your first array into the last dimension (axis) of your last array in the function.
For example:

In [None]:
# both are 2 dimensional array
a = np.array([[1, 2, 3], [4, 5, 6]])
b = np.array([[7, 8, 9], [10, 11, 12]])

Now, let's take a look at np.c_[a, b] (use square brackets):

First, let's look at the shape:

The shape of both a and b are (2, 3).

In [None]:
np.c_[a, b]

In [None]:
print(a.shape)
print(b.shape)

 Concating a (2, 3) into the last axis of b (3), while keeping other axises unchanged (1) will become

In [None]:
np.c_[a, b].shape

# jargon

- So these are called nodes.
- this is the root node
- these are the internal nodes
- and these are our leaf nodes. here we have a green leaf node and a purple leaf node.

So for each of these nodes as you can see, we have some coditions applied. A binary condition that is trying to split the dataset according to that condition. so on the root node for example the condition is `petal_lenght<=2.45`. So if it's true, it goes to the left internal node, the orange one and if it's false it goes to the right interna node, the white one.<br>
So in general, a decision tree makes a statement and the makes a decision based on whether or not that statement is true or false.<br>
That simple, no big deal!

# So how does it know how to split the dataset?? Gini Index

One minus the sum of the ratio of observation, that is the probability of that observation being true for every class.

Do you guys remember the iris dataset? So basically there are 3 types of flowers, this is the number of ocurrences that we have of each class among the total samples.

Okay, so that's basically it, how you calculate the Gini. And so, yeah, the Gini, as we said, it kind of tells us what is the ability of a given node to split the data. So basically, how perfectly or how well can this condition here split the data? Like, is this a condition where if I set a line at a threshold, if I have my points on a scatter plot and I set a line at a threshold, is my data going to be perfectly separated? Like, do I have a half, exactly half my data set on the one side of the line and half my data set on the other or not, Right. In most cases it's not. I mean it's that's why you have the whole tree, right? So then what we try to do is, everytime we just go on and try to find a threshold where it perfectly splits right at that point.<br> 
And we'll see in a bit how that looks in a graph. Actually, there's a graph that shows these sections that are formed by these different thresholds, ok? 

# How Do We "Grow" a Tree? - read

# Predicting

# Think about decision trees as "orthogonal" classifiers

# DecisionTreeRegressor

# The goal of regression trees is to predict a continuous value, they are "grown" differently than classification trees

SO here we would have to separate the data, from 0 to 14.5, from 15 to 23.5 from 25 to 29 ad 30 to 40. Then we would take the average - so what this graph is doing is -> its checking for the effectiveness of a drug based on the dosage. If some takes the right amount, it is 100% effective but if someone takes too little or too much, it isnt effective.

So if the dosage some takes is less than 14.5, the drug will be 4.2% effective...

# Growing the regression tree

So here we have our data points without a threshold

- So here what we so is we set a threshold after the first point here. then we calculate the sum of squared resduals on both sides
- then we move on to the next point, set the threshold again, and we calculate again... and so on...

so we keep doing this until we find the point that minimizes this threshold. so we can actually treat this threshold as a loss function so we are trying to minimize the loss function. we are trying to find a point where this is minimized. so when we find that point that its minimized, thats out threshold. so we set it and we continue on to the rest of the points until we find thresholds that will, in the end, minimize all of my SSR.

# 💻 Variance Illustrated

## Regression

# Classification

# Pros and Cons of Decision Trees

- there is a way to use gini to define feature importance, If it's a very Bad Gini index, then then most likely that's not necessarily a good feature to describe the dataset depending on the situation. 

- because they split the data orthogonally they might not be able to find a good split at all
- Principal Component Analysis -> is a dimensionality reduction technique The main goal of PCA is to identify the directions (principal components) in the data along which there is the most significant variation. It achieves this by finding a new set of orthogonal variables called principal components that are linear combinations of the original features.

# Bagging

All right. So the first technique I want to show you is bagging. And so the idea here is that we want to aggregate Multiple versions of the same model. And we'll see This in a moment But a few things about begging. First of all, it's a parallel ensemble method, right? There's parallel and sequential ones. So if you look here at the bottom right, the parallel ones are essentially methods that will compute here. So in this case, the different Versions of the model In parallel and in the end will aggregate all of the results Into One result. There's a few strategies for how to aggregate,. The sequential ones are essentially waiting for the previous model to run to try and learn from the previous model's mistakes And then they will run and then the next one will wait and so on.  The aim here is we try to reduce the variance So we're trying to get away from the problem that we just saw. Each of these tiny little Models here that you see that get generated is called a weak learner. Simply because that model by itself is not able to describe your data well enough. Right. So it's a weak learner. But the point is, if you have enough weak learners, you can then average the results and try to find a good description for your data. That's essentially the idea. And these these weak learners, they're trained on something called bootstrapped samples of the dataset.

# Boostraping

And what bootstrapping is Is This. So if you look at this graph, it's actually quite intuitive to understand. So you have your training data here. Okay. And then you create Random samples from that dataset. So essentially you're taking your whole dataset and you're creating like mini datasets out of it. And these are called bootstrapped samples. And then what you do is you train a weak learner on this sample, and then you train another one on this sample and another one on this sample, right? So that way you're essentially looking at your dataSet from just hundreds Of different angles and training Weak learners On them And that's also inherently why weak learners are not good at describing your complete data, because they are working. They're being Trained on only A sample Not on your whole dataset. So they don't Have the full picture, right? But they have a bunch of smaller pictures.

# Random Forests == Baggeed Trees

Then if you so if you imagine That for each one of these samples, you're training a tree essentially, right? You're training a weak learner model. If you aggregate all of them together, if you have many trees, what do you have? So this whole thing is called Random forest. Random simply because we're grabbing random samples here and training them on random samples. Okay, so the whole thing together is called Random forest at the bottom here. So random forests are a bagged ensemble Of decision trees. That's basically what they are. And so that's why also here in the first image you saw, it says your random forest, right? Basically a bunch of trees. Who would have thought? 

# Prediction

And then when you when you actually Run a prediction, since you have now all of these these weak learners. So the basically the predictions from  these little weak learners, they're all averaged in the case of of regression. So you just take All of the different predictions, average them and that's your final prediction. And if it's a classification, they're voted. So there's a Few voting strategies as well. I think the default one is just majority vote. So basically the Class that was predicted The most Times between all of the weak learners, that's what you assume That is the correct class.

# sklearn Random

um, this is. How we do it in code. Okay, so import random forest Regressor instantiate it. We have a number of estimators here that we can set. Um, and then we can. Just cross validate like we normally would. So just passing in the forest X, the Y. Some scoring methodology. And here we can see that there's a lot less variance. So you see the results are much closer together, right?

This is the graph version -  So before like we saw, It was trying To fit Precisely all of these points and here it's actually being a bit more So it's leaving out a few of them. Spending a lot less resources. Not overfitting so much. So it seems to work. For our purpose of trying to Avoid the Overfitting. The interesting aspect here to know about bagging is that you can actually bag any algorithm because bagging just consists of, as we described before, it consists Of trying different Versions of the same model - so you can back anything really. So here, for example, we're just bagging a KNN. Right. So we called it a weak learner, then we're bagging it here. Right. And that will be the result. Right. So just to show you that you can also bag basically any algorithm.

# Out of bag samples

 And this is the Out of bag samples I. So you can set a score. So out of bag score to true. And essentially all of these points that were not sampled during the Bagging, the regular Bagging, they can be used as a test set when scoring your model. It's a bit more precise than not doing it. So unless you have like a Super tiny dataSet, there usually is no reason to not use it, right?

So a random forest classifier with 100 estimators and a bagging classifier. With 100 estimators. They're similar, but they're not the same. So this is actually slightly less optimized , as it says in the comment right. . And This is actually not a good idea at all because so if you're instantiating if you're running a random forest classifier, which has a default Number of 100 Estimators and then you're training that itself with 100 estimators on the bagging regressor then you're doing 100 times 100 Trees. So that's A lot, right? So it's ten k trees. Probably avoid that, don't do that. And this Is the point of this slide is just to kind of point that out. You should look precisely where you're  you're putting your estimators. Just to make sure that you're getting what you what you expect, actually. And you're not training too Many trees and Just spending all day training Trees.

# Pros and COns of bagging

- complex structure -> training different versions of the same model, even tho they are weak learners the model is complex
- disregards  performance of the individual submodels. So if by any chance You randomly sampled a bootstrapped sample, right? And you trained an amazing model that was super good at describing your data. The Bagging itself will simply not care. That result will still Get averaged like it receives no special Attention for being good. Right. Um, so this could actually be and usually is a downside.  but we will see how in the next slides. We will see actually how we can in fact apply a few algorithms and techniques that do exactly that. They recognize these these individual performances and and give them more attention. Right. 

# Boosting

This type of technique is called boosting.
