# Random Forest

# The Bagging Method as one of the Ensemble Methods

- ensemble learning algorithm
- multiple decision trees to improve the model performance
- better than simple tree decisions (avoid overfitting-reducing corr of some features-handling missing value-resistant to outliers)
- uncorrelatedness advantage through bootstrapping or feature randomness
- used both for classification (voting) and regression (averages-mean) (CART = Classification and Regression trees)
- ensemble methods
- bagging (bootstrap aggregation)
- with replacement

Random Forest is a popular ensemble learning algorithm in machine learning that **combines multiple decision trees to improve the performance of the model and to improve the model's generalization capabilities**. **Ensemble methods work on the principle that a combination of weak models can create a powerful model**.  "There is wisdom in crowds"

The random forest algorithm works by creating **a large number of decision trees, each trained on a different subset of the data and with a random subset of the features**. When making a prediction, each tree in the forest makes a prediction, and the **final prediction is the one that receives the most votes (in classification problems) or the average of the predictions (in regression problems)**.

**The random forest algorithm has several advantages over a single decision tree.**: 
- Firstly, it can handle a large number of features and avoid overfitting, because each tree is trained on a different subset of the features, reducing their correlation (**uncorrelatedness**). 
- Secondly, it can handle missing data, as it only uses the available features in each subset. 
- Thirdly, it is resistant to outliers, as the impact of a single outlier is limited by the number of trees in the forest.  
- Decision trees have high variance and low bias, which can make them unstable. However, by averaging the predictions of many decision trees, the variance of the model is reduced, and the final model is closest to an ideal model.

Random Forest is a powerful and versatile algorithm that can be used for both classification and regression problems. It is widely used in practice due to its high accuracy and robustness.

The random forest algorithm is a type of ensemble learning method that is built on the concept of decision trees. To understand the random forest algorithm, it is important to first understand the concept of ensemble methods and the specific ensemble method of bagging.

[Ensemble methods](https://scikit-learn.org/stable/modules/ensemble.html#adaboost) **are a technique that involves training multiple models on the same dataset**. **Each model makes its own prediction, and a final prediction is made by combining the predictions of the individual models**. Ensemble methods aim to create a strong model by combining the predictions of multiple models. **The key principle of ensemble methods** is that the diversity of the models can result in better performance than a single model.

![image.png](attachment:277ed522-0d0c-4797-8526-3794c6097f82.png)

**Bagging is a specific ensemble method that stands for Bootstrap Aggregating.** **It is a technique that trains each model in the ensemble using a randomly drawn subset of the training set with replacement**. By using different subsets of the training set, the individual models in the ensemble will have different variations, leading to diversity among the models.

In the random forest algorithm, a collection of decision trees are created, each trained on a different subset of the data, and the final prediction is made by taking the majority vote of the predictions made by the individual decision trees. **The randomness in this algorithm is introduced by using a bagging technique to create random subsets of the data and by selecting random subsets of features to split on at each decision tree node. This leads to a high diversity among the decision trees, and it is able to reduce the overfitting in comparison to a single decision tree**.

![image.png](attachment:a7736e4f-4766-4eec-a684-37a04b547b4c.png)

**Ensemble methods**

Ensemble methods in machine learning involve combining multiple models (called "base models") to improve the overall performance of the system. The idea is that by combining the predictions of multiple models, we can obtain a better result than by using any single model alone.

There are different ways to create an ensemble, but one common approach is to train several base models on the same dataset using different algorithms or settings, and then combine their predictions using a weighted average or a voting mechanism. By doing this, the ensemble can reduce the risk of overfitting and capture a wider range of patterns in the data.

Ensemble methods have been shown to be effective in many applications, including classification, regression, and clustering. They are widely used in competitions such as Kaggle, where teams often combine multiple models to achieve the best performance.

![image.png](attachment:ed77036b-64fc-463d-9bf4-51d1d3ba84a9.png)

**Bagging (Bootstrap Aggregating)**

**Bagging (Bootstrap Aggregating) is a popular ensemble method in machine learning that involves creating multiple base models using subsets of the training data and then aggregating their predictions to make a final prediction.**

The bagging process involves randomly selecting subsets of the training data with replacement, training a base model on each subset, and then combining their predictions using a voting mechanism for classification problems or averaging for regression problems. By using different subsets of the data, the base models have slightly different training sets and can capture different patterns in the data.

The advantage of bagging is that it can reduce the variance of the model and decrease the risk of overfitting. Bagging is often used with decision trees, but it can also be applied to other types of models. In addition, bagging can be parallelized easily because each base model can be trained independently on a different subset of the data.

Overall, bagging is a powerful technique that can improve the performance of machine learning models and make them more robust.

**selection with replacement**

In the context of bagging, "with replacement" means that when creating a subset of the training data, we are allowed to select the same data point multiple times.

For example, if we have a training set of 100 samples and we want to create a subset of 50 samples with replacement, we randomly select a sample from the training set, add it to the subset, and then put the sample back in the original training set. We repeat this process 49 more times, allowing the same sample to be selected multiple times.

The idea behind selecting subsets with replacement is to create different training sets for each base model. By allowing the same sample to be selected multiple times, some samples may be selected more often than others, while some may not be selected at all. **This introduces some randomness into the training process, which can help the base models capture different patterns in the data and reduce their correlation with each other**.

Random forestin baggingden 3 farki:

- sadece decision tree kullanir
- traindeki data bootstrap datasetlere split edildikten sonra bunlardaki datanin 2/3unu in-bag; 1/3unu out of bag (oob) ayirir feature importance icin
- train icin alinan 2/3 datayi islerken featureları da random seciyor. decision tree'de hep best feature uzerinden bolunme yapiliyordu. burda ise random secerek farkli featurelar aldigi icin hem accuracy artiyor, hem correlatedness azaliyor...

**Random Forest Hyperparameters**

Random forest is an ensemble of many decision trees; many hyperparameters between both models are shared. So, for example, all the decision trees in the rain forest will have the same max depth rule and minimum sample split rule or minimum Gini and purity, decrease rule, etc.

The important hyperparameters unique to random forest are :

- n_estimator: How many decision trees to use total in the forest. default 100. the more trees the better accuracy. But CPU intensive
- max_depth: default = None. if none, then nodes are expanded until all leaves are pure.
- max_features: default=sqrt (if there are 16 features, will get 4). How many features to include in each of those random subsets each split. Increaing max_features will imrpove the performance, but end up in a correlation between the trees.
- bootstrap: default=True. Must be set as True. Should we allow for bootstrap sampling of each training subset of features or not? False olarak kullanirsak datasetinin aynisini kullanir. 
- oob_score: default=False.  Should we calculate OOB or out-of-bag error during the training? this score can be used as validation score, in case of we have a small dataset. 

![image.png](attachment:677643a7-cdce-499f-858b-35ee712b3879.png)

aciklanabilirligi zayif oldugu icin (100 decision treesi yorumlamak zordur) blacbox modeller arasinda sayilir.