<b><center>Random Forest Algorithm</center></b>

In statistics and machine learning, ensemble methods use multiple learning algorithms to obtain better predictive performance than could be obtained from any of the constituent learning algorithms alone.

Ensemble learning is the process by which multiple models, such as classifiers or experts, are strategically generated and combined to solve a particular computational intelligence problem. <br>
Ensemble learning is primarily used to improve the performance of a model, or reduce the likelihood of an unfortunate selection of a poor one. Other applications of ensemble learning include assigning a confidence to the decision made by the model, selecting optimal features, error-correcting etc.

<b>Why to use ensemble techniques?</b><br>
There are two main reasons to use an ensemble over a single model, and they are related; they are: <br>

- Performance: An ensemble can make better predictions and achieve better performance than any single contributing model.
- Robustness: An ensemble reduces the spread or dispersion of the predictions and model performance.
Ensembles are used to achieve better predictive performance on a predictive modeling problem than a single predictive model. The way this is achieved can be understood as the model reducing the variance component of the prediction error by adding bias 

Common types of ensembles:
- Bayes optimal classifier
- Bootstrap aggregating (bagging)
- Boosting
- Bayesian model averaging
- Bayesian model combination
- Bucket of models
- Stacking

Random Forest Algorithm comes under Bootstrap aggregating (bagging) which is a type of supervised learning algorithm.<br>
Bootstrap aggregating, also called bagging (from bootstrap aggregating), is a machine learning ensemble meta-algorithm designed to improve the stability and accuracy of machine learning algorithms used in statistical classification and regression. It also reduces variance and helps to avoid overfitting. Although it is usually applied to decision tree methods, it can be used with any type of method.

Bagging is composed of two parts: aggregation and bootstrapping. Bootstrapping is a sampling method, where a sample is chosen out of a set, using the replacement method. The learning algorithm is then run on the samples selected. <br>

The bootstrapping technique uses sampling with replacements to make the selection procedure completely random. When a sample is selected without replacement, the subsequent selections of variables are always dependent on the previous selections, making the criteria non-random.

Model predictions undergo aggregation to combine them for the final prediction to consider all the possible outcomes. The aggregation can be done based on the total number of outcomes or the probability of predictions derived from the bootstrapping of every model in the procedure.

<center><img src="https://upload.wikimedia.org/wikipedia/commons/thumb/c/c8/Ensemble_Bagging.svg/440px-Ensemble_Bagging.svg.png"></img></center>

Let's try to understand Random Forest Algorithm more clearly.

- So as to implement Random Forest Algorithm firstly, we take n number of random records are from the data set having k number of records. <br>

- Then as Random Forest Algorithm is based on Decision Trees hence, individual decision trees are constructed for each sample. <br>

- After constructing these decision treees, each tree will generate an output. <br>

- Finally, the output is considered based on Majority Voting or Averaging for Classification and regression respectively.

Here we see bootstrapping is done by randomly selecting n records from the given k records and the individual decision trees are averaged out to obtain one output.

<center><img src="https://editor.analyticsvidhya.com/uploads/33019random-forest-algorithm2.png"></img></center>

This is also known as parallelization where each tree is created independently out of different data and attributes and combined together to form a single output.

Now let's see how we can implement Random Forest Algorithm with the help of code.

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.metrics import r2_score
from sklearn.model_selection import train_test_split

import warnings
warnings.filterwarnings('ignore')

In [2]:
df = pd.read_csv('Random-Forest-Data.csv')
df.head(n=5)

Unnamed: 0,x,y
0,59.0,152.553428
1,88.69697,158.420441
2,87.443939,154.189316
3,110.090909,161.136969
4,126.787879,158.819572


In [3]:
X = df.drop('y',axis=1)
y1 = df['y']

In [4]:
X_train, X_test, y_train, y_test = train_test_split(X, y1, train_size=0.3, random_state=42)

In [5]:
from sklearn.ensemble import RandomForestRegressor
rf = RandomForestRegressor(random_state=42, n_jobs=1, max_depth=3,n_estimators=50)
rf.fit(X_train, y_train)

In [6]:
y_pred = rf.predict(X_test) 

In [7]:
r2 = r2_score(y_test, y_pred)
print("The R2 score is ",r2)

The R2 score is  0.8894130486785599


Coefficient of determination also called as R2 score is used to evaluate the performance of a linear regression model. It is the amount of the variation in the output dependent attribute which is predictable from the input independent variable(s). It is used to check how well-observed results are reproduced by the model, depending on the ratio of total deviation of results described by the model. <br>
It's formula is given by : <br>
<b>R2= 1- SS(res) / SS(tot)</b>
- where SS(res) is the sum of squares of the residual errors.
- SS(tot) is the total sum of the errors.

We have completed our first ensembling algorithm. <br>
Download the dataset from Kaggle and get started!