# Introduction to Random Forest Algorithm with Python


## What is Random Forest?
As Leo Breiman defined it in the [research paper](https://medium.com/r/?url=https%3A%2F%2Fwww.stat.berkeley.edu%2F~breiman%2Frandomforest2001.pdf), “ Random forests are a combination of tree predictors such that each tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the forest ”

Another definition “A random forest is a classifier consisting of a collection of tree structured classifiers {h(x,Θk ), k=1, ...} where the {Θk} are independent identically distributed random vectors and each tree casts a unit vote for the most popular class at input x ”
Briefly, Random forest builds multiple decision trees and merges them together to get a more accurate and stable prediction. 


## Advantages of Random Forests
*	It can be used for both classification and regression problems
*	Reduction in overfitting: by averaging several trees, there is a significantly lower risk of overfitting.
*	Random forests make a wrong prediction only when more than half of the base classifiers are wrong
*	 It is very easy to measure the relative importance of each feature on the prediction. Sklearn provides a great tool for this

Because of that, it is more accurate than most of the other algorithms.

## Disadvantages of Random Forests 
*	Random forests have been observed to overfit for some datasets with noisy classification/regression tasks.
*	It’s more complex and computationally expensive than decision tree algorithm.


In [None]:
### Use Case 
Survived/ 

## Import needed dependencies :

In [None]:
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import f1_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import accuracy_score

Using TensorFlow backend.


## Load the preprocessed dataset:

Download the preprocessed dataset [Download](https://drive.google.com/file/d/1rzbDYv3tYLQ7J-P3cgG7mHVwxWzgBwdr/view?usp=sharing)

In [None]:
dataset =pd.read_csv('TitanicPreprocessed.csv') 

In [None]:
dataset.head()

Unnamed: 0,Sex,Age,SibSp,Parch,Fare,Title_Master,Title_Miss,Title_Mr,Title_Mrs,Title_Officer,...,Ticket_STONOQ,Ticket_SWPP,Ticket_WC,Ticket_WEP,Ticket_XXX,FamilySize,Singleton,SmallFamily,LargeFamily,Survived
0,1,22.0,1,0,7.25,0,0,1,0,0,...,0,0,0,0,0,2,0,1,0,0
1,0,38.0,1,0,71.2833,0,0,0,1,0,...,0,0,0,0,0,2,0,1,0,1
2,0,26.0,0,0,7.925,0,1,0,0,0,...,0,0,0,0,0,1,0,0,0,1
3,0,35.0,1,0,53.1,0,0,0,1,0,...,0,0,0,0,1,2,0,1,0,1
4,1,35.0,0,0,8.05,0,0,1,0,0,...,0,0,0,0,1,1,0,0,0,0


In [None]:
y = dataset['Survived']
X = dataset.drop(['Survived'], axis = 1)

# Split the dataset to trainand test data
train_X, test_X, train_y, test_y = train_test_split(X, y, test_size=0.25, random_state=0)

## Set the parameters for the random forest model :

In [None]:
parameters = {'bootstrap': True,
              'min_samples_leaf': 3,
              'n_estimators': 50, 
              'min_samples_split': 10,
              'max_features': 'sqrt',
              'max_depth': 6,
              'max_leaf_nodes': None}

## Hyperparameters of Sklearn Random forest classifier[[2]](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) :

*	**bootstrap** : boolean, optional (default=True)

> Whether bootstrap samples are used when building trees.

*	**min_samples_leaf** : int, float, optional (default=1)

> The minimum number of samples required to be at a leaf node:

> - If int, then consider min_samples_leaf as the minimum number.

> - If float, then min_samples_leaf is a percentage and ceil(min_samples_leaf * n_samples) are the minimum number of samples for each node.

* **n_estimators** : integer, optional (default=10)
> The number of trees in the forest.

* 	**min_samples_split** :  int, float, optional (default=2)
> The minimum number of samples required to split an internal node:

> - If int, then consider min_samples_split as the minimum number.
> -	If float, then min_samples_split is a percentage and ceil(min_samples_split * n_samples) are the minimum number of samples for each split.

*	**max_features** : int, float, string or None, optional (default=”auto”)
> The number of features to consider when looking for the best split:

> -	If int, then consider max_features features at each split.
> -If float, then max_features is a percentage and int(max_features * n_features) features are considered at each split.
> -	If “auto”, then max_features=sqrt(n_features).
> -	If “sqrt”, then max_features=sqrt(n_features) (same as “auto”).
> -	If “log2”, then max_features=log2(n_features).
> -	If None, then max_features=n_features.


*	**max_depth** :  integer or None, optional (default=None)
> The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.


*	**max_leaf_nodes** : int or None, optional (default=None)
> Grow trees with max_leaf_nodes in best-first fashion. Best nodes are defined as relative reduction in impurity. If None then unlimited number of leaf nodes.


If you want to learn more about the rest of hyperparameters , check out  [sklearn.ensemble.RandomForestClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)

## Define the model :

In [None]:
RF_model = RandomForestClassifier(**parameters)

## Train the model :

In [None]:
RF_model.fit(train_X, train_y)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=6, max_features='sqrt', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=3, min_samples_split=10,
            min_weight_fraction_leaf=0.0, n_estimators=50, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

## Test the trained model on test data :

In [None]:
RF_predictions = RF_model.predict(test_X)

In [None]:
score = accuracy_score(test_y ,RF_predictions)
print(score)

0.8251121076233184


#### https://share.cocalc.com/share/e9d2f604-5c15-48c1-8c69-4d560cf9a933/PythonDataScienceHandbook/notebooks/05.08-Random-Forests.ipynb
### https://github.com/tirthajyoti/Machine-Learning-with-Python/blob/master/Classification/DecisionTrees_RandomForest_Classification.ipynb


We see that the model's accuracy is  82%, not bad at all.