# <font color="orange">Random Forests Classification (ensemble method)</font>
<p id="9631" class="pw-post-body-paragraph kl km ig kn b ko kp jh kq kr ks jk kt ku kv kw kx ky kz la lb lc ld le lf lg hz gh" data-selectable-paragraph="">Before we start our discussion on random forests, we first need to understand Bagging. <mark class="wg wh pi">Bagging is a simple and a very powerful ensemble method. It is a general procedure that can be used to reduce our model’s variance.</mark> A higher variance means that your model is overfitted. Certain algorithms such as decision trees usually suffer from high variance. In another way, decision trees are extremely sensitive to the data on which they have been trained. If the underlying data is changed even a little bit, then the resulting decision tree can be very different and as result our model’s predictions will change drastically. Bagging offers a solution to the problem of high variance. It can systematically reduce overfitting by taking an average of several decision trees. Bagging uses bootstrap sampling and finally aggregates the individual models by averaging to get the ultimate predictions. <strong class="kn ih">Bootstrap sampling simply means sampling rows at random from the training dataset with replacement.</strong>

Random forest is one of the most widely used ensemble learning algorithms. Why is it so effective? The reason is that by using multiple samples of the original dataset, we reduce the variance of the final model. Remember that the low variance means low overfitting. Overfitting happens when our model tries to explain small variations in the dataset because our dataset is just a small sample of the population of all possible examples of the phenomenon we try to model. 

</p>


<img src="../../../img/1_5Spqp6X1fDbWlNyCgTMzdQ.png">
<img src="../../../img/1_l16JAxJR5MJea12jut-FLQ.png">
<img src="../../../img/1_5vlUF8FRR6flPPWK4wt-Kw.png">


In [7]:
import numpy as np
import pandas as pd
from matplotlib import pyplot as PLT, cm as CMAP
from sklearn import datasets
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics

In [8]:
DT = datasets.load_digits()

In [9]:
X = DT.images.reshape(len(DT.images),-1)
Y = DT.target

In [11]:
random_forest_classifier = RandomForestClassifier(n_estimators=1000,verbose=True)
random_forest_classifier.fit(X[:1000],Y[:1000])

predicted = random_forest_classifier.predict(X[1000:])
expected = Y[1000:]

report = metrics.classification_report(expected,predicted)
print(report)

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 1000 out of 1000 | elapsed:    4.1s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


              precision    recall  f1-score   support

           0       0.99      0.99      0.99        79
           1       0.93      0.89      0.91        80
           2       1.00      0.91      0.95        77
           3       0.90      0.84      0.87        79
           4       0.98      0.95      0.96        83
           5       0.89      0.98      0.93        82
           6       0.98      0.99      0.98        80
           7       0.93      0.99      0.96        80
           8       0.88      0.91      0.90        76
           9       0.88      0.91      0.90        81

    accuracy                           0.93       797
   macro avg       0.94      0.93      0.93       797
weighted avg       0.94      0.93      0.93       797



[Parallel(n_jobs=1)]: Done 1000 out of 1000 | elapsed:    0.2s finished
