# ![](https://ga-dash.s3.amazonaws.com/production/assets/logo-9f88ae6c9c3871690e33280fcf557f33.png) Random Forest

### LEARNING OBJECTIVES
*After this lesson, you will be able to:*

Random forests- 
- Describe the random forest algorithm
- Identify its motivation
- Evaluate the effectiveness of the model 

### Introduction (5 mins)

## Let's review Decision Trees first

### Titanic example:


<img src="https://upload.wikimedia.org/wikipedia/commons/f/f3/CART_tree_titanic_survivors.png" width=500>

*For the purpose of this lesson: we will focus on the classifer part of CART (Classification and Regression Trees). "decision trees classifier is highly interpretable visually" - Penn State Applied Data Mining and Statistical Learning


### Goodness of Split 

<img src='http://www.aunalytics.com/wp-content/uploads/2015/01/DecisionTree_BlogImages_1-15-03.png'>

# Bagging/Bootstrap Aggregator?

## Bagging
Averages a given procedure over many samples, to reduce its variance — a poor man’s Bayes

<img src='http://danielboatman.com/wp-content/uploads/2015/07/poor-man-empty-pockets-595x335.jpg'>

## Bagging
- Averages many trees
- Produces smoother decision boundaries
- Dramatically reduce variance of unstable procedures (like trees)>> improved prediction.

## Random Forest

<img src="https://s-media-cache-ak0.pinimg.com/originals/90/7b/2f/907b2f4ff7dbd48e235bd34f7670c87c.jpg" width=950>

## Random Forest
- Refinement of bagged trees
- At each tree split, a random sample of m features is drawn, and only those m features are considered for splitting.
- Same observation can be drawn multiple times
- Random forests tries to improve on bagging by “de-correlating” the trees. Each tree has the same expectation.
- Trees are combined together to make predictions

## Random Forest Limits
Algorithm limits the variables that can be used to make a split in the decision tree to some random subset of the explanatory variables. Limiting the splits in this fashion helps avoid the pitfall of always splitting on the same variables and helps random forests create a wider variety of trees to reduce overfitting (reduce Variance see. bias variance tradeoff).

## Disadvantages of Random Forest

It surely does a good job at classification but not as good as for regression problem as it does not give precise continuous nature predictions. In case of regression, it doesn’t predict beyond the range in the training data, and that they may over-fit data sets that are particularly noisy.
Random Forest can feel like a black box approach for statistical modelers – you have very little control on what the model does. You can at best – try different parameters and random seeds!

### Documentation 
http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

## Template for RT

In [None]:
#Import Library
from sklearn.ensemble import RandomForestClassifier #use RandomForestRegressor for regression problem
#Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of test_dataset
# Create Random Forest object
model= RandomForestClassifier(n_estimators=1000)
# Train the model using the training sets and check score
model.fit(X_train, y_train) 
#Predict Output
predicted= model.predict(x_test)

### Guided practice

In [3]:
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
import pandas as pd
import numpy as np

In [22]:
iris = load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)
df['is_train'] = np.random.uniform(0, 1, len(df)) <= .75
df['species'] = pd.Categorical.from_codes(iris.target, iris.target_names)
df.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),is_train,species
0,5.1,3.5,1.4,0.2,False,setosa
1,4.9,3.0,1.4,0.2,True,setosa
2,4.7,3.2,1.3,0.2,True,setosa
3,4.6,3.1,1.5,0.2,True,setosa
4,5.0,3.6,1.4,0.2,False,setosa


In [23]:
train, test = df[df['is_train']==True], df[df['is_train']==False]

In [24]:
features = df.columns[:4]
clf = RandomForestClassifier(n_jobs=2)
y, _ = pd.factorize(train['species'])
clf.fit(train[features], y)

preds = iris.target_names[clf.predict(test[features])]
pd.crosstab(test['species'], preds, rownames=['actual'], colnames=['preds'])

preds,setosa,versicolor,virginica
actual,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
setosa,13,0,0
versicolor,0,18,1
virginica,0,2,15


Confusion Matrix is a great way to visually look at the effectiveness of a model

### Independent practice (5 mins)

Load in a data set of your choice and conduct a random tree classifier 
conduct a classification report on the output


#### Hint

In [2]:
from sklearn.metrics import classification_report
print(classification_report(ytest, predicted))

NameError: name 'ytest' is not defined

## Additional Resources
http://blog.yhat.com/posts/comparing-random-forests-in-python-and-r.html