# Random Forests
In this workbook, we will examine **random forests**.   As you might expect, random forests are a collection of **decision trees**.   More precisely, random forests are an ensemble of decision trees, designed to overcome two drawbacks of decsion trees:
1.  Decision trees are prone to overfitting the data.  Now this is a limitation that can be overcome, as we saw in the previous example, but may be more difficult depending on sample size and the nature of the problem you are working on.
2.  Decision trees are sensitive to the actual distribution of the training data, such that small perturbations in one single distribution can yield a very different tree.

The basic idea is very simple: instead of training a single tree, we will train a large number of trees and average the results.   But of course if we use the same data, we will get the same trees.   To overcome this, **random** forests introduce the following ideas:
1.  Randomize the training data.   The technique used is called **bootstrap aggregating** or **bagging** for short.   If we have **n** samples in our training data, we select **n** samples **with replacement** from that same dataset.   On average this will choose about 63% of the training data for each tree (meaning that there are copies in the training set).   The key idea is that each tree in the forest sees a different (but overlapping) subset of the data.
2.  Randomize the features used by each tree in the forest.  By default, the random forests in scikit use only a subset of the available features (by default the rounded sqrt(number of input features)) when deciding which feature to use to split the data at each node in a given tree.

How are the results of all of these tree combined?   As we saw with our decision tree workbook, a decision tree can predict a class (0,1,2, etc), or a set of probabilities.   When combining an ensemble of trees we have two options to predict the result for a single sample:
1.  Hard voting: use the class predicted by most of the trees.
2.  Soft voting: average the probabilites, and use the class with the highest probability.   This allows higher performance by giving greater weight to more confident predicitons.   Sklearn uses this method for random forests.

# Feature Importance
A very useful property of random forests is that they can be used to assess **feature importance**.   The idea is that since different trees use different subsets of available features at each split point within each tree, we have a mechanism for assessing how important a specific feature is for our model.   A feature's importance is related to how much that feature reduces impurity on average across all of the trees in the forest.

To help us test this idea, we will add a **random** feature to our dataset.   This feature should neither help nor hurt our classification, and when we  check our feature importances later, we would expect that the random feature will be ranked very low in feature importance.


# Getting the data:
We will again use the puslar data as we did for the decision tree workbook.   The data come from the UCI Machine Learning Repository: https://archive.ics.uci.edu/ml/datasets/HTRU2

As usual the data is on github: 'https://raw.githubusercontent.com/big-data-analytics-physics/data/master/HTRU2/HTRU_2a.csv'

As noted above, will will add a separate column labeled "random" which is essentailly a *fake* feature. 

In [0]:
import pandas as pd
import numpy as np


#
# Read in all of the other digits
fname = 'https://raw.githubusercontent.com/big-data-analytics-physics/data/master/HTRU2/HTRU_2a.csv'
dfAll = pd.read_csv(fname)
#
# Add a feature called "random" with random numbers from 0-1
dfAll['random'] = np.random.random(size=len(dfAll))
print(dfAll.head(5))




## Reorder data columns
Our code below expects the **last** column to have the **class** label.  Since we added the **random** column this is not longer true, so in the code below we reorder the columns.

In [0]:
dfAll = dfAll[['Profile_mean','Profile_stdev','Profile_skewness','Profile_kurtosis','DM_mean',
        'DM_stdev','DM_skewness','DM_kurtosis','random','class']]
num_features = 9
print(dfAll.head(5))

# Defining Signal
The **class** variable distinguished signal from background.   As usual,  **1** is signal (pulsars) and **0** is background.   We make sure the sample is balanced so we have equal numbers of pulsars and non-pulsars.

In [0]:
#
# The data already has a 0/1 class variable that defines signal (1) and background (0)
#
# The data is already combined but it will be usefull to split it so we can look at 
# signal and background separately.
dfA = dfAll[dfAll['class']==1]
dfB = dfAll[dfAll['class']==0]

print("Length of signal sample:     ",len(dfA))
print("Length of background sample: ",len(dfB))

#
# Shuffle the data here
from sklearn.utils import shuffle
dfBShuffle = shuffle(dfB)
#
# Uncomment the next line to limit dfB to be the same length as dfA
#dfB_use = dfBShuffle
dfB_use = dfBShuffle.head(len(dfA))


dfCombined = dfB_use
dfCombined = pd.concat([dfCombined, dfA])
dfCombined = shuffle(dfCombined)

print("Size of signal sample ",len(dfA))
print("Size of background sample ",len(dfB_use))
print("Size of combined sample ",len(dfCombined))



# Some useful methods
We have our usual helpful methods of autovivify and enable_plotly_in_cell, as well as getDecisionTreeGraphic, which we can use to visualize an individual tree within a forest.

In [0]:
#
# This allows multidimensional counters (and other more complicated strucutres!)
from collections import defaultdict
def autovivify(levels=1, final=dict):
    return (defaultdict(final) if levels < 2 else
            defaultdict(lambda: autovivify(levels-1, final)))
  
def enable_plotly_in_cell():
  import IPython
  from plotly.offline import init_notebook_mode
  display(IPython.core.display.HTML('''
        <script src="/static/components/requirejs/require.js"></script>
  '''))
  init_notebook_mode(connected=False)
  
def getDecisionTreeGraphic(estimator,feature_names,class_names):
  from sklearn import tree
  from io import StringIO
  import pydot_ng as pydot 
  import graphviz
  import matplotlib.pyplot as plt

  dot_data = StringIO()

  #                         class_names=classColumn,
  out = tree.export_graphviz(estimator,out_file=None,
                           feature_names=feature_names,
                           class_names=class_names,
                           filled=True, rounded=True,
                           special_characters=True,
                            node_ids=1,)
  import pydotplus
  pydot_graph = pydotplus.graph_from_dot_data(out)
  pydot_graph.set_size('"7,7!"')
  #print(pydot_graph.getvalue())
  #graph = graphviz.Source( out)
  graph = graphviz.Source(pydot_graph.to_string())
  return graph


# Performance Method

In [0]:
#
# Determine the performance
def binaryPerformance(y,y_pred,y_score,debug=False):
#
# Assuming a binary classifier with 1=signal, 0=background
  confusionMatrix = autovivify(2,int)
  for i in range(len(y_pred)):
    trueClass = y[i]
    predClass = y_pred[i]
    confusionMatrix[trueClass][predClass] += 1

  if debug:
    for trueClass in range(2):
      print("True: ",trueClass,end="")
      for predClass in range(2):
        print("\t",confusionMatrix[trueClass][predClass],end="")
      print()
    print()
  TP = confusionMatrix[1][1]
  FP = confusionMatrix[0][1]
  FN = confusionMatrix[1][0]
  TN = confusionMatrix[0][0]
  
  if debug:
    print("TP predicted true, actually true   ",TP)
    print("FP predicted true, acutally false  ",FP)
    print("TN predicted false, actually false ",TN)
    print("FN predicted false, actually true  ",FN)


  precision = TP / (TP + FP)
  recall = TP / (TP + FN)
  f1_score = 2.0 / ( (1.0/precision) + (1.0/recall) )
  
  if debug:
    print("Precision = TP/(TP+FP) = fraction of predicted true actually true ",precision)
    print("Recall = TP/(TP+FN) = fraction of true class predicted to be true ",recall)
    print("F1 score = ",f1_score)

  #
  # Get the ROC curve.  We will use the sklearn function to do this
  from sklearn import metrics
  #fpr_train, tpr_train, thresholds_train = metrics.roc_curve(y_train, y_train_score, pos_label=1)
  fpr, tpr, thresholds = metrics.roc_curve(y, y_score, pos_label=1)
  #
  # Get the auc
  auc = metrics.roc_auc_score(y, y_score)
  if debug:
    print("AUC this sample: ",auc)
  
  return precision,recall,auc,fpr, tpr, thresholds

#  The runFitter Method
We will use the same form of the runFitter method we used in our decision tree workbook.

In [0]:

def runFitter(estimator,X_train,y_train,X_test,y_test,debug=False):
#
# Now fit to our training set
  estimator.fit(X_train,y_train)
#
# Now predict the classes and get the score for our traing set
  y_train_pred = estimator.predict(X_train)
  y_train_score = estimator.predict_proba(X_train)[:,1]   # NOTE: some estimators have a predict_prob method instead od descision_function
#
# Now predict the classes and get the score for our test set
  y_test_pred = estimator.predict(X_test)
  y_test_score = estimator.predict_proba(X_test)[:,1]

#
# Now get the performaance
  precision_test,recall_test,auc_test,fpr_test, tpr_test, thresholds_test\
    = binaryPerformance(y_test,y_test_pred,y_test_score,debug)
  precision_train,recall_train,auc_train,fpr_train, tpr_train, thresholds_train\
    = binaryPerformance(y_train,y_train_pred,y_train_score,debug)
#
# Decide what you want to return: for now, just precision, recall, and auc for both test and train
  results = {
      'precision_train':precision_train,
      'recall_train':recall_train,
      'auc_train':auc_train,
      'fpr_train':fpr_train, 
      'tpr_train':tpr_train, 
      'thresholds_train':thresholds_train,
      'precision_test':precision_test,
      'recall_test':recall_test,
      'auc_test':auc_test,
      'fpr_test':fpr_test, 
      'tpr_test':tpr_test, 
      'thresholds_test':thresholds_test}

  return results
  

# Prepare the data
As usual we shuffle it first and then dump the dataframe into an **X** features numpy array and a **y** labels number array.

In [0]:
from sklearn.utils import shuffle
dfCombinedShuffle = shuffle(dfCombined,random_state=42)    # by setting the random state we will get reproducible results

X = dfCombinedShuffle.as_matrix(columns=dfCombinedShuffle.columns[:num_features])
y = dfCombinedShuffle['class'].values

# Training a single random forest, with k-fold validation.
Let's train a single random forest estimator, including cross validation.   But first we have to define the settings for the random forest estimator.   Here is the call definition from sklearn:

    class sklearn.ensemble.RandomForestClassifier(n_estimators=’warn’, criterion=’gini’, max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=’auto’, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, bootstrap=True, oob_score=False, n_jobs=None, random_state=None, verbose=0, warm_start=False, class_weight=None)
    
You will notice many parameters which are there to control the individual trees in the forest, such as **max_depth**,  **max_leaf_nodes** etc.   These serve exaclty the same purpose as described in the decision tree workbook.  We will only modify one of these - the **max_depth**  of the trees.   

The other primary parameter is **n_estimators** which is simply the number of trees in the forest.   We will typically set this to 100, but forests with 500, 1000, or even 5000 trees are not uncommon.   The bigger the forest the longer it will take to train.   This can be parallelized, which is what **n_jobs** is for.   For now, we will leave this and all of the other parameters at their default values, unless specified otherwise.

In [0]:
from sklearn.model_selection import StratifiedKFold
kfolds = 5
skf = StratifiedKFold(n_splits=kfolds)

In [0]:
from sklearn.ensemble import RandomForestClassifier
estimator = RandomForestClassifier(n_estimators=100, max_depth=6,random_state=42)

avg_precision_train = 0.0
avg_recall_train = 0.0
avg_auc_train = 0.0
avg_precision_test = 0.0
avg_recall_test = 0.0
avg_auc_test = 0.0
numSplits = 0.0
#
# Now loop
for train_index, test_index in skf.split(X, y):
  print("Training")
  numSplits += 1
  X_train = X[train_index]
  y_train = y[train_index]
  X_test = X[test_index]
  y_test = y[test_index]
  
#
# Now fit to our training set
  results = runFitter(estimator,X_train,y_train,X_test,y_test,debug=False)

  avg_precision_train += results['precision_train']
  avg_recall_train += results['recall_train']
  avg_auc_train += results['auc_train']
#
  avg_precision_test += results['precision_test']
  avg_recall_test += results['recall_test']
  avg_auc_test += results['auc_test']
#
avg_precision_train /= numSplits
avg_recall_train /= numSplits
avg_auc_train /= numSplits
avg_precision_test /= numSplits
avg_recall_test /= numSplits
avg_auc_test /= numSplits
# 
# Now print
print("Precision train/test ",round(avg_precision_train,3),round(avg_precision_test,3))
print("Recall train/test    ",round(avg_recall_train,3),round(avg_recall_test,3))
print("AUC train/test       ",round(avg_auc_train,3),round(avg_auc_test,3))

# Overfitting vs Underfitting
As with decision trees, we would like to make our parameter choices such that we are neither significantly over or under-fitting our training data.   

To keep things simple, we will again vary only one parameter - the max_depth of the trees in the forest.   To choose the optimal point, we will plot both training and testing error vs max_depth.   

We will again use 1-precision, 1-recall, and 1-AUC, as a measure of the **error** in our models.

In [0]:
#
# Get our estimator and predict
from sklearn.ensemble import RandomForestClassifier
import pandas as pd
#
# Create a dataframe to store our results
dfError = pd.DataFrame(columns=['max_depth','trainError_pre','testError_pre',
                                    'trainError_rec','testError_rec',
                                    'trainError_auc','testError_auc'])

for max_depth in range(1,20):
  print("training with max depth =",max_depth)
  estimator = RandomForestClassifier(n_estimators=100,random_state=42,max_depth=max_depth)
  avg_precision_train = 0.0
  avg_recall_train = 0.0
  avg_auc_train = 0.0
  avg_precision_test = 0.0
  avg_recall_test = 0.0
  avg_auc_test = 0.0
  numSplits = 0.0
#
# Now loop
  for train_index, test_index in skf.split(X, y):
    numSplits += 1
    X_train = X[train_index]
    y_train = y[train_index]
    X_test = X[test_index]
    y_test = y[test_index]

#
# Now fit to our training set
    results = runFitter(estimator,X_train,y_train,X_test,y_test,debug=False)
    #print("results",results)

    avg_precision_train += results['precision_train']
    avg_recall_train += results['recall_train']
    avg_auc_train += results['auc_train']
#
    avg_precision_test += results['precision_test']
    avg_recall_test += results['recall_test']
    avg_auc_test += results['auc_test']
#
  avg_precision_train /= numSplits
  avg_recall_train /= numSplits
  avg_auc_train /= numSplits
  avg_precision_test /= numSplits
  avg_recall_test /= numSplits
  avg_auc_test /= numSplits
#
# Fill dataframe
  dfError = dfError.append({
     'max_depth':max_depth,
     'trainError_pre':1.0-avg_precision_train,'testError_pre':1.0-avg_precision_test,
     'trainError_rec':1.0-avg_recall_train,'testError_rec':1.0-avg_recall_test,
     'trainError_auc':1.0-avg_auc_train,'testError_auc':1.0-avg_auc_test
      }, ignore_index=True)
# 
# Now print
print(dfError.head)

# Plotting results

Lets look at the error measures:
1.  train/test Error using precision: remember that **precision** = TP/(TP+FP) = "Total actual positive found as positive" / "Total our model identified as positive".   It is the fraction of identified postives that are truly positive.   **(1-precision)** is then the **error** - the fraction of our identified postives that are incorrect.
2.  train/test Error using recall: remember that **recall** = TP/(TP+FN) = "Total actual positive found as positive" / "Total actual positive".   It is the fraction of actual positives that our model managed to identify.   **(1-recall)** is then the **error** - the fraction of actual postives that we failed to identify.
3.  train/test Error using AUC: remember that AUC measures the probability that a randomly chosen positive example is properly ranked above a randomly chosen negative example.   **(1-AUC)** is then the probability that we will fail to do this.

In [0]:
import plotly.plotly as py
import numpy as np
from plotly.offline import iplot
import plotly.graph_objs as go

enable_plotly_in_cell()
# Create a trace
trace1 = go.Scatter(
    x = dfError['max_depth'],
    y = dfError['trainError_pre'],
    mode = 'line',
    name = "Training error"
)
# Create a trace
trace2 = go.Scatter(
    x = dfError['max_depth'],
    y = dfError['testError_pre'],
    mode = 'line',
    name = "Testing Error"
)

layout = dict(
    title='Error (Precision) vs Model Complexity',
    xaxis=dict(title='max_depth'),
    yaxis=dict(title='Error (fraction)')
)

data = [trace1, trace2]
iplot(dict(data=data,layout=layout),validate=False)

# Create a trace
trace1 = go.Scatter(
    x = dfError['max_depth'],
    y = dfError['trainError_rec'],
    mode = 'line',
    name = "Training error"
)
# Create a trace
trace2 = go.Scatter(
    x = dfError['max_depth'],
    y = dfError['testError_rec'],
    mode = 'line',
    name = "Testing Error"
)

layout = dict(
    title='Error (Recall) vs Model Complexity',
    xaxis=dict(title='max_depth'),
    yaxis=dict(title='Error (fraction)')
)

data = [trace1, trace2]
iplot(dict(data=data,layout=layout),validate=False)
# Create a trace
trace1 = go.Scatter(
    x = dfError['max_depth'],
    y = dfError['trainError_auc'],
    mode = 'line',
    name = "Training error"
)
# Create a trace
trace2 = go.Scatter(
    x = dfError['max_depth'],
    y = dfError['testError_auc'],
    mode = 'line',
    name = "Testing Error"
)

layout = dict(
    title='Error (AUC) vs Model Complexity',
    xaxis=dict(title='max_depth'),
    yaxis=dict(title='Error (fraction)')
)

data = [trace1, trace2]
iplot(dict(data=data,layout=layout),validate=False)


# Feature Importance (1)
A very useful property of random forests is that they can be used to assess **feature importance**.   To get this from a trained random forest, we need to first train a single random forest, using our optimal settings for our hyperparameters.

Note that in the code below:
1.  We call runFitter once.
2.  We feed runFitter the same X,y arrays for both training and testing (since we don't really care about the testing set anymore).
3.  Remember that the expected performance is **not** the result printed out below, but the avereged result from k-fold validation from above (at our optimal hyperparameter settings).


In [0]:
from sklearn.utils import shuffle
dfCombinedShuffle = shuffle(dfCombined,random_state=42)    # by setting the random state we will get reproducible results

X = dfCombinedShuffle.as_matrix(columns=dfCombinedShuffle.columns[:num_features])
y = dfCombinedShuffle['class'].values

from sklearn.ensemble import RandomForestClassifier
estimator = RandomForestClassifier(n_estimators=100, max_depth=6,random_state=42,oob_score=True)

results = runFitter(estimator,X,y,X,y,debug=False)

print("Precision:",results['precision_train'])
print("Recall:   ",results['recall_train'])
print("AUC:      ",results['auc_train'])


# Feature Importance (2)
Now that we have trained our random forest estimator, we can get the list of feature importances by accessing the **attribute** of our estimator like this:
    * estimator.feature_importances_

This is a **list**, ordered exactly as our features that we fed to the estimator - so it has the same order as **dfCombinedShuffle.columns** (up to but not including the **class** column).

The code below shows how to access the feaure importance, as well as how to order it from maximum to minimum.

There is a very interesting discussion of issues around the calculation of feature importances here:  https://explained.ai/rf-importance/index.html

In [0]:
importanceByName = {}
print("unsorted importance")
for name,importance in zip(dfCombinedShuffle.columns[:num_features],estimator.feature_importances_):
  importanceByName[name] = importance
  print("Name,importance",name,round(importance,3))
#
# Now sort and print
print()
print("Sorted importance")
for name in sorted(importanceByName, key=importanceByName.get, reverse=True):
  print("Name,importance",name,round(importanceByName[name],3))


# Additional things to try:
Here are some things to explore to make sure you understand how things work:
1.   You can add a second random variable to the features.   Do both random features end up at the bottom of importances?
2.  You can remove the random variable and make sure your results are unaffected.
3.   You can try different settings of trees: ntrees=10, or ntrees=200.   How much do your results change?