## If a tree falls in the forest...

Now that you've learned about random forests and decision trees let's do an exercise in accuracy. You know that random forests are basically a collection of decision trees. But how do the accuracies of the two models compare?

So here's what you should do. Pick a dataset. It could be one you've worked with before or it could be a new one. Then build the best decision tree you can.

Now try to match that with the simplest random forest you can. For our purposes measure simplicity with runtime. Compare that to the runtime of the decision tree. This is imperfect but just go with it.

Hopefully out of this you'll see the power of random forests, but also their potential costs. Remember, in the real world you won't necessarily be dealing with thousands of rows. It could be millions, billions, or even more.

In [1]:
import math
import warnings

from IPython.display import display
from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn import linear_model
from scipy import stats
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)

# Display preferences.
%matplotlib inline
pd.options.display.float_format = '{:.3f}'.format

# Suppress annoying harmless error.
warnings.filterwarnings(
    action="ignore",
    module="scipy",
    message="^internal gelsd"
)

In [2]:
full_data = pd.read_csv('../data/chr/County_Health_Rankings.csv',index_col=0)

In [3]:
full_data.head()

Unnamed: 0_level_0,State,County,Life Expectancy,Age-Adjusted Mortality,Child Mortality Rate,Infant Mortality Rate,% Frequent Physical Distress,% Frequent Mental Distress,% Diabetic,HIV Prevalence Rate,% Food Insecure,% Limited Food Access,Drug Overdose Mortality Rate,Motor Vehicle Mortality Rate,% Insufficient Sleep,% Uninsured Adults,% Uninsured Children,% Disconnected Youth,Household Income,% Free or Reduced Lunch,Homicide Rate,Firearm Fatalities Rate,% Homeowners,% Severe Housing Cost Burden,Population,% < 18,% 65 and over,% African American,% American Indian/Alaskan Native,% Asian,% Native Hawaiian/Other Pacific Islander,% Hispanic,% Non-Hispanic White,% Not Proficient in English,% Female,% Rural,Premature Death Years,% Fair/Poor Health,Physically Unhealthy Days,Mentally Unhealthy Days,% Low Birth Wt,% Smokers,% Obese,Food Environment Index,% Physically Inactive,% With Access to Exercise,% Excessive Drinking,% Alcohol-Impaired Deaths,Chlamydia Rate,Teen Birth Rate,% Uninsured,PCP Ratio,Dentist Ratio,MHP Ratio,Preventable Hosp. Rate,% Mammograph Screened,% Flu Vaccinated,HS Graduation Rate,% Some College,% Unemployed,% Children in Poverty,80/20 Income Ratio,% Single-Parent Households,Social Association Rate,Violent Crime Rate,Injury Death Rate,Average Daily Particulate Matter 2.5,Presence of drinking water violation,% Severe Housing Problems,% Drive Alone,% Long Commute - Drives Alone
FIPS,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1
1001,Alabama,Autauga,76.3,439.0,53.0,8.0,13,13,14,226.0,13,12.0,10.0,20.0,36,11.0,2.0,8.0,58343.0,48.0,5.0,18.0,73,13.0,55504,23.9,15.1,19.3,0.5,1.3,0.1,2.9,74.5,1,51.3,42.0,0.159,18,4.2,4.3,8.0,19,38,7.2,31,69.0,17,29.0,341.2,27.0,9.0,2409,3084,6167,6599.0,44.0,41.0,90.0,61,3.9,19.0,4.6,25.0,12.6,272.0,74.0,11.7,No,15,86,38
1003,Alabama,Baldwin,78.6,348.0,47.0,6.0,13,13,11,164.0,12,5.0,16.0,15.0,33,14.0,3.0,8.0,56607.0,45.0,3.0,14.0,73,13.0,212628,21.8,19.9,9.0,0.8,1.2,0.1,4.6,83.0,0,51.5,42.3,0.034,18,4.1,4.2,8.0,17,31,8.0,24,72.0,17,32.0,338.8,30.0,11.0,1372,2006,1096,3833.0,45.0,45.0,86.0,66,4.0,15.0,4.5,25.0,10.7,204.0,69.0,10.3,Yes,14,85,41
1005,Alabama,Barbour,75.8,470.0,77.0,,16,15,18,436.0,23,11.0,,21.0,39,17.0,3.0,12.0,32490.0,74.0,7.0,15.0,63,14.0,25270,20.8,18.8,47.9,0.7,0.5,0.2,4.2,46.0,1,47.2,67.8,0.379,26,5.1,4.6,11.0,22,44,5.6,28,54.0,13,30.0,557.9,45.0,13.0,2597,2808,12635,4736.0,46.0,37.0,81.0,37,5.9,50.0,5.8,57.0,8.5,414.0,73.0,11.5,No,15,83,34
1007,Alabama,Bibb,73.9,564.0,112.0,15.0,13,13,15,192.0,16,3.0,22.0,25.0,38,12.0,3.0,,45795.0,65.0,8.0,21.0,75,9.0,22668,20.6,16.0,21.5,0.4,0.2,0.1,2.6,74.3,0,46.5,68.4,0.52,20,4.4,4.3,11.0,20,38,7.6,35,16.0,16,27.0,302.1,45.0,10.0,1742,3778,11334,5998.0,44.0,39.0,84.0,48,4.4,27.0,4.3,30.0,10.2,89.0,100.0,11.2,No,11,86,49
1009,Alabama,Blount,74.6,502.0,76.0,6.0,14,14,14,95.0,11,3.0,25.0,26.0,36,16.0,3.0,15.0,48253.0,53.0,7.0,20.0,79,8.0,58013,23.3,17.8,1.5,0.6,0.3,0.1,9.6,86.9,2,50.7,90.0,0.188,21,4.5,4.7,8.0,20,34,8.5,29,23.0,15,22.0,114.3,36.0,12.0,4439,4834,9669,4162.0,36.0,38.0,93.0,54,4.0,19.0,4.1,30.0,9.0,483.0,105.0,11.7,No,10,87,60


In [4]:
# Select medically-related fields
data = pd.DataFrame(index=full_data.index)
data['Life Expectancy'] = full_data['Life Expectancy']
data['Physical Distress Pct'] = full_data['% Frequent Physical Distress']
data['Mental Distress Pct'] = full_data['% Frequent Mental Distress']
data['Diabetic Pct'] = full_data['% Diabetic']
data['HIV Rate'] = full_data['HIV Prevalence Rate']
data['Food Insecure Pct'] = full_data['% Food Insecure']
data['Insufficient Sleep Pct'] = full_data['% Insufficient Sleep']
data['Houshold Income'] = full_data['Household Income']
data['Youth Pct'] = full_data['% < 18']
data['Elderly Pct'] = full_data['% 65 and over']
data['Female Pct'] = full_data['% Female']
data['Poor Health Pct'] = full_data['% Fair/Poor Health']
data['Physically Unhealthy Days'] = full_data['Physically Unhealthy Days']
data['Mentally Unhealthy Days'] = full_data['Mentally Unhealthy Days']
data['Low Birth Weight Pct'] = full_data['% Low Birth Wt']
data['Smoker Pct'] = full_data['% Smokers']
data['Obesity Pct'] = full_data['% Obese']
data['Inactive Pct'] = full_data['% Physically Inactive']
data['Exercise Availability Pct'] = full_data['% With Access to Exercise']
data['Excess Drinker Pct'] = full_data['% Excessive Drinking']
data['Chlamydia Rate'] = full_data['Chlamydia Rate']
data['Teen Birth Rate'] = full_data['Teen Birth Rate']
data['PCP Ratio'] = full_data['PCP Ratio']
data['Dentist Ratio'] = full_data['Dentist Ratio']
data['MHP Ratio'] = full_data['MHP Ratio']
data['Mammograph Pct'] = full_data.iloc[:,[-16]] #full_data['% Mammograph Screened']
data['Flu Vaccinated Pct'] = full_data['% Flu Vaccinated']

In [5]:
# Check whether data cleaning is needed
data.dtypes

Life Expectancy              float64
Physical Distress Pct          int64
Mental Distress Pct            int64
Diabetic Pct                   int64
HIV Rate                     float64
Food Insecure Pct              int64
Insufficient Sleep Pct         int64
Houshold Income              float64
Youth Pct                    float64
Elderly Pct                  float64
Female Pct                   float64
Poor Health Pct                int64
Physically Unhealthy Days    float64
Mentally Unhealthy Days      float64
Low Birth Weight Pct         float64
Smoker Pct                     int64
Obesity Pct                    int64
Inactive Pct                   int64
Exercise Availability Pct    float64
Excess Drinker Pct             int64
Chlamydia Rate               float64
Teen Birth Rate              float64
PCP Ratio                     object
Dentist Ratio                 object
MHP Ratio                     object
Mammograph Pct               float64
Flu Vaccinated Pct           float64
d

In [6]:
# Handle non-numeric columns
data['PCP Ratio'] = pd.to_numeric(data['PCP Ratio'], errors='coerce')
data['Dentist Ratio'] = pd.to_numeric(data['Dentist Ratio'], errors='coerce')
data['MHP Ratio'] = pd.to_numeric(data['MHP Ratio'], errors='coerce')

In [7]:
# Use above/below average life expectancy as classifier
X = data.dropna()
Y = (X['Life Expectancy'] > X['Life Expectancy'].mean())*1
X = X.drop('Life Expectancy', 1)

In [8]:
# Decision Tree model
from sklearn import tree
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV

decision_tree = tree.DecisionTreeClassifier(
    criterion='entropy',
    random_state = 1337
)
decision_tree.fit(X, Y)

# Use grid search to find best max_features and max_depth based on accuracy
grid_values = {'max_features': range(1,15),'max_depth':range(1,8)}
grid_dt = GridSearchCV(decision_tree, param_grid = grid_values,scoring = 'accuracy',cv=5,iid=True)
grid_dt.fit(X, Y)

print('Best Decision Tree parameters: ' + str(grid_dt.best_params_))

Best Decision Tree parameters: {'max_depth': 5, 'max_features': 13}


In [9]:
# Re-run Decision Tree model using best grid search parameters
import time
start_time = time.time()

decision_tree = tree.DecisionTreeClassifier(
    criterion='entropy',
    max_features=5,
    max_depth=13
)
decision_tree.fit(X, Y)

cv_score = cross_val_score(decision_tree, X, Y, cv=10)

print("Decision Tree Accuracy: %0.2f (+/- %0.2f)" % (cv_score.mean(), cv_score.std() * 2))
print("Decision Tree Runtime: %s seconds" % (time.time() - start_time))

Decision Tree Accuracy: 0.77 (+/- 0.14)
Decision Tree Runtime: 0.12189459800720215 seconds


In [10]:
# Random Forest model
from sklearn import ensemble

rfc = ensemble.RandomForestClassifier()
X = pd.get_dummies(X)
X = X.dropna(axis=1)

# Use grid search to find best max_features and max_depth based on accuracy
#grid_values = {'n_estimators': range(5,25)}
grid_values = {'max_features': range(1,15),'max_depth':range(1,8)}
grid_rfc = GridSearchCV(rfc, param_grid = grid_values,scoring = 'accuracy',cv=5,iid=True)
grid_rfc.fit(X, Y)

print('Best Random Forest parameters: ' + str(grid_rfc.best_params_))

Best Random Forest parameters: {'max_depth': 6, 'max_features': 4}


In [11]:
# Re-run Random Forest model using best grid search parameters
start_time = time.time()

rfc = ensemble.RandomForestClassifier(max_features=7,max_depth=3)
X = pd.get_dummies(X)
X = X.dropna(axis=1)

cv_score = cross_val_score(rfc, X, Y, cv=10)
print("Random Forest Accuracy: %0.2f (+/- %0.2f)" % (cv_score.mean(), cv_score.std() * 2))
print("Random Forest Runtime: %s seconds" % (time.time() - start_time))

Random Forest Accuracy: 0.83 (+/- 0.09)
Random Forest Runtime: 0.2343435287475586 seconds


The optimized random forest model has a slightly higher accuracy than the decision tree model but took almost three times as long to run. This data set is fairly small so this difference is not significant but on a larger data set, this could potentially cause computing resource constraints. 