## Challenge: If a tree falls in the forest...

Comparing Decision Tree and Random Forest algorithms by runtime.



In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA 
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score
from sklearn import tree
from sklearn import ensemble
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.datasets import make_regression
from sklearn.model_selection import cross_val_score
%matplotlib inline

In [2]:
child_care = pd.read_csv(r'C:\Users\Omistaja\nys-children-in-foster-care-annually\children-in-foster-care-annually-beginning-1994.csv').dropna()

The purpose of this data set is to provide information on the total number of admissions and discharges of children in foster care, the type of care they reviece, and other related metrices. The data set can be downloaded here https://www.kaggle.com/new-york-state/nys-children-in-foster-care-annually 

For this excercise we will tryt to predict Discharges using regressor. 

In [3]:
child_care.head()

Unnamed: 0,County,Year,Adoptive Home,Agency Operated Boarding Home,Approved Relative Home,Foster Boarding Home,Group Home,Group Residence,Institution,Supervised Independent Living,Other,Total Days In Care,Admissions,Discharges,Children In Care,Number of Children Served,Indicated CPS Reports
0,ALBANY,2017,0,965,2598,48637,5207,1488,16017,692,232,75836,158,130.0,199,402,602.0
1,ALLEGANY,2017,0,585,5596,11320,0,0,2615,285,0,20401,22,46.0,58,77,89.0
2,BROOME,2017,0,393,6171,53256,8419,451,13870,3828,468,86856,136,108.0,241,385,1016.0
3,CATTARAUGUS,2017,0,1006,2747,14925,132,0,3949,106,256,23121,45,80.0,66,113,320.0
4,CAYUGA,2017,0,337,0,15543,92,155,5067,455,0,21649,40,42.0,58,120,208.0


In [4]:
# lets avoid the only non numeric column
child_care = child_care.drop('County', 1)

In [5]:
child_care.shape

(1403, 16)

In [6]:
# Check data type
child_care.dtypes

Year                                int64
Adoptive Home                       int64
Agency Operated Boarding Home       int64
 Approved Relative Home             int64
 Foster Boarding Home               int64
 Group Home                         int64
 Group Residence                    int64
Institution                         int64
 Supervised Independent Living      int64
Other                               int64
Total Days In Care                  int64
Admissions                          int64
Discharges                        float64
Children In Care                    int64
Number of Children Served           int64
Indicated CPS Reports             float64
dtype: object

### Comparing Random Forest and Decision Tree algorithms 

### 1. Using the raw data


In [7]:
# Let's use all variables and apply pca 
X = child_care.drop('Discharges', 1)
Y = child_care['Discharges'] 

In [8]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size = 0.2, random_state = 0)

#### Decision tree

In [9]:
import time
start_time = time.clock()
regressor = DecisionTreeRegressor(random_state=0)
regressor.fit(x_train, y_train)
y_ = regressor.predict(x_test)

score_tree = cross_val_score(regressor, X, Y, cv=10)
print(score_tree)
print("--- %s seconds ---" % (time.clock() - start_time))

[0.99204767 0.97316862 0.98373446 0.99065008 0.99389431 0.97821466
 0.97143231 0.99305388 0.9960625  0.99786922]
--- 0.21862741333333333 seconds ---


#### Random forest

In [11]:
start_time = time.clock()
x_train, y_train = make_regression(n_features=15, n_informative=2, random_state=0, shuffle=False)
regr = RandomForestRegressor(max_depth=5, random_state=0)
regr.fit(x_train, y_train)
Y_ = regr.predict(x_test)

score_forest = cross_val_score(regr, X, Y, cv=10)
print(score_forest)
print("--- %s seconds ---" % (time.clock() - start_time))

[0.97384792 0.99067861 0.98255792 0.98671177 0.99496693 0.95606318
 0.99355206 0.98872688 0.99741922 0.99559668]
--- 0.5375270399999863 seconds ---


The random forest algorithm takes more than twice as much time to run as the decision tree algorithm. 

### 2.Using PCA 

In [12]:
# pca requires this 
X_nor = StandardScaler().fit_transform(X)

In [13]:
# five components explain allmost all of the varience 
pca =PCA(n_components=5)

In [14]:
X_pca = pca.fit_transform(X_nor)
print(pca.explained_variance_ratio_.cumsum())

[0.77383099 0.84847266 0.91104799 0.95564556 0.98262387]


In [15]:
X_train1, X_test1, Y_train1, Y_test1 = train_test_split(X_pca, Y, test_size = 0.2, random_state = 0)

#### Decision Tree

In [16]:
start_time = time.clock()
regr2 = DecisionTreeRegressor(random_state=0)
regr2.fit(X_train1, Y_train1)
y_ = regr2.predict(X_test1)

score_tree = cross_val_score(regr2, X_pca, Y, cv=10)
print(score_tree)
print("--- %s seconds ---" % (time.clock() - start_time))

[0.98415518 0.99257293 0.98497597 0.99485666 0.97473199 0.98455885
 0.97402008 0.99235562 0.99558184 0.99329543]
--- 0.10849493333333271 seconds ---


#### Random Forest

In [17]:
start_time = time.clock()
X_train1, Y_train1 = make_regression(n_features=5, n_informative=2, random_state=0, shuffle=False)
regr3 = RandomForestRegressor(max_depth=5, random_state=0)
regr3.fit(X_train1, Y_train1)
Y_ = regr3.predict(X_test1)

score_forest = cross_val_score(regr3, X_pca, Y, cv=10)
print(score_forest)
print("--- %s seconds ---" % (time.clock() - start_time))

[0.96694004 0.99648392 0.9682422  0.9890833  0.99339856 0.9898025
 0.97701399 0.98006366 0.99864959 0.99498602]
--- 0.30844159999999476 seconds ---


The randon forest algorithm takes more time to run. 

### 3. Using Select K best

In [19]:
# using chi squared test to select best features
selection = SelectKBest(score_func=chi2, k=5)
X_features = selection.fit(X, Y).transform(X)

In [20]:
x_train2, x_test2, y_train2, y_tes2 = train_test_split(X_features, Y, test_size = 0.2, random_state = 0)

#### Decision Tree

In [21]:
start_time = time.clock()
regr5 = DecisionTreeRegressor(random_state=0)
regr5.fit(x_train2, y_train2)
y_ = regr5.predict(x_test2)

score_tree = cross_val_score(regr5, X_features, Y, cv=10)
print(score_tree)
print("--- %s seconds ---" % (time.clock() - start_time))

[0.99056243 0.9852936  0.98309361 0.98995698 0.99314615 0.97287147
 0.96304211 0.99264767 0.99570308 0.99722881]
--- 0.10223701333333679 seconds ---


#### Random Forest 

In [22]:
start_time = time.clock()
x_train2, y_train2 = make_regression(n_features=5, n_informative=2, random_state=0, shuffle=False)
regr4 = RandomForestRegressor(max_depth=5, random_state=0)
regr4.fit(x_train2, y_train2)
Y_ = regr4.predict(x_test2)

score_forest = cross_val_score(regr4, X_features, Y, cv=10)
print(score_forest)
print("--- %s seconds ---" % (time.clock() - start_time))

[0.98669888 0.99572783 0.99588006 0.99061081 0.99727062 0.98528803
 0.99519749 0.99665789 0.99783617 0.9933284 ]
--- 0.3049881599999935 seconds ---


Both models have very igh accuracy and the Random Forest algorithm takes more than twice time to excute than the Decision Tree algorithm. This data sets has only 1403 rows and I expect that the difference in the time it takes the two algorithms to excute will be much bigger with larger data sets.   