# Challenge

Now that you've learned about random forests and decision trees let's do an exercise in accuracy. You know that random forests are basically a collection of decision trees. But how do the accuracies of the two models compare?

So here's what you should do. Pick a dataset. It could be one you've worked with before or it could be a new one. Then build the best decision tree you can.

Now try to match that with the simplest random forest you can. For our purposes measure simplicity with runtime. Compare that to the runtime of the decision tree. This is imperfect but just go with it.

Hopefully out of this you'll see the power of random forests, but also their potential costs. Remember, in the real world you won't necessarily be dealing with thousands of rows. It could be millions, billions, or even more.

Submit a link to your models below.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style='dark', color_codes=True)

%matplotlib inline

In [2]:
loans = pd.read_csv('loan_data.csv')
loans.head()

Unnamed: 0,credit.policy,purpose,int.rate,installment,log.annual.inc,dti,fico,days.with.cr.line,revol.bal,revol.util,inq.last.6mths,delinq.2yrs,pub.rec,not.fully.paid
0,1,debt_consolidation,0.1189,829.1,11.350407,19.48,737,5639.958333,28854,52.1,0,0,0,0
1,1,credit_card,0.1071,228.22,11.082143,14.29,707,2760.0,33623,76.7,0,0,0,0
2,1,debt_consolidation,0.1357,366.86,10.373491,11.63,682,4710.0,3511,25.6,1,0,0,0
3,1,debt_consolidation,0.1008,162.34,11.350407,8.1,712,2699.958333,33667,73.2,1,0,0,0
4,1,credit_card,0.1426,102.92,11.299732,14.97,667,4066.0,4740,39.5,0,1,0,0


In [3]:
loans.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9578 entries, 0 to 9577
Data columns (total 14 columns):
credit.policy        9578 non-null int64
purpose              9578 non-null object
int.rate             9578 non-null float64
installment          9578 non-null float64
log.annual.inc       9578 non-null float64
dti                  9578 non-null float64
fico                 9578 non-null int64
days.with.cr.line    9578 non-null float64
revol.bal            9578 non-null int64
revol.util           9578 non-null float64
inq.last.6mths       9578 non-null int64
delinq.2yrs          9578 non-null int64
pub.rec              9578 non-null int64
not.fully.paid       9578 non-null int64
dtypes: float64(6), int64(7), object(1)
memory usage: 1.0+ MB


In [4]:
loans.describe()

Unnamed: 0,credit.policy,int.rate,installment,log.annual.inc,dti,fico,days.with.cr.line,revol.bal,revol.util,inq.last.6mths,delinq.2yrs,pub.rec,not.fully.paid
count,9578.0,9578.0,9578.0,9578.0,9578.0,9578.0,9578.0,9578.0,9578.0,9578.0,9578.0,9578.0,9578.0
mean,0.80497,0.12264,319.089413,10.932117,12.606679,710.846314,4560.767197,16913.96,46.799236,1.577469,0.163708,0.062122,0.160054
std,0.396245,0.026847,207.071301,0.614813,6.88397,37.970537,2496.930377,33756.19,29.014417,2.200245,0.546215,0.262126,0.366676
min,0.0,0.06,15.67,7.547502,0.0,612.0,178.958333,0.0,0.0,0.0,0.0,0.0,0.0
25%,1.0,0.1039,163.77,10.558414,7.2125,682.0,2820.0,3187.0,22.6,0.0,0.0,0.0,0.0
50%,1.0,0.1221,268.95,10.928884,12.665,707.0,4139.958333,8596.0,46.3,1.0,0.0,0.0,0.0
75%,1.0,0.1407,432.7625,11.291293,17.95,737.0,5730.0,18249.5,70.9,2.0,0.0,0.0,0.0
max,1.0,0.2164,940.14,14.528354,29.96,827.0,17639.95833,1207359.0,119.0,33.0,13.0,5.0,1.0


## EDA

In [None]:
#plot FICO distributions by credit policy outcome
plt.figure(figsize=(10,6))

loans[loans['credit.policy']]==1]['fico'].hist(bins=35, color='blue', label='Credit Policy = 1', alpha=0.6)


In [7]:
cat_feats = ['purpose']

final_data = pd.get_dummies(loans, columns=cat_feats,
                            drop_first=True)
final_data.head()

Unnamed: 0,credit.policy,int.rate,installment,log.annual.inc,dti,fico,days.with.cr.line,revol.bal,revol.util,inq.last.6mths,delinq.2yrs,pub.rec,not.fully.paid,purpose_credit_card,purpose_debt_consolidation,purpose_educational,purpose_home_improvement,purpose_major_purchase,purpose_small_business
0,1,0.1189,829.1,11.350407,19.48,737,5639.958333,28854,52.1,0,0,0,0,0,1,0,0,0,0
1,1,0.1071,228.22,11.082143,14.29,707,2760.0,33623,76.7,0,0,0,0,1,0,0,0,0,0
2,1,0.1357,366.86,10.373491,11.63,682,4710.0,3511,25.6,1,0,0,0,0,1,0,0,0,0
3,1,0.1008,162.34,11.350407,8.1,712,2699.958333,33667,73.2,1,0,0,0,0,1,0,0,0,0
4,1,0.1426,102.92,11.299732,14.97,667,4066.0,4740,39.5,0,1,0,0,1,0,0,0,0,0


In [31]:
#Train test split 

from sklearn.model_selection import train_test_split

X = final_data.drop('not.fully.paid', axis=1)
y = final_data['not.fully.paid']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=101)

print("The number of observations in the training set is {}".format(X_train.shape[0]))
print("The number of observations in the test set is {}".format(X_test.shape[0]))

The number of observations in the training set is 6704
The number of observations in the test set is 2874


In [45]:
#Train a decision tree model 

from sklearn.tree import DecisionTreeClassifier

#decision tree with gini as criterion 
dtree= DecisionTreeClassifier()
dtree.fit(X_train, y_train)

y_preds = dtree.predict(X_test)


from sklearn.metrics import classification_report, confusion_matrix

print('Decision Tree using Gini as Criterion')

print(classification_report(y_test, pred))
print(confusion_matrix(y_test, pred))
print('\n')

from sklearn.model_selection import cross_val_score

print(cross_val_score(dtree, X, y, cv=10))
print('\n')
print(np.mean(cross_val_score(dtree, X, y, cv=10)))
print('\n')

import time

start_time = time.time()

print("--- %s seconds ---" % (time.time() - start_time))

Decision Tree using Gini as Criterion
              precision    recall  f1-score   support

           0       0.85      1.00      0.92      2431
           1       0.50      0.02      0.04       443

    accuracy                           0.85      2874
   macro avg       0.67      0.51      0.48      2874
weighted avg       0.79      0.85      0.78      2874

[[2422    9]
 [ 434    9]]


[0.73514077 0.75703858 0.75912409 0.76200418 0.7651357  0.75862069
 0.75757576 0.74294671 0.57366771 0.54336468]


0.7086750598533158


--- 0.00011229515075683594 seconds ---


In [46]:
#decision tree with entropy as criterion 
dtree = DecisionTreeClassifier(criterion='entropy')
dtree.fit(X_train, y_train)

pred = dtree.predict(X_test)

print('Decision Tree using Entropy as Criterion')

print(classification_report(y_test, pred))
print(confusion_matrix(y_test, pred))
print('\n')

print(cross_val_score(dtree, X, y, cv=10))
print('\n')
print(np.mean(cross_val_score(dtree, X, y, cv=10)))
print('\n')

start_time = time.time()

print("--- %s seconds ---" % (time.time() - start_time))

Decision Tree using Entropy as Criterion
              precision    recall  f1-score   support

           0       0.85      0.83      0.84      2431
           1       0.18      0.21      0.19       443

    accuracy                           0.73      2874
   macro avg       0.52      0.52      0.52      2874
weighted avg       0.75      0.73      0.74      2874

[[2018  413]
 [ 352   91]]


[0.73826903 0.74973931 0.73305527 0.77244259 0.79123173 0.76593521
 0.75339603 0.76280042 0.58098224 0.52560084]


0.7160880814861994


--- 3.0994415283203125e-05 seconds ---


In [47]:
#random forest 

from sklearn.ensemble import RandomForestClassifier

rfc = RandomForestClassifier(n_estimators=300)

rfc.fit(X_train, y_train)

pred = rfc.predict(X_test)


print('Random Forest')
print(classification_report(y_test, pred))
print(confusion_matrix(y_test, pred))
print('\n')

print(cross_val_score(rfc, X, y, cv=10))
print('\n')
print(np.mean(cross_val_score(rfc, X, y, cv=10)))
print('\n')

start_time = time.time()

print("--- %s seconds ---" % (time.time() - start_time))

Random Forest
              precision    recall  f1-score   support

           0       0.85      1.00      0.92      2431
           1       0.40      0.02      0.03       443

    accuracy                           0.84      2874
   macro avg       0.62      0.51      0.47      2874
weighted avg       0.78      0.84      0.78      2874

[[2419   12]
 [ 435    8]]


[0.83941606 0.83941606 0.83941606 0.84029228 0.83820459 0.84012539
 0.83908046 0.84221526 0.6384535  0.57575758]


0.7943870389696784


--- 3.0994415283203125e-05 seconds ---
