# Comparision of Decision Tree Classifier and Random Forest Classifier Models 

Dataset from a lending club containing information about credit history, interest rate, accounts, installments, etc. We will use this information to decide whether a person gets a good loan or a bad loan. 
We will use Decision Tree Classifier Model to build a model and check how well it does before considering Random Forests.

Data stored as lending_club_data.csv and obtained from https://github.com/sshumiye/Notes where it is stored as lending_club_data01.csv.

In [2]:
import pandas as pd
import numpy as np
import seaborn as sns
import sklearn
from pandas import Series, DataFrame

import matplotlib.pyplot as plt
%matplotlib inline

sns.set_style('whitegrid')

In [3]:
loans = pd.read_csv("C:\\Users\\Ritesh Mohan Monga\\Documents\\Python-ML\\lending_club_data.csv")
loans.head()

Unnamed: 0,int_rate,installment,open_acc,revol_bal,revol_util,total_acc,bad_loans,grade_num
0,10.65,162.87,3,13648,83.7,9,0,5
1,15.27,59.83,3,1687,9.4,4,1,4
2,15.96,84.33,2,2956,98.5,10,0,4
3,13.49,339.31,10,5598,21.0,37,0,4
4,7.9,156.46,9,7963,28.3,12,0,6


In [4]:
loans.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1468 entries, 0 to 1467
Data columns (total 8 columns):
int_rate       1468 non-null float64
installment    1468 non-null float64
open_acc       1468 non-null int64
revol_bal      1468 non-null int64
revol_util     1468 non-null float64
total_acc      1468 non-null int64
bad_loans      1468 non-null int64
grade_num      1468 non-null int64
dtypes: float64(3), int64(5)
memory usage: 91.8 KB


In [29]:
# adding a good_loans column holding values that are either 'yes' or 'no'
# when the values of bad_loans column observations are '0' and '1', respectively.
# we will use good_loans as target
loans['good_loans'] = loans['bad_loans'].apply(lambda y: 'yes' if y == 0 else 'no')
loans.head()

Unnamed: 0,loan_amnt,funded_amnt,funded_amnt_inv,int_rate,installment,annual_inc,dti,delinq_2yrs,inq_last_6mths,open_acc,...,recoveries,collection_recovery_fee,last_pymnt_amnt,bad_loans,emp_length_num,grade_num,sub_grade_num,delinq_2yrs_zero,payment_inc_ratio,good_loans
0,5000,5000,4975,10.65,162.87,24000,27.65,0,1,3,...,0.0,0.0,171.62,0,11,5,0.4,1,8.1435,yes
1,2500,2500,2500,15.27,59.83,30000,1.0,0,5,3,...,117.08,1.11,119.66,1,1,4,0.8,1,2.3932,no
2,2400,2400,2400,15.96,84.33,12252,8.72,0,2,2,...,0.0,0.0,649.91,0,11,4,1.0,1,8.25955,yes
3,10000,10000,10000,13.49,339.31,49200,20.0,0,1,10,...,0.0,0.0,357.48,0,11,4,0.2,1,8.27585,yes
4,5000,5000,5000,7.9,156.46,36000,11.2,0,3,9,...,0.0,0.0,161.03,0,4,6,0.8,1,5.21533,yes


In [27]:
# features = all columns except bad_loans and good_loans
# target = good_loans 

X = loans.drop(['bad_loans', 'good_loans'], axis = 1)
y = loans['good_loans']

from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 124)

In [8]:
model = DecisionTreeClassifier()
model.fit(X_train, y_train)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
                       max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=None, splitter='best')

In [9]:
prediction = model.predict(X_test)

In [10]:
from sklearn.metrics import classification_report, confusion_matrix

In [12]:
print (confusion_matrix(y_test, prediction))

[[ 19  48]
 [ 56 318]]


In [13]:
# 48 + 56 = 104 misclassified loans
print (classification_report(y_test, prediction))

              precision    recall  f1-score   support

          no       0.25      0.28      0.27        67
         yes       0.87      0.85      0.86       374

    accuracy                           0.76       441
   macro avg       0.56      0.57      0.56       441
weighted avg       0.78      0.76      0.77       441



In [21]:
# 78% precisi
# model has better prediction (25%) when good_loans = 'yes'
# model has poor prediction (87%) when good_loans = 'no'

# Random Forests

In [15]:
from sklearn.ensemble import RandomForestClassifier

rf_model = RandomForestClassifier(n_estimators = 150)

In [16]:
rf_model.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=None, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=150,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

In [17]:
rf_prediction = rf_model.predict(X_test)

In [19]:
print (classification_report(y_test, rf_prediction))

              precision    recall  f1-score   support

          no       0.36      0.12      0.18        67
         yes       0.86      0.96      0.91       374

    accuracy                           0.83       441
   macro avg       0.61      0.54      0.54       441
weighted avg       0.78      0.83      0.80       441



In [20]:
# Precision = 78%
# Recall = 83%
# F1-score = 80%
# Random Forest model gives slightly better results than Decision Tree Classifiers.
print(confusion_matrix(y_test, rf_prediction))

[[  8  59]
 [ 14 360]]


In [22]:
# 73 misclassified as fp and fn 
## model has good prediction (86%) when good_loans = 'yes'
# model has better prediction than Decision Tree Classifier (36%) when good_loans = 'no'

In [23]:
# Random Forest gives better results with Larger datasets

# Let's consider a larger dataset

Data stored in lending_club_data_big.csv and obtained from https://github.com/sshumiye/Notes where it is stored as lending_club_new_data.csv.


In [24]:
loans = pd.read_csv("C:\\Users\\Ritesh Mohan Monga\\Documents\\Python-ML\\lending_club_data_big.csv")
loans.head()

Unnamed: 0,loan_amnt,funded_amnt,funded_amnt_inv,int_rate,installment,annual_inc,dti,delinq_2yrs,inq_last_6mths,open_acc,...,total_rec_late_fee,recoveries,collection_recovery_fee,last_pymnt_amnt,bad_loans,emp_length_num,grade_num,sub_grade_num,delinq_2yrs_zero,payment_inc_ratio
0,5000,5000,4975,10.65,162.87,24000,27.65,0,1,3,...,0.0,0.0,0.0,171.62,0,11,5,0.4,1,8.1435
1,2500,2500,2500,15.27,59.83,30000,1.0,0,5,3,...,0.0,117.08,1.11,119.66,1,1,4,0.8,1,2.3932
2,2400,2400,2400,15.96,84.33,12252,8.72,0,2,2,...,0.0,0.0,0.0,649.91,0,11,4,1.0,1,8.25955
3,10000,10000,10000,13.49,339.31,49200,20.0,0,1,10,...,16.97,0.0,0.0,357.48,0,11,4,0.2,1,8.27585
4,5000,5000,5000,7.9,156.46,36000,11.2,0,3,9,...,0.0,0.0,0.0,161.03,0,4,6,0.8,1,5.21533


In [25]:
loans.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9516 entries, 0 to 9515
Data columns (total 24 columns):
loan_amnt                  9516 non-null int64
funded_amnt                9516 non-null int64
funded_amnt_inv            9516 non-null int64
int_rate                   9516 non-null float64
installment                9516 non-null float64
annual_inc                 9516 non-null int64
dti                        9516 non-null float64
delinq_2yrs                9516 non-null int64
inq_last_6mths             9516 non-null int64
open_acc                   9516 non-null int64
total_pymnt                9516 non-null float64
total_pymnt_inv            9516 non-null float64
total_rec_prncp            9516 non-null float64
total_rec_int              9516 non-null float64
total_rec_late_fee         9516 non-null float64
recoveries                 9516 non-null float64
collection_recovery_fee    9516 non-null float64
last_pymnt_amnt            9516 non-null float64
bad_loans                

In [28]:
# adding a good_loans column holding values that are either 'yes' or 'no'
# when the values of bad_loans column observations are '0' and '1', respectively.
# we will use good_loans as target
loans['good_loans'] = loans['bad_loans'].apply(lambda y: 'yes' if y == 0 else 'no')
loans.head()

Unnamed: 0,loan_amnt,funded_amnt,funded_amnt_inv,int_rate,installment,annual_inc,dti,delinq_2yrs,inq_last_6mths,open_acc,...,recoveries,collection_recovery_fee,last_pymnt_amnt,bad_loans,emp_length_num,grade_num,sub_grade_num,delinq_2yrs_zero,payment_inc_ratio,good_loans
0,5000,5000,4975,10.65,162.87,24000,27.65,0,1,3,...,0.0,0.0,171.62,0,11,5,0.4,1,8.1435,yes
1,2500,2500,2500,15.27,59.83,30000,1.0,0,5,3,...,117.08,1.11,119.66,1,1,4,0.8,1,2.3932,no
2,2400,2400,2400,15.96,84.33,12252,8.72,0,2,2,...,0.0,0.0,649.91,0,11,4,1.0,1,8.25955,yes
3,10000,10000,10000,13.49,339.31,49200,20.0,0,1,10,...,0.0,0.0,357.48,0,11,4,0.2,1,8.27585,yes
4,5000,5000,5000,7.9,156.46,36000,11.2,0,3,9,...,0.0,0.0,161.03,0,4,6,0.8,1,5.21533,yes


In [90]:
X = loans.drop(['bad_loans', 'good_loans'], axis = 1) # features
y = loans['good_loans'] # target

In [91]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 124)

In [92]:
model2 = DecisionTreeClassifier()
model2.fit(X_train, y_train)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
                       max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=None, splitter='best')

In [93]:
prediction2 = model2.predict(X_test)

In [94]:
print (confusion_matrix(y_test, prediction2))

[[ 443   12]
 [   9 2391]]


In [95]:
# only 21 observations are misclassified
print (classification_report(y_test, prediction2))

              precision    recall  f1-score   support

          no       0.98      0.97      0.98       455
         yes       1.00      1.00      1.00      2400

    accuracy                           0.99      2855
   macro avg       0.99      0.98      0.99      2855
weighted avg       0.99      0.99      0.99      2855



In [38]:
# 100% accuracy when good_loans = 'yes'
# model did very well because we have a larger dataset and it trained on more features

In [96]:
# random forest classifier
rf_model2 = RandomForestClassifier(n_estimators = 150)

In [97]:
rf_model2.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=None, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=150,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

In [98]:
rf_prediction2 = rf_model2.predict(X_test)

In [99]:
print (confusion_matrix(y_test, rf_prediction2))

[[ 442   13]
 [   0 2400]]


In [100]:
# 13 misclassified data points
print (classification_report(y_test, rf_prediction2))

              precision    recall  f1-score   support

          no       1.00      0.97      0.99       455
         yes       0.99      1.00      1.00      2400

    accuracy                           1.00      2855
   macro avg       1.00      0.99      0.99      2855
weighted avg       1.00      1.00      1.00      2855



In [49]:
# 100% average accuracy for precision, recall and F1-score
# not much difference between Decision Tree model and Random Forest model, 
# but RFC did better.

# Dropping some features

In [104]:
X = loans.drop(['bad_loans', 'good_loans', 'annual_inc', 'dti', 'delinq_2yrs',
               'inq_last_6mths', 'open_acc', 'total_pymnt', 'total_pymnt_inv',
               'total_rec_prncp', 'total_rec_int', 'total_rec_late_fee',
               'recoveries', 'collection_recovery_fee'], axis = 1)
y = loans['good_loans']

In [105]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 124)

In [53]:
model3 = DecisionTreeClassifier()
model3.fit(X_train, y_train)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
                       max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=None, splitter='best')

In [54]:
prediction3 = model3.predict(X_test)

In [57]:
print (confusion_matrix(y_test, prediction3))

[[ 302  153]
 [ 165 2235]]


In [59]:
# 318 data points misclassified as fp and fn
print (classification_report(y_test, prediction3))

              precision    recall  f1-score   support

          no       0.65      0.66      0.66       455
         yes       0.94      0.93      0.93      2400

    accuracy                           0.89      2855
   macro avg       0.79      0.80      0.79      2855
weighted avg       0.89      0.89      0.89      2855



In [60]:
# av precision = 89%
# good loans = yes 94% precision model does  good
# good loans = no 65% precision model does  bad

In [61]:
# random forest classifier
rf_model3 = RandomForestClassifier(n_estimators = 150)

In [62]:
rf_model3.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=None, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=150,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

In [106]:
rf_prediction3 = rf_model3.predict(X_test)

In [109]:
print (confusion_matrix(y_test, rf_prediction3))

[[ 281  174]
 [  70 2330]]


In [112]:
# 255 misclassified data points
print (classification_report(y_test, rf_prediction3))

              precision    recall  f1-score   support

          no       0.80      0.62      0.70       455
         yes       0.93      0.97      0.95      2400

    accuracy                           0.91      2855
   macro avg       0.87      0.79      0.82      2855
weighted avg       0.91      0.91      0.91      2855



In [113]:
# Average Accuracy of precision = 91%
# good = 'no' has good precision but poor recall and F1-score.

# Conclusion
We dealt with following cases. For each case, two classifier models based on the Decision Tree as well as Random Forest algorithms were built and used for prediction. Then, results from both were compared. The observations from each case is given below:

# Case 1 : Small Dataset with 23 feautures
- Random Forest Classifier gives better results than Decision Tree Classifier model.

# Case 2 : Large Dataset with 23 features
- Random Forest gave slightly better results than Decision Tree Classifier model even though the latter had 99% average accuracy of precision, recall and F1-score.

# Case 3 : Large Dataset with 11 features
- Random Forest Classifier gives better results than Decision Tree Classifier model.

# Inference
- Both, Random Forest and Decision Tree Classifier Models give better results when there is a large amount of data.
- Random Forest Classifier Model gives better results than Decision Tree Classifier Model.