# Identifying safe loans with decision trees

The LendingClub is a peer-to-peer leading company that directly connects borrowers and potential lenders/investors. In this notebook, you will build a classification model to predict whether or not a loan provided by LendingClub is likely to default.

In this notebook you will use data from the LendingClub to predict whether a loan will be paid off in full or the loan will be charged off and possibly go into default. In this assignment you will:

Use SFrames to do some feature engineering.
Train a decision-tree on the LendingClub dataset.
Visualize the tree.
Predict whether a loan will default along with prediction probabilities (on a validation set).
Train a complex tree model and compare it to simple tree model.

In [2]:
import graphlab
import pandas as pd
import numpy as np
from __future__ import division
import json

In [3]:
loans = graphlab.SFrame('lending-club-data.gl/')

[INFO] graphlab.cython.cy_server: GraphLab Create v2.1 started. Logging: C:\Users\Santosh\AppData\Local\Temp\graphlab_server_1482223006.log.0


This non-commercial license of GraphLab Create for academic use is assigned to santosh.chilkunda@gmail.com and will expire on July 20, 2017.


In [4]:
loans

id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade
1077501,1296599,5000,5000,4975,36 months,10.65,162.87,B,B2
1077430,1314167,2500,2500,2500,60 months,15.27,59.83,C,C4
1077175,1313524,2400,2400,2400,36 months,15.96,84.33,C,C5
1076863,1277178,10000,10000,10000,36 months,13.49,339.31,C,C1
1075269,1311441,5000,5000,5000,36 months,7.9,156.46,A,A4
1072053,1288686,3000,3000,3000,36 months,18.64,109.43,E,E1
1071795,1306957,5600,5600,5600,60 months,21.28,152.39,F,F2
1071570,1306721,5375,5375,5350,60 months,12.69,121.45,B,B5
1070078,1305201,6500,6500,6500,60 months,14.65,153.45,C,C3
1069908,1305008,12000,12000,12000,36 months,12.69,402.54,B,B5

emp_title,emp_length,home_ownership,annual_inc,is_inc_v,issue_d,loan_status,pymnt_plan
,10+ years,RENT,24000,Verified,20111201T000000,Fully Paid,n
Ryder,< 1 year,RENT,30000,Source Verified,20111201T000000,Charged Off,n
,10+ years,RENT,12252,Not Verified,20111201T000000,Fully Paid,n
AIR RESOURCES BOARD,10+ years,RENT,49200,Source Verified,20111201T000000,Fully Paid,n
Veolia Transportaton,3 years,RENT,36000,Source Verified,20111201T000000,Fully Paid,n
MKC Accounting,9 years,RENT,48000,Source Verified,20111201T000000,Fully Paid,n
,4 years,OWN,40000,Source Verified,20111201T000000,Charged Off,n
Starbucks,< 1 year,RENT,15000,Verified,20111201T000000,Charged Off,n
Southwest Rural metro,5 years,OWN,72000,Not Verified,20111201T000000,Fully Paid,n
UCLA,10+ years,OWN,75000,Source Verified,20111201T000000,Fully Paid,n

url,desc,purpose,title,zip_code
https://www.lendingclub.c om/browse/loanDetail. ...,Borrower added on 12/22/11 > I need to ...,credit_card,Computer,860xx
https://www.lendingclub.c om/browse/loanDetail. ...,Borrower added on 12/22/11 > I plan to use ...,car,bike,309xx
https://www.lendingclub.c om/browse/loanDetail. ...,,small_business,real estate business,606xx
https://www.lendingclub.c om/browse/loanDetail. ...,Borrower added on 12/21/11 > to pay for ...,other,personel,917xx
https://www.lendingclub.c om/browse/loanDetail. ...,,wedding,My wedding loan I promise to pay back ...,852xx
https://www.lendingclub.c om/browse/loanDetail. ...,Borrower added on 12/16/11 > Downpayment ...,car,Car Downpayment,900xx
https://www.lendingclub.c om/browse/loanDetail. ...,Borrower added on 12/21/11 > I own a small ...,small_business,Expand Business & Buy Debt Portfolio ...,958xx
https://www.lendingclub.c om/browse/loanDetail. ...,Borrower added on 12/16/11 > I'm trying to ...,other,Building my credit history. ...,774xx
https://www.lendingclub.c om/browse/loanDetail. ...,Borrower added on 12/15/11 > I had recived ...,debt_consolidation,High intrest Consolidation ...,853xx
https://www.lendingclub.c om/browse/loanDetail. ...,,debt_consolidation,Consolidation,913xx

addr_state,dti,delinq_2yrs,earliest_cr_line,inq_last_6mths,mths_since_last_delinq,mths_since_last_record
AZ,27.65,0,19850101T000000,1,,
GA,1.0,0,19990401T000000,5,,
IL,8.72,0,20011101T000000,2,,
CA,20.0,0,19960201T000000,1,35.0,
AZ,11.2,0,20041101T000000,3,,
CA,5.35,0,20070101T000000,2,,
CA,5.55,0,20040401T000000,2,,
TX,18.08,0,20040901T000000,0,,
AZ,16.12,0,19980101T000000,2,,
CA,10.78,0,19891001T000000,0,,

open_acc,pub_rec,revol_bal,revol_util,total_acc,initial_list_status,out_prncp,out_prncp_inv,total_pymnt
3,0,13648,83.7,9,f,0.0,0.0,5861.07
3,0,1687,9.4,4,f,0.0,0.0,1008.71
2,0,2956,98.5,10,f,0.0,0.0,3003.65
10,0,5598,21.0,37,f,0.0,0.0,12226.3
9,0,7963,28.3,12,f,0.0,0.0,5631.38
4,0,8221,87.5,4,f,0.0,0.0,3938.14
11,0,5210,32.6,13,f,0.0,0.0,646.02
2,0,9279,36.5,3,f,0.0,0.0,1476.19
14,0,4032,20.6,23,f,0.0,0.0,7677.52
12,0,23336,67.1,34,f,0.0,0.0,13943.1

total_pymnt_inv,...
5831.78,...
1008.71,...
3003.65,...
12226.3,...
5631.38,...
3938.14,...
646.02,...
1469.34,...
7677.52,...
13943.1,...


Exploring the target column

The target column (label column) of the dataset that we are interested in is called bad_loans. In this column 1 means a risky (bad) loan 0 means a safe loan.

In order to make this more intuitive and consistent with the lectures, we reassign the target to be:

+1 as a safe loan
-1 as a risky (bad) loan

In [5]:
loans['safe_loans'] = loans['bad_loans'].apply(lambda x: +1 if (x==0) else -1)

In [6]:
num_safe_loans = len(loans[loans['safe_loans'] == +1])
num_risky_loans = len(loans[loans['safe_loans'] == -1])
print '% safe loans:', (num_safe_loans/len(loans))
print '% risky loans:', (num_risky_loans/len(loans))

% safe loans: 0.811185331996
% risky loans: 0.188814668004


Features for the classification algorithm

In [7]:
features = ['grade',                     # grade of the loan
            'sub_grade',                 # sub-grade of the loan
            'short_emp',                 # one year or less of employment
            'emp_length_num',            # number of years of employment
            'home_ownership',            # home_ownership status: own, mortgage or rent
            'dti',                       # debt to income ratio
            'purpose',                   # the purpose of the loan
            'term',                      # the term of the loan
            'last_delinq_none',          # has borrower had a delinquincy
            'last_major_derog_none',     # has borrower had 90 day or worse rating
            'revol_util',                # percent of available credit being used
            'total_rec_late_fee',        # total late fees received to day
           ]

target = 'safe_loans'                    # prediction target (y) (+1 means safe, -1 is risky)

In [8]:
loans2 = loans[features + [target]]

Sample data to balance classes

In [9]:
safe_loans_raw = loans2[loans2['safe_loans'] == +1]
risky_loans_raw = loans2[loans2['safe_loans'] == -1]

In [10]:
percentage = len(risky_loans_raw)/(len(safe_loans_raw))

risky_loans = risky_loans_raw
safe_loans = safe_loans_raw.sample(percentage, seed=1)

loans_data = risky_loans.append(safe_loans)

One-hot encoding

In [11]:
categorical_variables = []
for feat_name, feat_type in zip(loans_data.column_names(), loans_data.column_types()):
    if feat_type == str:
        categorical_variables.append(feat_name)

for feature in categorical_variables:
    loans_data_one_hot_encoded = loans_data[feature].apply(lambda x: {x: 1})
    loans_data_unpacked = loans_data_one_hot_encoded.unpack(column_name_prefix=feature)

    # Change None's to 0's
    for column in loans_data_unpacked.column_names():
        loans_data_unpacked[column] = loans_data_unpacked[column].fillna(0)

    loans_data.remove_column(feature)
    loans_data.add_columns(loans_data_unpacked)

Split data into training and validation

In [12]:
train_data, validation_data = loans_data.random_split(.8, seed=1)

In [13]:
column_names = train_data.column_names()
my_features = []
for i in xrange(len(column_names)):
    if (column_names[i] != target):
        my_features.append(column_names[i])

In [14]:
train_data_X = train_data[my_features].to_numpy()
train_data_y = train_data[target].to_numpy()

Build a decision tree classifier

In [15]:
import sklearn
from sklearn.tree import DecisionTreeClassifier

Make sure to set max_depth=6.

In [16]:
decision_tree_model = DecisionTreeClassifier(max_depth=6)
decision_tree_model.fit(train_data_X, train_data_y)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=6,
            max_features=None, max_leaf_nodes=None, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best')

Also train a tree using with max_depth=2. Call this model small_model.

In [17]:
small_model = DecisionTreeClassifier(max_depth=2)
small_model.fit(train_data_X, train_data_y)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=2,
            max_features=None, max_leaf_nodes=None, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best')

Making predictions

First, let's grab 2 positive examples and 2 negative examples.

In [18]:
validation_safe_loans = validation_data[validation_data[target] == +1]
validation_risky_loans = validation_data[validation_data[target] == -1]

In [19]:
sample_validation_data_safe = validation_safe_loans[0:2]
sample_validation_data_risky = validation_risky_loans[0:2]

In [20]:
sample_validation_data = sample_validation_data_safe.append(sample_validation_data_risky)

sample_validation_data_X = sample_validation_data[my_features].to_numpy()
sample_validation_data_y = sample_validation_data[target].to_numpy()

In [21]:
sample_validation_data_pred = np.zeros(4)

In [22]:
sample_validation_data_pred = decision_tree_model.predict(sample_validation_data_X)

# What percentage of the predictions on sample_validation_data did decision_tree_model get correct?

In [23]:
print sample_validation_data_pred
print sample_validation_data_y

[ 1 -1 -1  1]
[ 1  1 -1 -1]


Explore probability predictions

For each row in the sample_validation_data, what is the probability (according decision_tree_model) of a loan being classified as safe?

# Which loan has the highest probability of being classified as a safe loan?

In [24]:
sample_validation_data_prob = np.zeros(4)
sample_validation_data_prob = decision_tree_model.predict_proba(sample_validation_data_X)
print sample_validation_data_prob

[[ 0.34156543  0.65843457]
 [ 0.53630646  0.46369354]
 [ 0.64750958  0.35249042]
 [ 0.20789474  0.79210526]]


Now, we will explore something pretty interesting. For each row in the sample_validation_data, what is the probability (according to small_model) of a loan being classified as safe?

# Notice that the probability preditions are the exact same for the 2nd and 3rd loans. Why would this happen?

In [25]:
sample_validation_data_small_prob = np.zeros(4)
sample_validation_data_small_prob = small_model.predict_proba(sample_validation_data_X)
print sample_validation_data_small_prob

[[ 0.41896585  0.58103415]
 [ 0.59255339  0.40744661]
 [ 0.59255339  0.40744661]
 [ 0.23120112  0.76879888]]


# Based on the visualized tree, what prediction would you make for this data point (according to small_model)? (If you don't have Graphviz, you can answer this quiz question by executing the next part.)

In [26]:
print small_model.predict(sample_validation_data_X[1])

[-1]




Evaluate the accuracy of small_model and decision_tree_model on the training data

In [27]:
print "decision tree model accuracy (train):", decision_tree_model.score(train_data_X, train_data_y)
print "small model accuracy (train):", small_model.score(train_data_X, train_data_y)

decision tree model accuracy (train): 0.640527616591
small model accuracy (train): 0.613502041694


Now, evaluate the accuracy of the small_model and decision_tree_model on the entire validation_data, not just the subsample considered above.

# What is the accuracy of decision_tree_model on the validation set, rounded to the nearest .01?

In [28]:
validation_data_X = validation_data[my_features].to_numpy()
validation_data_y = validation_data[target].to_numpy()

In [29]:
print "decision tree model accuracy (valid):", decision_tree_model.score(validation_data_X, validation_data_y)
print "small model accuracy (valid):", small_model.score(validation_data_X, validation_data_y)

decision tree model accuracy (valid): 0.636148211978
small model accuracy (valid): 0.619345109866


Evaluating accuracy of a complex decision tree model

Here, we will train a large decision tree with max_depth=10. This will allow the learned tree to become very deep, and result in a very complex model. Recall that in lecture, we prefer simpler models with similar predictive power. This will be an example of a more complicated model which has similar predictive power, i.e. something we don't want.

Using sklearn.tree.DecisionTreeClassifier, train a decision tree with maximum depth = 10. Call this model big_model.

In [30]:
big_model = DecisionTreeClassifier(max_depth=10)
big_model.fit(train_data_X, train_data_y)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=10,
            max_features=None, max_leaf_nodes=None, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best')

Evaluate the accuracy of big_model on the training set and validation set.

In [31]:
print "big model accuracy (train):", big_model.score(train_data_X, train_data_y)
print "big model accuracy (valid):", small_model.score(validation_data_X, validation_data_y)

big model accuracy (train): 0.663738448313
big model accuracy (valid): 0.619345109866


# How does the performance of big_model on the validation set compare to decision_tree_model on the validation set? Is this a sign of overfitting?

Quantifying the cost of mistakes

False negatives: Loans that were actually safe but were predicted to be risky. This results in an oppurtunity cost of loosing a loan that would have otherwise been accepted.

In [32]:
validation_safe_loans = validation_data[validation_data[target] == +1]
validation_safe_X = validation_safe_loans[my_features].to_numpy()
validation_safe_loans['pred'] = decision_tree_model.predict(validation_safe_X)
false_negatives = len(validation_safe_loans[validation_safe_loans['safe_loans'] != validation_safe_loans['pred']])
print false_negatives
print len(validation_safe_loans)

1717
4610


False positives: Loans that were actually risky but were predicted to be safe. These are much more expensive because it results in a risky loan being given.

In [33]:
validation_risky_loans = validation_data[validation_data[target] == -1]
validation_risky_X = validation_risky_loans[my_features].to_numpy()
validation_risky_loans['pred'] = decision_tree_model.predict(validation_risky_X)
false_positives = len(validation_risky_loans[validation_risky_loans['safe_loans'] != validation_risky_loans['pred']])
print false_positives
print len(validation_risky_loans)

1661
4674


# Let's assume that each mistake costs us money: a false negative costs $10,000, while a false positive positive costs $20,000. What is the total cost of mistakes made by decision_tree_model on validation_data?

In [34]:
false_negatives*10000 + false_positives*20000

50390000