### Load the Lending Club dataset

We will be using a dataset from the LendingClub. 

In [1]:
import pandas as pd
import numpy as np

loans = pd.read_csv('lending-club-data.csv')
loans.head()

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,...,sub_grade_num,delinq_2yrs_zero,pub_rec_zero,collections_12_mths_zero,short_emp,payment_inc_ratio,final_d,last_delinq_none,last_record_none,last_major_derog_none
0,1077501,1296599,5000,5000,4975,36 months,10.65,162.87,B,B2,...,0.4,1.0,1.0,1.0,0,8.1435,20141201T000000,1,1,1
1,1077430,1314167,2500,2500,2500,60 months,15.27,59.83,C,C4,...,0.8,1.0,1.0,1.0,1,2.3932,20161201T000000,1,1,1
2,1077175,1313524,2400,2400,2400,36 months,15.96,84.33,C,C5,...,1.0,1.0,1.0,1.0,0,8.25955,20141201T000000,1,1,1
3,1076863,1277178,10000,10000,10000,36 months,13.49,339.31,C,C1,...,0.2,1.0,1.0,1.0,0,8.27585,20141201T000000,0,1,1
4,1075269,1311441,5000,5000,5000,36 months,7.9,156.46,A,A4,...,0.8,1.0,1.0,1.0,0,5.21533,20141201T000000,1,1,1


### Exploring some features

Let's quickly explore what the dataset looks like. First, print out the column names to see what features we have in this dataset.

In [2]:
loans.columns.values

array(['id', 'member_id', 'loan_amnt', 'funded_amnt', 'funded_amnt_inv',
       'term', 'int_rate', 'installment', 'grade', 'sub_grade',
       'emp_title', 'emp_length', 'home_ownership', 'annual_inc',
       'is_inc_v', 'issue_d', 'loan_status', 'pymnt_plan', 'url', 'desc',
       'purpose', 'title', 'zip_code', 'addr_state', 'dti', 'delinq_2yrs',
       'earliest_cr_line', 'inq_last_6mths', 'mths_since_last_delinq',
       'mths_since_last_record', 'open_acc', 'pub_rec', 'revol_bal',
       'revol_util', 'total_acc', 'initial_list_status', 'out_prncp',
       'out_prncp_inv', 'total_pymnt', 'total_pymnt_inv',
       'total_rec_prncp', 'total_rec_int', 'total_rec_late_fee',
       'recoveries', 'collection_recovery_fee', 'last_pymnt_d',
       'last_pymnt_amnt', 'next_pymnt_d', 'last_credit_pull_d',
       'collections_12_mths_ex_med', 'mths_since_last_major_derog',
       'policy_code', 'not_compliant', 'status', 'inactive_loans',
       'bad_loans', 'emp_length_num', 'grade_num', '

### Exploring the target column

The target column (label column) of the dataset that we are interested in is called bad_loans. In this column 1 means a risky (bad) loan 0 means a safe loan.

In order to make this more intuitive and consistent with the lectures, we reassign the target to be:

 *   +1 as a safe loan
 *   -1 as a risky (bad) loan

We put this in a new column called safe_loans.

In [3]:
# safe_loans =  1 => safe
# safe_loans = -1 => risky
loans['safe_loans'] = loans['bad_loans'].apply(lambda x : +1 if x==0 else -1)
loans = loans.drop('bad_loans', axis = 1)

In [4]:
loans.columns.values

array(['id', 'member_id', 'loan_amnt', 'funded_amnt', 'funded_amnt_inv',
       'term', 'int_rate', 'installment', 'grade', 'sub_grade',
       'emp_title', 'emp_length', 'home_ownership', 'annual_inc',
       'is_inc_v', 'issue_d', 'loan_status', 'pymnt_plan', 'url', 'desc',
       'purpose', 'title', 'zip_code', 'addr_state', 'dti', 'delinq_2yrs',
       'earliest_cr_line', 'inq_last_6mths', 'mths_since_last_delinq',
       'mths_since_last_record', 'open_acc', 'pub_rec', 'revol_bal',
       'revol_util', 'total_acc', 'initial_list_status', 'out_prncp',
       'out_prncp_inv', 'total_pymnt', 'total_pymnt_inv',
       'total_rec_prncp', 'total_rec_int', 'total_rec_late_fee',
       'recoveries', 'collection_recovery_fee', 'last_pymnt_d',
       'last_pymnt_amnt', 'next_pymnt_d', 'last_credit_pull_d',
       'collections_12_mths_ex_med', 'mths_since_last_major_derog',
       'policy_code', 'not_compliant', 'status', 'inactive_loans',
       'emp_length_num', 'grade_num', 'sub_grade_num

Now, let us explore the distribution of the column safe_loans. This gives us a sense of how many safe and risky loans are present in the dataset. Print out the percentage of safe loans and risky loans in the data frame.

You should have:

 *   Around 81% safe loans
 *   Around 19% risky loans

It looks like most of these loans are safe loans (thankfully). But this does make our problem of identifying risky loans challenging.

In [5]:
num_safe_loans = loans[loans['safe_loans'] == +1].shape[0]
num_risky_loans = loans[loans['safe_loans'] == -1].shape[0]
num_total_loans = loans.shape[0]

In [6]:
print("the percentage of safe loans: ", num_safe_loans / num_total_loans * 1.0)
print("the percentage of risky loans: ", num_risky_loans / num_total_loans * 1.0)

the percentage of safe loans:  0.8111853319957262
the percentage of risky loans:  0.18881466800427382


### Features for the classification algorithm

In this assignment, we will be using a subset of features (categorical and numeric). The features we will be using are described in the code comments below. If you are a finance geek, the LendingClub website has a lot more details about these features. Extract these feature columns and target column from the dataset. We will only use these features.

In [7]:
features = ['grade',                     # grade of the loan
            'sub_grade',                 # sub-grade of the loan
            'short_emp',                 # one year or less of employment
            'emp_length_num',            # number of years of employment
            'home_ownership',            # home_ownership status: own, mortgage or rent
            'dti',                       # debt to income ratio
            'purpose',                   # the purpose of the loan
            'term',                      # the term of the loan
            'last_delinq_none',          # has borrower had a delinquincy
            'last_major_derog_none',     # has borrower had 90 day or worse rating
            'revol_util',                # percent of available credit being used
            'total_rec_late_fee',        # total late fees received to day
           ]

target = 'safe_loans'                    # prediction target (y) (+1 means safe, -1 is risky)

# Extract the feature columns and target column
loans = loans[features + [target]]

In [8]:
loans.head()

Unnamed: 0,grade,sub_grade,short_emp,emp_length_num,home_ownership,dti,purpose,term,last_delinq_none,last_major_derog_none,revol_util,total_rec_late_fee,safe_loans
0,B,B2,0,11,RENT,27.65,credit_card,36 months,1,1,83.7,0.0,1
1,C,C4,1,1,RENT,1.0,car,60 months,1,1,9.4,0.0,-1
2,C,C5,0,11,RENT,8.72,small_business,36 months,1,1,98.5,0.0,1
3,C,C1,0,11,RENT,20.0,other,36 months,0,1,21.0,16.97,1
4,A,A4,0,4,RENT,11.2,wedding,36 months,1,1,28.3,0.0,1


Then follow the following steps:

  *  Apply one-hot encoding to loans. Your tool may have a function for one-hot encoding. Alternatively, see #7 for implementation hints.
  *  Load the JSON files into the lists train_idx and validation_idx.
  *  Perform train/validation split using train_idx and validation_idx. In Pandas, for instance:

In [9]:
import json

train_idx = json.loads(open('module-5-assignment-1-train-idx.json').read())
validation_idx = json.loads(open('module-5-assignment-1-validation-idx.json').read())

train_data = loans.iloc[train_idx]
validation_idx = loans.iloc[validation_idx]
train_data.head()

Unnamed: 0,grade,sub_grade,short_emp,emp_length_num,home_ownership,dti,purpose,term,last_delinq_none,last_major_derog_none,revol_util,total_rec_late_fee,safe_loans
1,C,C4,1,1,RENT,1.0,car,60 months,1,1,9.4,0.0,-1
6,F,F2,0,5,OWN,5.55,small_business,60 months,1,1,32.6,0.0,-1
7,B,B5,1,1,RENT,18.08,other,60 months,1,1,36.5,0.0,-1
10,C,C1,1,1,RENT,10.08,debt_consolidation,36 months,1,1,91.7,0.0,-1
12,B,B2,0,4,RENT,7.06,other,36 months,1,1,55.5,0.0,-1


### Sample data to balance classes

As we explored above, our data is disproportionally full of safe loans. Let's create two datasets: one with just the safe loans (safe_loans_raw) and one with just the risky loans (risky_loans_raw).

In [10]:
safe_loans_raw = loans[loans[target] == +1]
risky_loans_raw = loans[loans[target] == -1]
print("Number of safe loans  : %s" % len(safe_loans_raw))
print("Number of risky loans : %s" % len(risky_loans_raw))

Number of safe loans  : 99457
Number of risky loans : 23150


One way to combat class imbalance is to undersample the larger class until the class distribution is approximately half and half. Here, we will undersample the larger class (safe loans) in order to balance out our dataset. This means we are throwing away many data points. We used seed=1 so everyone gets the same results.

In [11]:
# Since there are fewer risky loans than safe loans, find the ratio of the sizes
# and use that percentage to undersample the safe loans.
percentage = len(risky_loans_raw)/float(len(safe_loans_raw))

risky_loans = risky_loans_raw
safe_loans = safe_loans_raw.sample(frac = percentage, random_state = 1)

# Append the risky_loans with the downsampled version of safe_loans
loans_data = risky_loans.append(safe_loans)

In [12]:
print("number of risky loans: ", len(risky_loans))
print("number of safe loans:  ", len(safe_loans))

number of risky loans:  23150
number of safe loans:   23150


### One-hot encoding

For scikit-learn's decision tree implementation, it requires numerical values for it's data matrix. This means you will have to turn categorical variables into binary features via one-hot encoding. The next assignment has more details about this.

In [13]:
categorical_variables = []
for feat_name, feat_type in zip(loans_data.columns.values, loans_data.dtypes):
    if feat_type == object:
        categorical_variables.append(feat_name)
categorical_variables

['grade', 'sub_grade', 'home_ownership', 'purpose', 'term']

In [14]:
def grade_map(grade):
    if grade == "G":
        return 1
    elif grade == "F":
        return 2
    elif grade == "E":
        return 3
    elif grade == "D":
        return 4
    elif grade == "C":
        return 5
    elif grade == "B":
        return 6
    else:
        return 7
    
loans_data['grade'] = loans_data['grade'].apply(grade_map)

In [15]:
subgrade_val = list(loans_data['sub_grade'].unique())
subgrade_dict = {subgrade:index for subgrade, index in zip(subgrade_val, range(len(subgrade_val)))}

def subgrade_map(subgrade):
    return subgrade_dict[subgrade]

loans_data['sub_grade'] = loans_data['sub_grade'].apply(subgrade_map)

In [16]:
ownership_val = list(loans_data['home_ownership'].unique())
ownership_dict = {ownership:index for ownership, index in zip(ownership_val, range(len(ownership_val)))}

def ownership_map(ownership):
    return ownership_dict[ownership]

loans_data['home_ownership'] = loans_data['home_ownership'].apply(ownership_map)

In [17]:
purpose_val = list(loans_data['purpose'].unique())
purpose_dict = {purpose:index for purpose, index in zip(purpose_val, range(len(purpose_val)))}
def purpose_map(purpose):
    return purpose_dict[purpose]

loans_data['purpose'] = loans_data['purpose'].apply(purpose_map)

In [18]:
loans_data['term'] = loans_data['term'].apply(lambda term: 60 if term == " 60 months" else 36)

We split the data into training and validation sets using an 80/20 split and specifying seed=1 so everyone gets the same results. Call the training and validation sets train_data and validation_data, respectively.

Note: In previous assignments, we have called this a train-test split. However, the portion of data that we don't train on will be used to help select model parameters (this is known as model selection). Thus, this portion of data should be called a validation set. Recall that examining performance of various potential models (i.e. models with different parameters) should be on validation set, while evaluation of the final selected model should always be on test data. Typically, we would also save a portion of the data (a real test set) to test our final model on or use cross-validation on the training set to select our final model. But for the learning purposes of this assignment, we won't do that.

In [56]:
from sklearn.model_selection import train_test_split
train_x, validation_x, train_y, validation_y = train_test_split(loans_data[features], loans_data[target], test_size = 0.2, random_state = 1)

### Build a decision tree classifier

Now, let's use the built-in scikit learn decision tree learner (sklearn.tree.DecisionTreeClassifier) to create a loan prediction model on the training data. To do this, you will need to import sklearn, sklearn.tree, and numpy.

Note: You will have to first convert the SFrame into a numpy data matrix, and extract the target labels as a numpy array (Hint: you can use the .to_numpy() method call on SFrame to turn SFrames into numpy arrays). See the API for more information. Make sure to set max_depth=6.

Call this model decision_tree_model.

In [57]:
from sklearn.tree import DecisionTreeClassifier

decision_tree_model = DecisionTreeClassifier(max_depth = 6).fit(train_x, train_y)

In [58]:
small_model = DecisionTreeClassifier(max_depth = 2).fit(train_x, train_y)

### Making predictions

Let's consider two positive and two negative examples from the validation set and see what the model predicts. We will do the following:

 *   Predict whether or not a loan is safe.
 *   Predict the probability that a loan is safe.

First, let's grab 2 positive examples and 2 negative examples. 

In [59]:
train_data = train_x
train_data[target] = train_y

validation_data = validation_x
validation_data[target] = validation_y

In [60]:
validation_safe_loans = validation_data[validation_data[target] == 1]
validation_risky_loans = validation_data[validation_data[target] == -1]

sample_validation_data_risky = validation_risky_loans[0:2]
sample_validation_data_safe = validation_safe_loans[0:2]

sample_validation_data = sample_validation_data_safe.append(sample_validation_data_risky)
sample_validation_data

Unnamed: 0,grade,sub_grade,short_emp,emp_length_num,home_ownership,dti,purpose,term,last_delinq_none,last_major_derog_none,revol_util,total_rec_late_fee,safe_loans
83265,6,4,0,11,2,5.92,3,60,1,1,19.2,0.0,1
89463,5,7,0,4,2,11.49,3,36,1,1,61.0,0.0,1
121355,3,26,0,11,0,15.9,3,36,1,1,59.7,0.0,-1
39588,5,16,0,11,2,1.51,3,36,0,1,20.4,0.0,-1


Now, we will use our model to predict whether or not a loan is likely to default. For each row in the sample_validation_data, use the decision_tree_model to predict whether or not the loan is classified as a safe loan. (Hint: if you are using scikit-learn, you can use the .predict() method)

In [61]:
sample_validation_prediction = decision_tree_model.predict(sample_validation_data[features])

# Question 1
What percentage of the predictions on sample_validation_data did decision_tree_model get correct?

In [62]:
sample_validation_prediction

array([-1, -1, -1,  1])

In [63]:
sample_validation_data[target]

83265     1
89463     1
121355   -1
39588    -1
Name: safe_loans, dtype: int64

Explore probability predictions

For each row in the sample_validation_data, what is the probability (according decision_tree_model) of a loan being classified as safe? (Hint: if you are using scikit-learn, you can use the .predict_proba() method)

# Question 2
Which loan has the highest probability of being classified as a safe loan?

In [64]:
sample_validation_predict_proba = decision_tree_model.predict_proba(sample_validation_data[features])
sample_validation_predict_proba

array([[ 0.55806452,  0.44193548],
       [ 0.50016281,  0.49983719],
       [ 0.56614462,  0.43385538],
       [ 0.38582028,  0.61417972]])

### Tricky predictions!

Now, we will explore something pretty interesting. For each row in the sample_validation_data, what is the probability (according to small_model) of a loan being classified as safe?

In [65]:
sample_validation_predict_proba_small_model = small_model.predict_proba(sample_validation_data[features])
sample_validation_predict_proba_small_model

array([[ 0.42964501,  0.57035499],
       [ 0.5801218 ,  0.4198782 ],
       [ 0.5801218 ,  0.4198782 ],
       [ 0.5801218 ,  0.4198782 ]])

# Question 3
Notice that the probability preditions are the exact same for the 2nd and 3rd loans. Why would this happen?

* During tree traversal both examples fall into the same leaf node.

### Visualize the prediction on a tree

Note that you should be able to look at the small tree (of depth 2), traverse it yourself, and visualize the prediction being made. Consider the following point in the sample_validation_data:



In [66]:
small_model.predict(sample_validation_data[features])

array([ 1, -1, -1, -1])

# Question 4
Based on the visualized tree, what prediction would you make for this data point?

* -1

### Evaluating accuracy of the decision tree model

Recall that the accuracy is defined as follows:

accuracy=# correctly classified data points / # total data points

Evaluate the accuracy of small_model and decision_tree_model on the training data. (Hint: if you are using scikit-learn, you can use the .score() method)

Checkpoint: You should see that the small_model performs worse than the decision_tree_model on the training data.

Now, evaluate the accuracy of the small_model and decision_tree_model on the entire validation_data, not just the subsample considered above.



In [72]:
validation_predict_small_model = small_model.predict(validation_data[features])
validation_predict_decision_tree_model = decision_tree_model.predict(validation_data[features])

accuracy_small_model = list(validation_predict_small_model == validation_data[target]).count(True) / len(validation_data[target]) * 1.0
accuracy_decision_tree_model = list(validation_predict_decision_tree_model == validation_data[target]).count(True) / len(validation_data[target]) * 1.0

print("the accuracy of the small_model on the entire validation_data: ", accuracy_small_model)
print("the accuracy of the decision_tree_model on the entire validation_data: ", accuracy_decision_tree_model)

the accuracy of the small_model on the entire validation_data:  0.614902807775378
the accuracy of the decision_tree_model on the entire validation_data:  0.6250539956803456


# Question 5
What is the accuracy of decision_tree_model on the validation set, rounded to the nearest .01 (e.g. 0.76)?

In [73]:
accuracy_decision_tree_model

0.6250539956803456

### Evaluating accuracy of a complex decision tree model

Here, we will train a large decision tree with max_depth=10. This will allow the learned tree to become very deep, and result in a very complex model. Recall that in lecture, we prefer simpler models with similar predictive power. This will be an example of a more complicated model which has similar predictive power, i.e. something we don't want.

Using sklearn.tree.DecisionTreeClassifier, train a decision tree with maximum depth = 10. Call this model big_model.

Evaluate the accuracy of big_model on the training set and validation set.

Checkpoint: We should see that big_model has even better performance on the training set than decision_tree_model did on the training set.

In [79]:
big_model = DecisionTreeClassifier(max_depth = 10).fit(train_x[features], train_y)

# Question 6
How does the performance of big_model on the validation set compare to decision_tree_model on the validation set? Is this a sign of overfitting?

In [80]:
validation_predict_big_model = big_model.predict(validation_data[features])

accuracy_big_model = list(validation_predict_big_model == validation_data[target]).count(True) / len(validation_data[target]) * 1.0


print("the accuracy of the big_model on the entire validation_data: ", accuracy_big_model)


the accuracy of the big_model on the entire validation_data:  0.6206263498920086


### Quantifying the cost of mistakes

Every mistake the model makes costs money. In this section, we will try and quantify the cost each mistake made by the model. Assume the following:

 *   False negatives: Loans that were actually safe but were predicted to be risky. This results in an oppurtunity cost of loosing a loan that would have otherwise been accepted.
 *   False positives: Loans that were actually risky but were predicted to be safe. These are much more expensive because it results in a risky loan being given.
 *   Correct predictions: All correct predictions don't typically incur any cost.

Let's write code that can compute the cost of mistakes made by the model. Complete the following 4 steps:

 *   First, let us compute the predictions made by the model.
 *   Second, compute the number of false positives.
 *   Third, compute the number of false negatives.
 *   Finally, compute the cost of mistakes made by the model by adding up the costs of true positives and false positves.

# Question 7

Let us assume that each mistake costs money:

 *   Assume a cost of \$ 10,000 per false negative.
 *   Assume a cost of \$ 20,000 per false positive.

What is the total cost of mistakes made by decision_tree_model on validation_data? Please enter your answer as a plain integer, without the dollar sign or the comma separator, e.g. 3002000.

In [86]:
false_validation_predict = validation_predict_decision_tree_model[validation_predict_decision_tree_model != validation_data[target]]

false_negative_validation_predict = false_validation_predict[false_validation_predict == -1]
false_positive_validation_predict = false_validation_predict[false_validation_predict == 1]

10000 * len(false_negative_validation_predict) + 20000 * len(false_positive_validation_predict)

50130000