### In this notebook you will use data from the LendingClub to predict whether a loan will be paid off in full or the loan will be charged off and possibly go into default. In this assignment you will:

Use SFrames to do some feature engineering.
Train a decision-tree on the LendingClub dataset.
Visualize the tree.
Predict whether a loan will default along with prediction probabilities (on a validation set).
Train a complex tree model and compare it to simple tree model. 

In [76]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import tree
%matplotlib inline
import os
os.environ["PATH"] += os.pathsep + 'D:/Program Files (x86)/Graphviz2.38/bin/'

In [3]:
loans = pd.read_csv('lending-club-data.csv')

  interactivity=interactivity, compiler=compiler, result=result)


### The target column (label column) of the dataset that we are interested in is called bad_loans. In this column 1 means a risky (bad) loan 0 means a safe loan.

In [9]:
# safe_loans =  1 => safe
# safe_loans = -1 => risky
loans['safe_loans'] = loans['bad_loans'].apply(lambda x: -1 if x == 1 else 1)

In [10]:
loans.drop('bad_loans', axis=1, inplace=True)

### 4. Now, let us explore the distribution of the column safe_loans. This gives us a sense of how many safe and risky loans are present in the dataset. Print out the percentage of safe loans and risky loans in the data frame.
You should have:

Around 81% safe loans
Around 19% risky loans
It looks like most of these loans are safe loans (thankfully). But this does make our problem of identifying risky loans challenging.

In [12]:
len(loans[loans['safe_loans'] == 1]) / len(loans)

0.8111853319957262

### Features for the classification algorithm

In this assignment, we will be using a subset of features (categorical and numeric). The features we will be using are described in the code comments below. If you are a finance geek, the LendingClub website has a lot more details about these features. Extract these feature columns and target column from the dataset. We will only use these features.

In [14]:
features = ['grade',                     # grade of the loan
            'sub_grade',                 # sub-grade of the loan
            'short_emp',                 # one year or less of employment
            'emp_length_num',            # number of years of employment
            'home_ownership',            # home_ownership status: own, mortgage or rent
            'dti',                       # debt to income ratio
            'purpose',                   # the purpose of the loan
            'term',                      # the term of the loan
            'last_delinq_none',          # has borrower had a delinquincy
            'last_major_derog_none',     # has borrower had 90 day or worse rating
            'revol_util',                # percent of available credit being used
            'total_rec_late_fee',        # total late fees received to day
           ]
target = 'safe_loans'                    # prediction target (y) (+1 means safe, -1 is risky)

In [41]:
# Extract the feature columns and target column
loans = loans[features + [target]]

In [48]:
loans.head()

Unnamed: 0,grade,sub_grade,short_emp,emp_length_num,home_ownership,dti,purpose,term,last_delinq_none,last_major_derog_none,revol_util,total_rec_late_fee,safe_loans
0,B,B2,0,11,RENT,27.65,credit_card,36 months,1,1,83.7,0.0,1
1,C,C4,1,1,RENT,1.0,car,60 months,1,1,9.4,0.0,-1
2,C,C5,0,11,RENT,8.72,small_business,36 months,1,1,98.5,0.0,1
3,C,C1,0,11,RENT,20.0,other,36 months,0,1,21.0,16.97,1
4,A,A4,0,4,RENT,11.2,wedding,36 months,1,1,28.3,0.0,1


### One-hot encoding
For scikit-learn's decision tree implementation, it requires numerical values for it's data matrix. This means you will have to turn categorical variables into binary features via one-hot encoding. The next assignment has more details about this.
https://stackoverflow.com/questions/48170405/is-pd-get-dummies-one-hot-encoding

In [49]:
loans = pd.get_dummies(loans)

In [50]:
loans.shape

(122607, 68)

### Then follow the following steps:

Apply one-hot encoding to loans. Your tool may have a function for one-hot encoding. Alternatively, see #7 for implementation hints.
Load the JSON files into the lists train_idx and validation_idx.
Perform train/validation split using train_idx and validation_idx. In Pandas, for instance:

Note. Some elements in loans are included neither in train_data nor validation_data. This is to perform sampling to achieve class balance.

One way to combat class imbalance is to undersample the larger class until the class distribution is approximately half and half. Here, we will undersample the larger class (safe loans) in order to balance out our dataset. This means we are throwing away many data points. We used seed=1 so everyone gets the same results.

Note: There are many approaches for dealing with imbalanced data, including some where we modify the learning algorithm. These approaches are beyond the scope of this course, but some of them are reviewed in this paper. For this assignment, we use the simplest possible approach, where we subsample the overly represented class to get a more balanced dataset. In general, and especially when the data is highly imbalanced, we recommend using more advanced methods.

In [51]:
train_idx = pd.read_json('module-5-assignment-1-train-idx.json')

In [52]:
val_idx = pd.read_json('module-5-assignment-1-validation-idx.json')

In [53]:
train_data = loans.iloc[train_idx[0]]
validation_data = loans.iloc[val_idx[0]]

In [54]:
len(loans[loans[target] == 1])

99457

In [55]:
len(loans[loans[target] == -1])

23150

In [56]:
safe_loans_prob = round(float(sum(train_data['safe_loans'] == 1))/len(train_data),2)
safe_loans_prob

0.5

In [61]:
loans.head()

Unnamed: 0,short_emp,emp_length_num,dti,last_delinq_none,last_major_derog_none,revol_util,total_rec_late_fee,safe_loans,grade_A,grade_B,...,purpose_house,purpose_major_purchase,purpose_medical,purpose_moving,purpose_other,purpose_small_business,purpose_vacation,purpose_wedding,term_ 36 months,term_ 60 months
0,0,11,27.65,1,1,83.7,0.0,1,0,1,...,0,0,0,0,0,0,0,0,1,0
1,1,1,1.0,1,1,9.4,0.0,-1,0,0,...,0,0,0,0,0,0,0,0,0,1
2,0,11,8.72,1,1,98.5,0.0,1,0,0,...,0,0,0,0,0,1,0,0,1,0
3,0,11,20.0,0,1,21.0,16.97,1,0,0,...,0,0,0,0,1,0,0,0,1,0
4,0,4,11.2,1,1,28.3,0.0,1,1,0,...,0,0,0,0,0,0,0,1,1,0


### Split data into training and validation
Note: In previous assignments, we have called this a train-test split. However, the portion of data that we don't train on will be used to help select model parameters (this is known as model selection). Thus, this portion of data should be called a validation set. Recall that examining performance of various potential models (i.e. models with different parameters) should be on validation set, while evaluation of the final selected model should always be on test data. Typically, we would also save a portion of the data (a real test set) to test our final model on or use cross-validation on the training set to select our final model. But for the learning purposes of this assignment, we won't do that. http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html


In [57]:
decision_tree_model = tree.DecisionTreeClassifier(max_depth=6)

In [62]:
decision_tree_model.fit(train_data.drop(target, axis=1), train_data[target])

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=6,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

In [64]:
small_model = tree.DecisionTreeClassifier(max_depth=2)

In [65]:
small_model.fit(train_data.drop(target, axis=1), train_data[target])

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=2,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

### Visualizing a learned model (Optional)
https://stackoverflow.com/questions/35064304/runtimeerror-make-sure-the-graphviz-executables-are-on-your-systems-path-aft
https://stackoverflow.com/questions/27817994/visualizing-decision-tree-in-scikit-learn

In [78]:
from sklearn.tree import export_graphviz
from graphviz import Source
from IPython.display import SVG
#Source(tree.export_graphviz(small_model, out_file=None))

### Making predictions

Let's consider two positive and two negative examples from the validation set and see what the model predicts. We will do the following:

Predict whether or not a loan is safe.
Predict the probability that a loan is safe.

In [81]:
validation_safe_loans = validation_data[validation_data[target] == 1]
validation_risky_loans = validation_data[validation_data[target] == -1]
sample_validation_data_risky = validation_risky_loans[0:2]
sample_validation_data_safe = validation_safe_loans[0:2]
sample_validation_data = sample_validation_data_safe.append(sample_validation_data_risky)
sample_validation_data

Unnamed: 0,short_emp,emp_length_num,dti,last_delinq_none,last_major_derog_none,revol_util,total_rec_late_fee,safe_loans,grade_A,grade_B,...,purpose_house,purpose_major_purchase,purpose_medical,purpose_moving,purpose_other,purpose_small_business,purpose_vacation,purpose_wedding,term_ 36 months,term_ 60 months
19,0,11,11.18,1,1,82.4,0.0,1,0,1,...,0,0,0,0,0,0,0,0,1,0
79,0,10,16.85,1,1,96.4,0.0,1,0,0,...,0,0,0,0,0,0,0,0,1,0
24,0,3,13.97,0,1,59.5,0.0,-1,0,0,...,0,0,0,0,1,0,0,0,0,1
41,0,11,16.33,1,1,62.1,0.0,-1,1,0,...,0,0,0,0,0,0,0,0,1,0


### 12. Now, we will use our model to predict whether or not a loan is likely to default. For each row in the sample_validation_data, use the decision_tree_model to predict whether or not the loan is classified as a safe loan. (Hint: if you are using scikit-learn, you can use the .predict() method)

In [82]:
decision_tree_model.predict(sample_validation_data.drop(target, axis=1))

array([ 1, -1, -1,  1], dtype=int64)

In [83]:
sample_validation_data[[target]]

Unnamed: 0,safe_loans
19,1
79,1
24,-1
41,-1


### Explore probability predictions
For each row in the sample_validation_data, what is the probability (according decision_tree_model) of a loan being classified as safe? (Hint: if you are using scikit-learn, you can use the .predict_proba() method)

In [84]:
decision_tree_model.predict_proba(sample_validation_data.drop(target, axis=1))

array([[0.34156543, 0.65843457],
       [0.53630646, 0.46369354],
       [0.64750958, 0.35249042],
       [0.20789474, 0.79210526]])

### Tricky predictions!

14. Now, we will explore something pretty interesting. For each row in the sample_validation_data, what is the probability (according to small_model) of a loan being classified as safe?

In [85]:
small_model.predict_proba(sample_validation_data.drop(target, axis=1))

array([[0.41896585, 0.58103415],
       [0.59255339, 0.40744661],
       [0.59255339, 0.40744661],
       [0.23120112, 0.76879888]])

### Visualize the prediction on a tree¶
Quiz Question: Based on the visualized tree, what prediction would you make for this data point (according to small_model)? (If you don't have Graphviz, you can answer this quiz question by executing the next part.)

In [86]:
small_model.predict(sample_validation_data.drop('safe_loans',1))

array([ 1, -1, -1,  1], dtype=int64)

### Evaluating accuracy of the decision tree model
Evaluate the accuracy of small_model and decision_tree_model on the training data. (Hint: if you are using scikit-learn, you can use the .score() method)

In [87]:
decision_tree_model.score(train_data.drop(target, axis=1), train_data[target])

0.6405276165914464

In [88]:
small_model.score(train_data.drop(target, axis=1), train_data[target])

0.6135020416935311

In [89]:
decision_tree_model.score(validation_data.drop(target, axis=1), validation_data[target])

0.6361482119775959

In [90]:
small_model.score(validation_data.drop(target, axis=1), validation_data[target])

0.6193451098664369

### Evaluating accuracy of a complex decision tree model
Here, we will train a large decision tree with max_depth=10. This will allow the learned tree to become very deep, and result in a very complex model. Recall that in lecture, we prefer simpler models with similar predictive power. This will be an example of a more complicated model which has similar predictive power, i.e. something we don't want.

In [96]:
big_model = tree.DecisionTreeClassifier(max_depth=10)

In [97]:
big_model.fit(train_data.drop(target, axis=1), train_data[target])

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=10,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

#### Checkpoint: We should see that big_model has even better performance on the training set than decision_tree_model did on the training set

In [101]:
big_model.score(validation_data.drop(target, axis=1), validation_data[target])

0.6264541146057734

In [105]:
decision_tree_model_predict = decision_tree_model.predict(validation_data.drop(target, axis=1))

In [106]:
decision_tree_model_predict

array([-1,  1, -1, ..., -1, -1,  1], dtype=int64)

In [107]:
result = validation_data[[target]]

In [110]:
result.loc[:, 'Predict Value'] = decision_tree_model_predict

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item_labels[indexer[info_axis]]] = value


In [122]:
result.head()

Unnamed: 0,safe_loans,Predict Value
24,-1,-1
41,-1,1
60,-1,-1
93,-1,-1
132,-1,1


In [130]:
result['Lost'] = result.apply(lambda x: 1e4 if (x[target] == 1 and x['Predict Value'] == -1) 
                              else(2e4 if (x[target] == -1 and x['Predict Value'] == 1) else 0), axis=1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


In [133]:
result['Lost'].sum()

50390000.0