In [1]:
import sframe
loans = sframe.SFrame('lending-club-data.gl/')

[INFO] sframe.cython.cy_server: SFrame v2.1 started. Logging /tmp/sframe_server_1470672199.log


In [2]:
loans.column_names()

['id',
 'member_id',
 'loan_amnt',
 'funded_amnt',
 'funded_amnt_inv',
 'term',
 'int_rate',
 'installment',
 'grade',
 'sub_grade',
 'emp_title',
 'emp_length',
 'home_ownership',
 'annual_inc',
 'is_inc_v',
 'issue_d',
 'loan_status',
 'pymnt_plan',
 'url',
 'desc',
 'purpose',
 'title',
 'zip_code',
 'addr_state',
 'dti',
 'delinq_2yrs',
 'earliest_cr_line',
 'inq_last_6mths',
 'mths_since_last_delinq',
 'mths_since_last_record',
 'open_acc',
 'pub_rec',
 'revol_bal',
 'revol_util',
 'total_acc',
 'initial_list_status',
 'out_prncp',
 'out_prncp_inv',
 'total_pymnt',
 'total_pymnt_inv',
 'total_rec_prncp',
 'total_rec_int',
 'total_rec_late_fee',
 'recoveries',
 'collection_recovery_fee',
 'last_pymnt_d',
 'last_pymnt_amnt',
 'next_pymnt_d',
 'last_credit_pull_d',
 'collections_12_mths_ex_med',
 'mths_since_last_major_derog',
 'policy_code',
 'not_compliant',
 'status',
 'inactive_loans',
 'bad_loans',
 'emp_length_num',
 'grade_num',
 'sub_grade_num',
 'delinq_2yrs_zero',
 'pub_rec

We put this in a new column called safe_loans.

In [3]:
# safe_loans =  1 => safe
# safe_loans = -1 => risky
loans['safe_loans'] = loans['bad_loans'].apply(lambda x : +1 if x==0 else -1)
loans = loans.remove_column('bad_loans')

Now, let us explore the distribution of the column safe_loans. This gives us a sense of how many safe and risky loans are present in the dataset. Print out the percentage of safe loans and risky loans in the data frame.

You should have:

    Around 81% safe loans
    Around 19% risky loans

In [4]:
safe = (loans['safe_loans'] == +1 ).sum()
risky = (loans['safe_loans'] == -1).sum()

In [5]:
print "safe loans: ", 100*float(safe)/(safe+ risky)
print "risky loans: ", 100*float(risky)/(safe+ risky)

safe loans:  81.1185331996
risky loans:  18.8814668004


In this assignment, we will be using a subset of features (categorical and numeric). The features we will be using are described in the code comments below. If you are a finance geek, the LendingClub website has a lot more details about these features. Extract these feature columns and target column from the dataset. We will only use these features.

In [6]:
features = ['grade',                     # grade of the loan
            'sub_grade',                 # sub-grade of the loan
            'short_emp',                 # one year or less of employment
            'emp_length_num',            # number of years of employment
            'home_ownership',            # home_ownership status: own, mortgage or rent
            'dti',                       # debt to income ratio
            'purpose',                   # the purpose of the loan
            'term',                      # the term of the loan
            'last_delinq_none',          # has borrower had a delinquincy
            'last_major_derog_none',     # has borrower had 90 day or worse rating
            'revol_util',                # percent of available credit being used
            'total_rec_late_fee',        # total late fees received to day
           ]

target = 'safe_loans'                    # prediction target (y) (+1 means safe, -1 is risky)

# Extract the feature columns and target column
loans = loans[features + [target]]

What remains now is a subset of features and the target that we will use for the rest of this notebook.

### Sample data to balance classes

As we explored above, our data is disproportionally full of safe loans. Let's create two datasets: one with just the safe loans (safe_loans_raw) and one with just the risky loans (risky_loans_raw).

In [7]:
safe_loans_raw = loans[loans[target] == +1]
risky_loans_raw = loans[loans[target] == -1]
print "Number of safe loans  : %s" % len(safe_loans_raw)
print "Number of risky loans : %s" % len(risky_loans_raw)

Number of safe loans  : 99457
Number of risky loans : 23150


One way to combat class imbalance is to undersample the larger class until the class distribution is approximately half and half. Here, we will undersample the larger class (safe loans) in order to balance out our dataset. This means we are throwing away many data points. We used seed=1 so everyone gets the same results.

In [8]:
# Since there are fewer risky loans than safe loans, find the ratio of the sizes
# and use that percentage to undersample the safe loans.
percentage = len(risky_loans_raw)/float(len(safe_loans_raw))

risky_loans = risky_loans_raw
safe_loans = safe_loans_raw.sample(percentage, seed=1)

# Append the risky_loans with the downsampled version of safe_loans
loans_data = risky_loans.append(safe_loans)

In [9]:
safe_loans_raw = loans_data[loans_data[target] == +1]
risky_loans_raw = loans_data[loans_data[target] == -1]
print "Number of safe loans  : %s" % len(safe_loans_raw)
print "Number of risky loans : %s" % len(risky_loans_raw)

Number of safe loans  : 23358
Number of risky loans : 23150


In [10]:
print len(safe_loans_raw)/float(len(risky_loans_raw))

1.00898488121


In [11]:
categorical_variables = []
for feat_name, feat_type in zip(loans_data.column_names(), loans_data.column_types()):
    if feat_type == str:
        categorical_variables.append(feat_name)

for feature in categorical_variables:
    loans_data_one_hot_encoded = loans_data[feature].apply(lambda x: {x: 1})
    loans_data_unpacked = loans_data_one_hot_encoded.unpack(column_name_prefix=feature)

    # Change None's to 0's
    for column in loans_data_unpacked.column_names():
        loans_data_unpacked[column] = loans_data_unpacked[column].fillna(0)

    loans_data.remove_column(feature)
    loans_data.add_columns(loans_data_unpacked)

### Split data into training and validation

We split the data into training and validation sets using an 80/20 split and specifying seed=1 so everyone gets the same results. Call the training and validation sets train_data and validation_data, respectively.

Note: In previous assignments, we have called this a train-test split. However, the portion of data that we don't train on will be used to help select model parameters (this is known as model selection). Thus, this portion of data should be called a validation set. Recall that examining performance of various potential models (i.e. models with different parameters) should be on validation set, while evaluation of the final selected model should always be on test data. Typically, we would also save a portion of the data (a real test set) to test our final model on or use cross-validation on the training set to select our final model. But for the learning purposes of this assignment, we won't do that.

In [12]:
train_data, validation_data = loans_data.random_split(.8, seed=1)

In [13]:
import sklearn
from sklearn.tree import DecisionTreeClassifier
import numpy as np

**Note:** You will have to first convert the **SFrame** into a numpy data matrix, and extract the target labels as a numpy array (Hint: you can use the **.to_numpy()** method call on SFrame to turn SFrames into numpy arrays). See the API for more information. Make sure to **set max_depth=6**.

Call this model **decision_tree_model.**

In [14]:
def get_numpy_data(data_sframe, features, label):
    features_sframe = data_sframe[features]
    feature_matrix = features_sframe.to_numpy()
    label_sarray = data_sframe[label]
    label_array = label_sarray.to_numpy()
    return(feature_matrix, label_array)

In [15]:
features_new = ['short_emp',
 'emp_length_num',
 'dti',
 'last_delinq_none',
 'last_major_derog_none',
 'revol_util',
 'total_rec_late_fee',
 'safe_loans',
 'grade.A',
 'grade.B',
 'grade.C',
 'grade.D',
 'grade.E',
 'grade.F',
 'grade.G',
 'sub_grade.A1',
 'sub_grade.A2',
 'sub_grade.A3',
 'sub_grade.A4',
 'sub_grade.A5',
 'sub_grade.B1',
 'sub_grade.B2',
 'sub_grade.B3',
 'sub_grade.B4',
 'sub_grade.B5',
 'sub_grade.C1',
 'sub_grade.C2',
 'sub_grade.C3',
 'sub_grade.C4',
 'sub_grade.C5',
 'sub_grade.D1',
 'sub_grade.D2',
 'sub_grade.D3',
 'sub_grade.D4',
 'sub_grade.D5',
 'sub_grade.E1',
 'sub_grade.E2',
 'sub_grade.E3',
 'sub_grade.E4',
 'sub_grade.E5',
 'sub_grade.F1',
 'sub_grade.F2',
 'sub_grade.F3',
 'sub_grade.F4',
 'sub_grade.F5',
 'sub_grade.G1',
 'sub_grade.G2',
 'sub_grade.G3',
 'sub_grade.G4',
 'sub_grade.G5',
 'home_ownership.MORTGAGE',
 'home_ownership.OTHER',
 'home_ownership.OWN',
 'home_ownership.RENT',
 'purpose.car',
 'purpose.credit_card',
 'purpose.debt_consolidation',
 'purpose.home_improvement',
 'purpose.house',
 'purpose.major_purchase',
 'purpose.medical',
 'purpose.moving',
 'purpose.other',
 'purpose.small_business',
 'purpose.vacation',
 'purpose.wedding',
 'term. 36 months',
 'term. 60 months']

In [16]:
train_feature_matrix, train_target_array = get_numpy_data(train_data, features_new, target)

In [17]:
decision_tree_model = DecisionTreeClassifier(max_depth = 6)

In [18]:
decision_tree_model.fit(train_feature_matrix,train_target_array)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=6,
            max_features=None, max_leaf_nodes=None, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best')

In [19]:
small_model = DecisionTreeClassifier(max_depth = 2)
small_model.fit(train_feature_matrix,train_target_array)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=2,
            max_features=None, max_leaf_nodes=None, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best')

In [20]:
validation_safe_loans = validation_data[validation_data[target] == 1]
validation_risky_loans = validation_data[validation_data[target] == -1]

sample_validation_data_risky = validation_risky_loans[0:2]
sample_validation_data_safe = validation_safe_loans[0:2]

sample_validation_data = sample_validation_data_safe.append(sample_validation_data_risky)
sample_validation_data

short_emp,emp_length_num,dti,last_delinq_none,last_major_derog_none,revol_util,total_rec_late_fee,safe_loans
0,11,11.18,1,1,82.4,0.0,1
0,10,16.85,1,1,96.4,0.0,1
0,3,13.97,0,1,59.5,0.0,-1
0,11,16.33,1,1,62.1,0.0,-1

grade.A,grade.B,grade.C,grade.D,grade.E,grade.F,grade.G,sub_grade.A1,sub_grade.A2,sub_grade.A3,sub_grade.A4
0,1,0,0,0,0,0,0,0,0,0
0,0,0,1,0,0,0,0,0,0,0
0,0,0,1,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0

sub_grade.A5,sub_grade.B1,sub_grade.B2,sub_grade.B3,sub_grade.B4,sub_grade.B5,sub_grade.C1,sub_grade.C2
0,0,0,1,0,0,0,0
0,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0

sub_grade.C3,sub_grade.C4,sub_grade.C5,sub_grade.D1,sub_grade.D2,sub_grade.D3,sub_grade.D4,sub_grade.D5
0,0,0,0,0,0,0,0
0,0,0,1,0,0,0,0
0,0,0,0,1,0,0,0
0,0,0,0,0,0,0,0

sub_grade.E1,sub_grade.E2,sub_grade.E3,sub_grade.E4,sub_grade.E5,...
0,0,0,0,0,...
0,0,0,0,0,...
0,0,0,0,0,...
0,0,0,0,0,...


In [21]:
sample_matrix , sample_array = get_numpy_data(sample_validation_data,features_new, target)

Now, we will use our model to predict whether or not a loan is likely to default. For each row in the sample_validation_data, use the decision_tree_model to predict whether or not the loan is classified as a safe loan. (Hint: if you are using scikit-learn, you can use the .predict() method)

In [22]:
decision_tree_model.predict(sample_matrix)

array([ 1,  1, -1, -1])

In [23]:
print sample_array

[ 1  1 -1 -1]


**Quiz Question:** What percentage of the predictions on **sample_validation_data** did **decision_tree_model** get correct?

In [24]:
from sklearn.tree import export_graphviz
export_graphviz(small_model,out_file='smalltree.dot')

In [27]:
from sklearn.externals.six import StringIO  
import pydot 
dot_data = StringIO()  
export_graphviz(small_model, out_file=dot_data,  
                         filled=True, rounded=True,  
                         special_characters=True)  
graph = pydot.graph_from_dot_data(dot_data.getvalue())  


In [28]:
from IPython.display import Image


In [30]:
graph

[<pydot.Dot at 0x7fd7084b29d0>]