In [7]:
from __future__ import division, print_function

In [1]:
import sframe
loans = sframe.SFrame('lending-club-data.gl/')

[INFO] sframe.cython.cy_server: SFrame v2.1 started. Logging /tmp/sframe_server_1519465980.log


Exploring some features
2. Let's quickly explore what the dataset looks like. First, print out the column names to see what features we have in this dataset. On SFrame, you can run this code:

In [2]:
loans['safe_loans'] = loans['bad_loans'].apply(lambda x: +1 if x==0 else -1)
loans = loans.remove_column('bad_loans')

### Selecting features
In this assignment, we will be using a subset of features (categorical and numeric). The features we will be using are described in the code comments below. If you are a finance geek, the LendingClub website has a lot more details about these features.

4. The features we will be using are described in the code comments below. Extract these feature columns and target column from the dataset. We will only use these features.

In [3]:
target = 'safe_loans'
features = ['grade',                     # grade of the loan (categorical)
            'sub_grade_num',             # sub-grade of the loan as a number from 0 to 1
            'short_emp',                 # one year or less of employment
            'emp_length_num',            # number of years of employment
            'home_ownership',            # home_ownership status: own, mortgage or rent
            'dti',                       # debt to income ratio
            'purpose',                   # the purpose of the loan
            'payment_inc_ratio',         # ratio of the monthly payment to income
            'delinq_2yrs',               # number of delinquincies
             'delinq_2yrs_zero',          # no delinquincies in last 2 years
            'inq_last_6mths',            # number of creditor inquiries in last 6 months
            'last_delinq_none',          # has borrower had a delinquincy
            'last_major_derog_none',     # has borrower had 90 day or worse rating
            'open_acc',                  # number of open credit accounts
            'pub_rec',                   # number of derogatory public records
            'pub_rec_zero',              # no derogatory public records
            'revol_util',                # percent of available credit being used
            'total_rec_late_fee',        # total late fees received to day
            'int_rate',                  # interest rate of the loan
            'total_rec_int',             # interest received to date
            'annual_inc',                # annual income of borrower
            'funded_amnt',               # amount committed to the loan
            'funded_amnt_inv',           # amount committed by investors for the loan
            'installment',               # monthly payment owed by the borrower
           ]

### Skipping observations with missing values
Recall from the lectures that one common approach to coping with missing values is to skip observations that contain missing values.

5. Using SFrame, we run the following code to do so:

In [6]:
loans, loans_with_na = loans[[target] + features].dropna_split()

In [8]:
# Count tne number of rows with missing data
num_rows_with_na = loans_with_na.num_rows()
num_rows = loans.num_rows()
print("Dropping %s observations: keeping %s " % (num_rows_with_na, num_rows))

Dropping 29 observations: keeping 122578 


### Make sure the classes are balanced
6. We saw in an earlier assignment that this dataset is also imbalanced. We will undersample the larger class (safe loans) in order to balance out our dataset. We used seed=1 to make sure everyone gets the same results.

In [10]:
safe_loans_raw = loans[loans[target] == 1]
risky_loans_raw = loans[loans[target] == -1]

# Undersample the safe loans
percentage = len(risky_loans_raw)/len(safe_loans_raw)
safe_loans = safe_loans_raw.sample(percentage, seed = 1)
risky_loans = risky_loans_raw
loans_data = risky_loans.append(safe_loans)

print("Percentage of safe loans         :", len(safe_loans) / len(loans_data))
print("Percentage of risky loans         :", len(risky_loans) / len(loans_data))
print("Total number of loans in our new dataset :", len(loans_data))

Percentage of safe loans         : 0.502247166849
Percentage of risky loans         : 0.497752833151
Total number of loans in our new dataset : 46503


### One-hot encoding
For scikit-learn's decision tree implementation, it numerical values for it's data matrix. This means you will have to turn categorical variables into binary features via one-hot encoding.

7. We've seen this same piece of code in earlier assignments. Again, feel free to use this piece of code as is. Refer to the API documentation for a deeper understanding.

In [12]:
loans_data = risky_loans.append(safe_loans)

categorical_variables = []
for feat_name, feat_type in zip(loans_data.column_names(), loans_data.column_types()):
    if feat_type == str:
        categorical_variables.append(feat_name)
        
for feature in categorical_variables:
    loans_data_one_hot_encoded = loans_data[feature].apply(lambda x: {x: 1})
    loans_data_unpacked = loans_data_one_hot_encoded.unpack(column_name_prefix=feature)
    
    # Change Nones to 0s
    for column in loans_data_unpacked.column_names():
        loans_data_unpacked[column] = loans_data_unpacked[column].fillna(0)
    
    loans_data.remove_column(feature)
    loans_data.add_columns(loans_data_unpacked)

### Split data into training and validation
8. We split the data into training data and validation data. We used seed=1 to make sure everyone gets the same results. We will use the validation data to help us select model parameters.

In [13]:
train_data, validation_data = loans_data.random_split(.8, seed=1)

### Gradient boosted tree classifier
Gradient boosted trees are a powerful variant of boosting methods; they have been used to win many Kaggle competitions, and have been widely used in industry. We will explore the predictive power of multiple decision trees as opposed to a single decision tree.

We will now train models to predict safe_loans using the features above. In this section, we will experiment with training an ensemble of 5 trees.

9.Now, let's use the built-in scikit learn gradient boosting classifier (sklearn.ensemble.GradientBoostingClassifier) to create a gradient boosted classifier on the training data. You will need to import sklearn, sklearn.ensemble, and numpy.

You will have to first convert the SFrame into a numpy data matrix. See the API for more information. You will also have to extract the label column. Make sure to set max_depth=6 and n_estimators=5.

In [17]:
import numpy as np
import sklearn
from sklearn.ensemble import GradientBoostingClassifier

In [23]:
train_data_matrix = train_data.to_numpy()
X_train, y_train = train_data_matrix[:, 1:], train_data_matrix[:, 0]

In [48]:
model_5 = GradientBoostingClassifier(n_estimators=5, max_depth=6).fit(X_train, y_train)

#### Making predictions
Just like we did in previous sections, let us consider a few positive and negative examples from the validation set. We will do the following:

* Predict whether or not a loan is likely to default.
* Predict the probability with which the loan is likely to default.

In [49]:
validation_safe_loans = validation_data[validation_data[target] == 1]
validation_risky_loans = validation_data[validation_data[target] == -1]

sample_validation_data_risky = validation_risky_loans[0:2]
sample_validation_data_safe = validation_safe_loans[0:2]

sample_validation_data = sample_validation_data_safe.append(sample_validation_data_risky)

In [50]:
validation_data_matrix = sample_validation_data.to_numpy()
X_sample_validation, y_sample_validation = validation_data_matrix[:, 1:], validation_data_matrix[:, 0]

In [52]:
model_5.predict(X_sample_validation)

array([ 1.,  1., -1.,  1.])

In [54]:
print(y_sample_validation)

[ 1.  1. -1. -1.]


In [56]:
proba = model_5.predict_proba(X_sample_validation)
proba[:, 1] - proba[:, 0]

array([ 0.15253933,  0.07448605, -0.07747521,  0.18206706])

In [38]:
validation_matrix = validation_data.to_numpy()
X_validation, y_validation = validation_matrix[:, 1:], validation_matrix[:, 0]

In [40]:
validation_accuracy = model_5.score(X_validation, y_validation)

In [41]:
validation_accuracy

0.66146057733735464

In [42]:
import pandas as pd

In [43]:
predicted_df = pd.DataFrame()
predicted_df['predicted'] = model_5.predict(X_validation)
predicted_df['actual'] = y_validation

In [46]:
false_positives = predicted_df[(predicted_df['predicted']==1) & (predicted_df['actual']==-1)]
len(false_positives)

1652

In [47]:
false_negatives = predicted_df[(predicted_df['predicted']==-1) & (predicted_df['actual']==1)]
len(false_negatives)

1491