# Machine learning - decision trees

 In this small project we will be playing with dataset which comes from coursera's machine learning specialization. The LendingClub is a peer-to-peer leading company that directly connects borrowers and potential lenders/investors. We will try to build a model to predict what is probability to pay back a loan.

In [1]:
import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report,confusion_matrix

## Data

Loading the loans dataset using pandas. Setting up column data types at the beginning may avoid memory problems with huge datasets (dtype determination of each column).

In [2]:
loans = pd.read_csv('lending-club-data.csv', dtype={'desc': object, 'next_pymnt_d': object})

Changing values to be consistent with the previous lectures.
<br>We reassing the target to be:
<br>  **+1 as a safe loan,**
<br>  **-1 as a risky (bad) loan.**

In [3]:
loans['safe_loans'] = loans['bad_loans'].apply(lambda x: +1 if x == 0 else -1)
loans = loans.drop('bad_loans', axis=1)

In [4]:
loans.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 122607 entries, 0 to 122606
Data columns (total 68 columns):
id                             122607 non-null int64
member_id                      122607 non-null int64
loan_amnt                      122607 non-null int64
funded_amnt                    122607 non-null int64
funded_amnt_inv                122607 non-null int64
term                           122607 non-null object
int_rate                       122607 non-null float64
installment                    122607 non-null float64
grade                          122607 non-null object
sub_grade                      122607 non-null object
emp_title                      115770 non-null object
emp_length                     122607 non-null object
home_ownership                 122607 non-null object
annual_inc                     122603 non-null float64
is_inc_v                       122607 non-null object
issue_d                        122607 non-null object
loan_status                

## Features for the classification

We will be using a subset of categorical and numeric features.

In [5]:
features = ['grade',                     # grade of the loan
            'sub_grade',                 # sub-grade of the loan
            'short_emp',                 # one year or less of employment
            'emp_length_num',            # number of years of employment
            'home_ownership',            # home_ownership status: own, mortgage or rent
            'dti',                       # debt to income ratio
            'purpose',                   # the purpose of the loan
            'term',                      # the term of the loan
            'last_delinq_none',          # has borrower had a delinquincy
            'last_major_derog_none',     # has borrower had 90 day or worse rating
            'revol_util',                # percent of available credit being used
            'total_rec_late_fee',        # total late fees received to day
            ]
target = 'safe_loans'                    # prediction target (y) (+1 means safe, -1 is risky)
loans = loans[features + [target]]

In [6]:
loans.head()

Unnamed: 0,grade,sub_grade,short_emp,emp_length_num,home_ownership,dti,purpose,term,last_delinq_none,last_major_derog_none,revol_util,total_rec_late_fee,safe_loans
0,B,B2,0,11,RENT,27.65,credit_card,36 months,1,1,83.7,0.0,1
1,C,C4,1,1,RENT,1.0,car,60 months,1,1,9.4,0.0,-1
2,C,C5,0,11,RENT,8.72,small_business,36 months,1,1,98.5,0.0,1
3,C,C1,0,11,RENT,20.0,other,36 months,0,1,21.0,16.97,1
4,A,A4,0,4,RENT,11.2,wedding,36 months,1,1,28.3,0.0,1


In [7]:
loans.describe()

Unnamed: 0,short_emp,emp_length_num,dti,last_delinq_none,last_major_derog_none,revol_util,total_rec_late_fee,safe_loans
count,122607.0,122607.0,122607.0,122607.0,122607.0,122607.0,122607.0,122607.0
mean,0.123672,6.370256,15.496888,0.588115,0.873906,53.716307,0.742344,0.622371
std,0.329208,3.736014,7.497442,0.492177,0.331957,25.723881,5.363268,0.782726
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-1.0
25%,0.0,3.0,9.88,0.0,1.0,34.8,0.0,1.0
50%,0.0,6.0,15.26,1.0,1.0,55.7,0.0,1.0
75%,0.0,11.0,20.85,1.0,1.0,74.3,0.0,1.0
max,1.0,11.0,39.88,1.0,1.0,150.7,208.82,1.0


## Data preparation

**Note:** We will use imbalanced data in this approach. To deal with that issue there are some advanced methods but we will not use them to build this classifier.

Checking every feature if there is some missing data.

In [8]:
loans[loans.isnull().any(axis=1)].any()

grade                    False
sub_grade                False
short_emp                False
emp_length_num           False
home_ownership           False
dti                      False
purpose                  False
term                     False
last_delinq_none         False
last_major_derog_none    False
revol_util               False
total_rec_late_fee       False
safe_loans               False
dtype: bool

Let's create a dataframe showing basic info about data. As we can see there are categorical features which we need to take care of. The basic strategy is to convert each category value into a new column and assigns a 1 or 0 value to the column. This approach is called **one hot encoding**. We avoid misleading weighting of values. However it does have downside of adding more columns to data set. The example is sub_grade feature which has 35 unique values. 

In [9]:
df_feat = pd.DataFrame.from_records([(j, len(loans[j].value_counts().index), loans[j].dtype) for j in loans.columns],
                                    columns=['feature', 'unique_values', 'dtype_'])
df_feat

Unnamed: 0,feature,unique_values,dtype_
0,grade,7,object
1,sub_grade,35,object
2,short_emp,2,int64
3,emp_length_num,12,int64
4,home_ownership,4,object
5,dti,3543,float64
6,purpose,12,object
7,term,2,object
8,last_delinq_none,2,int64
9,last_major_derog_none,2,int64


We need to define a list of categorical features.

In [10]:
list_feat = list(df_feat.feature[df_feat.dtype_ == object])
max_val_feat = max(df_feat.unique_values[df_feat.dtype_ == object])
list_feat, max_val_feat

(['grade', 'sub_grade', 'home_ownership', 'purpose', 'term'], 35)

A few words about features with many unique values. There is some simple approach called **custom binary encoding**. It allows to assume that the various values of feature are all the same for this specific analysis e.g. bathroom has shower or bath - we can create a new column the indicates whether or not the bathroom has atleast one of them. In the case of sub_grade unique values it does not seem to be a good solution.

In [11]:
max_feat = df_feat.feature[(df_feat.dtype_ == object) & (df_feat.unique_values == max_val_feat)].reset_index(drop=True)[0]
max_feat

'sub_grade'

In [12]:
loans[max_feat].value_counts().head(10)

B3    9036
B4    8279
B2    7096
C1    7068
B5    6924
C2    6726
A5    6027
A4    5993
B1    5837
C3    5690
Name: sub_grade, dtype: int64

We need to transform categorical features using dummy variables. This is alternative to sklearn.preprocessing methods.

In [13]:
loans = pd.get_dummies(loans, columns=list_feat, prefix=list_feat, drop_first=True)
loans.columns

Index(['short_emp', 'emp_length_num', 'dti', 'last_delinq_none',
       'last_major_derog_none', 'revol_util', 'total_rec_late_fee',
       'safe_loans', 'grade_B', 'grade_C', 'grade_D', 'grade_E', 'grade_F',
       'grade_G', 'sub_grade_A2', 'sub_grade_A3', 'sub_grade_A4',
       'sub_grade_A5', 'sub_grade_B1', 'sub_grade_B2', 'sub_grade_B3',
       'sub_grade_B4', 'sub_grade_B5', 'sub_grade_C1', 'sub_grade_C2',
       'sub_grade_C3', 'sub_grade_C4', 'sub_grade_C5', 'sub_grade_D1',
       'sub_grade_D2', 'sub_grade_D3', 'sub_grade_D4', 'sub_grade_D5',
       'sub_grade_E1', 'sub_grade_E2', 'sub_grade_E3', 'sub_grade_E4',
       'sub_grade_E5', 'sub_grade_F1', 'sub_grade_F2', 'sub_grade_F3',
       'sub_grade_F4', 'sub_grade_F5', 'sub_grade_G1', 'sub_grade_G2',
       'sub_grade_G3', 'sub_grade_G4', 'sub_grade_G5', 'home_ownership_OTHER',
       'home_ownership_OWN', 'home_ownership_RENT', 'purpose_credit_card',
       'purpose_debt_consolidation', 'purpose_home_improvement',
       'p

## Splitting into training, testing datasets

In [14]:
X = loans.drop('safe_loans', axis=1)
y = loans['safe_loans']

In [15]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

## Training a Decision Tree classifier

In [16]:
decision_tree_model = DecisionTreeClassifier()

In [17]:
decision_tree_model.fit(X_train, y_train)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

## Predictions and Evaluation

In [18]:
predictions = decision_tree_model.predict(X_test)

In [19]:
print(classification_report(y_test, predictions))

             precision    recall  f1-score   support

         -1       0.27      0.29      0.28      4671
          1       0.83      0.81      0.82     19851

avg / total       0.72      0.71      0.72     24522



In [20]:
print(confusion_matrix(y_test,predictions))

[[ 1361  3310]
 [ 3767 16084]]
