## Decision Tree - Example
### Problem: Predicting risky bank loans using C5.0 decision trees

The default vector indicates whether the loan applicant was unable to meet the agreed payment terms and went into default. A total of 30 percent of the loans in this dataset went into default. We have to train our model and predict such defaulters. 

### Data: 
1. checking_balance        - object
2. months_loan_duration     - int64
3. credit_history          - object
4. purpose                 - object
5. amount                   - int64
6. savings_balance         - object
7. employment_length       - object
8. installment_rate         - int64
9. personal_status         - object
10. other_debtors           - object
11. residence_history        - int64
12. property                - object
13. age                      - int64
14. installment_plan        - object
15. housing                 - object
16. existing_credits         - int64
17. job                     - object
18. dependents               - int64
19. telephone               - object
20. foreign_worker          - object
21. default                  - int64 (Target variable/Label)

a) default = 1 --> Normal Customer

b) default = 2 --> Risky Customer/Probable Deaulter

In [1]:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import confusion_matrix, accuracy_score
pd.set_option('display.max_columns',30)

In [2]:
credit = pd.read_csv('Data/credit.csv')

In [3]:
credit.head()

Unnamed: 0,checking_balance,months_loan_duration,credit_history,purpose,amount,savings_balance,employment_length,installment_rate,personal_status,other_debtors,residence_history,property,age,installment_plan,housing,existing_credits,job,dependents,telephone,foreign_worker,default
0,< 0 DM,6,critical,radio/tv,1169,unknown,> 7 yrs,4,single male,none,4,real estate,67,none,own,2,skilled employee,1,yes,yes,1
1,1 - 200 DM,48,repaid,radio/tv,5951,< 100 DM,1 - 4 yrs,2,female,none,2,real estate,22,none,own,1,skilled employee,1,none,yes,2
2,unknown,12,critical,education,2096,< 100 DM,4 - 7 yrs,2,single male,none,3,real estate,49,none,own,1,unskilled resident,2,none,yes,1
3,< 0 DM,42,repaid,furniture,7882,< 100 DM,4 - 7 yrs,2,single male,guarantor,4,building society savings,45,none,for free,1,skilled employee,2,none,yes,1
4,< 0 DM,24,delayed,car (new),4870,< 100 DM,1 - 4 yrs,3,single male,none,4,unknown/none,53,none,for free,2,skilled employee,2,none,yes,2


In [4]:
credit.dtypes 

checking_balance        object
months_loan_duration     int64
credit_history          object
purpose                 object
amount                   int64
savings_balance         object
employment_length       object
installment_rate         int64
personal_status         object
other_debtors           object
residence_history        int64
property                object
age                      int64
installment_plan        object
housing                 object
existing_credits         int64
job                     object
dependents               int64
telephone               object
foreign_worker          object
default                  int64
dtype: object

In [5]:
credit.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 21 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   checking_balance      1000 non-null   object
 1   months_loan_duration  1000 non-null   int64 
 2   credit_history        1000 non-null   object
 3   purpose               1000 non-null   object
 4   amount                1000 non-null   int64 
 5   savings_balance       1000 non-null   object
 6   employment_length     1000 non-null   object
 7   installment_rate      1000 non-null   int64 
 8   personal_status       1000 non-null   object
 9   other_debtors         1000 non-null   object
 10  residence_history     1000 non-null   int64 
 11  property              1000 non-null   object
 12  age                   1000 non-null   int64 
 13  installment_plan      1000 non-null   object
 14  housing               1000 non-null   object
 15  existing_credits      1000 non-null   i

In [6]:
credit.describe()

Unnamed: 0,months_loan_duration,amount,installment_rate,residence_history,age,existing_credits,dependents,default
count,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0
mean,20.903,3271.258,2.973,2.845,35.546,1.407,1.155,1.3
std,12.058814,2822.736876,1.118715,1.103718,11.375469,0.577654,0.362086,0.458487
min,4.0,250.0,1.0,1.0,19.0,1.0,1.0,1.0
25%,12.0,1365.5,2.0,2.0,27.0,1.0,1.0,1.0
50%,18.0,2319.5,3.0,3.0,33.0,1.0,1.0,1.0
75%,24.0,3972.25,4.0,4.0,42.0,2.0,1.0,2.0
max,72.0,18424.0,4.0,4.0,75.0,4.0,2.0,2.0


In [7]:
#checking NA values
credit.isnull().sum() 

checking_balance        0
months_loan_duration    0
credit_history          0
purpose                 0
amount                  0
savings_balance         0
employment_length       0
installment_rate        0
personal_status         0
other_debtors           0
residence_history       0
property                0
age                     0
installment_plan        0
housing                 0
existing_credits        0
job                     0
dependents              0
telephone               0
foreign_worker          0
default                 0
dtype: int64

In [9]:
# uniquie values in checking balance
credit.checking_balance.value_counts()

unknown       394
< 0 DM        274
1 - 200 DM    269
> 200 DM       63
Name: checking_balance, dtype: int64

In [10]:
# uniquie values in saving balance
credit['savings_balance'].value_counts()

< 100 DM         603
unknown          183
101 - 500 DM     103
501 - 1000 DM     63
> 1000 DM         48
Name: savings_balance, dtype: int64

In [11]:
# Checking unique values in each column, just to find the categorical columns.
# Generally it is given in the description of data which columns are categorical and which are continous.
for i in credit.columns:
    print(i,credit[i].nunique())

checking_balance 4
months_loan_duration 33
credit_history 5
purpose 10
amount 921
savings_balance 5
employment_length 5
installment_rate 4
personal_status 4
other_debtors 3
residence_history 4
property 4
age 53
installment_plan 3
housing 3
existing_credits 4
job 4
dependents 2
telephone 2
foreign_worker 2
default 2


In [12]:
# Following coloumns are to be converted into srting
categorical_cols = ['checking_balance','credit_history','purpose','savings_balance','employment_length','personal_status','other_debtors','property','installment_plan','housing', 'job', 'telephone', 'foreign_worker']

In [13]:
# LabelEncoder is used for converting categorical string columns to numeric.

le = LabelEncoder()
for col in categorical_cols:
    # Taking a column from dataframe, encoding it and replacing same column in the dataframe.
    credit[col] = le.fit_transform(credit[col])

In [14]:
credit.head(20)

Unnamed: 0,checking_balance,months_loan_duration,credit_history,purpose,amount,savings_balance,employment_length,installment_rate,personal_status,other_debtors,residence_history,property,age,installment_plan,housing,existing_credits,job,dependents,telephone,foreign_worker,default
0,1,6,0,7,1169,4,3,4,3,2,4,2,67,1,1,2,1,1,1,1,1
1,0,48,4,7,5951,2,1,2,1,2,2,2,22,1,1,1,1,1,0,1,2
2,3,12,0,4,2096,2,2,2,3,2,3,2,49,1,1,1,3,2,0,1,1
3,1,42,4,5,7882,2,2,2,3,1,4,0,45,1,0,1,1,2,0,1,1
4,1,24,1,1,4870,2,1,3,3,2,4,3,53,1,0,2,1,2,0,1,2
5,3,36,4,4,9055,4,1,2,3,2,4,3,35,1,0,1,3,2,1,1,1
6,3,24,4,5,2835,1,3,3,3,2,4,0,53,1,1,1,1,1,0,1,1
7,0,36,4,2,6948,2,1,2,3,2,2,1,35,1,2,1,0,1,1,1,1
8,3,12,4,7,3059,3,2,2,0,2,4,2,61,1,1,1,3,1,0,1,1
9,0,30,0,1,5234,2,4,4,2,2,2,1,28,1,1,2,0,1,0,1,2


## Spliting the data into test and train

In [15]:
credit.shape

(1000, 21)

In [16]:
# Train Data - Selecting 900 rows at random from the dataframe for training
credit_train = credit.sample(900, random_state = 123)

In [17]:
# Test Data - Taking remaining 100 rows for testing by dropping the rows present in train dataframe from original dataframe.
credit_test = credit.drop(credit_train.index)

In [18]:
# Check whether this appears to be a fairly even split or not,
# train should have about 30 percent of defaulted loans 
# and test data also should have similar % of default loans
(credit.default.value_counts()/credit.default.count())*100

1    70.0
2    30.0
Name: default, dtype: float64

In [19]:
# Train data - Ration of normal and risky customers
(credit_train.default.value_counts()/credit_train.default.count())*100

1    69.777778
2    30.222222
Name: default, dtype: float64

In [20]:
# Test data - Ration of normal and risky customers
(credit_test.default.value_counts()/credit_test.default.count())*100

1    72.0
2    28.0
Name: default, dtype: float64

In [21]:
#taking label in seperate objects
train_labels = credit_train.default
test_labels = credit_test.default

## Training the Model

In [22]:
# Creating object of the DT with required options 
clf = DecisionTreeClassifier(criterion='entropy')

In [23]:
# Training/Build the model with train data
clf.fit(credit_train.iloc[:,:-1],train_labels)

In [24]:
# Make predictions on test data
predictions = clf.predict(credit_test.iloc[:,:-1])

In [25]:
predictions

array([1, 1, 1, 2, 1, 1, 2, 1, 1, 1, 2, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1,
       2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 2, 1, 2, 1, 1, 1, 2, 2, 1, 1, 2, 1, 1, 1, 2, 1, 2, 1, 1, 2,
       2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 2,
       1, 1, 2, 1, 1, 2, 2, 1, 2, 1, 2, 1])

## Evaluting the Model

In [27]:
# confusion matrix
confusion_matrix(test_labels,predictions)

array([[57, 15],
       [19,  9]])

In [28]:
accuracy_score(test_labels,predictions)*100

66.0

### Precision Recall and F1-Score

In [29]:

from sklearn.metrics import precision_score, recall_score, f1_score

In [30]:
# Precision (P)
precision_score(test_labels,predictions)

0.75

In [31]:
# Recall (R)
recall_score(test_labels,predictions)

0.7916666666666666

In [32]:
# F1-Score
f1_score(test_labels,predictions)

0.7702702702702704

In [33]:
# AUC 
from sklearn.metrics import roc_auc_score
roc_auc_score(test_labels,predictions)

0.5565476190476191

In [34]:
### Cross Validation


In [35]:
# simplify names
y_tr = train_labels

In [36]:
# k-fold cross validation
from sklearn.model_selection import cross_val_predict
y_pr = cross_val_predict(clf, credit_train, train_labels, cv=5)

In [37]:
confusion_matrix(y_tr, y_pr)

array([[628,   0],
       [  0, 272]])

In [38]:
# Precision (P)
precision_score(y_tr, y_pr)

1.0

In [39]:
# Recall (R)
recall_score(y_tr, y_pr)

1.0

In [40]:
# F1-Score
f1_score(y_tr, y_pr)

1.0

In [41]:
### Hyper Parameter Tuning

In [42]:
from sklearn.model_selection import GridSearchCV

# Hyper parametrs to tune using grid search - total 8x3 = 24 combinations
# i.e. 24 models will be built
params = {'max_leaf_nodes': list(range(2, 10)), # 8 params
          'min_samples_split': [2, 3, 4]} # 3 params

grid_search_cv = GridSearchCV(DecisionTreeClassifier(random_state=42), 
                              params, 
                              verbose=1, 
                              cv=3)

grid_search_cv.fit(credit_train, train_labels)

Fitting 3 folds for each of 24 candidates, totalling 72 fits


In [43]:
grid_search_cv.best_params_

{'max_leaf_nodes': 2, 'min_samples_split': 2}

In [44]:
grid_search_cv.best_estimator_

In [45]:
from sklearn.metrics import accuracy_score

y_pred = grid_search_cv.predict(credit_test)
accuracy_score(test_labels, y_pred)

1.0