<a href="https://colab.research.google.com/github/rahiakela/machine-learning-research-and-practice/blob/main/machine-learning-bookcamp/6-ensemble-learning/04_credit_risk_scoring_final_model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Credit risk scoring project: Final model

Imagine that we work at a bank. When we receive a loan application, we need to make
sure that if we give the money, the customer will be able to pay it back. Every application
carries a risk of default — the failure to return the money.

Credit risk scoring is a binary classification problem: the target is positive (“1”) if the
customer defaults and negative (“0”) otherwise.

We will use machine learning to calculate the risk of
default. The plan for the project is the following:

* We will train decision tree model for predicting the probability of default.
* Then we combine multiple decision trees into one model — a random forest.
* Finally, we explore a different way of combining decision trees — gradient boosting(XGBoost).

##Setup

In [None]:
import pandas as pd
import numpy as np
import pickle 
import requests

from sklearn.feature_extraction import DictVectorizer
from sklearn.tree import export_graphviz, export_text
from graphviz import Source

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
import xgboost as xgb

from sklearn.model_selection import train_test_split

from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve

import seaborn as sns
from matplotlib import pyplot as plt
%matplotlib inline

In [None]:
!wget https://github.com/rahiakela/machine-learning-research-and-practice/raw/main/machine-learning-bookcamp/6-ensemble-learning/credit_scoring.csv

##Dataset

In [None]:
# let’s read our dataset
data_df = pd.read_csv("credit_scoring.csv")
print(len(data_df))
data_df.head()

4455


Unnamed: 0,Status,Seniority,Home,Time,Age,Marital,Records,Job,Expenses,Income,Assets,Debt,Amount,Price
0,1,9,1,60,30,2,1,3,73,129,0,0,800,846
1,1,17,1,60,58,3,1,1,48,131,0,0,1000,1658
2,2,10,2,36,46,2,2,3,90,200,3000,0,2000,2985
3,1,0,1,60,24,1,1,1,63,182,2500,0,900,1325
4,1,0,1,36,26,1,1,1,46,107,0,0,310,910


##Data cleaning

In [None]:
# let’s lowercase all the column names
data_df.columns = data_df.columns.str.lower()
data_df.head()

Unnamed: 0,status,seniority,home,time,age,marital,records,job,expenses,income,assets,debt,amount,price
0,1,9,1,60,30,2,1,3,73,129,0,0,800,846
1,1,17,1,60,58,3,1,1,48,131,0,0,1000,1658
2,2,10,2,36,46,2,2,3,90,200,3000,0,2000,2985
3,1,0,1,60,24,1,1,1,63,182,2500,0,900,1325
4,1,0,1,36,26,1,1,1,46,107,0,0,310,910


In [None]:
# Let’s handle the categorical column
data_df.status.value_counts()

1    3200
2    1254
0       1
Name: status, dtype: int64

In [None]:
status_values = {
  1: "ok", 
  2: "default", 
  0: "unk"
}
data_df.status = data_df.status.map(status_values)
data_df.head()

Unnamed: 0,status,seniority,home,time,age,marital,records,job,expenses,income,assets,debt,amount,price
0,ok,9,1,60,30,2,1,3,73,129,0,0,800,846
1,ok,17,1,60,58,3,1,1,48,131,0,0,1000,1658
2,default,10,2,36,46,2,2,3,90,200,3000,0,2000,2985
3,ok,0,1,60,24,1,1,1,63,182,2500,0,900,1325
4,ok,0,1,36,26,1,1,1,46,107,0,0,310,910


In [None]:
data_df.home.value_counts()

2    2107
1     973
5     783
6     319
3     247
4      20
0       6
Name: home, dtype: int64

In [None]:
home_values = {
    1: 'rent',
    2: 'owner',
    3: 'private',
    4: 'ignore',
    5: 'parents',
    6: 'other',
    0: 'unk'
}
data_df.home = data_df.home.map(home_values)

In [None]:
data_df.marital.value_counts()

2    3241
1     978
4     130
3      67
5      38
0       1
Name: marital, dtype: int64

In [None]:
marital_values = {
    1: 'single',
    2: 'married',
    3: 'widow',
    4: 'separated',
    5: 'divorced',
    0: 'unk'
}
data_df.marital = data_df.marital.map(marital_values)

In [None]:
data_df.records.value_counts()

1    3682
2     773
Name: records, dtype: int64

In [None]:
records_values = {
    1: 'no',
    2: 'yes',
    0: 'unk'
}
data_df.records = data_df.records.map(records_values)

In [None]:
data_df.job.value_counts()

1    2806
3    1024
2     452
4     171
0       2
Name: job, dtype: int64

In [None]:
job_values = {
    1: 'fixed',
    2: 'partime',
    3: 'freelance',
    4: 'others',
    0: 'unk'
}
data_df.job = data_df.job.map(job_values)

In [None]:
data_df.head()

Unnamed: 0,status,seniority,home,time,age,marital,records,job,expenses,income,assets,debt,amount,price
0,ok,9,rent,60,30,married,no,freelance,73,129,0,0,800,846
1,ok,17,rent,60,58,widow,no,fixed,48,131,0,0,1000,1658
2,default,10,owner,36,46,married,yes,freelance,90,200,3000,0,2000,2985
3,ok,0,rent,60,24,single,no,fixed,63,182,2500,0,900,1325
4,ok,0,rent,36,26,single,no,fixed,46,107,0,0,310,910


In [None]:
# let’s check the summary statistics for each of the columns
data_df.describe().round()

Unnamed: 0,seniority,time,age,expenses,income,assets,debt,amount,price
count,4455.0,4455.0,4455.0,4455.0,4455.0,4455.0,4455.0,4455.0,4455.0
mean,8.0,46.0,37.0,56.0,763317.0,1060341.0,404382.0,1039.0,1463.0
std,8.0,15.0,11.0,20.0,8703625.0,10217569.0,6344253.0,475.0,628.0
min,0.0,6.0,18.0,35.0,0.0,0.0,0.0,100.0,105.0
25%,2.0,36.0,28.0,35.0,80.0,0.0,0.0,700.0,1118.0
50%,5.0,48.0,36.0,51.0,120.0,3500.0,0.0,1000.0,1400.0
75%,12.0,60.0,45.0,72.0,166.0,6000.0,0.0,1300.0,1692.0
max,48.0,72.0,68.0,180.0,99999999.0,99999999.0,99999999.0,5000.0,11140.0


In [None]:
# Let’s replace this big number with NaN for these columns
for c in ["income", "assets", "debt"]:
  data_df[c] = data_df[c].replace(to_replace=99999999, value=np.nan) 

In [None]:
data_df.isnull().sum()

status        0
seniority     0
home          0
time          0
age           0
marital       0
records       0
job           0
expenses      0
income       34
assets       47
debt         18
amount        0
price         0
dtype: int64

In [None]:
data_df.describe().round()

Unnamed: 0,seniority,time,age,expenses,income,assets,debt,amount,price
count,4455.0,4455.0,4455.0,4455.0,4421.0,4408.0,4437.0,4455.0,4455.0
mean,8.0,46.0,37.0,56.0,131.0,5403.0,343.0,1039.0,1463.0
std,8.0,15.0,11.0,20.0,86.0,11573.0,1246.0,475.0,628.0
min,0.0,6.0,18.0,35.0,0.0,0.0,0.0,100.0,105.0
25%,2.0,36.0,28.0,35.0,80.0,0.0,0.0,700.0,1118.0
50%,5.0,48.0,36.0,51.0,120.0,3000.0,0.0,1000.0,1400.0
75%,12.0,60.0,45.0,72.0,165.0,6000.0,0.0,1300.0,1692.0
max,48.0,72.0,68.0,180.0,959.0,300000.0,30000.0,5000.0,11140.0


In [None]:
# let’s look at our target variable status
data_df.status.value_counts()

ok         3200
default    1254
unk           1
Name: status, dtype: int64

In [None]:
# this row is not useful, so let’s remove it
data_df = data_df[data_df.status != "unk"]

In [None]:
data_df.status.value_counts()

ok         3200
default    1254
Name: status, dtype: int64

##Dataset preparation

In [None]:
# Let’s start by splitting the data
df_train_full, df_test = train_test_split(data_df, test_size=0.2, random_state=11)
df_train, df_val = train_test_split(df_train_full, test_size=0.25, random_state=11)

In [None]:
df_train = df_train.reset_index(drop=True)
df_val = df_val.reset_index(drop=True)
df_test = df_test.reset_index(drop=True)

In [None]:
len(df_train), len(df_val), len(df_test)

(2672, 891, 891)

In [None]:
# let's convert label to 0 and 1
y_train = (df_train.status == "default").astype(int).values
y_val = (df_val.status == "default").astype(int).values
y_test = (df_test.status == "default").astype(int).values

In [None]:
# Now we need to remove status from the DataFrames.
del df_train['status']
del df_val['status']
del df_test['status']

In [None]:
df_train

Unnamed: 0,seniority,home,time,age,marital,records,job,expenses,income,assets,debt,amount,price
0,10,owner,36,36,married,no,freelance,75,0.0,10000.0,0.0,1000,1400
1,6,parents,48,32,single,yes,fixed,35,85.0,0.0,0.0,1100,1330
2,1,parents,48,40,married,no,fixed,75,121.0,0.0,0.0,1320,1600
3,1,parents,48,23,single,no,partime,35,72.0,0.0,0.0,1078,1079
4,5,owner,36,46,married,no,freelance,60,100.0,4000.0,0.0,1100,1897
...,...,...,...,...,...,...,...,...,...,...,...,...,...
2667,18,private,36,45,married,no,fixed,45,220.0,20000.0,0.0,800,1600
2668,7,private,60,29,married,no,fixed,60,51.0,3500.0,500.0,1000,1290
2669,1,parents,24,19,single,no,fixed,35,28.0,0.0,0.0,400,600
2670,15,owner,48,43,married,no,freelance,60,100.0,18000.0,0.0,2500,2976


##Selecting the final model

* Choosing between xgboost, random forest and decision tree
* Training the final model
* Saving the model

In [None]:
# Let's train Decision Tree
dt = DecisionTreeClassifier(max_depth=6, min_samples_leaf=15)
dt.fit(x_train, y_train)

In [None]:
y_pred = dt.predict_proba(x_val)[:, 1]
roc_auc_score(y_val, y_pred)

0.7850954203095104

In [None]:
# Let's train random forest
rf = RandomForestClassifier(n_estimators=200,
                            max_depth=10,
                            min_samples_leaf=3,
                            random_state=1)
rf.fit(x_train, y_train)

In [None]:
y_pred = rf.predict_proba(x_val)[:, 1]
roc_auc_score(y_val, y_pred)

0.8246258264512848

In [None]:
# let's train final  model
xgb_params = {
  "eta": 0.1,
  "max_depth": 3,
  "min_child_weight": 1, 

  "objective": "binary:logistic",
  "eval_metric": "auc",

  "nthread": 8,

  "seed": 1,
  "verbosity": 1
}

# Let's wrap data into DMatrix
features = dv.get_feature_names_out()

d_train = xgb.DMatrix(x_train, label=y_train, feature_names=features)
d_val = xgb.DMatrix(x_val, label=y_val, feature_names=features)

model = xgb.train(xgb_params, d_train, num_boost_round=175)

In [None]:
y_pred = model.predict(d_val)
roc_auc_score(y_val, y_pred)

0.8360387251459157

##Testing final model

In [None]:
# let's train model on full dataset
df_train_full = df_train_full.reset_index(drop=True)
y_train_full = (df_train_full.status == "default").astype(int).values

In [None]:
y_train_full

array([0, 1, 0, ..., 0, 0, 1])

In [None]:
del df_train_full["status"]

In [None]:
dv = DictVectorizer(sparse=False)

dicts_train_full = df_train_full.to_dict(orient="records")
x_train_full = dv.fit_transform(dicts_train_full)

dicts_test_full = df_test.to_dict(orient="records")
x_test_full = dv.transform(dicts_test_full)

In [None]:
features = dv.get_feature_names_out()

d_train_full = xgb.DMatrix(x_train_full, label=y_train_full, feature_names=dv.get_feature_names_out())
d_test_full = xgb.DMatrix(x_test_full, feature_names=dv.get_feature_names_out())

In [None]:
# let's train final  model
xgb_params = {
  "eta": 0.1,
  "max_depth": 3,
  "min_child_weight": 1, 

  "objective": "binary:logistic",
  "eval_metric": "auc",

  "nthread": 8,

  "seed": 1,
  "verbosity": 1
}

model = xgb.train(xgb_params, d_train_full, num_boost_round=175)

In [None]:
y_pred = model.predict(d_test_full)
roc_auc_score(y_test, y_pred)

0.8322662626460096

##Summary

* Decision trees learn if-then-else rules from data.
* Finding the best split: select the least impure split. This algorithm can overfit, that's why we control it by limiting the max depth and the size of the group.
* Random forest is a way of combininig multiple decision trees. It should have a diverse set of models to make good predictions.
* Gradient boosting trains model sequentially: each model tries to fix errors of the previous model. XGBoost is an implementation of gradient boosting.

##Explore more

* For this dataset we didn't do EDA or feature engineering. You can do it to get more insights into the problem.
* For random forest, there are more parameters that we can tune. Check `max_features` and `bootstrap`.
* There's a variation of random forest caled "extremely randomized trees", or "extra trees". Instead of selecting the best split among all possible thresholds, it selects a few thresholds randomly and picks the best one among them. Because of that extra trees never overfit. In Scikit-Learn, they are implemented in `ExtraTreesClassifier`. Try it for this project.
* XGBoost can deal with NAs - we don't have to do `fillna` for it. Check if not filling NA's help improve performance.
* Experiment with other XGBoost parameters: `subsample` and `colsample_bytree`.
* When selecting the best split, decision trees find the most useful features. This information can be used for understanding which features are more important than otheres. See example here for [random forest](https://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances.html) (it's the same for plain decision trees) and for [xgboost](https://stackoverflow.com/questions/37627923/how-to-get-feature-importance-in-xgboost)
* Trees can also be used for solving the regression problems: check `DecisionTreeRegressor`, `RandomForestRegressor` and the `objective=reg:squarederror` parameter for XGBoost.