# Predicting Loan Defaults

Predicting loan defaults is an extremely common use case for machine
learning in banking, one of DataRobot's main target industries. As a
loan officer, you are responsible for determining which loans are
going to be the most profitable and worthy of lending money to. Based
on a loan application from a potential client, you would like to
predict whether the loan will be paid back in time.

# Data

You will be working with a loan dataset
from LendingClub.com,
a US peer-to-peer lending company. Your classification target is
`is_bad`.

# Task

Partition your data into a holdout set and 5 stratified CV folds. Pick
any two machine learning algorithms from the list below, and build a
binary classification model with each of them:
- Regularized Logistic Regression (scikit-learn)  
- Gradient Boosting Machine (scikit-learn, XGBoost or LightGBM)
- Neural Network (Keras), with the architecture of your choice

Both of your models must make use of numeric, categorical, text, and
date features. Compute out-of-sample LogLoss and F1 scores on
cross-validation and holdout. Which one of your two models would you
recommend to deploy? Explain your decision.

*(Advanced, optional)*: Which 3 features are the most impactful for your model?

Explain your methodology.

# Data Dictionary
|	Column Name|Type|Description|Category|
|---|---|---|---|
|`addr_state`|Categorical|Customer State|Customer
|`annual_inc`|Numeric|Annual Income|Customer
|`collections_12_mths_ex_med`|Numeric|(Credit based)|Customer
|`debt-to-income`|Numeric|Ratio of debt to income|Loan
|`delinq_2yrs`|Numeric|Any delinquency in last 2 years|Customer
|`earliest_cr_line`|Date|First credit date|Customer
|`emp_length`|Numeric|Length in current job|Customer
|`emp_title`|Text|Employee Title|Customer
|`home_ownership`|Categorical|Housing Status|Customer
|`Id`|Numeric|Sequential number|Identifier
|`initial_list_status`|Categorical|Loan status|Loan
|`inq_last_6mths`|Numeric|Number of inquiries|Customer
|`is_bad`|Numeric|1 or 0|Target
|`mths_since_last_delinq`|Numeric|Months since last delinquency|Customer
|`mths_since_last_major_derog`|Numeric|(Credit based)|Customer
|`mths_since_last_record`|Numeric|Months since last record|Customer
|`Notes`|Text|Notes taken by the administrator|Loan
|`open_acc`|Numeric|(Credit based)|Customer
|`pymnt_plan`|Categorical|Current Payment Plans|Customer
|`policy_code`|Categorical|Loan type|Loan
|`pub_rec`|Numeric|(Credit based)|Customer
|`purpose`|Text|Purpose for the loan|Loan
|`purpose_cat`|Categorical|Purpose category for the loan|Loan
|`revol_bal`|Numeric|(Credit based)|Customer
|`revol_util`|Numeric|(Credit based)|Customer
|`total_acc`|Numeric|(Credit based)|Customer
|`verification_status`|Categorical|Income Verified|Loan
|`zip_code`|Categorical|Customer zip code|Customer

In this notebook, I build a binary classification model of loan defaults using three ML methods 
- Logistic regression
- Gradient boosting machine
- Neural network 

Details of the task is shown in https://www.interviewquery.com/takehomes/datarobot-1

# Download Data

In [None]:
!git clone --branch datarobot_1 https://github.com/interviewquery/takehomes.git
%cd takehomes/datarobot_1
!ls

# Load Packages

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import time 

from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedKFold
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import log_loss
from sklearn.metrics import f1_score
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.ensemble import GradientBoostingClassifier

import tensorflow as tf
from tensorflow import keras

# Data Processing

In [2]:
# Load csv
df_raw = pd.read_csv('dataset.csv')
df_raw.head()

Unnamed: 0,Id,is_bad,emp_title,emp_length,home_ownership,annual_inc,verification_status,pymnt_plan,Notes,purpose_cat,...,mths_since_last_record,open_acc,pub_rec,revol_bal,revol_util,total_acc,initial_list_status,collections_12_mths_ex_med,mths_since_last_major_derog,policy_code
0,1,0,Time Warner Cable,10,MORTGAGE,50000.0,not verified,n,,medical,...,,15.0,0.0,12087,12.1,44.0,f,0.0,1,PC4
1,2,0,Ottawa University,1,RENT,39216.0,not verified,n,Borrower added on 04/14/11 > I will be using...,debt consolidation,...,,4.0,0.0,10114,64.0,5.0,f,0.0,2,PC1
2,3,0,Kennedy Wilson,4,RENT,65000.0,not verified,n,,credit card,...,,4.0,0.0,81,0.6,8.0,f,0.0,3,PC4
3,4,0,TOWN OF PLATTEKILL,10,MORTGAGE,57500.0,not verified,n,,debt consolidation,...,,6.0,0.0,10030,37.1,23.0,f,0.0,2,PC2
4,5,0,Belmont Correctional,10,MORTGAGE,50004.0,VERIFIED - income,n,"I want to consolidate my debt, pay for a vacat...",debt consolidation,...,,8.0,0.0,10740,40.4,21.0,f,0.0,3,PC3


### Drop Unnecessary Columns

After quick observation, we can drop these columns from ```df_raw```:
- ```Id```: contains no information other than index
- ```Notes```: contains text data that is redundant with ```purpose_cat``` and other columns. Also has more than 3000 null entries
- ```purpose```: redundant with ```purpose_cat```, also text data
- ```collections_12_mths_ex_med```: contains mostly 0 or nan

In [3]:
# Delete the columns as described above
cols_to_delete = ['Id','Notes','purpose','collections_12_mths_ex_med']
df_raw.drop(cols_to_delete,axis=1,inplace=True)

df_raw.head()

Unnamed: 0,is_bad,emp_title,emp_length,home_ownership,annual_inc,verification_status,pymnt_plan,purpose_cat,zip_code,addr_state,...,mths_since_last_delinq,mths_since_last_record,open_acc,pub_rec,revol_bal,revol_util,total_acc,initial_list_status,mths_since_last_major_derog,policy_code
0,0,Time Warner Cable,10,MORTGAGE,50000.0,not verified,n,medical,766xx,TX,...,,,15.0,0.0,12087,12.1,44.0,f,1,PC4
1,0,Ottawa University,1,RENT,39216.0,not verified,n,debt consolidation,660xx,KS,...,,,4.0,0.0,10114,64.0,5.0,f,2,PC1
2,0,Kennedy Wilson,4,RENT,65000.0,not verified,n,credit card,916xx,CA,...,,,4.0,0.0,81,0.6,8.0,f,3,PC4
3,0,TOWN OF PLATTEKILL,10,MORTGAGE,57500.0,not verified,n,debt consolidation,124xx,NY,...,16.0,,6.0,0.0,10030,37.1,23.0,f,2,PC2
4,0,Belmont Correctional,10,MORTGAGE,50004.0,VERIFIED - income,n,debt consolidation,439xx,OH,...,,,8.0,0.0,10740,40.4,21.0,f,3,PC3


### Deal with Missing Values

Check for missing values for each column

In [4]:
df_raw.isnull().sum()

is_bad                            0
emp_title                       592
emp_length                        0
home_ownership                    0
annual_inc                        1
verification_status               0
pymnt_plan                        0
purpose_cat                       0
zip_code                          0
addr_state                        0
debt_to_income                    0
delinq_2yrs                       5
earliest_cr_line                  5
inq_last_6mths                    5
mths_since_last_delinq         6316
mths_since_last_record         9160
open_acc                          5
pub_rec                           5
revol_bal                         0
revol_util                       26
total_acc                         5
initial_list_status               0
mths_since_last_major_derog       0
policy_code                       0
dtype: int64

We see that there are more than 5000 null entries for columns ```mths_since_last_delinq``` (Months since last delinquency) and ```mths_since_last_record``` (Months since last record)

Assuming that null values on these columns mean that the person either has no delinquency or record, I think it'll be reasonable to change the null values to the max value for that column

In [5]:
# Fill the null values for 'mths_since_last_delinq' and 'mths_since_last_record' columns
df_raw["mths_since_last_delinq"].fillna(df_raw['mths_since_last_delinq'].max(), inplace = True)
df_raw["mths_since_last_record"].fillna(df_raw['mths_since_last_record'].max(), inplace = True)

For the rest, drop the rows with null entries as they're only about 6% of the dataset

In [6]:
df_raw.dropna(inplace=True)

Check for null values again

In [7]:
df_raw.isnull().sum()

is_bad                         0
emp_title                      0
emp_length                     0
home_ownership                 0
annual_inc                     0
verification_status            0
pymnt_plan                     0
purpose_cat                    0
zip_code                       0
addr_state                     0
debt_to_income                 0
delinq_2yrs                    0
earliest_cr_line               0
inq_last_6mths                 0
mths_since_last_delinq         0
mths_since_last_record         0
open_acc                       0
pub_rec                        0
revol_bal                      0
revol_util                     0
total_acc                      0
initial_list_status            0
mths_since_last_major_derog    0
policy_code                    0
dtype: int64

### Miscellaneous

I want to check for unique values for each column

In [8]:
for col in df_raw.columns:
    print(f'{col} unique values: {df_raw[col].unique()}')
    print()

is_bad unique values: [0 1]

emp_title unique values: ['Time Warner Cable' 'Ottawa University' 'Kennedy Wilson' ...
 'Weichert, Realtors' 'meadwestvaco' 'Rehab Alliance']

emp_length unique values: ['10' '1' '4' '6' '2' '5' '3' '8' '7' '9' '22' '11' '33' 'na']

home_ownership unique values: ['MORTGAGE' 'RENT' 'OWN' 'OTHER']

annual_inc unique values: [50000. 39216. 65000. ... 66250. 47831. 70560.]

verification_status unique values: ['not verified' 'VERIFIED - income' 'VERIFIED - income source']

pymnt_plan unique values: ['n' 'y']

purpose_cat unique values: ['medical' 'debt consolidation' 'credit card' 'other' 'wedding' 'house'
 'small business' 'educational' 'major purchase' 'car' 'home improvement'
 'vacation' 'other small business' 'debt consolidation small business'
 'moving' 'credit card small business' 'wedding small business'
 'small business small business' 'home improvement small business'
 'educational small business' 'house small business' 'renewable energy'
 'major purcha

We note few things: 
- ```emp_length``` values are strings when they should be ints (or floats). There are also 'na' entries.
- ```earliest_cr_line``` values are strings when we want them to be some type of numerical values.

Convert ```emp_length``` by turning it into int values. We assume that 'na' entries correspond to 0

In [9]:
df_raw['emp_length'] = df_raw['emp_length'].apply(lambda x: 0 if x=='na' else int(x))

Convert ```earliest_cr_line``` into integer timestamp

In [10]:
df_raw['earliest_cr_line'] = pd.to_datetime(df_raw['earliest_cr_line']).astype('int64')

Reset the index

In [11]:
df_raw = df_raw.reset_index()
df_raw.drop(['index'],axis=1,inplace=True)

### ```StandardScaler``` for numerical columns and ```OneHotEncoder``` for categorical (nominal) columns

Only these two since we don't have ordinal categorical variables

First let's divide the columns into 3 categories:
- ```target```
- ```numerical```
- ```categorical```

In [12]:
target = ['is_bad']
numerical = ['emp_length', 'annual_inc', 'debt_to_income', 'delinq_2yrs',
             'earliest_cr_line', 'inq_last_6mths', 'mths_since_last_delinq',
             'mths_since_last_record', 'open_acc', 'pub_rec', 'revol_bal',
             'revol_util', 'total_acc', 'mths_since_last_major_derog']
categorical = ['emp_title', 'home_ownership', 'verification_status', 'pymnt_plan',
               'purpose_cat', 'zip_code', 'addr_state', 'initial_list_status',
               'policy_code']

Make intermediate ```df_target```

In [13]:
df_target = pd.DataFrame(df_raw[target].values,columns=target)

Make intermediate ```df_numerical```

In [14]:
# Define standardscaler
std = StandardScaler()

df_numerical = pd.DataFrame(std.fit_transform(df_raw[numerical]),
                            columns=numerical)

df_numerical.head()

Unnamed: 0,emp_length,annual_inc,debt_to_income,delinq_2yrs,earliest_cr_line,inq_last_6mths,mths_since_last_delinq,mths_since_last_record,open_acc,pub_rec,revol_bal,revol_util,total_acc,mths_since_last_major_derog
0,1.447377,-0.384031,-0.381719,-0.295756,-0.665415,-0.719764,0.728143,0.230337,1.238614,-0.235839,-0.089629,-1.302304,1.876884,-1.240396
1,-1.166366,-0.607497,-0.637186,-0.295756,1.285905,0.631124,0.728143,0.230337,-1.194044,-0.235839,-0.168193,0.541755,-1.466361,-0.005263
2,-0.295118,-0.073202,-0.326763,-0.295756,-4.064714,-0.719764,0.728143,0.230337,-1.194044,-0.235839,-0.567702,-1.710911,-1.209188,1.229871
3,1.447377,-0.228617,-1.078314,1.663704,-2.213897,-0.719764,-1.699751,0.230337,-0.751743,-0.235839,-0.171537,-0.414029,0.076675,-0.005263
4,1.447377,-0.383948,0.830267,-0.295756,0.366493,1.982012,0.728143,0.230337,-0.309441,-0.235839,-0.143266,-0.296777,-0.094773,1.229871


Same for ```df_categorical```

In [15]:
# Define onehotencoder
enc = OneHotEncoder(handle_unknown='ignore',
                    drop='first')

df_categorical = pd.DataFrame(enc.fit_transform(df_raw[categorical]).toarray(),
                              columns=enc.get_feature_names_out(categorical))

df_categorical.head()

Unnamed: 0,emp_title_ U.S. Dept. Of Homeland Security,emp_title_(Collaborative) Abbott Nutrition Intl,emp_title_(self) Castleforte Group,emp_title_1)-Yavapai Regional Medical Center 2)- Dr. cantors office,emp_title_128 Air Refueling Wing (USAF),emp_title_162 fighter wing,emp_title_19th Circuit State Attorney's Office,emp_title_1Life Healthcare,emp_title_1ST FRANKLIN FINANCIAL CORP,emp_title_1st Financial Bank,...,addr_state_VT,addr_state_WA,addr_state_WI,addr_state_WV,addr_state_WY,initial_list_status_m,policy_code_PC2,policy_code_PC3,policy_code_PC4,policy_code_PC5
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0


### Define ```X```, ```y```, and Put Aside Holdout

In [16]:
df_combined = pd.concat([df_target,df_numerical,df_categorical],axis=1)

X = df_combined.drop(['is_bad'],axis=1)
y = df_combined['is_bad']

In [17]:
X, X_holdout, y, y_holdout = train_test_split(X, y, test_size=0.2, random_state=869255)

### Define CV Folds

In [18]:
n_splits = 5
folds = StratifiedKFold(n_splits=n_splits)

# Prepare the Models

For completeness, I'll use all three methods suggested in the task (logistic regression, GBM, and NN). However, due to computational cost, I will only do CV fold on the first two. For the NN, the model will only be tested on the holdout dataset. 

In [19]:
# Logistic regression 
LR = LogisticRegression(penalty='l2', max_iter=1000)

# GBM
# Parameter from documentation 
GBM = GradientBoostingClassifier(n_estimators=100, 
                                 learning_rate=1.0,
                                 max_depth=1, 
                                 random_state=0)

Define CV fold function to train and evaluate the model

In [23]:
def evaluate_model(model, X, y, CVfold):
    # Define metrics of the model
    LL = np.zeros(CVfold.n_splits)
    F1 = np.zeros(CVfold.n_splits)
    accuracy = np.zeros(CVfold.n_splits)
    
    for i, (train, test) in enumerate(CVfold.split(X,y)):
        print(f'CV {i+1} out of {CVfold.n_splits}')
        
        # Split X and y
        X_train = X.iloc[train]
        X_test = X.iloc[test]
        y_train = y.iloc[train]
        y_test = y.iloc[test]
    
        # Train the model
        model.fit(X_train, y_train)
        
        # Get y_pred (predicted binary) y_pred_prob (predicted binary probability to be 1, hence [:,1])
        y_pred = model.predict(X_test)
        y_pred_prob = model.predict_proba(X_test)[:,1]
        
        # Get the stats
        LL[i] = log_loss(y_test, y_pred_prob)
        F1[i] = f1_score(y_test, y_pred)
        accuracy[i] = model.score(X_test, y_test)
        
    return LL, F1, accuracy

Same for using the holdout dataset

In [24]:
def evaluate_model_holdout(model, X, y, X_holdout, y_holdout):
    # Train the model
    model.fit(X, y)

    # Get y_pred (predicted binary) y_pred_prob (predicted binary probability to be 1, hence [:,1])
    y_pred = model.predict(X_holdout)
    y_pred_prob = model.predict_proba(X_holdout)[:,1]

    # Get the stats
    LL = log_loss(y_holdout, y_pred_prob)
    F1 = f1_score(y_holdout, y_pred)
    accuracy = model.score(X_holdout, y_holdout)
        
    return LL, F1, accuracy, y_pred

# Train and Evaluate Logistic Regression and GBM

Evaluate two sklearn models

In [25]:
log_loss_LR, F1_LR, accuracy_LR = evaluate_model(model=LR,
                                                 X=X,
                                                 y=y,
                                                 CVfold=folds)

log_loss_GBM, F1_GBM, accuracy_GBM = evaluate_model(model=GBM,
                                                   X=X,
                                                   y=y,
                                                   CVfold=folds)

CV 1 out of 5
CV 2 out of 5
CV 3 out of 5
CV 4 out of 5
CV 5 out of 5
CV 1 out of 5
CV 2 out of 5
CV 3 out of 5
CV 4 out of 5
CV 5 out of 5


In [26]:
print('Logistic regression:')
print(f'Log loss mean: {np.mean(log_loss_LR):0.2f}')
print(f'F1 score mean: {np.mean(F1_LR):0.2f}')
print(f'Accuracy mean: {np.mean(accuracy_LR):0.2f}')
print()
print('Gradient boost machine:')
print(f'Log loss mean: {np.mean(log_loss_GBM):0.2f}')
print(f'F1 score mean: {np.mean(F1_GBM):0.2f}')
print(f'Accuracy mean: {np.mean(accuracy_GBM):0.2f}')

Logistic regression:
Log loss mean: 0.34
F1 score mean: 0.21
Accuracy mean: 0.89

Gradient boost machine:
Log loss mean: 0.36
F1 score mean: 0.24
Accuracy mean: 0.89


We see that logistic regression has lower loss but GBM has higher F1 score. Both have same accuracy score. 

Let's test the models on the holdout dataset

In [27]:
log_loss_LR_holdout, F1_LR_holdout, accuracy_LR_holdout, y_pred_LR = \
evaluate_model_holdout(model=LR,
                       X=X,
                       y=y,
                       X_holdout=X_holdout,
                       y_holdout=y_holdout)

log_loss_GBM_holdout, F1_GBM_holdout, accuracy_GBM_holdout, y_pred_GBM = \
evaluate_model_holdout(model=GBM,
                       X=X,
                       y=y,
                       X_holdout=X_holdout,
                       y_holdout=y_holdout)

In [28]:
print('Logistic regression:')
print(f'Log loss: {log_loss_LR_holdout:0.2f}')
print(f'F1 score: {F1_LR_holdout:0.2f}')
print(f'Accuracy: {accuracy_LR_holdout:0.2f}')
print()
print('Gradient boost machine:')
print(f'Log loss: {log_loss_GBM_holdout:0.2f}')
print(f'F1 score: {F1_GBM_holdout:0.2f}')
print(f'Accuracy: {accuracy_GBM_holdout:0.2f}')

Logistic regression:
Log loss: 0.36
F1 score: 0.20
Accuracy: 0.88

Gradient boost machine:
Log loss: 0.38
F1 score: 0.26
Accuracy: 0.88


Same trend as CV

Let's look at classification report

In [29]:
print('Classification report for: Logistic Regression')
print(classification_report(y_holdout, y_pred_LR))

Classification report for: Logistic Regression
              precision    recall  f1-score   support

           0       0.88      1.00      0.94      1631
           1       0.88      0.11      0.20       247

    accuracy                           0.88      1878
   macro avg       0.88      0.56      0.57      1878
weighted avg       0.88      0.88      0.84      1878



In [30]:
print('Classification report for: GBM')
print(classification_report(y_holdout, y_pred_GBM))

Classification report for: GBM
              precision    recall  f1-score   support

           0       0.89      0.99      0.94      1631
           1       0.80      0.16      0.26       247

    accuracy                           0.88      1878
   macro avg       0.84      0.58      0.60      1878
weighted avg       0.87      0.88      0.85      1878



# Neural Network

We'll do a simple NN without any hidden layer

In [31]:
# Define the model
NN = keras.Sequential([
     keras.layers.Dense(5000, input_shape=(np.shape(X)[1],), activation='relu'), 
     keras.layers.Dense(1, activation='sigmoid')
])

# Compile the model
NN.compile(
    optimizer='adam',
    loss='binary_crossentropy',
    metrics=['accuracy']
)

# Fit the model
NN.fit(X, y, epochs=5)

2023-09-11 17:53:58.665551: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-09-11 17:53:59.732894: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:116] None of the MLIR optimization passes are enabled (registered 2)


Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<tensorflow.python.keras.callbacks.History at 0x7fcc69ebaeb0>

In [32]:
# Get prediction and probability for the test dataset
y_pred_NN = np.transpose(NN.predict_classes(X_holdout))[0]
y_pred_prob_NN = np.transpose(NN.predict(X_holdout))[0]

# Get stats
log_loss_NN_holdout = log_loss(y_holdout, y_pred_prob_NN)
F1_NN_holdout = f1_score(y_holdout, y_pred_NN)
accuracy_NN_holdout = NN.evaluate(X_holdout, y_holdout)[1]





In [33]:
print('Neural network:')
print(f'Log loss: {log_loss_NN_holdout}')
print(f'F1 score: {F1_NN_holdout}')
print(f'Accuracy: {accuracy_NN_holdout}')

Neural network:
Log loss: 0.5768663558501774
F1 score: 0.26405867970660146
Accuracy: 0.8397231101989746


We see that compared to logistic regression and GBM, NN gives higher log loss and accuracy, but F1 score is higher

In [34]:
print('Classification report for: NN')
print(classification_report(y_holdout, y_pred_NN))

Classification report for: NN
              precision    recall  f1-score   support

           0       0.89      0.93      0.91      1631
           1       0.33      0.22      0.26       247

    accuracy                           0.84      1878
   macro avg       0.61      0.58      0.59      1878
weighted avg       0.81      0.84      0.83      1878



# Three Most Important Features

Since all three models give comparable performance in terms of log loss, F1 score, and accuracy, let's see what the three most important features are for the ML models (not straightforward in NN)

### Linear Regression

In [35]:
feature_indices = np.argsort(np.abs(LR.coef_[0]))
print('Linear regression:')
print(f'1st important feature:')
print(f'{X.columns[feature_indices[-1]]}, abs(coef): {np.abs(LR.coef_[0])[feature_indices[-1]]:0.2f}')
print(f'2nd important feature:')
print(f'{X.columns[feature_indices[-2]]}, abs(coef): {np.abs(LR.coef_[0])[feature_indices[-2]]:0.2f}')
print(f'3rd important feature:')
print(f'{X.columns[feature_indices[-3]]}, abs(coef): {np.abs(LR.coef_[0])[feature_indices[-3]]:0.2f}')

Linear regression:
1st important feature:
purpose_cat_debt consolidation small business, abs(coef): 4.42
2nd important feature:
purpose_cat_credit card small business, abs(coef): 2.70
3rd important feature:
purpose_cat_home improvement small business, abs(coef): 2.65


### GBM

In [36]:
feature_indices = np.argsort(GBM.feature_importances_)
print('GBM:')
print(f'1st important feature:')
print(f'{X.columns[feature_indices[-1]]}, abs(coef): {GBM.feature_importances_[feature_indices[-1]]:0.2f}')
print(f'2nd important feature:')
print(f'{X.columns[feature_indices[-2]]}, abs(coef): {GBM.feature_importances_[feature_indices[-2]]:0.2f}')
print(f'3rd important feature:')
print(f'{X.columns[feature_indices[-3]]}, abs(coef): {GBM.feature_importances_[feature_indices[-3]]:0.2f}')

GBM:
1st important feature:
purpose_cat_debt consolidation small business, abs(coef): 0.27
2nd important feature:
purpose_cat_credit card small business, abs(coef): 0.06
3rd important feature:
total_acc, abs(coef): 0.05


# Recommended Model to Use

Considering that the three models have similar performance based on F1 score, log loss, and accuracy, I would recommend using **logistic regression model** because of its few characteristics.
- Easy interpretation of important features
- Relatively fast training (in case more training data becomes available) 
- Computationally efficient prediction

# One Last Thing to Try: Oversampling

I suspect that the relatively low F1 score in all of the models is due to the imbalance in the dataset

In [37]:
print(f"There are {df_raw['is_bad'][df_raw['is_bad']==0].count()} fulfillments")
print(f"There are {df_raw['is_bad'][df_raw['is_bad']==1].count()} defaults")

There are 8204 fulfillments
There are 1184 defaults


In [38]:
# Partition df_combined into 'is_bad'=0 and 1
df_isbad_0 = df_combined[df_combined['is_bad']==0]
df_isbad_1 = df_combined[df_combined['is_bad']==1]

Oversample ```df_isbad_1``` using the ```sample``` function

In [39]:
n_isbad_0 = np.shape(df_isbad_0)[0]
df_isbad_1_oversample = df_isbad_1.sample(n_isbad_0, replace=True)

Check the shape

In [40]:
print(f'Fulfillment: {np.shape(df_isbad_0)[0]}')
print(f'Default (oversampled): {np.shape(df_isbad_1_oversample)[0]}')

Fulfillment: 8204
Default (oversampled): 8204


Nice, now combine the two and test the models (just do the holdout this time) 

In [41]:
df_oversample = pd.concat([df_isbad_0,df_isbad_1_oversample],axis=0)

In [42]:
X_oversample = df_oversample.drop('is_bad',axis=1)
y_oversample = df_oversample['is_bad']

X_train, X_test, y_train, y_test = train_test_split(X_oversample, y_oversample, 
                                                    test_size=0.2, stratify=y_oversample)

In [43]:
log_loss_LR_oversample, F1_LR_oversample, accuracy_LR_oversample, y_pred_LR_oversample = \
evaluate_model_holdout(model=LR,
                       X=X_train,
                       y=y_train,
                       X_holdout=X_test,
                       y_holdout=y_test)

log_loss_GBM_oversample, F1_GBM_oversample, accuracy_GBM_oversample, y_pred_GBM_oversample = \
evaluate_model_holdout(model=GBM,
                       X=X_train,
                       y=y_train,
                       X_holdout=X_test,
                       y_holdout=y_test)

In [44]:
print('Logistic regression:')
print(f'Log loss: {log_loss_LR_oversample:0.2f}')
print(f'F1 score: {F1_LR_oversample:0.2f}')
print(f'Accuracy: {accuracy_LR_oversample:0.2f}')
print()
print('Gradient boost machine:')
print(f'Log loss: {log_loss_GBM_oversample:0.2f}')
print(f'F1 score: {F1_GBM_oversample:0.2f}')
print(f'Accuracy: {accuracy_GBM_oversample:0.2f}')

Logistic regression:
Log loss: 0.34
F1 score: 0.92
Accuracy: 0.92

Gradient boost machine:
Log loss: 0.57
F1 score: 0.66
Accuracy: 0.69


In [45]:
print('Classification report for: Logistic Regression')
print(classification_report(y_test, y_pred_LR_oversample))

Classification report for: Logistic Regression
              precision    recall  f1-score   support

           0       0.95      0.88      0.91      1641
           1       0.89      0.95      0.92      1641

    accuracy                           0.92      3282
   macro avg       0.92      0.92      0.92      3282
weighted avg       0.92      0.92      0.92      3282



In [46]:
print('Classification report for: GBM')
print(classification_report(y_test, y_pred_GBM_oversample))

Classification report for: GBM
              precision    recall  f1-score   support

           0       0.66      0.77      0.71      1641
           1       0.73      0.61      0.66      1641

    accuracy                           0.69      3282
   macro avg       0.70      0.69      0.69      3282
weighted avg       0.70      0.69      0.69      3282



We now see that with the oversampling, logistic regression outperforms GBM by all three measures of the model