This notebook provides a baseline loan applications classifier based on Scikit-Learn's Random Forest. It doesn't include complex feature engineering, but provides high F1 and ROC AUC score on 5-fold cross-validation.

In [None]:
import pandas as pd
from sklearn.model_selection import cross_validate, StratifiedKFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score, f1_score, make_scorer

import hashlib

In [None]:
# Set path for project files
fpath = ('../input/should-this-loan-be-approved-or-denied/')

In [None]:
# Let's read CSV and take a look at our data
# We drop unique ID and borrower's organization name right away as they are useless or even noisy as features;
# Also we drop ChgOffDate, ChgOffPrinGr because they can directly tell us that the loan is charged-off
# 'ApprovalDate', 'ApprovalFY', 'DisbursementDate' are dropped to make the model time-independent
data = pd.read_csv(fpath + 'SBAnational.csv').drop(columns=['LoanNr_ChkDgt', 'Name', 'ChgOffDate', 'ChgOffPrinGr',
                                                            'ApprovalDate', 'ApprovalFY', 'DisbursementDate'])
len_data = len(data)
data

In [None]:
# Let's convert the strings styled as '$XXXX.XX' to float values
money_cols = ['DisbursementGross', 'BalanceGross', 'GrAppv', 'SBA_Appv']

for col in money_cols:
  data[col] = [float(val[1:].replace(',', '')) for val in data[col].values]

In [None]:
# Let's check our data for missing values and fill NAs with mode
for col in data.drop(columns=['MIS_Status']).columns:
  if data[col].isna().any():
    data[col] = data[col].fillna(data[col].mode().iloc[0])

In [None]:
# We have many columns with Object dtype; let's apply one hot encoding
# (if the number of unique values is relatively small)
# or hashing if there are many uniques
# The only exception is MIS_Status (our target) variable: it is 'PIF' if the loan is returned
# and 'CHGOFF' if the borrower had a debt
cols_to_drop = []

for col in data.drop(columns=['MIS_Status']).columns:
  if data[col].dtype == 'object':
    print(f'Column {col} has {data[col].nunique()} values among {len_data}')

    if data[col].nunique() < 25:
      print(f'One-hot encoding of {col}')
      one_hot_cols = pd.get_dummies(data[col])
      for ohc in one_hot_cols.columns:
        data[col + '_' + ohc] = one_hot_cols[ohc]
    else:
      print(f'Hashing of {col}')
      data[col + '_hash'] = data[col].apply(lambda row: int(hashlib.sha1((col + "_" + str(row)).encode('utf-8')).hexdigest(), 16) % len_data)

    cols_to_drop.append(col)

In [None]:
# Converting target variable from string to binary
data = data.drop(columns=cols_to_drop)

data['Defaulted'] = [1 if app == 'CHGOFF' else 0 for app in data.MIS_Status.values]
data = data.drop(columns=['MIS_Status'])

In [None]:
# Finally, our data looks like this:
data

In [None]:
# The dataset is quite imbalanced: the amount of non-defaulted loans is 5x of that of defaulted ones
print(data.Defaulted.value_counts())

In [None]:
# Let's fit and cross-validate a balanced random forest; first, divide the data to X and Y...
X = data.drop(columns=['Defaulted'])
Y = data.Defaulted

In [None]:
# ...and apply stratified 5-fold validation
rfc = RandomForestClassifier(class_weight='balanced', random_state=42)
f1_scorer = make_scorer(f1_score)
auc_scorer = make_scorer(roc_auc_score)
cross_validate(rfc, X, Y, cv=StratifiedKFold(random_state=42, shuffle=True), scoring=['f1_weighted', 'roc_auc'],
               n_jobs=-1, verbose=10)

As we can see, this model provides average F1 of 0.94 and average ROC AUC of 0.97, which is close to 1 and, thus, efficient to detect potentially risky loan applications. To improve the baseline solution, we can:
- dive deeper into the problem and create new informative features;
- apply more sophisticated methods, such as boosting or deep learning;
- try to use oversampling techniques.

Thanks for your attention :)