# Deloitte Presents Machine Learning Challenge: Predict Loan Defaulters

This competition is a
 - Supervised Learning Problem.
 - highly imbalanced dataset problem.
 - a binary classification problem.
 
As this kind of problem is very unique and new to me, I'll start with the tutorial on TensorFlow Website.
Here is the link to the tutorial: https://www.tensorflow.org/tutorials/structured_data/imbalanced_data

## Strategy

Step 1 - We'll start with using only numerical columns. Try to remove columns which do not have many unique values.

Step 2 - Drop columns which are irrelavant because thier mean and std are 0.

## Concepts

From the training data, we can see that number of positive samples are around 9 to 10
percent of the total data.

Try common techniques for dealing with imbalanced data like:

    1. Class Weights
    2. Over Sampling.

## To Do List

- Make everything a function.
- Handle missing values.
- When to take log of the data. I mean how big the values should be?

## Imports

In [1]:
import tensorflow as tf
import sklearn

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import os

print('TensorFlow version: {}'.format(tf.__version__))

TensorFlow version: 2.3.0


## Download and extract data

In [4]:
url = 'https://raw.githubusercontent.com/rameshgangwar/DeepLearning/main/Participants_Data_PLD.zip'
current_dir = os.getcwd()

ds_path = tf.keras.utils.get_file(
    fname='dataset.zip',
    origin=url,
    extract=True,
    cache_dir=current_dir,
    cache_subdir='')

print(ds_path)
os.remove(ds_path)

Downloading data from https://raw.githubusercontent.com/rameshgangwar/DeepLearning/main/Participants_Data_PLD.zip
c:\Users\rames\Desktop\GitHub\deep_learning\Deloitte-PredictLoanDefaulters\dataset.zip


## Primary Exploration of Data

In [22]:
# Load data into DataFrames.
ds_dir = os.getcwd() # Use this line if you have already downloaded the data from URL.
# ds_dir = os.path.dirname(ds_path)

ds_train = os.path.join(ds_dir, 'train.csv')
ds_test = os.path.join(ds_dir, 'test.csv')

df_train = pd.read_csv(ds_train)
df_test = pd.read_csv(ds_test)

In [23]:
df_train.head()

Unnamed: 0,ID,Loan Amount,Funded Amount,Funded Amount Investor,Term,Batch Enrolled,Interest Rate,Grade,Sub Grade,Employment Duration,...,Recoveries,Collection Recovery Fee,Collection 12 months Medical,Application Type,Last week Pay,Accounts Delinquent,Total Collection Amount,Total Current Balance,Total Revolving Credit Limit,Loan Status
0,65087372,10000,32236,12329.36286,59,BAT2522922,11.135007,B,C4,MORTGAGE,...,2.498291,0.793724,0,INDIVIDUAL,49,0,31,311301,6619,0
1,1450153,3609,11940,12191.99692,59,BAT1586599,12.237563,C,D3,RENT,...,2.377215,0.974821,0,INDIVIDUAL,109,0,53,182610,20885,0
2,1969101,28276,9311,21603.22455,59,BAT2136391,12.545884,F,D4,MORTGAGE,...,4.316277,1.020075,0,INDIVIDUAL,66,0,34,89801,26155,0
3,6651430,11170,6954,17877.15585,59,BAT2428731,16.731201,C,C3,MORTGAGE,...,0.10702,0.749971,0,INDIVIDUAL,39,0,40,9189,60214,0
4,14354669,16890,13226,13539.92667,59,BAT5341619,15.0083,C,D4,MORTGAGE,...,1294.818751,0.368953,0,INDIVIDUAL,18,0,430,126029,22579,0


In [24]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 67463 entries, 0 to 67462
Data columns (total 35 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   ID                            67463 non-null  int64  
 1   Loan Amount                   67463 non-null  int64  
 2   Funded Amount                 67463 non-null  int64  
 3   Funded Amount Investor        67463 non-null  float64
 4   Term                          67463 non-null  int64  
 5   Batch Enrolled                67463 non-null  object 
 6   Interest Rate                 67463 non-null  float64
 7   Grade                         67463 non-null  object 
 8   Sub Grade                     67463 non-null  object 
 9   Employment Duration           67463 non-null  object 
 10  Home Ownership                67463 non-null  float64
 11  Verification Status           67463 non-null  object 
 12  Payment Plan                  67463 non-null  object 
 13  L

In [25]:
df_train.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
ID,67463.0,25627610.0,21091550.0,1297933.0,6570288.0,17915650.0,42715210.0,72245780.0
Loan Amount,67463.0,16848.9,8367.866,1014.0,10012.0,16073.0,22106.0,35000.0
Funded Amount,67463.0,15770.6,8150.993,1014.0,9266.5,13042.0,21793.0,34999.0
Funded Amount Investor,67463.0,14621.8,6785.345,1114.59,9831.685,12793.68,17807.59,34999.75
Term,67463.0,58.17381,3.327441,36.0,58.0,59.0,59.0,59.0
Interest Rate,67463.0,11.84626,3.718629,5.320006,9.297147,11.3777,14.19353,27.18235
Home Ownership,67463.0,80541.5,45029.12,14573.54,51689.84,69335.83,94623.32,406561.5
Debit to Income,67463.0,23.29924,8.451824,0.6752991,16.75642,22.65666,30.0484,39.62986
Delinquency - two years,67463.0,0.3271275,0.8008884,0.0,0.0,0.0,0.0,8.0
Inquires - six months,67463.0,0.145754,0.4732913,0.0,0.0,0.0,0.0,5.0


In [26]:
# Check the number positive and negative samples.
total_samples = len(df_train)
positive_samples = df_train['Loan Status'].value_counts()[1]
negative_samples = df_train['Loan Status'].value_counts()[0]

print('Total Samples: {}'.format(total_samples))
print('Positive Samples: {} which are {:.2f} % of total'.format(positive_samples, (positive_samples/total_samples)*100))

Total Samples: 67463
Positive Samples: 6241 which are 9.25 % of total


## Split the Dataset Into Train, validation.

Test dataset is given along with Data.

In [28]:
df_train = df_train.drop(['ID'], axis=1)
df_train.head()

Unnamed: 0,Loan Amount,Funded Amount,Funded Amount Investor,Term,Batch Enrolled,Interest Rate,Grade,Sub Grade,Employment Duration,Home Ownership,...,Recoveries,Collection Recovery Fee,Collection 12 months Medical,Application Type,Last week Pay,Accounts Delinquent,Total Collection Amount,Total Current Balance,Total Revolving Credit Limit,Loan Status
0,10000,32236,12329.36286,59,BAT2522922,11.135007,B,C4,MORTGAGE,176346.6267,...,2.498291,0.793724,0,INDIVIDUAL,49,0,31,311301,6619,0
1,3609,11940,12191.99692,59,BAT1586599,12.237563,C,D3,RENT,39833.921,...,2.377215,0.974821,0,INDIVIDUAL,109,0,53,182610,20885,0
2,28276,9311,21603.22455,59,BAT2136391,12.545884,F,D4,MORTGAGE,91506.69105,...,4.316277,1.020075,0,INDIVIDUAL,66,0,34,89801,26155,0
3,11170,6954,17877.15585,59,BAT2428731,16.731201,C,C3,MORTGAGE,108286.5759,...,0.10702,0.749971,0,INDIVIDUAL,39,0,40,9189,60214,0
4,16890,13226,13539.92667,59,BAT5341619,15.0083,C,D4,MORTGAGE,44234.82545,...,1294.818751,0.368953,0,INDIVIDUAL,18,0,430,126029,22579,0


In [29]:
# Separate features and labels.

y = df_train.pop('Loan Status')
x = df_train

In [30]:
from sklearn.model_selection import train_test_split

x_train, x_val, y_train, y_val = train_test_split(x, y, test_size=0.2, random_state=42)

## Preprocess Data

In [31]:
# Select numeric Columns only.
x_train = x_train.select_dtypes(include=['float64', 'int64']).copy()
x_val = x_val.select_dtypes(include=['float64', 'int64']).copy()

In [32]:
print('Number of Samples in x_train: {}'.format(len(x_train)))
print('Number of unique values in each coloumn.')
x_train.nunique()

Number of Samples in x_train: 53970
Number of unique values in each coloumn.


Loan Amount                     25343
Funded Amount                   22613
Funded Amount Investor          53959
Term                                3
Interest Rate                   53958
Home Ownership                  53962
Debit to Income                 53968
Delinquency - two years             9
Inquires - six months               6
Open Account                       36
Public Record                       5
Revolving Balance               19029
Revolving Utilities             53966
Total Accounts                     69
Total Received Interest         53964
Total Received Late Fee         53923
Recoveries                      53924
Collection Recovery Fee         53873
Collection 12 months Medical        2
Last week Pay                     162
Accounts Delinquent                 1
Total Collection Amount          1961
Total Current Balance           49754
Total Revolving Credit Limit    33117
dtype: int64

From the above data, we can see that there are a lot of columns which have almost same value for each row.
These column names are: ['Term', ]

In [34]:
x_train = np.array(x_train)
x_val = np.array(x_val)

In [35]:
# Normalize columns.
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

scaler.fit(x_train)

StandardScaler()

In [41]:
x_train = scaler.transform(x_train)
x_val = scaler.transform(x_val)

In [43]:
print('Train features shape {}'.format(x_train.shape))
print('Val features shape: {}'.format(x_val.shape))

print('train label shape: {}'.format(y_train.shape))
print('Val label shape: {}'.format(y_val.shape))

Train features shape (53970, 24)
Val features shape: (13493, 24)
train label shape: (53970,)
Val label shape: (13493,)


## Define Model and matrices

In [None]:
# Understand the meaning of these metrics?
# understand optimizers, loss, and metrics?
# understand how we define layer and it's parameters?

METRICS = [
    tf.keras.metrics.TruePositives(name='tp'),
    tf.keras.metrics.FalsePositives(name='fp'),
    tf.keras.metrics.TrueNegatives(name='tn'),
    tf.keras.metrics.FalseNegatives(name='fn'),
    tf.keras.metrics.BinaryAccuracy(name='accuracy'),
    tf.keras.metrics.Precision(name='precision'),
    tf.keras.metrics.Recall(name='recall'),
    tf.keras.metrics.AUC(name='auc'),
    tf.keras.metrics.AUC(name='prc', curve='PR'),
]


def make_model(metrics=METRICS, output_bias=None):
    if output_bias is not None:
        output_bias = tf.keras.initializers.Constant(output_bias)
    
    model = tf.keras.Sequential([
        tf.keras.layers.Dense(16, activation='relu', input_shape=(x_train.shape[-1])),
        tf.keras.layers.Dropout(0.5),
        tf.keras.layers.Dense(1, activation='sigmoid', bias_initializer=output_bias),
    ])

    model.compile(
        optimizer=tf.keras.optimizers.Adam(learning_rate=1e-3),
        loss=tf.keras.losses.BinaryCrossentropy(),
        metrics=metrics,
    )

    return model