## Lending Club Loan Data Analysis

DESCRIPTION

Create a model that predicts whether or not a loan will be default using the historical data.

 

Problem Statement:  

For companies like Lending Club correctly predicting whether or not a loan will be a default is very important. In this project, using the historical data from 2007 to 2015, you have to build a deep learning model to predict the chance of default for future loans. As you will see later this dataset is highly imbalanced and includes a lot of features that makes this problem more challenging.

Domain: Finance

Analysis to be done: Perform data preprocessing and build a deep learning prediction model. 

Content: 

Dataset columns and definition:

 

credit.policy: 1 if the customer meets the credit underwriting criteria of LendingClub.com, and 0 otherwise.

purpose: The purpose of the loan (takes values "credit_card", "debt_consolidation", "educational", "major_purchase", "small_business", and "all_other").

int.rate: The interest rate of the loan, as a proportion (a rate of 11% would be stored as 0.11). Borrowers judged by LendingClub.com to be more risky are assigned higher interest rates.

installment: The monthly installments owed by the borrower if the loan is funded.

log.annual.inc: The natural log of the self-reported annual income of the borrower.

dti: The debt-to-income ratio of the borrower (amount of debt divided by annual income).

fico: The FICO credit score of the borrower.

days.with.cr.line: The number of days the borrower has had a credit line.

revol.bal: The borrower's revolving balance (amount unpaid at the end of the credit card billing cycle).

revol.util: The borrower's revolving line utilization rate (the amount of the credit line used relative to total credit available).

inq.last.6mths: The borrower's number of inquiries by creditors in the last 6 months.

delinq.2yrs: The number of times the borrower had been 30+ days past due on a payment in the past 2 years.

pub.rec: The borrower's number of derogatory public records (bankruptcy filings, tax liens, or judgments).

 

Steps to perform:

Perform exploratory data analysis and feature engineering and then apply feature engineering. Follow up with a deep learning model to predict whether or not the loan will be default using the historical data.

 

Tasks:

1.     Feature Transformation

Transform categorical values into numerical values (discrete)

2.     Exploratory data analysis of different factors of the dataset.

3.     Additional Feature Engineering

You will check the correlation between features and will drop those features which have a strong correlation

This will help reduce the number of features and will leave you with the most relevant features

4.     Modeling

After applying EDA and feature engineering, you are now ready to build the predictive models

In this part, you will create a deep learning model using Keras with Tensorflow backend

In [None]:

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)


import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
#IMPORT REQUIRED LIBRARIES 
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import keras
import warnings 

import tensorflow as tf
from tensorflow.keras import layers

In [None]:
#LOAD DATASET 
df = pd.read_csv('/kaggle/input/loan-data-csv/loan_data.csv')

In [None]:
df.info()

In [None]:
df.head()

In [None]:
#EXPLORE VALUES FOR PURPOSE COLUMNS
df['purpose'].unique()

In [None]:
#ONE HOT ENCODING
purpose = pd.get_dummies(df.purpose, drop_first=True)

In [None]:
purpose

In [None]:
# CONCATINATE DATA TO THE MAIN DATAFRAME
df = pd.concat([df,purpose],axis=1)

In [None]:
df.shape
data = df

In [None]:
#DROP THE COLUMN PURPOSE AS WE HAVE GOT THE VALUES THROUGH OHC
data.drop('purpose',axis=1,inplace=True)

In [None]:
data.describe().T

In [None]:
# SPLIT THE DATA TO FEATURES AND TARGET 
X = data.drop('credit.policy',axis=1)
y = data['credit.policy']

In [None]:
X.info()

## FIRST METHOD

In [None]:
from sklearn.ensemble import GradientBoostingClassifier,RandomForestClassifier
gb = GradientBoostingClassifier()
rf = RandomForestClassifier()
gb.fit(X,y)
rf.fit(X,y)
print(gb.feature_importances_)
print(rf.feature_importances_)

In [None]:
# CORRELATION WITH THE DATA AVAILBLE 
plt.figure(figsize=(12,9))
sns.heatmap(data.corr(),annot=True)


In [None]:
#PAIRPLOT
#sns.pairplot(data,diag_kind='kde',hue='not.fully.paid')

In [None]:
# SPLIT THE DATA TO TRAIN AND TEST
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=1)

In [None]:
# DEFINE MODEL
model = tf.keras.models.Sequential()

# MODEL INPUT 
model.add(tf.keras.layers.Reshape((18,),input_shape=(18,)))
model.add(tf.keras.layers.BatchNormalization())

#FIRST LAYER
model.add(tf.keras.layers.Dense(128, activation='relu'))
model.add(tf.keras.layers.BatchNormalization())

#SECOND LAYER
model.add(tf.keras.layers.Dense(128, activation='relu'))
model.add(tf.keras.layers.BatchNormalization())

#THIRD LAYER
model.add(tf.keras.layers.Dense(128, activation='relu'))
model.add(tf.keras.layers.BatchNormalization())

#FOURTH LAYER
#model.add(tf.keras.layers.Dense(16, activation='relu'))
#model.add(tf.keras.layers.BatchNormalization())

# OUTPUT LAYER
model.add(tf.keras.layers.Dense(1, activation='sigmoid'))

In [None]:
sgd_optimizer = tf.keras.optimizers.Adam(lr=0.0005)

model.compile(optimizer=sgd_optimizer,loss='binary_crossentropy',metrics=['accuracy'])
#model.compile(optimizer='rmsprop',loss='binary_crossentropy',metrics=['accuracy'])

In [None]:
model.fit(X_train,y_train, validation_data=(X_test,y_test), epochs=100, batch_size=32)

WIth the above we could reach a accuracy of 96%