# _Lending Club Loan Data Analysis_
***

<b>DESCRIPTION</b>

Create a model that predicts whether or not a loan will be default using the historical data.

<b>Problem Statement:  </b>

For companies like Lending Club correctly predicting whether or not a loan will be a default is very important. In this project, using the historical data from 2007 to 2015, you have to build a deep learning model to predict the chance of default for future loans. As you will see later this dataset is highly imbalanced and includes a lot of features that makes this problem more challenging.
***

<b>Domain: Finance</b>

Analysis to be done: Perform data preprocessing and build a deep learning prediction model. 

Content: 

Dataset columns and definition:

- credit.policy: 1 if the customer meets the credit underwriting criteria of LendingClub.com, and 0 otherwise.

- purpose: The purpose of the loan (takes values "credit_card", "debt_consolidation", "educational", "major_purchase", "small_business", and "all_other").

- int.rate: The interest rate of the loan, as a proportion (a rate of 11% would be stored as 0.11). Borrowers judged by   LendingClub.com to be more risky are assigned higher interest rates.

- installment: The monthly installments owed by the borrower if the loan is funded.

- log.annual.inc: The natural log of the self-reported annual income of the borrower.

- dti: The debt-to-income ratio of the borrower (amount of debt divided by annual income).

- fico: The FICO credit score of the borrower.

- days.with.cr.line: The number of days the borrower has had a credit line.

- revol.bal: The borrower's revolving balance (amount unpaid at the end of the credit card billing cycle).

- revol.util: The borrower's revolving line utilization rate (the amount of the credit line used relative to total credit available).

- inq.last.6mths: The borrower's number of inquiries by creditors in the last 6 months.

- delinq.2yrs: The number of times the borrower had been 30+ days past due on a payment in the past 2 years.

- pub.rec: The borrower's number of derogatory public records (bankruptcy filings, tax liens, or judgments).


## Approach 
-  Exploratory Data Analysis
-  Statistical Hypothesis Tests on selected features (Shapiro wilk test, Augustino K^2 test, ..)
-  Prepare and preprocess data (Power Transformation, Scaling , PCA)
-  Select classifiers based on cross validation 
-  Fine tune the selected classifier and baseline the performance 
-  Define Dense Layer Model as a function 
-  Grid search on deep learning model parameters using Keras wrapper for Scikit learn
-  Add regularization 
-  Compare performance of deep learning model with the fine tuned ML classifier 

## _Import Libraries and Load Data_

In [None]:
#usual imports 
import numpy as np
import pandas as pd
import os
import sys
import statistics
assert sys.version_info >= (3,5)
#visualization imports
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style='whitegrid')
%matplotlib inline
#consistent sized plots
from pylab import rcParams
rcParams['figure.figsize']= 12,5
rcParams['axes.labelsize']=12
rcParams['xtick.labelsize']=12
rcParams['ytick.labelsize']=12
#handle the unwanted warnings
import warnings
warnings.filterwarnings(action='ignore',category=DeprecationWarning)
warnings.filterwarnings(action='ignore',category=FutureWarning)
#view all the columns in the dataframe
pd.options.display.max_columns = None
#import zip file
from zipfile import ZipFile

In [None]:
#check the version of the libraries used
print('Pandas version: {}'.format(pd.__version__))
print('Numpy version: {}'.format(np.__version__))
print('Seaborn version: {}'.format(sns.__version__))

In [None]:
#load the csv file into the dataframe
loan = pd.read_csv('/kaggle/input/loan-risk/loan_data.csv',delimiter=',',engine='python')
#check the top rows of the data frame
loan.head()

In [None]:
#automated basic exploration of the data using pandas profiling
import pandas_profiling as pp
pp.ProfileReport(loan)

In [None]:
#check info 
loan.info()

<font color=blue> <b>
- *There are no null or missing values in the dataset*
- *There exists multicollinearity between the factors* </font> </b>

There is one categorical variable which is purpose and remaining are numerical.

## _Exploratory Data Analysis_

In [None]:
#check the balance of the data
sns.countplot(loan['credit.policy'])
plt.title('Countplot of the credit policy (Target Variable)')
plt.show()

<b>_Clearly the data is highly imbalanced dataset.  There are more customers who meet the credit underwriting than those who do not. Accuracy won't be a good measure of the model performance and tuning of the decision function to improve either precision or recall would be key. There are two ways to look at it. The lender won't want a situation where the loan is provided to a customer who would default. Other way, if the lender wants to maximise the lending, then the model should identify all the legitimate customers who would pay back. In this problem, I would go with the objective that the lending club which makes money by lending to customers is more keen to identify all people who meet the credit policy criteria. Moreover the interest rate of the risky customers is set high by the lending club_</b>

In [None]:
#different purpose of the loan
loan['purpose'].value_counts().sort_values(ascending=False)

In [None]:
#credit policy w.r.t the purpose for which the loan was taken
sns.countplot(loan['credit.policy'],hue=loan['purpose'])
plt.title('Plot of credit policy with respect to the loan purpose')
plt.show()

<b> _debt_consolidation has the maximum customer who meet as well who do not meet the credit policy criteria. all_other and home improvement are the next highest purpose for both 0 and 1 credit policy_</b> 

In [None]:
#visualize interest rate vs the credit policy
sns.violinplot(x='credit.policy', y='int.rate', data=loan,jitter=True,palette='Set2')
plt.show()

In [None]:
#visualize interest rate vs the purpose of the loan
plt.figure(figsize=(12,7))
sns.violinplot(x='purpose', y='int.rate', data=loan,jitter=True,palette='Set1')
plt.title('Plot of the interest rate set against the purpose of the loan')
plt.show()

<b> _The interest rate for the purpose of small_business is hightest, followed by debt consolidation_ </b>

In [None]:
#visualize interest rate vs the purpose of the loan
plt.figure(figsize=(12,7))
sns.violinplot(x='purpose', y='int.rate', data=loan,hue = 'credit.policy',jitter=True,palette='Set3')
plt.title('Plot of the interest rate set against the purpose of the loan separated by credit policy')
plt.ylabel('Loan Interest rate')
plt.xlabel('Purpose of the loan')
plt.show()

<b> _In this graph we clearly see that no matter what the purpose of the loan is, the interest rate is set higher for the customers who are more risky or do not meet the criteria of the lending club loan policy_</b>

In [None]:
sns.stripplot(x="credit.policy", y="dti", data=loan,jitter=True,hue='purpose',palette='Set2',alpha=0.3)
plt.title('Plot debt to income ratio classified by credit policy')
plt.ylabel('Debt to Income Ratio')
plt.xlabel('Credit Policy')
plt.show()

<b> _The debt to income ratio of the customers who do not meet the criteria is higher compared to those who meet_ </b>

In [None]:
#visualize debt to income ration vs the purpose of the loan
plt.figure(figsize=(12,7))
sns.violinplot(x='purpose', y='dti', data=loan,hue = 'credit.policy',jitter=True,palette='Set3')
plt.title('Plot of the debt to income ration set against the purpose of the loan separated by credit policy')
plt.ylabel('Debt to Income Ration')
plt.xlabel('Purpose of the loan')
plt.show()

<b> _Debt to income ration of the eligible customers are much higher compared to those who do not meet all the criteria of the lending club loan policy_ </b>

### _Distributions and Statistical Hypothesis Test_

In [None]:
#annual income of the borrowers
plt.hist(loan['log.annual.inc'],bins=30,orientation='vertical')
plt.title('Plot of the annual income of the borrower')
plt.grid()
plt.ylabel('Frequency')
plt.xlabel('Self declared income (log)')
plt.show()

The annual income of the borrowers appears to be normally distributed. 

In [None]:
#test of normality using Shapiro-wilk test
from scipy.stats import shapiro
stats,p = shapiro(loan['log.annual.inc'])
print('p-value of the Shapiro Normality Test {}'.format(p))
if p>0.05:
    print('Probably data is Gaussian')
else:
    print('Probably data is not Gaussian')

In [None]:
#test of normality using Augustino k-square test
from scipy.stats import normaltest
stats,p = normaltest(loan['log.annual.inc'])
print('p-value of the Shapiro Normality Test {}'.format(p))
if p>0.05:
    print('Probably data is Gaussian')
else:
    print('Probably data is not Gaussian')

So based on what appears to be normal distributed fails to quality the statistical tests of normality of two very strong statistical tests. 

In [None]:
loan.head(3)

The features fico represents the FICO credit score of the borrower. This could be an important criterion for the lending club to decide the credibility of the borrower.

In [None]:
sns.distplot(loan['fico'])
plt.title('Histogram Plot of the fico credit score borrower')
plt.grid()
plt.ylabel('Frequency')
plt.xlabel('fico')
plt.show()

In [None]:
#compare the fico score of the two borrower types 
sns.boxplot(x='credit.policy',y='fico',data=loan)
plt.title('Plot of fico score versus the credit policy of the borrowers')
plt.ylabel('FICO Credit Score')
plt.xlabel('Credit Policy Lending Club')
plt.grid()
plt.show()

The median credit score of the borrowers with credit policy 1 is much higher compared to the borrowers with credit policy 0. At the same time, there are quite a few fico score beyond the IQR of the credit policy with 0 borrowers. The model might get confused trying to predict the outcome with the fico score. 

In [None]:
#compare the fico score of the two borrower types
warnings.filterwarnings(action='ignore',message='')
sns.boxplot(x=loan['credit.policy'],y=np.log(loan['revol.bal']))
plt.title('Plot of revolving balance')
plt.ylabel('Unpaid credit card balance (log scale)')
plt.xlabel('Credit Policy Lending Club')
plt.grid()
plt.show()

The distribution looks to be the same . However we can perform a statistical test. However, t-test assumes the data to be normally and independently distributed. 

In [None]:
#student t-test of means of the revol.bal for the two borrower class
data_0 = loan[loan['credit.policy']==0]['revol.bal']
data_1 = loan[loan['credit.policy']==1]['revol.bal']

from scipy.stats import ttest_ind
stat,p = ttest_ind(data_0,data_1)
if p > 0.05:
    print('Fail to reject the Null Hypothesis ')
    print('The samples have same mean and probably they are from same distribution')
    
else:
    print('Reject the Null Hypothesis')
    print('The samples have unequal means and probably they are from different distributions')

In [None]:
#check whether the credit policy is related to the number of days the borrower has had a credit line
from scipy.stats import chi2_contingency
table = [loan['credit.policy'],loan['days.with.cr.line']]

stat,p,dof,expected = chi2_contingency(table)

if p > 0.05:
    print('Fail to reject the Null Hypothesis ')
    print('The two samples are independent')
    
else:
    print('Reject the Null Hypothesis')
    print('The two samples are dependent')

In [None]:
loan['delinq.2yrs'].value_counts().sort_values(ascending=False)

In [None]:
#check the correlation of the features with credit policy
loan.corr()['credit.policy'].sort_values()

In [None]:
#label encode the purpose
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
loan['purpose'] = encoder.fit_transform(loan['purpose'])

## _Split the data into train and test set_
In order to have similar split between the train and test, the data split would be stratified on the target label which is credit.policy in this case. If the model performance is bad, we can also try out synthetic data augmentation to balance the data.

In [None]:
#set a random state seed and the test size
seed = 51
test_size = 0.2

In [None]:
#import the required libraries
from sklearn.model_selection import train_test_split
train_set,test_set = train_test_split(loan,test_size=0.2,random_state=seed,stratify=loan['credit.policy'])

In [None]:
#check the shape
train_set.shape, test_set.shape

In [None]:
#split into X_train and X_test and y_train and y_test ..data is already shuffled in the previous split
X_train_orig = train_set.drop('credit.policy',axis=1)
y_train_orig = train_set['credit.policy']

X_test_orig = test_set.drop('credit.policy',axis=1)
y_test_orig = test_set['credit.policy']


In [None]:
#check the proportion of the credit policy in the train and test split
train_set['credit.policy'].value_counts()/len(train_set)

In [None]:
test_set['credit.policy'].value_counts()/len(test_set)

Great, now that the train set sample is right representation of the test.

In [None]:
#check the shape of the labels / target 
y_train_orig.shape, y_test_orig.shape

In [None]:
#store as array values
X_train = X_train_orig.values
y_train = y_train_orig.values

X_test = X_test_orig.values
y_test = y_test_orig.values

In [None]:
y_test.shape

In [None]:
y_test

This is a Rank1 array and can result in bad issues during the neural network modeling. It is better to reshape this.

In [None]:
y_train = y_train.reshape(y_train.shape[0],1)
y_test = y_test.reshape(y_test.shape[0],1)

In [None]:
#check the shape now
y_train.shape, y_test.shape

## _Power Transformation and Scaling_
Apply a power transform featurewise to make data more Gaussian-like.

In [None]:
from sklearn.preprocessing import PowerTransformer
pt = PowerTransformer()
#transform the train and the test set
X_train = pt.fit_transform(X_train)
X_test = pt.transform(X_test)

In [None]:
#Scale the inputs
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

## _Dimensionality Reduction_
Reduce the dimensions using PCA

In [None]:
from sklearn.decomposition import PCA
pca = PCA(n_components=0.95) #retain 95% variablity in the data
X_train = pca.fit_transform(X_train)
X_test = pca.transform(X_test)

In [None]:
#check number of components 
pca.n_components_

## _Modeling using ML Algos_

In [None]:
#import the model libraries
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
#evaluation metrics
from sklearn.metrics import classification_report
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import plot_confusion_matrix

In [None]:
#constructor for the classifiers to be tested
classifiers = {'Logistic Regression':LogisticRegression(),
               'Random Forest':RandomForestClassifier(random_state=seed)}
               

for key,model in classifiers.items():
    model.fit(X_train,y_train)
    train_predict = model.predict(X_train)
    test_predict = model.predict(X_test)
    print('\n')
    print('Model {}'.format(key))
    print('----------------------')
    print('Train Data Recall Score',recall_score(y_train,train_predict))
    print('Test Data Recall Score',recall_score(y_test,test_predict))   
    #print the classification report based on predictions of the test data .. 
    print('\n')
    print(f'{key} Classification Report(Test Data)')
    print('...............................')
    print(classification_report(y_test,test_predict))
    print(confusion_matrix(y_test,test_predict))

The plain vanilla random forest classifier performs better than the logistic regression. However it also overfitted as is clear from the precision score on the train data. We can try out a lot of other models. However, we would fine tune the random forest classifier and use the performance of the tuned rf model as a baseline and then try to improve the overall score using neural network model.

## _Cross validation_
To be sure on the average performance of the model, perform the cross validation over the entire training set.

In [None]:
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.model_selection import cross_val_predict

In [None]:
#define a function to run cross validation across models 
def cross_validate(X = X_train,y = y_train):
    '''This function will run cross validation on multiple models and will print the accuracy score'''
    
    seed = 42
    warnings.filterwarnings(action='ignore',message='')

    models = []
    models.append(('Logistic Regression',LogisticRegression(C=100.0)))
    models.append(('Random Forest',RandomForestClassifier()))
    # * Add more models to compare * #
        
    results = []
    names = []
    scoring ='recall'

    for name,model in models:        
        kfold = RepeatedStratifiedKFold(n_splits=10,random_state=seed,n_repeats=10)
        cv_results = cross_val_score(model,X,y,cv=kfold,scoring=scoring)
        print (f' Model: {name} ,Recall Score: {(np.mean(cv_results))}') 

In [None]:
#check the evaluation metric across the different models using cross validation
cross_validate(X_train,y_train)

## _Grid Search_

In [None]:
from sklearn.model_selection import GridSearchCV

In [None]:
#initialize the model , set the criterion to be information gain rather than gini impurity
rf_clf = RandomForestClassifier(random_state=seed,criterion='entropy')


In [None]:
param_grid = [{'n_estimators': [300,500,550,600]}]
grid_search = GridSearchCV(rf_clf, param_grid, cv=5,scoring='recall',return_train_score=True)
grid_search.fit(X_train,y_train)

In [None]:
grid_search.best_estimator_

In [None]:
#instantiate a new model based on the best params from grid search
from sklearn.base import clone
model = clone(grid_search.best_estimator_)
model.fit(X_train,y_train)
test_pred =  model.predict(X_test)
#print the model evaluation metrics
print('Classification Report on the Validation Data')
print(classification_report(y_test,test_pred))
print(confusion_matrix(y_test,test_pred))

Recall score which reflects the True Positivity Rate, ie the model should be able to correctly identify all the positive classes (in this case credit.policy as 1).  The lending club's profit is based on lending the money at an interest rate and hence the model should be able to identify all the positive outcomes more accurately. For others predicted as credit policy 0, the interest rate could be set higher. 

The fine tuned RF model provides a recall of 98% for the positive class. Still there are 179 0 classes are misclassified as 1. This can be tackled using precision recall and ROC approaches. However we will stick with this score and this would be the baseline for the neural net model to outperform. Again to re-stress, for smaller dataset as in this case, it is very hard to separate the best performing model. Deep learning algos definitely have benefits over larger datasets compared to the traditional ML models.

One thing to note from the confusion matrix below, the model has clearly overfitted. This can be addressed using regularization techniques. One option which is not exercised is data augmentation in which we can also try to increase the instances of the minority class using ADASYN or SMOTE. 

In [None]:
#plot the confusion matrix
plot_confusion_matrix(model,X_test,y_test)
plt.title('Confusion Matrix - Test Dataset')
plt.show()

In [None]:
#plot the confusion matrix
plot_confusion_matrix(model,X_train,y_train)
plt.title('Confusion Matrix - Train Dataset')
plt.show()

## _Model Training using Neural Nets_
Objective: Try to beat the performance of the Random Forest Model. Please note that the neural networks performns better for a much larger dataset and for a relatively small dataset, it is hard to separate the various ML models and also the Neural N/W model. 

In [None]:
#import the required tensorflow keras libraries
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Dropout
from tensorflow.keras.callbacks import EarlyStopping
#keras wrapper for scikit learn to perform cross validation or grid search
from tensorflow.keras.wrappers.scikit_learn import KerasClassifier

In [None]:
#define the model --> required for the Keras Classifier class 
def create_model(optimizer='adam',init='glorot_uniform',dropout=0.0):
    model = Sequential()
    #add the layers
    model.add(Dense(units=500,input_dim=X_train.shape[1],activation='relu',kernel_initializer=init))
    model.add(Dense(units=300,activation='relu',kernel_initializer=init))
    model.add(Dropout(dropout))
    model.add(Dense(units=100,activation='relu',kernel_initializer=init))
    model.add(Dropout(dropout))
    model.add(Dense(units=50,activation='relu',kernel_initializer=init))
    model.add(Dropout(dropout))
    model.add(Dense(units=1,activation='sigmoid',kernel_initializer=init))
    #compile the model
    model.compile(loss='binary_crossentropy',optimizer=optimizer,metrics=['accuracy'])
    #return the model
    return model   

### _Grid search deep learning model parameters_

In [None]:
#define the early stop criteria
early_stop = EarlyStopping(monitor='val_loss',patience=50,restore_best_weights=True)
#create the model
model = KerasClassifier(build_fn=create_model,epochs=100,batch_size=16,verbose=0)
#parameters to search 
optimizers = ['rmsprop','adam','nadam']
#define the param grid for grid search
param_grid = dict(optimizer=optimizers)
grid = GridSearchCV(estimator=model,param_grid=param_grid,cv=3)

In [None]:
#perform grid search 
grid_result = grid.fit(X_train,y_train)

In [None]:
#print which optimizer performed best during the designated epochs 
print('Best %f using %s' %(grid_result.best_score_,grid_result.best_params_))

In [None]:
#summarize the results
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']

In [None]:
#check the mean score with all the tested optmizers
print('Mean Accuracy:= RMSProp:%.5f, Adam:%.5f, Nadam:%.5f'%(means[0],means[1],means[2]))
print('Standard Dev Accuracy:= RMSProp:%.5f, Adam:%.5f, Nadam:%.5f'%(stds[0],stds[1],stds[2]))

<b> Lets try it out with nadam optimizer and see if that itself can perform a better solution. In real sense we should trust and go with the best optimizer returned by grid search.</b>

### _Train the model for a longer epoch cycle with early stopping_


In [None]:
model = create_model(optimizer='nadam')
model.fit(X_train,y_train,epochs=500,callbacks=[early_stop],validation_data=(X_test,y_test),
          verbose=0)

In [None]:
y_pred = model.predict_classes(X_test)
y_pred[:10]

In [None]:
#evaluate the model
model_score = model.evaluate(X_test,y_test,verbose=1)
print('%s: %.2f%% ' %(model.metrics_names[1],model_score[1]*100 ))

In [None]:
print(classification_report(y_test,y_pred))

<b> The recall score is the same for the positive class 1 while it has improved by 23% for the class 0. Secondly, the accuracy of the model has improved to 93% from earlier 89% using the fine tuned random forest model. </b>

<b> This is encouraging. Lets try out adding some regularization into the model using drop out layers. In the previous run the dropout argument was set as 0 which is a good as no Dropout. </b>

In [None]:
model = create_model(optimizer='nadam',dropout=0.25)
history = model.fit(X_train,y_train,epochs=500,callbacks=[early_stop],validation_data=(X_test,y_test),
          verbose=0)

In [None]:
y_pred = model.predict_classes(X_test)
y_pred[:10]

In [None]:
#evaluate the model
model_score = model.evaluate(X_test,y_test,verbose=1)
print('%s: %.2f%% ' %(model.metrics_names[1],model_score[1]*100 ))

In [None]:
print(classification_report(y_test,y_pred))

In [None]:
#confusion matrix
print('confusion matrix of the dense neural n/w with dropout enabled')
print(confusion_matrix(y_test,y_pred))

<b> With addition of dropout layer, while the accuracy did not improve, the recall score has improved for both the classes. Further optmization can be done using learning rate decay, using momentum optmizers, wide and deep networks. However, the current deep learning model with dropout regularization seems to be a good overall solution. </b>

In [None]:
#save the model architecture and weights
model.save('loan_risk_dense_model.h5')

## _Please leave your feedback or remark. It will help to improve the notebook. Thank you !_ 