# Loan Pay Back Prediction - Lending Club

 **Notebook Contents**

* Introduction
* Data Overview
* Loading the data and importing relevant libraries
* Exploratory Data Analysis
* Data preprocessing
* Evaluating Model Performance


# **Introduction**

LendingClub is the world's largest peer-to-peer lending platform located in San Francisco, California, United States. 

Our goal is to use dataset provided to build a model that predict wether or not the borrower will pay back the loan. The model can help assess new borrowers if they will be able to pay the loan based on their historical information. 



# Data Overview

----
-----
Here are the details of this dataset:

<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>LoanStatNew</th>
      <th>Description</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>loan_amnt</td>
      <td>The listed amount of the loan applied for by the borrower. If at some point in time, the credit department reduces the loan amount, then it will be reflected in this value.</td>
    </tr>
    <tr>
      <th>1</th>
      <td>term</td>
      <td>The number of payments on the loan. Values are in months and can be either 36 or 60.</td>
    </tr>
    <tr>
      <th>2</th>
      <td>int_rate</td>
      <td>Interest Rate on the loan</td>
    </tr>
    <tr>
      <th>3</th>
      <td>installment</td>
      <td>The monthly payment owed by the borrower if the loan originates.</td>
    </tr>
    <tr>
      <th>4</th>
      <td>grade</td>
      <td>LC assigned loan grade</td>
    </tr>
    <tr>
      <th>5</th>
      <td>sub_grade</td>
      <td>LC assigned loan subgrade</td>
    </tr>
    <tr>
      <th>6</th>
      <td>emp_title</td>
      <td>The job title supplied by the Borrower when applying for the loan.*</td>
    </tr>
    <tr>
      <th>7</th>
      <td>emp_length</td>
      <td>Employment length in years. Possible values are between 0 and 10 where 0 means less than one year and 10 means ten or more years.</td>
    </tr>
    <tr>
      <th>8</th>
      <td>home_ownership</td>
      <td>The home ownership status provided by the borrower during registration or obtained from the credit report. Our values are: RENT, OWN, MORTGAGE, OTHER</td>
    </tr>
    <tr>
      <th>9</th>
      <td>annual_inc</td>
      <td>The self-reported annual income provided by the borrower during registration.</td>
    </tr>
    <tr>
      <th>10</th>
      <td>verification_status</td>
      <td>Indicates if income was verified by LC, not verified, or if the income source was verified</td>
    </tr>
    <tr>
      <th>11</th>
      <td>issue_d</td>
      <td>The month which the loan was funded</td>
    </tr>
    <tr>
      <th>12</th>
      <td>loan_status</td>
      <td>Current status of the loan</td>
    </tr>
    <tr>
      <th>13</th>
      <td>purpose</td>
      <td>A category provided by the borrower for the loan request.</td>
    </tr>
    <tr>
      <th>14</th>
      <td>title</td>
      <td>The loan title provided by the borrower</td>
    </tr>
    <tr>
      <th>15</th>
      <td>zip_code</td>
      <td>The first 3 numbers of the zip code provided by the borrower in the loan application.</td>
    </tr>
    <tr>
      <th>16</th>
      <td>addr_state</td>
      <td>The state provided by the borrower in the loan application</td>
    </tr>
    <tr>
      <th>17</th>
      <td>dti</td>
      <td>A ratio calculated using the borrower’s total monthly debt payments on the total debt obligations, excluding mortgage and the requested LC loan, divided by the borrower’s self-reported monthly income.</td>
    </tr>
    <tr>
      <th>18</th>
      <td>earliest_cr_line</td>
      <td>The month the borrower's earliest reported credit line was opened</td>
    </tr>
    <tr>
      <th>19</th>
      <td>open_acc</td>
      <td>The number of open credit lines in the borrower's credit file.</td>
    </tr>
    <tr>
      <th>20</th>
      <td>pub_rec</td>
      <td>Number of derogatory public records</td>
    </tr>
    <tr>
      <th>21</th>
      <td>revol_bal</td>
      <td>Total credit revolving balance</td>
    </tr>
    <tr>
      <th>22</th>
      <td>revol_util</td>
      <td>Revolving line utilization rate, or the amount of credit the borrower is using relative to all available revolving credit.</td>
    </tr>
    <tr>
      <th>23</th>
      <td>total_acc</td>
      <td>The total number of credit lines currently in the borrower's credit file</td>
    </tr>
    <tr>
      <th>24</th>
      <td>initial_list_status</td>
      <td>The initial listing status of the loan. Possible values are – W, F</td>
    </tr>
    <tr>
      <th>25</th>
      <td>application_type</td>
      <td>Indicates whether the loan is an individual application or a joint application with two co-borrowers</td>
    </tr>
    <tr>
      <th>26</th>
      <td>mort_acc</td>
      <td>Number of mortgage accounts.</td>
    </tr>
    <tr>
      <th>27</th>
      <td>pub_rec_bankruptcies</td>
      <td>Number of public record bankruptcies</td>
    </tr>
  </tbody>
</table>

---
----

# Loading the Data and relevant libraries

In [None]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

        
        
loan_data = pd.read_csv('/kaggle/input/lending-club-dataset/lending_club_loan_two.csv')


In [None]:
loan_data.info()

# Exploratory Data Analysis

In this section, we will get an understanding for which variables are important, and visualize data

As we are predicting loan payback, let's plot current loan status

In [None]:
sns.countplot(x='loan_status',data=loan_data)

In [None]:
loan_data['loan_status'].value_counts()

There we see that 318,357 paid their loan back. This means that there is 80% chance that a borrower can pay back the loan. We want to use model that predict who that might be.

In [None]:
plt.figure(figsize=(12,4))
sns.distplot(loan_data['loan_amnt'],kde=False,bins=40)
plt.xlim(0,45000)

On histogram above we see that all loans fall under 35,000, and many people take 10,000.

In [None]:
#How features collerate to one another
loan_data.corr()

In [None]:
plt.figure(figsize=(12,7))
sns.heatmap(loan_data.corr(),annot=True,cmap='viridis')
plt.ylim(10, 0)

Installment correlates with the loan amount. Let's keep exploring this to see their relationship. Back to data overview, installment is the monthly payment owed by the borrower if the loan originates. And the loan amount is the listed amount of the loan applied for by the borrower. If at some point in time, the credit department reduces the loan amount, then it will be reflected in this value.


In [None]:
sns.scatterplot(x='installment',y='loan_amnt',data=loan_data)

Let's again see relationship between loan amount and loan status

In [None]:
sns.boxplot(x='loan_status',y='loan_amnt',data=loan_data)

In [None]:
loan_data.groupby('loan_status')['loan_amnt'].describe()

Let's explore the Grade and SubGrade columns that LendingClub attributes to the loans. We will find the unique possible grades and subgrades.

In [None]:
sorted(loan_data['grade'].unique())

In [None]:
sorted(loan_data['sub_grade'].unique())

In [None]:
sns.countplot(x='grade',data=loan_data,hue='loan_status')

Let's display a count plot per subgrade.

In [None]:
plt.figure(figsize=(12,4))
subgrade_order = sorted(loan_data['sub_grade'].unique())
sns.countplot(x='sub_grade',data=loan_data,order = subgrade_order,palette='coolwarm' )

In [None]:
plt.figure(figsize=(12,4))
subgrade_order = sorted(loan_data['sub_grade'].unique())
sns.countplot(x='sub_grade',data=loan_data,order = subgrade_order,palette='coolwarm' ,hue='loan_status')

It seems that subgrade F and G don't pay back the loan. Let's isolate them

In [None]:
f_and_g = loan_data[(loan_data['grade']=='G') | (loan_data['grade']=='F')]

plt.figure(figsize=(12,4))
subgrade_order = sorted(f_and_g['sub_grade'].unique())
sns.countplot(x='sub_grade',data=f_and_g,order = subgrade_order,hue='loan_status')

Let's create a new column called 'load_repaid' which will contain a 1 if the loan status was "Fully Paid" and a 0 if it was "Charged Off".

In [None]:
loan_data['loan_status'].unique()

In [None]:
loan_data['loan_repaid'] = loan_data['loan_status'].map({'Fully Paid':1,'Charged Off':0})

In [None]:
loan_data[['loan_repaid','loan_status']]

# Data PreProcessing

In this section, we will perform actions on missing data, categorical data and dummy variables

In [None]:
#Let's start by showing sum of missing data
loan_data.isnull().sum()

In [None]:
#By percentage of dataframe, missing data are
100* loan_data.isnull().sum()/len(loan_data)

Mortage account outperforms all feature in missing data. Also, we have missing data in columns of Employment title, Employment length, title, revol_util, and pub_rec_bankruptcies

Let's examine emp_title and emp_length to see whether it will be okay to drop them. 'emp_title' is the job title supplied by the Borrower when applying for the loan.*


'emp_length' is employment length in years. Possible values are between 0 and 10 where 0 means less than one year and 10 means ten or more years. 


In [None]:
loan_data['emp_title'].nunique()

In [None]:
loan_data['emp_title'].value_counts()

As there are to many job titles, let's remove the column out of our dataset

In [None]:
loan_data=loan_data.drop('emp_title', axis=1)

In [None]:
#Let's create the count plot of the emp_length column by sorting firdt the order of the values
sorted(loan_data['emp_length'].dropna().unique())

In [None]:
emp_length_order = [ '< 1 year',
                      '1 year',
                     '2 years',
                     '3 years',
                     '4 years',
                     '5 years',
                     '6 years',
                     '7 years',
                     '8 years',
                     '9 years',
                     '10+ years']

In [None]:
plt.figure(figsize=(12,4))

sns.countplot(x='emp_length',data=loan_data,order=emp_length_order)

In [None]:
plt.figure(figsize=(12,4))
sns.countplot(x='emp_length',data=loan_data,order=emp_length_order,hue='loan_status')

Now we see that it is likely that most people who paid the loan were employed in 10 years, and it is true as these are people who are able to pay

In [None]:
emp_co = loan_data[loan_data['loan_status']=="Charged Off"].groupby("emp_length").count()['loan_status']

In [None]:
emp_fp = loan_data[loan_data['loan_status']=="Fully Paid"].groupby("emp_length").count()['loan_status']

In [None]:
emp_len = emp_co/emp_fp
emp_len

In [None]:
emp_len.plot(kind='bar')

This tell us that chargeoff rates are extremely similar across all empleoyment lenths. Let's drop it

In [None]:
loan_data=loan_data.drop('emp_length',axis=1)

In [None]:
loan_data.isnull().sum()

Let's review the title and the purpose column. Is there any repeated information?

In [None]:
loan_data['purpose'].head(10)

In [None]:
loan_data['title'].head(10)

It's seems that title and purpose columns are repeated information. Let's drop title out

In [None]:
loan_data=loan_data.drop('title',axis=1)

The mort_acc feature represents the number of mortgage accounts.


In [None]:
loan_data['mort_acc'].value_counts()

Let us deal with the missing data. Firstly, we can check other features which correlate with it

In [None]:
print("Correlation with the mort_acc column")
loan_data.corr()['mort_acc'].sort_values()

Looks like the total_acc feature correlates with the mort_acc , this makes sense! Let's try this fillna() approach. We will group the dataframe by the total_acc and calculate the mean value for the mort_acc per total_acc entry

In [None]:
print("Mean of mort_acc column per total_acc")
loan_data.groupby('total_acc').mean()['mort_acc']

In [None]:
total_acc_avg =loan_data.groupby('total_acc').mean()['mort_acc']

In [None]:
def fill_mort_acc(total_acc,mort_acc):
    '''
    Accepts the total_acc and mort_acc values for the row.
    Checks if the mort_acc is NaN , if so, it returns the avg mort_acc value
    for the corresponding total_acc value for that row.
    
    total_acc_avg here should be a Series or dictionary containing the mapping of the
    groupby averages of mort_acc per total_acc values.
    '''
    if np.isnan(mort_acc):
        return total_acc_avg[total_acc]
    else:
        return mort_acc

In [None]:
loan_data['mort_acc'] = loan_data.apply(lambda x: fill_mort_acc(x['total_acc'], x['mort_acc']), axis=1)

In [None]:
loan_data.isnull().sum()

revol_util and the pub_rec_bankruptcies have missing data points, but they account for less than 0.5% of the total data. Let's drop the missing values off only

In [None]:
loan_data=loan_data.dropna()

In [None]:
loan_data.isnull().sum()

Now is the time to deal with categorical values and dummy variables

In [None]:
print('The categorical variables are:')
loan_data.select_dtypes(['object']).columns

![](http://)Let's convert term feature into either 30 or 60 integer numeric data type

In [None]:
loan_data['term'].value_counts()

In [None]:
loan_data['term'] = loan_data['term'].apply(lambda term: int(term[:3]))

As we already know that the grade is part of sub_grade, let's drop the feature off. 

In [None]:
loan_data=loan_data.drop('grade',axis=1)

Let us now convert the subgrade into dummy variables. Then concatenate these new columns to the original dataframe.

In [None]:
subgrade_dummies = pd.get_dummies(loan_data['sub_grade'],drop_first=True)

In [None]:
loan_data = pd.concat([loan_data.drop('sub_grade',axis=1),subgrade_dummies],axis=1)

In [None]:
loan_data.columns

In [None]:
loan_data.select_dtypes(['object']).columns

In [None]:
dummies = pd.get_dummies(loan_data[['verification_status', 'application_type','initial_list_status','purpose' ]],drop_first=True)
loan_data = loan_data.drop(['verification_status', 'application_type','initial_list_status','purpose'],axis=1)


In [None]:
loan_data = pd.concat([loan_data,dummies],axis=1)

Let us review the value_counts for the home_ownership column.

In [None]:
loan_data['home_ownership'].value_counts()

Let us convert these to dummy variables, but replace NONE and ANY with OTHER, so that we end up with just 4 categories, MORTGAGE, RENT, OWN, OTHER. Then concatenate them with the original dataframe.

In [None]:
loan_data['home_ownership']=loan_data['home_ownership'].replace(['NONE', 'ANY'], 'OTHER')

dummies = pd.get_dummies(loan_data['home_ownership'],drop_first=True)
loan_data = loan_data.drop('home_ownership',axis=1)
loan_data = pd.concat([loan_data,dummies],axis=1)

Let's feature engineer a zip code column from the address in the data set. Create a column called 'zip_code' that extracts the zip code from the address column.

In [None]:
loan_data['zip_code'] = loan_data['address'].apply(lambda address:address[-5:])

In [None]:
dummies = pd.get_dummies(loan_data['zip_code'],drop_first=True)
loan_data = loan_data.drop(['zip_code','address'],axis=1)
loan_data = pd.concat([loan_data,dummies],axis=1)

Let us drop issue date

In [None]:
loan_data=loan_data.drop('issue_d',axis=1)

earliest_cr_line appears to be a historical time stamp feature. Extract the year from this feature

In [None]:
loan_data['earliest_cr_year'] = loan_data['earliest_cr_line'].apply(lambda date:int(date[-4:]))
loan_data = loan_data.drop('earliest_cr_line',axis=1)

In [None]:
loan_data.select_dtypes(['object']).columns

In [None]:
loan_data = loan_data.drop('loan_status',axis=1)

**It's time to split test and train data**

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X= loan_data.drop('loan_repaid',axis=1).values
y = loan_data['loan_repaid'].values

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=101)

In [None]:
from sklearn.preprocessing import MinMaxScaler

In [None]:
scaler = MinMaxScaler()

In [None]:
X_train = scaler.fit_transform(X_train)

In [None]:
X_test = scaler.transform(X_test)

# Creating the Model

In [None]:
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation,Dropout
from tensorflow.keras.constraints import max_norm

In [None]:
model = Sequential()

# input layer
model.add(Dense(78,  activation='relu'))
model.add(Dropout(0.2))

# hidden layer
model.add(Dense(39, activation='relu'))
model.add(Dropout(0.2))

# hidden layer
model.add(Dense(19, activation='relu'))
model.add(Dropout(0.2))

# output layer
model.add(Dense(units=1,activation='sigmoid'))

# Compile model
model.compile(loss='binary_crossentropy', optimizer='adam')

In [None]:
model.fit(x=X_train, 
          y=y_train, 
          epochs=25,
          batch_size=256,
          validation_data=(X_test, y_test), 
          )

# Evaluating Model Performance

In [None]:
losses = pd.DataFrame(model.history.history)

In [None]:
losses[['loss','val_loss']].plot()

Let us create predictions from the X_test set and display a classification report and confusion matrix for the X_test set.

In [None]:
from sklearn.metrics import classification_report,confusion_matrix

In [None]:
predictions = model.predict_classes(X_test)

In [None]:
print(classification_report(y_test,predictions))

The model did well on the test set, and achieved good accuracy. 

In [None]:
confusion_matrix(y_test,predictions)

Let us test the model for new customer

In [None]:
import random
random.seed(101) ##Keeping same random numbers 
random_ind = random.randint(0,len(loan_data))

new_customer = loan_data.drop('loan_repaid',axis=1).iloc[random_ind]
new_customer

In [None]:
model.predict_classes(new_customer.values.reshape(1,78))

Model tested on new customer showed that he would repay back the loan. 


# Thank you!!

# Ownership!

I used some ideas from Pierian Data Course of [Data Science and Machine learning](https://www.pieriandata.com/p/python-for-data-science-and-machine-learning-bootcamp)