## The Data

**Please Note: This dataset was a part of my Python for Data Science and Machine Learning Bootcamp. I have created this notebook to solve the problem that was asked below and to create a model**

I will be using a subset of the LendingClub DataSet obtained from Kaggle: https://www.kaggle.com/wordsforthewise/lending-club

LendingClub is a US peer-to-peer lending company, headquartered in San Francisco, California. It was the first peer-to-peer lender to register its offerings as securities with the Securities and Exchange Commission (SEC), and to offer loan trading on a secondary market. LendingClub is the world's largest peer-to-peer lending platform.

Given historical data on loans given out with information on whether or not the borrower defaulted (charge-off), can we build a model that can predict wether or not a borrower will pay back their loan? This way in the future when we get a new potential customer we can assess whether or not they are likely to pay back the loan. Keep in mind classification metrics when evaluating the performance of your model!

The "loan_status" column contains our label.

### Data Overview

----
-----
There are many LendingClub data sets on Kaggle. Here is the information on this particular data set:

<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>LoanStatNew</th>
      <th>Description</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>loan_amnt</td>
      <td>The listed amount of the loan applied for by the borrower. If at some point in time, the credit department reduces the loan amount, then it will be reflected in this value.</td>
    </tr>
    <tr>
      <th>1</th>
      <td>term</td>
      <td>The number of payments on the loan. Values are in months and can be either 36 or 60.</td>
    </tr>
    <tr>
      <th>2</th>
      <td>int_rate</td>
      <td>Interest Rate on the loan</td>
    </tr>
    <tr>
      <th>3</th>
      <td>installment</td>
      <td>The monthly payment owed by the borrower if the loan originates.</td>
    </tr>
    <tr>
      <th>4</th>
      <td>grade</td>
      <td>LC assigned loan grade</td>
    </tr>
    <tr>
      <th>5</th>
      <td>sub_grade</td>
      <td>LC assigned loan subgrade</td>
    </tr>
    <tr>
      <th>6</th>
      <td>emp_title</td>
      <td>The job title supplied by the Borrower when applying for the loan.*</td>
    </tr>
    <tr>
      <th>7</th>
      <td>emp_length</td>
      <td>Employment length in years. Possible values are between 0 and 10 where 0 means less than one year and 10 means ten or more years.</td>
    </tr>
    <tr>
      <th>8</th>
      <td>home_ownership</td>
      <td>The home ownership status provided by the borrower during registration or obtained from the credit report. Our values are: RENT, OWN, MORTGAGE, OTHER</td>
    </tr>
    <tr>
      <th>9</th>
      <td>annual_inc</td>
      <td>The self-reported annual income provided by the borrower during registration.</td>
    </tr>
    <tr>
      <th>10</th>
      <td>verification_status</td>
      <td>Indicates if income was verified by LC, not verified, or if the income source was verified</td>
    </tr>
    <tr>
      <th>11</th>
      <td>issue_d</td>
      <td>The month which the loan was funded</td>
    </tr>
    <tr>
      <th>12</th>
      <td>loan_status</td>
      <td>Current status of the loan</td>
    </tr>
    <tr>
      <th>13</th>
      <td>purpose</td>
      <td>A category provided by the borrower for the loan request.</td>
    </tr>
    <tr>
      <th>14</th>
      <td>title</td>
      <td>The loan title provided by the borrower</td>
    </tr>
    <tr>
      <th>15</th>
      <td>zip_code</td>
      <td>The first 3 numbers of the zip code provided by the borrower in the loan application.</td>
    </tr>
    <tr>
      <th>16</th>
      <td>addr_state</td>
      <td>The state provided by the borrower in the loan application</td>
    </tr>
    <tr>
      <th>17</th>
      <td>dti</td>
      <td>A ratio calculated using the borrower’s total monthly debt payments on the total debt obligations, excluding mortgage and the requested LC loan, divided by the borrower’s self-reported monthly income.</td>
    </tr>
    <tr>
      <th>18</th>
      <td>earliest_cr_line</td>
      <td>The month the borrower's earliest reported credit line was opened</td>
    </tr>
    <tr>
      <th>19</th>
      <td>open_acc</td>
      <td>The number of open credit lines in the borrower's credit file.</td>
    </tr>
    <tr>
      <th>20</th>
      <td>pub_rec</td>
      <td>Number of derogatory public records</td>
    </tr>
    <tr>
      <th>21</th>
      <td>revol_bal</td>
      <td>Total credit revolving balance</td>
    </tr>
    <tr>
      <th>22</th>
      <td>revol_util</td>
      <td>Revolving line utilization rate, or the amount of credit the borrower is using relative to all available revolving credit.</td>
    </tr>
    <tr>
      <th>23</th>
      <td>total_acc</td>
      <td>The total number of credit lines currently in the borrower's credit file</td>
    </tr>
    <tr>
      <th>24</th>
      <td>initial_list_status</td>
      <td>The initial listing status of the loan. Possible values are – W, F</td>
    </tr>
    <tr>
      <th>25</th>
      <td>application_type</td>
      <td>Indicates whether the loan is an individual application or a joint application with two co-borrowers</td>
    </tr>
    <tr>
      <th>26</th>
      <td>mort_acc</td>
      <td>Number of mortgage accounts.</td>
    </tr>
    <tr>
      <th>27</th>
      <td>pub_rec_bankruptcies</td>
      <td>Number of public record bankruptcies</td>
    </tr>
  </tbody>
</table>

---
----

## Loading data and imports

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import random

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler

import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense,Dropout

from sklearn.metrics import classification_report, confusion_matrix

In [None]:
df = pd.read_csv('../input/lendingclub-data-sets/lending_club_loan_two.csv')

In [None]:
df.info()

In [None]:
data_info = pd.read_csv('../input/lendingclub-data-sets/lending_club_info.csv',index_col='LoanStatNew')

In [None]:
#Create a function to read the description from the info .csv
def feat_info(col_name):
    print(data_info.loc[col_name]['Description'])

In [None]:
feat_info('mort_acc')

# Section 1: Exploratory Data Analysis

**OVERALL GOAL: Get an understanding for which variables are important, view summary statistics, and visualize the data**


----

In [None]:
sns.countplot(df['loan_status'])

**TASK: Create a histogram of the loan_amnt column.**

In [None]:
sns.histplot(df['loan_amnt'], bins =30)

**Calculating correlation between contionus features to see any potential relationships between variables**

In [None]:
df.corr().transpose()

In [None]:
plt.figure(figsize=(12,7))
sns.heatmap(df.corr(),annot = True )

**There is a high correlation between installment and loan_amnt. Lets check to see if that makes any sense**

In [None]:
feat_info('installment')

In [None]:
feat_info('loan_amnt')

In [None]:
sns.scatterplot(data=df, x='installment',y='loan_amnt')

**The relationship between the loan_status and the Loan Amount.**

In [None]:
sns.boxplot(data =df,x='loan_status', y='loan_amnt')

**Summary statistics for the loan amount, grouped by the loan_status.**

In [None]:
df.groupby('loan_status')['loan_amnt'].describe()

**Grade and SubGrade**

In [None]:
df['grade'].unique()

In [None]:
sort = df['sub_grade'].sort_values().unique()
sort

In [None]:
sns.countplot(data=df,x='grade',hue='loan_status')

**Count plot per subgrade**

In [None]:
plt.figure(figsize = (12,4))
sns.countplot(data=df, x='sub_grade', order=sort)

**Count plot per subgrade**  
###### loan status as hue

In [None]:
plt.figure(figsize = (12,4))
sns.countplot(data=df, x='sub_grade', order=sort, hue = 'loan_status', alpha = .5)

**F and G subgrades don't get paid back that often. Lets look at that some more.**

In [None]:
FandG = df[(df['grade']=='G') | (df['grade']=='F')]

In [None]:
plt.figure(figsize=(12,4))
subgrade_order = sorted(FandG['sub_grade'].unique())
sns.countplot(x='sub_grade',data=FandG,order = subgrade_order,hue='loan_status')

**Lets turn our label/y feature (loan_status) into a binary classification**

In [None]:
df['loan_repaid'] = [1 if df.iloc[x]['loan_status'] == 'Fully Paid' else 0 for x in range(0,len(df))]

In [None]:
df[['loan_repaid','loan_status']]

**Lets look at the how our loan_repaid column correlates with other numeric features**

In [None]:
df.corrwith(df['loan_repaid'])[:-1].sort_values()

In [None]:
df.corrwith(df['loan_repaid'])[:-1].sort_values().plot.bar(sort_columns = True)

---
---
# Section 2: Data PreProcessing

**Section Goals: Remove or fill any missing data. Remove unnecessary or repetitive features. Convert categorical string features to dummy variables.**



In [None]:
df.head()

# Missing Data

**Let's explore this missing data columns. We use a variety of factors to decide whether or not they would be useful, to see if we should keep, discard, or fill in the missing data.**

In [None]:
len(df)

**Lets look at the total count of missing values per column.**

In [None]:
df.isnull().sum()

**This is in term of percentage of the total DataFrame**

In [None]:
df.isnull().mean()*100

**The nulls that are left are emp_title, emp_length, title, mort_acc, and pub_rec_bankruptcies. Lets take a look at all of them.**

**Let's examine emp_title and emp_length**

In [None]:
print(feat_info('emp_title'))
print("\n")
print(feat_info('emp_length'))

**How many unique employment job titles are there**

In [None]:
df['emp_title'].nunique()

In [None]:
df['emp_title'].value_counts()

**Realistically there are too many unique job titles to try to convert this to a dummy variable feature. Let's remove that emp_title column.**

In [None]:
df = df.drop('emp_title',axis = 1)

**Now emp_length**

In [None]:
sorted(df['emp_length'].dropna().unique())

In [None]:
emp_length_order = [ '< 1 year',
                      '1 year',
                     '2 years',
                     '3 years',
                     '4 years',
                     '5 years',
                     '6 years',
                     '7 years',
                     '8 years',
                     '9 years',
                     '10+ years']

In [None]:
plt.figure(figsize=(12,4))
sns.countplot(x='emp_length',data=df,order=emp_length_order)

**Lets see the countplot with respect to our classification**

In [None]:
plt.figure(figsize=(12,4))
sns.countplot(x='emp_length',data=df,order=emp_length_order,hue='loan_status')

**Doesnt help yet lets look at the charge off per category to see if that makes a difference**

In [None]:
a = df[df['loan_status'] == 'Charged Off'].groupby('emp_length').count()['loan_status']

In [None]:
b = df[df['loan_status'] == 'Fully Paid'].groupby('emp_length').count()['loan_status']

In [None]:
c = a/b

In [None]:
c

In [None]:
c.plot.bar()

**Charge off rates are extremely similar across all employment lengths. This would have very low if any affect on our model. No need to keep and manage the null values. Lets drop the column.**

In [None]:
df = df.drop('emp_length',axis =1)

**By tranposing, we can see all collumns a little bit clearer**

In [None]:
df.transpose()

**Title column vs the purpose column**

In [None]:
df['purpose'].head(10)

In [None]:
df['title'].head(10)

**The title column seems to be a string subcategory/description of the purpose column. Seems better to drop it as well**

In [None]:
df = df.drop('title', axis = 1)

**mort_acc**

In [None]:
feat_info('mort_acc')

In [None]:
df['mort_acc'].nunique()

**TASK: Create a value_counts of the mort_acc column.**

In [None]:
df['mort_acc'].value_counts()

**Let's review the other columns to see which most highly correlates to mort_acc**

In [None]:
df.corrwith(df['mort_acc']).sort_values()

**Total_acc feature correlates with the mort_acc. Let's group the dataframe by the total_acc and calculate the mean value for the mort_acc per total_acc.**

In [None]:
df.groupby('total_acc').mean()['mort_acc']

**Let's fill in the missing mort_acc values based on their total_acc value. If the mort_acc is missing, then we will fill in that missing value with the mean value corresponding to its total_acc value.**


In [None]:
total_acc_avg = df.groupby('total_acc').mean()['mort_acc']

In [None]:
def fill_mort_acc(total_acc,mort_acc):
    '''
    Accepts the total_acc and mort_acc values for the row.
    
    Checks if the mort_acc is NaN , if so, it returns the avg mort_acc value
    for the corresponding total_acc value for that row.
    '''
    if np.isnan(mort_acc):
        return total_acc_avg[total_acc]
    else:
        return mort_acc

In [None]:
df['mort_acc'] = df.apply(lambda x: fill_mort_acc(x['total_acc'], x['mort_acc']), axis=1)

In [None]:
df.isnull().sum()

**Not that many entries left but lets see what percentage of the entries still contain missing values**

In [None]:
df.isnull().sum()/len(df) * 100

**evol_util and the pub_rec_bankruptcies have missing data points, but they account for less than 0.2% of the total data. It would be best I believe to remove those rows**

In [None]:
df = df.dropna()

In [None]:
df.isnull().sum()

## Categorical Variables and Dummy Variables

**Now its time to deal with the string values due to the categorical columns.**

**Lets take a look at the df with the non-numeric data**

In [None]:
df.select_dtypes(['object']).columns

In [None]:
df[df.select_dtypes(['object']).columns]

In [None]:
for i in df[df.select_dtypes(['object']).columns]:
    print(i)
    print(df[df.select_dtypes(['object']).columns][i].value_counts())
    print('\n')

**Let's now go through all the string features to see what we should do with them.**  



**verification_status, application_type,initial_list_status,purpose**   
**These can all instantly become dummy variables**

In [None]:
dummies = pd.get_dummies(df[['verification_status', 'application_type','initial_list_status','purpose' ]],drop_first=True)
df = df.drop(['verification_status', 'application_type','initial_list_status','purpose'],axis=1)
df = pd.concat([df,dummies],axis=1)

**term**

In [None]:
df['term'].value_counts()

**We can convert this into numeric variables**

In [None]:
df['term'] = df['term'].apply(lambda term: int(term[:3]))

In [None]:
df['term'].head()

**grade**

**We already know grade is part of sub_grade so we can drop this grade feature and use the sub_grade feature instead**

In [None]:
df = df.drop('grade', axis =1 )

In [None]:
subgrade_dummies = pd.get_dummies(df['sub_grade'],drop_first=True)


In [None]:
df = pd.concat([df.drop('sub_grade',axis=1),subgrade_dummies],axis=1)

**home_ownership**

In [None]:
df['home_ownership'].value_counts()

**Lets convert to dummy variables but first lets join the bottom three values together as OTHER so we can have only 4 categories**

In [None]:
df['home_ownership']=df['home_ownership'].replace(['NONE', 'ANY'], 'OTHER')
dummies = pd.get_dummies(df['home_ownership'],drop_first=True)
df = df.drop('home_ownership',axis=1)
df = pd.concat([df,dummies],axis=1)

**address**  
**to help our model and better clarify we can use the zipcode instead of the full address**

In [None]:
df['zipcode'] = df['address'].apply(lambda zip: zip[-5:])

In [None]:
df['zipcode'].value_counts()

**All hold significance to me so lets create dummy variables and remove the address column**

In [None]:
dummies = pd.get_dummies(df['zipcode'],drop_first=True)
df = df.drop(['zipcode','address'],axis=1)
df = pd.concat([df,dummies],axis=1)

**issue_d** 



In [None]:
feat_info('issue_d')

**This would be data leakage, we wouldn't know beforehand whether or not a loan would be issued when using our model, so in theory we wouldn't have an issue_date, lets drop this feature.**

In [None]:
df = df.drop('issue_d', axis = 1)

**earliest_cr_line**


In [None]:
df['earliest_cr_line']

**This seems like a timeseries feature in which case lets just use the year**

In [None]:
df['earliest_cr_year'] = df['earliest_cr_line'].apply(lambda date:int(date[-4:]))
df = df.drop('earliest_cr_line',axis=1)

**loan_status**  
**Because we already created our loan_repaid column, we could now drop the loan_status column as we have our 0 and 1 binary classification already**

In [None]:
df = df.drop('loan_status', axis = 1 )

In [None]:
df.select_dtypes(['object']).columns

Now that we have taken care of all non numeric columns we can continue

---
---
# Section 3: Creating the model

**Section Goals: Split, Normalize, and Create a model for our data**

## Train Test Split

In [None]:
X = df.drop('loan_repaid',axis=1).values
y = df['loan_repaid'].values

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=101)
#Using random_state 101 in order to compare answers

## Normalizing the Data

**Lets use a MinMaxScaler to normalize the feature data X_train and X_test. To prevent data leakage, we will only fit on the X_train data.**

In [None]:
scaler = MinMaxScaler()

In [None]:
X_train = scaler.fit_transform(X_train)

In [None]:
X_test = scaler.transform(X_test)

## Creating the Model

**Lets use a Sequential Model**

In [None]:
model = Sequential()

model.add(Dense(78, activation = 'relu'))
model.add(Dropout(0.5))
model.add(Dense(39, activation = 'relu'))
model.add(Dropout(0.5))
model.add(Dense(19, activation = 'relu'))
model.add(Dropout(0.5))
# https://stats.stackexchange.com/questions/181/how-to-choose-the-number-of-hidden-layers-and-nodes-in-a-feedforward-neural-netw
model.add(Dense(1, activation = 'sigmoid'))

model.compile(optimizer = 'adam', loss = 'binary_crossentropy')

In [None]:
model.fit(X_train,y_train, epochs = 25, validation_data = (X_test,y_test),
         batch_size = 256)
#Will consider spliiting for validation data as well in later revisions
#Also can use early stopping as well

# Section 3: Evaluating Model Performance.

**Section Goals: Plot out the validation loss versus the training loss.**

In [None]:
model_loss = pd.DataFrame(model.history.history)
model_loss.plot()

**Create predictions from the X_test set and display a classification report and confusion matrix for the X_test set.**

In [None]:
predictions = model.predict_classes(X_test)

In [None]:
print(classification_report(y_test, predictions))
print(confusion_matrix(y_test,predictions))

**Lets check with a random person**

In [None]:
random.seed(101)
random_ind = random.randint(0,len(df))

new_customer = df.drop('loan_repaid',axis=1).iloc[random_ind]
new_customer

In [None]:
model.predict_classes(new_customer.values.reshape(1,78))

**did this person actually end up paying back their loan?**

In [None]:
df.iloc[random_ind]['loan_repaid']