## Assessing the likelihood of paying back the loan for a peer-to-peer Lending company

## The Data
We will be using a subset of the LendingClub DataSet obtained from Kaggle: https://www.kaggle.com/hadiyad/lendingclub-data-sets

LendingClub is a US peer-to-peer lending company.

### Goal

Given historical data on loans given out with information on whether or not the borrower defaulted (charge-off), can we build a model that can predict whether or not a borrower will pay back their loan? This way in the future when the company gets a new potential customer,can assess whether or not they are likely to pay back the  loan.

The "loan_status" column contains the desired label.



#### Importing necesarry libraries and Loading the info file which has the description for each column in the original dataset and 

In [None]:
#Importing the basic libraries needed for EDA
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

#To store the plot images in the notebook document using inline commands.
%matplotlib inline
data_info = pd.read_csv('../input/lendingclub-data-sets/lending_club_info.csv',index_col='LoanStatNew')

# **Data Overview**
#### There are many LendingClub data sets on Kaggle. Lets have a look at the information on this particular data set:

In [None]:
# aligning the data towards left to get full view of the description column 
data_info.style.set_properties(**{'text-align': 'left'})

In [None]:
print(data_info.loc['revol_util']['Description'])

In [None]:
def feat_info(col_name):
    print(data_info.loc[col_name]['Description'])
feat_info('mort_acc')

### **Loading the dataset**

In [None]:
df = pd.read_csv('../input/lendingclub-data-sets/lending_club_loan_two.csv')

In [None]:
df.info()

## Exploratory Data Analysis

We have to fetch an overall understanding on all the parameters to find which variables are important by viewing summary statistics and visualizing the data

----

**Let's create a simple countplot in an attempt to predict loan_status.**

In [None]:
sns.countplot(x = "loan_status", data = df)

**Let's create a histogram plot on loan_amnt column.**

In [None]:
fig,ax = plt.subplots(figsize = (20,10))
g = sns.histplot(ax = ax,x = 'loan_amnt', data = df,bins = 45)

**Let's explore correlation between the continuous feature variables. This can be done by calling the method corr()**

In [None]:
df.corr()

**TASK: Let's visualize this using a heatmap.**

* [Heatmap info](https://seaborn.pydata.org/generated/seaborn.heatmap.html#seaborn.heatmap)
* [Help with resizing](https://stackoverflow.com/questions/56942670/matplotlib-seaborn-first-and-last-row-cut-in-half-of-heatmap-plot)

In [None]:
fig,ax = plt.subplots(figsize = (12,10))
sns.heatmap(ax = ax ,data = df.corr(), annot=True,cmap = 'coolwarm')

**We can see almost a perfect correlation with the "installment" feature. Let's explore this feature further. Print out their descriptions and perform a scatterplot between them. Does this relationship make sense to you? Do you think there is duplicate information here?**

In [None]:
sns.scatterplot(x = 'installment', y = 'loan_amnt', data = df)

**Displaying the boxplot showing the relationship between the loan_status and the Loan Amount.**

In [None]:
sns.boxplot(x = 'loan_status', y = 'loan_amnt', data = df)

**The summary statistics for the loan amount, grouped by the loan_status.**

In [None]:
df.groupby('loan_status').describe().loan_amnt

**Let's explore the Grade and SubGrade columns that LendingClub attributes to the loans. What are the unique possible grades and subgrades?**

In [None]:
print(np.sort(df.grade.unique()))

In [None]:
sub_grade_keys = np.sort(df.sub_grade.unique())
print(sub_grade_keys)

**Let's Create a countplot per grade. Set the hue to the loan_status label.**

In [None]:
sns.countplot(x = 'grade', data = df, hue = 'loan_status')

**Display a count plot per subgrade. Let's explore both all loans made per subgrade as well being separated based on the loan_status. Let's have a look at a similar plot, but with hue being set to "loan_status"**

In [None]:
fig, ax = plt.subplots(figsize = (15,7))
sorted_df = df.sort_values(by = ['sub_grade'])
sns.countplot(x = 'sub_grade',data = sorted_df, ax = ax)

In [None]:
fig, ax = plt.subplots(figsize = (15,7))
sns.countplot(x = 'sub_grade',data = sorted_df, ax = ax, hue = 'loan_status')

**It looks like F and G subgrades don't get paid back that often. Isloating those and recreating the countplot just for those subgrades.**

**Grade values before F and G isolotion**

In [None]:
print(df['grade'].value_counts())

**Grade values after F and G isolotion**

In [None]:
sorted_df['grade'] = sorted_df['grade'].apply(lambda x : "F and G" if x in ['F','G'] else x)
df['grade'] = df['grade'].apply(lambda x : "F and G" if x in ['F','G'] else x)
print(sorted_df['grade'].value_counts())

**Countplot of Grade after F and G isolotion**


In [None]:
fig, ax = plt.subplots(figsize = (15,7))
sns.countplot(x = 'grade',data = sorted_df, ax = ax, hue = 'loan_status')

In [None]:
fig, ax = plt.subplots(figsize = (15,7))
sns.countplot(data = sorted_df.loc[sorted_df.grade == 'F and G'][['sub_grade','loan_status']],x = 'sub_grade', hue = 'loan_status', ax = ax)

**As loan status is a binary classifier which just says literally conveys whether the loan is fully paid or not,we can create a new column called 'loan_repaid' which will contain a 1 if the loan status was "Fully Paid" and a 0 if it was "Charged Off".**

In [None]:
df['loan_repaid'] = df['loan_status'].apply(lambda x : 1 if x == "Fully Paid" else 0)
df[['loan_repaid','loan_status']]

**Lets create a bar plot showing the correlation of the numeric features to the new loan_repaid column.**

In [None]:
df.corrwith(df.loan_repaid).drop('loan_repaid').sort_values().plot(kind = 'bar')

---
---
# Data PreProcessing

**Goals: Missing data handling. Removing unnecessary or repetitive features. Convert categorical string features to dummy variables.**

Lets have a look at the dataframe using the method head()


In [None]:
df.head()

# Missing Data

**Let's explore this missing data columns. We use a variety of factors to decide whether or not they would be useful, to see if we should keep, discard, or fill in the missing data.**

**Total length of the dataframe?**

In [None]:
len(df.index)

**TASK: Total count of missing values per column.**

In [None]:
df.isnull().sum()

**Lets display in terms of percentage of the total DataFrame**

In [None]:
(df.isnull().sum() * 100 / len(df)).sort_values(ascending = False)

**Let's examine emp_title and emp_length columns to see whether it will be okay to drop them.**

In [None]:
feat_info('emp_title')
feat_info('emp_length')

**Looks like there are many employement job titles are present. Finding out the unique employment job titles will help us estimate the importance of that column?**

In [None]:
df.emp_title.nunique()

In [None]:
df.emp_title.value_counts()

**Realistically there are too many unique job titles to try to convert this to a dummy variable feature. Let's remove that emp_title column.**

In [None]:
df.drop(columns = ['emp_title'], inplace = True)

* Lets create a count plot of the emp_length feature column.<br>
* Sorting the order of the values will be challenge here<br>
* Hence we use **CategoricalDtype** method from pandas to set the datatype for emp_length column with the specified order

In [None]:
cat_emp_length = pd.CategoricalDtype(
    ['< 1 year','1 year', '2 years', '3 years', '4 years', '5 years', '6 years', '7 years', '8 years', '9 years', '10+ years'], 
    ordered=True,
)
df['emp_length'] = df['emp_length'].astype(cat_emp_length)

In [None]:
fig,ax = plt.subplots(figsize = (16,5))
sns.countplot(data = df.sort_values(by = 'emp_length'), x = 'emp_length')

**Let's plot out the countplot with a hue separating Fully Paid vs Charged Off**

In [None]:
fig,ax = plt.subplots(figsize = (16,5))
sns.countplot(data = df.sort_values(by = 'emp_length'), x = 'emp_length', hue = 'loan_status')

**This still doesn't really inform us if there is a strong relationship between employment length and being charged off, what we want is the percentage of charge offs per category. Essentially informing us what percent of people per employment category didn't pay back their loan.**

In [None]:
grp_ln_stat = pd.pivot_table(df, index = ['emp_length'],columns = ['loan_status'],values=["loan_amnt"],aggfunc=len).loan_amnt
grp_ln_stat['percent'] = grp_ln_stat['Charged Off']/grp_ln_stat['Fully Paid']*100
grp_ln_stat.percent.plot(kind = 'bar')

**Lets drop the column emp_length, As the Charge off rates are extremely similar across all employment lengths.**

In [None]:
df.drop(columns = ['emp_length'], inplace = True)

**Revisiting the DataFrame to see what feature columns still have missing data.**

In [None]:
df.isnull().sum()

**Review the title column vs the purpose column. Is this repeated information?**

In [None]:
df[['purpose','title']].head()

In [None]:
df['title'].head(10)

**The title column is simply a string subcategory/description of the purpose column. Let's go ahead and drop the title column.**

In [None]:
df.drop(columns = ['title'], inplace = True)

**Let's find out what the mort_acc feature represents**

In [None]:
feat_info('mort_acc')

**Displaying the unique values with its counts of the mort_acc column.**

In [None]:
df.mort_acc.value_counts()

**There are many ways we could deal with this missing data. We could attempt to build a simple model to fill it in, such as a linear model, we could just fill it in based on the mean of the other columns, or you could even bin the columns into categories and then set NaN as its own category. As there is no 100% correct approach, Let's review the other column to see which most highly correlates to mort_acc**

In [None]:
df.corr().mort_acc.drop('mort_acc').sort_values()

**Looks like the total_acc feature correlates with the mort_acc, this makes sense! Let's try this fillna() approach. We will group the dataframe by the total_acc and calculate the mean value for the mort_acc per total_acc entry. To get the result below:**

In [None]:
total_acc_grp = df.groupby('total_acc').mean().mort_acc
total_acc_grp

**Let's fill in the missing mort_acc values based on their total_acc value. If the mort_acc is missing, then we will fill in that missing value with the mean value corresponding to its total_acc value from the Series we created above. We are going to use .apply() method on axis=1. Check out the link below for more info.**

[Reference](https://stackoverflow.com/questions/13331698/how-to-apply-a-function-to-two-columns-of-pandas-dataframe) 

In [None]:
df.loc[10].isnull().mort_acc

In [None]:
df['mort_acc'] = df.apply(lambda x : total_acc_grp[x.total_acc] if x.isnull().mort_acc else x.mort_acc, axis = 1)

In [None]:
#Total null values in mort_acc
df['mort_acc'].isnull().sum()

In [None]:
#Percentage of missing values in each columns
df.isnull().sum().sort_values(ascending = False)/len(df)*100

**revol_util and the pub_rec_bankruptcies have missing data points, but they account for less than 0.5% of the total data. Go ahead and remove the rows that are missing those values in those columns with dropna().**

In [None]:
print("length of dataframe before and after removing the missing values")
print(len(df))
df.dropna(inplace = True)
print(len(df))

## Categorical Variables and Dummy Variables

**We're done working with the missing data! Now we just need to deal with the string values due to the categorical columns.**

**Let's list down all the columns that are currently non-numeric.

In [None]:
df.select_dtypes(include = ['object']).columns

In [None]:
df.term.value_counts()

---
**Let's now go through all the string features to see what we should do with them.**

---


### Term feature

**Lets Convert the term feature into either a 36 or 60 integer numeric data type**

In [None]:
df['term_36_or_60'] = df.term.apply(lambda x: 0 if int(x[:3]) == 36 else 1)
df['term_36_or_60'].value_counts()

### grade feature

**TASK: We already know grade is part of sub_grade, so just drop the grade feature.**

In [None]:
df.drop(columns = ['grade','term'], inplace = True)

**Converting the subgrade into dummy variables. Then concatenate these new columns to the original dataframe.**

In [None]:
new_df = df.copy()
new_df = pd.get_dummies(data = new_df,columns = ['sub_grade'], prefix = 'grade',drop_first = True)

In [None]:
new_df.columns

In [None]:
new_df.select_dtypes(include = ['object']).columns

### verification_status, application_type,initial_list_status,purpose 
**TASK: Convert these columns: ['verification_status', 'application_type','initial_list_status','purpose'] into dummy variables and concatenate them with the original dataframe. Remember to set drop_first=True and to drop the original columns.**

In [None]:
new_df = pd.get_dummies(data = new_df, columns =  ['verification_status', 'application_type','initial_list_status','purpose'] , prefix =  ['ver_status', 'app_type','init_list_status','purpose'],drop_first = True)

In [None]:
new_df.select_dtypes(include = ['object']).columns

### home_ownership
**LEt's review the value_counts for the home_ownership column.**

In [None]:
new_df.home_ownership.value_counts()

**NONE and ANY classes can be merged with OTHER, so that we end up with just 4 categories, MORTGAGE, RENT, OWN, OTHER.**

In [None]:
new_df['home_ownership'] = new_df.home_ownership.replace(['NONE','ANY'], 'OTHER')

**TASK: Now make this zip_code column into dummy variables using pandas. Concatenate the result and drop the original zip_code column along with dropping the address column.**

In [None]:
new_df['zip_code'] = new_df.address.apply(lambda x: x[-5:])
new_df.drop(columns = ['address'], inplace = True)

In [None]:
new_df = pd.get_dummies(data = new_df, columns = ['zip_code','home_ownership'], prefix = ['zip', 'home_own'], drop_first = True)

In [None]:
new_df.select_dtypes(include = ['object']).columns

### issue_d 

**This would be data leakage, we wouldn't know beforehand whether or not a loan would be issued when using our model, so in theory we wouldn't have an issue_date, Let's drop this feature.**

In [None]:
new_df.drop(columns = ['issue_d'], inplace = True)

### earliest_cr_line
**This appears to be a historical time stamp feature. Let's extract the year from this feature using a .apply function, then convert it to a numeric feature.**

In [None]:
new_df['earliest_cr_year'] = new_df.earliest_cr_line.apply(lambda x: int(x[-4:]))
new_df.drop(columns = ['earliest_cr_line'], inplace = True)

In [None]:
new_df.select_dtypes(include = ['object']).columns

## Train Test Split

**Import train_test_split from sklearn.**

In [None]:
from sklearn.model_selection import train_test_split

**Let's drop the load_status column we created earlier, since its a duplicate of the loan_repaid column. We'll use the loan_repaid column since its already in 0s and 1s.**

In [None]:
new_df.drop(columns = ['loan_status'], inplace = True)
len(new_df.columns)

**TASK: Set X and y variables to the .values of the features and label.**

In [None]:
X = new_df.drop(columns = ['loan_repaid']).values
y = new_df.loan_repaid.values

----
----

## Grabbing a Sample for Training Time

### Using .sample() to grab a sample of the 490k+ entries to save time on training. Highly recommended for lower RAM computers or if you are not using GPU.

----
----

In [None]:
s_df = new_df.sample(frac=0.1,random_state=101)
print(len(s_df))

**TASK: Perform a train/test split with test_size=0.2 and a random_state of 101.**

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=0.2, random_state=101)

## Normalizing the Data

**A MinMaxScaler can be used to normalize the feature data X_train and X_test.**

In [None]:
from sklearn.preprocessing import MinMaxScaler

In [None]:
scaler = MinMaxScaler()
X_train_new = scaler.fit_transform(X_train)
X_test_new = scaler.transform(X_test)

# The Model Creation

**Importing the necessary Keras functions.**

In [None]:
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense,Dropout

**Build a sequential model to will be trained on the data. You have unlimited options here, but here is what the solution uses: a model that goes 78 --> 39 --> 19--> 1 output neuron.**

In [None]:
# CODE HERE
model = Sequential()

model.add(Dense(78, activation = 'relu'))
model.add(Dropout(0.2))

model.add(Dense(39, activation = 'relu'))
model.add(Dropout(0.2))

model.add(Dense(19, activation = 'relu'))
model.add(Dropout(0.2))

model.add(Dense(1, activation = 'sigmoid'))

model.compile(loss = 'binary_crossentropy', optimizer = 'adam')

In [None]:
X_train_new.shape

**TASK: Fit the model to the training data for at least 25 epochs. Also add in the validation data for later plotting. Optional: add in a batch_size of 256.**

In [None]:
model.fit(x = X_train_new, y = y_train, epochs = 25, batch_size = 256, validation_data = (X_test_new, y_test))

** Save your model.**

In [None]:
from tensorflow.keras.models import load_model

In [None]:
model.save('lend_club_model.h5')

# Evaluating Model Performance.

**Plot out the validation loss versus the training loss.**

In [None]:
losses = pd.DataFrame(model.history.history)

In [None]:
losses.plot()

**Create predictions from the X_test set and display a classification report and confusion matrix for the X_test set.**

In [None]:
from sklearn.metrics import classification_report, confusion_matrix

In [None]:
y_pred = model.predict_classes(X_test_new)

In [None]:
print(classification_report(y_test, y_pred))

In [None]:
confusion_matrix(y_test, y_pred)

##Quick check

**Given the customer below, would you offer this person a loan?**

In [None]:
import random
random.seed(101)
random_ind = random.randint(0,len(df))

new_customer = new_df.drop('loan_repaid',axis=1).iloc[random_ind]
new_customer

In [None]:
new_customer = scaler.transform(new_customer.values.reshape(1,78))

In [None]:
model.predict_classes(new_customer)

**Now check, did this person actually end up paying back their loan?**

In [None]:
new_df.iloc[random_ind].loan_repaid

**We got it right!!**