# Finance Project: Deep Neural Network

## Supervised machine learning algorithm using Lending Club dataset.

# Description:
### While staring into the future through a crytal ball is a myth, technology can help investors to seek true sight in their investment prospects. It is the financial institutes' dream to grasp the untapped knowledge what is the return on investment in a given project. This is an interesting project to uncover the risky versus the high-profitable loan borrowers in the ocean of dataset. 

# Project Objective: 
### To develop a supervised machine learning model to identify which borrowers will payoff their loans. The project implications can be beneficial to the financial institute in risk assessments, whether the prospective borrower would default or payoff the loan. Strategy for loan approval and profitable target market can be identified. Ultimately, this model serves as the blueprint to decrease bussiness risks and increase profitability of the organization.

# Predictive Model: Deep Neural Network
### Real life dataset by one of the financial powerhouses, Lending Club. Supervised machine learning deep neural network will be used to perform binary classification. In this project, the target feature or y-variable will be "Loan Status".  

# Process:
### This project will start off with exploratory data analysis, data visualization, feature-engineering, and preparing the dataset for machine learning. The end result the accuracy of the model to predict payoff or default loan. 

### Importing libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [None]:
data = pd.read_csv("../input/lending-club-loan/lending_club_loan_two.csv")

In [None]:
data.head()

### This is a large dataset, which has ~396000 observations as shown below.  Noted that they are float and object data type. 

In [None]:
data.info()

### Visualizing loan payoff vs chargeoff. 

In [None]:
plt.figure(figsize=(12,7))
sns.set_context("paper")
sns.countplot(x="loan_status", data=data, palette="Spectral")

### Visualizing the distribution of loan amount borrowed.

In [None]:
plt.figure(figsize=(15,7))
sns.set_context("paper", font_scale=2)
sns.distplot(data["loan_amnt"], bins=15, kde=False, color="seagreen")

### Descriptive analysis and correlation of features. 

In [None]:
data.corr()

### Noted that loan amount and interest rate has high correlation,, which is expected. Total account & open account also has high correlation. Lastly, public record and bankcruptcies, which also makes sense. 

In [None]:
plt.figure(figsize=(15,7))
sns.set_context("poster", font_scale=0.5)
sns.heatmap(data.corr(), annot=True, cmap="coolwarm", alpha=0.6)

### Visualizing the relationship between installment and loan amount.

In [None]:
plt.figure(figsize=(15,7))
sns.scatterplot(x="installment", y="loan_amnt", data=data, alpha=0.8, hue="loan_status", palette="RdYlBu" )

### Noted that there is only a slight difference in full paid and charged off on loan amount. 

In [None]:
plt.figure(figsize=(15,7))
sns.boxplot(x="loan_status", y="loan_amnt", data=data, palette="plasma")

In [None]:
data.groupby("loan_status")["loan_amnt"].describe()

### Diving in the feature "grade", presumably the level of worthiness of the borrowers. 

In [None]:
data["grade"].value_counts()

### Visualizing the relationship between the grade and loan status. As expected, borrowers tend to have a higher charged off in lower grade categories. This shows that the grade can potentially be a good indicator if the borrower has the ability to payoff or default.  

In [None]:
plt.figure(figsize=(15,7))
sns.set_context("paper", font_scale=2)
sns.countplot(x="grade", data=data, hue="loan_status", color="seagreen")

### Visualizing in a sorted order of grade gives a better understanding of the impact of grade in loan status. The lower the grade, the higher the ratio in fully paid to charged off.

In [None]:
plt.figure(figsize=(15,7))
sns.set_context("paper", font_scale=2)
sorted_grade = sorted(data["grade"].unique())
sns.countplot(x="grade", data=data, hue="loan_status", color="seagreen", order=sorted_grade)

In [None]:
data["loan_status"].value_counts()

In [None]:
data["loan_repaid"] = data["loan_status"].map({"Fully Paid":1, "Charged Off":0})

In [None]:
data[["loan_repaid", "loan_status"]]

### Visualizing the correlation between "loan repaid" and other features. Noted that interest rate has relatively high correlation compared to the rest. This is expected as the higher the interest rate, the harder it is to pay off a loan.

In [None]:
plt.figure(figsize=(15,7))
data.corr()["loan_repaid"].sort_values().drop("loan_repaid").plot(kind="bar", color="seagreen", alpha=0.6)

### Dealing with missing values. The number of features that have missing values is shown below. The number of missing values in each feature will determine the treatment of the feature; to drop or replace it with some other values. 

In [None]:
data.isnull().sum()

In [None]:
data.isnull().sum() / len(data) * 100

In [None]:
data["emp_title"].nunique()

In [None]:
data["emp_title"].value_counts()

### The employment title feature has >170000 unique values. It is not feasible to keep them as a feature to be used in our machine learning model. Will just drop it. 

In [None]:
data = data.drop("emp_title", axis=1)

In [None]:
sorted(data["emp_length"].dropna().unique())

In [None]:
sorted_emp_length = ['< 1 year',
 '1 year',
 '2 years',
 '3 years',
 '4 years',
 '5 years',
 '6 years',
 '7 years',
 '8 years',
 '9 years',
 '10+ years']

### Visualizing the employment length feature. It appears that most of the borrowers have >10yrs employment length, meaning most of the borrowers are middle-aged adults and/or matured instead of young adults. 

In [None]:
plt.figure(figsize=(16,7))
sns.set_context("paper", font_scale=1.5)
sns.countplot(x="emp_length", data=data, hue="loan_status", order=sorted_emp_length, palette="coolwarm")

In [None]:
emp_co = data[data["loan_status"] == "Charged Off"].groupby("emp_length").count()["loan_status"]
emp_fp = data[data["loan_status"] == "Fully Paid"].groupby("emp_length").count()["loan_status"]

In [None]:
emp_length_graph = emp_co/(emp_co+emp_fp)

### Visualizing the number of charged off and total borrowers in percentage in each intervals of employment length. They are relatively the same across the board shown in the graph below.

In [None]:
emp_length_graph.plot(kind="bar")

### The employment length feature does not really help us to distinguish borrowers who payoff or default. Will just drop. 

In [None]:
data = data.drop("emp_length", axis=1)

In [None]:
data.isnull().sum()

In [None]:
data["purpose"].value_counts()

### Visualizing the loan purpose feature. It is apparent that top reason to borrow loans from Lending Club is debt consolidation as provided by borrowers. This is matches with common perception that people often attempt to pay off high interest rate credit card accounts with unsecured personal loan.

In [None]:
plt.figure(figsize=(18,7))
sns.set_context("paper", font_scale=1)
sns.countplot(x="purpose", data=data, hue="loan_status", palette="seismic")

In [None]:
data["title"].value_counts()

### It appears that the feature "title" provides the same information as loan purpose. Will drop this feature. 

In [None]:
data = data.drop("title", axis=1)

In [None]:
data["mort_acc"].value_counts()

### Recalling that mortage account feature has ~38000 missing values. This is pretty significant as dropping this feature will significantly reduce the size of the dataset. It is probably a good idea to replace the missing values with some other values. After checking out the correlation of mortageg with other features, it is noted that total account has the highest correlation with mortgage account. it is not surprising that people have more mortgages when they have more accounts. 

In [None]:
data.corr()["mort_acc"].sort_values()

### Decided to replace missing values in mortgage account with the mean value based on the total account. 

In [None]:
total_acc_avg = data.groupby("total_acc").mean()["mort_acc"]

In [None]:
def fill_in_mort_acc(total_acc, mort_acc):
    
    if np.isnan(mort_acc):
        return total_acc_avg[total_acc]
    else:
        return(mort_acc)

### Replacing the missing values with lambda function. 

In [None]:
data["mort_acc"] = data.apply(lambda x: fill_in_mort_acc(x["total_acc"], x["mort_acc"]), axis=1)

### Finding out the current standing of our missing values. Since the remaining 2 features have very low missing values, decided to just drop those missing values as it is more time-saving. 

In [None]:
data.isnull().sum()

In [None]:
data = data.dropna()

In [None]:
data.isnull().sum()

In [None]:
data.dtypes

### Dealing with non-numeric type of data

In [None]:
data.select_dtypes(["object"]).columns

In [None]:
data["term"].value_counts()

### Grabbing the numeric values "36" and "60".

In [None]:
data["term"] = data["term"].apply(lambda term: int(term[:3]))

In [None]:
data["term"].value_counts()

### Since sub-grade provides more information than grade, this featue will be dropped. 

In [None]:
data = data.drop("grade", axis=1)

### Preparing the data with binary classification (dummy data).

In [None]:
dummy = pd.get_dummies(data["sub_grade"], drop_first=True)

In [None]:
data = pd.concat([data.drop("sub_grade", axis=1), dummy], axis=1)

In [None]:
data.head()

In [None]:
data["verification_status"].value_counts()

In [None]:
data["application_type"].value_counts()

In [None]:
data["initial_list_status"].value_counts()

In [None]:
dummy = pd.get_dummies(data[["verification_status", "application_type", "initial_list_status", "purpose"]], drop_first=True)
data = pd.concat([data.drop(["verification_status", "application_type", "initial_list_status", "purpose"], axis=1), dummy], axis=1)

In [None]:
data.head()

In [None]:
data["home_ownership"].value_counts()

In [None]:
data["home_ownership"] = data["home_ownership"].replace(["NONE", "ANY"], "OTHER")

In [None]:
data["home_ownership"].value_counts()

In [None]:
dummy = pd.get_dummies(data["home_ownership"], drop_first=True)
data = pd.concat([data.drop("home_ownership", axis=1), dummy], axis=1)

In [None]:
data.head()

##### Four more features with object as data type to deal with.

In [None]:
data.select_dtypes("object").columns

In [None]:
data["address"].value_counts()

#### Address has no values in our machine learning model but the ZIP code may have some sort of influence in the outcome. Grabbing the ZIP code from the address. 

In [None]:
data["zip_code"] = data["address"].apply(lambda address: address[-5:])

In [None]:
data["zip_code"].value_counts()

#### Noted that the newly engineered feature of ZIP has only a few unique counts so it is feasible to keep this feature. Getting dummy data on this feature. 

In [None]:
dummy = pd.get_dummies(data["zip_code"], drop_first=True)
data = pd.concat([data.drop("zip_code", axis=1), dummy], axis=1)

In [None]:
data.head()

In [None]:
data = data.drop("address", axis=1)

In [None]:
data = data.drop("issue_d", axis=1)

In [None]:
data["earliest_cr_line"].value_counts()

#### The feature "Earliest credit line" may be a key factor as it provides some sort of a time series information. Grabbing the year as our tim series feature. 

In [None]:
data["earliest_cr_line"] = data["earliest_cr_line"].apply(lambda year: int(year[-4:]))

In [None]:
data["earliest_cr_line"].value_counts()

In [None]:
data.select_dtypes("object").columns

#### Recalling that we have converted "fully paid" and "charged off" with binary digits, it is safe to just drop the original feature. 

In [None]:
data = data.drop("loan_status", axis=1)

#### Data cleansing process and feature-engineering complete. The dataset now has 79 features. Now preparing for training data and test data. 

In [None]:
data.head(3)

### Trainning data will be set at 80% of the dataset and test size is 20%. Random state will be set at 42, which is arbitrary - I heard 42 is THE number of universe, life, and everything :)

#### The target feature or y-variable is "loan repaid" (yes = 1; no = 0). 

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X = data.drop("loan_repaid", axis=1).values
y = data["loan_repaid"].values

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

#### Scalar is used when preparing the dataset for deep learning so the data will have a more meaningful relationship among features. This can enable the machine to learn the data better. 

In [None]:
from sklearn.preprocessing import MinMaxScaler

In [None]:
scaler = MinMaxScaler()

In [None]:
X_train = scaler.fit_transform(X_train)

In [None]:
X_test = scaler.transform(X_test)

### Preparation is complete. Importing deep neural network libraries. 

In [None]:
import tensorflow as tf
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.models import Sequential

#### Recalling the dataset has ~310000 observations and 78 features. 

In [None]:
X_train.shape

#### Setting "rectified linear unit" as the activation function in neural network as this is a commonly used activation. Dense will be set at 78, which is an arbitrary number, as the first layer, then followed by half of it and so forth. On the final layer of the neural network, the "sigmoid" is used as the activation function, which is sorta similar to logistic regression. For loss function, the binary cross entropy will be used since this is a binary classification model. Optimizer is set as Adam is this is the most commonly used. 

In [None]:
model = Sequential()

model.add(Dense(78, activation="relu"))
model.add(Dropout(0.2))

model.add(Dense(39, activation="relu"))
model.add(Dropout(0.2))

model.add(Dense(19, activation="relu"))
model.add(Dropout(0.2))

model.add(Dense(1, activation="sigmoid"))

model.compile(loss="binary_crossentropy", optimizer="adam")

#### Setting epochs as 25, which is also an arbitrary number; the batch size is set at 256 (64-bit ~ personal preference).

In [None]:
model.fit(X_train, y_train, epochs=25, batch_size=256, validation_data=(X_test, y_test))

In [None]:
losses = pd.DataFrame(model.history.history)

#### Loss function graph to see the performance of the deep neural network model. Noted that the loss funtion decreased sharply at the beginning, which is desirable, then trending down slowly below the validation loss. 

In [None]:
losses.plot()

### Evalution of the supervised machine learning deep neural network performance. 

In [None]:
from sklearn.metrics import classification_report, confusion_matrix

In [None]:
prediction = model.predict_classes(X_test)

In [None]:
print(classification_report(y_test, prediction))
print(confusion_matrix(y_test, prediction))

## Noted that the recall value is 100% on 1s and precision at 98% on 0s. This model did not do so well in recall on 0s, only at 44%, which is pretty significant. The f1-score on 1 is 94%, which is pretty good. Overall, the prediction against the true positive is 90%, which is pretty good IMO. The overall accuracy yields 89%, which is much better than a random guess. The deep neural network algorithm can be further optimized using Earlystopping and dropout. 