# Loan Approval Model

**Purpose:** The purpose of this notebook is to work with machine learning techniques to make the best model for approving or denying loan applications.

**Result:** Able to produce a model with approximately 80% accuracy.

## Table of Contents

* **[Analyzing the Data](#Analyzing-the-Data)**
* **[Cleaning the Data](#Cleaning-the-Data)**
* **[Modeling the Data](#Modeling-the-Data)**

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

test = pd.read_csv('/kaggle/input/loan-prediction-problem-dataset/test_Y3wMUE5_7gLdaTN.csv')
train = pd.read_csv('/kaggle/input/loan-prediction-problem-dataset/train_u6lujuX_CVtuZ9i.csv')
train_original = train.copy()
test_original = test.copy()

## Analyzing the Data

In [None]:
train.info()

In [None]:
#plt.hist(train['ApplicantIncome']) #Will need to do a log transformation on this column
#plt.hist(train['CoapplicantIncome']) #Will need to do a log transformation on this column
#plt.hist(train['LoanAmount']) #Will need to do a log transformation on this column
#plt.hist(train['Loan_Amount_Term'])
#plt.hist(train['Credit_History'])

Will need to apply log transformations on the test columns ApplicantIncome, CoapplicantIncome and LoanAmount.

In [None]:
train.describe()

We have some missing data in LoanAmount, Loan_Amount_Term and Credit_History that will have to get cleaned up. 

In [None]:
plt.title('Genders')
train['Gender'].value_counts().plot.bar()

Working with what looks like about three times as many Males than Females in the train dataset.

In [None]:
plt.title('Marital Status')
train['Married'].value_counts().plot.bar()

Married is essentially a boolean column split up into yes and no answers. Looks like more of the train dataset includes people who are married.

In [None]:
plt.title('Dependents')
train['Dependents'].value_counts().plot.bar()

The dependents column looks to be a column illustrating the amount of people that an independent claims. Most of the independents applying for loans look to have 0 dependents.

In [None]:
plt.title('Education')
train['Education'].value_counts().plot.bar()

For the education column what we see is a little vague. No data description is given of this column. We see that there are more graduates than not, but we are not sure if this means people that graduate high school or university. The assumption would be that a graduate would mean from university, but we cannot be certain.

In [None]:
plt.title('Self-Employed')
train['Self_Employed'].value_counts().plot.bar()

This column is straight forward and it looks like most loan applicants are not self-employed.

In [None]:
plt.title('Property Area')
train['Property_Area'].value_counts().plot.bar()

We seem to have a pretty equal distribution of property area, semiurban being the most popular area.

In [None]:
train['Loan_Status'].value_counts()

Our last and most important column for our machine learning is the loan status column. This is the column that we are looking too predict. From here it looks like a good majority of applicants in the train dataset got there loan application approved.

From what is seen above it might be helpful to convert some of these columns to boolean. However we will do some categorical encoding, so it might be better to just leave as is.

## Cleaning the Data

For cleaning the data there are a couple of tasks that will need to be done in order to make sure that the model is running the best it can.
* Clean up the missing values
* Check for any duplicates
* Make a separate train dataset where the columns that needed a logistic regression earlier are modified.

### Cleaning Missing Values

So the process does not have to be done twice for train and test, the datasets will be merged while it is cleaned and then split back up after the data is put together like wanted.

In [None]:
ntrain = train.shape[0]
ntest = test.shape[0]
y = train.Loan_Status.values
all_data = pd.concat((train, test)).reset_index(drop=True)
all_data.drop(['Loan_Status'], axis=1, inplace=True)
print("all_data size is : {}".format(all_data.shape))


#Credit to: https://stackoverflow.com/questions/18172851/deleting-dataframe-row-in-pandas-based-on-column-value for helping me understand the correct way to drop my columns

Now we have a dataset with the combination of rows for train and test and the loan_status column dropped.

In [None]:
#Checking percentage of data that is na
all_data_na = (all_data.isnull().sum() / len(all_data)) * 100
all_data_na = all_data_na.drop(all_data_na[all_data_na == 0].index).sort_values(ascending=False)[:30]
missing_data = pd.DataFrame({'Missing Ratio' :all_data_na})
missing_data.head(7)

Because we have such a low percentage of missing values in each row it can be seen that none of these rows will need to be dropped, just cleaned up.

In [None]:
all_data.head()

In [None]:
all_data.LoanAmount.describe()

Above I just wanted to quickly confirm that zero was not being counted as an NaN value or that No's were considered NaN's. As seen with the describe for Loan Amount column the lowest amount that there is, is nine which means that zeros are not apart of our range of values.

#### Cleaning Numeric Variables

Now that we are setup to start cleaning the missing data we will split up the data into numeric and categorical variables.

In [None]:
Numeric_Columns = all_data.select_dtypes(include=np.number) #Creating dataset for just the numeric columns
Numeric_Columns.head()

For this part of the process I am going to use simple imputer to quickly and effectively deal with my numeric missing data

In [None]:
from sklearn.impute import SimpleImputer

my_imputer = SimpleImputer()
imputed_num = pd.DataFrame(my_imputer.fit_transform(Numeric_Columns)) #Using simple imputer to fill in missing values

imputed_num.columns = Numeric_Columns.columns #Making column headings the same for imputed_num as Numeric_Columns
identification = all_data['Loan_ID'] #Making a column for Loan_ID to merge numerical and categorical data
imputed_num = imputed_num.join(identification)
imputed_num.head()

#### Cleaning Categorical Variables

For cleaning the categorical variables I am going to use get_dummies to effectively clean. This will also prove effective for my model.

In [None]:
Categorical_Columns = all_data.select_dtypes(exclude=np.number)
Categorical_Columns.drop(['Loan_ID'], axis=1, inplace=True)
Categorical_Columns.head()

In [None]:
imputed_categorical = pd.get_dummies(data=Categorical_Columns) #Using get_dummies to clean up categorical data
imputed_categorical.head()

In [None]:
imputed_categorical = imputed_categorical.join(identification)
imputed_categorical.head()

#### Merging Numerical Dataset and Categorical Dataset Together

In [None]:
merged_data = imputed_num.merge(imputed_categorical, on='Loan_ID') #Using similar id columns to merge data
merged_data.drop(['Loan_ID'], axis=1, inplace=True) #Need to drop Loan_ID now that we have successfully merged our data together
merged_data.head()

In [None]:
all_data_nan = (merged_data.isnull().sum() / len(merged_data)) * 100
all_data_nan = all_data_nan.drop(all_data_nan[all_data_nan == 0].index).sort_values(ascending=False)[:30]
missing_data = pd.DataFrame({'Missing Ratio' :all_data_nan})
missing_data.head(10)

As can be seen above we no longer have any missing values. We are going to go ahead and split our data back up into our train and test datasets and then do the same process and make a separate dataset that contains our logarithmic columns.

In [None]:
X_tra = merged_data[:ntrain]
X_test = merged_data[ntrain:]
X_tra.info()

## Logarithmic Cleaning

In [None]:
log_train = train_original
log_test = test_original
log_train.head()

The columns that we are replacing with logarithm values are:
* ApplicantIncome
* CoapplicantIncome
* LoanAmount

First we will start again by merging the log_train and log_test and then perform the transformations.

In [None]:
log_train.drop(log_train.loc[log_train['ApplicantIncome']==0].index, inplace=True)
log_train.drop(log_train.loc[log_train['CoapplicantIncome']==0].index, inplace=True)
log_train.drop(log_train.loc[log_train['LoanAmount']==0].index, inplace=True)

log_test.drop(log_test.loc[log_test['ApplicantIncome']==0].index, inplace=True)
log_test.drop(log_test.loc[log_test['CoapplicantIncome']==0].index, inplace=True)
log_test.drop(log_test.loc[log_test['LoanAmount']==0].index, inplace=True)

In [None]:
nlogtrain = log_train.shape[0]
nlogtest = log_test.shape[0]
y_log = log_train.Loan_Status.values
all_data_log = pd.concat((log_train, log_test)).reset_index(drop=True)
all_data_log.drop(['Loan_Status'], axis=1, inplace=True)
print("all_data size is : {}".format(all_data_log.shape))

The log transformation will not be insightful if we have values equal to zero, so if that is the case we will need to drop them.

In [None]:
#Perform log transformations
all_data_log['ApplicantIncome_log'] = np.log(all_data_log['ApplicantIncome'])
all_data_log['CoapplicantIncome_log'] = np.log(all_data_log['CoapplicantIncome'])
all_data_log['LoanAmount_log'] = np.log(all_data_log['LoanAmount'])
all_data_log.head()

In [None]:
all_data_log.info()

Now we have a lot less rows to work with, but we are able to get log transformations that will be useful. Let's verify that we have a approximately normal distribution now by taking a look at the plots.

In [None]:
plt.hist(all_data_log['ApplicantIncome_log']) 

In [None]:
plt.hist(all_data_log['CoapplicantIncome_log']) 

In [None]:
plt.hist(all_data_log['LoanAmount_log'])

We can verify that these are now approximately normally distributed. From here we will go ahead and drop the original columns associated with the log transformations and then follow the same steps as with the all_data dataset with how we will clean.

In [None]:
all_data_log.drop(['ApplicantIncome'], axis=1, inplace=True)
all_data_log.drop(['CoapplicantIncome'], axis=1, inplace=True)
all_data_log.drop(['LoanAmount'], axis=1, inplace=True)

In [None]:
all_data_log.head()

In [None]:
Numeric_Columns_log = all_data_log.select_dtypes(include=np.number) #Creating dataset for just the numeric columns
Numeric_Columns_log.head()

In [None]:
imputed_num_log = pd.DataFrame(my_imputer.fit_transform(Numeric_Columns_log)) #Using simple imputer to fill in missing values

imputed_num_log.columns = Numeric_Columns_log.columns #Making column headings the same for imputed_num as Numeric_Columns
identification_log = all_data_log['Loan_ID'] #Making a column for Loan_ID to merge numerical and categorical data
imputed_num_log = imputed_num_log.join(identification_log)
imputed_num_log.head()

In [None]:
Categorical_Columns_log = all_data_log.select_dtypes(exclude=np.number)
Categorical_Columns_log.drop(['Loan_ID'], axis=1, inplace=True)
Categorical_Columns_log.head()

In [None]:
imputed_categorical_log = pd.get_dummies(data=Categorical_Columns_log) #Using get_dummies to clean up categorical data
imputed_categorical_log.head()

In [None]:
imputed_categorical_log = imputed_categorical_log.join(identification_log)
imputed_categorical_log.head()

In [None]:
merged_data_log = imputed_num_log.merge(imputed_categorical_log, on='Loan_ID') #Using similar id columns to merge data
merged_data_log.drop(['Loan_ID'], axis=1, inplace=True) #Need to drop Loan_ID now that we have successfully merged our data together
merged_data_log.head()

In [None]:
all_data_log_nan = (merged_data_log.isnull().sum() / len(merged_data_log)) * 100
all_data_log_nan = all_data_log_nan.drop(all_data_log_nan[all_data_log_nan == 0].index).sort_values(ascending=False)[:30]
missing_data_log = pd.DataFrame({'Missing Ratio' :all_data_log_nan})
missing_data_log.head(10)

In [None]:
X_tra_log = merged_data[:nlogtrain]
X_test_log = merged_data[nlogtrain:]
X_tra_log.info()

## Modeling the Data

Now we are into the fun part working with building the best model we can for this data. A couple methods will be done. We are going to be using train_test_split and trying a couple different models like LogisticRegression, DecisionTreeClassifier, RandomForest and XGBoost. We will start with our original dataset and then like with the cleaning do the same methods with our log transformation dataset to see if that improved our model at all.

### Base Modeling with train_test_split

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_cv, y_train, y_cv = train_test_split(X_tra, y, test_size =.3)

In [None]:
from sklearn.linear_model import LogisticRegression

log_model = LogisticRegression()
log_model.fit(X_train, y_train)
LogisticRegression()

In [None]:
from sklearn.metrics import accuracy_score
log_pred_cv = log_model.predict(X_cv)
accuracy_score(y_cv, log_pred_cv)

It looks like our first model is about 80-85% accurate which is not bad at all, but we will work to do see if a better model can be achieved starting with a DecisionTreeClassifier.

In [None]:
from sklearn import tree

tree_model = tree.DecisionTreeClassifier()
tree_model.fit(X_train, y_train)
tree.DecisionTreeClassifier()

In [None]:
tree_pred_cv = tree_model.predict(X_cv)
accuracy_score(y_cv, tree_pred_cv)

Not as accurate as the logistic regression model, but we will see how a random forest classifier handles the dataset.

In [None]:
from sklearn.ensemble import RandomForestClassifier

forest_model = RandomForestClassifier()
forest_model.fit(X_train, y_train)
RandomForestClassifier()

In [None]:
forest_pred_cv = forest_model.predict(X_cv)
accuracy_score(y_cv, forest_pred_cv)

This model did better than the decision tree, but still not as good as the logistic regression. We will try one more method which is XGBoost.

In [None]:
from xgboost import XGBClassifier

xgb_model = XGBClassifier(n_estimators=50, max_depth=4)
xgb_model.fit(X_train, y_train)
XGBClassifier()

In [None]:
xgb_pred_cv = xgb_model.predict(X_cv)
accuracy_score(y_cv, xgb_pred_cv)

Our xgb did not do as well as the logistic regression model, so we will go ahead and just try to beef up the logistic regression model the best it can be. This starts with seeing if our log transformations helped the model.

### Log Dataset Modeling

In [None]:
X_train_log, X_cv_log, y_train_log, y_cv_log = train_test_split(X_tra_log, y_log, test_size =.3)

In [None]:
log_model.fit(X_train_log, y_train_log)
LogisticRegression()

In [None]:
log_pred_cv_log = log_model.predict(X_cv_log)
accuracy_score(y_cv_log, log_pred_cv_log)

As it appears the making of new columns for log transformations was not very influential for finding a good model.

#### Final Model

It seems like our basic logistic regression model actually appeared to do the best for this dataset. Before calling it quits a few things will be checked out to see if any slight improvements can be made to our Logistic Regression model.

In [None]:
#Credit for ideas on improvement: https://stackoverflow.com/questions/38077190/how-to-increase-the-model-accuracy-of-logistic-regression-in-scikit-python
#Credit for ideas on improvement: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

#log_model = LogisticRegression()
#log_model.fit(X_train, y_train)
#LogisticRegression()

#log_pred_cv = log_model.predict(X_cv)
#accuracy_score(y_cv, log_pred_cv)

In [None]:
log_model1 = LogisticRegression(random_state=42)
log_model1.fit(X_train, y_train)

log_pred_cv1 = log_model1.predict(X_cv)
accuracy_score(y_cv, log_pred_cv1)
#We get about the same results

In [None]:
log_model2 = LogisticRegression(random_state=42, solver='lbfgs')
log_model2.fit(X_train, y_train)

log_pred_cv2 = log_model2.predict(X_cv)
accuracy_score(y_cv, log_pred_cv2)

In [None]:
#Our final model will be our log_model2 and we want to now use it to predict based on the information in our test dataset
final_model = log_model2
pred_test = final_model.predict(X_test)

In [None]:
#Make new dataframe for our identification and loan_status predictions
submission = pd.DataFrame()

#Putting data into submission dataframe
submission['Loan_Status'] = pred_test
submission['Loan_ID'] = test_original['Loan_ID']

#Makes more sense to put in terms of 'Y' and 'N' instead of 1 and 0 when talking about a loan being approved
submission['Loan_Status'].replace(0, 'N', inplace=True)
submission['Loan_Status'].replace(1, 'Y', inplace=True)

submission.Loan_Status.value_counts()

From this we are predicting that the test dataset will have 305 accepted loans and 62 denied loans. Thinking back to the amount of loans that were approved and denied in the train dataset there were about 68% of applications that were accepted. With our model there are about 83% of applications that are accepted. This could be in part to either the fact that we are 80% accurate or the fact that there are simply more applicants in the test dataset that would qualify for a loan than in the train dataset.