# Loan Approval Prediction Analysis

This project was done as an online coding challenge provided by Analytics Vidhya. The scoring method of this coding challenge is the percentage of loan approvals that have been predicted correctly. 

In this project my goal is to predict whether a loan will be approved or not based on a provided training and testing data set. The outcome of the decision will either be a "Y" which means that a loan will be approved or a "N" which means that the loan will not be approved. Hence, I will be developing a model that will analyse the features of each unique loan and determine if it will be approved or not.

# Data Set

In this project, there are 13 columns of features and 614 rows of records in the training set and 12 columns of features and 367 rows of records in the test set. The one column not present in the test data set is the Loan_Status column which is only used in the training data set to see which loans would be approved. The data set variables can be seen below:

<img src = "Dataset-v.png" style = "width:500px;height:300px"/>

# Data Analysis

In order to perform the data analysis and determine which loans to approve, I will use Python 3 and the pandas, matplotlib, and numpy libraries. 

In [197]:
# Importing libraries.
import pandas as pd
import numpy as np
from sklearn import preprocessing
from sklearn.preprocessing import LabelEncoder
%pylab inline
%matplotlib inline

Populating the interactive namespace from numpy and matplotlib


In [198]:
# Loading in both the training and testing data sets.
train_set = pd.read_csv("train_loan.csv")
test_set = pd.read_csv("test_loan.csv")

In [199]:
#Seeing the relative sizes of both datasets
train_set.shape, test_set.shape

((614, 13), (367, 12))

In the dataset, there might be missing values. Hence, we have to perform data munging in order to make the data suitable for analysis.

In [200]:
#Checking for missing values.
train_set.apply(lambda x: sum(x.isnull()),axis=0) 

Loan_ID               0
Gender               13
Married               3
Dependents           15
Education             0
Self_Employed        32
ApplicantIncome       0
CoapplicantIncome     0
LoanAmount           22
Loan_Amount_Term     14
Credit_History       50
Property_Area         0
Loan_Status           0
dtype: int64

As we can see there is missing data in Gender, Married, Dependents, Self_Employed, LoanAmount, Loan_Amount_Term, and Credit_History. In order to rectify this, I will use imputation. 

For numerical values like "Loan_Amount_Term", I will use the median to fill in the missing values.

For string or categorical values like "Gender", I will use mode to fill in the missing values.

In [201]:
#Replacing missing values
train_set['LoanAmount'].fillna(train_set['LoanAmount'].mode()[0], inplace=True)
train_set['Loan_Amount_Term'].fillna(train_set['Loan_Amount_Term'].mode()[0], inplace=True)
train_set['Gender'].fillna(train_set['Gender'].mode()[0], inplace=True)
train_set['Married'].fillna(train_set['Married'].mode()[0], inplace=True)
train_set['Dependents'].fillna(train_set['Dependents'].mode()[0], inplace=True)
train_set['Self_Employed'].fillna(train_set['Self_Employed'].mode()[0], inplace=True)
train_set['Credit_History'].fillna(train_set['Credit_History'].mode()[0], inplace=True)

In [202]:
#Checking for missing values again.
train_set.apply(lambda x: sum(x.isnull()),axis=0) 

Loan_ID              0
Gender               0
Married              0
Dependents           0
Education            0
Self_Employed        0
ApplicantIncome      0
CoapplicantIncome    0
LoanAmount           0
Loan_Amount_Term     0
Credit_History       0
Property_Area        0
Loan_Status          0
dtype: int64

As we can see from the results above, all the missing values have been replaced.

We now have to replicate the same process for the test dataset.

In [203]:
#Checking for missing values in test dataset.
test_set.apply(lambda x: sum(x.isnull()),axis=0) 

Loan_ID               0
Gender               11
Married               0
Dependents           10
Education             0
Self_Employed        23
ApplicantIncome       0
CoapplicantIncome     0
LoanAmount            5
Loan_Amount_Term      6
Credit_History       29
Property_Area         0
dtype: int64

In [204]:
#Replacing missing values with mode and median found in training dataset.
test_set['Gender'].fillna(train_set['Gender'].mode()[0], inplace=True)
test_set['Dependents'].fillna(train_set['Dependents'].mode()[0], inplace=True)
test_set['Self_Employed'].fillna(train_set['Self_Employed'].mode()[0], inplace=True)
test_set['Credit_History'].fillna(train_set['Credit_History'].mode()[0], inplace=True)
test_set['Loan_Amount_Term'].fillna(train_set['Loan_Amount_Term'].mode()[0], inplace=True)
test_set['LoanAmount'].fillna(train_set['LoanAmount'].median(), inplace=True)

In [205]:
#Checking for missing values again in test dataset.
test_set.apply(lambda x: sum(x.isnull()), axis = 0)

Loan_ID              0
Gender               0
Married              0
Dependents           0
Education            0
Self_Employed        0
ApplicantIncome      0
CoapplicantIncome    0
LoanAmount           0
Loan_Amount_Term     0
Credit_History       0
Property_Area        0
dtype: int64

After filling in the missing data value fields, we now have to deal with possible outlier values. In order to find outlier values I will perform univariate analysis.

Now both the training and test data set have had their missing data values filled, we can start building our model.

# Building the Model

When building our model we want to understand which values will help determine whether the loan will be approved or not.

We want to consider:

    1) Applicants who do have a credit history.
    
    2) Applicants with higher Applicant and Coapplicant Incomes.
    
    3) Applicants that have a higher education level.
    
    4) The property area and if it is an urban area or not.
    
I will be using a logistical regression model for this model.

I will first drop the Loan_ID variable as it does not affect the outcome of the model at all.

In [206]:
train_set = train_set.drop('Loan_ID', axis=1)
test_set = test_set.drop('Loan_ID', axis=1)

Sklearn requires the target variable in a separate dataset. So, we will drop our target variable from the train dataset and save it in another dataset.

In [207]:
# drop "Loan_Status" and assign it to target variable
x = train_set.drop('Loan_Status', 1)
y = train_set.Loan_Status

In [208]:
x = pd.get_dummies(x)
train_set = pd.get_dummies(train_set)
test_set = pd.get_dummies(test_set)

Now we will train the model on training dataset and make predictions for the test dataset. We can train the model on this train part and using that make predictions for the validation part. In this way we can validate our predictions as we have the true predictions for the validation part.

We will use the train_test_split function from sklearn to divide our train dataset.

In [209]:
from sklearn.model_selection import train_test_split
x_train, x_cv, y_train, y_cv = train_test_split(x, y, test_size=0.3, random_state=0)

We will now calculate the accuracy of our prediction.

In [212]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

model = LogisticRegression()
model.fit(x_train, y_train)
pred_cv = model.predict(x_cv)
accuracy_score(y_cv, pred_cv)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


0.8324324324324325

We have received accuracy of about above 83%.