<a href="https://colab.research.google.com/github/tejatanush/Loan-Approval-Prediction/blob/main/Loan_Approva_Prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Model Description
This model is capable of predicting if a person with certain features like education,Gender,Martial status,Income,Property Area,Credi History...etc can get loan or not.This helps customers to know if they get loan or not according to their features. Banks also benificial that they make customers  easier to know their loan status.

#Steps to build a model:
1. Import required libraries
2. Import dataset
3. Data Preprocessing
*  Find and fill missing values
* Encoding data
* Splitting into training and testing set
* Feature Scaling
4. Selection of model
5. Build a model
6. Evaluate model
7. Predict Results
8. making confusion matrix

# 1. Import libraries

In [267]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import tensorflow as tf

# 2. Import dataset
This dataset has many independent features to predict wheather a person with respective features will able to get a loan or not.
# Reference:
https://github.com/prasertcbs/basic-dataset/blob/master/Loan-Approval-Prediction.csv  
Lets split our data into two parts x (dependent variables) and y (independent variable).

In [268]:
dataset=pd.read_csv("Loan-Approval-Prediction.csv")
x=dataset.iloc[:,1:-1].values
y=dataset.iloc[:,-1].values
dataset.head()

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,Y
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
3,LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
4,LP001008,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y


# 3. Data Preprocessing


#Find and filling missing values

In [269]:
missing_values = dataset.isnull().sum()
print(missing_values)

Loan_ID               0
Gender               13
Married               3
Dependents           15
Education             0
Self_Employed        32
ApplicantIncome       0
CoapplicantIncome     0
LoanAmount           22
Loan_Amount_Term     14
Credit_History       50
Property_Area         0
Loan_Status           0
dtype: int64


Hence we have some missing values let's fill them with some mathematical operations like mean or else most occured value. The way of filling values depends on feature in dataset.

In [270]:
from sklearn.impute import SimpleImputer

imputer_1 = SimpleImputer(missing_values=np.nan, strategy='most_frequent')
imputer_2 = SimpleImputer(missing_values=np.nan, strategy='mean')
x[:, 0:3] = imputer_1.fit_transform(x[:, 0:3])
x[:, 4:5] = imputer_1.fit_transform(x[:, 4:5])
x[:, 8:10] = imputer_1.fit_transform(x[:, 8:10])
x[:, 7:8] = imputer_2.fit_transform(x[:, 7:8])
print(x[23])

['Male' 'Yes' '2' 'Not Graduate' 'No' 3365 1917.0 112.0 360.0 0.0 'Rural']


Hence some of the categorical features cannot be determined using mathematical operations like mean,mode,median..etc so we fill them with most occured value in respective feature.

# Encoding data
From our data we have categorical features and some label features. So that we should encode categorcal and label encoding in 2 steps.

In [271]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder
le=LabelEncoder()
columns_to_encode = [0,1,3,4,9]
for column in columns_to_encode:
    x[:, column] = le.fit_transform(x[:, column])
ct=ColumnTransformer(transformers=[('encoder1',OneHotEncoder(),[10]),('encoder2',OneHotEncoder(),[8])],remainder='passthrough')
x=np.array(ct.fit_transform(x))
y=le.fit_transform(y)
print(x[23])
print(x[24])
print(x[25])
print(x[28])
print(x[61])

[1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 1 1 '2' 1 0 3365
 1917.0 112.0 0]
[0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 1 1 '1' 0 0 3717
 2925.0 151.0 1]
[0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 1 1 '0' 0 1 9560 0.0
 191.0 1]
[0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 1 0 '0' 1 0 1442 0.0
 35.0 1]
[0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 1 1 '3+' 0 0 3029 0.0
 99.0 1]


We can observe the pattern of a row that how it changes and there is string at index 5 which is no.of dependents.So let's convert this string into numeric form. There is only one non-numeric string 3+ in dataset.Somlet's make 3+ to 3 which does not make any large change to dataset.

In [272]:
for i in range(0,len(y)):
  if x[i,15]=='3+':
    x[i,15]=3
  elif x[i,15]=='0':
    x[i,15]=0
  elif x[i,15]=='1':
    x[i,15]=1
  elif x[i,15]=='2':
    x[i,15]=2
print(len(x[61]))
x=x.astype('float32') #make as tensors
y=y.astype('float32')

22


You can see that string is converted to int

# Splitting into training set and test set
Let's split data so that 80% of data will be training set and remaining 20% will be testing set.

In [273]:
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2,random_state=0)
#Let's check how data splitted
#print(x_train)
#print(y_train)
print(len(x_train[:,18]))

491


We can see that data is not in certain order.... it mean the data is splitted in random.

#Feature Scaling
Let's Normalize ApplicantIncome,Co-ApplicantIncome,Loan Amount.Because these are having unique numerical values and by normalizing them model may understand better patterns between them.

In [274]:
from sklearn.preprocessing import MinMaxScaler
sc=MinMaxScaler()
x_train[:,19]=sc.fit_transform((x_train[:,19]).reshape(-1,1)).flatten()
x_train[:,20]=sc.fit_transform((x_train[:,20]).reshape(-1,1)).flatten()
x_train[:,18]=sc.fit_transform((x_train[:,18]).reshape(-1,1)).flatten()
x_test[:,19]=sc.transform((x_test[:,19]).reshape(-1,1)).flatten()
x_test[:,20]=sc.transform((x_test[:,20]).reshape(-1,1)).flatten()
x_test[:,18]=sc.transform((x_test[:,18]).reshape(-1,1)).flatten()
print(x_test[2])

[0.         0.         1.         0.         0.         0.
 0.         0.         0.         0.         0.         1.
 0.         1.         1.         0.         0.         0.
 0.07400125 0.0464564  0.00215213 1.        ]


So, the ApplicantIncome,Co-ApplicantIncome,Loan Amount were normalized into values between 0 and 1.

# 4. Selection of model
Hence the prediction of loan status will only be two chances (Y/N). So that we can use classification model for this type of problem.

# 5. Build a model
**Create model:** In classification problems use activation function as Sigmoid for output layer and no activation function is used for input and hidden layers for better results.

In [275]:
tf.random.set_seed(42)
model=tf.keras.Sequential([tf.keras.layers.Dense(100,input_shape=[22]),
                           tf.keras.layers.Dense(10),
                           tf.keras.layers.Dense(1,activation='sigmoid')])
model.summary()

Model: "sequential_9"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense_23 (Dense)            (None, 100)               2300      
                                                                 
 dense_24 (Dense)            (None, 10)                1010      
                                                                 
 dense_25 (Dense)            (None, 1)                 11        
                                                                 
Total params: 3321 (12.97 KB)
Trainable params: 3321 (12.97 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


Here we can see summary of model that how model constructed and no.of hidden layers and parameters...etc.

**Compile Model:** For classification problems use optimizers as SGD , loss is binary crossentropy because our results are only one of two (Y/N), we can evaluate our model results using accuracy metrics

In [276]:
model.compile(loss=tf.keras.losses.BinaryCrossentropy(),
              optimizer=tf.keras.optimizers.Adam(learning_rate=0.01),
              metrics=['accuracy'])

Fit Model: Train model for 20 epochs or more and ensure to add validation data for better vizualisation so that trained model can evaluate test set simultaneously.

In [277]:
model_history=model.fit(x_train,y_train,epochs=20,validation_data=[x_test,y_test])

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


#6. Evaluate Model

In [278]:
model.evaluate(x_test,y_test)



[0.4535917043685913, 0.8373983502388]

In [280]:
print(len(y_test))

123


#7. predict Results
Let's Predict results of y_test and compare the original results of y_test.

In [282]:
y_pred=model.predict(x_test)
y_pred=y_pred.reshape(-1,1)
for i in range(0,len(y_test)):
  if y_pred[i]>0.5:
    y_pred[i]=1
  else:
    y_pred[i]=0
y_test=y_test.reshape(-1,1)
# print predicted and actual values
print(np.concatenate((y_pred, y_test), axis=1))

[[1. 1.]
 [1. 0.]
 [1. 1.]
 [1. 0.]
 [1. 1.]
 [0. 0.]
 [1. 1.]
 [1. 1.]
 [0. 0.]
 [1. 1.]
 [1. 1.]
 [1. 1.]
 [1. 1.]
 [1. 1.]
 [1. 1.]
 [1. 0.]
 [1. 0.]
 [1. 1.]
 [1. 1.]
 [0. 0.]
 [0. 0.]
 [1. 1.]
 [1. 1.]
 [1. 1.]
 [1. 1.]
 [1. 1.]
 [1. 1.]
 [1. 1.]
 [0. 0.]
 [0. 0.]
 [1. 1.]
 [1. 1.]
 [1. 1.]
 [1. 1.]
 [1. 1.]
 [0. 0.]
 [1. 1.]
 [1. 1.]
 [1. 1.]
 [1. 1.]
 [1. 1.]
 [0. 0.]
 [1. 1.]
 [1. 1.]
 [1. 1.]
 [1. 1.]
 [1. 1.]
 [1. 1.]
 [1. 0.]
 [1. 1.]
 [1. 1.]
 [1. 1.]
 [1. 1.]
 [1. 1.]
 [1. 1.]
 [1. 1.]
 [1. 1.]
 [1. 0.]
 [1. 1.]
 [1. 1.]
 [0. 1.]
 [1. 0.]
 [1. 1.]
 [0. 0.]
 [1. 1.]
 [1. 1.]
 [1. 1.]
 [1. 1.]
 [1. 1.]
 [1. 1.]
 [1. 0.]
 [1. 1.]
 [1. 1.]
 [1. 1.]
 [1. 1.]
 [1. 1.]
 [1. 0.]
 [1. 0.]
 [1. 1.]
 [0. 0.]
 [0. 1.]
 [1. 0.]
 [1. 0.]
 [1. 1.]
 [0. 0.]
 [1. 1.]
 [1. 1.]
 [1. 1.]
 [1. 1.]
 [1. 1.]
 [1. 1.]
 [1. 0.]
 [1. 0.]
 [0. 0.]
 [1. 1.]
 [0. 0.]
 [1. 1.]
 [1. 1.]
 [1. 1.]
 [1. 1.]
 [1. 1.]
 [1. 1.]
 [1. 1.]
 [1. 0.]
 [1. 1.]
 [1. 1.]
 [1. 1.]
 [1. 1.]
 [1. 1.]
 [1. 0.]
 [1. 1.]
 

# 8. Making confusion matrix
Lets make a confusion matrix to analyze real status of loan and predicted status of loan.

In [284]:
from sklearn.metrics import confusion_matrix
cm=confusion_matrix(y_test,y_pred)
print(cm)

[[16 17]
 [ 3 87]]


You can see that there are total of 103 correct predictions and 20 wrong predictions....Our model trains good.