<a href="https://colab.research.google.com/github/sb2356-iiitr/ML_Projects/blob/main/Logistic_Regression/Predict_Loan_Approval.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Project - Loan Approval
Build a logistic regression model and predict loan approval status based on Gender, Marital Status, Credit History, Income and Loan Amount.

## Steps 


1.   Import Libraries and read data
2.   Identify and deal with missing values
3.   Create dummy variables.
4. Normalize the data.
5. Select Relevant columns.
6. Split Dataset into Training and Test Dataset.
7. Train and Evaluate the model.



### Step 1
Import files and libraries.

In [1]:
from google.colab import files

uploaded = files.upload()

for fn in uploaded.keys():
  print('User uploaded file "{name}" with length {length} bytes'.format(
      name=fn, length=len(uploaded[fn])))

Saving 01Exercise1.csv to 01Exercise1.csv
User uploaded file "01Exercise1.csv" with length 13907 bytes


In [2]:
# Import libraries
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline

In [3]:
# Read data and create a copy
LoanData = pd.read_csv('01Exercise1.csv')
LoanPrep = LoanData.copy()
LoanPrep.head()

Unnamed: 0,gender,married,ch,income,loanamt,status
0,Male,No,1.0,5849,,Y
1,Male,Yes,1.0,4583,128.0,N
2,Male,Yes,1.0,3000,66.0,Y
3,Male,Yes,1.0,2583,120.0,Y
4,Male,No,1.0,6000,141.0,Y


### Step 2
Identify and deal with missing values

In [4]:
# Identify missing values
LoanPrep.isna().sum()

gender     13
married     3
ch         50
income      0
loanamt    22
status      0
dtype: int64

In [5]:
# Drop the rows with missing values
LoanPrep = LoanPrep.dropna()
LoanPrep.isna().sum()

gender     0
married    0
ch         0
income     0
loanamt    0
status     0
dtype: int64

In [6]:
# Dropping the gender column, assuming the bank would not discriminate on basis of gender.
LoanPrep = LoanPrep.drop(['gender'], axis = 1)
LoanPrep.head()

Unnamed: 0,married,ch,income,loanamt,status
1,Yes,1.0,4583,128.0,N
2,Yes,1.0,3000,66.0,Y
3,Yes,1.0,2583,120.0,Y
4,No,1.0,6000,141.0,Y
5,Yes,1.0,5417,267.0,Y


### Step 3
Create dummy variables from categorical variables.

In [7]:
LoanPrep.dtypes

married     object
ch         float64
income       int64
loanamt    float64
status      object
dtype: object

In [8]:
# Categorical variables to be converted to dummy variables: married, status

In [9]:
LoanPrep = pd.get_dummies(LoanPrep, drop_first=True)

In [10]:
LoanPrep.head()

Unnamed: 0,ch,income,loanamt,married_Yes,status_Y
1,1.0,4583,128.0,1,0
2,1.0,3000,66.0,1,1
3,1.0,2583,120.0,1,1
4,1.0,6000,141.0,0,1
5,1.0,5417,267.0,1,1


In [11]:
LoanPrep.dtypes

ch             float64
income           int64
loanamt        float64
married_Yes      uint8
status_Y         uint8
dtype: object

### Step 4
Normalize the data

In [12]:
# Normalize the data for loanamt and income using StandardScaler
from sklearn.preprocessing import StandardScaler
scaler_ = StandardScaler()

LoanPrep['income'] = scaler_.fit_transform(LoanPrep[['income']])
LoanPrep['loanamt'] = scaler_.fit_transform(LoanPrep[['loanamt']])


In [13]:
LoanPrep.head()

Unnamed: 0,ch,income,loanamt,married_Yes,status_Y
1,1.0,-0.128073,-0.19425,1,0
2,1.0,-0.392077,-0.971015,1,1
3,1.0,-0.461621,-0.294478,1,1
4,1.0,0.108246,-0.03138,0,1
5,1.0,0.011017,1.547205,1,1


### Step 5
Create independent (X) and dependent (Y) dataframes

In [14]:
Y = LoanPrep[['status_Y']]
X = LoanPrep.drop(['status_Y'], axis=1)

In [15]:
# Split the X and Y dataset into training and testing sets
from sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test = \
train_test_split(X, Y, test_size = 0.3, random_state = 1234, stratify=Y)
# stratify is used to prevent all Yes's in train and all No's in test
# and similar situations

### Step 6
Build the Logistic Regression Model

In [16]:
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()

lr.fit(X_train, Y_train)

  y = column_or_1d(y, warn=True)


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [17]:
Y_predict = lr.predict(X_test)

In [18]:
Y_predict

array([0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1,
       0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0,
       1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1,
       1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1,
       1, 1, 1, 1, 1], dtype=uint8)

### Step 7
Evaluate the model

In [19]:
# Build the confusion matrix and get the accuracy/score
from sklearn.metrics import confusion_matrix

cm = confusion_matrix(Y_test, Y_predict)
cm

array([[ 29,  20],
       [  2, 108]])

In [20]:
score = lr.score(X_test, Y_test)
score

0.8616352201257862