## Problem Statement:

#### Predict Loan Eligibility for Dream Housing Finance company
#### Dream Housing Finance company deals in all kinds of home loans. They have presence across all urban, semi urban and rural areas. Customer first applies for home loan and after that company validates the customer eligibility for loan.

#### Company wants to automate the loan eligibility process (real time) based on customer detail provided while filling online application form. These details are Gender, Marital Status, Education, Number of Dependents, Income, Loan Amount, Credit History and others. To automate this process, they have provided a dataset to identify the customers segments that are eligible for loan amount so that they can specifically target these customers. 

### Library:

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, precision_score, recall_score, f1_score


### Data Gathering:

In [2]:
train_data = pd.read_csv(r'D:\Datascience\inceptez\Preprocessing\Loan Predictions\train_ctrUa4K.csv')
test_data = pd.read_csv(r'D:\Datascience\inceptez\Preprocessing\Loan Predictions\test_lAUu6dG.csv')
sample_data = pd.read_csv(r'D:\Datascience\inceptez\Preprocessing\Loan Predictions\sample_submission_49d68Cx.csv')

In [3]:
train_data.shape

(614, 13)

In [4]:
test_data.shape

(367, 12)

### Feature Engineering:


In [6]:
train_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 614 entries, 0 to 613
Data columns (total 13 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Loan_ID            614 non-null    object 
 1   Gender             601 non-null    object 
 2   Married            611 non-null    object 
 3   Dependents         599 non-null    object 
 4   Education          614 non-null    object 
 5   Self_Employed      582 non-null    object 
 6   ApplicantIncome    614 non-null    int64  
 7   CoapplicantIncome  614 non-null    float64
 8   LoanAmount         592 non-null    float64
 9   Loan_Amount_Term   600 non-null    float64
 10  Credit_History     564 non-null    float64
 11  Property_Area      614 non-null    object 
 12  Loan_Status        614 non-null    object 
dtypes: float64(4), int64(1), object(8)
memory usage: 62.5+ KB


In [7]:
train_data.head(10)

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,Y
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
3,LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
4,LP001008,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y
5,LP001011,Male,Yes,2,Graduate,Yes,5417,4196.0,267.0,360.0,1.0,Urban,Y
6,LP001013,Male,Yes,0,Not Graduate,No,2333,1516.0,95.0,360.0,1.0,Urban,Y
7,LP001014,Male,Yes,3+,Graduate,No,3036,2504.0,158.0,360.0,0.0,Semiurban,N
8,LP001018,Male,Yes,2,Graduate,No,4006,1526.0,168.0,360.0,1.0,Urban,Y
9,LP001020,Male,Yes,1,Graduate,No,12841,10968.0,349.0,360.0,1.0,Semiurban,N


In [8]:
train_data.duplicated().sum()

0

In [9]:
train_data.describe(include='all').T

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
Loan_ID,614.0,614.0,LP001002,1.0,,,,,,,
Gender,601.0,2.0,Male,489.0,,,,,,,
Married,611.0,2.0,Yes,398.0,,,,,,,
Dependents,599.0,4.0,0,345.0,,,,,,,
Education,614.0,2.0,Graduate,480.0,,,,,,,
Self_Employed,582.0,2.0,No,500.0,,,,,,,
ApplicantIncome,614.0,,,,5403.459283,6109.041673,150.0,2877.5,3812.5,5795.0,81000.0
CoapplicantIncome,614.0,,,,1621.245798,2926.248369,0.0,0.0,1188.5,2297.25,41667.0
LoanAmount,592.0,,,,146.412162,85.587325,9.0,100.0,128.0,168.0,700.0
Loan_Amount_Term,600.0,,,,342.0,65.12041,12.0,360.0,360.0,360.0,480.0


In [10]:
train_data[train_data['Loan_Status']=='N']

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
7,LP001014,Male,Yes,3+,Graduate,No,3036,2504.0,158.0,360.0,0.0,Semiurban,N
9,LP001020,Male,Yes,1,Graduate,No,12841,10968.0,349.0,360.0,1.0,Semiurban,N
13,LP001029,Male,No,0,Graduate,No,1853,2840.0,114.0,360.0,1.0,Rural,N
17,LP001036,Female,No,0,Graduate,No,3510,0.0,76.0,360.0,0.0,Urban,N
...,...,...,...,...,...,...,...,...,...,...,...,...,...
596,LP002941,Male,Yes,2,Not Graduate,Yes,6383,1000.0,187.0,360.0,1.0,Rural,N
597,LP002943,Male,No,,Graduate,No,2987,0.0,88.0,360.0,0.0,Semiurban,N
600,LP002949,Female,No,3+,Graduate,,416,41667.0,350.0,180.0,,Urban,N
605,LP002960,Male,Yes,0,Not Graduate,No,2400,3800.0,,180.0,1.0,Urban,N


In [11]:
pd.set_option('display.max_rows', 500)
train_data[train_data['Loan_Status']=='N']

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
7,LP001014,Male,Yes,3+,Graduate,No,3036,2504.0,158.0,360.0,0.0,Semiurban,N
9,LP001020,Male,Yes,1,Graduate,No,12841,10968.0,349.0,360.0,1.0,Semiurban,N
13,LP001029,Male,No,0,Graduate,No,1853,2840.0,114.0,360.0,1.0,Rural,N
17,LP001036,Female,No,0,Graduate,No,3510,0.0,76.0,360.0,0.0,Urban,N
18,LP001038,Male,Yes,0,Not Graduate,No,4887,0.0,133.0,360.0,1.0,Rural,N
20,LP001043,Male,Yes,0,Not Graduate,No,7660,0.0,104.0,360.0,0.0,Urban,N
22,LP001047,Male,Yes,0,Not Graduate,No,2600,1911.0,116.0,360.0,0.0,Semiurban,N
23,LP001050,,Yes,2,Not Graduate,No,3365,1917.0,112.0,360.0,0.0,Rural,N
24,LP001052,Male,Yes,1,Graduate,,3717,2925.0,151.0,360.0,,Semiurban,N


In [12]:
train_data[train_data['Loan_Status']=='N'].describe(include='all').T

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
Loan_ID,192.0,192.0,LP001003,1.0,,,,,,,
Gender,187.0,2.0,Male,150.0,,,,,,,
Married,192.0,2.0,Yes,113.0,,,,,,,
Dependents,186.0,4.0,0,107.0,,,,,,,
Education,192.0,2.0,Graduate,140.0,,,,,,,
Self_Employed,183.0,2.0,No,157.0,,,,,,,
ApplicantIncome,192.0,,,,5446.078125,6819.558528,150.0,2885.0,3833.5,5861.25,81000.0
CoapplicantIncome,192.0,,,,1877.807292,4384.060103,0.0,0.0,268.0,2273.75,41667.0
LoanAmount,181.0,,,,151.220994,85.862783,9.0,100.0,129.0,176.0,570.0
Loan_Amount_Term,186.0,,,,344.064516,69.238921,36.0,360.0,360.0,360.0,480.0


In [13]:
train_data[train_data['Loan_Status']=='Y']

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,Y
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
3,LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
4,LP001008,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y
5,LP001011,Male,Yes,2,Graduate,Yes,5417,4196.0,267.0,360.0,1.0,Urban,Y
6,LP001013,Male,Yes,0,Not Graduate,No,2333,1516.0,95.0,360.0,1.0,Urban,Y
8,LP001018,Male,Yes,2,Graduate,No,4006,1526.0,168.0,360.0,1.0,Urban,Y
10,LP001024,Male,Yes,2,Graduate,No,3200,700.0,70.0,360.0,1.0,Urban,Y
11,LP001027,Male,Yes,2,Graduate,,2500,1840.0,109.0,360.0,1.0,Urban,Y
12,LP001028,Male,Yes,2,Graduate,No,3073,8106.0,200.0,360.0,1.0,Urban,Y


In [14]:
train_data[train_data['Loan_Status']=='Y'].describe(include='all').T

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
Loan_ID,422.0,422.0,LP001002,1.0,,,,,,,
Gender,414.0,2.0,Male,339.0,,,,,,,
Married,419.0,2.0,Yes,285.0,,,,,,,
Dependents,413.0,4.0,0,238.0,,,,,,,
Education,422.0,2.0,Graduate,340.0,,,,,,,
Self_Employed,399.0,2.0,No,343.0,,,,,,,
ApplicantIncome,422.0,,,,5384.06872,5765.441615,210.0,2877.5,3812.5,5771.5,63337.0
CoapplicantIncome,422.0,,,,1504.516398,1924.754855,0.0,0.0,1239.5,2297.25,20000.0
LoanAmount,411.0,,,,144.294404,85.484607,17.0,100.0,126.0,161.0,700.0
Loan_Amount_Term,414.0,,,,341.072464,63.24777,12.0,360.0,360.0,360.0,480.0


In [15]:
train_data.groupby(['Loan_Status','Credit_History'])['Loan_Status'].count()

Loan_Status  Credit_History
N            0.0                82
             1.0                97
Y            0.0                 7
             1.0               378
Name: Loan_Status, dtype: int64

In [16]:
train_data.groupby(by=['Loan_Status','Credit_History']).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Property_Area
Loan_Status,Credit_History,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
N,0.0,82,79,82,77,82,81,82,82,78,76,82
N,1.0,97,95,97,96,97,94,97,97,91,97,97
Y,0.0,7,7,7,7,7,7,7,7,7,7,7
Y,1.0,378,371,375,369,378,356,378,378,367,370,378


In [17]:
#train_data[train_data['Loan_Status']=='N'].describe(include='all').T
train_data.groupby(by=['Loan_Status','Gender']).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,Loan_ID,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area
Loan_Status,Gender,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
N,Female,37,37,36,37,35,37,37,36,36,36,37
N,Male,150,150,145,150,143,150,150,140,145,138,150
Y,Female,75,74,73,75,69,75,75,73,73,65,75
Y,Male,339,337,332,339,322,339,339,330,333,313,339


In [131]:
all_data = pd.concat([train_data,test_data],ignore_index=True)

In [132]:
all_data

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,Y
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
3,LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
4,LP001008,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y
...,...,...,...,...,...,...,...,...,...,...,...,...,...
976,LP002971,Male,Yes,3+,Not Graduate,Yes,4009,1777.0,113.0,360.0,1.0,Urban,
977,LP002975,Male,Yes,0,Graduate,No,4158,709.0,115.0,360.0,1.0,Urban,
978,LP002980,Male,No,0,Graduate,No,3250,1993.0,126.0,360.0,,Semiurban,
979,LP002986,Male,Yes,0,Graduate,No,5000,2393.0,158.0,360.0,1.0,Rural,


In [133]:
all_data['Credit_History']=all_data['Credit_History'].fillna(2)
#all_data['Credit_History']=all_data['Credit_History'].fillna(1)

In [134]:
all_data['Credit_History'].value_counts(dropna=False)

1.0    754
0.0    148
2.0     79
Name: Credit_History, dtype: int64

In [135]:
all_data['LoanAmount']=all_data['LoanAmount'].fillna(0)
#all_data['LoanAmount']=all_data['LoanAmount'].fillna(all_data['LoanAmount'].mean())

In [136]:
all_data['Loan_Amount_Term']=all_data['Loan_Amount_Term'].fillna(0)
#all_data['Loan_Amount_Term']=all_data['Loan_Amount_Term'].fillna(all_data['Loan_Amount_Term'].mean())

In [139]:
all_data['LoanAmount'].value_counts(dropna=False)

120.0    29
0.0      27
110.0    27
100.0    24
187.0    21
150.0    19
130.0    18
125.0    18
160.0    17
90.0     15
135.0    14
128.0    14
113.0    14
108.0    13
104.0    12
80.0     12
95.0     12
96.0     12
70.0     12
185.0    10
116.0    10
180.0    10
132.0    10
200.0    10
115.0    10
138.0    10
112.0     9
140.0     9
122.0     9
152.0     9
131.0     9
158.0     9
105.0     8
126.0     8
81.0      8
124.0     8
144.0     8
123.0     8
176.0     7
84.0      7
136.0     7
162.0     7
99.0      7
102.0     7
155.0     7
133.0     7
143.0     6
88.0      6
50.0      6
94.0      6
134.0     6
71.0      6
165.0     6
175.0     6
148.0     5
65.0      5
188.0     5
151.0     5
117.0     5
118.0     5
139.0     5
98.0      5
137.0     5
107.0     5
60.0      5
55.0      5
75.0      5
260.0     5
170.0     5
40.0      5
111.0     5
66.0      5
103.0     4
275.0     4
173.0     4
172.0     4
182.0     4
149.0     4
93.0      4
74.0      4
106.0     4
67.0      4
300.0     4
225.

In [140]:
all_data['Credit_History'].value_counts(dropna=False)

1.0    754
0.0    148
2.0     79
Name: Credit_History, dtype: int64

In [141]:
all_data['Gender'].value_counts(dropna=False)

Male      775
Female    182
NaN        24
Name: Gender, dtype: int64

In [142]:
all_data['Gender'].fillna('Unknown',inplace=True)

In [143]:
all_data['Gender'].value_counts(dropna=False)

Male       775
Female     182
Unknown     24
Name: Gender, dtype: int64

In [144]:
all_data['Married'].fillna('Unkown',inplace=True)

In [145]:
all_data['Married'].value_counts(dropna=False)

Yes       631
No        347
Unkown      3
Name: Married, dtype: int64

In [146]:
all_data['Self_Employed'].fillna('Unkown',inplace=True)

In [147]:
all_data['Self_Employed'].value_counts(dropna=False)

No        807
Yes       119
Unkown     55
Name: Self_Employed, dtype: int64

In [148]:
#all_data['Dependents'] = all_data['Dependents'].astype(int)
all_data['Dependents'].value_counts(dropna=False)

0      545
1      160
2      160
3+      91
NaN     25
Name: Dependents, dtype: int64

In [149]:
all_data['Dependents'].fillna('Unkown',inplace=True)

In [150]:
all_data

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,Male,No,0,Graduate,No,5849,0.0,0.0,360.0,1.0,Urban,Y
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
3,LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
4,LP001008,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y
...,...,...,...,...,...,...,...,...,...,...,...,...,...
976,LP002971,Male,Yes,3+,Not Graduate,Yes,4009,1777.0,113.0,360.0,1.0,Urban,
977,LP002975,Male,Yes,0,Graduate,No,4158,709.0,115.0,360.0,1.0,Urban,
978,LP002980,Male,No,0,Graduate,No,3250,1993.0,126.0,360.0,2.0,Semiurban,
979,LP002986,Male,Yes,0,Graduate,No,5000,2393.0,158.0,360.0,1.0,Rural,


In [151]:
all_data.describe(include='all').T

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
Loan_ID,981.0,981.0,LP001002,1.0,,,,,,,
Gender,981.0,3.0,Male,775.0,,,,,,,
Married,981.0,3.0,Yes,631.0,,,,,,,
Dependents,981.0,5.0,0,545.0,,,,,,,
Education,981.0,2.0,Graduate,763.0,,,,,,,
Self_Employed,981.0,3.0,No,807.0,,,,,,,
ApplicantIncome,981.0,,,,5179.795107,5695.104533,0.0,2875.0,3800.0,5516.0,81000.0
CoapplicantIncome,981.0,,,,1601.91633,2718.772806,0.0,0.0,1110.0,2365.0,41667.0
LoanAmount,981.0,,,,138.589195,79.831886,0.0,99.0,125.0,160.0,700.0
Loan_Amount_Term,981.0,,,,335.22528,80.577376,0.0,360.0,360.0,360.0,480.0


In [152]:
all_data = pd.get_dummies(all_data,columns=['Gender','Married','Dependents','Education','Property_Area','Self_Employed','Credit_History'])
#all_data = pd.get_dummies(all_data,columns=['Gender','Married','Dependents','Education','Property_Area','Self_Employed'])

In [153]:
all_data

Unnamed: 0,Loan_ID,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Loan_Status,Gender_Female,Gender_Male,Gender_Unknown,Married_No,...,Education_Not Graduate,Property_Area_Rural,Property_Area_Semiurban,Property_Area_Urban,Self_Employed_No,Self_Employed_Unkown,Self_Employed_Yes,Credit_History_0.0,Credit_History_1.0,Credit_History_2.0
0,LP001002,5849,0.0,0.0,360.0,Y,0,1,0,1,...,0,0,0,1,1,0,0,0,1,0
1,LP001003,4583,1508.0,128.0,360.0,N,0,1,0,0,...,0,1,0,0,1,0,0,0,1,0
2,LP001005,3000,0.0,66.0,360.0,Y,0,1,0,0,...,0,0,0,1,0,0,1,0,1,0
3,LP001006,2583,2358.0,120.0,360.0,Y,0,1,0,0,...,1,0,0,1,1,0,0,0,1,0
4,LP001008,6000,0.0,141.0,360.0,Y,0,1,0,1,...,0,0,0,1,1,0,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
976,LP002971,4009,1777.0,113.0,360.0,,0,1,0,0,...,1,0,0,1,0,0,1,0,1,0
977,LP002975,4158,709.0,115.0,360.0,,0,1,0,0,...,0,0,0,1,1,0,0,0,1,0
978,LP002980,3250,1993.0,126.0,360.0,,0,1,0,1,...,0,0,1,0,1,0,0,0,0,1
979,LP002986,5000,2393.0,158.0,360.0,,0,1,0,0,...,0,1,0,0,1,0,0,0,1,0


In [154]:
all_data.drop(columns=['Loan_ID']).corr()

Unnamed: 0,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Gender_Female,Gender_Male,Gender_Unknown,Married_No,Married_Unkown,Married_Yes,...,Education_Not Graduate,Property_Area_Rural,Property_Area_Semiurban,Property_Area_Urban,Self_Employed_No,Self_Employed_Unkown,Self_Employed_Yes,Credit_History_0.0,Credit_History_1.0,Credit_History_2.0
ApplicantIncome,1.0,-0.114247,0.518314,-0.006297,-0.060444,0.026282,0.082797,-0.052126,0.009995,0.050873,...,-0.138909,-9.9e-05,-0.009034,0.009171,-0.108873,0.020259,0.113106,-0.020201,0.029453,-0.019076
CoapplicantIncome,-0.114247,1.0,0.178831,-0.049129,-0.082428,0.081674,-0.007926,-0.061606,-0.027527,0.06466,...,-0.06038,0.035925,-0.026793,-0.007484,-0.01508,0.051807,-0.018861,0.011531,-0.068562,0.09109
LoanAmount,0.518314,0.178831,1.0,0.058462,-0.087561,0.061332,0.05862,-0.140185,-0.022621,0.142522,...,-0.163894,0.037799,0.002358,-0.038565,-0.104833,0.026745,0.103809,-0.001292,-0.023243,0.037722
Loan_Amount_Term,-0.006297,-0.049129,0.058462,1.0,0.063592,-0.066117,0.014305,0.048255,0.017038,-0.050126,...,-0.056236,0.014941,0.054579,-0.06914,0.026946,-0.017463,-0.019222,-0.033501,0.019032,0.014569
Gender_Female,-0.060444,-0.082428,-0.087561,0.063592,1.0,-0.92572,-0.075581,0.327012,0.02106,-0.328808,...,-0.040649,-0.067824,0.094498,-0.029989,0.015662,0.009075,-0.024719,0.018627,-0.030381,0.022585
Gender_Male,0.026282,0.081674,0.061332,-0.066117,-0.92572,1.0,-0.307162,-0.325238,-0.016772,0.326543,...,0.040802,0.065251,-0.103057,0.041052,-0.003526,0.005977,-8.6e-05,-0.027421,0.031645,-0.012976
Gender_Unknown,0.082797,-0.007926,0.05862,0.014305,-0.075581,-0.307162,1.0,0.034649,-0.008771,-0.033571,...,-0.00529,-0.001371,0.03393,-0.032775,-0.030111,-0.038595,0.062424,0.025426,-0.006986,-0.022617
Married_No,-0.052126,-0.061606,-0.140185,0.048255,0.327012,-0.325238,0.034649,1.0,-0.040974,-0.993346,...,-0.026211,0.001967,0.006909,-0.008825,0.003055,0.014322,-0.013666,0.021738,-0.018733,0.000439
Married_Unkown,0.009995,-0.027527,-0.022621,0.017038,0.02106,-0.016772,-0.008771,-0.040974,1.0,-0.074366,...,-0.029604,-0.03588,0.035968,-0.001777,0.025718,-0.013498,-0.020578,-0.023345,0.030389,-0.016391
Married_Yes,0.050873,0.06466,0.142522,-0.050126,-0.328808,0.326543,-0.033571,-0.993346,-0.074366,1.0,...,0.029573,0.002172,-0.011042,0.009013,-0.006014,-0.012739,0.016012,-0.019005,0.015194,0.001451


In [155]:
all_data.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
ApplicantIncome,981.0,5179.795107,5695.104533,0.0,2875.0,3800.0,5516.0,81000.0
CoapplicantIncome,981.0,1601.91633,2718.772806,0.0,0.0,1110.0,2365.0,41667.0
LoanAmount,981.0,138.589195,79.831886,0.0,99.0,125.0,160.0,700.0
Loan_Amount_Term,981.0,335.22528,80.577376,0.0,360.0,360.0,360.0,480.0
Gender_Female,981.0,0.185525,0.388921,0.0,0.0,0.0,0.0,1.0
Gender_Male,981.0,0.79001,0.407509,0.0,1.0,1.0,1.0,1.0
Gender_Unknown,981.0,0.024465,0.154566,0.0,0.0,0.0,0.0,1.0
Married_No,981.0,0.353721,0.478368,0.0,0.0,0.0,1.0,1.0
Married_Unkown,981.0,0.003058,0.055244,0.0,0.0,0.0,0.0,1.0
Married_Yes,981.0,0.643221,0.479293,0.0,0.0,1.0,1.0,1.0


### Feature Selection:

In [156]:
pp_train = all_data[all_data['Loan_Status'].notna()]
pp_test = all_data[all_data['Loan_Status'].isna()]

#pp_train = preprocessed_data[preprocessed_data['Loan_Status'].notna()]
#pp_test = preprocessed_data[preprocessed_data['Loan_Status'].isna()]

In [157]:
pp_train.shape

(614, 28)

In [158]:
pp_test.shape

(367, 28)

In [159]:
pp_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 614 entries, 0 to 613
Data columns (total 28 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Loan_ID                  614 non-null    object 
 1   ApplicantIncome          614 non-null    int64  
 2   CoapplicantIncome        614 non-null    float64
 3   LoanAmount               614 non-null    float64
 4   Loan_Amount_Term         614 non-null    float64
 5   Loan_Status              614 non-null    object 
 6   Gender_Female            614 non-null    uint8  
 7   Gender_Male              614 non-null    uint8  
 8   Gender_Unknown           614 non-null    uint8  
 9   Married_No               614 non-null    uint8  
 10  Married_Unkown           614 non-null    uint8  
 11  Married_Yes              614 non-null    uint8  
 12  Dependents_0             614 non-null    uint8  
 13  Dependents_1             614 non-null    uint8  
 14  Dependents_2             6

In [160]:
X = pp_train.drop(columns=['Loan_ID','Loan_Status'])
y = pp_train['Loan_Status']

In [195]:
train_X, test_X, train_y, test_y = train_test_split(X,y,test_size=0.2,random_state=42)
train_X.shape, test_X.shape, train_y.shape, test_y.shape

((491, 26), (123, 26), (491,), (123,))

In [196]:
train_X.shape[0], test_X.shape[0], train_X.shape[0]+ test_X.shape[0], pp_train.shape[0]

(491, 123, 614, 614)

In [197]:
train_y.isna().sum()

0

### Model Creation:

In [198]:
model = LogisticRegression(max_iter=500)
model.fit(train_X,train_y)

LogisticRegression(max_iter=500)

In [199]:
model.coef_, model.intercept_

(array([[-1.69712961e-05, -6.30971065e-05, -6.22729188e-04,
         -2.11995798e-03, -1.28470241e-02,  2.49606105e-01,
         -4.35543211e-02, -3.38066399e-01,  1.25995312e-02,
          5.18671628e-01,  1.50192762e-01, -1.17440448e-01,
          1.99867545e-01,  3.87906778e-02, -7.82057775e-02,
          3.08723850e-01, -1.15519090e-01, -2.88376529e-01,
          5.86523901e-01, -1.04942612e-01,  9.19894218e-02,
          5.72864345e-02,  4.39289035e-02, -1.34090327e+00,
          1.41471733e+00,  1.19390698e-01]]),
 array([0.19320539]))

In [200]:
y_pred = model.predict(train_X)
print('Train accuracy')
print('accuracy score',accuracy_score(train_y,y_pred))
print('f1 score',f1_score(train_y,y_pred,pos_label='Y'))
print('confusion matrix\n',confusion_matrix(train_y,y_pred))

Train accuracy
accuracy score 0.7983706720977597
f1 score 0.8671140939597315
confusion matrix
 [[ 69  80]
 [ 19 323]]


In [201]:
y_test_pred = model.predict(test_X)
print('Test accuracy')
print('accuracy score',accuracy_score(test_y,y_test_pred))
print('f1 score',f1_score(test_y,y_test_pred,pos_label='Y'))
print('confusion matrix\n',confusion_matrix(test_y,y_test_pred))

Test accuracy
accuracy score 0.7804878048780488
f1 score 0.8508287292817679
confusion matrix
 [[19 24]
 [ 3 77]]


### Model Deployment:

In [202]:
pptest_X = pp_test.drop(columns=['Loan_ID','Loan_Status'])
y_unpred = model.predict(pptest_X)

In [203]:
np.unique(y_unpred, return_counts=True)

(array(['N', 'Y'], dtype=object), array([ 69, 298], dtype=int64))

In [204]:
sample_data['Loan_Status']=y_unpred
sample_data['Loan_Status']=sample_data['Loan_Status']
sample_data.to_csv('submission_loan_prd_lr1.csv',index=False)