### **Loan Status Prediction using Support Vector Machines**

The dataset used here has details of customer's Gender, Marital Status, Education, Number of Dependents, Income, Loan Amount, Credit History and others. It also has a column regarding loan approval status for each client. Using this information a machine learning model will be created which can predict the loan approval status for a person. 

Algorithm applied - Support Vector Classifier 

The dataset is obtained from Kaggle https://www.kaggle.com/datasets/ninzaami/loan-predication/data

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
import warnings
warnings.filterwarnings('ignore')

#### **The Dataset**

In [164]:
raw_data = pd.read_csv('train_u6lujuX_CVtuZ9i (1).csv')
data = raw_data.copy()
data.head()

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,Y
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
3,LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
4,LP001008,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y


In [165]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 614 entries, 0 to 613
Data columns (total 13 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Loan_ID            614 non-null    object 
 1   Gender             601 non-null    object 
 2   Married            611 non-null    object 
 3   Dependents         599 non-null    object 
 4   Education          614 non-null    object 
 5   Self_Employed      582 non-null    object 
 6   ApplicantIncome    614 non-null    int64  
 7   CoapplicantIncome  614 non-null    float64
 8   LoanAmount         592 non-null    float64
 9   Loan_Amount_Term   600 non-null    float64
 10  Credit_History     564 non-null    float64
 11  Property_Area      614 non-null    object 
 12  Loan_Status        614 non-null    object 
dtypes: float64(4), int64(1), object(8)
memory usage: 62.5+ KB


In [166]:
data.isnull().sum()

Loan_ID               0
Gender               13
Married               3
Dependents           15
Education             0
Self_Employed        32
ApplicantIncome       0
CoapplicantIncome     0
LoanAmount           22
Loan_Amount_Term     14
Credit_History       50
Property_Area         0
Loan_Status           0
dtype: int64

#### **Handling Missing Values**

In [167]:
round(data.isnull().sum()/data.shape[0],2)

Loan_ID              0.00
Gender               0.02
Married              0.00
Dependents           0.02
Education            0.00
Self_Employed        0.05
ApplicantIncome      0.00
CoapplicantIncome    0.00
LoanAmount           0.04
Loan_Amount_Term     0.02
Credit_History       0.08
Property_Area        0.00
Loan_Status          0.00
dtype: float64

The data of columns which have missing values < 5% will be dropped. `Self_Employed` and `Credit_History` have missing values which are 5% or more. These values will be imputed. 

For `Self Employed` the missing values will be replaced by the Mode. For `Credit_History` the missing values will be replaced by Median

In [168]:
data['Credit_History'] = data['Credit_History'].fillna(data['Credit_History'].median())

In [169]:
data['Self_Employed'] = data['Self_Employed'].fillna(data['Self_Employed'].mode()[0])

In [170]:
data.isnull().sum()

Loan_ID               0
Gender               13
Married               3
Dependents           15
Education             0
Self_Employed         0
ApplicantIncome       0
CoapplicantIncome     0
LoanAmount           22
Loan_Amount_Term     14
Credit_History        0
Property_Area         0
Loan_Status           0
dtype: int64

In [171]:
data = data.dropna()

In [172]:
data.isnull().sum()

Loan_ID              0
Gender               0
Married              0
Dependents           0
Education            0
Self_Employed        0
ApplicantIncome      0
CoapplicantIncome    0
LoanAmount           0
Loan_Amount_Term     0
Credit_History       0
Property_Area        0
Loan_Status          0
dtype: int64

In [173]:
data.shape

(553, 13)

#### **Dropping Loan_ID**

The Loan_ID column is being dropped as it not very relevant to our analysis and intereferes with analysis

In [174]:
data = data.drop('Loan_ID', axis =1)
data.head()

Unnamed: 0,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
1,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
2,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
3,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
4,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y
5,Male,Yes,2,Graduate,Yes,5417,4196.0,267.0,360.0,1.0,Urban,Y


#### **Inspecting the Features**

In [175]:
data['Gender'].unique()

array(['Male', 'Female'], dtype=object)

In [176]:
data['Married'].unique()

array(['Yes', 'No'], dtype=object)

In [177]:
data['Dependents'].unique()

array(['1', '0', '2', '3+'], dtype=object)

In [178]:
data['Education'].unique()

array(['Graduate', 'Not Graduate'], dtype=object)

In [179]:
data['Self_Employed'].unique()

array(['No', 'Yes'], dtype=object)

In [180]:
data['Property_Area'].unique()

array(['Rural', 'Urban', 'Semiurban'], dtype=object)

`Dependants` has a category 3+ which will be problematic for analysis so data here is being replaced with value 4 and datapoints will be converted to int

In [181]:
data['Dependents'].replace(to_replace='3+', value=4, inplace = True)
data['Dependents'].unique()

array(['1', '0', '2', 4], dtype=object)

In [182]:
data['Dependents'] = data['Dependents'].astype('int')

In [183]:
data['Dependents'].value_counts()

Dependents
0    316
1     96
2     96
4     45
Name: count, dtype: int64

In [184]:
data['Loan_Status'].unique()

array(['N', 'Y'], dtype=object)

In [185]:
data['Loan_Status'] = data['Loan_Status'].replace({'Y':1,'N':0})

In [186]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 553 entries, 1 to 613
Data columns (total 12 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Gender             553 non-null    object 
 1   Married            553 non-null    object 
 2   Dependents         553 non-null    int64  
 3   Education          553 non-null    object 
 4   Self_Employed      553 non-null    object 
 5   ApplicantIncome    553 non-null    int64  
 6   CoapplicantIncome  553 non-null    float64
 7   LoanAmount         553 non-null    float64
 8   Loan_Amount_Term   553 non-null    float64
 9   Credit_History     553 non-null    float64
 10  Property_Area      553 non-null    object 
 11  Loan_Status        553 non-null    int64  
dtypes: float64(4), int64(3), object(5)
memory usage: 56.2+ KB


In [187]:
data_new = data.copy()

In [193]:
data_new = data_new.reset_index(drop=True)

#### **Feature Engineering**

Encoding values for all categorical features. OneHotEncoder is used because these categories are nominal not ordinal

In [195]:
from sklearn.preprocessing import OneHotEncoder

ohe = OneHotEncoder()
encoded_feature = ohe.fit_transform(data_new[['Gender','Married','Education','Self_Employed','Property_Area']])

In [196]:
encoded_feature

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 2765 stored elements and shape (553, 11)>

In [197]:
data_encoded = pd.DataFrame(encoded_feature.todense(), columns=ohe.get_feature_names_out(['Gender','Married','Education','Self_Employed','Property_Area']))
data_encoded.head()

Unnamed: 0,Gender_Female,Gender_Male,Married_No,Married_Yes,Education_Graduate,Education_Not Graduate,Self_Employed_No,Self_Employed_Yes,Property_Area_Rural,Property_Area_Semiurban,Property_Area_Urban
0,0.0,1.0,0.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0
1,0.0,1.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0
2,0.0,1.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0
3,0.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0
4,0.0,1.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0


In [198]:
df_new = pd.concat([data_new,data_encoded], axis =1)
df_new.head()

Unnamed: 0,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,...,Gender_Male,Married_No,Married_Yes,Education_Graduate,Education_Not Graduate,Self_Employed_No,Self_Employed_Yes,Property_Area_Rural,Property_Area_Semiurban,Property_Area_Urban
0,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,...,1.0,0.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0
1,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,...,1.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0
2,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,...,1.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0
3,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,...,1.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0
4,Male,Yes,2,Graduate,Yes,5417,4196.0,267.0,360.0,1.0,...,1.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0


In [199]:
df_new.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 553 entries, 0 to 552
Data columns (total 23 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Gender                   553 non-null    object 
 1   Married                  553 non-null    object 
 2   Dependents               553 non-null    int64  
 3   Education                553 non-null    object 
 4   Self_Employed            553 non-null    object 
 5   ApplicantIncome          553 non-null    int64  
 6   CoapplicantIncome        553 non-null    float64
 7   LoanAmount               553 non-null    float64
 8   Loan_Amount_Term         553 non-null    float64
 9   Credit_History           553 non-null    float64
 10  Property_Area            553 non-null    object 
 11  Loan_Status              553 non-null    int64  
 12  Gender_Female            553 non-null    float64
 13  Gender_Male              553 non-null    float64
 14  Married_No               5

In [200]:
df_new = df_new.drop(labels=['Gender','Married','Education','Self_Employed','Property_Area'], axis=1)
df_new.head()

Unnamed: 0,Dependents,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Loan_Status,Gender_Female,Gender_Male,Married_No,Married_Yes,Education_Graduate,Education_Not Graduate,Self_Employed_No,Self_Employed_Yes,Property_Area_Rural,Property_Area_Semiurban,Property_Area_Urban
0,1,4583,1508.0,128.0,360.0,1.0,0,0.0,1.0,0.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0
1,0,3000,0.0,66.0,360.0,1.0,1,0.0,1.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0
2,0,2583,2358.0,120.0,360.0,1.0,1,0.0,1.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0
3,0,6000,0.0,141.0,360.0,1.0,1,0.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0
4,2,5417,4196.0,267.0,360.0,1.0,1,0.0,1.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0


In [201]:
df_new.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 553 entries, 0 to 552
Data columns (total 18 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Dependents               553 non-null    int64  
 1   ApplicantIncome          553 non-null    int64  
 2   CoapplicantIncome        553 non-null    float64
 3   LoanAmount               553 non-null    float64
 4   Loan_Amount_Term         553 non-null    float64
 5   Credit_History           553 non-null    float64
 6   Loan_Status              553 non-null    int64  
 7   Gender_Female            553 non-null    float64
 8   Gender_Male              553 non-null    float64
 9   Married_No               553 non-null    float64
 10  Married_Yes              553 non-null    float64
 11  Education_Graduate       553 non-null    float64
 12  Education_Not Graduate   553 non-null    float64
 13  Self_Employed_No         553 non-null    float64
 14  Self_Employed_Yes        5

#### **Selecting Features and Target**

In [202]:
df_new['Loan_Status'].unique()

array([0, 1])

In [203]:
y = df_new['Loan_Status']
X = df_new.drop('Loan_Status',axis=1)

In [204]:
X.shape

(553, 17)

In [205]:
y.shape

(553,)

#### **Train Test Split**

In [206]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=2)

In [207]:
X_train.shape

(442, 17)

In [208]:
X_test.shape

(111, 17)

In [209]:
y_train.shape

(442,)

In [210]:
y_test.shape

(111,)

#### **Feature Scaling**

Since SVM will be used, which is a distance based algorithm, we do not want large values to influence the model. So data will be scaled using 

In [249]:
X.describe()

Unnamed: 0,Dependents,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Gender_Female,Gender_Male,Married_No,Married_Yes,Education_Graduate,Education_Not Graduate,Self_Employed_No,Self_Employed_Yes,Property_Area_Rural,Property_Area_Semiurban,Property_Area_Urban
count,553.0,553.0,553.0,553.0,553.0,553.0,553.0,553.0,553.0,553.0,553.0,553.0,553.0,553.0,553.0,553.0,553.0
mean,0.846293,5350.018083,1659.119204,146.001808,341.663653,0.871609,0.188065,0.811935,0.350814,0.649186,0.790235,0.209765,0.869801,0.130199,0.294756,0.388788,0.316456
std,1.206816,5965.429068,3043.448229,84.052035,65.555451,0.334827,0.391118,0.391118,0.477657,0.477657,0.407509,0.407509,0.336827,0.336827,0.456346,0.487916,0.465514
min,0.0,150.0,0.0,9.0,12.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,2889.0,0.0,100.0,360.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0
50%,0.0,3812.0,1213.0,128.0,360.0,1.0,0.0,1.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0
75%,2.0,5815.0,2306.0,170.0,360.0,1.0,0.0,1.0,1.0,1.0,1.0,0.0,1.0,0.0,1.0,1.0,1.0
max,4.0,81000.0,41667.0,650.0,480.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


Applicant and Coapplicant Income seem to have outliers

In [250]:
X['ApplicantIncome'].quantile(0.99)

np.float64(27079.240000000238)

In [213]:
X['CoapplicantIncome'].quantile(0.99)

np.float64(9934.240000000036)

In [68]:
X[X['ApplicantIncome'] > X['ApplicantIncome'].quantile(0.99)]

Unnamed: 0,Dependents,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Gender_Female,Gender_Male,Married_No,Married_Yes,Education_Graduate,Education_Not Graduate,Self_Employed_No,Self_Employed_Yes,Property_Area_Rural,Property_Area_Semiurban,Property_Area_Urban
155,4.0,39999.0,0.0,600.0,180.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0
183,1.0,33846.0,0.0,260.0,360.0,1.0,0.0,1.0,1.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0
185,0.0,39147.0,4750.0,120.0,360.0,1.0,0.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0
333,0.0,63337.0,0.0,490.0,180.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0
409,4.0,81000.0,0.0,360.0,360.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0
443,1.0,37719.0,0.0,152.0,360.0,1.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0


In [251]:
X[X['CoapplicantIncome'] > X['CoapplicantIncome'].quantile(0.99)]

Unnamed: 0,Dependents,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Gender_Female,Gender_Male,Married_No,Married_Yes,Education_Graduate,Education_Not Graduate,Self_Employed_No,Self_Employed_Yes,Property_Area_Rural,Property_Area_Semiurban,Property_Area_Urban
8,1,12841,10968.0,349.0,360.0,1.0,0.0,1.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0
156,4,5516,11300.0,495.0,360.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0
360,0,2500,20000.0,103.0,360.0,1.0,0.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0
375,2,1600,20000.0,239.0,360.0,1.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0
524,0,1836,33837.0,90.0,360.0,1.0,0.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0
540,4,416,41667.0,350.0,180.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0


We will not remove the outliers for now. Since data has outliers we are using Standardisation.


In [219]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)

In [220]:
X_test_scaled = scaler.transform(X_test)

In [221]:
X_train_scaled

array([[-0.72069456, -0.36156171,  0.25827047, ..., -0.65606456,
         1.26491106, -0.67730781],
       [ 0.93409503, -0.5407572 ,  0.00815659, ...,  1.52424023,
        -0.79056942, -0.67730781],
       [ 0.10670023, -0.58276114,  0.23905992, ...,  1.52424023,
        -0.79056942, -0.67730781],
       ...,
       [-0.72069456, -0.33105695, -0.60168435, ..., -0.65606456,
         1.26491106, -0.67730781],
       [ 0.10670023, -0.42225181,  0.12040649, ...,  1.52424023,
        -0.79056942, -0.67730781],
       [-0.72069456, -0.0871786 , -0.60168435, ...,  1.52424023,
        -0.79056942, -0.67730781]])

In [222]:
y_train

100    1
155    1
521    1
500    1
15     1
      ..
299    1
534    1
493    0
527    0
168    1
Name: Loan_Status, Length: 442, dtype: int64

#### **Training the Model**

So the model features are now X_train_scaled and y_train. X_test_scaled will be used later for predictions

In [245]:
from sklearn.svm import SVC

model = SVC(C= 5, gamma = 'scale', kernel='linear')
model.fit(X_train_scaled, y_train)

#### **Model Evaluation**

In [246]:
y_pred = model.predict(X_test_scaled)
y_pred

array([1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1,
       1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0,
       1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1,
       1])

In [247]:
from sklearn.metrics import confusion_matrix, classification_report

cm = confusion_matrix(y_test, y_pred)
cr = classification_report(y_test, y_pred)
print(cm)
print(cr)

[[18 22]
 [ 2 69]]
              precision    recall  f1-score   support

           0       0.90      0.45      0.60        40
           1       0.76      0.97      0.85        71

    accuracy                           0.78       111
   macro avg       0.83      0.71      0.73       111
weighted avg       0.81      0.78      0.76       111



In [248]:
model.score(X_test_scaled, y_test)

0.7837837837837838

**The model achieves a score of 78%. This candefinitely be improved with addressing outliers and hyper parameter tuning**