**Logistic Regression**

import section

In [1]:
import pandas as pd
import numpy as np

Mounting Google Drive

In [2]:
from google.colab import drive
drive.mount('/content/drive/', force_remount= True)

Mounted at /content/drive/


Read csv file as dataframe

In [3]:
dataset = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/training_loan_data.csv')
print(dataset)

      Loan_ID  Gender Married Dependents     Education Self_Employed  \
0    LP001002    Male      No          0      Graduate            No   
1    LP001003    Male     Yes          1      Graduate            No   
2    LP001005    Male     Yes          0      Graduate           Yes   
3    LP001006    Male     Yes          0  Not Graduate            No   
4    LP001008    Male      No          0      Graduate            No   
..        ...     ...     ...        ...           ...           ...   
609  LP002978  Female      No          0      Graduate            No   
610  LP002979    Male     Yes         3+      Graduate            No   
611  LP002983    Male     Yes          1      Graduate            No   
612  LP002984    Male     Yes          2      Graduate            No   
613  LP002990  Female      No          0      Graduate           Yes   

     ApplicantIncome  CoapplicantIncome  LoanAmount  Loan_Amount_Term  \
0               5849                0.0         NaN           

checking dtypes

In [4]:
dataset.dtypes

Loan_ID               object
Gender                object
Married               object
Dependents            object
Education             object
Self_Employed         object
ApplicantIncome        int64
CoapplicantIncome    float64
LoanAmount           float64
Loan_Amount_Term     float64
Credit_History       float64
Property_Area         object
Loan_Status           object
dtype: object

In [5]:
print(dataset['Dependents'])

0       0
1       1
2       0
3       0
4       0
       ..
609     0
610    3+
611     1
612     2
613     0
Name: Dependents, Length: 614, dtype: object


In [6]:
dataset['Dependents'].replace(['3+'], '4', inplace=True)
print(dataset)

      Loan_ID  Gender Married Dependents     Education Self_Employed  \
0    LP001002    Male      No          0      Graduate            No   
1    LP001003    Male     Yes          1      Graduate            No   
2    LP001005    Male     Yes          0      Graduate           Yes   
3    LP001006    Male     Yes          0  Not Graduate            No   
4    LP001008    Male      No          0      Graduate            No   
..        ...     ...     ...        ...           ...           ...   
609  LP002978  Female      No          0      Graduate            No   
610  LP002979    Male     Yes          4      Graduate            No   
611  LP002983    Male     Yes          1      Graduate            No   
612  LP002984    Male     Yes          2      Graduate            No   
613  LP002990  Female      No          0      Graduate           Yes   

     ApplicantIncome  CoapplicantIncome  LoanAmount  Loan_Amount_Term  \
0               5849                0.0         NaN           

Column Transformer is creating extra columns for categorical data, so need to change of the Dependents column from object to int64

In [7]:
dataset['Dependents'] = pd.to_numeric(dataset['Dependents'], errors='coerce', downcast='integer')
dataset.dtypes

Loan_ID               object
Gender                object
Married               object
Dependents           float64
Education             object
Self_Employed         object
ApplicantIncome        int64
CoapplicantIncome    float64
LoanAmount           float64
Loan_Amount_Term     float64
Credit_History       float64
Property_Area         object
Loan_Status           object
dtype: object

to findout what columns have missing values

In [8]:
dataset.isna().any()

Loan_ID              False
Gender                True
Married               True
Dependents            True
Education            False
Self_Employed         True
ApplicantIncome      False
CoapplicantIncome    False
LoanAmount            True
Loan_Amount_Term      True
Credit_History        True
Property_Area        False
Loan_Status          False
dtype: bool

number of missing values in each column

In [9]:
dataset.isna().sum()

Loan_ID               0
Gender               13
Married               3
Dependents           15
Education             0
Self_Employed        32
ApplicantIncome       0
CoapplicantIncome     0
LoanAmount           22
Loan_Amount_Term     14
Credit_History       50
Property_Area         0
Loan_Status           0
dtype: int64

splitting independent and dependent variables

In [10]:
#X = dataset.iloc[:, 1:12].values
X = dataset.iloc[:, [5,6,7,8,10,11]]
print(X[:10])
y = dataset.iloc[:, -1].values
print(y)

  Self_Employed  ApplicantIncome  CoapplicantIncome  LoanAmount  \
0            No             5849                0.0         NaN   
1            No             4583             1508.0       128.0   
2           Yes             3000                0.0        66.0   
3            No             2583             2358.0       120.0   
4            No             6000                0.0       141.0   
5           Yes             5417             4196.0       267.0   
6            No             2333             1516.0        95.0   
7            No             3036             2504.0       158.0   
8            No             4006             1526.0       168.0   
9            No            12841            10968.0       349.0   

   Credit_History Property_Area  
0             1.0         Urban  
1             1.0         Rural  
2             1.0         Urban  
3             1.0         Urban  
4             1.0         Urban  
5             1.0         Urban  
6             1.0       

Dynamically select columns based on the specified indices


In [11]:
selected_indices = range(1,13)
selected_columns = [dataset.columns[i] for i in selected_indices]
print(selected_columns)

['Gender', 'Married', 'Dependents', 'Education', 'Self_Employed', 'ApplicantIncome', 'CoapplicantIncome', 'LoanAmount', 'Loan_Amount_Term', 'Credit_History', 'Property_Area', 'Loan_Status']


Reshaping an array which is 1D array to compate with ColumnTranfer as it expects 2D array or DataFrame, so we need to convert it from 1D to 2D

In [12]:
y = y.reshape(-1, 1)
print(y[:10])

[['Y']
 ['N']
 ['Y']
 ['Y']
 ['Y']
 ['Y']
 ['Y']
 ['N']
 ['Y']
 ['N']]


KNN imputer to replace missing values

KNN imputer doesn't recognize text data values. So, we have to encode. I am using OneHotEncoder which doesn't confuse the model in predicting or classifying

In [14]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
#ct_X = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [0,1,3,4,10])], remainder = 'passthrough')
ct_X = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [0,5])], remainder = 'passthrough')
ct_y = ColumnTransformer(transformers=[('encoder', OneHotEncoder(drop='first'), [0])], remainder = 'passthrough')
transformed_X = ct_X.fit_transform(X)
transformed_y = ct_y.fit_transform(y)

In [15]:
print(X.shape)
print(transformed_X.shape)
print(y.shape)
print(transformed_y.shape)

(614, 6)
(614, 10)
(614, 1)
(614, 1)


KNN Imputer starts now

In [16]:
from sklearn.impute import KNNImputer
imputer = KNNImputer(n_neighbors=5)
imputer_X = pd.DataFrame(imputer.fit_transform(transformed_X))
imputer_y = pd.DataFrame(imputer.fit_transform(transformed_y))
print(imputer_X)
print(imputer_y)

       0    1    2    3    4    5       6       7      8    9
0    1.0  0.0  0.0  0.0  0.0  1.0  5849.0     0.0  147.8  1.0
1    1.0  0.0  0.0  1.0  0.0  0.0  4583.0  1508.0  128.0  1.0
2    0.0  1.0  0.0  0.0  0.0  1.0  3000.0     0.0   66.0  1.0
3    1.0  0.0  0.0  0.0  0.0  1.0  2583.0  2358.0  120.0  1.0
4    1.0  0.0  0.0  0.0  0.0  1.0  6000.0     0.0  141.0  1.0
..   ...  ...  ...  ...  ...  ...     ...     ...    ...  ...
609  1.0  0.0  0.0  1.0  0.0  0.0  2900.0     0.0   71.0  1.0
610  1.0  0.0  0.0  1.0  0.0  0.0  4106.0     0.0   40.0  1.0
611  1.0  0.0  0.0  0.0  0.0  1.0  8072.0   240.0  253.0  1.0
612  1.0  0.0  0.0  0.0  0.0  1.0  7583.0     0.0  187.0  1.0
613  0.0  1.0  0.0  0.0  1.0  0.0  4583.0     0.0  133.0  0.0

[614 rows x 10 columns]
       0
0    1.0
1    0.0
2    1.0
3    1.0
4    1.0
..   ...
609  1.0
610  1.0
611  1.0
612  1.0
613  0.0

[614 rows x 1 columns]


Splitting the dataset into the Training set and Test set

In [17]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(imputer_X, imputer_y, test_size = 0.25, random_state = 0)

Scaling the data

In [18]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
y_train= np.ravel(y_train)

Training the Logistic Regression model on the Training set

In [19]:
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(random_state = 0)
classifier.fit(X_train, y_train)

Confusion matrix and accuracy

In [20]:
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score
y_pred = classifier.predict(X_test)
print(confusion_matrix(y_test, y_pred))
accuracy_score(y_test, y_pred)

[[ 19  24]
 [  4 107]]


0.8181818181818182

In [21]:
precision_score(y_test, y_pred)

0.816793893129771

In [22]:
recall_score(y_test, y_pred)

0.963963963963964

In [23]:
f1_score(y_test, y_pred)

0.8842975206611571