Data:
1) Structured Data: 2 Dimensional Data (Rows and Columns)
	a) Homogenous: Statistical Models (Linear Regression and Logistic Regression)
	b) Hetrogenous: Machine Learning (Random Forest, SVM, Boosting...)
2) Unstructured Data: n dimensional (Images, Text, Voice, videos) Deep Learning


Variables
1) Discrete
	- Flag
	- Categorical/nominal
	- Ordinal
2) Numerical
	- Continuous
	- Integer
	
Approach:
1) Load and Audit the data (Pandas, sklearn)
	a) Shape of the data
	b) distribution of the data (Homogenous/Linear or Hetrogenous/NonLinear)
	c) integrity of the data (missing values, inconsistencies etc)
2) Data Preparation and Data Transformation
	a) Missing value imputation
	b) Handle inconsistencies
	c) Transformation
		- zscore/standard scaler
		- log (scaling)
		- min/max scaler
3) Data Visualization
	a) Boxplots
	b) scatters
	c) Pairplot
	
4) Analysis
	a) Uni-variate analysis (Measures central tendency/ measures of dipersion)
	b) Bi-variate analysis (correlation/chi square tests)
	c) Multi-Variate
		- Supervised
			- Classification
		- Unsupervised

Import the standard libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report

Load and Audit the data

In [2]:
train = pd.read_csv('train_ctrUa4K.csv')
test = pd.read_csv('test_lAUu6dG.csv')

In [3]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 614 entries, 0 to 613
Data columns (total 13 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Loan_ID            614 non-null    object 
 1   Gender             601 non-null    object 
 2   Married            611 non-null    object 
 3   Dependents         599 non-null    object 
 4   Education          614 non-null    object 
 5   Self_Employed      582 non-null    object 
 6   ApplicantIncome    614 non-null    int64  
 7   CoapplicantIncome  614 non-null    float64
 8   LoanAmount         592 non-null    float64
 9   Loan_Amount_Term   600 non-null    float64
 10  Credit_History     564 non-null    float64
 11  Property_Area      614 non-null    object 
 12  Loan_Status        614 non-null    object 
dtypes: float64(4), int64(1), object(8)
memory usage: 62.5+ KB


In [16]:
train.isna().sum()

Loan_ID               0
Gender               13
Married               3
Dependents           15
Education             0
Self_Employed        32
ApplicantIncome       0
CoapplicantIncome     0
LoanAmount           22
Loan_Amount_Term     14
Credit_History       50
Property_Area         0
Loan_Status           0
dtype: int64

Imputation of missing values


1.   Categorical/Object dtype: Mode (Frequency)
2.   Continuous/Int and Float dtype: Median 



In [5]:
train_1 = train.copy()

In [6]:
train_1['Gender'].value_counts()

Male      489
Female    112
Name: Gender, dtype: int64

In [7]:
train_1['Gender'] = np.where(train_1['Gender'].isna(),'Male', train_1['Gender'])

In [31]:
train_1['Self_Employed'].value_counts()

No     532
Yes     82
Name: Self_Employed, dtype: int64

In [13]:
train_1['Married'] = np.where(train_1['Married'].isna(),'Yes', train_1['Married'])

In [29]:
train_1['Self_Employed'].mode()[0]

'No'

In [25]:
train_1['Dependents'] = np.where(train_1['Dependents'].isna(),train_1['Dependents'].mode()[0], train_1['Dependents'])

In [30]:
train_1['Self_Employed'] = np.where(train_1['Self_Employed'].isna(),train_1['Self_Employed'].mode()[0], train_1['Self_Employed'])


In [35]:
train_1['Dependents'].value_counts()

0    360
1    102
2    101
3     51
Name: Dependents, dtype: int64

In [34]:
train_1['Dependents'].replace('3+',3, inplace=True)

In [39]:
train_1['LoanAmount']= np.where(train_1['LoanAmount'].isna(),train_1['LoanAmount'].median(), train_1['LoanAmount'])


In [41]:
train_1['Loan_Amount_Term']= np.where(train_1['Loan_Amount_Term'].isna(),train_1['Loan_Amount_Term'].median(), train_1['Loan_Amount_Term'])


In [44]:
train_1['Credit_History'].value_counts()

1.0    475
0.0     89
Name: Credit_History, dtype: int64

In [45]:
train_1['Credit_History'] = np.where(train_1['Credit_History'].isna(),train_1['Credit_History'].mode()[0], train_1['Credit_History'])


In [68]:
train_1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 614 entries, 0 to 613
Data columns (total 13 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Loan_ID            614 non-null    object 
 1   Gender             614 non-null    int64  
 2   Married            614 non-null    int64  
 3   Dependents         614 non-null    int64  
 4   Education          614 non-null    int64  
 5   Self_Employed      614 non-null    int64  
 6   ApplicantIncome    614 non-null    int64  
 7   CoapplicantIncome  614 non-null    float64
 8   LoanAmount         614 non-null    float64
 9   Loan_Amount_Term   614 non-null    float64
 10  Credit_History     614 non-null    float64
 11  Property_Area      614 non-null    int64  
 12  Loan_Status        614 non-null    int64  
dtypes: float64(4), int64(8), object(1)
memory usage: 62.5+ KB


In [65]:
train_1.head()

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,1,0,0,0,0,5849,0.0,128.0,360.0,1.0,2,1
1,LP001003,1,1,1,0,0,4583,1508.0,128.0,360.0,1.0,0,0
2,LP001005,1,1,0,0,1,3000,0.0,66.0,360.0,1.0,2,1
3,LP001006,1,1,0,1,0,2583,2358.0,120.0,360.0,1.0,2,1
4,LP001008,1,0,0,0,0,6000,0.0,141.0,360.0,1.0,2,1


In [54]:
from sklearn.preprocessing import LabelEncoder

In [55]:
le = LabelEncoder()

In [57]:
train_1['Gender'] = le.fit_transform(train_1['Gender'])

In [59]:
train_1['Married'] = le.fit_transform(train_1['Married'])

In [60]:
train_1['Education'] = le.fit_transform(train_1['Education'])

In [62]:
train_1['Self_Employed'] = le.fit_transform(train_1['Self_Employed'])

In [63]:
train_1['Property_Area'] = le.fit_transform(train_1['Property_Area'])

In [64]:
train_1['Loan_Status'] = le.fit_transform(train_1['Loan_Status'])

In [67]:
train_1['Dependents'] = train_1['Dependents'].astype(int)

In [71]:
X = train_1.iloc[:, 1:12]

In [73]:
y = train_1['Loan_Status']

In [74]:
X.head()

Unnamed: 0,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area
0,1,0,0,0,0,5849,0.0,128.0,360.0,1.0,2
1,1,1,1,0,0,4583,1508.0,128.0,360.0,1.0,0
2,1,1,0,0,1,3000,0.0,66.0,360.0,1.0,2
3,1,1,0,1,0,2583,2358.0,120.0,360.0,1.0,2
4,1,0,0,0,0,6000,0.0,141.0,360.0,1.0,2


In [75]:
y.head()

0    1
1    0
2    1
3    1
4    1
Name: Loan_Status, dtype: int64

In [76]:
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size= 0.2, random_state= 123)

In [78]:
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(491, 11)
(123, 11)
(491,)
(123,)


In [79]:
lr = LogisticRegression()

In [80]:
lr.fit(X_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


LogisticRegression()

In [81]:
lr.coef_

array([[ 6.21691937e-02,  4.15831344e-01, -9.52819674e-02,
        -4.14422940e-01, -1.46406973e-01, -1.38294206e-05,
        -3.78146197e-05, -5.48936669e-04, -5.03460465e-03,
         2.92409605e+00,  5.94359903e-02]])

In [82]:
lr.intercept_

array([0.15631743])

In [83]:
preds_lr = lr.predict(X_test)

In [86]:
cm_lr = confusion_matrix(y_test, preds_lr)
classrep_lr = classification_report(y_test, preds_lr)
acciuracy_score = accuracy_score(y_test, preds_lr)

In [87]:
print(cm_lr)

[[17 25]
 [ 2 79]]


In [90]:
print(classrep_lr)

              precision    recall  f1-score   support

           0       0.89      0.40      0.56        42
           1       0.76      0.98      0.85        81

    accuracy                           0.78       123
   macro avg       0.83      0.69      0.71       123
weighted avg       0.81      0.78      0.75       123



In [89]:
print(acciuracy_score)

0.7804878048780488
