<h1> Goal </h1>

<p> The goal of this notebook is to create an end-to-end process of machine learning model development to its highest possible accuracy score. The step by step process would be: </p>
    <ol> 
    <li> Summary: checking on missing value, distribution, types, and dimension </li>
    <li> Preprocessing: treat missing value, normalise distribution, encode the data </li>
    <li> Model selection: decide which algorithm to use </li>
    <li> Model evaluation: summarising the accuracy score </li>
    <li> Hyperparameter tuning: tune the label accuracy </li>
    <li> Finally, to reate the data output from machine learning mdoel </li>
    </ol> 

<p> The data used would be Analytica Vidya loan classification </p>

______

In [1]:
import pandas as pd
import os
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns


path = os.getcwd()
df_test = pd.read_csv(path + "/Data/test - loan prediction.csv")
df_train = pd.read_csv(path + "/Data/train - loan prediction.csv")

df_train.head()

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,Y
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
3,LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
4,LP001008,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y


<h2> Data Summary </h2>

In [2]:
#summary each data value
df_missing_value = pd.DataFrame(df_train.isna().sum())
df_missing_value.reset_index(level = 0, inplace = True)
df_missing_value.columns = ['Column Name', 'Total Missing Value']

df_missing_value.sort_values(by = 'Total Missing Value', ascending = False)

Unnamed: 0,Column Name,Total Missing Value
10,Credit_History,50
5,Self_Employed,32
8,LoanAmount,22
3,Dependents,15
9,Loan_Amount_Term,14
1,Gender,13
2,Married,3
0,Loan_ID,0
4,Education,0
6,ApplicantIncome,0


In [3]:
#check all classes per column
print(df_train['Property_Area'].unique())
print(df_train['Dependents'].unique())
print(df_train['Self_Employed'].unique())
print(df_train['Education'].unique())

['Urban' 'Rural' 'Semiurban']
['0' '1' '2' '3+' nan]
['No' 'Yes' nan]
['Graduate' 'Not Graduate']


<h3> Data with Drop NA </h3>

In [4]:
#treat missing value
#missing value drop
df_drop = df_train.dropna()
df_drop.head()

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
3,LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
4,LP001008,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y
5,LP001011,Male,Yes,2,Graduate,Yes,5417,4196.0,267.0,360.0,1.0,Urban,Y


In [5]:
#test dataframes
test_drop = df_test.dropna()
test_drop = test_drop.drop(['Loan_ID'], axis = 1)
test_drop.head()

Unnamed: 0,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area
0,Male,Yes,0,Graduate,No,5720,0,110.0,360.0,1.0,Urban
1,Male,Yes,1,Graduate,No,3076,1500,126.0,360.0,1.0,Urban
2,Male,Yes,2,Graduate,No,5000,1800,208.0,360.0,1.0,Urban
4,Male,No,0,Not Graduate,No,3276,0,78.0,360.0,1.0,Urban
5,Male,Yes,0,Not Graduate,Yes,2165,3422,152.0,360.0,1.0,Urban


In [6]:
#encode the data
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()

#gender
le.fit(df_drop['Gender'])
df_drop['Gender'] = le.transform(df_drop['Gender'])
test_drop['Gender'] = le.transform(test_drop['Gender'])

#married
le.fit(df_drop['Married'])
df_drop['Married'] = le.transform(df_drop['Married'])
test_drop['Married'] = le.transform(test_drop['Married'])

#education
le.fit(df_drop['Education'])
df_drop['Education'] = le.transform(df_drop['Education'])
test_drop['Education'] = le.transform(test_drop['Education'])

#self employed
le.fit(df_drop['Self_Employed'])
df_drop['Self_Employed'] = le.transform(df_drop['Self_Employed'])
test_drop['Self_Employed'] = le.transform(test_drop['Self_Employed'])


#Coapplicant Income
le.fit(df_drop['Property_Area'])
df_drop['Property_Area'] = le.transform(df_drop['Property_Area'])
test_drop['Property_Area'] = le.transform(test_drop['Property_Area'])

#loan status
le.fit(df_drop['Loan_Status'])
df_drop['Loan_Status'] = le.transform(df_drop['Loan_Status'])

#Dependents
le.fit(df_drop['Dependents'])
df_drop['Dependents'] = le.transform(df_drop['Dependents'])
test_drop['Dependents'] = le.transform(test_drop['Dependents'])

df_drop.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  del sys.path[0]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
1,LP001003,1,1,1,0,0,4583,1508.0,128.0,360.0,1.0,0,0
2,LP001005,1,1,0,0,1,3000,0.0,66.0,360.0,1.0,2,1
3,LP001006,1,1,0,1,0,2583,2358.0,120.0,360.0,1.0,2,1
4,LP001008,1,0,0,0,0,6000,0.0,141.0,360.0,1.0,2,1
5,LP001011,1,1,2,0,1,5417,4196.0,267.0,360.0,1.0,2,1


In [7]:
#split data
from sklearn.model_selection import train_test_split

x = df_drop.drop(['Loan_ID', 'Loan_Status'], axis = 1)
y = df_drop['Loan_Status']

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state = 42)

In [8]:
#model creation
from sklearn import svm
from sklearn.svm import SVC, LinearSVC
from sklearn.neural_network import MLPClassifier
from sklearn.linear_model import LogisticRegression, Perceptron, SGDClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

In [9]:
#svm
clf = svm.SVC()
clf.fit(x_train, y_train)
pred_clf = clf.predict(x_test)



In [10]:
#random forest
rfc = RandomForestClassifier(n_estimators = 200)
rfc.fit(x_train, y_train)
pred_rfc = rfc.predict(x_test)

In [11]:
#neural networks
mlpc = MLPClassifier(hidden_layer_sizes = (6, 6, 6), max_iter = 500)
mlpc.fit(x_train, y_train)
pred_mlpc = mlpc.predict(x_test)

In [12]:
#Logistic Regression
logreg = LogisticRegression()
logreg.fit(x_train, y_train)
pred_logreg = logreg.predict(x_test)



In [13]:
#Perceptorn
prec = Perceptron()
prec.fit(x_train, y_train)
pred_prec = prec.predict(x_test)

In [14]:
#SGDClassifier
sgdc = SGDClassifier()
sgdc.fit(x_train, y_train)
pred_sgdc = sgdc.predict(x_test)

In [15]:
#KNN
knn = KNeighborsClassifier()
knn.fit(x_train, y_train)
pred_knn = knn.predict(x_test)

In [16]:
#Naive Bayes
gauss = GaussianNB()
gauss.fit(x_train, y_train)
pred_gauss = gauss.predict(x_test)

In [17]:
#Decision tree
dstree = DecisionTreeClassifier()
dstree.fit(x_train, y_train)
pred_dstree = dstree.predict(x_test)

In [18]:
#model evaluation

#accuracy score
rfc_score = accuracy_score(y_test, pred_rfc)
mlpc_score = accuracy_score(y_test, pred_mlpc)
logreg_score = accuracy_score(y_test, pred_logreg)
prec_score = accuracy_score(y_test, pred_prec)
sgdc_score = accuracy_score(y_test, pred_sgdc)
knn_score = accuracy_score(y_test, pred_knn)
gauss_score = accuracy_score(y_test, pred_gauss)
dstree_score = accuracy_score(y_test, pred_dstree)

modelResult = pd.DataFrame({
    'Model': ['Random Forest', 'Neural Networks', 'Logistic Regression', 
             'Preceptorn', 'SGDC', 'KNN', 'Naive Bayes', 'Decisiton Tree'],
    'Score': [rfc_score, mlpc_score, logreg_score, prec_score, sgdc_score, knn_score, gauss_score, dstree_score]
    
})

modelResult.sort_values(by = 'Score', ascending = False)

Unnamed: 0,Model,Score
2,Logistic Regression,0.822917
6,Naive Bayes,0.822917
0,Random Forest,0.802083
7,Decisiton Tree,0.75
4,SGDC,0.6875
5,KNN,0.65625
3,Preceptorn,0.645833
1,Neural Networks,0.479167


In [19]:
#confusion matrix
from sklearn.metrics import confusion_matrix

confusion_matrix(y_test, pred_logreg, labels = [1, 0])

array([[68,  0],
       [17, 11]])

<p> From above confusion matrix, it is showed that the model predict a 100% of loan who applicants who have loan status 'Y', while of 28 'no' loan status, 17 of them were wrongly predicted as 'Y' </p>

<h2> Normalised Data Distribution </h2>
<p> below tested the normalised data distribution </p>

In [20]:
#normaliser
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler(feature_range=(0, 1))
scaler.fit(x_train)
x_train_scale = scaler.transform(x_train)
scaler.transform(x_test)

array([[1.        , 1.        , 0.66666667, ..., 0.72972973, 1.        ,
        1.        ],
       [1.        , 0.        , 0.        , ..., 0.72972973, 1.        ,
        0.5       ],
       [1.        , 1.        , 0.66666667, ..., 0.72972973, 1.        ,
        0.        ],
       ...,
       [1.        , 1.        , 1.        , ..., 0.72972973, 1.        ,
        0.        ],
       [1.        , 1.        , 0.        , ..., 0.72972973, 1.        ,
        0.5       ],
       [1.        , 0.        , 0.66666667, ..., 0.72972973, 1.        ,
        0.5       ]])

In [21]:
#svm
clf = svm.SVC(gamma = 'auto')
clf.fit(x_train, y_train)
pred_clf = clf.predict(x_test)

In [22]:
#random forest
rfc = RandomForestClassifier(n_estimators = 200)
rfc.fit(x_train, y_train)
pred_rfc = rfc.predict(x_test)

In [23]:
#neural networks
mlpc = MLPClassifier(hidden_layer_sizes = (6, 6, 6), max_iter = 500)
mlpc.fit(x_train, y_train)
pred_mlpc = mlpc.predict(x_test)

In [24]:
#Logistic Regression
logreg = LogisticRegression()
logreg.fit(x_train, y_train)
pred_logreg = logreg.predict(x_test)



In [25]:
#Perceptorn
prec = Perceptron()
prec.fit(x_train, y_train)
pred_prec = prec.predict(x_test)

In [26]:
#SGDClassifier
sgdc = SGDClassifier()
sgdc.fit(x_train, y_train)
pred_sgdc = sgdc.predict(x_test)

In [27]:
#KNN
knn = KNeighborsClassifier()
knn.fit(x_train, y_train)
pred_knn = knn.predict(x_test)

In [28]:
#Naive Bayes
gauss = GaussianNB()
gauss.fit(x_train, y_train)
pred_gauss = gauss.predict(x_test)

In [29]:
#Decision tree
dstree = DecisionTreeClassifier()
dstree.fit(x_train, y_train)
pred_dstree = dstree.predict(x_test)

In [30]:
#model evaluation

#accuracy score
rfc_score_normalised = accuracy_score(y_test, pred_rfc)
mlpc_score_normalised = accuracy_score(y_test, pred_mlpc)
logreg_score_normalised = accuracy_score(y_test, pred_logreg)
prec_score_normalised = accuracy_score(y_test, pred_prec)
sgdc_score_normalised = accuracy_score(y_test, pred_sgdc)
knn_score_normalised = accuracy_score(y_test, pred_knn)
gauss_score_normalised = accuracy_score(y_test, pred_gauss)
dstree_score_normalised = accuracy_score(y_test, pred_dstree)

modelResult = pd.DataFrame({
    'Model': ['Random Forest', 'Neural Networks', 'Logistic Regression', 
             'Preceptorn', 'SGDC', 'KNN', 'Naive Bayes', 'Decision Tree'],
    'Base (%)': [rfc_score, mlpc_score, logreg_score, prec_score, sgdc_score, knn_score, gauss_score, dstree_score],
    'Normalised (%)': [rfc_score_normalised, mlpc_score_normalised, logreg_score_normalised, prec_score_normalised, sgdc_score_normalised, knn_score_normalised, gauss_score_normalised, dstree_score_normalised]
})


modelResult['Base (%)'] = modelResult['Base (%)'] * 100
modelResult['Normalised (%)'] = modelResult['Normalised (%)'] * 100
modelResult['Optimisation (%)'] = modelResult['Normalised (%)'] - modelResult['Base (%)']
modelResult['Optimisation (%)'] = modelResult['Optimisation (%)']

modelResult.sort_values(by = 'Normalised (%)', ascending = False)

Unnamed: 0,Model,Base (%),Normalised (%),Optimisation (%)
0,Random Forest,80.208333,82.291667,2.083333
2,Logistic Regression,82.291667,82.291667,0.0
6,Naive Bayes,82.291667,82.291667,0.0
7,Decision Tree,75.0,72.916667,-2.083333
5,KNN,65.625,65.625,0.0
3,Preceptorn,64.583333,64.583333,0.0
1,Neural Networks,47.916667,59.375,11.458333
4,SGDC,68.75,30.208333,-38.541667


In [31]:
print("Random Forest\n\n{}\n\nLogistic Regression\n\n{}\n\nNaive Bayes\n\n{}"
      .format(
          confusion_matrix(y_test, pred_rfc, labels = [1, 0]), 
          confusion_matrix(y_test, pred_logreg, labels = [1, 0]),
          confusion_matrix(y_test, pred_gauss, labels = [1, 0])))

Random Forest

[[65  3]
 [14 14]]

Logistic Regression

[[68  0]
 [17 11]]

Naive Bayes

[[67  1]
 [16 12]]


<h2> Hyperparameter Tuning </h2>
<p> this is meant to find an optimal parameter for top three algorithm, that is Random Forest, Logistic Regression, and Naive Bayes </p>

<p> it is also needs to resampling the data to overcome overfitting </p>

In [None]:
#