## Benchmark Top ML Algorithms


### Dataset
Predict Loan Eligibility for Dream Housing Finance company



### ML algorithms
- Decision Tree
- KNN
- Logistic Regression
- SVM
- Random Forest

### Using GridSearchCV for finding the best model with the best hyperparameters

- ### Build models
- ### Create Parameter Grid
- ### Run GridSearchCV
- ### Choose the best model with the best hyperparameter
- ### Give the best accuracy
- ### Also, benchmark the best accuracy that you could get for every classification algorithm asked above

## Importing Data

In [2]:
import pandas as pd
import numpy as np
import pandas as pd 
import seaborn as sns
import matplotlib.pyplot as plt
import scipy.stats as ss
from collections import Counter
import math
from scipy import stats
url= "https://raw.githubusercontent.com/subashgandyer/datasets/main/loan_train.csv"
df = pd.read_csv(url)
df_transform = pd.DataFrame(data=df)
df_transform

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,Y
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
3,LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
4,LP001008,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y
...,...,...,...,...,...,...,...,...,...,...,...,...,...
609,LP002978,Female,No,0,Graduate,No,2900,0.0,71.0,360.0,1.0,Rural,Y
610,LP002979,Male,Yes,3+,Graduate,No,4106,0.0,40.0,180.0,1.0,Rural,Y
611,LP002983,Male,Yes,1,Graduate,No,8072,240.0,253.0,360.0,1.0,Urban,Y
612,LP002984,Male,Yes,2,Graduate,No,7583,0.0,187.0,360.0,1.0,Urban,Y


## Finding The Completeness of The Dataset

In [4]:
for i in range(len(df.columns)):
    missing_data = df[df.columns[i]].isna().sum()
    perc = missing_data / len(df) * 100
    print(f'Feature {i+1} >> Missing entries: {missing_data}  |  Percentage: {round(perc, 2)}')


Feature 1 >> Missing entries: 0  |  Percentage: 0.0
Feature 2 >> Missing entries: 13  |  Percentage: 2.12
Feature 3 >> Missing entries: 3  |  Percentage: 0.49
Feature 4 >> Missing entries: 15  |  Percentage: 2.44
Feature 5 >> Missing entries: 0  |  Percentage: 0.0
Feature 6 >> Missing entries: 32  |  Percentage: 5.21
Feature 7 >> Missing entries: 0  |  Percentage: 0.0
Feature 8 >> Missing entries: 0  |  Percentage: 0.0
Feature 9 >> Missing entries: 22  |  Percentage: 3.58
Feature 10 >> Missing entries: 14  |  Percentage: 2.28
Feature 11 >> Missing entries: 50  |  Percentage: 8.14
Feature 12 >> Missing entries: 0  |  Percentage: 0.0
Feature 13 >> Missing entries: 0  |  Percentage: 0.0


## Cleaning Dataset

In [5]:
cleaned_data = df_transform.dropna(axis = 0, how ='any') 
cleaned_data

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
3,LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
4,LP001008,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y
5,LP001011,Male,Yes,2,Graduate,Yes,5417,4196.0,267.0,360.0,1.0,Urban,Y
...,...,...,...,...,...,...,...,...,...,...,...,...,...
609,LP002978,Female,No,0,Graduate,No,2900,0.0,71.0,360.0,1.0,Rural,Y
610,LP002979,Male,Yes,3+,Graduate,No,4106,0.0,40.0,180.0,1.0,Rural,Y
611,LP002983,Male,Yes,1,Graduate,No,8072,240.0,253.0,360.0,1.0,Urban,Y
612,LP002984,Male,Yes,2,Graduate,No,7583,0.0,187.0,360.0,1.0,Urban,Y


In [6]:
cleaned_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 480 entries, 1 to 613
Data columns (total 13 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Loan_ID            480 non-null    object 
 1   Gender             480 non-null    object 
 2   Married            480 non-null    object 
 3   Dependents         480 non-null    object 
 4   Education          480 non-null    object 
 5   Self_Employed      480 non-null    object 
 6   ApplicantIncome    480 non-null    int64  
 7   CoapplicantIncome  480 non-null    float64
 8   LoanAmount         480 non-null    float64
 9   Loan_Amount_Term   480 non-null    float64
 10  Credit_History     480 non-null    float64
 11  Property_Area      480 non-null    object 
 12  Loan_Status        480 non-null    object 
dtypes: float64(4), int64(1), object(8)
memory usage: 52.5+ KB


In [7]:
for i in range(len(cleaned_data.columns)):
    missing_data =cleaned_data[cleaned_data.columns[i]].isna().sum()
    perc = missing_data / len(df) * 100
    print(f'Feature {i+1} >> Missing entries: {missing_data}  |  Percentage: {round(perc, 2)}')

Feature 1 >> Missing entries: 0  |  Percentage: 0.0
Feature 2 >> Missing entries: 0  |  Percentage: 0.0
Feature 3 >> Missing entries: 0  |  Percentage: 0.0
Feature 4 >> Missing entries: 0  |  Percentage: 0.0
Feature 5 >> Missing entries: 0  |  Percentage: 0.0
Feature 6 >> Missing entries: 0  |  Percentage: 0.0
Feature 7 >> Missing entries: 0  |  Percentage: 0.0
Feature 8 >> Missing entries: 0  |  Percentage: 0.0
Feature 9 >> Missing entries: 0  |  Percentage: 0.0
Feature 10 >> Missing entries: 0  |  Percentage: 0.0
Feature 11 >> Missing entries: 0  |  Percentage: 0.0
Feature 12 >> Missing entries: 0  |  Percentage: 0.0
Feature 13 >> Missing entries: 0  |  Percentage: 0.0


## Using Label Encoder to Convert Dataset to an Acceptable Form

In [8]:
from sklearn import preprocessing
label_encoder = preprocessing.LabelEncoder()

cleaned_data['Gender_Clean']= label_encoder.fit_transform(cleaned_data['Gender'])
cleaned_data['Married_Clean']= label_encoder.fit_transform(cleaned_data['Married']) 
cleaned_data['Dependents_Clean']= label_encoder.fit_transform(cleaned_data['Dependents'])
cleaned_data['Education_Clean']= label_encoder.fit_transform(cleaned_data['Education'])
cleaned_data['Self_Employed_Clean']= label_encoder.fit_transform(cleaned_data['Self_Employed'])
cleaned_data['Property_Area_Clean']= label_encoder.fit_transform(cleaned_data['Property_Area'])
cleaned_data['Loan_Status_Clean']= label_encoder.fit_transform(cleaned_data['Loan_Status'])



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  cleaned_data['Gender_Clean']= label_encoder.fit_transform(cleaned_data['Gender'])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  cleaned_data['Married_Clean']= label_encoder.fit_transform(cleaned_data['Married'])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  cleaned_data['Dependents_Clean']= label

In [9]:
cleaned_data

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status,Gender_Clean,Married_Clean,Dependents_Clean,Education_Clean,Self_Employed_Clean,Property_Area_Clean,Loan_Status_Clean
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N,1,1,1,0,0,0,0
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y,1,1,0,0,1,2,1
3,LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y,1,1,0,1,0,2,1
4,LP001008,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y,1,0,0,0,0,2,1
5,LP001011,Male,Yes,2,Graduate,Yes,5417,4196.0,267.0,360.0,1.0,Urban,Y,1,1,2,0,1,2,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
609,LP002978,Female,No,0,Graduate,No,2900,0.0,71.0,360.0,1.0,Rural,Y,0,0,0,0,0,0,1
610,LP002979,Male,Yes,3+,Graduate,No,4106,0.0,40.0,180.0,1.0,Rural,Y,1,1,3,0,0,0,1
611,LP002983,Male,Yes,1,Graduate,No,8072,240.0,253.0,360.0,1.0,Urban,Y,1,1,1,0,0,2,1
612,LP002984,Male,Yes,2,Graduate,No,7583,0.0,187.0,360.0,1.0,Urban,Y,1,1,2,0,0,2,1


In [10]:
data=cleaned_data.drop(['Gender','Married', 'Loan_ID','Dependents','Education','Self_Employed','Property_Area','Loan_Status'], axis=1)
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 480 entries, 1 to 613
Data columns (total 12 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   ApplicantIncome      480 non-null    int64  
 1   CoapplicantIncome    480 non-null    float64
 2   LoanAmount           480 non-null    float64
 3   Loan_Amount_Term     480 non-null    float64
 4   Credit_History       480 non-null    float64
 5   Gender_Clean         480 non-null    int32  
 6   Married_Clean        480 non-null    int32  
 7   Dependents_Clean     480 non-null    int32  
 8   Education_Clean      480 non-null    int32  
 9   Self_Employed_Clean  480 non-null    int32  
 10  Property_Area_Clean  480 non-null    int32  
 11  Loan_Status_Clean    480 non-null    int32  
dtypes: float64(4), int32(7), int64(1)
memory usage: 35.6 KB


## Train Test Split

In [11]:
from sklearn.model_selection import train_test_split
labels = np.array(data.pop('Loan_Status_Clean'))
X, X_test, y, y_test = train_test_split(data, labels, stratify = labels, test_size = 0.1, random_state = 43)

In [12]:
X.shape,y.shape,X_test.shape, y_test.shape

((432, 11), (432,), (48, 11), (48,))

## DecisionTreeClassifier 

In [13]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import GridSearchCV
tree_par= {'criterion' :['gini','entropy'], 
           'max_depth': range(1,10),
           'min_samples_leaf' : range(1,5), 
           'min_samples_split':range(1,10)
          }
from sklearn.model_selection import GridSearchCV
tree = DecisionTreeClassifier()
grid_tree= GridSearchCV(tree, param_grid= tree_par,scoring='accuracy', cv=10,n_jobs=-1)
grid_tree.fit(X,y)

DT_acc=grid_tree.best_score_
DT_par=grid_tree.best_params_
DT_par

 0.81479915 0.81479915 0.81479915        nan 0.81479915 0.81479915
 0.81479915 0.81479915 0.81479915 0.81479915 0.81479915 0.81479915
        nan 0.81479915 0.81479915 0.81479915 0.81479915 0.81479915
 0.81479915 0.81479915 0.81479915        nan 0.81479915 0.81479915
 0.81479915 0.81479915 0.81479915 0.81479915 0.81479915 0.81479915
        nan 0.80327696 0.80327696 0.80327696 0.80327696 0.80327696
 0.80327696 0.80327696 0.80327696        nan 0.80560254 0.80560254
 0.80560254 0.80560254 0.80560254 0.80560254 0.80560254 0.80560254
        nan 0.80327696 0.80327696 0.80327696 0.80327696 0.80327696
 0.80327696 0.80327696 0.80327696        nan 0.80327696 0.80327696
 0.80327696 0.80327696 0.80327696 0.80327696 0.80327696 0.80327696
        nan 0.80327696 0.80327696 0.80327696 0.80327696 0.80327696
 0.79862579 0.79862579 0.79862579        nan 0.80560254 0.80560254
 0.80560254 0.80560254 0.80560254 0.80095137 0.80095137 0.80095137
        nan 0.79867865 0.79402748 0.79867865 0.79635307 0.7940

{'criterion': 'gini',
 'max_depth': 1,
 'min_samples_leaf': 1,
 'min_samples_split': 2}

In [14]:

tree = DecisionTreeClassifier(criterion= 'gini',
 max_depth= 1,
 min_samples_leaf= 1,
 min_samples_split= 2)
tree.fit(X, y)
y_pred=tree.predict(X_test)
DT_accuracy= accuracy_score(y_test, y_pred)
DT_accuracy

0.75

## KNeighborsClassifier

In [15]:
from sklearn.neighbors import KNeighborsClassifier

In [16]:
neigh = KNeighborsClassifier()

In [17]:
knn_par={'n_neighbors': range(1,20), 'weights': ['uniform','distance'],'metric': ['euclidean','manhattan']}
grid_knn=GridSearchCV(neigh, knn_par,scoring='accuracy', cv=10,n_jobs=-1)
grid_knn.fit(X,y)

GridSearchCV(cv=10, estimator=KNeighborsClassifier(), n_jobs=-1,
             param_grid={'metric': ['euclidean', 'manhattan'],
                         'n_neighbors': range(1, 20),
                         'weights': ['uniform', 'distance']},
             scoring='accuracy')

In [18]:
knn_acc=grid_knn.best_score_
knn_par=grid_knn.best_params_
knn_par

{'metric': 'manhattan', 'n_neighbors': 14, 'weights': 'uniform'}

In [19]:
neigh = KNeighborsClassifier(metric= 'manhattan', n_neighbors= 14, weights= 'uniform')
neigh.fit(X,y)
y_pred=neigh.predict(X_test)
knn_accuracy= accuracy_score(y_test, y_pred)
knn_accuracy

0.6666666666666666

## LogisticRegression

In [20]:
from sklearn.linear_model import LogisticRegression
lr=LogisticRegression()
lr_par={"C":np.logspace(-3,3,7), "penalty":["l1","l2"]}
grid_LR=GridSearchCV(lr, lr_par,scoring='accuracy', cv=10,n_jobs=-1)
grid_LR.fit(X,y)


        nan 0.81014799        nan 0.81247357        nan 0.81014799
        nan 0.80782241]


GridSearchCV(cv=10, estimator=LogisticRegression(), n_jobs=-1,
             param_grid={'C': array([1.e-03, 1.e-02, 1.e-01, 1.e+00, 1.e+01, 1.e+02, 1.e+03]),
                         'penalty': ['l1', 'l2']},
             scoring='accuracy')

In [21]:
LR_acc=grid_LR.best_score_
LR_par=grid_LR.best_params_
LR_par

{'C': 10.0, 'penalty': 'l2'}

In [22]:
lr =LogisticRegression(C= 10.0, penalty= 'l2')
lr.fit(X,y)
y_pred=lr.predict(X_test)
lr_accuracy= accuracy_score(y_test, y_pred)
lr_accuracy

0.75

## Support Vector Machine

In [23]:
from sklearn import svm
svc = svm.SVC()
svc_par={'kernel':('linear', 'rbf'), 'C':[1, 10]}
grid_svc=GridSearchCV(svc, svc_par)

grid_svc.fit(X,y)

GridSearchCV(estimator=SVC(),
             param_grid={'C': [1, 10], 'kernel': ('linear', 'rbf')})

In [24]:
svm_acc=grid_svc.best_score_
svm_par=grid_svc.best_params_
svm_par

{'C': 10, 'kernel': 'linear'}

In [25]:
svc = svm.SVC(C= 10, kernel= 'linear')
svc.fit(X,y)
y_pred=svc.predict(X_test)
svn_accuracy= accuracy_score(y_test, y_pred)
svn_accuracy

0.7708333333333334

## Random Forest Classifaction

In [26]:
from sklearn.ensemble import RandomForestClassifier
rf=RandomForestClassifier()
rf_dis = { 'n_estimators': [int(x) for x in np.linspace(start = 10, stop = 200, num = 10)],
            'max_depth' : [int(x) for x in np.linspace(start = 3, stop = 20 , num = 1)],
          'max_features' : ['auto', 'sqrt', None] + list(np.arange(0.5, 1, 0.1)),
          'max_leaf_nodes' : [int(x) for x in np.linspace(start = 10, stop = 50, num = 5)],
          'min_samples_split' : [2,5,10],
          'bootstrap' : [True, False]
         }
grid_rf=GridSearchCV(rf, rf_dis,scoring='accuracy', cv=10,n_jobs=-1)

grid_rf.fit(X,y)


GridSearchCV(cv=10, estimator=RandomForestClassifier(), n_jobs=-1,
             param_grid={'bootstrap': [True, False], 'max_depth': [3],
                         'max_features': ['auto', 'sqrt', None, 0.5, 0.6, 0.7,
                                          0.7999999999999999,
                                          0.8999999999999999],
                         'max_leaf_nodes': [10, 20, 30, 40, 50],
                         'min_samples_split': [2, 5, 10],
                         'n_estimators': [10, 31, 52, 73, 94, 115, 136, 157,
                                          178, 200]},
             scoring='accuracy')

In [27]:
rf_acc=grid_rf.best_score_
rf_par=grid_rf.best_params_
rf_par

{'bootstrap': True,
 'max_depth': 3,
 'max_features': 'auto',
 'max_leaf_nodes': 20,
 'min_samples_split': 10,
 'n_estimators': 10}

In [28]:
rf=RandomForestClassifier(bootstrap= True,
 max_depth= 3,
 max_features= 'auto',
 max_leaf_nodes= 30,
 min_samples_split= 2,
 n_estimators= 31)
rf.fit(X,y)
y_pred=rf.predict(X_test)
rf_accuracy= accuracy_score(y_test, y_pred)
rf_accuracy

0.75

## Naive Bayes 

In [29]:
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()
params_NB = {'var_smoothing': np.logspace(0,-9, num=100)}
grid_nb=GridSearchCV(gnb,params_NB ,scoring='accuracy', cv=10,n_jobs=-1)
grid_nb.fit(X,y)

GridSearchCV(cv=10, estimator=GaussianNB(), n_jobs=-1,
             param_grid={'var_smoothing': array([1.00000000e+00, 8.11130831e-01, 6.57933225e-01, 5.33669923e-01,
       4.32876128e-01, 3.51119173e-01, 2.84803587e-01, 2.31012970e-01,
       1.87381742e-01, 1.51991108e-01, 1.23284674e-01, 1.00000000e-01,
       8.11130831e-02, 6.57933225e-02, 5.33669923e-02, 4.32876128e-02,
       3.51119173e-02, 2.848035...
       1.23284674e-07, 1.00000000e-07, 8.11130831e-08, 6.57933225e-08,
       5.33669923e-08, 4.32876128e-08, 3.51119173e-08, 2.84803587e-08,
       2.31012970e-08, 1.87381742e-08, 1.51991108e-08, 1.23284674e-08,
       1.00000000e-08, 8.11130831e-09, 6.57933225e-09, 5.33669923e-09,
       4.32876128e-09, 3.51119173e-09, 2.84803587e-09, 2.31012970e-09,
       1.87381742e-09, 1.51991108e-09, 1.23284674e-09, 1.00000000e-09])},
             scoring='accuracy')

In [30]:
nb_acc=grid_nb.best_score_
nb_par=grid_nb.best_params_
nb_par

{'var_smoothing': 1.873817422860387e-09}

In [31]:
gnb = GaussianNB(var_smoothing= 1.873817422860387e-09)
gnb.fit(X,y)
y_pred=gnb.predict(X_test)
gnb_accuracy= accuracy_score(y_test, y_pred)
gnb_accuracy

0.7708333333333334

## The Accuracy of All Classifiers

In [32]:
print('Decision Tree:  ','Accuracy:  ',DT_accuracy)
print('KNN:  ','Accuracy:',knn_accuracy)
print('SVM:  ','Accuracy:  ',svn_accuracy)
print('Random Forest:  ','Accuracy:  ',rf_accuracy)
print('Naive Bayes:  ','Accuracy:  ',gnb_accuracy)
print('Logistic regression:  ','Accuracy:  ',lr_accuracy)




Decision Tree:   Accuracy:   0.75
KNN:   Accuracy: 0.6666666666666666
SVM:   Accuracy:   0.7708333333333334
Random Forest:   Accuracy:   0.75
Naive Bayes:   Accuracy:   0.7708333333333334
Logistic regression:   Accuracy:   0.75


## The Best Classifier

In [33]:
print("The best algorithm is Naiv Bayes with an accuracy of  ", gnb_accuracy)

The best algorithm is Naiv Bayes with an accuracy of   0.7708333333333334
