# Task 10 : Benchmark Top ML Algorithms

This task tests your ability to use different ML algorithms when solving a specific problem.


### Dataset
Predict Loan Eligibility for Dream Housing Finance company

Dream Housing Finance company deals in all kinds of home loans. They have presence across all urban, semi urban and rural areas. Customer first applies for home loan and after that company validates the customer eligibility for loan.

Company wants to automate the loan eligibility process (real time) based on customer detail provided while filling online application form. These details are Gender, Marital Status, Education, Number of Dependents, Income, Loan Amount, Credit History and others. To automate this process, they have provided a dataset to identify the customers segments that are eligible for loan amount so that they can specifically target these customers.

Train: https://raw.githubusercontent.com/subashgandyer/datasets/main/loan_train.csv

Test: https://raw.githubusercontent.com/subashgandyer/datasets/main/loan_test.csv

## Task Requirements
### You can have the following Classification models built using different ML algorithms
- Decision Tree
- KNN
- Logistic Regression
- SVM
- Random Forest
- Any other algorithm of your choice

### Use GridSearchCV for finding the best model with the best hyperparameters

- ### Build models
- ### Create Parameter Grid
- ### Run GridSearchCV
- ### Choose the best model with the best hyperparameter
- ### Give the best accuracy
- ### Also, benchmark the best accuracy that you could get for every classification algorithm asked above

#### Your final output will be something like this:
- Best algorithm accuracy
- Best hyperparameter accuracy for every algorithm

**Table 1 (Algorithm wise best model with best hyperparameter)**

Algorithm   |     Accuracy   |   Hyperparameters
- DT
- KNN
- LR
- SVM
- RF
- anyother

**Table 2 (Best overall)**

Algorithm    |   Accuracy    |   Hyperparameters



### Submission
- Submit Notebook containing all saved ran code with outputs
- Document with the above two tables

In [81]:
import numpy as np 
import pandas as pd
import warnings
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import GridSearchCV
from sklearn import metrics



In [82]:
df_train = pd.read_csv("https://raw.githubusercontent.com/subashgandyer/datasets/main/loan_train.csv")
df_test = pd.read_csv("https://raw.githubusercontent.com/subashgandyer/datasets/main/loan_test.csv")

In [83]:
df_train["Loan_Status"].value_counts()

Y    422
N    192
Name: Loan_Status, dtype: int64

In [84]:
df_train.describe(include ='all')


Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
count,614,601,611,599.0,614,582,614.0,614.0,592.0,600.0,564.0,614,614
unique,614,2,2,4.0,2,2,,,,,,3,2
top,LP001002,Male,Yes,0.0,Graduate,No,,,,,,Semiurban,Y
freq,1,489,398,345.0,480,500,,,,,,233,422
mean,,,,,,,5403.459283,1621.245798,146.412162,342.0,0.842199,,
std,,,,,,,6109.041673,2926.248369,85.587325,65.12041,0.364878,,
min,,,,,,,150.0,0.0,9.0,12.0,0.0,,
25%,,,,,,,2877.5,0.0,100.0,360.0,1.0,,
50%,,,,,,,3812.5,1188.5,128.0,360.0,1.0,,
75%,,,,,,,5795.0,2297.25,168.0,360.0,1.0,,


In [85]:
#  Dimensions of train data
print("Train Data: ",df_train.shape)

# Dimensions of test data
print("Test Data: ",df_test.shape)

#removing duplicates from train data
df_train.drop_duplicates(keep='first', inplace=True)
print("Train Data after removing duplicates: ",df_train.shape)


Train Data:  (614, 13)
Test Data:  (367, 12)
Train Data after removing duplicates:  (614, 13)


In [86]:
df_train.isnull().sum()

Loan_ID               0
Gender               13
Married               3
Dependents           15
Education             0
Self_Employed        32
ApplicantIncome       0
CoapplicantIncome     0
LoanAmount           22
Loan_Amount_Term     14
Credit_History       50
Property_Area         0
Loan_Status           0
dtype: int64

In [87]:
df_test.isnull().sum()

Loan_ID               0
Gender               11
Married               0
Dependents           10
Education             0
Self_Employed        23
ApplicantIncome       0
CoapplicantIncome     0
LoanAmount            5
Loan_Amount_Term      6
Credit_History       29
Property_Area         0
dtype: int64

In [88]:
df_train["Loan_Status"] = df_train["Loan_Status"].map({"Y" : 1, "N" : 0})

In [89]:
df_train["Loan_Status"].value_counts()

1    422
0    192
Name: Loan_Status, dtype: int64

In [90]:
xtrain = df_train.drop(["Loan_ID","Loan_Status","Gender"],axis=1)
xtest = df_test.drop(["Loan_ID","Gender"],axis=1)

In [91]:
ytrain = df_train[["Loan_Status"]]

In [92]:
cat_cols = [col for col in xtrain.columns if xtrain.dtypes[col]=="object"]
cat_cols

num_cols = [col for col in xtrain.columns if xtrain.dtypes[col] !="object"]
num_cols

['ApplicantIncome',
 'CoapplicantIncome',
 'LoanAmount',
 'Loan_Amount_Term',
 'Credit_History']

In [93]:
xtrain_cat = xtrain[cat_cols]
xtrain_num = xtrain[num_cols]
xtest_cat = xtest[cat_cols]
xtest_num = xtest[num_cols]

In [94]:
from sklearn.impute import SimpleImputer
cat_impu=SimpleImputer(strategy="most_frequent")
num_impu=SimpleImputer(strategy="median" )
xtrain_cat=pd.DataFrame(cat_impu.fit_transform(xtrain_cat),columns=cat_cols)
xtest_cat=pd.DataFrame(cat_impu.fit_transform(xtest_cat),columns=cat_cols)
xtrain_num=pd.DataFrame(num_impu.fit_transform(xtrain_num),columns=num_cols)
xtest_num=pd.DataFrame(num_impu.fit_transform(xtest_num),columns=num_cols)

In [95]:
# Function to convert all object columns to categorical
xtrain_cat[cat_cols]=xtrain_cat[cat_cols].astype("category")
xtest_cat[cat_cols]=xtest_cat[cat_cols].astype("category")

In [96]:
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()

for col in cat_cols:
    xtrain_cat[col]=encoder.fit_transform(xtrain_cat[col])
    xtest_cat[col]=encoder.fit_transform(xtest_cat[col])
    
xtrain_cat.head()

Unnamed: 0,Married,Dependents,Education,Self_Employed,Property_Area
0,0,0,0,0,2
1,1,1,0,0,0
2,1,0,0,1,2
3,1,0,1,0,2
4,0,0,0,0,2


In [97]:
xtrain=pd.concat([xtrain_num,xtrain_cat],axis=1)
xtest=pd.concat([xtest_num,xtest_cat],axis=1)
columnsxtr = xtrain.columns
columnsxtr

Index(['ApplicantIncome', 'CoapplicantIncome', 'LoanAmount',
       'Loan_Amount_Term', 'Credit_History', 'Married', 'Dependents',
       'Education', 'Self_Employed', 'Property_Area'],
      dtype='object')

In [98]:
columnsxte = xtest.columns
columnsxte

Index(['ApplicantIncome', 'CoapplicantIncome', 'LoanAmount',
       'Loan_Amount_Term', 'Credit_History', 'Married', 'Dependents',
       'Education', 'Self_Employed', 'Property_Area'],
      dtype='object')

In [99]:
from sklearn.preprocessing import StandardScaler
scale = StandardScaler()
scale.fit(xtrain)

In [100]:
xtrain = pd.DataFrame(scale.transform(xtrain),columns=columnsxtr)

In [101]:
xtrain.isnull().sum()

ApplicantIncome      0
CoapplicantIncome    0
LoanAmount           0
Loan_Amount_Term     0
Credit_History       0
Married              0
Dependents           0
Education            0
Self_Employed        0
Property_Area        0
dtype: int64

In [102]:
xtest.isnull().sum()

ApplicantIncome      0
CoapplicantIncome    0
LoanAmount           0
Loan_Amount_Term     0
Credit_History       0
Married              0
Dependents           0
Education            0
Self_Employed        0
Property_Area        0
dtype: int64

In [103]:
xtest = pd.DataFrame(scale.transform(xtest),columns=columnsxte)

In [104]:
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(xtrain,ytrain,test_size=0.25,random_state=0)

In [105]:
def evaluate_model(model, x_test, y_test):
    from sklearn import metrics

    # Predict Test Data 
    y_pred = model.predict(x_test)

    # Calculate accuracy, precision, recall, f1-score, and kappa score
    acc = metrics.accuracy_score(y_test, y_pred)
    
    return {'acc': acc}


In [106]:
modelDict = {"Decision Tree":DecisionTreeClassifier(),"KNN":KNeighborsClassifier(),"Logistic Regression":LogisticRegression(),
             "SVM":SVC(),"Random Forest":RandomForestClassifier(),"NaiveBayes":GaussianNB()}

In [107]:
modelsResults = pd.DataFrame()
for modelName,model in modelDict.items():
    model.fit(x_train, y_train)
    dtc_eval = evaluate_model(model, x_test, y_test)
    
    modelsResults = modelsResults.append(pd.DataFrame({"ModelName":modelName,"Accuracy": dtc_eval['acc']},index = [0]))
    

  return self._fit(X, y)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  model.fit(x_train, y_train)
  y = column_or_1d(y, warn=True)


In [108]:
modelsResults

Unnamed: 0,ModelName,Accuracy
0,Decision Tree,0.746753
0,KNN,0.779221
0,Logistic Regression,0.837662
0,SVM,0.837662
0,Random Forest,0.811688
0,NaiveBayes,0.824675


In [109]:
param_grid_DST = {
    "criterion":['gini','entropy'],
    "max_depth":[x for x in range(1,10)],
    "min_samples_split":[x for x in range(1,10)]
}
param_grid_KNN = {
     "n_neighbors":[x for x in range(5,10)],
    "leaf_size":[x for x in range(10,50,10)]
                 }
param_grid_Logistic_Regression = {
    'C': [0.1, 1, 10, 100],
    'solver': ['lbfgs', 'liblinear']
}

param_grid_SVC= {
    'C': [0.1, 1, 10, 100],
    'kernel': ['linear', 'rbf', 'poly', 'sigmoid']
    }

param_grid_RF = {'n_estimators': [10, 100, 1000],
              'criterion': ['gini', 'entropy'],
              'max_depth': [None, 5, 10, 20]
                }

param_NaiveBayes = {'var_smoothing': np.logspace(0,-9, num=100)}

In [110]:
modelDict = {"Decision Tree":[DecisionTreeClassifier(),param_grid_DST],
             "KNN":[KNeighborsClassifier(),param_grid_KNN],
             "Logistic Regression":[LogisticRegression(),param_grid_Logistic_Regression],
             "SVM":[SVC(),param_grid_SVC],
             "Random Forest":[RandomForestClassifier(),param_grid_RF],
             "NaiveBayes":[GaussianNB(),param_NaiveBayes]
            }

In [139]:

results = pd.DataFrame()
for modelName, model in modelDict.items():
    GSV = GridSearchCV(model[0],param_grid=model[1],cv=3,scoring="accuracy",n_jobs=8)
    GSV.fit(x_train,y_train)
    
    y_pred = GSV.predict(x_test)
    acc = metrics.accuracy_score(y_test, y_pred)
    
    results_dict = {"ModelName":modelName,
                   "Accuracy":acc,
                   "HyperParameter":str(GSV.best_params_)}
    results = results.append(pd.DataFrame(results_dict,index = [0]))
    
    

  return self._fit(X, y)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  self.best_estimator_.fit(X, y, **fit_params)
  y = column_or_1d(y, warn=True)


# Algorithm wise best model with best hyperparameter

In [140]:
results[results['Accuracy']==max(results['Accuracy'])]

Unnamed: 0,ModelName,Accuracy,HyperParameter
0,Logistic Regression,0.837662,"{'C': 0.1, 'solver': 'lbfgs'}"
0,NaiveBayes,0.837662,{'var_smoothing': 1.0}


# All Model results

In [144]:
pd.set_option('display.max_colwidth', None)
results['HyperParameter'] = results['HyperParameter'].astype(str).apply(lambda x: x.ljust(150))


In [145]:
results

Unnamed: 0,ModelName,Accuracy,HyperParameter
0,Decision Tree,0.831169,"{'criterion': 'gini', 'max_depth': 1, 'min_samples_split': 1}"
0,KNN,0.831169,"{'leaf_size': 10, 'n_neighbors': 9}"
0,Logistic Regression,0.837662,"{'C': 0.1, 'solver': 'lbfgs'}"
0,SVM,0.831169,"{'C': 0.1, 'kernel': 'linear'}"
0,Random Forest,0.831169,"{'criterion': 'entropy', 'max_depth': 5, 'n_estimators': 1000}"
0,NaiveBayes,0.837662,{'var_smoothing': 1.0}
