# Task 10 : Benchmark Top ML Algorithms

This task tests your ability to use different ML algorithms when solving a specific problem.


### Dataset
Predict Loan Eligibility for Dream Housing Finance company

Dream Housing Finance company deals in all kinds of home loans. They have presence across all urban, semi urban and rural areas. Customer first applies for home loan and after that company validates the customer eligibility for loan.

Company wants to automate the loan eligibility process (real time) based on customer detail provided while filling online application form. These details are Gender, Marital Status, Education, Number of Dependents, Income, Loan Amount, Credit History and others. To automate this process, they have provided a dataset to identify the customers segments that are eligible for loan amount so that they can specifically target these customers.

Train: https://raw.githubusercontent.com/subashgandyer/datasets/main/loan_train.csv

Test: https://raw.githubusercontent.com/subashgandyer/datasets/main/loan_test.csv

## Task Requirements
### You can have the following Classification models built using different ML algorithms
- Decision Tree
- KNN
- Logistic Regression
- SVM
- Random Forest
- Any other algorithm of your choice

### Use GridSearchCV for finding the best model with the best hyperparameters

- ### Build models
- ### Create Parameter Grid
- ### Run GridSearchCV
- ### Choose the best model with the best hyperparameter
- ### Give the best accuracy
- ### Also, benchmark the best accuracy that you could get for every classification algorithm asked above

#### Your final output will be something like this:
- Best algorithm accuracy
- Best hyperparameter accuracy for every algorithm

**Table 1 (Algorithm wise best model with best hyperparameter)**

Algorithm   |     Accuracy   |   Hyperparameters
- DT
- KNN
- LR
- SVM
- RF
- anyother

**Table 2 (Best overall)**

Algorithm    |   Accuracy    |   Hyperparameters



### Submission
- Submit Notebook containing all saved ran code with outputs
- Document with the above two tables

In [1]:
import pandas as pd
import math

from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from lightgbm import LGBMClassifier
from sklearn.svm import SVC
from tqdm.contrib import itertools
from tqdm import tqdm
import xgboost as xgb

from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.model_selection import RepeatedStratifiedKFold, cross_val_score
import numpy as np
from sklearn.metrics import f1_score, confusion_matrix, accuracy_score, precision_score, recall_score
from sklearn.model_selection import train_test_split
from sklearn.experimental import enable_iterative_imputer
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.model_selection import GridSearchCV
from sklearn.impute import IterativeImputer
from sklearn.linear_model import SGDClassifier

import pprint as pp
import plotly.express as px
import plotly.graph_objects as go

In [2]:
train_data = pd.read_csv("https://raw.githubusercontent.com/subashgandyer/datasets/main/loan_train.csv")
X_test = pd.read_csv("https://raw.githubusercontent.com/subashgandyer/datasets/main/loan_test.csv")

In [3]:
print(train_data.shape)

(614, 13)


In [4]:
print(X_test.shape)

(367, 12)


In [5]:
train_data.head()

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,Y
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
3,LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
4,LP001008,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y


In [6]:
train_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 614 entries, 0 to 613
Data columns (total 13 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Loan_ID            614 non-null    object 
 1   Gender             601 non-null    object 
 2   Married            611 non-null    object 
 3   Dependents         599 non-null    object 
 4   Education          614 non-null    object 
 5   Self_Employed      582 non-null    object 
 6   ApplicantIncome    614 non-null    int64  
 7   CoapplicantIncome  614 non-null    float64
 8   LoanAmount         592 non-null    float64
 9   Loan_Amount_Term   600 non-null    float64
 10  Credit_History     564 non-null    float64
 11  Property_Area      614 non-null    object 
 12  Loan_Status        614 non-null    object 
dtypes: float64(4), int64(1), object(8)
memory usage: 62.5+ KB


In [7]:
train_data.drop(["Loan_ID"], axis=1, inplace=True)
X_test.drop(["Loan_ID"], axis=1, inplace=True)

In [8]:
train_data.Dependents.value_counts()

0     345
1     102
2     101
3+     51
Name: Dependents, dtype: int64

In [9]:
train_data.loc[ train_data["Dependents"] == "3+", "Dependents"] = 3
X_test.loc[X_test["Dependents"] == "3+", "Dependents"] = 3

train_data.loc[train_data["Loan_Status"] == "Y", "Loan_Status"] = 1
train_data.loc[train_data["Loan_Status"] == "N", "Loan_Status"] = 0
train_data["Loan_Status"] = train_data["Loan_Status"].astype('uint')
# X_test.loc[X_test["Loan_Status"] == "Y", "Loan_Status"] = 1
# X_test.loc[X_test["Loan_Status"] == "N", "Loan_Status"] = 0

In [10]:
def print_stratified_percentages(data):
    classes = data.value_counts()
    for class_ in classes.keys():
        print(f"Class percentage: {class_} - ", f"{math.ceil((classes[class_] / data.shape[0])*100)}%")

In [11]:
train_data

Unnamed: 0,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,1
1,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,0
2,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,1
3,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,1
4,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,1
...,...,...,...,...,...,...,...,...,...,...,...,...
609,Female,No,0,Graduate,No,2900,0.0,71.0,360.0,1.0,Rural,1
610,Male,Yes,3,Graduate,No,4106,0.0,40.0,180.0,1.0,Rural,1
611,Male,Yes,1,Graduate,No,8072,240.0,253.0,360.0,1.0,Urban,1
612,Male,Yes,2,Graduate,No,7583,0.0,187.0,360.0,1.0,Urban,1


In [152]:
print("For Original :")
print_stratified_percentages(train_data.Loan_Status)

For Original :
Class percentage: 1 -  69%
Class percentage: 0 -  32%


In [12]:

X = train_data.drop(["Loan_Status"], axis=1)
y = train_data["Loan_Status"]


X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

print("Train shape: ", X_train.shape)
print("Val shape: ", X_val.shape)

Train shape:  (491, 11)
Val shape:  (123, 11)


# Encoding

In [14]:
def encode_categorical_variable(df):
    categorical_df = df.select_dtypes(["object"])
    categorical_df_encoded = pd.get_dummies(categorical_df)
    return pd.concat([df.drop(categorical_df.columns, axis=1), categorical_df_encoded], axis=1)

In [15]:
X_train_encoded = encode_categorical_variable(X_train)
X_test_encoded = encode_categorical_variable(X_test)

In [16]:
print(X_train_encoded.shape, X_test_encoded.shape)

(491, 20) (123, 20) (367, 20)


## Missing values 

In [17]:
imp_mean = IterativeImputer(random_state=0)

X_train_imputed = pd.DataFrame(imp_mean.fit_transform(X_train_encoded), columns=X_train_encoded.columns)
X_test_imputed = pd.DataFrame(imp_mean.transform(X_test_encoded), columns=X_test_encoded.columns)

## Scaling

In [18]:
scaler = MinMaxScaler()

X_train_scaled = scaler.fit_transform(X_train_imputed)
X_test_scaled = scaler.transform(X_test_imputed)

In [19]:
X_train_scaled_df = pd.DataFrame(X_train_scaled, columns=X_train_imputed.columns)
X_test_scaled_df = pd.DataFrame(X_test_scaled, columns=X_test_imputed.columns)

In [20]:
y_train = y_train.reset_index().Loan_Status
y_train

0      1
1      1
2      0
3      0
4      1
      ..
486    1
487    1
488    1
489    1
490    1
Name: Loan_Status, Length: 491, dtype: uint64

## Hyperparameter

In [136]:
metric_data = []
target_names=y_train.value_counts().keys().to_list()

We will take top features which got equal or more than 4 votings. 

In [137]:
def get_metrics(model_name, params, f1_score):
    
    return {"Model": model_name, 
            "Best params": params, 
            "F1_score": f1_score}

In [138]:
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import accuracy_score, confusion_matrix
import pprint as pp
from sklearn.svm import SVC

### LogisticRegression

In [120]:
param_grid = {
    "C": [0.5, 1, 5, 10], 
    "max_iter": [500, 1000]
}

lr_grid_search = GridSearchCV(LogisticRegression(), param_grid=param_grid, scoring="f1")
lr_grid_search.fit(X_train_scaled_df, y_train)

In [121]:
lr_grid_search.best_params_

{'C': 0.5, 'max_iter': 500}

In [139]:
metrics = get_metrics("Parameter Tuned Logistic Regression", lr_grid_search.best_params_, lr_grid_search.best_score_)
pp.pprint(metrics)
metric_data.append(metrics)

{'Best params': {'C': 0.5, 'max_iter': 500},
 'F1_score': 0.8695946835109882,
 'Model': 'Parameter Tuned Logistic Regression'}


### DecisionTreeClassifier

In [123]:
param_grid = {'max_depth': [2, 4, 8, 16, 32],
              'min_samples_leaf': [2, 10, 100, 1000],
              'criterion': ['gini','entropy', 'log_loss'],
              'max_leaf_nodes': [10, 100, 1000],
              'min_impurity_decrease': [0.000001, 0.0001, 0.001, 0.010],
              'splitter': ['best', 'random']}

dt_grid_search = GridSearchCV(DecisionTreeClassifier(), param_grid=param_grid, scoring="f1")
dt_grid_search.fit(X_train_scaled_df, y_train)

In [140]:
metrics = get_metrics("Parameter Tuned DecisionTree Regression", dt_grid_search.best_params_, dt_grid_search.best_score_)

pp.pprint(metrics)
metric_data.append(metrics)

{'Best params': {'criterion': 'gini',
                 'max_depth': 2,
                 'max_leaf_nodes': 10,
                 'min_impurity_decrease': 0.001,
                 'min_samples_leaf': 2,
                 'splitter': 'random'},
 'F1_score': 0.8710980392156863,
 'Model': 'Parameter Tuned DecisionTree Regression'}


### RandomForestClassifier

In [125]:

param_grid = {
    "n_estimators": [200, 300, 400], 
    "max_depth": [2, 8, ],
    "max_features" : ['log2', 'sqrt', None],
    "max_leaf_nodes": [4, 8, 16, 32] , 
    "min_samples_split": [2, 4 ], 
    "bootstrap": [True, False]
}
rf_grid_search = GridSearchCV(RandomForestClassifier(class_weight="balanced", n_jobs=-1), param_grid=param_grid, n_jobs=-1, scoring="f1", verbose=True)
rf_grid_search.fit(X_train_scaled_df, y_train)

Fitting 5 folds for each of 288 candidates, totalling 1440 fits


In [141]:
metrics = get_metrics("Parameter Tuned RandomForest Classifier", rf_grid_search.best_params_, rf_grid_search.best_score_)

pp.pprint(metrics)
metric_data.append(metrics)

{'Best params': {'bootstrap': False,
                 'max_depth': 2,
                 'max_features': None,
                 'max_leaf_nodes': 4,
                 'min_samples_split': 2,
                 'n_estimators': 200},
 'F1_score': 0.8669186214077854,
 'Model': 'Parameter Tuned RandomForest Classifier'}


### SGDClassifier

In [127]:
param_grid = {
    "loss": ["hinge", "log_loss"], 
    "penalty":["l2", "l1", "elasticnet"],
    "alpha": [0.0001, 0.001, 0.1,0.5 ]
}
sgd_grid_search = GridSearchCV(SGDClassifier(), param_grid=param_grid,scoring="f1")
sgd_grid_search.fit(X_train_scaled_df, y_train)

In [142]:
metrics = get_metrics("Parameter Tuned Stochastic Gradient Descent", sgd_grid_search.best_params_, sgd_grid_search.best_score_)

pp.pprint(metrics)
metric_data.append(metrics)

{'Best params': {'alpha': 0.001, 'loss': 'hinge', 'penalty': 'l1'},
 'F1_score': 0.8707299621603027,
 'Model': 'Parameter Tuned Stochastic Gradient Descent'}


### SVC

In [129]:
param_grid = {
    "C": [1, 5, 10], 
    "kernel": ["linear", "rbf"], #["linear", "rbf"]
}

svm_grid_search = GridSearchCV(SVC(), param_grid=param_grid, n_jobs=-1,scoring="f1")
svm_grid_search.fit(X_train_scaled_df, y_train)

In [143]:
metrics = get_metrics("Parameter Tuned  Support Vector Machine", svm_grid_search.best_params_, svm_grid_search.best_score_)

pp.pprint(metrics)
metric_data.append(metrics)

{'Best params': {'C': 1, 'kernel': 'linear'},
 'F1_score': 0.8699456484348126,
 'Model': 'Parameter Tuned  Support Vector Machine'}


### XGBClassifier

In [135]:

param_grid  = {
    "max_depth": [2, 4, 8, 16, 32], 
    "n_estimators": [100, 200, 400, 500], 
    "lambda": [0, 0.5, 1, 1.5, 3, 6], 
    "alpha": [1, 1.5, 3, 6, 8, 10],
    "tree_method": ["auto", "hist", "exact"],
    "eta": [0.1, 0.3, 0.5, 1]
}

xgb_search = GridSearchCV(xgb.XGBClassifier(objective='binary:logistic', tree_method='hist', eta=0.1), param_grid=param_grid, n_jobs=-1,scoring="f1")
xgb_search.fit(X_train_scaled_df, y_train)

In [144]:

metrics = get_metrics("Parameter Tuned  XGB", xgb_search.best_params_, xgb_search.best_score_)

pp.pprint(metrics)
metric_data.append(metrics)

{'Best params': {'alpha': 6,
                 'eta': 0.1,
                 'lambda': 0.5,
                 'max_depth': 8,
                 'n_estimators': 100,
                 'tree_method': 'auto'},
 'F1_score': 0.8704161762767946,
 'Model': 'Parameter Tuned  XGB'}


## Table 1

In [145]:
parameter_tuned_df = pd.DataFrame(metric_data)
parameter_tuned_df.sort_values("F1_score", inplace=True, ascending=False)
parameter_tuned_df

Unnamed: 0,Model,Best params,F1_score
1,Parameter Tuned DecisionTree Regression,"{'criterion': 'gini', 'max_depth': 2, 'max_lea...",0.871098
3,Parameter Tuned Stochastic Gradient Descent,"{'alpha': 0.001, 'loss': 'hinge', 'penalty': '...",0.87073
5,Parameter Tuned XGB,"{'alpha': 6, 'eta': 0.1, 'lambda': 0.5, 'max_d...",0.870416
4,Parameter Tuned Support Vector Machine,"{'C': 1, 'kernel': 'linear'}",0.869946
0,Parameter Tuned Logistic Regression,"{'C': 0.5, 'max_iter': 500}",0.869595
2,Parameter Tuned RandomForest Classifier,"{'bootstrap': False, 'max_depth': 2, 'max_feat...",0.866919


## Table 2

In [176]:
parameter_tuned_df.head(1)

Unnamed: 0,Model,Best params,F1_score
1,Parameter Tuned DecisionTree Regression,"{'criterion': 'gini', 'max_depth': 2, 'max_lea...",0.871098


## Predict with test data

In [172]:
def predict_test_data(estimator, test_data):
    pred = pd.Series(estimator.predict(test_data), name="Loan_Status")
    pred[pred == 1] = "Y"
    pred[pred == 0] = "N"
    dt_pred.to_csv(f"{type(estimator).__name__}-result.csv", index=False)

In [175]:
predict_test_data(dt_grid_search.best_estimator_, X_test_scaled_df)
predict_test_data(lr_grid_search.best_estimator_, X_test_scaled_df)
predict_test_data(rf_grid_search.best_estimator_, X_test_scaled_df)
predict_test_data(svm_grid_search.best_estimator_, X_test_scaled_df)
predict_test_data(sgd_grid_search.best_estimator_, X_test_scaled_df)
predict_test_data(xgb_search.best_estimator_, X_test_scaled_df)