# Task 10 : Benchmark Top ML Algorithms

This task tests your ability to use different ML algorithms when solving a specific problem.


### Dataset
Predict Loan Eligibility for Dream Housing Finance company

Dream Housing Finance company deals in all kinds of home loans. They have presence across all urban, semi urban and rural areas. Customer first applies for home loan and after that company validates the customer eligibility for loan.

Company wants to automate the loan eligibility process (real time) based on customer detail provided while filling online application form. These details are Gender, Marital Status, Education, Number of Dependents, Income, Loan Amount, Credit History and others. To automate this process, they have provided a dataset to identify the customers segments that are eligible for loan amount so that they can specifically target these customers.

Train: https://raw.githubusercontent.com/subashgandyer/datasets/main/loan_train.csv

Test: https://raw.githubusercontent.com/subashgandyer/datasets/main/loan_test.csv

## Task Requirements
### You can have the following Classification models built using different ML algorithms
- Decision Tree
- KNN
- Logistic Regression
- SVM
- Random Forest
- Any other algorithm of your choice

### Use GridSearchCV for finding the best model with the best hyperparameters

- ### Build models
- ### Create Parameter Grid
- ### Run GridSearchCV
- ### Choose the best model with the best hyperparameter
- ### Give the best accuracy
- ### Also, benchmark the best accuracy that you could get for every classification algorithm asked above

#### Your final output will be something like this:
- Best algorithm accuracy
- Best hyperparameter accuracy for every algorithm

**Table 1 (Algorithm wise best model with best hyperparameter)**

Algorithm   |     Accuracy   |   Hyperparameters
- DT
- KNN
- LR
- SVM
- RF
- anyother

**Table 2 (Best overall)**

Algorithm    |   Accuracy    |   Hyperparameters



### Submission
- Submit Notebook containing all saved ran code with outputs
- Document with the above two tables

In [44]:
import pandas as pd
import numpy as np

train = pd.read_csv('https://raw.githubusercontent.com/subashgandyer/datasets/main/loan_train.csv')
test = pd.read_csv('https://raw.githubusercontent.com/subashgandyer/datasets/main/loan_test.csv')

In [45]:
train.head()

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,Y
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
3,LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
4,LP001008,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y


In [46]:
test.head()

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area
0,LP001015,Male,Yes,0,Graduate,No,5720,0,110.0,360.0,1.0,Urban
1,LP001022,Male,Yes,1,Graduate,No,3076,1500,126.0,360.0,1.0,Urban
2,LP001031,Male,Yes,2,Graduate,No,5000,1800,208.0,360.0,1.0,Urban
3,LP001035,Male,Yes,2,Graduate,No,2340,2546,100.0,360.0,,Urban
4,LP001051,Male,No,0,Not Graduate,No,3276,0,78.0,360.0,1.0,Urban


In [47]:
train.shape

(614, 13)

In [48]:
test.shape

(367, 12)

In [49]:
train = train.dropna()
test = test.dropna()

In [50]:
y_train = train['Loan_Status']
X_train = train.drop('Loan_Status', axis=1)
X_test = test

In [51]:
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder(handle_unknown='ignore', sparse=False)
categorical_cols = [col for col in X_train.columns if X_train[col].dtype == 'object']
ohe.fit(X_train[categorical_cols])
X_train_ohe = ohe.transform(X_train[categorical_cols])
X_test_ohe = ohe.transform(X_test[categorical_cols])
numerical_cols = [col for col in X_train.columns if X_train[col].dtype in ['int64', 'float64']]
X_train_final = np.concatenate([X_train_ohe, X_train[numerical_cols]], axis=1)
X_test_final = np.concatenate([X_test_ohe, X_test[numerical_cols]], axis=1)



In [52]:
# Decision Tree Classifier
from sklearn.tree import DecisionTreeClassifier
DT_model = DecisionTreeClassifier()
DT_model.fit(X_train_final, y_train)
y_pred = DT_model.predict(X_test_final)

In [53]:
# KNN Classifier
from sklearn.neighbors import KNeighborsClassifier
KNN_model = KNeighborsClassifier()
KNN_model.fit(X_train_final, y_train)
y_pred = KNN_model.predict(X_test_final)

In [54]:
# Logistic Regression
from sklearn.linear_model import LogisticRegression
LR_model = LogisticRegression()
LR_model.fit(X_train_final, y_train)
y_pred = LR_model.predict(X_test_final)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [55]:
# SVM Classifier
from sklearn.svm import SVC
SVM_model = SVC()
SVM_model.fit(X_train_final, y_train)
y_pred = SVM_model.predict(X_test_final)

In [56]:
# Random Forest Classifier
from sklearn.ensemble import RandomForestClassifier
RF_model = RandomForestClassifier()
RF_model.fit(X_train_final, y_train)
y_pred = RF_model.predict(X_test_final)

In [57]:
# Stochastic Gradient Descent Classifier
from sklearn.linear_model import SGDClassifier
SGD_model = SGDClassifier()
SGD_model.fit(X_train_final, y_train)
y_pred = SGD_model.predict(X_test_final)

In [58]:
# create param grid object
from sklearn.model_selection import GridSearchCV, StratifiedKFold
def find_best_model_using_gridsearchcv(X,y):
    algos = {
        'KNN': {
            'model': KNeighborsClassifier(),
            'params': {
                'n_neighbors': [3,5,7,9,11,13,15,17,19,21,23,25],
                'weights': ['uniform', 'distance'],
                'algorithm': ['auto', 'ball_tree','kd_tree','brute'],
                'leaf_size': [10,20,30,40,50],
                'p': [1,2]
            }
        },
        'Logistic': {
            'model': LogisticRegression(),
            'params': {
                'max_iter' : [50, 100,200,500],
                'penalty' : ['l1', 'l2', 'elasticnet', 'none'],
                'C' : [1,2,3,4,5]
            }
        },
        'decision_tree': {
            'model': DecisionTreeClassifier(),
            'params': {
                'criterion' : ['gini','entropy', 'log_loss'],
                'splitter': ['best','random'],
                'max_depth': [2,4,6,8,10,12,14,16,18,20],
                'min_samples_leaf': [1,2,3,4,5],
                'min_samples_split': [2,3,4,5]
            }
        },
        'random_forest' : {
            'model': RandomForestClassifier(),
            'params': {
            'criterion' : ['gini', 'entropy'],
            'n_estimators' : [50, 100, 200],
            'max_depth' : [4,6,8,10,12,14,16,18,20],
            'min_samples_leaf' : [1,2,3,4,5],
            'min_samples_split': [2,3,4,5]
           }
        },
         'SGD' : {
            'model': SGDClassifier(),
            'params': {
                'loss': ['hinge', 'log_loss', 'modified_huber'],
                'penalty': [None, 'l1', 'l2', 'elasticnet'],
                'alpha': [0.0001, 0.001, 0.01, 0.1],
                'learning_rate': ['constant', 'optimal', 'invscaling', 'adaptive'],
                'eta0': [0.01, 0.1, 1, 10],
                'power_t': [0.5, 1],
                'class_weight': [None, 'balanced']
            }
        },
         'SVM' : {
            'model': SVC(),
            'params': {
                'C': [1, 2, 3],
                'kernel': ['linear', 'poly', 'rbf', 'sigmoid']
            }
        }
    }
    scores = []
    cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=0)
    for algo_name, config in algos.items():
        gs =  GridSearchCV(config['model'], config['params'], cv=cv, return_train_score=False,
                            )
        gs.fit(X,y)
        scores.append({
            'model': algo_name,
            'best_score': gs.best_score_,
            'best_params': gs.best_params_
        })

    return pd.DataFrame(scores,columns=['model','best_score','best_params'])

df1 = find_best_model_using_gridsearchcv(X_train_final, y_train)
             

Traceback (most recent call last):
  File "c:\Users\yatch\poetry-demo\.venv\Lib\site-packages\sklearn\model_selection\_validation.py", line 813, in _score
    scores = scorer(estimator, X_test, y_test)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\yatch\poetry-demo\.venv\Lib\site-packages\sklearn\metrics\_scorer.py", line 527, in __call__
    return estimator.score(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\yatch\poetry-demo\.venv\Lib\site-packages\sklearn\base.py", line 705, in score
    return accuracy_score(y, self.predict(X), sample_weight=sample_weight)
                             ^^^^^^^^^^^^^^^
  File "c:\Users\yatch\poetry-demo\.venv\Lib\site-packages\sklearn\neighbors\_classification.py", line 249, in predict
    probabilities = self.predict_proba(X)
                    ^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\yatch\poetry-demo\.venv\Lib\site-packages\sklearn\neighbors\_classification.py", line 327, in predict_proba
    proba

In [60]:
df1

Unnamed: 0,model,best_score,best_params
0,KNN,0.69375,"{'algorithm': 'auto', 'leaf_size': 10, 'n_neig..."
1,Logistic,0.80625,"{'C': 1, 'max_iter': 200, 'penalty': 'l2'}"
2,decision_tree,0.8125,"{'criterion': 'gini', 'max_depth': 4, 'min_sam..."
3,random_forest,0.804167,"{'criterion': 'gini', 'max_depth': 20, 'min_sa..."
4,SGD,0.695833,"{'alpha': 0.0001, 'class_weight': None, 'eta0'..."
5,SVM,0.785417,"{'C': 1, 'kernel': 'linear'}"


In [71]:
best_row = df1[df1['best_score'] == df1['best_score'].max()]
best_algorithm = best_row['model'].values[0]
best_accuracy = best_row['best_score'].values[0]
best_hyperparameters = best_row['best_params'].values[0]
df2 = pd.DataFrame([{'Algorithm': best_algorithm, 'Accuracy': best_accuracy, 'Hyperparameters': best_hyperparameters}])

In [72]:
df2

Unnamed: 0,Algorithm,Accuracy,Hyperparameters
0,decision_tree,0.8125,"{'criterion': 'gini', 'max_depth': 4, 'min_sam..."
