# Task 10 : Benchmark Top ML Algorithms

This task tests your ability to use different ML algorithms when solving a specific problem.


### Dataset
Predict Loan Eligibility for Dream Housing Finance company

Dream Housing Finance company deals in all kinds of home loans. They have presence across all urban, semi urban and rural areas. Customer first applies for home loan and after that company validates the customer eligibility for loan.

Company wants to automate the loan eligibility process (real time) based on customer detail provided while filling online application form. These details are Gender, Marital Status, Education, Number of Dependents, Income, Loan Amount, Credit History and others. To automate this process, they have provided a dataset to identify the customers segments that are eligible for loan amount so that they can specifically target these customers.

Train: https://raw.githubusercontent.com/subashgandyer/datasets/main/loan_train.csv

Test: https://raw.githubusercontent.com/subashgandyer/datasets/main/loan_test.csv

## Task Requirements
### You can have the following Classification models built using different ML algorithms
- Decision Tree
- KNN
- Logistic Regression
- SVM
- Random Forest
- Any other algorithm of your choice

### Use GridSearchCV for finding the best model with the best hyperparameters

- ### Build models
- ### Create Parameter Grid
- ### Run GridSearchCV
- ### Choose the best model with the best hyperparameter
- ### Give the best accuracy
- ### Also, benchmark the best accuracy that you could get for every classification algorithm asked above

#### Your final output will be something like this:
- Best hyperparameter accuracy for every algorithm
- Best overall algorithm accuracy


**Table 1 (Algorithm wise best model with best hyperparameter)**

Algorithm   |     Accuracy   |   Hyperparameters
- DT
- KNN
- LR
- SVM
- RF
- anyother

**Table 2 (Best overall)**

Algorithm    |   Accuracy    |   Hyperparameters



### Submission
- Submit Notebook containing all saved ran code with outputs
- Document with the above two tables

In [1]:
import pandas as pd
import numpy as np

In [2]:
url = "https://raw.githubusercontent.com/subashgandyer/datasets/main/loan_test.csv"

In [3]:
data = pd.read_csv(url)
data

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area
0,LP001015,Male,Yes,0,Graduate,No,5720,0,110.0,360.0,1.0,Urban
1,LP001022,Male,Yes,1,Graduate,No,3076,1500,126.0,360.0,1.0,Urban
2,LP001031,Male,Yes,2,Graduate,No,5000,1800,208.0,360.0,1.0,Urban
3,LP001035,Male,Yes,2,Graduate,No,2340,2546,100.0,360.0,,Urban
4,LP001051,Male,No,0,Not Graduate,No,3276,0,78.0,360.0,1.0,Urban
...,...,...,...,...,...,...,...,...,...,...,...,...
362,LP002971,Male,Yes,3+,Not Graduate,Yes,4009,1777,113.0,360.0,1.0,Urban
363,LP002975,Male,Yes,0,Graduate,No,4158,709,115.0,360.0,1.0,Urban
364,LP002980,Male,No,0,Graduate,No,3250,1993,126.0,360.0,,Semiurban
365,LP002986,Male,Yes,0,Graduate,No,5000,2393,158.0,360.0,1.0,Rural


In [4]:
data.columns

Index(['Loan_ID', 'Gender', 'Married', 'Dependents', 'Education',
       'Self_Employed', 'ApplicantIncome', 'CoapplicantIncome', 'LoanAmount',
       'Loan_Amount_Term', 'Credit_History', 'Property_Area'],
      dtype='object')

In [5]:
train_data = "https://raw.githubusercontent.com/subashgandyer/datasets/main/loan_train.csv"

In [6]:
train_df = pd.read_csv(train_data)

In [7]:
train_df.columns

Index(['Loan_ID', 'Gender', 'Married', 'Dependents', 'Education',
       'Self_Employed', 'ApplicantIncome', 'CoapplicantIncome', 'LoanAmount',
       'Loan_Amount_Term', 'Credit_History', 'Property_Area', 'Loan_Status'],
      dtype='object')

In [8]:
train_df

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,Y
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
3,LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
4,LP001008,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y
...,...,...,...,...,...,...,...,...,...,...,...,...,...
609,LP002978,Female,No,0,Graduate,No,2900,0.0,71.0,360.0,1.0,Rural,Y
610,LP002979,Male,Yes,3+,Graduate,No,4106,0.0,40.0,180.0,1.0,Rural,Y
611,LP002983,Male,Yes,1,Graduate,No,8072,240.0,253.0,360.0,1.0,Urban,Y
612,LP002984,Male,Yes,2,Graduate,No,7583,0.0,187.0,360.0,1.0,Urban,Y


In [9]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 614 entries, 0 to 613
Data columns (total 13 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Loan_ID            614 non-null    object 
 1   Gender             601 non-null    object 
 2   Married            611 non-null    object 
 3   Dependents         599 non-null    object 
 4   Education          614 non-null    object 
 5   Self_Employed      582 non-null    object 
 6   ApplicantIncome    614 non-null    int64  
 7   CoapplicantIncome  614 non-null    float64
 8   LoanAmount         592 non-null    float64
 9   Loan_Amount_Term   600 non-null    float64
 10  Credit_History     564 non-null    float64
 11  Property_Area      614 non-null    object 
 12  Loan_Status        614 non-null    object 
dtypes: float64(4), int64(1), object(8)
memory usage: 62.5+ KB


In [10]:
len(train_df['Loan_ID'].unique())

614

In [11]:
len(train_df['ApplicantIncome'].unique())

505

## Dropping columns with all unique values

In [33]:
# train_df=train_df.drop('Loan_ID', axis=1)
# test_df=data.drop('Loan_ID', axis=1)

In [34]:
from sklearn import preprocessing
label_encoder = preprocessing.LabelEncoder()

In [30]:
train_df['Gender']=label_encoder.fit_transform(train_df['Gender'])
test_df['Gender']=label_encoder.fit_transform(test_df['Gender'])

TypeError: Encoders require their input to be uniformly strings or numbers. Got ['float', 'str']

In [None]:
train_df['Married']=label_encoder.fit_transform(train_df['Married'])
test_df['Married']=label_encoder.fit_transform(test_df['Married'])

In [None]:
train_df['Dependents']=label_encoder.fit_transform(train_df['Dependents'])
test_df['Dependents']=label_encoder.fit_transform(test_df['Dependents'])

In [None]:
train_df['Education']=label_encoder.fit_transform(train_df['Education'])
test_df['Education']=label_encoder.fit_transform(test_df['Education'])

In [None]:
train_df['Self_Employed']=label_encoder.fit_transform(train_df['Self_Employed'])
test_df['Self_Employed']=label_encoder.fit_transform(test_df['Self_Employed'])

In [None]:
train_df['Property_Area']=label_encoder.fit_transform(train_df['Property_Area'])
test_df['Property_Area']=label_encoder.fit_transform(test_df['Property_Area'])

## checking for missing values

In [None]:
train_df.isna().sum()

## Handling Missing values using Iterative Imputer.

In [None]:
train_df['Loan_Status'].value_counts()

In [None]:
X=train_df.drop('Loan_Status', axis=1)
y=train_df['Loan_Status']

In [None]:
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

In [None]:
imputer = IterativeImputer(max_iter=10, random_state=42)

In [None]:
imputer.fit(X)

In [None]:
X_transform = imputer.transform(X)

## Splitting train dataset into test and train[Not using the test dataset provided, as it does not have the output(Target) variable column, and thus wont help to predict accuracy of the models]

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
xtrain, xtest, ytrain, ytest = train_test_split(X_transform, y, test_size=0.3, random_state=1)

## model_building

## Task Requirements
### You can have the following Classification models built using different ML algorithms
- Decision Tree
- KNN
- Logistic Regression
- SVM
- Random Forest
- Any other algorithm of your choice

In [None]:
from sklearn.ensemble import RandomForestClassifier as rfc
from sklearn.neighbors import KNeighborsClassifier as kn
from sklearn.linear_model import LogisticRegression as lrc
from sklearn.svm import SVC as svc
from sklearn.tree import DecisionTreeClassifier as dtc

In [None]:
from sklearn.model_selection import GridSearchCV

In [None]:
from sklearn.metrics import accuracy_score

In [None]:
def dt(xtrain,xtest,ytrain,ytest):
    param_grid = {'criterion': ['entropy','gini'], 'splitter': ['random','best'], 'min_samples_split': [2,5]}
    grid=GridSearchCV(dtc(random_state=42),param_grid,refit=True,verbose=3)
    grid.fit(xtrain,ytrain)
    print('best parameters for dt: ',grid.best_params_)
    ypredict=grid.predict(xtest)
    accuracy_Score = accuracy_score(ytest,ypredict)
    print('accuracy score for dt', accuracy_Score*100)
    return accuracy_Score

In [None]:
def knn(xtrain,xtest,ytrain,ytest):
    param_grid = {'leaf_size': list(range(1,50)), 'n_neighbors': list(range(1,30)), 'p': [1,2]}
    grid=GridSearchCV(kn(),param_grid,refit=True,verbose=3)
    grid.fit(xtrain,ytrain)
    print('best parameters for knn: ',grid.best_params_)
    ypredict=grid.predict(xtest)
    accuracy_Score = accuracy_score(ytest,ypredict)
    print('accuracy score for knn', accuracy_Score*100)
    return accuracy_Score

In [None]:
def lr(xtrain,xtest,ytrain,ytest):
    param_grid={"C":np.logspace(-3,3,7), "penalty":["l1","l2"]}
    grid=GridSearchCV(lrc(random_state=42),param_grid,refit=True,verbose=3)
    grid.fit(xtrain,ytrain)
    print('best parameters for lr: ',grid.best_params_)
    ypredict=grid.predict(xtest)
    accuracy_Score = accuracy_score(ytest,ypredict)
    print('accuracy score for lr', accuracy_Score*100)
    return accuracy_Score

In [None]:
def svm(xtrain,xtest,ytrain,ytest):
    param_grid = {'C': [100], 'gamma': [0.001], 'kernel': ['linear']}
    grid=GridSearchCV(svc(random_state=42),param_grid,refit=True,verbose=3)
    grid.fit(xtrain,ytrain)
    print('best parameters for svm: ',grid.best_params_)
    ypredict=grid.predict(xtest)
    accuracy_Score = accuracy_score(ytest,ypredict)
    print('accuracy score for svm', accuracy_Score*100)
    return accuracy_Score

In [None]:
def rf(xtrain,xtest,ytrain,ytest):
    param_grid = { 
    'n_estimators': [200, 500],
    'max_features': ['auto', 'sqrt', 'log2'],
    'max_depth' : [4,5,6,7,8],
    'criterion' :['gini', 'entropy']}
    grid=GridSearchCV(rfc(random_state=42),param_grid,refit=True,verbose=3)
    grid.fit(xtrain,ytrain)
    print('best parameters for rf: ',grid.best_params_)
    ypredict=grid.predict(xtest)
    accuracy_Score = accuracy_score(ytest,ypredict)
    print('accuracy score for rf', accuracy_Score*100)
    return accuracy_Score

In [None]:
param=[]

In [None]:
param.append(dt(xtrain,xtest,ytrain,ytest))

In [None]:
param.append(knn(xtrain,xtest,ytrain,ytest))

In [None]:
param.append(lr(xtrain,xtest,ytrain,ytest))

In [None]:
param.append(rf(xtrain,xtest,ytrain,ytest))

In [None]:
param.append(svm(xtrain,xtest,ytrain,ytest))

In [None]:
 print('Optimised accuracies of decision tree, KNN, logistic regression, Random forest and SVM are: \n', param)