# Task 10 : Benchmark Top ML Algorithms

This task tests your ability to use different ML algorithms when solving a specific problem.


### Dataset
Predict Loan Eligibility for Dream Housing Finance company

Dream Housing Finance company deals in all kinds of home loans. They have presence across all urban, semi urban and rural areas. Customer first applies for home loan and after that company validates the customer eligibility for loan.

Company wants to automate the loan eligibility process (real time) based on customer detail provided while filling online application form. These details are Gender, Marital Status, Education, Number of Dependents, Income, Loan Amount, Credit History and others. To automate this process, they have provided a dataset to identify the customers segments that are eligible for loan amount so that they can specifically target these customers.

Train: https://raw.githubusercontent.com/subashgandyer/datasets/main/loan_train.csv

Test: https://raw.githubusercontent.com/subashgandyer/datasets/main/loan_test.csv

## Task Requirements
### You can have the following Classification models built using different ML algorithms
- Decision Tree
- KNN
- Logistic Regression
- SVM
- Random Forest
- Any other algorithm of your choice

### Use GridSearchCV for finding the best model with the best hyperparameters

- ### Build models
- ### Create Parameter Grid
- ### Run GridSearchCV
- ### Choose the best model with the best hyperparameter
- ### Give the best accuracy
- ### Also, benchmark the best accuracy that you could get for every classification algorithm asked above

#### Your final output will be something like this:
- Best algorithm accuracy
- Best hyperparameter accuracy for every algorithm

**Table 1 (Algorithm wise best model with best hyperparameter)**

Algorithm   |     Accuracy   |   Hyperparameters
- DT
- KNN
- LR
- SVM
- RF
- anyother

**Table 2 (Best overall)**

Algorithm    |   Accuracy    |   Hyperparameters



### Submission
- Submit Notebook containing all saved ran code with outputs
- Document with the above two tables

In [452]:
import pandas as pd
import math

from sklearn.model_selection import train_test_split
from sklearn.experimental import enable_iterative_imputer
from sklearn.preprocessing import MinMaxScaler, StandardScaler

from sklearn.impute import IterativeImputer

import plotly.express as px
import plotly.graph_objects as go

In [453]:
train_data = pd.read_csv("https://raw.githubusercontent.com/subashgandyer/datasets/main/loan_train.csv")
X_test = pd.read_csv("https://raw.githubusercontent.com/subashgandyer/datasets/main/loan_test.csv")

In [454]:
print(train_data.shape)

(614, 13)


In [455]:
print(test_data.shape)

(367, 11)


In [456]:
train_data.head()

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,Y
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
3,LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
4,LP001008,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y


In [457]:
train_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 614 entries, 0 to 613
Data columns (total 13 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Loan_ID            614 non-null    object 
 1   Gender             601 non-null    object 
 2   Married            611 non-null    object 
 3   Dependents         599 non-null    object 
 4   Education          614 non-null    object 
 5   Self_Employed      582 non-null    object 
 6   ApplicantIncome    614 non-null    int64  
 7   CoapplicantIncome  614 non-null    float64
 8   LoanAmount         592 non-null    float64
 9   Loan_Amount_Term   600 non-null    float64
 10  Credit_History     564 non-null    float64
 11  Property_Area      614 non-null    object 
 12  Loan_Status        614 non-null    object 
dtypes: float64(4), int64(1), object(8)
memory usage: 62.5+ KB


In [458]:
train_data.drop(["Loan_ID"], axis=1, inplace=True)
X_test.drop(["Loan_ID"], axis=1, inplace=True)

In [459]:
train_data.Dependents.value_counts()

0     345
1     102
2     101
3+     51
Name: Dependents, dtype: int64

In [460]:
train_data.loc[ train_data["Dependents"] == "3+", "Dependents"] = 3
X_test.loc[X_test["Dependents"] == "3+", "Dependents"] = 3

# train_data.loc[train_data["Loan_Status"] == "Y", "Loan_Status"] = 1
# train_data.loc[train_data["Loan_Status"] == "N", "Loan_Status"] = 0

In [461]:
def print_stratified_percentages(data):
    classes = data.value_counts()
    for class_ in classes.keys():
        print(f"Class percentage: {class_} - ", f"{math.ceil((classes[class_] / data.shape[0])*100)}%")

In [462]:
train_data

Unnamed: 0,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,Y
1,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
2,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
3,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
4,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y
...,...,...,...,...,...,...,...,...,...,...,...,...
609,Female,No,0,Graduate,No,2900,0.0,71.0,360.0,1.0,Rural,Y
610,Male,Yes,3,Graduate,No,4106,0.0,40.0,180.0,1.0,Rural,Y
611,Male,Yes,1,Graduate,No,8072,240.0,253.0,360.0,1.0,Urban,Y
612,Male,Yes,2,Graduate,No,7583,0.0,187.0,360.0,1.0,Urban,Y


In [463]:

X = train_data.drop(["Loan_Status"], axis=1)
y = train_data["Loan_Status"]


X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

print("Train shape: ", X_train.shape)
print("Val shape: ", X_val.shape)

Train shape:  (491, 11)
Val shape:  (123, 11)


In [464]:
print("For Original :")
print_stratified_percentages(train_data.Loan_Status)

print("\nAfter split - For Training :")
print_stratified_percentages(y_train)

print("\nAfter split - For validation :")
print_stratified_percentages(y_val)

For Original :
Class percentage: Y -  69%
Class percentage: N -  32%

After split - For Training :
Class percentage: Y -  69%
Class percentage: N -  32%

After split - For validation :
Class percentage: Y -  70%
Class percentage: N -  31%


# Encoding

In [465]:
def encode_categorical_variable(df):
    categorical_df = df.select_dtypes(["object"])
    categorical_df_encoded = pd.get_dummies(categorical_df)
    return pd.concat([df.drop(categorical_df.columns, axis=1), categorical_df_encoded], axis=1)

In [466]:
X_train_encoded = encode_categorical_variable(X_train)
X_val_encoded = encode_categorical_variable(X_val)
X_test_encoded = encode_categorical_variable(X_test)

In [467]:
print(X_train_encoded.shape, X_val_encoded.shape, X_test_encoded.shape)

(491, 20) (123, 20) (367, 20)


## Missing values 

In [468]:
imp_mean = IterativeImputer(random_state=0)

X_train_imputed = pd.DataFrame(imp_mean.fit_transform(X_train_encoded), columns=X_train_encoded.columns)
X_val_imputed = pd.DataFrame(imp_mean.fit_transform(X_val_encoded), columns=X_val_encoded.columns)
X_test_imputed = pd.DataFrame(imp_mean.transform(X_test_encoded), columns=X_test_encoded.columns)

## Scaling

In [469]:
scaler = MinMaxScaler()

X_train_scaled = scaler.fit_transform(X_train_imputed)

X_val_scaled = scaler.transform(X_val_imputed)
X_test_scaled = scaler.transform(X_test_imputed)

In [470]:
X_train_scaled_df = pd.DataFrame(X_train_scaled, columns=X_train_imputed.columns)
X_val_scaled_df = pd.DataFrame(X_val_scaled, columns=X_val_imputed.columns)
X_test_scaled_df = pd.DataFrame(X_test_scaled, columns=X_test_imputed.columns)

In [471]:
y_train = y_train.reset_index().Loan_Status
y_train

0      Y
1      Y
2      N
3      N
4      Y
      ..
486    Y
487    Y
488    Y
489    Y
490    Y
Name: Loan_Status, Length: 491, dtype: object

In [472]:
X_train_imputed

Unnamed: 0,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Gender_Female,Gender_Male,Married_No,Married_Yes,Dependents_3,Dependents_0,Dependents_1,Dependents_2,Education_Graduate,Education_Not Graduate,Self_Employed_No,Self_Employed_Yes,Property_Area_Rural,Property_Area_Semiurban,Property_Area_Urban
0,3254.0,0.0,50.0,360.0,1.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0
1,3315.0,0.0,96.0,360.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0
2,3340.0,1710.0,150.0,360.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0
3,2653.0,1500.0,113.0,180.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0
4,2620.0,2223.0,150.0,360.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
486,2971.0,2791.0,144.0,360.0,1.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0
487,2625.0,6250.0,187.0,360.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0
488,2799.0,2253.0,122.0,360.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0
489,2484.0,2302.0,137.0,360.0,1.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0


## Feature Selection

In [337]:


def get_corr_support(selected_features):
    return [True if feature in selected_features else False for feature in feature_name]


def cor_selector(X, y, num_feats):
    pearson_cor = pd.concat([X, y], axis=1).corr()["Loan_Status"].to_dict()
    del pearson_cor["Loan_Status"]

    sorted_features_with_values = sorted(pearson_cor.items(), key=lambda x: x[1], reverse=True)[:num_feats]
    selected_features = [sor[0] for sor in sorted_features_with_values]
    return get_corr_support(selected_features), selected_features


def chi_squared_selector(X, y, num_feats):
    chi2_features = SelectKBest(chi2, k=num_feats)
    chi2_features.fit(X, y)
    chi2_features.transform(X)
    selected_features = list(chi2_features.get_feature_names_out())
    return get_corr_support(selected_features), selected_features


def rfe_selector(X, y, num_feats):
    # Your code goes here (Multiple lines)

    estimator = LogisticRegression()
    selector = RFE(estimator, n_features_to_select=num_feats, step=10, verbose=True)
    selector = selector.fit(X, y)
    # Your code ends here
    return selector.support_, list(selector.get_feature_names_out())


def embedded_log_reg_selector(X, y, num_feats):
    # Your code goes here (Multiple lines)
    estimator = LogisticRegression()

    model = SelectFromModel(estimator, max_features=num_feats)
    model.fit(X, y)

    # Your code ends here
    return get_corr_support(model.get_feature_names_out()), list(model.get_feature_names_out())


def embedded_rf_selector(X, y, num_feats):
    estimator = RandomForestClassifier()

    model = SelectFromModel(estimator, max_features=num_feats)
    model.fit(X, y)

    # Your code ends here
    return get_corr_support(model.get_feature_names_out()), list(model.get_feature_names_out())


def embedded_lgbm_selector(X, y, num_feats):
    estimator = LGBMClassifier()

    model = SelectFromModel(estimator, max_features=num_feats)
    model.fit(X, y)

    # Your code ends here
    return get_corr_support(model.get_feature_names_out()), list(model.get_feature_names_out())


def autoFeatureSelector(X, y, num_feats, methods=[]):
    # Parameters
    # data - dataset to be analyzed (csv file)
    # methods - various feature selection methods we outlined before, use them all here (list)

    # preprocessing
    cor_support, chi_support, rfe_support, embedded_lr_support, embedded_rf_support, embedded_lgbm_support = None, None, None, None, None, None
    # Run every method we outlined above from the methods list and collect returned best features from every method
    if 'pearson' in methods:
        cor_support, cor_feature = cor_selector(X, y, num_feats)
    if 'chi-square' in methods:
        chi_support, chi_feature = chi_squared_selector(X, y, num_feats)
    if 'rfe' in methods:
        rfe_support, rfe_feature = rfe_selector(X, y, num_feats)
    if 'log-reg' in methods:
        embedded_lr_support, embedded_lr_feature = embedded_log_reg_selector(X, y, num_feats)
    if 'rf' in methods:
        embedded_rf_support, embedded_rf_feature = embedded_rf_selector(X, y, num_feats)
    if 'lgbm' in methods:
        embedded_lgbm_support, embedded_lgbm_feature = embedded_lgbm_selector(X, y, num_feats)

    # Combine all the above feature list and count the maximum set of features that got selected by all methods
    feature_selection_df = pd.DataFrame(
        {'Feature': feature_name, 'Pearson': cor_support, 'Chi-2': chi_support, 'RFE': rfe_support,
         'Logistics': embedded_lr_support,
         'Random Forest': embedded_rf_support, 'LightGBM': embedded_lgbm_support})
    # count the selected times for each feature
    feature_selection_df['Total'] = np.sum(feature_selection_df, axis=1)
    # display the top 100
    feature_selection_df = feature_selection_df.sort_values(['Total', 'Feature'], ascending=False)
    feature_selection_df.index = range(1, len(feature_selection_df) + 1)

#     best_features = feature_selection_df.head(num_feats)["Feature"].to_list()
#     return X_train_scaled[best_features], X_test_scaled[best_features], best_features
    return feature_selection_df

In [340]:
import pprint

In [342]:
feature_name = X_train_scaled_df.columns
best_features = autoFeatureSelector(X_train_scaled_df, y_train, num_feats=20,
                                                                       methods=['pearson', 'chi-square', 'rfe',
                                                                                'log-reg', 'rf', 'lgbm'])

print("Selected best features by multiple methods - ")
best_features

Selected best features by multiple methods - 



Dropping of nuisance columns in DataFrame reductions (with 'numeric_only=None') is deprecated; in a future version this will raise TypeError.  Select only valid columns before calling the reduction.



Unnamed: 0,Feature,Pearson,Chi-2,RFE,Logistics,Random Forest,LightGBM,Total
1,CoapplicantIncome,True,True,True,True,True,True,6
2,LoanAmount,True,True,True,False,True,True,5
3,Credit_History,True,True,True,True,True,False,5
4,ApplicantIncome,True,True,True,False,True,True,5
5,Property_Area_Semiurban,True,True,True,True,False,False,4
6,Married_No,True,True,True,True,False,False,4
7,Dependents_1,True,True,True,True,False,False,4
8,Self_Employed_Yes,True,True,True,False,False,False,3
9,Self_Employed_No,True,True,True,False,False,False,3
10,Property_Area_Urban,True,True,True,False,False,False,3


We will take top features which got equal or more than 4 votings. 

In [349]:
selected_columns = list(best_features[best_features["Total"] >= 4]["Feature"])

In [350]:
selected_columns

['CoapplicantIncome',
 'LoanAmount',
 'Credit_History',
 'ApplicantIncome',
 'Property_Area_Semiurban',
 'Married_No',
 'Dependents_1']

In [353]:
X_train_selected = X_train_scaled_df[selected_columns]
X_val_selected = X_val_scaled_df[selected_columns]
X_test_selected = X_test_scaled_df[selected_columns]

## Baseline Models

In [351]:
metric_data = []
target_names=y_train.value_counts().keys().to_list()

In [362]:
def get_metrics(model_name, y_val, y_pred):
    
    cm = confusion_matrix(y_val, y_pred)
    accuracy = accuracy_score(y_val, y_pred)
    
    precision = cm[1][1] / (cm[1][1]+cm[0][1])
    recall = cm[1][1] / (cm[1][1]+cm[1][0])
    f1_score = 2*(precision*recall)/(precision+recall)
    
    return {"Model": model_name, 
            "Accuracy":accuracy, 
            "Precision": precision, 
            "Recall": recall, 
            "F1_score": f1_score}

In [217]:
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import accuracy_score, confusion_matrix
import pprint as pp
from sklearn.svm import SVC

In [218]:
estimator = LogisticRegression(max_iter=1000)
estimator.fit(X_train_scaled_df, y_train)

In [219]:
y_pred = estimator.predict(X_val_scaled_df)

In [220]:
metrics = get_metrics("Basline Logistic Regression", y_val, y_pred)
pp.pprint(metrics)
metric_data.append(metrics)

{'Accuracy': 0.8536585365853658,
 'F1_score': 0.903225806451613,
 'Model': 'Basline Logistic Regression',
 'Precision': 0.8316831683168316,
 'Recall': 0.9882352941176471}


In [221]:
bdt_estimator = DecisionTreeClassifier()
bdt_estimator.fit(X_train_scaled_df, y_train)

In [222]:
y_pred = bdt_estimator.predict(X_val_scaled_df)

In [223]:
metrics = get_metrics("Baseline DecisionTree Classifier", y_val, y_pred)
pp.pprint(metrics)
metric_data.append(metrics)

{'Accuracy': 0.7479674796747967,
 'F1_score': 0.8121212121212121,
 'Model': 'Baseline DecisionTree Classifier',
 'Precision': 0.8375,
 'Recall': 0.788235294117647}


In [224]:
brf_estimator = RandomForestClassifier()
brf_estimator.fit(X_train_scaled_df, y_train)

In [225]:
y_pred = brf_estimator.predict(X_val_scaled_df)

In [226]:
metrics = get_metrics("Baseline RandomForest Classifier", y_val, y_pred)
pp.pprint(metrics)
metric_data.append(metrics)

{'Accuracy': 0.8130081300813008,
 'F1_score': 0.8700564971751413,
 'Model': 'Baseline RandomForest Classifier',
 'Precision': 0.8369565217391305,
 'Recall': 0.9058823529411765}


In [227]:
sgd_estimator = SGDClassifier()
sgd_estimator.fit(X_train_scaled_df, y_train)

In [228]:
y_pred = sgd_estimator.predict(X_val_scaled_df)

In [229]:
metrics = get_metrics("Baseline Stochastic Gradient Descent", y_val, y_pred)
pp.pprint(metrics)
metric_data.append(metrics)

{'Accuracy': 0.7723577235772358,
 'F1_score': 0.8181818181818181,
 'Model': 'Baseline Stochastic Gradient Descent',
 'Precision': 0.9130434782608695,
 'Recall': 0.7411764705882353}


In [230]:
svm_estimator = SVC()
svm_estimator.fit(X_train_scaled_df, y_train)

In [231]:
y_pred = svm_estimator.predict(X_val_scaled_df)

In [232]:
metrics = get_metrics("Baseline Support Vector Machine", y_val, y_pred)
pp.pprint(metrics)
metric_data.append(metrics)

{'Accuracy': 0.8536585365853658,
 'F1_score': 0.903225806451613,
 'Model': 'Baseline Support Vector Machine',
 'Precision': 0.8316831683168316,
 'Recall': 0.9882352941176471}


In [233]:
baseline_df = pd.DataFrame(metric_data)
baseline_df.sort_values("F1_score")

Unnamed: 0,Model,Accuracy,Precision,Recall,F1_score
1,Baseline DecisionTree Classifier,0.747967,0.8375,0.788235,0.812121
3,Baseline Stochastic Gradient Descent,0.772358,0.913043,0.741176,0.818182
2,Baseline RandomForest Classifier,0.813008,0.836957,0.905882,0.870056
0,Basline Logistic Regression,0.853659,0.831683,0.988235,0.903226
4,Baseline Support Vector Machine,0.853659,0.831683,0.988235,0.903226


## Hyperparameter

In [473]:
from sklearn.model_selection import GridSearchCV

In [477]:
param_grid = {
    "C": [0.5, 1, 5, 10], 
    "max_iter": [500, 1000]
}

lr_grid_search = GridSearchCV(LogisticRegression(), param_grid=param_grid)
lr_grid_search.fit(X_train_scaled_df, y_train)

In [478]:
lr_grid_search.best_params_

{'C': 0.5, 'max_iter': 500}

In [480]:
y_pred = lr_grid_search.best_estimator_.predict(X_val_scaled_df)

In [481]:
metrics = get_metrics("Parameter Tuned Logistic Regression", y_val, y_pred)
pp.pprint(metrics)
metric_data.append(metrics)

{'Accuracy': 0.8536585365853658,
 'F1_score': 0.903225806451613,
 'Model': 'Parameter Tuned Logistic Regression',
 'Precision': 0.8316831683168316,
 'Recall': 0.9882352941176471}


In [368]:
param_grid = {'max_depth': [2, 20],
              'min_samples_leaf': [2, 10, 100, 1000],
              'criterion': ['gini','entropy', 'log_loss'],
              'max_leaf_nodes': [10, 100, 1000],
              'min_impurity_decrease': [0.000001, 0.0001, 0.001, 0.010],
              'splitter': ['best', 'random']}

dt_grid_search = GridSearchCV(DecisionTreeClassifier(), param_grid=param_grid)
dt_grid_search.fit(X_train_selected, y_train)

In [369]:
dt_grid_search.best_params_

{'criterion': 'entropy',
 'max_depth': 2,
 'max_leaf_nodes': 100,
 'min_impurity_decrease': 0.0001,
 'min_samples_leaf': 10,
 'splitter': 'random'}

In [370]:
y_pred = dt_grid_search.best_estimator_.predict(X_val_selected)

In [371]:
metrics = get_metrics("Parameter Tuned DecisionTree Regression", y_val, y_pred)
pp.pprint(metrics)
metric_data.append(metrics)

{'Accuracy': 0.8536585365853658,
 'F1_score': 0.903225806451613,
 'Model': 'Parameter Tuned DecisionTree Regression',
 'Precision': 0.8316831683168316,
 'Recall': 0.9882352941176471}


In [380]:

param_grid = {
    "n_estimators": [200, 300, 400], 
    "max_depth": [2, 8, ],
    "max_features" : ['log2', 'sqrt', None],
    "max_leaf_nodes": [4, 8, 16, 32] , 
    "min_samples_split": [2, 4 ], 
    "bootstrap": [True, False]
}
rf_grid_search = GridSearchCV(RandomForestClassifier(class_weight="balanced", n_jobs=-1), param_grid=param_grid, n_jobs=-1, verbose=True)
rf_grid_search.fit(X_train_selected, y_train)

Fitting 5 folds for each of 288 candidates, totalling 1440 fits


In [381]:
rf_grid_search.best_params_

{'bootstrap': True,
 'max_depth': 2,
 'max_features': 'sqrt',
 'max_leaf_nodes': 4,
 'min_samples_split': 2,
 'n_estimators': 400}

In [383]:
y_pred = rf_grid_search.best_estimator_.predict(X_val_selected)

In [384]:
metrics = get_metrics("Parameter Tuned RandomForest Classifier", y_val, y_pred)
pp.pprint(metrics)
metric_data.append(metrics)

{'Accuracy': 0.8130081300813008,
 'F1_score': 0.87292817679558,
 'Model': 'Parameter Tuned RandomForest Classifier',
 'Precision': 0.8229166666666666,
 'Recall': 0.9294117647058824}


In [375]:
metrics = get_metrics("Parameter Tuned RandomForest Classifier", y_val, y_pred)
pp.pprint(metrics)
metric_data.append(metrics)

{'Accuracy': 0.8455284552845529,
 'F1_score': 0.8983957219251337,
 'Model': 'Parameter Tuned RandomForest Classifier',
 'Precision': 0.8235294117647058,
 'Recall': 0.9882352941176471}


In [376]:
param_grid = {
    "loss": ["hinge", "log_loss"], 
    "penalty":["l2", "l1", "elasticnet"],
    "alpha": [0.0001, 0.001, 0.1,0.5 ]
}
sgd_grid_search = GridSearchCV(SGDClassifier(), param_grid=param_grid)
sgd_grid_search.fit(X_train_scaled_df, y_train)

In [377]:
sgd_grid_search.best_params_

{'alpha': 0.001, 'loss': 'hinge', 'penalty': 'l2'}

In [378]:
y_pred = sgd_grid_search.best_estimator_.predict(X_val_scaled_df)

In [379]:
metrics = get_metrics("Parameter Tuned Stochastic Gradient Descent", y_val, y_pred)
pp.pprint(metrics)
metric_data.append(metrics)

{'Accuracy': 0.8536585365853658,
 'F1_score': 0.903225806451613,
 'Model': 'Parameter Tuned Stochastic Gradient Descent',
 'Precision': 0.8316831683168316,
 'Recall': 0.9882352941176471}


In [252]:
param_grid = {
    "C": [1, 5, 10], 
    "kernel": ["linear", "rbf"], #["linear", "rbf"]
}

svm_grid_search = GridSearchCV(SVC(), param_grid=param_grid, n_jobs=-1)
svm_grid_search.fit(X_train_scaled_df, y_train)

In [253]:
svm_grid_search.best_params_


{'C': 1, 'kernel': 'linear'}

In [254]:
y_pred = svm_grid_search.best_estimator_.predict(X_val_scaled_df)

In [255]:
metrics = get_metrics("Parameter Tuned  Support Vector Machine", y_val, y_pred)
pp.pprint(metrics)
metric_data.append(metrics)

{'Accuracy': 0.8536585365853658,
 'F1_score': 0.903225806451613,
 'Model': 'Parameter Tuned  Support Vector Machine',
 'Precision': 0.8316831683168316,
 'Recall': 0.9882352941176471}


In [268]:
parameter_tuned_df = pd.DataFrame(metric_data)
parameter_tuned_df.sort_values("F1_score", ascending=False)

Unnamed: 0,Model,Accuracy,Precision,Recall,F1_score
0,Basline Logistic Regression,0.853659,0.831683,0.988235,0.903226
4,Baseline Support Vector Machine,0.853659,0.831683,0.988235,0.903226
5,Parameter Tuned Logistic Regression,0.853659,0.831683,0.988235,0.903226
6,Parameter Tuned DecisionTree Regression,0.853659,0.831683,0.988235,0.903226
8,Parameter Tuned Stochastic Gradient Descent,0.853659,0.831683,0.988235,0.903226
9,Parameter Tuned Support Vector Machine,0.853659,0.831683,0.988235,0.903226
10,Parameter Tuned RandomForest Classifier,0.845528,0.823529,0.988235,0.898396
7,Parameter Tuned RandomForest Classifier,0.837398,0.842105,0.941176,0.888889
2,Baseline RandomForest Classifier,0.813008,0.836957,0.905882,0.870056
3,Baseline Stochastic Gradient Descent,0.772358,0.913043,0.741176,0.818182


In [116]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from lightgbm import LGBMClassifier
from sklearn.svm import SVC
from tqdm.contrib import itertools
from tqdm import tqdm
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.model_selection import RepeatedStratifiedKFold, cross_val_score
import numpy as np
from sklearn.metrics import f1_score

In [132]:
impute_strategies= [SimpleImputer(strategy="mean"), 
                    SimpleImputer(strategy="median"),
                    SimpleImputer(strategy="most_frequent"),
                    SimpleImputer(strategy="constant"),
                   IterativeImputer(max_iter=10, random_state=0)]

scaler_strategies = [MinMaxScaler(), StandardScaler()]

algorithms= [LogisticRegression(), KNeighborsClassifier(n_jobs=-1),
             RandomForestClassifier(n_jobs=-1, class_weight="balanced"), SVC(),
             LGBMClassifier(), DecisionTreeClassifier()]


In [133]:
results = []

for imputer, scaler, model in itertools.product(impute_strategies, scaler_strategies, algorithms):
    
        pipeline = Pipeline([("imputer", imputer), ("scaler", scaler), ("model", model)])

        cv2 = RepeatedStratifiedKFold(n_repeats=3, n_splits=10, random_state=1)
        
        
        
        X= pd.concat([X_train_encoded, X_val_encoded])
        y = pd.concat([y_train, y_val])
        y[y == "Y"] = 1
        y[y == "N"] = 0
        y = y.astype('uint')

        accuracy_scores = cross_val_score(pipeline, X, y, scoring="f1", cv=cv2, n_jobs=-1)
        
        model_name = type(model).__name__
        strategy = imputer.get_params().get("strategy")
        imputer_name = f"{type(imputer).__name__}_{strategy}"
        scaler_name = type(scaler).__name__

        mean_accuracy = round(np.mean(accuracy_scores), 3)
        std_accuracy = round(np.std(accuracy_scores), 3)
        
        result = {
            "Model": model_name, 
            "Imputer": imputer_name,
            "Scaler": scaler_name,
            "Mean Accuracy": mean_accuracy, 
            "Std Accuracy": std_accuracy
        }
        
        results.append(result)
        
        print(f"Model: {model_name}, Mean Accuracy: {mean_accuracy}  | Std: {std_accuracy}")


  0%|          | 0/60 [00:00<?, ?it/s]

Model: LogisticRegression, Mean Accuracy: 0.875  | Std: 0.019
Model: KNeighborsClassifier, Mean Accuracy: 0.818  | Std: 0.026
Model: RandomForestClassifier, Mean Accuracy: 0.858  | Std: 0.022
Model: SVC, Mean Accuracy: 0.877  | Std: 0.019
Model: LGBMClassifier, Mean Accuracy: 0.845  | Std: 0.03
Model: DecisionTreeClassifier, Mean Accuracy: 0.795  | Std: 0.039
Model: LogisticRegression, Mean Accuracy: 0.876  | Std: 0.022
Model: KNeighborsClassifier, Mean Accuracy: 0.83  | Std: 0.027
Model: RandomForestClassifier, Mean Accuracy: 0.859  | Std: 0.022
Model: SVC, Mean Accuracy: 0.875  | Std: 0.019
Model: LGBMClassifier, Mean Accuracy: 0.842  | Std: 0.031
Model: DecisionTreeClassifier, Mean Accuracy: 0.795  | Std: 0.044
Model: LogisticRegression, Mean Accuracy: 0.876  | Std: 0.019
Model: KNeighborsClassifier, Mean Accuracy: 0.815  | Std: 0.023
Model: RandomForestClassifier, Mean Accuracy: 0.855  | Std: 0.021
Model: SVC, Mean Accuracy: 0.877  | Std: 0.019
Model: LGBMClassifier, Mean Accuracy:

In [134]:
results_df = pd.DataFrame(results)
results_df.sort_values("Mean Accuracy", ascending=False, inplace=True)
results_df

Unnamed: 0,Model,Imputer,Scaler,Mean Accuracy,Std Accuracy
15,SVC,SimpleImputer_median,MinMaxScaler,0.877,0.019
3,SVC,SimpleImputer_mean,MinMaxScaler,0.877,0.019
51,SVC,IterativeImputer_None,MinMaxScaler,0.877,0.019
27,SVC,SimpleImputer_most_frequent,MinMaxScaler,0.877,0.019
30,LogisticRegression,SimpleImputer_most_frequent,StandardScaler,0.876,0.021
54,LogisticRegression,IterativeImputer_None,StandardScaler,0.876,0.02
6,LogisticRegression,SimpleImputer_mean,StandardScaler,0.876,0.022
48,LogisticRegression,IterativeImputer_None,MinMaxScaler,0.876,0.019
33,SVC,SimpleImputer_most_frequent,StandardScaler,0.876,0.018
24,LogisticRegression,SimpleImputer_most_frequent,MinMaxScaler,0.876,0.019
