## Objective

To compare performace of the classifiers `KNeighborsClassifier`, `LogisticRegression`, `DecisionTreeClassifier`, and `SVC`

### Data

Data used is a bank marketing dataset which comes from the UCI Machine Learning Repository.

Note: I ran a ```python3.10 -m pip install ucimlrepo``` to retrieve the dataset, then imported it.

In [None]:
import pandas as pd

from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import make_column_transformer
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split, GridSearchCV

from ucimlrepo import fetch_ucirepo

In [None]:
bank_marketing = fetch_ucirepo(id=222)

### Overview of data and data cleanup

In [None]:
bank_tmp = pd.DataFrame(bank_marketing.data.features)
bank_tmp['y'] = bank_marketing.data.targets

In [None]:
bank_tmp.info()

In [None]:
bank_tmp.isna().sum()

I decided to remove columns `poutcome` and `contact` as they had quite a few missing values. I also believe that they would not have had an impact on the target as a feature. For the remaining `NaN's` I went ahead and removed those rows

In [None]:
bank_tmp_drop = bank_tmp.drop(['poutcome', 'contact'], axis=1)

In [None]:
bank_data = bank_tmp_drop.dropna()

### Functions

In [None]:
def grid_search_cv(model, feature, target, parameters):
    grid = GridSearchCV(estimator=model, param_grid=parameters).fit(feature, target)
    return grid

In [None]:
def print_scores(model_name, model, feature, target):
    print(f'{model_name} results')
    print(f'Training data score: {model.best_score_}')
    print(f'Test data score: {model.score(feature, target)}')
    print(f'Best parameters: {model.best_params_}')
    print(f'Time for fitting best model: {model.refit_time_}')

### Arrange data

Since we have features that are of type object, I used `OneHotEncoder` on those features, and `StandardScaler` for types of float and int

In [None]:
X = bank_data.drop(['y'], axis=1)
y = bank_data['y']

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state = 42)

In [None]:
transformer = make_column_transformer((OneHotEncoder(drop = 'if_binary'), ['job', 'marital', 'education', 'default', 'housing', 'loan', 'month']), 
                                     remainder = StandardScaler())

### KNeighborsClassifier

In [None]:
knn_pipe = Pipeline([('transform', transformer),
                     ('knn', KNeighborsClassifier())]) 

In [None]:
kn_params = {'knn__n_neighbors': list(range(1,11))}

In [None]:
knn_grid = grid_search_cv(knn_pipe, X_train, y_train, kn_params)

In [None]:
print_scores('KNeighborsClassifier', knn_grid, X_test, y_test)

### LogisticRegression

In [None]:
lr_pipe = Pipeline([('transform', transformer),
                     ('lr', LogisticRegression())])

In [None]:
lr_params = {'lr__max_iter': list(range(1,11))}

In [None]:
lr_grid = grid_search_cv(lr_pipe, X_train, y_train, lr_params)

In [None]:
print_scores('LogisticRegression', lr_grid, X_test, y_test)

### DecisionTreeClassifier

In [None]:
dtr_pipe = Pipeline([('transform', transformer),
                     ('dtr', DecisionTreeClassifier())])

In [None]:
dtr_params = {'dtr__criterion': ['gini', 'entropy', 'log_loss']}

In [None]:
dtr_grid = grid_search_cv(dtr_pipe, X_train, y_train, dtr_params)

In [None]:
print_scores('DecisionTreeClassifier', dtr_grid, X_test, y_test)

### Support Vector Machines

I was unable to use GridSearchCV and pass a params_grid due to the limitations of my home computer. I have tried passing in two hyperparameters `kernel` and `gamma`, which led to the python kernel to run for more than an hour. 

In [None]:
svc_pipe = Pipeline([('transform', transformer),
                     ('svc', SVC())])

In [None]:
svc_pipe.fit(X_train, y_train)

In [None]:
print('SupportVectorMachines results')
print(f'Training data score: {svc_pipe.score(X_train, y_train)}')
print(f'Test data score: {svc_pipe.score(X_test, y_test)}')

### Results

In [None]:
res_dict = {'model': ['KNN', 'Logistic Regression', 'DecisionTreeClassifier','SVC'],
           'train score': [0.8916, 0.8951, 0.8719, 0.9129],
           'test score': [0.8905, 0.8939, 0.8783, 0.8996],
           'Best param fit time': [0.09899, 0.1378, 0.2204, 'NaN']}

results_df = pd.DataFrame(res_dict).set_index('model')

In [None]:
results_df

### Conclusion

Despite not being able to get the best param fit time for the `SVC` classifer, I would say this is classifier performed best overall just by looking at the scores. I have a strong feeling that the fit time would be the lowest number as well.

Note: I did not include any plots as the objective of this practial application assignment was to assess the best classifer.

### Other Analysis

If I did not have the CPU limitations on my computer, I could further my analysis by incorporating additional hyperparameters and cross-validation as well with `GridSearchCV`. Also, `GridSearchCV` has an attribute `cv_results_` that can be used to dive even deeper into the analysis of the hyperparameters.