## Objective

To compare performace of the classifiers `KNeighborsClassifier`, `LogisticRegression`, `DecisionTreeClassifier`, and `SVC`

### Data

Data used is a bank marketing dataset which comes from the UCI Machine Learning Repository.

Note: I ran a ```python3.10 -m pip install ucimlrepo``` to retrieve the dataset, then imported it.

In [1]:
import pandas as pd

from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import make_column_transformer
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split, GridSearchCV

from ucimlrepo import fetch_ucirepo

In [2]:
bank_marketing = fetch_ucirepo(id=222)

### Overview of data and data cleanup

In [3]:
bank_tmp = pd.DataFrame(bank_marketing.data.features)
bank_tmp['y'] = bank_marketing.data.targets

In [4]:
bank_tmp.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45211 entries, 0 to 45210
Data columns (total 17 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   age          45211 non-null  int64 
 1   job          44923 non-null  object
 2   marital      45211 non-null  object
 3   education    43354 non-null  object
 4   default      45211 non-null  object
 5   balance      45211 non-null  int64 
 6   housing      45211 non-null  object
 7   loan         45211 non-null  object
 8   contact      32191 non-null  object
 9   day_of_week  45211 non-null  int64 
 10  month        45211 non-null  object
 11  duration     45211 non-null  int64 
 12  campaign     45211 non-null  int64 
 13  pdays        45211 non-null  int64 
 14  previous     45211 non-null  int64 
 15  poutcome     8252 non-null   object
 16  y            45211 non-null  object
dtypes: int64(7), object(10)
memory usage: 5.9+ MB


In [5]:
bank_tmp.isna().sum()

age                0
job              288
marital            0
education       1857
default            0
balance            0
housing            0
loan               0
contact        13020
day_of_week        0
month              0
duration           0
campaign           0
pdays              0
previous           0
poutcome       36959
y                  0
dtype: int64

I decided to remove columns `poutcome` and `contact` as they had quite a few missing values. I also believe that they would not have had an impact on the target as a feature. For the remaining `NaN's` I went ahead and removed those rows

In [6]:
bank_tmp_drop = bank_tmp.drop(['poutcome', 'contact'], axis=1)

In [7]:
bank_data = bank_tmp_drop.dropna()

### Functions

In [8]:
def grid_search_cv(model, feature, target, parameters):
    grid = GridSearchCV(estimator=model, param_grid=parameters).fit(feature, target)
    return grid

In [9]:
def print_scores(model_name, model, feature, target):
    print(f'{model_name} results')
    print(f'Training data score: {model.best_score_}')
    print(f'Test data score: {model.score(feature, target)}')
    print(f'Best parameters: {model.best_params_}')
    print(f'Time for fitting best model: {model.refit_time_}')

### Arrange data

Since we have features that are of type object, I used `OneHotEncoder` on those features, and `StandardScaler` for types of float and int

In [10]:
X = bank_data.drop(['y'], axis=1)
y = bank_data['y']

In [11]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state = 42)

In [12]:
transformer = make_column_transformer((OneHotEncoder(drop = 'if_binary'), ['job', 'marital', 'education', 'default', 'housing', 'loan', 'month']), 
                                     remainder = StandardScaler())

### KNeighborsClassifier

In [13]:
knn_pipe = Pipeline([('transform', transformer),
                     ('knn', KNeighborsClassifier())]) 

In [14]:
kn_params = {'knn__n_neighbors': list(range(1,11))}

In [15]:
knn_grid = grid_search_cv(knn_pipe, X_train, y_train, kn_params)

In [16]:
print_scores('KNeighborsClassifier', knn_grid, X_test, y_test)

KNeighborsClassifier results
Training data score: 0.8914503059368281
Test data score: 0.8904923599320883
Best parameters: {'knn__n_neighbors': 5}
Time for fitting best model: 0.08972883224487305


### LogisticRegression

In [17]:
lr_pipe = Pipeline([('transform', transformer),
                     ('lr', LogisticRegression())])

In [18]:
lr_params = {'lr__max_iter': list(range(1,101))}

In [19]:
lr_grid = grid_search_cv(lr_pipe, X_train, y_train, lr_params)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

In [20]:
print_scores('LogisticRegression', lr_grid, X_test, y_test)

LogisticRegression results
Training data score: 0.8951215478749793
Test data score: 0.8938879456706282
Best parameters: {'lr__max_iter': 20}
Time for fitting best model: 0.13123202323913574


### DecisionTreeClassifier

In [21]:
dtr_pipe = Pipeline([('transform', transformer),
                     ('dtr', DecisionTreeClassifier())])

In [22]:
dtr_params = {'dtr__criterion': ['gini', 'entropy', 'log_loss']}

In [23]:
dtr_grid = grid_search_cv(dtr_pipe, X_train, y_train, dtr_params)

In [24]:
print_scores('DecisionTreeClassifier', dtr_grid, X_test, y_test)

DecisionTreeClassifier results
Training data score: 0.8722672399536959
Test data score: 0.879148016669239
Best parameters: {'dtr__criterion': 'entropy'}
Time for fitting best model: 0.22285103797912598


### Support Vector Machines

I was unable to use GridSearchCV and pass a params_grid due to the limitations of my home computer. I have tried passing in two hyperparameters `kernel` and `gamma`, which led to the python kernel to run for more than an hour. 

In [25]:
svc_pipe = Pipeline([('transform', transformer),
                     ('svc', SVC())])

In [26]:
svc_pipe.fit(X_train, y_train)

In [27]:
print('SupportVectorMachines results')
print(f'Training data score: {svc_pipe.score(X_train, y_train)}')
print(f'Test data score: {svc_pipe.score(X_test, y_test)}')

SupportVectorMachines results
Training data score: 0.9129154952869192
Test data score: 0.8995987035036271


### Results

In [28]:
res_dict = {'model': ['KNN', 'Logistic Regression', 'DecisionTreeClassifier','SVC'],
           'train score': [0.8916, 0.8951, 0.8719, 0.9129],
           'test score': [0.8905, 0.8939, 0.8783, 0.8996],
           'Best param fit time': [0.09899, 0.1378, 0.2204, 'NaN']}

results_df = pd.DataFrame(res_dict).set_index('model')

In [29]:
results_df

Unnamed: 0_level_0,train score,test score,Best param fit time
model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
KNN,0.8916,0.8905,0.09899
Logistic Regression,0.8951,0.8939,0.1378
DecisionTreeClassifier,0.8719,0.8783,0.2204
SVC,0.9129,0.8996,


### Conclusion

Despite not being able to get the best param fit time for the `SVC` classifer, I would say this is classifier performed best overall just by looking at the scores. I have a strong feeling that the fit time would be the lowest number as well.

Note: I did not include any plots as the objective of this practial application assignment was to assess the best classifer.

### Other Analysis

If I did not have the CPU limitations on my computer, I could further my analysis by incorporating additional hyperparameters and cross-validation as well with `GridSearchCV`. Also, `GridSearchCV` has an attribute `cv_results_` that can be used to dive even deeper into the analysis of the hyperparameters.