# Practical Application III: Comparing Classifiers

**Overview**: In this practical application, your goal is to compare the performance of the classifiers we encountered in this section, namely K Nearest Neighbor, Logistic Regression, Decision Trees, and Support Vector Machines.  We will utilize a dataset related to marketing bank products over the telephone.  



### Getting Started

Our dataset comes from the UCI Machine Learning repository [link](https://archive.ics.uci.edu/ml/datasets/bank+marketing).  The data is from a Portugese banking institution and is a collection of the results of multiple marketing campaigns.  We will make use of the article accompanying the dataset [here](CRISP-DM-BANK.pdf) for more information on the data and features.



### Problem 1: Understanding the Data

To gain a better understanding of the data, please read the information provided in the UCI link above, and examine the **Materials and Methods** section of the paper.  How many marketing campaigns does this data represent?

The data was accumulated through 17 marketing campaigns.

### Problem 2: Read in the Data

Use pandas to read in the dataset `bank-additional-full.csv` and assign to a meaningful variable name.

In [1]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns
import time

from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsRegressor, KNeighborsClassifier
from sklearn.impute import KNNImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, PolynomialFeatures
from sklearn.pipeline import Pipeline
from sklearn.compose import make_column_transformer
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn import set_config
from sklearn.dummy import DummyClassifier
set_config("figure")

from sklearn.compose import make_column_selector, make_column_transformer
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import SelectFromModel

from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV
from sklearn.metrics import ConfusionMatrixDisplay, RocCurveDisplay
from sklearn.metrics import confusion_matrix, roc_curve, auc

from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix

In [2]:
df = pd.read_csv('data/bank-additional-full.csv', sep = ';')

In [3]:
df.head()

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,...,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,56,housemaid,married,basic.4y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
1,57,services,married,high.school,unknown,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
2,37,services,married,high.school,no,yes,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
3,40,admin.,married,basic.6y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
4,56,services,married,high.school,no,no,yes,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no


### Problem 3: Understanding the Features


Examine the data description below, and determine if any of the features are missing values or need to be coerced to a different data type.


```
Input variables:
# bank client data:
1 - age (numeric)
2 - job : type of job (categorical: 'admin.','blue-collar','entrepreneur','housemaid','management','retired','self-employed','services','student','technician','unemployed','unknown')
3 - marital : marital status (categorical: 'divorced','married','single','unknown'; note: 'divorced' means divorced or widowed)
4 - education (categorical: 'basic.4y','basic.6y','basic.9y','high.school','illiterate','professional.course','university.degree','unknown')
5 - default: has credit in default? (categorical: 'no','yes','unknown')
6 - housing: has housing loan? (categorical: 'no','yes','unknown')
7 - loan: has personal loan? (categorical: 'no','yes','unknown')
# related with the last contact of the current campaign:
8 - contact: contact communication type (categorical: 'cellular','telephone')
9 - month: last contact month of year (categorical: 'jan', 'feb', 'mar', ..., 'nov', 'dec')
10 - day_of_week: last contact day of the week (categorical: 'mon','tue','wed','thu','fri')
11 - duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y='no'). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.
# other attributes:
12 - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)
13 - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)
14 - previous: number of contacts performed before this campaign and for this client (numeric)
15 - poutcome: outcome of the previous marketing campaign (categorical: 'failure','nonexistent','success')
# social and economic context attributes
16 - emp.var.rate: employment variation rate - quarterly indicator (numeric)
17 - cons.price.idx: consumer price index - monthly indicator (numeric)
18 - cons.conf.idx: consumer confidence index - monthly indicator (numeric)
19 - euribor3m: euribor 3 month rate - daily indicator (numeric)
20 - nr.employed: number of employees - quarterly indicator (numeric)

Output variable (desired target):
21 - y - has the client subscribed a term deposit? (binary: 'yes','no')
```



In [4]:
df.isnull().sum()

age               0
job               0
marital           0
education         0
default           0
housing           0
loan              0
contact           0
month             0
day_of_week       0
duration          0
campaign          0
pdays             0
previous          0
poutcome          0
emp.var.rate      0
cons.price.idx    0
cons.conf.idx     0
euribor3m         0
nr.employed       0
y                 0
dtype: int64

In [5]:
df['age'].value_counts()

age
31    1947
32    1846
33    1833
36    1780
35    1759
      ... 
89       2
91       2
94       1
87       1
95       1
Name: count, Length: 78, dtype: int64

In [6]:
df['job'].value_counts()

job
admin.           10422
blue-collar       9254
technician        6743
services          3969
management        2924
retired           1720
entrepreneur      1456
self-employed     1421
housemaid         1060
unemployed        1014
student            875
unknown            330
Name: count, dtype: int64

In [7]:
df['marital'].value_counts()

marital
married     24928
single      11568
divorced     4612
unknown        80
Name: count, dtype: int64

In [8]:
df['education'].value_counts()

education
university.degree      12168
high.school             9515
basic.9y                6045
professional.course     5243
basic.4y                4176
basic.6y                2292
unknown                 1731
illiterate                18
Name: count, dtype: int64

In [9]:
df['default'].value_counts()

default
no         32588
unknown     8597
yes            3
Name: count, dtype: int64

In [10]:
df['housing'].value_counts()

housing
yes        21576
no         18622
unknown      990
Name: count, dtype: int64

In [11]:
df['loan'].value_counts()

loan
no         33950
yes         6248
unknown      990
Name: count, dtype: int64

In [12]:
df['contact'].value_counts()

contact
cellular     26144
telephone    15044
Name: count, dtype: int64

In [13]:
df['month'].value_counts()

month
may    13769
jul     7174
aug     6178
jun     5318
nov     4101
apr     2632
oct      718
sep      570
mar      546
dec      182
Name: count, dtype: int64

In [14]:
df['day_of_week'].value_counts()

day_of_week
thu    8623
mon    8514
wed    8134
tue    8090
fri    7827
Name: count, dtype: int64

In [15]:
df['duration'].value_counts()

duration
90      170
85      170
136     168
73      167
124     164
       ... 
1569      1
1053      1
1263      1
1169      1
1868      1
Name: count, Length: 1544, dtype: int64

In [16]:
df['campaign'].value_counts()

campaign
1     17642
2     10570
3      5341
4      2651
5      1599
6       979
7       629
8       400
9       283
10      225
11      177
12      125
13       92
14       69
17       58
16       51
15       51
18       33
20       30
19       26
21       24
22       17
23       16
24       15
27       11
29       10
28        8
26        8
25        8
31        7
30        7
35        5
32        4
33        4
34        3
42        2
40        2
43        2
56        1
39        1
41        1
37        1
Name: count, dtype: int64

In [17]:
df['pdays'].value_counts()

pdays
999    39673
3        439
6        412
4        118
9         64
2         61
7         60
12        58
10        52
5         46
13        36
11        28
1         26
15        24
14        20
8         18
0         15
16        11
17         8
18         7
22         3
19         3
21         2
25         1
26         1
27         1
20         1
Name: count, dtype: int64

In [18]:
df['previous'].value_counts()

previous
0    35563
1     4561
2      754
3      216
4       70
5       18
6        5
7        1
Name: count, dtype: int64

In [19]:
df['poutcome'].value_counts()

poutcome
nonexistent    35563
failure         4252
success         1373
Name: count, dtype: int64

In [20]:
df['emp.var.rate'].value_counts()

emp.var.rate
 1.4    16234
-1.8     9184
 1.1     7763
-0.1     3683
-2.9     1663
-3.4     1071
-1.7      773
-1.1      635
-3.0      172
-0.2       10
Name: count, dtype: int64

In [21]:
df['cons.price.idx'].value_counts()

cons.price.idx
93.994    7763
93.918    6685
92.893    5794
93.444    5175
94.465    4374
93.200    3616
93.075    2458
92.201     770
92.963     715
92.431     447
92.649     357
94.215     311
94.199     303
92.843     282
92.379     267
93.369     264
94.027     233
94.055     229
93.876     212
94.601     204
92.469     178
93.749     174
92.713     172
94.767     128
93.798      67
92.756      10
Name: count, dtype: int64

In [22]:
df['cons.conf.idx'].value_counts()

cons.conf.idx
-36.4    7763
-42.7    6685
-46.2    5794
-36.1    5175
-41.8    4374
-42.0    3616
-47.1    2458
-31.4     770
-40.8     715
-26.9     447
-30.1     357
-40.3     311
-37.5     303
-50.0     282
-29.8     267
-34.8     264
-38.3     233
-39.8     229
-40.0     212
-49.5     204
-33.6     178
-34.6     174
-33.0     172
-50.8     128
-40.4      67
-45.9      10
Name: count, dtype: int64

In [23]:
df['euribor3m'].value_counts()

euribor3m
4.857    2868
4.962    2613
4.963    2487
4.961    1902
4.856    1210
         ... 
3.853       1
3.901       1
0.969       1
0.956       1
3.669       1
Name: count, Length: 316, dtype: int64

In [24]:
df['nr.employed'].value_counts()

nr.employed
5228.1    16234
5099.1     8534
5191.0     7763
5195.8     3683
5076.2     1663
5017.5     1071
4991.6      773
5008.7      650
4963.6      635
5023.5      172
5176.3       10
Name: count, dtype: int64

No NaN or null values. Clean up unknown entries.

In [25]:
df = df.drop(df[df['job'] == 'unknown'].index)

In [26]:
df = df.drop(df[df['marital'] == 'unknown'].index)

In [27]:
df = df.drop(df[df['education'] == 'unknown'].index)

In [28]:
df = df.drop(df[df['default'] == 'unknown'].index)

In [29]:
df = df.drop(df[df['housing'] == 'unknown'].index)

In [30]:
df = df.drop(df[df['loan'] == 'unknown'].index)

### Problem 4: Understanding the Task

After examining the description and data, your goal now is to clearly state the *Business Objective* of the task.  State the objective below.

In [31]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 30488 entries, 0 to 41187
Data columns (total 21 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   age             30488 non-null  int64  
 1   job             30488 non-null  object 
 2   marital         30488 non-null  object 
 3   education       30488 non-null  object 
 4   default         30488 non-null  object 
 5   housing         30488 non-null  object 
 6   loan            30488 non-null  object 
 7   contact         30488 non-null  object 
 8   month           30488 non-null  object 
 9   day_of_week     30488 non-null  object 
 10  duration        30488 non-null  int64  
 11  campaign        30488 non-null  int64  
 12  pdays           30488 non-null  int64  
 13  previous        30488 non-null  int64  
 14  poutcome        30488 non-null  object 
 15  emp.var.rate    30488 non-null  float64
 16  cons.price.idx  30488 non-null  float64
 17  cons.conf.idx   30488 non-null  floa

The goal of this work is to try to determine the features of a targeted marketing campaign that might lead to a more successful conversion rate for customers that 'subscribe to a term deposit'. Knowing how these features can predict this outcome can help marketers focus on the key behaviors necessary and minimize the costs associated with the campaign. 

### Problem 5: Engineering Features

Now that you understand your business objective, we will build a basic model to get started.  Before we can do this, we must work to encode the data.  Using just the bank information features (columns 1 - 7), prepare the features and target column for modeling with appropriate encoding and transformations.

In [32]:
#default, housing, loan, contact can be binary encoded.other non mumeric can be one hot encoded.

In [33]:
#X = df['job','marital','education','default','housing','loan','age']
X = df.drop(['contact', 'month','day_of_week','duration','campaign','pdays','previous','poutcome','emp.var.rate','cons.price.idx','cons.conf.idx','euribor3m','nr.employed','y'], axis = 1)
y = df['y']

In [34]:
selector = make_column_selector(dtype_include=object)
transformer = make_column_transformer((OneHotEncoder(drop = 'first'), selector),
                                     remainder = StandardScaler())

### Problem 6: Train/Test Split

With your data prepared, split it into a train and test set.

In [35]:
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                   random_state = 42)

### Problem 7: A Baseline Model

Before we build our first model, we want to establish a baseline.  What is the baseline performance that our classifier should aim to beat?

In [36]:
dummy_clf = ''
baseline_score = ''

    
# YOUR CODE HERE
dummy_clf = DummyClassifier().fit(X_train, y_train)
baseline_score = dummy_clf.score(X_test, y_test)
#raise NotImplementedError()

### ANSWER CHECK
print(baseline_score)

0.8728680136447127


### Problem 8: A Simple Model

Use Logistic Regression to build a basic model on your data.  

In [37]:
lgr_pipe = ''
pipe_1_acc = ''

# YOUR CODE HERE
#extractor = SelectFromModel(LogisticRegression(penalty='l1', solver = 'liblinear', random_state = 42))
extractor = SelectFromModel(LogisticRegression())
lgr_pipe = Pipeline([('transformer', transformer),
                    ('selector', extractor),
                    ('lgr', LogisticRegression(random_state=42, max_iter = 1000))])
start = time.time()
lgr_pipe.fit(X_train, y_train)
end = time.time()
print(end - start)


0.19906187057495117


### Problem 9: Score the Model

What is the accuracy of your model?

In [38]:
lgr_test_acc = lgr_pipe.score(X_test, y_test)
lgr_train_acc = lgr_pipe.score(X_train, y_train)

print(lgr_test_acc)
print(lgr_train_acc)

0.8728680136447127
0.8736114755532232


### Problem 10: Model Comparisons

Now, we aim to compare the performance of the Logistic Regression model to our KNN algorithm, Decision Tree, and SVM models.  Using the default settings for each of the models, fit and score each.  Also, be sure to compare the fit time of each of the models.  Present your findings in a `DataFrame` similar to that below:

| Model | Train Time | Train Accuracy | Test Accuracy |
| ----- | ---------- | -------------  | -----------   |
|     |    |.     |.     |

In [39]:
#knn
knc = Pipeline([('transform', transformer), ('knn', KNeighborsClassifier())])
start = time.time()
knc.fit(X_train, y_train)
knc_test_acc = knc.score(X_test, y_test)
knc_train_acc = knc.score(X_train, y_train)

print(knc_test_acc)
print(knc_train_acc)
end = time.time()
print(end - start)

0.8653896615061664
0.8764103909734978
3.2747063636779785


In [40]:
#decision tree
tree = ''
tree_acc = ''

tree = Pipeline([('transform', transformer), ('dtc', DecisionTreeClassifier())])
start = time.time()
tree.fit(X_train, y_train)
tree_test_acc = tree.score(X_test, y_test)
tree_train_acc = tree.score(X_train, y_train)

print(tree_test_acc)
print(tree_train_acc)
end = time.time()
print(end - start)

0.8546313303594857
0.9018630280766203
0.3645350933074951


In [41]:
#svc
svc_pipe = Pipeline([('transformer', transformer), ('svc', SVC())])
start = time.time()
svc = svc_pipe.fit(X_train, y_train)
svc_test_acc = svc.score(X_test, y_test)
svc_train_acc = svc.score(X_train, y_train)

print(svc_test_acc)
print(svc_train_acc)
end = time.time()
print(end - start)

0.8728680136447127
0.8736114755532232
19.52266764640808


In [42]:
res_dict_ = {'model': ['Logistic Regression', 'KNN', 'Decision Tree', 'SVC'],
           'train score': [0.8736114755532232, 0.8764103909734978, 0.9018630280766203,0.8736114755532232],
           'test score': [0.8728680136447127, 0.8653896615061664, 0.8539753345578588,0.8728680136447127],
           'average fit time': [0.1940627098083496, 3.265071392059326, 0.3626134395599365,19.46342372894287]}
results_df_ = pd.DataFrame(res_dict_).set_index('model')
results_df_

Unnamed: 0_level_0,train score,test score,average fit time
model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Logistic Regression,0.873611,0.872868,0.194063
KNN,0.87641,0.86539,3.265071
Decision Tree,0.901863,0.853975,0.362613
SVC,0.873611,0.872868,19.463424


### Problem 11: Improving the Model

Now that we have some basic models on the board, we want to try to improve these.  Below, we list a few things to explore in this pursuit.

- More feature engineering and exploration.  For example, should we keep the gender feature?  Why or why not?
- Hyperparameter tuning and grid search.  All of our models have additional hyperparameters to tune and explore.  For example the number of neighbors in KNN or the maximum depth of a Decision Tree.  
- Adjust your performance metric

In [48]:
#lgr
params = {'lgr__penalty': ['l1', 'l2', 'sigmoid'],
         'lgr__solver': ['lbfgs', 'liblinear', 'newton-cg', 'newton-cholesky', 'sag', 'saga']}
 
grid = GridSearchCV(lgr_pipe, param_grid=params).fit(X_train, y_train)
grid_test_score = grid.score(X_test, y_test)
grid_train_score = grid.score(X_train, y_train)

best_penalty = grid.best_params_['lgr__penalty']
best_solver = grid.best_params_['lgr__solver']

print(grid_test_score)
print(grid_train_score)
print(grid.best_score_)
print(best_penalty)
print(best_solver)

50 fits failed out of a total of 90.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
5 fits failed with the following error:
Traceback (most recent call last):
  File "C:\Users\Bob\anaconda3\Lib\site-packages\sklearn\model_selection\_validation.py", line 732, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\Bob\anaconda3\Lib\site-packages\sklearn\base.py", line 1151, in wrapper
    return fit_method(estimator, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Bob\anaconda3\Lib\site-packages\sklearn\pipeline.py", line 420, in fit
    self._final_estimator.fit(Xt, y, **fit_params_last_step)
  File "C:\Users\Bob\anaconda3\Lib\site-packages\sklearn\base.py", line 1151, in 

0.8728680136447127
0.8736114755532232
l1
liblinear
0.8735677396203319


In [44]:
#knn
params = {'knn__n_neighbors': [1,5,25,50,100]}
 
grid = GridSearchCV(knc, param_grid=params).fit(X_train, y_train)
grid_test_score = grid.score(X_test, y_test)
grid_train_score = grid.score(X_train, y_train)

best_n_neighbors = grid.best_params_['knn__n_neighbors']
print(grid_test_score)
print(grid_train_score)
print(grid.best_score_)
print(best_n_neighbors)

0.8723432170034112
0.8734802763928977
50


In [45]:
#DecisionTree
params = {'dtc__criterion': ['gini', 'entropy', 'log_loss'],
         'dtc__max_depth':[5,15,30,len(X_train)],
         'dtc__min_samples_split':[1,15,30,len(X_train)]}
 
grid = GridSearchCV(tree, param_grid=params).fit(X_train, y_train)
grid_test_score = grid.score(X_test, y_test)
grid_train_score = grid.score(X_train, y_train)

best_criterion = grid.best_params_['dtc__criterion']

print(grid_test_score)
print(grid_train_score)
print(grid.best_score_)
print(best_criterion)

0.8728680136447127
0.8736114755532232
gini


60 fits failed out of a total of 240.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
60 fits failed with the following error:
Traceback (most recent call last):
  File "C:\Users\Bob\anaconda3\Lib\site-packages\sklearn\model_selection\_validation.py", line 732, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\Bob\anaconda3\Lib\site-packages\sklearn\base.py", line 1151, in wrapper
    return fit_method(estimator, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Bob\anaconda3\Lib\site-packages\sklearn\pipeline.py", line 420, in fit
    self._final_estimator.fit(Xt, y, **fit_params_last_step)
  File "C:\Users\Bob\anaconda3\Lib\site-packages\sklearn\base.py", line 1144, i

In [46]:
params = {'svc__kernel': ['rbf', 'poly', 'linear', 'sigmoid'],
         'svc__gamma': ['scale','auto'],}
 
grid = GridSearchCV(svc_pipe, param_grid=params).fit(X_train, y_train)
grid_test_score = grid.score(X_test, y_test)
grid_train_score = grid.score(X_train, y_train)

print(grid_test_score)
print(grid_train_score)
print(grid.best_score_)

0.8728680136447127
0.8736114755532232


In [47]:
best_kernel = grid.best_params_['svc__kernel']
best_gamma = grid.best_params_['svc__gamma']

print(best_kernel)
print(best_gamma)

rbf
scale


##### Questions