# Practical Application III: Comparing Classifiers

**Overview**: In this practical application, your goal is to compare the performance of the classifiers we encountered in this section, namely K Nearest Neighbor, Logistic Regression, Decision Trees, and Support Vector Machines.  We will utilize a dataset related to marketing bank products over the telephone.  



### Getting Started

Our dataset comes from the UCI Machine Learning repository [link](https://archive.ics.uci.edu/ml/datasets/bank+marketing).  The data is from a Portugese banking institution and is a collection of the results of multiple marketing campaigns.  We will make use of the article accompanying the dataset [here](CRISP-DM-BANK.pdf) for more information on the data and features.



### Problem 1: Understanding the Data

To gain a better understanding of the data, please read the information provided in the UCI link above, and examine the **Materials and Methods** section of the paper.  How many marketing campaigns does this data represent?

The data represents 17 campaigns between May 2008 and November 2010.

### Problem 2: Read in the Data

Use pandas to read in the dataset `bank-additional-full.csv` and assign to a meaningful variable name.

In [1]:
import pandas as pd

In [3]:
df = pd.read_csv('data/bank-additional-full.csv', sep = ';')

In [4]:
df.head()

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,...,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,56,housemaid,married,basic.4y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
1,57,services,married,high.school,unknown,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
2,37,services,married,high.school,no,yes,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
3,40,admin.,married,basic.6y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
4,56,services,married,high.school,no,no,yes,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no


### Problem 3: Understanding the Features


Examine the data description below, and determine if any of the features are missing values or need to be coerced to a different data type.


```
Input variables:
# bank client data:
1 - age (numeric)
2 - job : type of job (categorical: 'admin.','blue-collar','entrepreneur','housemaid','management','retired','self-employed','services','student','technician','unemployed','unknown')
3 - marital : marital status (categorical: 'divorced','married','single','unknown'; note: 'divorced' means divorced or widowed)
4 - education (categorical: 'basic.4y','basic.6y','basic.9y','high.school','illiterate','professional.course','university.degree','unknown')
5 - default: has credit in default? (categorical: 'no','yes','unknown')
6 - housing: has housing loan? (categorical: 'no','yes','unknown')
7 - loan: has personal loan? (categorical: 'no','yes','unknown')
# related with the last contact of the current campaign:
8 - contact: contact communication type (categorical: 'cellular','telephone')
9 - month: last contact month of year (categorical: 'jan', 'feb', 'mar', ..., 'nov', 'dec')
10 - day_of_week: last contact day of the week (categorical: 'mon','tue','wed','thu','fri')
11 - duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y='no'). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.
# other attributes:
12 - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)
13 - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)
14 - previous: number of contacts performed before this campaign and for this client (numeric)
15 - poutcome: outcome of the previous marketing campaign (categorical: 'failure','nonexistent','success')
# social and economic context attributes
16 - emp.var.rate: employment variation rate - quarterly indicator (numeric)
17 - cons.price.idx: consumer price index - monthly indicator (numeric)
18 - cons.conf.idx: consumer confidence index - monthly indicator (numeric)
19 - euribor3m: euribor 3 month rate - daily indicator (numeric)
20 - nr.employed: number of employees - quarterly indicator (numeric)

Output variable (desired target):
21 - y - has the client subscribed a term deposit? (binary: 'yes','no')
```



"Job" can be represented with a dummy variable instead, since there is no unambiguous way to order the categories. Same with the "Marital". (re-represent these with integers instead of strings), 'contact', 'day_of_week', 'poutcome'.

"Educational" can be represented with an ordinal scale, in the order 'illiterate', 'basic.4y', 'basic.6y', 'basic.9y', 'high.school', 'university.degree', 'professional.course'. (re-represent these with integers instead of strings).

"default", "housing", "loan", can all be represented with an integer 1 or 0 instead of the strings 'yes' and 'no' respectively.

'duration' can most likely be left as is, however durations of 0 may have to be omitted.

The output variable should be converted to a binary integer or boolean as well.


Can consider dropping 'unknown' values from all categories if enough data entries will be left over afterwards, since these are clearly missing values.
The rest of the features can be left as is.

### Problem 4: Understanding the Task

After examining the description and data, your goal now is to clearly state the *Business Objective* of the task.  State the objective below.

In [23]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41188 entries, 0 to 41187
Data columns (total 21 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   age             41188 non-null  int64  
 1   job             41188 non-null  object 
 2   marital         41188 non-null  object 
 3   education       41188 non-null  object 
 4   default         41188 non-null  object 
 5   housing         41188 non-null  object 
 6   loan            41188 non-null  object 
 7   contact         41188 non-null  object 
 8   month           41188 non-null  object 
 9   day_of_week     41188 non-null  object 
 10  duration        41188 non-null  int64  
 11  campaign        41188 non-null  int64  
 12  pdays           41188 non-null  int64  
 13  previous        41188 non-null  int64  
 14  poutcome        41188 non-null  object 
 15  emp.var.rate    41188 non-null  float64
 16  cons.price.idx  41188 non-null  float64
 17  cons.conf.idx   41188 non-null 

#### Business Objective:

The business objective of this task is to determine what type of clients will be more likely so subscribe to a term deposit, and what features of campaigns are better for convincing clients to subscribe. To do so, this task will compare different classification models to determine which method will best model the data and predict client subscriptions, then evaluate and adjust the best model based on is accuracy to understand what types of clients and campaigns will be most able to get subscriptions.

### Problem 5: Engineering Features

Now that you understand your business objective, we will build a basic model to get started.  Before we can do this, we must work to encode the data.  Using just the bank information features (columns 1 - 7), prepare the features and target column for modeling with appropriate encoding and transformations.

#### Import libraries

In [24]:
import seaborn as sns
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.compose import make_column_transformer, make_column_selector
from sklearn.preprocessing import PolynomialFeatures, StandardScaler, OneHotEncoder, OrdinalEncoder
from sklearn.feature_selection import SequentialFeatureSelector
from sklearn.inspection import permutation_importance
from matplotlib import rcParams


#### Use just bank information features

Take slice of dataframe of just the bank information features as requested by "Problem 5" for processing, and add output column.

In [41]:
#Take slice of dataframe with just bank information features
bankdf = df.loc[:,:'loan'] 
#Add column corresponding to output column in original dataframe
bankdf['y'] = df['y'] 
bankdf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41188 entries, 0 to 41187
Data columns (total 8 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   age        41188 non-null  int64 
 1   job        41188 non-null  object
 2   marital    41188 non-null  object
 3   education  41188 non-null  object
 4   default    41188 non-null  object
 5   housing    41188 non-null  object
 6   loan       41188 non-null  object
 7   y          41188 non-null  object
dtypes: int64(1), object(7)
memory usage: 2.5+ MB


#### Clean data by removing 'unknown' values

In [50]:
for i in bankdf:
    bankdf = bankdf[bankdf[i]!='unknown']
bankdf.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 30488 entries, 0 to 41187
Data columns (total 8 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   age        30488 non-null  int64 
 1   job        30488 non-null  object
 2   marital    30488 non-null  object
 3   education  30488 non-null  object
 4   default    30488 non-null  object
 5   housing    30488 non-null  object
 6   loan       30488 non-null  object
 7   y          30488 non-null  object
dtypes: int64(1), object(7)
memory usage: 2.1+ MB


#### Use encoders to convert object columns to integers.

In [93]:
bankdf

Unnamed: 0,age,job,marital,education,default,housing,loan,y
0,56,housemaid,married,basic.4y,no,no,no,no
2,37,services,married,high.school,no,yes,no,no
3,40,admin.,married,basic.6y,no,no,no,no
4,56,services,married,high.school,no,no,yes,no
6,59,admin.,married,professional.course,no,no,no,no
...,...,...,...,...,...,...,...,...
41183,73,retired,married,professional.course,no,yes,no,yes
41184,46,blue-collar,married,professional.course,no,no,no,no
41185,56,retired,married,university.degree,no,yes,no,no
41186,44,technician,married,professional.course,no,no,no,yes


In [54]:
#List of columns to use OneHotEncoder on
onehotlist = ['job','marital','default','housing','loan','y']
#List of categories in order for OrdinalEncoder for the column 'education'
edulist = ['illiterate','basic.4y','basic.6y','basic.9y','high.school','university.degree','professional.course']

OneHotEncoder for columns that don't have a meaningful order in the categories

In [111]:
encoder = OneHotEncoder(drop = 'if_binary')
#Fit and Transform columns and store as variable 'array'
array = encoder.fit_transform(bankdf[onehotlist]).toarray()
#Get feature names of each new column after onehotecoder
names = encoder.get_feature_names_out(onehotlist)
#Make new dataframe using onehotencoder outputs
onehotdone = pd.DataFrame(array)
onehotdone.columns = names

 OrdinalEncoder for the education column.

In [157]:
#Reshape education column to use OrdinalEncoder
eduarray = np.array(bankdf['education']).reshape(-1,1)
#Fit and get array of values after OrdinalEncoder
ordinaldone = OrdinalEncoder(categories = [edulist]).fit_transform(eduarray)
#Add education column to onehotencoder transformed data

onehotdone['education'] = ordinaldone

### Problem 6: Train/Test Split

With your data prepared, split it into a train and test set.

In [166]:
#Train test split of processed dataframe
train, test = train_test_split(onehotdone)
#Split train and test into inputs and outputs as X and y respectively.
X_train = train.drop('y_yes', axis = 1)
y_train = train['y_yes']
X_test = test.drop('y_yes', axis = 1)
y_test = test['y_yes']

### Problem 7: A Baseline Model

Before we build our first model, we want to establish a baseline.  What is the baseline performance that our classifier should aim to beat?

In [176]:
#Check to see which output value is the most probable
(bankdf['y'] == 'yes').sum() / bankdf['y'].count()

#Since 'yes' only shows up roughly 12.66% of the time, basline model will be to assume the classifier is always 'no'.

0.1265743899239045

In [194]:
#Create baseline predictions
def basepredict(x):
    temp = []
    #Append 'no' prediction (0) to every set of inputs to predict an output for
    for i in range(0,x.count()[0]):
        temp.append(0.0)
    #Return predicted values as an array
    return(np.array(temp))

### Problem 8: A Simple Model

Use Logistic Regression to build a basic model on your data.  

In [189]:
#Import libraries for models
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC

In [234]:
#Build very basic model, fit to training data.
start =time.time()
lr = LogisticRegression(max_iter = 1000).fit(X_train,y_train)
lrtraintime =time.time() - start

### Problem 9: Score the Model

What is the accuracy of your model?

In [235]:
from sklearn.metrics import accuracy_score

In [236]:
#Find scores of logistic regression predictions

lrtrainscore = accuracy_score(y_train,lr.predict(X_train))
lrtestscore = accuracy_score(y_test,lr.predict(X_test))

print('Logistic Train Accuracy: ' + str(lrtrainscore) + '\nLogistic Test Accuracy: ' + str(lrtestscore))

#Format data for dataframe later
lrdata = ['Logistic Basic', lrtrainscore, lrtestscore, lrtraintime]

Logistic Train Accuracy: 0.8731741450188052
Logistic Test Accuracy: 0.8741800052479665


In [237]:
#Find scores of baseline prediction
basetrainscore = accuracy_score(y_train,basepredict(X_train))
basetestscore = accuracy_score(y_test,basepredict(X_test))
print('Baseline Train Accuracy: ' + str(basetrainscore) + '\nBaseline Test Accuracy: ' + str(basetestscore))

Baseline Train Accuracy: 0.8731741450188052
Baseline Test Accuracy: 0.8741800052479665


Basic Logistic Regression test scores are identical to that of the Baseline test scores, indicating that the models are very similar if not the same, and the basic logistic regression model is not signifiantly better than that of the baseline. 

### Problem 10: Model Comparisons

Now, we aim to compare the performance of the Logistic Regression model to our KNN algorithm, Decision Tree, and SVM models.  Using the default settings for each of the models, fit and score each.  Also, be sure to compare the fit time of each of the models.  Present your findings in a `DataFrame` similar to that below:

| Model | Train Time | Train Accuracy | Test Accuracy |
| ----- | ---------- | -------------  | -----------   |
|     |    |.     |.     |

In [238]:
import time

KNN Model

In [239]:
#Fit and score basic KNN algorithm
#Find current time
start = time.time()
#Train model by fitting data
kbasic = KNeighborsClassifier().fit(X_train,y_train)
#Calculate time taken to fit data
kbtraintime = time.time() - start

#Get accuracy score from training and testing
kbtrainscore = accuracy_score(y_train,kbasic.predict(X_train))
kbtestscore = accuracy_score(y_test,kbasic.predict(X_test))
#Format data for dataframe later
kbdata = ['KNN Basic', kbtrainscore, kbtestscore, kbtraintime]
print('KNN Basic Train Accuracy: ' + str(kbtrainscore) + '\nKNN Basic Test Accuracy: ' + str(kbtestscore) + '\nTrain Time: ' + str(kbtraintime))

KNN Basic Train Accuracy: 0.8635091402081694
KNN Basic Test Accuracy: 0.8527945421149304
Train Time: 0.0020012855529785156


Decision Tree Model

In [240]:
#Fit and score basic Decision Tree model
#Find current time
start = time.time()
#Train model by fitting data
dbasic = DecisionTreeClassifier().fit(X_train, y_train)
#Calculate time taken to fit data
dbtime = time.time() - start

#Get accuracy score from training and testing
dbtrainscore = accuracy_score(y_train, dbasic.predict(X_train))
dbtestscore = accuracy_score(y_test, dbasic.predict(X_test))
#Format data for dataframe later
dbdata = ['Decision Tree Basic',dbtrainscore, dbtestscore, dbtime]
print('Decision Tree Basic Train Accuracy: ' + str(dbtrainscore) + '\nDecision Tree Basic Test Accuracy: ' + str(dbtestscore) + '\nTrain Time ' + str(dbtime))

Decision Tree Basic Train Accuracy: 0.8743986705151754
Decision Tree Basic Test Accuracy: 0.8719496195224351
Train Time 0.018053054809570312


SVM Model

In [241]:
#Fit and score basic SVM model
#Find current time
start = time.time()
#Train model by fitting data
sbasic = SVC().fit(X_train,y_train)
#Calculate time taken to fit data
sbtime = time.time() - start

#Get accuracy score from training and testing
sbtrainscore = accuracy_score(y_train, sbasic.predict(X_train))
sbtestscore = accuracy_score(y_test, sbasic.predict(X_test))
#Format data for dataframe later
sbdata = ['SVM Basic',sbtrainscore,sbtestscore,sbtime]
print('SVM Basic Train Accuracy: ' + str(sbtrainscore) + '\nSVM Basic Test Accuracy ' + str(sbtestscore) + '\nTrain Time ' + str(sbtime))

SVM Basic Train Accuracy: 0.8731741450188052
SVM Basic Test Accuracy 0.8741800052479665
Train Time 3.896533727645874


In [247]:
Comparisondf = (pd.DataFrame([lrdata,kbdata, dbdata, sbdata]))
Comparisondf.columns = ['Model','Train Accuracy','Test Accuracy','Train Time']
Comparisondf.set_index('Model')

Unnamed: 0_level_0,Train Accuracy,Test Accuracy,Train Time
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Logistic Basic,0.873174,0.87418,0.073473
KNN Basic,0.863509,0.852795,0.002001
Decision Tree Basic,0.874399,0.87195,0.018053
SVM Basic,0.873174,0.87418,3.896534


### Problem 11: Improving the Model

Now that we have some basic models on the board, we want to try to improve these.  Below, we list a few things to explore in this pursuit.

- More feature engineering and exploration.  For example, should we keep the gender feature?  Why or why not?
- Hyperparameter tuning and grid search.  All of our models have additional hyperparameters to tune and explore.  For example the number of neighbors in KNN or the maximum depth of a Decision Tree.  
- Adjust your performance metric

##### Questions