# Practical Application III: Comparing Classifiers

**Overview**: In this practical application, your goal is to compare the performance of the classifiers we encountered in this section, namely K Nearest Neighbor, Logistic Regression, Decision Trees, and Support Vector Machines.  We will utilize a dataset related to marketing bank products over the telephone.  



### Getting Started

Our dataset comes from the UCI Machine Learning repository [link](https://archive.ics.uci.edu/ml/datasets/bank+marketing).  The data is from a Portugese banking institution and is a collection of the results of multiple marketing campaigns.  We will make use of the article accompanying the dataset [here](CRISP-DM-BANK.pdf) for more information on the data and features.



### Problem 1: Understanding the Data

To gain a better understanding of the data, please read the information provided in the UCI link above, and examine the **Materials and Methods** section of the paper.  How many marketing campaigns does this data represent?

#### Problem 1: Answer
**The dataset collected is related to 17 campaigns** that occurred between May 2008 and November 2010, corresponding to a total of 79354 contacts based on the paper. The actual data used is slightly different actually, and it has a total of 41188 contacts. The dataset includes various client characteristic features such as age, job, marital, education, default, balance, housing, loan, contact, day of week as well as a term deposit subscription status.

### Problem 2: Read in the Data

Use pandas to read in the dataset `bank-additional-full.csv` and assign to a meaningful variable name.

In [109]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.compose import make_column_transformer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
import time
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, recall_score, precision_score, precision_recall_curve, roc_curve
pd.reset_option('display.max_rows')

In [110]:
df = pd.read_csv('data/bank-additional-full.csv', sep = ';')

In [111]:
df.head()

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,...,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,56,housemaid,married,basic.4y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
1,57,services,married,high.school,unknown,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
2,37,services,married,high.school,no,yes,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
3,40,admin.,married,basic.6y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
4,56,services,married,high.school,no,no,yes,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no


### Problem 3: Understanding the Features


Examine the data description below, and determine if any of the features are missing values or need to be coerced to a different data type.


```
Input variables:
# bank client data:
1 - age (numeric)
2 - job : type of job (categorical: 'admin.','blue-collar','entrepreneur','housemaid','management','retired','self-employed','services','student','technician','unemployed','unknown')
3 - marital : marital status (categorical: 'divorced','married','single','unknown'; note: 'divorced' means divorced or widowed)
4 - education (categorical: 'basic.4y','basic.6y','basic.9y','high.school','illiterate','professional.course','university.degree','unknown')
5 - default: has credit in default? (categorical: 'no','yes','unknown')
6 - housing: has housing loan? (categorical: 'no','yes','unknown')
7 - loan: has personal loan? (categorical: 'no','yes','unknown')
# related with the last contact of the current campaign:
8 - contact: contact communication type (categorical: 'cellular','telephone')
9 - month: last contact month of year (categorical: 'jan', 'feb', 'mar', ..., 'nov', 'dec')
10 - day_of_week: last contact day of the week (categorical: 'mon','tue','wed','thu','fri')
11 - duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y='no'). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.
# other attributes:
12 - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)
13 - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)
14 - previous: number of contacts performed before this campaign and for this client (numeric)
15 - poutcome: outcome of the previous marketing campaign (categorical: 'failure','nonexistent','success')
# social and economic context attributes
16 - emp.var.rate: employment variation rate - quarterly indicator (numeric)
17 - cons.price.idx: consumer price index - monthly indicator (numeric)
18 - cons.conf.idx: consumer confidence index - monthly indicator (numeric)
19 - euribor3m: euribor 3 month rate - daily indicator (numeric)
20 - nr.employed: number of employees - quarterly indicator (numeric)

Output variable (desired target):
21 - y - has the client subscribed a term deposit? (binary: 'yes','no')
```



In [112]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41188 entries, 0 to 41187
Data columns (total 21 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   age             41188 non-null  int64  
 1   job             41188 non-null  object 
 2   marital         41188 non-null  object 
 3   education       41188 non-null  object 
 4   default         41188 non-null  object 
 5   housing         41188 non-null  object 
 6   loan            41188 non-null  object 
 7   contact         41188 non-null  object 
 8   month           41188 non-null  object 
 9   day_of_week     41188 non-null  object 
 10  duration        41188 non-null  int64  
 11  campaign        41188 non-null  int64  
 12  pdays           41188 non-null  int64  
 13  previous        41188 non-null  int64  
 14  poutcome        41188 non-null  object 
 15  emp.var.rate    41188 non-null  float64
 16  cons.price.idx  41188 non-null  float64
 17  cons.conf.idx   41188 non-null 

There are no null data. There are features that are object, which will be modified to integer later using one hot encoder. There is no gender feature.

### Problem 4: Understanding the Task

After examining the description and data, your goal now is to clearly state the *Business Objective* of the task.  State the objective below.

**Business Objective**

**The business objective of the task is to execute the direct marketing campaigns efficiently, considering ROI based on the result of the analysis of the previous campaign data.** The analysis here is essentially to predict if a new client will subscribe to a term deposit (Yes or No) through a campaign using phone calls or e-mail. For prediction, we will make classification models, which include a logistic regression model with multiple inputs (1/1+exp(-β0-β1*x1-β2*x2 ....) using OneHotEncoder as needed for categorical features and other classification techniques. Since there are relatively many features, we may want to use the L1 or L2 penalty to optimize the model calculation cost. It is ideal for a model to output a probability, so the number of target customers can be adjusted by changing the threshold depending on the budget of this project. This time, given that the campaign media cost is relatively high, like phone calls, our priority is to maximize "accuracy" rather than minimize false negatives at the cost of accepting false positives, which is suitable for low-cost campaigns like e-mail.

### Problem 5: Engineering Features

Now that you understand your business objective, we will build a basic model to get started.  Before we can do this, we must work to encode the data.  Using just the bank information features (columns 1 - 7), prepare the features and target column for modeling with appropriate encoding and transformations.

In [113]:
dfs = df.iloc[:, [0,1,2,3,4,5,6,-1]]
dfs

Unnamed: 0,age,job,marital,education,default,housing,loan,y
0,56,housemaid,married,basic.4y,no,no,no,no
1,57,services,married,high.school,unknown,no,no,no
2,37,services,married,high.school,no,yes,no,no
3,40,admin.,married,basic.6y,no,no,no,no
4,56,services,married,high.school,no,no,yes,no
...,...,...,...,...,...,...,...,...
41183,73,retired,married,professional.course,no,yes,no,yes
41184,46,blue-collar,married,professional.course,no,no,no,no
41185,56,retired,married,university.degree,no,yes,no,no
41186,44,technician,married,professional.course,no,no,no,yes


In [114]:
columns_categorical = ["job", "marital", "education", "default", "housing", "loan", "y"]
ct = make_column_transformer((OneHotEncoder(drop = 'if_binary', sparse_output = False).set_output(transform = "pandas"), [columns_categorical]), 
                            remainder='passthrough')
ohe = OneHotEncoder(drop = 'if_binary', sparse_output = False).set_output(transform = "pandas") #sparse_output = False => Return Pandas dataframe

No need to use scaler here at this point due to no coefficient cost function involved.

In [115]:
dfohe = ohe.fit_transform(dfs[columns_categorical])
dfohe.info()
dfohe

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41188 entries, 0 to 41187
Data columns (total 34 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   job_admin.                     41188 non-null  float64
 1   job_blue-collar                41188 non-null  float64
 2   job_entrepreneur               41188 non-null  float64
 3   job_housemaid                  41188 non-null  float64
 4   job_management                 41188 non-null  float64
 5   job_retired                    41188 non-null  float64
 6   job_self-employed              41188 non-null  float64
 7   job_services                   41188 non-null  float64
 8   job_student                    41188 non-null  float64
 9   job_technician                 41188 non-null  float64
 10  job_unemployed                 41188 non-null  float64
 11  job_unknown                    41188 non-null  float64
 12  marital_divorced               41188 non-null 

Unnamed: 0,job_admin.,job_blue-collar,job_entrepreneur,job_housemaid,job_management,job_retired,job_self-employed,job_services,job_student,job_technician,...,default_no,default_unknown,default_yes,housing_no,housing_unknown,housing_yes,loan_no,loan_unknown,loan_yes,y_yes
0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0
3,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
41183,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0
41184,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0
41185,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0
41186,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0


The specified 6 features (except for age) and y are changed to float through OHE. 

In [116]:
dfageohe = pd.concat([dfs["age"], dfohe], axis = 1)
dfageohe.info()
dfageohe

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41188 entries, 0 to 41187
Data columns (total 35 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   age                            41188 non-null  int64  
 1   job_admin.                     41188 non-null  float64
 2   job_blue-collar                41188 non-null  float64
 3   job_entrepreneur               41188 non-null  float64
 4   job_housemaid                  41188 non-null  float64
 5   job_management                 41188 non-null  float64
 6   job_retired                    41188 non-null  float64
 7   job_self-employed              41188 non-null  float64
 8   job_services                   41188 non-null  float64
 9   job_student                    41188 non-null  float64
 10  job_technician                 41188 non-null  float64
 11  job_unemployed                 41188 non-null  float64
 12  job_unknown                    41188 non-null 

Unnamed: 0,age,job_admin.,job_blue-collar,job_entrepreneur,job_housemaid,job_management,job_retired,job_self-employed,job_services,job_student,...,default_no,default_unknown,default_yes,housing_no,housing_unknown,housing_yes,loan_no,loan_unknown,loan_yes,y_yes
0,56,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0
1,57,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0
2,37,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0
3,40,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0
4,56,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
41183,73,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0
41184,46,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0
41185,56,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0
41186,44,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0


### Problem 6: Train/Test Split

With your data prepared, split it into a train and test set.

In [117]:
X = dfageohe.iloc[:, :-1]
y = dfageohe['y_yes']

In [118]:
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41188 entries, 0 to 41187
Data columns (total 34 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   age                            41188 non-null  int64  
 1   job_admin.                     41188 non-null  float64
 2   job_blue-collar                41188 non-null  float64
 3   job_entrepreneur               41188 non-null  float64
 4   job_housemaid                  41188 non-null  float64
 5   job_management                 41188 non-null  float64
 6   job_retired                    41188 non-null  float64
 7   job_self-employed              41188 non-null  float64
 8   job_services                   41188 non-null  float64
 9   job_student                    41188 non-null  float64
 10  job_technician                 41188 non-null  float64
 11  job_unemployed                 41188 non-null  float64
 12  job_unknown                    41188 non-null 

In [119]:
y

0        0.0
1        0.0
2        0.0
3        0.0
4        0.0
        ... 
41183    1.0
41184    0.0
41185    0.0
41186    1.0
41187    0.0
Name: y_yes, Length: 41188, dtype: float64

In [120]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, stratify = y, random_state = 42)


There is actually not so much y = yes raito distortion even if "stratify = y" is not applied.

### Problem 7: A Baseline Model

Before we build our first model, we want to establish a baseline.  What is the baseline performance that our classifier should aim to beat?

In [121]:
baseline = df['y'].value_counts(normalize = True)["yes"]
print(baseline)

0.11265417111780131


In [122]:
baseline = y_train.value_counts(normalize = True)[1]
print(baseline)
baseline = y_test.value_counts(normalize = True)[1]
print(baseline)

0.11265417111780131
0.11265417111780131


**The basedline performance is the ratio of column "y" = "yes" of the dataset, which is 11% accuracy.**

### Problem 8: A Simple Model

Use Logistic Regression to build a basic model on your data.  

In [123]:
#This is the function to calculate time used for processing sessions and create a combined result table
def model_resut(time0, time1, time2, time3, modelname, train_score, test_score, bp_all):
    time1_0 = time1 - time0
    time2_1 = time2 - time1
    time3_1 = time3 - time1
    time3_2 = time3 - time2
    time3_0 = time3 - time0

    print(f"Train time : {time1_0} seconds")
    print(f"Train score: {time2_1} seconds")
    print(f"Test  score: {time3_2} seconds")
    print(f"Total time : {time3_0} seconds")

    bp_temp = pd.DataFrame()
    bp_temp["Model"] = [modelname]
    bp_temp["Train Time"] = time1_0
    bp_temp["Score Time"] = time3_1
    bp_temp["Train Accuracy"] = train_score
    bp_temp["Test Accuracy"] = test_score
    column_order = ["Model", "Train Time", "Score Time", "Train Accuracy", "Test Accuracy"]
    bp_temp = bp_temp[column_order]
    # Concatenate bp_temp with existing DataFrame (if any)
    if 'bp_all' not in locals():
        bp_all = bp_temp
    else:
        bp_all = pd.concat([bp_all, bp_temp], axis=0, ignore_index=True)
    return bp_all

In [124]:
bp_all = pd.DataFrame()

modelname = "LogisticRegression"
time0 = time.time()
lgr = LogisticRegression(max_iter=1000).fit(X_train, y_train)  #max_iter=100 is default
time1 = time.time()
train_score = lgr.score(X_train, y_train)
time2 = time.time()
test_score = lgr.score(X_test, y_test)
time3 = time.time()
print(train_score)
print(test_score)

bp_all = model_resut(time0, time1, time2, time3, modelname, train_score, test_score, bp_all)

0.8873458288821987
0.8873458288821987
Train time : 0.3981766700744629 seconds
Train score: 0.008728265762329102 seconds
Test  score: 0.004002809524536133 seconds
Total time : 0.4109077453613281 seconds


In [125]:
bp_all

Unnamed: 0,Model,Train Time,Score Time,Train Accuracy,Test Accuracy
0,LogisticRegression,0.398177,0.012731,0.887346,0.887346


### Problem 9: Score the Model

What is the accuracy of your model?

In [126]:
print(train_score)
print(test_score)

0.8873458288821987
0.8873458288821987


### Problem 10: Model Comparisons

Now, we aim to compare the performance of the Logistic Regression model to our KNN algorithm, Decision Tree, and SVM models.  Using the default settings for each of the models, fit and score each.  Also, be sure to compare the fit time of each of the models.  Present your findings in a `DataFrame` similar to that below:

| Model | Train Time | Train Accuracy | Test Accuracy |
| ----- | ---------- | -------------  | -----------   |
|     |    |.     |.     |

In [127]:
modelname = "KNeighborsClassifier"
time0 = time.time()
knn = KNeighborsClassifier()   # default 5
knn.fit(X_train, y_train)
time1 = time.time()
train_score = knn.score(X_train, y_train)
time2 = time.time()
test_score = knn.score(X_test, y_test)
time3 = time.time()
print(train_score)
print(test_score)

bp_all = model_resut(time0, time1, time2, time3, modelname, train_score, test_score, bp_all)

0.8922339840082872
0.8782169563950665
Train time : 0.016115188598632812 seconds
Train score: 2.9885315895080566 seconds
Test  score: 1.017366886138916 seconds
Total time : 4.0220136642456055 seconds


In [128]:
bp_all

Unnamed: 0,Model,Train Time,Score Time,Train Accuracy,Test Accuracy
0,LogisticRegression,0.398177,0.012731,0.887346,0.887346
1,KNeighborsClassifier,0.016115,4.005898,0.892234,0.878217


In [129]:
modelname = "DecisionTreeClassifier"
time0 = time.time()
dt = DecisionTreeClassifier().fit(X_train, y_train)
time1 = time.time()
train_score = dt.score(X_train, y_train)
time2 = time.time()
test_score = dt.score(X_test, y_test)
time3 = time.time()
print(train_score)
print(test_score)

bp_all = model_resut(time0, time1, time2, time3, modelname, train_score, test_score, bp_all)

0.9177754038393059
0.8668544236185297
Train time : 0.11903858184814453 seconds
Train score: 0.015798568725585938 seconds
Test  score: 0.009311199188232422 seconds
Total time : 0.1441483497619629 seconds


In [130]:
bp_all

Unnamed: 0,Model,Train Time,Score Time,Train Accuracy,Test Accuracy
0,LogisticRegression,0.398177,0.012731,0.887346,0.887346
1,KNeighborsClassifier,0.016115,4.005898,0.892234,0.878217
2,DecisionTreeClassifier,0.119039,0.02511,0.917775,0.866854


In [131]:
modelname = "SVC RBF"
time0 = time.time()
svc_poly = SVC().fit(X_train, y_train)    # Default is rbf
time1 = time.time()
train_score = svc_poly.score(X_train, y_train)
time2 = time.time()
test_score = svc_poly.score(X_test, y_test)
time3 = time.time()
print(train_score)
print(test_score)

bp_all = model_resut(time0, time1, time2, time3, modelname, train_score, test_score, bp_all)

0.8873458288821987
0.8873458288821987
Train time : 12.719707727432251 seconds
Train score: 27.87177038192749 seconds
Test  score: 8.437405109405518 seconds
Total time : 49.02888321876526 seconds


In [132]:
bp_all

Unnamed: 0,Model,Train Time,Score Time,Train Accuracy,Test Accuracy
0,LogisticRegression,0.398177,0.012731,0.887346,0.887346
1,KNeighborsClassifier,0.016115,4.005898,0.892234,0.878217
2,DecisionTreeClassifier,0.119039,0.02511,0.917775,0.866854
3,SVC RBF,12.719708,36.309175,0.887346,0.887346


In [133]:
modelname = "SVC poly"
time0 = time.time()
svc = SVC(kernel = 'poly').fit(X_train, y_train)
time1 = time.time()
train_score = svc.score(X_train, y_train)
time2 = time.time()
test_score = svc.score(X_test, y_test)
time3 = time.time()
print(train_score)
print(test_score)

bp_all = model_resut(time0, time1, time2, time3, modelname, train_score, test_score, bp_all)

0.8873458288821987
0.8873458288821987
Train time : 37.77131700515747 seconds
Train score: 6.039712190628052 seconds
Test  score: 2.175581455230713 seconds
Total time : 45.986610651016235 seconds


In [134]:
bp_all

Unnamed: 0,Model,Train Time,Score Time,Train Accuracy,Test Accuracy
0,LogisticRegression,0.398177,0.012731,0.887346,0.887346
1,KNeighborsClassifier,0.016115,4.005898,0.892234,0.878217
2,DecisionTreeClassifier,0.119039,0.02511,0.917775,0.866854
3,SVC RBF,12.719708,36.309175,0.887346,0.887346
4,SVC poly,37.771317,8.215294,0.887346,0.887346


In [135]:
modelname = "SVC linear"
time0 = time.time()
svc_linear = SVC(kernel = 'linear').fit(X_train, y_train)
time1 = time.time()
train_score = svc_linear.score(X_train, y_train)
time2 = time.time()
test_score = svc_linear.score(X_test, y_test)
time3 = time.time()
print(train_score)
print(test_score)

bp_all = model_resut(time0, time1, time2, time3, modelname, train_score, test_score, bp_all)

0.8873458288821987
0.8873458288821987
Train time : 38.791096687316895 seconds
Train score: 5.377674579620361 seconds
Test  score: 1.4460947513580322 seconds
Total time : 45.61486601829529 seconds


In [136]:
bp_all

Unnamed: 0,Model,Train Time,Score Time,Train Accuracy,Test Accuracy
0,LogisticRegression,0.398177,0.012731,0.887346,0.887346
1,KNeighborsClassifier,0.016115,4.005898,0.892234,0.878217
2,DecisionTreeClassifier,0.119039,0.02511,0.917775,0.866854
3,SVC RBF,12.719708,36.309175,0.887346,0.887346
4,SVC poly,37.771317,8.215294,0.887346,0.887346
5,SVC linear,38.791097,6.823769,0.887346,0.887346


In [137]:
modelname = "SVC sigmoid"
time0 = time.time()
svc_sigmoid = SVC(kernel = 'sigmoid').fit(X_train, y_train)
time1 = time.time()
train_score = svc_sigmoid.score(X_train, y_train)
time2 = time.time()
test_score = svc_sigmoid.score(X_test, y_test)
time3 = time.time()
print(train_score)
print(test_score)

bp_all = model_resut(time0, time1, time2, time3, modelname, train_score, test_score, bp_all)

0.8822958143148489
0.8823929299796057
Train time : 16.4435818195343 seconds
Train score: 7.415475845336914 seconds
Test  score: 3.484466314315796 seconds
Total time : 27.34352397918701 seconds


In [138]:
bp_all

Unnamed: 0,Model,Train Time,Score Time,Train Accuracy,Test Accuracy
0,LogisticRegression,0.398177,0.012731,0.887346,0.887346
1,KNeighborsClassifier,0.016115,4.005898,0.892234,0.878217
2,DecisionTreeClassifier,0.119039,0.02511,0.917775,0.866854
3,SVC RBF,12.719708,36.309175,0.887346,0.887346
4,SVC poly,37.771317,8.215294,0.887346,0.887346
5,SVC linear,38.791097,6.823769,0.887346,0.887346
6,SVC sigmoid,16.443582,10.899942,0.882296,0.882393


### Problem 11: Improving the Model

Now that we have some basic models on the board, we want to try to improve these.  Below, we list a few things to explore in this pursuit.

- More feature engineering and exploration.  For example, should we keep the gender feature?  Why or why not?
- Hyperparameter tuning and grid search.  All of our models have additional hyperparameters to tune and explore.  For example the number of neighbors in KNN or the maximum depth of a Decision Tree.  
- Adjust your performance metric

In [139]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41188 entries, 0 to 41187
Data columns (total 21 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   age             41188 non-null  int64  
 1   job             41188 non-null  object 
 2   marital         41188 non-null  object 
 3   education       41188 non-null  object 
 4   default         41188 non-null  object 
 5   housing         41188 non-null  object 
 6   loan            41188 non-null  object 
 7   contact         41188 non-null  object 
 8   month           41188 non-null  object 
 9   day_of_week     41188 non-null  object 
 10  duration        41188 non-null  int64  
 11  campaign        41188 non-null  int64  
 12  pdays           41188 non-null  int64  
 13  previous        41188 non-null  int64  
 14  poutcome        41188 non-null  object 
 15  emp.var.rate    41188 non-null  float64
 16  cons.price.idx  41188 non-null  float64
 17  cons.conf.idx   41188 non-null 

In [140]:
columns_categorical2 = ["job", "marital", "education", "default", "housing", "loan", "contact", "month", "day_of_week", "poutcome", "y"]
ct = make_column_transformer((OneHotEncoder(drop = 'if_binary', sparse_output = False).set_output(transform = "pandas"), columns_categorical2), 
                            remainder='passthrough')  # Actually not used.
ohe = OneHotEncoder(drop = 'if_binary', sparse_output = False).set_output(transform = "pandas")

In [141]:
dfohe2 = ohe.fit_transform(df[columns_categorical2])
dfohe2.info()
dfohe2

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41188 entries, 0 to 41187
Data columns (total 53 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   job_admin.                     41188 non-null  float64
 1   job_blue-collar                41188 non-null  float64
 2   job_entrepreneur               41188 non-null  float64
 3   job_housemaid                  41188 non-null  float64
 4   job_management                 41188 non-null  float64
 5   job_retired                    41188 non-null  float64
 6   job_self-employed              41188 non-null  float64
 7   job_services                   41188 non-null  float64
 8   job_student                    41188 non-null  float64
 9   job_technician                 41188 non-null  float64
 10  job_unemployed                 41188 non-null  float64
 11  job_unknown                    41188 non-null  float64
 12  marital_divorced               41188 non-null 

Unnamed: 0,job_admin.,job_blue-collar,job_entrepreneur,job_housemaid,job_management,job_retired,job_self-employed,job_services,job_student,job_technician,...,month_sep,day_of_week_fri,day_of_week_mon,day_of_week_thu,day_of_week_tue,day_of_week_wed,poutcome_failure,poutcome_nonexistent,poutcome_success,y_yes
0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
3,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
41183,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0
41184,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
41185,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
41186,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0


In [142]:
y2 = dfohe2['y_yes']
dfohe2 = dfohe2.iloc[:, :-1]

In [143]:
dfageohe2 = pd.concat([df["age"], dfohe2, df[["duration", "campaign", "pdays", "previous", "emp.var.rate", "cons.price.idx", "cons.conf.idx", "euribor3m", "nr.employed"]]], axis = 1)
dfageohe2.info()
dfageohe2

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41188 entries, 0 to 41187
Data columns (total 62 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   age                            41188 non-null  int64  
 1   job_admin.                     41188 non-null  float64
 2   job_blue-collar                41188 non-null  float64
 3   job_entrepreneur               41188 non-null  float64
 4   job_housemaid                  41188 non-null  float64
 5   job_management                 41188 non-null  float64
 6   job_retired                    41188 non-null  float64
 7   job_self-employed              41188 non-null  float64
 8   job_services                   41188 non-null  float64
 9   job_student                    41188 non-null  float64
 10  job_technician                 41188 non-null  float64
 11  job_unemployed                 41188 non-null  float64
 12  job_unknown                    41188 non-null 

Unnamed: 0,age,job_admin.,job_blue-collar,job_entrepreneur,job_housemaid,job_management,job_retired,job_self-employed,job_services,job_student,...,poutcome_success,duration,campaign,pdays,previous,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed
0,56,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,261,1,999,0,1.1,93.994,-36.4,4.857,5191.0
1,57,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,149,1,999,0,1.1,93.994,-36.4,4.857,5191.0
2,37,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,226,1,999,0,1.1,93.994,-36.4,4.857,5191.0
3,40,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,151,1,999,0,1.1,93.994,-36.4,4.857,5191.0
4,56,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,307,1,999,0,1.1,93.994,-36.4,4.857,5191.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
41183,73,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,334,1,999,0,-1.1,94.767,-50.8,1.028,4963.6
41184,46,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,383,1,999,0,-1.1,94.767,-50.8,1.028,4963.6
41185,56,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,189,2,999,0,-1.1,94.767,-50.8,1.028,4963.6
41186,44,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,442,1,999,0,-1.1,94.767,-50.8,1.028,4963.6


In [144]:
X2 = dfageohe2
X2_train, X2_test, y2_train, y2_test = train_test_split(X2, y2, test_size = 0.25, stratify = y2, random_state = 42)

In [145]:
modelname = "KNeighborsClassifier all features"

X_train = X2_train
y_train = y2_train
X_test = X2_test
y_test = y2_test

time0 = time.time()
knn = KNeighborsClassifier()   # default 5
knn.fit(X_train, y_train)
time1 = time.time()
train_score = knn.score(X_train, y_train)
time2 = time.time()
test_score = knn.score(X_test, y_test)
time3 = time.time()
print(train_score)
print(test_score)

bp_all = model_resut(time0, time1, time2, time3, modelname, train_score, test_score, bp_all)

0.9308212748049594
0.9052151111974361
Train time : 0.018018722534179688 seconds
Train score: 3.502765417098999 seconds
Test  score: 1.116819143295288 seconds
Total time : 4.637603282928467 seconds


In [146]:
bp_all

Unnamed: 0,Model,Train Time,Score Time,Train Accuracy,Test Accuracy
0,LogisticRegression,0.398177,0.012731,0.887346,0.887346
1,KNeighborsClassifier,0.016115,4.005898,0.892234,0.878217
2,DecisionTreeClassifier,0.119039,0.02511,0.917775,0.866854
3,SVC RBF,12.719708,36.309175,0.887346,0.887346
4,SVC poly,37.771317,8.215294,0.887346,0.887346
5,SVC linear,38.791097,6.823769,0.887346,0.887346
6,SVC sigmoid,16.443582,10.899942,0.882296,0.882393
7,KNeighborsClassifier all features,0.018019,4.619585,0.930821,0.905215


In [147]:
modelname = "KNeighborsClassifier all features grid (1, 22, 2)"

X_train = X2_train
y_train = y2_train
X_test = X2_test
y_test = y2_test

params = {'n_neighbors': list(range(1, 22, 2))}
knn = KNeighborsClassifier()
knn_grid = GridSearchCV(knn, param_grid=params)

time0 = time.time()
knn_grid.fit(X_train, y_train)
time1 = time.time()
test_score = knn_grid.score(X_train, y_train)   # best accuray model for train
time2 = time.time()
test_score = knn_grid.score(X_test, y_test)   # best accuray model for test
time3 = time.time()

best_k = list(knn_grid.best_params_.values())[0]

print(train_score)
print(test_score)

bp_all = model_resut(time0, time1, time2, time3, modelname, train_score, test_score, bp_all)
print(best_k)

0.9308212748049594
0.9133728270370011
Train time : 37.00741219520569 seconds
Train score: 3.000553846359253 seconds
Test  score: 1.0277698040008545 seconds
Total time : 41.035735845565796 seconds
19


In [148]:
bp_all

Unnamed: 0,Model,Train Time,Score Time,Train Accuracy,Test Accuracy
0,LogisticRegression,0.398177,0.012731,0.887346,0.887346
1,KNeighborsClassifier,0.016115,4.005898,0.892234,0.878217
2,DecisionTreeClassifier,0.119039,0.02511,0.917775,0.866854
3,SVC RBF,12.719708,36.309175,0.887346,0.887346
4,SVC poly,37.771317,8.215294,0.887346,0.887346
5,SVC linear,38.791097,6.823769,0.887346,0.887346
6,SVC sigmoid,16.443582,10.899942,0.882296,0.882393
7,KNeighborsClassifier all features,0.018019,4.619585,0.930821,0.905215
8,"KNeighborsClassifier all features grid (1, 22, 2)",37.007412,4.028324,0.930821,0.913373


In [149]:
modelname = "DecisionTreeClassifier all features"

X_train = X2_train
y_train = y2_train
X_test = X2_test
y_test = y2_test

time0 = time.time()
dt = DecisionTreeClassifier().fit(X_train, y_train)
time1 = time.time()
train_score = dt.score(X_train, y_train)
time2 = time.time()
test_score = dt.score(X_test, y_test)
time3 = time.time()
print(train_score)
print(test_score)

bp_all = model_resut(time0, time1, time2, time3, modelname, train_score, test_score, bp_all)

1.0
0.8866660192289016
Train time : 0.23891139030456543 seconds
Train score: 0.017362356185913086 seconds
Test  score: 0.006999015808105469 seconds
Total time : 0.263272762298584 seconds


In [150]:
bp_all

Unnamed: 0,Model,Train Time,Score Time,Train Accuracy,Test Accuracy
0,LogisticRegression,0.398177,0.012731,0.887346,0.887346
1,KNeighborsClassifier,0.016115,4.005898,0.892234,0.878217
2,DecisionTreeClassifier,0.119039,0.02511,0.917775,0.866854
3,SVC RBF,12.719708,36.309175,0.887346,0.887346
4,SVC poly,37.771317,8.215294,0.887346,0.887346
5,SVC linear,38.791097,6.823769,0.887346,0.887346
6,SVC sigmoid,16.443582,10.899942,0.882296,0.882393
7,KNeighborsClassifier all features,0.018019,4.619585,0.930821,0.905215
8,"KNeighborsClassifier all features grid (1, 22, 2)",37.007412,4.028324,0.930821,0.913373
9,DecisionTreeClassifier all features,0.238911,0.024361,1.0,0.886666


In [151]:
modelname = "LogisticRegression all features"

X_train = X2_train
y_train = y2_train
X_test = X2_test
y_test = y2_test

time0 = time.time()
lgr = LogisticRegression(max_iter=10000).fit(X_train, y_train)
time1 = time.time()
train_score = lgr.score(X_train, y_train)
time2 = time.time()
test_score = lgr.score(X_test, y_test)
time3 = time.time()
print(train_score)
print(test_score)

bp_all = model_resut(time0, time1, time2, time3, modelname, train_score, test_score, bp_all)

0.9100061506587679
0.9138584053607847
Train time : 38.4043447971344 seconds
Train score: 0.012998580932617188 seconds
Test  score: 0.006132841110229492 seconds
Total time : 38.423476219177246 seconds


max_iter=10000 or max_iter=1000 does not influece to the Train Accuracy and Test Accuracy.

In [152]:
bp_all

Unnamed: 0,Model,Train Time,Score Time,Train Accuracy,Test Accuracy
0,LogisticRegression,0.398177,0.012731,0.887346,0.887346
1,KNeighborsClassifier,0.016115,4.005898,0.892234,0.878217
2,DecisionTreeClassifier,0.119039,0.02511,0.917775,0.866854
3,SVC RBF,12.719708,36.309175,0.887346,0.887346
4,SVC poly,37.771317,8.215294,0.887346,0.887346
5,SVC linear,38.791097,6.823769,0.887346,0.887346
6,SVC sigmoid,16.443582,10.899942,0.882296,0.882393
7,KNeighborsClassifier all features,0.018019,4.619585,0.930821,0.905215
8,"KNeighborsClassifier all features grid (1, 22, 2)",37.007412,4.028324,0.930821,0.913373
9,DecisionTreeClassifier all features,0.238911,0.024361,1.0,0.886666


In [153]:
column_names = dfageohe2.columns
scaler = StandardScaler()
dfageohe2_scaled = scaler.fit_transform(dfageohe2)
scaled_df = pd.DataFrame(dfageohe2_scaled, columns=column_names)
X2s = scaled_df
X2s

Unnamed: 0,age,job_admin.,job_blue-collar,job_entrepreneur,job_housemaid,job_management,job_retired,job_self-employed,job_services,job_student,...,poutcome_success,duration,campaign,pdays,previous,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed
0,1.533034,-0.582023,-0.538317,-0.19143,6.152772,-0.276435,-0.208757,-0.189032,-0.326556,-0.147327,...,-0.1857,0.010471,-0.565922,0.195414,-0.349494,0.648092,0.722722,0.886447,0.712460,0.331680
1,1.628993,-0.582023,-0.538317,-0.19143,-0.162528,-0.276435,-0.208757,-0.189032,3.062258,-0.147327,...,-0.1857,-0.421501,-0.565922,0.195414,-0.349494,0.648092,0.722722,0.886447,0.712460,0.331680
2,-0.290186,-0.582023,-0.538317,-0.19143,-0.162528,-0.276435,-0.208757,-0.189032,3.062258,-0.147327,...,-0.1857,-0.124520,-0.565922,0.195414,-0.349494,0.648092,0.722722,0.886447,0.712460,0.331680
3,-0.002309,1.718146,-0.538317,-0.19143,-0.162528,-0.276435,-0.208757,-0.189032,-0.326556,-0.147327,...,-0.1857,-0.413787,-0.565922,0.195414,-0.349494,0.648092,0.722722,0.886447,0.712460,0.331680
4,1.533034,-0.582023,-0.538317,-0.19143,-0.162528,-0.276435,-0.208757,-0.189032,3.062258,-0.147327,...,-0.1857,0.187888,-0.565922,0.195414,-0.349494,0.648092,0.722722,0.886447,0.712460,0.331680
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
41183,3.164336,-0.582023,-0.538317,-0.19143,-0.162528,-0.276435,4.790252,-0.189032,-0.326556,-0.147327,...,-0.1857,0.292025,-0.565922,0.195414,-0.349494,-0.752343,2.058168,-2.224953,-1.495186,-2.815697
41184,0.573445,-0.582023,1.857642,-0.19143,-0.162528,-0.276435,-0.208757,-0.189032,-0.326556,-0.147327,...,-0.1857,0.481012,-0.565922,0.195414,-0.349494,-0.752343,2.058168,-2.224953,-1.495186,-2.815697
41185,1.533034,-0.582023,-0.538317,-0.19143,-0.162528,-0.276435,4.790252,-0.189032,-0.326556,-0.147327,...,-0.1857,-0.267225,-0.204909,0.195414,-0.349494,-0.752343,2.058168,-2.224953,-1.495186,-2.815697
41186,0.381527,-0.582023,-0.538317,-0.19143,-0.162528,-0.276435,-0.208757,-0.189032,-0.326556,-0.147327,...,-0.1857,0.708569,-0.565922,0.195414,-0.349494,-0.752343,2.058168,-2.224953,-1.495186,-2.815697


Now scaler is applied due to L1 cost will be applied later.

In [154]:
X2s_train, X2s_test, y2_train, y2_test = train_test_split(X2s, y2, test_size = 0.25, stratify = y2, random_state = 42)

In [155]:
modelname = "LogisticRegression all features L1 cost"

X_train = X2s_train
y_train = y2_train
X_test = X2s_test
y_test = y2_test

time0 = time.time()
lgr = LogisticRegression(penalty='l1', solver='liblinear', C=0.1).fit(X_train, y_train)
time1 = time.time()
train_score = lgr.score(X_train, y_train)
time2 = time.time()
test_score = lgr.score(X_test, y_test)
time3 = time.time()
print(train_score)
print(test_score)

bp_all = model_resut(time0, time1, time2, time3, modelname, train_score, test_score, bp_all)

0.9103298695412904
0.9147324463435952
Train time : 3.5284829139709473 seconds
Train score: 0.0070002079010009766 seconds
Test  score: 0.0041964054107666016 seconds
Total time : 3.539679527282715 seconds


The first 7 models below use just the bank information features (columns 1 - 7: age, job, marital, education, default, housing, and loan). The remaining 6 models use all the features.

In [156]:
bp_all

Unnamed: 0,Model,Train Time,Score Time,Train Accuracy,Test Accuracy
0,LogisticRegression,0.398177,0.012731,0.887346,0.887346
1,KNeighborsClassifier,0.016115,4.005898,0.892234,0.878217
2,DecisionTreeClassifier,0.119039,0.02511,0.917775,0.866854
3,SVC RBF,12.719708,36.309175,0.887346,0.887346
4,SVC poly,37.771317,8.215294,0.887346,0.887346
5,SVC linear,38.791097,6.823769,0.887346,0.887346
6,SVC sigmoid,16.443582,10.899942,0.882296,0.882393
7,KNeighborsClassifier all features,0.018019,4.619585,0.930821,0.905215
8,"KNeighborsClassifier all features grid (1, 22, 2)",37.007412,4.028324,0.930821,0.913373
9,DecisionTreeClassifier all features,0.238911,0.024361,1.0,0.886666


In [157]:
coefficients = lgr.coef_[0]
coefficients_df = pd.DataFrame({'Feature': X2_train.columns, 'Coefficient': coefficients})
pd.set_option('display.max_rows', 100)
coefficients_df
ordered_df = coefficients_df.reindex(coefficients_df["Coefficient"].abs().sort_values(ascending= False).index)
ordered_df

Unnamed: 0,Feature,Coefficient
57,emp.var.rate,-1.609225
53,duration,1.198937
58,cons.price.idx,0.75459
41,month_may,-0.252675
34,contact_telephone,-0.228353
36,month_aug,0.217412
40,month_mar,0.196806
55,pdays,-0.188946
60,euribor3m,0.149506
25,default_no,0.115233


* Overall test accuracy results are in the 87% - 91% range. 

* Using all features vs. 7 features gives better results with acceptable increase of Train time.

* SVS takes a very long time at both the Train and Score stages.

* DecisionTreeClassifier performs well at Train but not at Test, meaning overfitting.

* KNeighborsClassifier at k=19 shows the best accuracy with ~1% better than the default (k=5).

* "LogisticRegression all features L1 cost" performs the best (0.914732) among all the models I tested, but it is not that dramatically better than others.

* The reason behind this would be the data itself are not clearly separated between y = yes and y = no in the features space, so even if we tweak the model, there is a saturation of test accuracy.

In [158]:
pd.reset_option('display.max_rows')

In [159]:
modelname = "LogisticRegression all features L2 cost"

X_train = X2s_train
y_train = y2_train
X_test = X2s_test
y_test = y2_test

time0 = time.time()
lgr = LogisticRegression(penalty='l2', solver='liblinear', C=0.1).fit(X_train, y_train)
time1 = time.time()
train_score = lgr.score(X_train, y_train)
time2 = time.time()
test_score = lgr.score(X_test, y_test)
time3 = time.time()
print(train_score)
print(test_score)

bp_all = model_resut(time0, time1, time2, time3, modelname, train_score, test_score, bp_all)

0.9099090349940112
0.9146353306788385
Train time : 0.4944181442260742 seconds
Train score: 0.009125709533691406 seconds
Test  score: 0.003968954086303711 seconds
Total time : 0.5075128078460693 seconds


Based on the fitting result of LogisticRegression using all features at L1 cost, there are some features highlighted as key to successful campaigns.

In [160]:
bp_all

Unnamed: 0,Model,Train Time,Score Time,Train Accuracy,Test Accuracy
0,LogisticRegression,0.398177,0.012731,0.887346,0.887346
1,KNeighborsClassifier,0.016115,4.005898,0.892234,0.878217
2,DecisionTreeClassifier,0.119039,0.02511,0.917775,0.866854
3,SVC RBF,12.719708,36.309175,0.887346,0.887346
4,SVC poly,37.771317,8.215294,0.887346,0.887346
5,SVC linear,38.791097,6.823769,0.887346,0.887346
6,SVC sigmoid,16.443582,10.899942,0.882296,0.882393
7,KNeighborsClassifier all features,0.018019,4.619585,0.930821,0.905215
8,"KNeighborsClassifier all features grid (1, 22, 2)",37.007412,4.028324,0.930821,0.913373
9,DecisionTreeClassifier all features,0.238911,0.024361,1.0,0.886666


* There is a high chance in **March and August** and a low chance in **May**.

* **Longer duration of contact** does matter for a successful result. The company can control this, so I recommend that the client talk as long as possible over the phone.

* **"emp.var.rate"** has a negative impact to the successful result.

* **"cons.price.idx"** has a positive impact to the successful result.

* **"pdays"** has a negative impact on the successful result. "pday" is the number of days that passed by after the client was last contacted from a previous campaign. So, I will recommend that the client have successive campaigns.

##### Questions