READ ME

Group Members:Ryan McDonald, Justin Drouin
  
This project loads the UCI bank marketing data set and trains a series of classifiers to make predictions on both the original dataset and a balanced version of the training data set. Each classifier is then scored against the full data set for testing.

**Experiments**


**1) Download bank-additional.zip and extract its contents. Since the dataset is large and some of the algorithms we will use can be time-consuming, we will train with bank-additional.csv, which is a subset of the original dataset.
Once our models are trained, we will test against the full dataset, which is in bank-additional-full.csv. The archive also contains a text file, bank-additional-names.txt, which describes the dataset and what each column represents.**



**2) Use read_csv() to load and examine the training and test sets. Unlike most CSV files, the separator is actually ';' rather than ','.**

In [1]:
import pandas as pd
import numpy as np
import sklearn.neighbors
import sklearn.linear_model

from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix
from sklearn.metrics import roc_auc_score
from imblearn.over_sampling import RandomOverSampler

bankdata = pd.read_csv('bank-additional.csv',sep = ';')
bankdatafull = pd.read_csv('bank-additional-full.csv', sep = ';')

**3) The training and test DataFrames will need some significant preprocessing before they can be used:**

**a) Several of the features are categorical variables and will need to be turned into numbers before they can be used by ML algorithms. The simplest way to accomplish this is to use dummy coding using the Pandas get_dummies() method.**

**Note: some algorithms (e.g. logistic regression) have problems with collinear features. If you use one-hot encoding, one dummy variable will be a linear combination of the other dummy variables, so be sure to pass drop_first=True.**

In [2]:
bankdata_dummies = pd.get_dummies(bankdata, drop_first=True)
bankdatafull_dummies = pd.get_dummies(bankdatafull, drop_first=True)

**b) Two features should be removed from the dataset:**

**Per bank-additional-names.txt, the duration “should be discarded if the intention is to have a realistic predictive model.”**

**The feature y (or y_yes after dummy coding) is the target.**

In [3]:
bankdata_dummies_duration_dropped = bankdata_dummies.drop(labels='duration', axis=1)
bankdatafull_dummies_duration_dropped = bankdatafull_dummies.drop(labels='duration', axis=1)

**c) Some algorithms (e.g. KNN and SVM) require non-categorical features to be standardized.**

According to bank-additional-names, the list of columns that are non-categorical features:
age, campaign, pdays, previous, emp.var.rate, cons.price.idx, cons.conf.idx, euribor3m, nr.employed
Standardization used: standardized_value = (value - mean of column) / standard deviation of column

In [4]:
training_set_features = bankdata_dummies_duration_dropped.loc[:, bankdata_dummies_duration_dropped.columns != 'y_yes']
training_set_target = bankdata_dummies_duration_dropped['y_yes']

test_set_features = bankdatafull_dummies_duration_dropped.loc[:, bankdatafull_dummies_duration_dropped.columns != 'y_yes']
test_set_target = bankdatafull_dummies_duration_dropped['y_yes']

#List of cols to standardize
to_standardize = 'age', 'campaign', 'pdays', 'previous', 'emp.var.rate', 'cons.price.idx', 'cons.conf.idx', 'euribor3m', 'nr.employed'

#Standardize all features in test and training datasets
for col in to_standardize:
    bankdata_dummies_duration_dropped[col] = (bankdata_dummies_duration_dropped[col]-bankdata_dummies_duration_dropped[col].mean())/bankdata_dummies_duration_dropped[col].std()
    bankdatafull_dummies_duration_dropped[col] = (bankdatafull_dummies_duration_dropped[col]-bankdatafull_dummies_duration_dropped[col].mean())/bankdatafull_dummies_duration_dropped[col].std()

#Separate dataframes into training/testing features & targets for ease of use later
training_set_features_standardized = bankdata_dummies_duration_dropped.loc[:, bankdata_dummies_duration_dropped.columns != 'y_yes']
training_set_target_standardized = bankdata_dummies_duration_dropped['y_yes']

test_set_features_standardized = bankdatafull_dummies_duration_dropped.loc[:, bankdatafull_dummies_duration_dropped.columns != 'y_yes']
test_set_target_standardized = bankdatafull_dummies_duration_dropped['y_yes']

**4) Fit Naive Bayes, KNN, and SVM classifiers to the training set, then score each classifier on the test set. Which classifier has the highest accuracy?**

In [5]:
#Guassian Naive Bayes
gnb_training = GaussianNB().fit(training_set_features, training_set_target)
gnb_training.score(test_set_features, test_set_target)

0.8545450131106147

In [6]:
#K-Nearest Neighbors
knn_training = sklearn.neighbors.KNeighborsClassifier().fit(training_set_features_standardized, training_set_target_standardized)
knn_training.score(test_set_features_standardized, test_set_target_standardized)

0.8929057006895212

In [7]:
#Support Vector Machine
svm_training = SVC().fit(training_set_features_standardized, training_set_target_standardized)
svm_training.score(test_set_features_standardized, test_set_target_standardized)

0.899436729144411

**5) These numbers look pretty good, but let’s take another look at the data. How many values in the training set have y_yes = 0, and how many have y_yes = 1? What would be the accuracy if we simply assumed that no customer ever subscribed to the product?**

In [8]:
print("Number of y_yes in training set = 0:", training_set_target.isin([0]).sum())
print("Number of y_yes in training set = 1:", training_set_target.isin([1]).sum())

Number of y_yes in training set = 0: 3668
Number of y_yes in training set = 1: 451


89.05% of customers in the training set have not subscribed to the product. If we assume that that no customer ever subscribed we would have roughly 89% accuracy.

**6) Use np.zeros_like() to create a target vector representing the output of the “dumb” classifier of experiment (5), then create a confusion matrix and find its AUC.**

In [9]:
zeroes = np.zeros_like(training_set_target)
confusion_matrix(zeroes, training_set_target)

array([[3668,  451],
       [   0,    0]], dtype=int64)

(Correct) True Positives: 3668

(Incorrect) False Positives: 451

(Incorrect) False Negative: 0

(Correct) True Negative: 0


Accuracy = (TP + TN) / (P + N) = 3668 / 4119 = 89.05%

In [10]:
roc_auc_score(training_set_target, zeroes)

0.5

**7)Create a confusion matrix and find the AUC for each of the classifiers of experiment (4). Is the best classifier the one with the highest accuracy?**

**Gaussian Naive Bayes:**

In [11]:
confusion_matrix(gnb_training.predict(test_set_features), test_set_target)

array([[33020,  2463],
       [ 3528,  2177]], dtype=int64)

(Correct) True Positives: 33020

(Incorrect) False Positives: 2463

(Incorrect) False Negative: 3528

(Correct) True Negative: 2177


Accuracy = (TP + TN) / (P + N) = 35197 / 41188 = **85.45%**

In [12]:
print("Gaussian Naive Bayes AUC Score: ", roc_auc_score(test_set_target, gnb_training.predict(test_set_features)))

Gaussian Naive Bayes AUC Score:  0.6863252222867992


**K-Nearest Neighbors:**

In [13]:
confusion_matrix(knn_training.predict(test_set_features_standardized), test_set_target)

array([[35577,  3440],
       [  971,  1200]], dtype=int64)

(Correct) True Positives: 35577

(Incorrect) False Positives: 3440

(Incorrect) False Negative: 971

(Correct) True Negative: 1200


Accuracy = (TP + TN) / (P + N) = 36777 / 41188 = **89.29%**

In [14]:
print("K-Nearest Neighbors AUC Score: ", roc_auc_score(test_set_target, knn_training.predict(test_set_features_standardized)))

K-Nearest Neighbors AUC Score:  0.616026444203749


**Support Vector Machine:**

In [15]:
confusion_matrix(svm_training.predict(test_set_features_standardized), test_set_target)

array([[36217,  3811],
       [  331,   829]], dtype=int64)

(Correct) True Positives: 36217

(Incorrect) False Positives: 3811

(Incorrect) False Negative: 331

(Correct) True Negative: 829


Accuracy = (TP + TN) / (P + N) = 37046 / 41188 = **89.94%**

In [16]:
print("Support Vector Machine AUC Score: ",roc_auc_score(test_set_target, svm_training.predict(test_set_features_standardized)))

Support Vector Machine AUC Score:  0.5848036049899423


The Guassian Naive Bayes classifier had the lowest accuracy out of all three of the classifiers, however it had the highest AUC score. 

**8)One of the easiest ways to deal with an unbalanced dataset is random oversampling. This can be done with an imblearn.over_sampling.RandomOverSampler object. Use fit_resample() to generate a balanced training set. To make sure that your results are reproducible, pass random_state=(2021-4-22).**

In [17]:
ros = RandomOverSampler(random_state=2021-4-22)

training_set_features_res, training_set_target_res = ros.fit_resample(training_set_features, training_set_target)
training_set_features_standardized_res, training_set_target_standardized_res = ros.fit_resample(training_set_features_standardized, training_set_target_standardized)

**9)Repeat experiments (4) and (7) on the balanced training set of experiment (8). Which classifier performs the best, and how much better is its performance?**

**9-4)**

In [18]:
#Guassian Naive Bayes
gnb_training_res = GaussianNB().fit(training_set_features_res, training_set_target_res)
print("GNB resampled score: ",gnb_training_res.score(test_set_features, test_set_target))

#K-Nearest Neighbors
knn_training_res = sklearn.neighbors.KNeighborsClassifier().fit(training_set_features_standardized_res, training_set_target_standardized_res)
print("KNN resampled score: ",knn_training_res.score(test_set_features_standardized, test_set_target_standardized))

#Support Vector Machine
svm_training_res = SVC().fit(training_set_features_standardized_res, training_set_target_standardized_res)
print("SVM resampled score: ",svm_training_res.score(test_set_features_standardized, test_set_target_standardized))

GNB resampled score:  0.8446392153054287
KNN resampled score:  0.7636690298145091
SVM resampled score:  0.8547635233563173


**9-7)**

**Gaussian Naive Bayes Resampled:**

In [19]:
confusion_matrix(gnb_training_res.predict(test_set_features), test_set_target)

array([[32481,  2332],
       [ 4067,  2308]], dtype=int64)

(Correct) True Positives: 32481

(Incorrect) False Positives: 2332

(Incorrect) False Negative: 4067

(Correct) True Negative: 2308


Accuracy = (TP + TN) / (P + N) = 34789 / 41188 = **84.46%**

In [20]:
print("Gaussian Naive Bayes Resampled AUC Score: ", roc_auc_score(test_set_target, gnb_training_res.predict(test_set_features)))

Gaussian Naive Bayes Resampled AUC Score:  0.6930677370901941


**K-Nearest Neighbors Resampled:**

In [21]:
confusion_matrix(knn_training_res.predict(test_set_features_standardized), test_set_target)

array([[28687,  1873],
       [ 7861,  2767]], dtype=int64)

(Correct) True Positives: 28687

(Incorrect) False Positives: 1873

(Incorrect) False Negative: 7861

(Correct) True Negative: 2767


Accuracy = (TP + TN) / (P + N) = 31454 / 41188 = **76.37%**

In [22]:
print("K-Nearest Neighbors Resampled AUC Score: ", roc_auc_score(test_set_target, knn_training_res.predict(test_set_features_standardized)))

K-Nearest Neighbors Resampled AUC Score:  0.6906245990157488


**Support Vector Machine Resampled:**

In [23]:
confusion_matrix(svm_training_res.predict(test_set_features_standardized), test_set_target)

array([[32453,  1887],
       [ 4095,  2753]], dtype=int64)

(Correct) True Positives: 32453

(Incorrect) False Positives: 1887

(Incorrect) False Negative: 4095

(Correct) True Negative: 2753


Accuracy = (TP + TN) / (P + N) = 37046 / 41188 = **85.48%**

In [24]:
print("Support Vector Machine Resampled AUC Score: ",roc_auc_score(test_set_target, svm_training_res.predict(test_set_features_standardized)))

Support Vector Machine Resampled AUC Score:  0.7406372654006257


Using the imbalanced learn random oversampling on our classifier models, the model with the highest accuracy also had the highest AUC score. Unlike on the previous experiments, the support vector machine model was most accurate as well as had the highest AUC when using a balanced training set. 