#CPSC 483 - Project 6
##Winston Do
This project was focused on training classification models using Gaussian Naive Bayes GNB, k Nearest Neighbors (KNN) and support vector machine (SVM). The unique attribute of this dataset is that it contains a lot of categorical datapoints. This entailed a degree of data engineering to clean the data before it could be used to train the models.

###Data cleaning
The data contained numerous features that were categorical in nature so a function was written to clean the data, as well as remove the 'duration' feature as outlined by the probject prompt.
```
#cleans dataframes as per the project requirements, maybe could be generalized for other applications.
def CleanDataFrame(dataframe, target_name, removed_features=None, use_dummy_coding=False, use_zScore_Normalize=False):
  featureNamesList = list(dataframe.columns)
  featureNamesList.remove(target_name)
  if (removed_features != None): #if feature names are provided, removes them from the input dataframe otherwise skips
    for feature in removed_features:
      featureNamesList.remove(feature)
  X = dataframe[featureNamesList]
  t = dataframe[[target_name]]
  ```
The above function will seperate the dataframe into the feature matrix and target vector as well as apply one-hot encoding to the categorical variables. 

Initially, it was found that the SVM model had the highest fit accuracy with KNN coming close. However, some data resampling and more metrics were needed to decalre a specific classifier model as "best".

Confusion matrixes (CM) and an area under the curve (AUC) score was calculated. These metrics showed the relative poor performance of the classifiers without resampling.

The data was then resampled with the fit_resample() function. 

These yielded a better accuracy result and the AUC and CMs were calculated using the helper function:
```
def GetAUCandConfusionMatrix(model, feature, target):
  prediction = model.predict(feature)
  AUC_score = roc_auc_score(target, prediction)
  confuse_mat = confusion_matrix(target, prediction)
  return AUC_score, confuse_mat
```
The resampled data was then used to train new models which were held in a dicitonary:
```
ModelDict_resam = {'GNB': GNBModel_resampled, 'KNN': KNNModel_resampled, 'SVM': SVMModel_resampled}
```

With the resampled data, KNN seems to have the best AUC score. The confusion matrix yeilds the lowest amount of false negatives. But when balanced against the false positive rate. The SVM performs much better both the false negatives and positive amounts are low. 

The GNB classifier seems to yield an alarming amount of false positives.

In [97]:
#CPSC 483
#Project 6
#Winston Do

############################
#Modules and helper function block
########################

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sklearn as sk
from google.colab import files

from sklearn.linear_model import LogisticRegression

def printShape(data):
  try:
    print(data.shape)
  except: 
    print('Error: printShape must be panda dataframe obj as argument')
  
def printType(obj):
  print(type(obj))

def uploadManyFiles(NumberOfFiles):
  for x in range(NumberOfFiles):
    x = files.upload()

#uses pd Dataframe: 
#input: takes in a list of column names , string with the target column name, dataframe
#outputs two pddataframes
def sliceDataFrame(feature_list, target, pdDataFrame):
  return pdDataFrame[feature_list], pdDataFrame[[target]]

#performs zscore normilization on a column vector of the dataframe. also ignores categorical columns
def zScoreNormalize(dataframe):
    # copy the dataframe
    dataframe_std = dataframe.copy()
    # apply the z-score method
    for column in dataframe.columns:
      if (dataframe[column].dtypes != 'object'): #ignore categorial columns of object type
        dataframe_std[column] = (dataframe_std[column] - dataframe_std[column].mean()) / dataframe_std[column].std()        
    return dataframe_std



#cleans dataframes as per the project requirements, maybe could be generalized for other applications.
#dependancies: pd Dataframe, zScoreNormalize(df)
#input: PDDataframe, string of target column, [optional]: features to be removed from target vector
#input(params): 
#outputs two pddataframes, the input and target
def CleanDataFrame(dataframe, target_name, removed_features=None, use_dummy_coding=False, use_zScore_Normalize=False):
  featureNamesList = list(dataframe.columns)
  featureNamesList.remove(target_name)
  if (removed_features != None): #if feature names are provided, removes them from the input dataframe otherwise skips
    for feature in removed_features:
      featureNamesList.remove(feature)
  X = dataframe[featureNamesList]
  t = dataframe[[target_name]]

  #X, t = sliceDataFrame(featureNamesList, target_name, dataframe)
  if (use_zScore_Normalize == True):  #standardization if argment is set to true. Must be before dummy coding
    X = zScoreNormalize(X)
  if (use_dummy_coding == True):
    X = pd.get_dummies(X, drop_first=False)
    t = pd.get_dummies(t, drop_first=True)
  return X, t



In [98]:
###########################
#import and clean data block
###########################

#features to remove
removedFeatures = ['duration']
try:
  pdDF_trainRaw = pd.read_csv('bank-additional.csv', sep=';')
  pdDF_testRaw = pd.read_csv('bank-additional-full.csv', sep=';')
except FileNotFoundError:
  x = input("File not found. Enter how many files to input and set directory:")
  uploadManyFiles(int(x))
  pdDF_trainRaw = pd.read_csv('bank-additional.csv', sep=';')
  pdDF_testRaw = pd.read_csv('bank-additional-full.csv', sep=';')


featureMatrix_train, targetVector_train = CleanDataFrame(pdDF_trainRaw, 'y', removed_features=removedFeatures, use_dummy_coding=True, use_zScore_Normalize=True)
featureMatrix_test, targetVector_test = CleanDataFrame(pdDF_testRaw, 'y', removed_features=removedFeatures, use_dummy_coding=True, use_zScore_Normalize=True)

In [99]:
#train a gaussian naive bayes classifier 
from sklearn.naive_bayes import GaussianNB

GNBModel = GaussianNB().fit(featureMatrix_train, np.ravel(targetVector_train))
GNBModel_score = GNBModel.score(featureMatrix_test, targetVector_test)
print("Score of Gaussian Naive Bayes is", GNBModel_score)

Score of Gaussian Naive Bayes is 0.5958288821986987


In [100]:
#train a k nearest neighbors(k=5) model
from sklearn.neighbors import KNeighborsClassifier
K = 5
KNNModel = KNeighborsClassifier(n_neighbors=K).fit(featureMatrix_train, np.ravel(targetVector_train))
KNNModel_score = KNNModel.score(featureMatrix_test, targetVector_test)
print(f"Score of k({K}) Nearest Neighbors is {KNNModel_score}")

Score of k(5) Nearest Neighbors is 0.893682626007575


In [101]:
#train a SVM model
from sklearn import svm
SVMModel = svm.SVC()
SVMModel.fit(featureMatrix_train, np.ravel(targetVector_train))
SVMModel_score = SVMModel.score(featureMatrix_test, targetVector_test)
print("Score of Suppor Vector Machine is", SVMModel_score)

Score of Suppor Vector Machine is 0.8993153345634651


The classifier with the highest accuracy seems to be the support vector machine (SVM) at 0.899 but with a slightly higher margin than the k nearest nieghbors (KNN) at 0.892.

In [127]:
ModelDict = {'GNB': GNBModel, 'KNN': KNNModel, 'SVM': SVMModel}
targetVector_train.value_counts(normalize=True)



y_yes
0        0.890507
1        0.109493
dtype: float64

Only ~11% of the customers have subscribed to the product according to the training data.

If the frequency of subscriptions is 0, then the models would not be very accurate it describing individuals who would subscribe to the product.

In [117]:
#create confusion matrix
targetVector_train_zeros = np.zeros_like(targetVector_train)

from sklearn.metrics import confusion_matrix

dumb_confusion_matrix = confusion_matrix(targetVector_train, targetVector_train_zeros,)
dumb_confusion_matrix


array([[3668,    0],
       [ 451,    0]])

In [129]:
from sklearn.metrics import roc_auc_score



def GetAUCandConfusionMatrix(model, feature, target):
  prediction = model.predict(feature)
  AUC_score = roc_auc_score(target, prediction)
  confuse_mat = confusion_matrix(target, prediction)
  return AUC_score, confuse_mat



for model in ModelDict:
  AUC, c_mat = GetAUCandConfusionMatrix(ModelDict[model], featureMatrix_train, targetVector_train)
  print(f"AUCScore of {model} is: {AUC}")
  print(f"Confustion matrix of {model} is:\n", c_mat)




AUCScore of GNB is: 0.6744768683187972
Confustion matrix of GNB is:
 [[2077 1591]
 [  98  353]]
AUCScore of KNN is: 0.6491127797914243
Confustion matrix of KNN is:
 [[3607   61]
 [ 309  142]]
AUCScore of SVM is: 0.6123185602332875
Confustion matrix of SVM is:
 [[3638   30]
 [ 346  105]]


Based on the AUC scores for all models, it seems like the best classifier is the gaussian naive bayes model. 

In [133]:
from imblearn.over_sampling import RandomOverSampler 

ros = RandomOverSampler(random_state=(2021-4-22))
X_res_train, t_res_train = ros.fit_resample(featureMatrix_train, np.ravel(targetVector_train))


GNBModel_resampled = GaussianNB().fit(X_res_train, np.ravel(t_res_train))
GNBModel_resampled_score = GNBModel_resampled.score(featureMatrix_test, targetVector_test)
print("Score of Gaussian Naive Bayes is", GNBModel_resampled_score)


#train a k nearest neighbors(k=5) model
from sklearn.neighbors import KNeighborsClassifier
K = 5
KNNModel_resampled = KNeighborsClassifier(n_neighbors=K).fit(X_res_train, np.ravel(t_res_train))
KNNModel_resampled_score = KNNModel_resampled.score(featureMatrix_test, targetVector_test)
print(f"Score of k({K}) Nearest Neighbors is {KNNModel_resampled_score}")
#train a SVM model
from sklearn import svm
SVMModel_resampled = svm.SVC()
SVMModel_resampled.fit(X_res_train, np.ravel(t_res_train))
SVMModel_resampled_score = SVMModel_resampled.score(featureMatrix_test, targetVector_test)
print("Score of Suppor Vector Machine is", SVMModel_resampled_score)



Score of Gaussian Naive Bayes is 0.5628338350976012
Score of k(5) Nearest Neighbors is 0.7643488394678062
Score of Suppor Vector Machine is 0.8514130329222104


In [135]:
ModelDict_resam = {'GNB': GNBModel_resampled, 'KNN': KNNModel_resampled, 'SVM': SVMModel_resampled}

for model in ModelDict_resam:
  AUC, c_mat = GetAUCandConfusionMatrix(ModelDict_resam[model], featureMatrix_train, targetVector_train)
  print(f"AUCScore of {model} is: {AUC}")
  print(f"Confustion matrix of {model} is:\n", c_mat)

AUCScore of GNB is: 0.6575464193226248
Confustion matrix of GNB is:
 [[1904 1764]
 [  92  359]]
AUCScore of KNN is: 0.9169484630059942
Confustion matrix of KNN is:
 [[3075  593]
 [   2  449]]
AUCScore of SVM is: 0.7993148631297953
Confustion matrix of SVM is:
 [[3310  358]
 [ 137  314]]


The KNN seems to have the best AUC score when resampling the data. The confusion matrix yeilds the lowest amount of false negatives. But when balanced against the false positive rate. The SVM performs much better both the false negatives and positive amounts are low. 

The GNB classifier seems to yield an alarming amount of false positives.