Resources Consulted:
* https://sklearn-template.readthedocs.io/en/latest/user_guide.html
* https://saturncloud.io/blog/how-to-keep-column-names-when-converting-from-pandas-to-numpy
* https://github.com/ApoorvRusia/Naive-Bayes-classification-on-Iris-dataset/blob/master/Naiye%20Bayes%20classification%20application.ipynb
* https://datascience.stackexchange.com/questions/18904/how-do-i-convert-a-pandas-dataframe-to-a-1d-array
* https://stackoverflow.com/questions/35996970/typeerror-fit-missing-1-required-positional-argument-y
* https://machinelearningmastery.com/bayes-theorem-for-machine-learning/
* https://news.ycombinator.com/item?id=21151032
* https://www.countbayesie.com/blog/2016/5/1/a-guide-to-bayesian-statistics
* **https://www.kdnuggets.com/2020/07/spam-filter-python-naive-bayes-scratch.html**
* https://www.kaggle.com/code/marloz/sklearn-pipelines-missing-values/notebook

Notes for study from machinelearningmastery link above:

The result P(A|B) is referred to as the posterior probability and P(A) is referred to as the prior probability.

P(A|B): Posterior probability.
P(A): Prior probability.
Sometimes P(B|A) is referred to as the likelihood and P(B) is referred to as the evidence.

P(B|A): Likelihood.
P(B): Evidence.
This allows Bayes Theorem to be restated as:

Posterior = Likelihood * Prior / Evidence

For this assignment you will need to use the sklearn framework to implement a custom Naive Bayes classifier.  The classifier only needs to handle binary data (both the attributes and the classes).  The attributes will always have a value of 0 or 1.  The class labels will always have a value of 1 or -1.  You can use libraries to help with the data processing, calculations, etc, but you must implement your own Naïve Bayes algorithm.  Do not use an existing implementation.  One important implementation detail is that you should convert the probabilities to log probabilities to avoid the number becoming to small to represent as a floating point number.  For example instead of computing P(x|c)P(c) compute log(P(x|c)+log(P(c)). Provide your implementation below.

In [1]:
#initial imports that you may find useful
from sklearn.base import BaseEstimator, ClassifierMixin
from sklearn.utils.validation import check_X_y, check_array, check_is_fitted
import numpy as np

In [2]:
#additional imports
import pandas as pd


In [3]:
# #Your Naive Bayes Implementation goes here.
# #Adjust this as you see fit

class BinaryNBClassifier(BaseEstimator, ClassifierMixin):        
    def __init__(self, k=0.5):
        self.k = k
        self.word_probs = []
             
    def fit(self, X, y):
        
        #  STEP 1: Determine how often the labeled class appears (e.g. spam). This variable is called p_of_b
        global total_b, total_not_b
        total_b = 0
        
        self.row_counter = 0
        for row in y:
            if y[self.row_counter][0] == 1: # Note: Because the not spam values are recorded as -1 rather than 0 we can't simply add up y using np.sum(y)
                total_b += 1
            self.row_counter += 1
        total_not_b = len(y) - total_b
        
        #   We will need the following two values later in the predict() function so we need to make sure they are in scope
        global p_of_b
        global p_of_not_b

        p_of_b = total_b / len(y) # This is how often the B event happens in the universe of labeled classifications
        p_of_not_b = 1 - p_of_b

        # print(f"p_of_b: {p_of_b} p_of_not_b: {p_of_not_b}")
        # print(f"log10(p_of_b): {np.log10(p_of_b)} log10(p_of_not_b): {np.log10(p_of_not_b)}")

        # STEP 2: With the constants in place we can look at the probability of the features when B is true/not true
        
        #   Prepopulate a single dimension array of length "feature_count" with 0s
        global X_row_count, X_column_count
        X_row_count, X_column_count = np.shape(X)

        global y_row_count, y_column_count
        y_row_count, y_column_count = np.shape(y)

        self.count_of_feature_A_when_b = np.full(X_column_count, 0) 
        self.count_of_feature_A_when_not_b = np.full(X_column_count, 0)
        self.count_of_feature_not_A_when_b = np.full(X_column_count, 0) 
        self.count_of_feature_not_A_when_not_b = np.full(X_column_count, 0)

        #   For each event, go through each feature. Increment the appropriate counter based on whether the word appears when B is true / not true
        self.row_counter = 0
        self.column_counter = 0
        self.evaluate_A_feature = int
        self.evaluate_B_feature = int

        for row in X:
            for value in row:
                # The following rows setup the condition to search for
                self.evaluate_A_feature = X[self.row_counter][self.column_counter]
                self.evaluate_B_feature = y[self.row_counter][0]

                if (self.evaluate_A_feature == 1 and self.evaluate_B_feature == 1):
                    self.count_of_feature_A_when_b[self.column_counter] += 1
                elif (self.evaluate_A_feature == 1 and self.evaluate_B_feature != 1):
                    self.count_of_feature_A_when_not_b[self.column_counter] += 1
                elif (self.evaluate_A_feature != 1 and self.evaluate_B_feature == 1):
                    self.count_of_feature_not_A_when_b[self.column_counter] += 1
                elif (self.evaluate_A_feature != 1 and self.evaluate_B_feature != 1):
                    self.count_of_feature_not_A_when_not_b[self.column_counter] += 1
                self.column_counter += 1                
            self.column_counter = 0
            self.row_counter += 1

        # print(f"count_of_feature_A_when_b: {self.count_of_feature_A_when_b}")
        # print(f"count_of_feature_A_when_not_b: {self.count_of_feature_A_when_not_b}")
        # print(f"count_of_feature_not_A_when_b: {self.count_of_feature_not_A_when_b}")
        # print(f"count_of_feature_not_A_when_not_b: {self.count_of_feature_not_A_when_not_b}")

        #   Having calculated the count of each identified feature, we can now calculate the probabilities of each feature as it appears in B, not B, and total
        global prob_of_feature_A_when_b        
        global prob_of_feature_A_when_not_b
        global prob_of_feature_not_A_when_b
        global prob_of_feature_not_A_when_not_b
        prob_of_feature_A_when_b = []
        prob_of_feature_A_when_not_b = []
        prob_of_feature_not_A_when_b = []
        prob_of_feature_not_A_when_not_b = []

        prob_of_feature_A_when_b = np.divide(self.count_of_feature_A_when_b, total_b) # The number of times the A feature occurs given B / Number of events marked B
        prob_of_feature_A_when_not_b = np.divide(self.count_of_feature_A_when_not_b, total_not_b) # The number of times the A feature occurs given B / Number of events marked B
        prob_of_feature_not_A_when_b = np.divide(self.count_of_feature_not_A_when_b, total_b) # The number of times the A feature occurs given B / Number of events marked B
        prob_of_feature_not_A_when_not_b = np.divide(self.count_of_feature_not_A_when_not_b, total_not_b) # The number of times the A feature occurs given B / Number of events marked B

        # print(f"prob_of_feature_A_when_b: {prob_of_feature_A_when_b}")
        # print(f"prob_of_feature_A_when_not_b: {prob_of_feature_A_when_not_b}")
        # print(f"prob_of_feature_not_A_when_b: {prob_of_feature_not_A_when_b}")
        # print(f"prob_of_feature_not_A_when_not_b: {prob_of_feature_not_A_when_not_b}")


        # # END: At this point, we have all the calculations required for what is necessary in the predict method

        return self
    
    def predict(self, X):

        import math

        # The predict function accepts N events with M features to make a classification using the Naive Bayes implementation
        # We want to multiply the existence of a feature (or lack thereof) by the probability that feature appears in the B (or not B) labeled training set

        num_rows, num_cols = X.shape
        y_predicted = np.full(num_rows, 0) 

        self.row_counter = 0
        for row in X:
            # print(f"X: {row}")

            # Using the numpy multiply operator we can multiple each 'cell' by the corresponding 'cell'
            prob_of_feature_A_when_b_weighted = np.multiply(row, prob_of_feature_A_when_b)
            prob_of_feature_A_when_not_b_weighted = np.multiply(row, prob_of_feature_A_when_not_b)

            # We need to swap out the 1s and 0s in the training set so we can multiply the 'not As' to get probabilities
            # print(f"row: {row} row_mirrored: {abs(row-1)}") # Checking to make sure this value does what I expect: YES
            prob_of_feature_not_A_when_b_weighted = np.multiply(abs(row-1), prob_of_feature_not_A_when_b) # BIG THOUGHT REQUIRED HERE. NEED TO ASSERT A TRUE WHEN THE FEATURE IS FALSE
            prob_of_feature_not_A_when_not_b_weighted = np.multiply(abs(row-1), prob_of_feature_not_A_when_not_b)

            # With the probabilities of each feature we can now add the log() scores together
            #   Note:   It's common for the probabilities for some of the features to come back zero
            #           However the np.log function breaks when trying to take the log(0) since there's no exponent that will get the base value to zero
            #           Since we no longer care about the order of the values, we can np.sort(), then np.trim_zeros to get rid of leading or trailing zeros
            #           before we take the log()        
   
        
            prob_of_feature_A_when_b_weighted_log = np.sum(np.log10(np.trim_zeros(np.sort(prob_of_feature_A_when_b_weighted))))/num_cols
            prob_of_feature_A_when_not_b_weighted_log = np.sum(np.log10(np.trim_zeros(np.sort(prob_of_feature_A_when_not_b_weighted))))/num_cols
            prob_of_feature_not_A_when_b_weighted_log = np.sum(np.log10(np.trim_zeros(np.sort(prob_of_feature_not_A_when_b_weighted))))/num_cols
            prob_of_feature_not_A_when_not_b_weighted_log = np.sum(np.log10(np.trim_zeros(np.sort(prob_of_feature_not_A_when_not_b_weighted))))/num_cols

            # Look into the math operations:
            # print(f"prob_of_B_given_A = ({prob_of_feature_A_when_b_weighted_log} + {prob_of_feature_not_A_when_b_weighted_log} + {math.log10(p_of_b)}")
            # print(f"prob_of_not_B_given_A =  ({prob_of_feature_A_when_not_b_weighted_log} + {prob_of_feature_not_A_when_not_b_weighted_log} + {math.log10(p_of_not_b)}") # ** is the operator for raising a value to that power

            prob_of_B_given_A = (prob_of_feature_A_when_b_weighted_log + prob_of_feature_not_A_when_b_weighted_log + np.log10(p_of_b)) # Intentionally NOT raising this back to 10** so I can see the values else it prints 0.0
            prob_of_not_B_given_A =  (prob_of_feature_A_when_not_b_weighted_log + prob_of_feature_not_A_when_not_b_weighted_log + np.log10(p_of_not_b)) # ** is the operator for raising a value to that power

            # print(f"Row {self.row_counter}: prob_of_B_given_A: {prob_of_B_given_A} vs prob_of_not_B_given_A: {prob_of_not_B_given_A}")

            if prob_of_B_given_A >= prob_of_not_B_given_A: 
                y_predicted[self.row_counter] = 1
            else:
                y_predicted[self.row_counter] = -1

            # print(f"y_predicted: {y_predicted[self.row_counter]}")
            
            self.row_counter += 1

        # print(f"y_predicted: {y_predicted}")
        return np.array(y_predicted) # For a long time I was returning self


In [4]:
# Define pipelines

from sklearn.pipeline import Pipeline # For setting up pipeline
from sklearn.naive_bayes import CategoricalNB
from sklearn.impute import SimpleImputer

NBClassifier_pipe = Pipeline([
# ('scaler', StandardScaler()), # Not necessary for this exercise
# ('selector', VarianceThreshold()), # Not necessary for this exercise
('imputer', SimpleImputer(strategy='most_frequent')), # Impute the values when missing values
('classifier', BinaryNBClassifier())
])

CategoricalNB_pipe = Pipeline([
# ('scaler', StandardScaler()), # Not necessary for this exercise
# ('selector', VarianceThreshold()), # Not necessary for this exercise
('imputer', SimpleImputer(strategy='most_frequent')), # Impute the values when missing values
('classifier', CategoricalNB())
])

Now you will train and test your Binary Naive Bayes classifier on a few different datasets.  The datasets can be downloaded from canvas.  They are linked in the assignment description.  For this part of the assignment we will not be splitting the data into training, validation and test data sets.  Instead you should use the entire dataset for training and the entire dataset for testing.  You will need to complete the following table (you can just output the results in this format you don't need to copy them into the text field).

|dataset|# of instances|# of features | Your NB Training Time | Your NB Test Time | Your NB Accuracy | sklearn CategoricalNB Training Time | sklearn Categorical NB Test Time | sklearn CategoricalNB Accuracy|
|-----------|------------|-------------|------------------|-------------------|-------------------------|---------------------------------|------------------------|----------------------------------|
test1_1 |
test1_2 |
test1_4 |
test1_5 |
test2_1 |
test2_2 |
test2_4 |
test2_5 |
test4_1 |
test4_2 |
test4_4 |
test4_5 |
test5_1 |
test5_2 |
test5_4 |


In [5]:
#Train and test your BinaryNBClassifier and the sklearn CategoricalNBClassifier on the datasets from canvas

# Loop through files in directory:

# import required modules
import os
import time
from sklearn.metrics import accuracy_score, confusion_matrix


# def evaluate_results(y_train,y_pred):

# assign directory
directory = 'Datasets for Assignment 2'

# print header
print("|dataset\t|# of instances\t|# of features\t|Your NB Training Time\t|Your NB Test Time\t|Your NB Accuracy\t|sklearn CategoricalNB Training Time\t|sklearn Categorical NB Test Time\t|sklearn CategoricalNB Accuracy\t|")
 
# iterate over files in directory
for filename in sorted(os.listdir(directory)):    
    if not filename.startswith('.') and filename != "vote.csv": # This command excludes the .DS_Store common on Mac OS  : and filename == "test1_4.csv"
        f = os.path.join(directory, filename)
        # checking if it is a file
        if os.path.isfile(f):
            # print(f)
            df = pd.DataFrame()
            df = pd.read_csv(f,)
            df.info
            
            # Prepare the data

            #   The values for the Events are up to the last column
            #   X = np.zeros(1) # Reset the array
            X = df.iloc[:,:-1].values # The values are everything but the last column
            
            #   The values for the Classification are in the last column
            y = df.iloc[:,-1:].values

            # Splitting the dataset into the Training set and Test set
            # from sklearn.model_selection import train_test_split
            # X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.1, random_state = 42)

            # However for this particular exercise, the instruction is to use all the samples for training and testing
            X_train = X
            y_train = y

            X_test = X
            y_test = y

            num_instances, num_features = np.shape(X_train)

            # Call pipelines
            CategoricalNB_fit_start =  time.time() # Returns Unix epoch time
            CategoricalNB_pipe.fit(X_train, y_train.ravel())
            CategoricalNB_fit_end =  time.time() # Returns Unix epoch time
            CategoricalNB_fit_seconds_elapsed = round(CategoricalNB_fit_end - CategoricalNB_fit_start, 6)

            CategoricalNB_predict = CategoricalNB_pipe.predict(X_train)
            CategoricalNB_predict_end =  time.time() # Returns Unix epoch time
            CategoricalNB_predict_seconds_elapsed = round(CategoricalNB_predict_end - CategoricalNB_fit_end, 6)
            CategoricalNB_accuracy =  accuracy_score(y_train, CategoricalNB_predict)


            NBClassifier_fit_start =  time.time() # Returns Unix epoch time
            NBClassifier_pipe.fit(X_train, y_train)
            NBClassifier_fit_end =  time.time() # Returns Unix epoch time
            NBClassifier_fit_seconds_elapsed = round(NBClassifier_fit_end - NBClassifier_fit_start, 6)

            NBClassifier_predict = NBClassifier_pipe.predict(X_train)
            NBClassifier_predict_end =  time.time() # Returns Unix epoch time
            NBClassifier_predict_seconds_elapsed = round(NBClassifier_predict_end - NBClassifier_fit_end, 6)
            NBClassifier_accuracy =  accuracy_score(y_train, NBClassifier_predict)
            # print(f"NBClassifier accuracy: {NBClassifier_accuracy}")
            
            print("|---------------|---------------|---------------|-----------------------|-----------------------|-----------------------|---------------------------------------|---------------------------------------|-------------------------------|")
            print(f"|{filename[:-4]}\t|{num_instances}\t\t|{num_features}\t\t|{NBClassifier_fit_seconds_elapsed}\t\t|{NBClassifier_predict_seconds_elapsed}\t\t|{NBClassifier_accuracy}\t\t\t|{CategoricalNB_fit_seconds_elapsed}\t\t\t\t|{CategoricalNB_predict_seconds_elapsed}\t\t\t\t|{CategoricalNB_accuracy}\t\t\t\t|")
print("|---------------|---------------|---------------|-----------------------|-----------------------|-----------------------|---------------------------------------|---------------------------------------|-------------------------------|")


|dataset	|# of instances	|# of features	|Your NB Training Time	|Your NB Test Time	|Your NB Accuracy	|sklearn CategoricalNB Training Time	|sklearn Categorical NB Test Time	|sklearn CategoricalNB Accuracy	|
|---------------|---------------|---------------|-----------------------|-----------------------|-----------------------|---------------------------------------|---------------------------------------|-------------------------------|
|test1_1	|10		|10		|0.001348		|0.000485		|0.8			|0.002883				|0.000124				|0.9				|
|---------------|---------------|---------------|-----------------------|-----------------------|-----------------------|---------------------------------------|---------------------------------------|-------------------------------|
|test1_2	|10		|100		|0.010266		|0.00042		|1.0			|0.012633				|0.000214				|1.0				|
|---------------|---------------|---------------|-----------------------|-----------------------|-----------------------|--------------------------------------

The next step for this assignment is split the vote dataset (also found on canvas) into a *train/test split* (use 20% of the data for testing).  Train both algorithms on the training data using *cross-fold validation* and then report the accuracy, f1-score, mcc and informedness results.

In [6]:
#Split the vote dataset
#Use cross-validatation to compare BinaryNBClassifier against CategoricalNBClassifier

# Loop through files in directory:

# import required modules
import os
import time
from sklearn.metrics import accuracy_score, f1_score, matthews_corrcoef, balanced_accuracy_score # balanced_accuracy_score with adjusted=True is Informedness
from sklearn.model_selection import cross_val_score
from sklearn import preprocessing


# def evaluate_results(y_train,y_pred):

# assign directory
directory = 'Datasets for Assignment 2'

# print header
print("|dataset\t|# of instances\t|# of features\t|Your NB Training Time\t|Your NB Test Time\t|Your NB Accuracy\t|sklearn CategoricalNB Training Time\t|sklearn Categorical NB Test Time\t|sklearn CategoricalNB Accuracy\t|")
 
# iterate over files in directory
for filename in sorted(os.listdir(directory)):    
    if not filename.startswith('.') and filename == "vote.csv": # This command excludes the .DS_Store common on Mac OS  : and filename == "test1_4.csv"
        f = os.path.join(directory, filename)
        # checking if it is a file
        if os.path.isfile(f):
            # print(f)
            df = pd.DataFrame()
            df = pd.read_csv(f,)
            df.info
            
            # Prepare the data

            #   The values for the Events are up to the last column
            #   X = np.zeros(1) # Reset the array
            X = df.iloc[:,:-1].values # The values are everything but the last column
            
            #   The values for the Classification are in the last column
            y = df.iloc[:,-1:].values

            # Splitting the dataset into the Training set and Test set
            from sklearn.model_selection import train_test_split
            X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.1, random_state = 42)

            num_instances, num_features = np.shape(X_train)

            # Call pipelines
            CategoricalNB_fit_start =  time.time() # Returns Unix epoch time
            CategoricalNB_pipe.fit(X_train, y_train.ravel())
            CategoricalNB_fit_end =  time.time() # Returns Unix epoch time
            CategoricalNB_fit_seconds_elapsed = round(CategoricalNB_fit_end - CategoricalNB_fit_start, 6)

            CategoricalNB_predict = CategoricalNB_pipe.predict(X_train)
            CategoricalNB_predict_end =  time.time() # Returns Unix epoch time
            CategoricalNB_predict_seconds_elapsed = round(CategoricalNB_predict_end - CategoricalNB_fit_end, 6)
            CategoricalNB_accuracy =  accuracy_score(y_train, CategoricalNB_predict)


            NBClassifier_fit_start =  time.time() # Returns Unix epoch time
            NBClassifier_pipe.fit(X_train, y_train)
            NBClassifier_fit_end =  time.time() # Returns Unix epoch time
            NBClassifier_fit_seconds_elapsed = round(NBClassifier_fit_end - NBClassifier_fit_start, 6)

            NBClassifier_predict = NBClassifier_pipe.predict(X_train)
            NBClassifier_predict_end =  time.time() # Returns Unix epoch time
            NBClassifier_predict_seconds_elapsed = round(NBClassifier_predict_end - NBClassifier_fit_end, 6)
            NBClassifier_accuracy =  accuracy_score(y_train, NBClassifier_predict)
            # print(f"NBClassifier accuracy: {NBClassifier_accuracy}")
            
            print("|---------------|---------------|---------------|-----------------------|-----------------------|-----------------------|---------------------------------------|---------------------------------------|-------------------------------|")
            print(f"|{filename[:-4]}\t\t|{num_instances}\t\t|{num_features}\t\t|{NBClassifier_fit_seconds_elapsed}\t\t|{NBClassifier_predict_seconds_elapsed}\t\t|{NBClassifier_accuracy}\t|{CategoricalNB_fit_seconds_elapsed}\t\t\t\t|{CategoricalNB_predict_seconds_elapsed}\t\t\t\t|{CategoricalNB_accuracy}\t\t|")
            
print("|---------------|---------------|---------------|-----------------------|-----------------------|-----------------------|---------------------------------------|---------------------------------------|-------------------------------|")


|dataset	|# of instances	|# of features	|Your NB Training Time	|Your NB Test Time	|Your NB Accuracy	|sklearn CategoricalNB Training Time	|sklearn Categorical NB Test Time	|sklearn CategoricalNB Accuracy	|
|---------------|---------------|---------------|-----------------------|-----------------------|-----------------------|---------------------------------------|---------------------------------------|-------------------------------|
|vote		|391		|17		|0.005095		|0.006629		|0.9104859335038363	|0.007196				|0.000219				|0.9156010230179028		|
|---------------|---------------|---------------|-----------------------|-----------------------|-----------------------|---------------------------------------|---------------------------------------|-------------------------------|


Finally, choose the algorithm that performed the best on the cross-validation, train it on all the training data and test on the test data.  Report the accuracy, f1-score, mcc and informedness results

In [7]:
def warn(*args, **kwargs): #Was getting a warning so I suppressed warnings: "IndexError: index 434 is out of bounds for axis 1 with size 434"
    pass
import warnings
warnings.warn = warn

# Perform testing with cross validation
myNBClassifier = NBClassifier_pipe
cv_myNBClassifier = cross_val_score(myNBClassifier, X_train, y_train, cv=5)
print(f"cv_myNBClassifier: {np.mean(cv_myNBClassifier[~np.isnan(cv_myNBClassifier)])}") # Have to handle NaN values

myCategoricalNB = CategoricalNB_pipe
cv_CategoricalNB = cross_val_score(myCategoricalNB, X_train, y_train.ravel(), cv=5)
print(f"cv_CategoricalNB: {np.mean(cv_CategoricalNB[~np.isnan(cv_CategoricalNB)])}") # Have to handle NaN values
print("Conclusion: The CategoricalNB is slightly better")

print()
my_f1_score = f1_score(y_train, CategoricalNB_predict)
my_mcc_score = matthews_corrcoef(y_train, CategoricalNB_predict)
my_balanced_accuracy_score = balanced_accuracy_score(y_train, CategoricalNB_predict, adjusted=True)
print(f"CategoricalNB F1 Score: {my_f1_score} MCC: {matthews_corrcoef} Informedness: {my_balanced_accuracy_score}")
print("\n\n")


cv_myNBClassifier: 0.9054203180785461
cv_CategoricalNB: 0.9073758519961052
Conclusion: The CategoricalNB is slightly better

CategoricalNB F1 Score: 0.8945686900958466 MCC: <function matthews_corrcoef at 0x1776b36a0> Informedness: 0.8378976486860306





In [8]:
#Final Generalization test
CategoricalNB_predict = CategoricalNB_pipe.predict(X_test)
my_f1_score = f1_score(y_test, CategoricalNB_predict)
my_mcc_score = matthews_corrcoef(y_test, CategoricalNB_predict)
my_balanced_accuracy_score = balanced_accuracy_score(y_test, CategoricalNB_predict, adjusted=True)
print(f"CategoricalNB F1 Score: {my_f1_score} MCC: {matthews_corrcoef} Informedness: {my_balanced_accuracy_score}")

print("Conclusion: Some generalization error is present, but it is not extreme. I would feel comfortable moving ahead with this model for classifying the vote data")


CategoricalNB F1 Score: 0.8717948717948718 MCC: <function matthews_corrcoef at 0x1776b36a0> Informedness: 0.7905982905982905
Conclusion: Some generalization error is present, but it is not extreme. I would feel comfortable moving ahead with this model for classifying the vote data


*What follows are code snippets I used in working on this assignment*

In [9]:
# Check np operations to validate they are doing what I think they are doing:
np.log10([.1, 0, .01, .001])
np.log10(np.sort([.1, 0, .01, .001]))
np.log10(np.trim_zeros(np.sort([.1, 0, .01, .001])))
np.sum(np.log10(np.trim_zeros(np.sort([.1, 0, .01, .001]))))
10**(np.sum(np.log10(np.trim_zeros(np.sort([.1, 0, .01, .001])))))
.1 * .01 * .001

  np.log10([.1, 0, .01, .001])
  np.log10(np.sort([.1, 0, .01, .001]))


1e-06

In [10]:
#lets see the actual and predicted value side by side
# y_compare = np.vstack((y_train.ravel(),NBClassifier_predict)).T
#actual value on the left side and predicted value on the right hand side
#printing the top 5 values
# y_compare[:]

In [11]:
# This cell remains as a reminder of a LOT of work I did to try to maintain the words with their 'labels'. 

#Spliting the dataset in independent and dependent variables
# X = df.iloc[:,:-1].to_dict('list') # The idea here is to capture all the columns except the last one as X
# y = df['Class'].to_dict('records')