Resources Consulted:
* https://sklearn-template.readthedocs.io/en/latest/user_guide.html
* https://saturncloud.io/blog/how-to-keep-column-names-when-converting-from-pandas-to-numpy
* https://github.com/ApoorvRusia/Naive-Bayes-classification-on-Iris-dataset/blob/master/Naiye%20Bayes%20classification%20application.ipynb
* https://datascience.stackexchange.com/questions/18904/how-do-i-convert-a-pandas-dataframe-to-a-1d-array
* https://stackoverflow.com/questions/35996970/typeerror-fit-missing-1-required-positional-argument-y
* https://machinelearningmastery.com/bayes-theorem-for-machine-learning/
* https://news.ycombinator.com/item?id=21151032
* https://www.countbayesie.com/blog/2016/5/1/a-guide-to-bayesian-statistics
* **https://www.kdnuggets.com/2020/07/spam-filter-python-naive-bayes-scratch.html**

Notes for study from machinelearningmastery link above:

The result P(A|B) is referred to as the posterior probability and P(A) is referred to as the prior probability.

P(A|B): Posterior probability.
P(A): Prior probability.
Sometimes P(B|A) is referred to as the likelihood and P(B) is referred to as the evidence.

P(B|A): Likelihood.
P(B): Evidence.
This allows Bayes Theorem to be restated as:

Posterior = Likelihood * Prior / Evidence

For this assignment you will need to use the sklearn framework to implement a custom Naive Bayes classifier.  The classifier only needs to handle binary data (both the attributes and the classes).  The attributes will always have a value of 0 or 1.  The class labels will always have a value of 1 or -1.  You can use libraries to help with the data processing, calculations, etc, but you must implement your own Naïve Bayes algorithm.  Do not use an existing implementation.  One important implementation detail is that you should convert the probabilities to log probabilities to avoid the number becoming to small to represent as a floating point number.  For example instead of computing P(x|c)P(c) compute log(P(x|c)+log(P(c)). Provide your implementation below.

In [1]:
#initial imports that you may find useful
from sklearn.base import BaseEstimator, ClassifierMixin
from sklearn.utils.validation import check_X_y, check_array, check_is_fitted
import numpy as np

In [2]:
#additional imports
import pandas as pd


In [8]:
# #Your Naive Bayes Implementation goes here.
# #Adjust this as you see fit

class BinaryNBClassifier(BaseEstimator, ClassifierMixin):        
    def __init__(self, k=0.5):
        self.k = k
        self.word_probs = []
             
    def fit(self, X, y):
        
        # Initiate variables
        word_in_spam = []
        word_in_ham = []
        word_in_event = []

        #  STEP 1: Determine how often the labeled class appears (e.g. spam). This variable is called p_of_b
        total_b = 0
        
        self.row_counter = 0
        for row in y:
            if y[self.row_counter][0] == 1: # Note: Because the not spam values are recorded as -1 rather than 0 we can't simply add up y using np.sum(y)
                total_b += 1
            self.row_counter += 1
        
        #   We will need the following two values later in the predict() function so we need to make sure they are in scope
        global p_of_b
        global p_of_not_b

        p_of_b = total_b / len(y) # This is how often the B event happens in the universe of labeled classifications
        p_of_not_b = 1 - p_of_b

        # STEP 2: Calculate the total number of A events when B is true as well when B is not true. 
        # For example, in spam num_of_A_events is equal to the number of A is true words in all the spam labeled messages
        # To accomplish this we basically loop through all the 'B is true' rows and add all the '1' values at each row. 
        self.num_of_A_events_when_B = 0
        self.num_of_A_events_when_not_B = 0


        self.row_counter = 0
        for row in X:
            if y[self.row_counter] == 1:
                self.num_of_A_events_when_B = self.num_of_A_events_when_B + int(np.sum(X[self.row_counter])) # The logic here is the numpy.sum operator will add all the 1s without having to loop
                # print(f"Row {self.row_counter} num_of_A_events_when_B running total is {self.num_of_A_events_when_B}")
            elif y[self.row_counter] != 1:
                # print(f"Count for B is not true for row {self.row_counter}: {np.sum(X[self.row_counter])}")                
                self.num_of_A_events_when_not_B = self.num_of_A_events_when_not_B + np.sum(X[self.row_counter])
                # print(f"Row {self.row_counter} num_of_A_events_when_not_B running total is {self.num_of_A_events_when_not_B}")
            self.row_counter += 1

        p_of_a_given_b = self.num_of_A_events_when_B / total_b
        p_of_a_given_not_b = self.num_of_A_events_when_not_B / total_b

        # STEP 3: Calculate the total number of features in A x number of events (this is like the count of 'cells')
        self.num_of_A_features = np.size(X)

        # STEP 4: With the constants in place we can look at the probability of the features when B is true/not true
        
        #   Prepopulate a single dimension array of length "feature_count" with 0s
        self.count_of_feature_A_when_b = np.full(len(y), 0) 
        self.count_of_feature_A_when_not_b = np.full(len(y), 0)
        self.word_in_event_counter = np.full(len(y), 0)

        global prob_of_feature_A_when_b
        global prob_of_feature_A_when_not_b

        #   For each event, go through each feature. Increment the appropriate counter based on whether the word appears when B is true / not true
        self.row_counter = 0
        self.column_counter = 0
        self.evaluate_A_feature = int
        self.evaluate_B_feature = int

        for row in X:
            for value in row:
                # The following rows setup the condition to search for
                self.evaluate_A_feature = X[self.row_counter][self.column_counter]
                self.evaluate_B_feature = y[self.row_counter][0]

                if (self.evaluate_A_feature == 1 and self.evaluate_B_feature == 1):
                    self.count_of_feature_A_when_b[self.column_counter] += 1
                elif (self.evaluate_A_feature == 1 and self.evaluate_B_feature != 1):                    
                    self.count_of_feature_A_when_not_b[self.column_counter] += 1
                if (self.evaluate_A_feature == 1):
                    self.word_in_event_counter[self.column_counter] += 1
                self.column_counter += 1                
            self.column_counter = 0
            self.row_counter += 1

        #   Having calculated the count of each identified feature, we can now calculate the probabilities of each feature as it appears in B, not B, and total
        prob_of_feature_A_when_b = np.divide(self.count_of_feature_A_when_b, self.word_in_event_counter) # The number of times the A feature occurs given B / Number of events marked B
        print(f"prob_of_feature_A_when_b: {prob_of_feature_A_when_b}")
        prob_of_feature_A_when_not_b = np.divide(self.count_of_feature_A_when_not_b, self.word_in_event_counter) # The number of times the A feature occurs given B / Number of events marked B
        print(f"prob_of_feature_A_when_not_b: {prob_of_feature_A_when_not_b}")

        # # END: At this point, we have all the calculations required for what is necessary in the predict method

        return self
    
    def predict(self, X):

        import math

        # The predict function accepts N events with M features to make a classification using the Naive Bayes implementation
        # We want to multiply the existence of a feature (or lack thereof) by the probability that feature appears in the B (or not B) labeled training set

        num_rows, num_cols = X.shape
        y_predicted = np.full(num_rows, 0) 

        self.row_counter = 0
        for row in X:
            print(f"X: {row}")

            # Using the numpy multiply operator we can multiple each 'cell' by the corresponding 'cell'
            prob_of_feature_X_as_B_fit_weighted = np.multiply(row, prob_of_feature_A_when_b)
            # print(f"Predict: prob_of_feature_X_as_B_fit_weighted: {prob_of_feature_X_as_B_fit_weighted}")
            prob_of_feature_X_as_not_B_fit_weighted = np.multiply(row, prob_of_feature_A_when_not_b)
            # print(f"Predict: prob_of_feature_X_as_not_B_fit_weighted: {prob_of_feature_X_as_not_B_fit_weighted}\n")

            # With the probabilities of each feature we can now add the log() scores together
            #   Note:   It's common for the probabilities for some of the features to come back zero
            #           However the np.log function breaks when trying to take the log(0) since there's no exponent that will get the base value to zero
            #           Since we no longer care about the order of the values, we can np.sort(), then np.trim_zeros to get rid of leading or trailing zeros
            #           before we take the log()        
   
            prob_of_feature_X_as_B_fit_weighted_log = np.sum(np.log(np.trim_zeros(np.sort(prob_of_feature_X_as_B_fit_weighted))))
            prob_of_feature_X_as_not_B_fit_weighted_log = np.sum(np.log(np.trim_zeros(np.sort(prob_of_feature_X_as_not_B_fit_weighted))))

            # print(f"prob_of_feature_X_as_B_fit_weighted_log: {prob_of_feature_X_as_B_fit_weighted_log}")
            # print(f"prob_of_feature_X_as_not_B_fit_weighted_log: {prob_of_feature_X_as_not_B_fit_weighted_log}")

            prob_of_B_given_A = 10**(prob_of_feature_X_as_B_fit_weighted_log + math.log10(p_of_b))
            prob_of_not_B_given_A =  10**(prob_of_feature_X_as_not_B_fit_weighted_log + math.log10(p_of_not_b)) # ** is the operator for raising a value to that power

            # print(f"prob_of_B_given_A: {prob_of_B_given_A} vs prob_of_not_B_given_A: {prob_of_not_B_given_A}")

            if prob_of_B_given_A >= prob_of_not_B_given_A: 
                y_predicted[self.row_counter] = 1
            else:
                y_predicted[self.row_counter] = -1

            self.row_counter += 1

        print(f"y_predicted: {y_predicted}")
        return self


In [5]:
# Loop through files in directory:

# import required module
import os
# assign directory
directory = 'Datasets for Assignment 2'
 
# iterate over files in directory
for filename in os.listdir(directory):
    f = os.path.join(directory, filename)
    # checking if it is a file
    if os.path.isfile(f):
        # print(f)
        df = pd.read_csv(f,)
        df.info
        
        # Prepare the data

        # The values for the Events are up to the last column
        X = df.iloc[:,:-1].values # The values are everything but the last column
        
        # The values for the Classification are in the last column
        y = df.iloc[:,-1:].values

        # Splitting the dataset into the Training set and Test set
        # from sklearn.model_selection import train_test_split
        # X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.1, random_state = 42)

        # However for this particular exercise, the instruction is to use all the samples for training and testing
        X_train = X
        y_train = y

        X_test = X
        y_test = y


In [11]:
# Define pipelines

from sklearn.pipeline import Pipeline # For setting up pipeline
from sklearn.naive_bayes import CategoricalNB
import time

NBClassifier_pipe = Pipeline([
# ('scaler', StandardScaler()), # Not necessary for this exercise
# ('selector', VarianceThreshold()), # Not necessary for this exercise
('classifier', BinaryNBClassifier())
])

CategoricalNB_pipe = Pipeline([
# ('scaler', StandardScaler()), # Not necessary for this exercise
# ('selector', VarianceThreshold()), # Not necessary for this exercise
('classifier', CategoricalNB())
])

# Call pipelines
NBClassifier_pipe_start =  time.time() # Returns Unix epoch time
NBClassifier_pipe.fit(X_train, y_train)
NBClassifier_pipe.predict(y_train)
NBClassifier_pipe_end =  time.time() # Returns Unix epoch time
NBClassifier_pipe_seconds_elapsed = NBClassifier_pipe_end - NBClassifier_pipe_start

CategoricalNB_pipe_start =  time.time() # Returns Unix epoch time
CategoricalNB_pipe.fit(X_train, y_train)
CategoricalNB_pipe.predict(y_train)
CategoricalNB_pipe_end =  time.time() # Returns Unix epoch time
CategoricalNB_pipe_seconds_elapsed = CategoricalNB_pipe_end - CategoricalNB_pipe_start



IndexError: index 9 is out of bounds for axis 0 with size 9

In [113]:
# Predicting the set results
y_pred = cclassifier.predict(X_train)
print(y_train)
print(y_pred)


[ 1  1  1  1  1 -1 -1 -1 -1 -1]
[ 1  1  1  1  1 -1  1 -1 -1 -1]


In [115]:
#lets see the actual and predicted value side by side
y_compare = np.vstack((y_train,y_pred)).T
#actual value on the left side and predicted value on the right hand side
#printing the top 5 values
y_compare[:]

array([[ 1,  1],
       [ 1,  1],
       [ 1,  1],
       [ 1,  1],
       [ 1,  1],
       [-1, -1],
       [-1,  1],
       [-1, -1],
       [-1, -1],
       [-1, -1]])

In [116]:
# Making the Confusion Matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_train, y_pred)
print(cm)

[[4 1]
 [0 5]]


In [117]:
#finding accuracy from the confusion matrix.
a = cm.shape
corrPred = 0
falsePred = 0

for row in range(a[0]):
    for c in range(a[1]):
        if row == c:
            corrPred +=cm[row,c]
        else:
            falsePred += cm[row,c]
print('Correct predictions: ', corrPred)
print('False predictions', falsePred)
print ('\n\nAccuracy of the Categorical Clasification is: ', corrPred/(cm.sum()))   

Correct predictions:  9
False predictions 1


Accuracy of the Categorical Clasification is:  0.9


Now you will train and test your Binary Naive Bayes classifier on a few different datasets.  The datasets can be downloaded from canvas.  They are linked in the assignment description.  For this part of the assignment we will not be splitting the data into training, validation and test data sets.  Instead you should use the entire dataset for training and the entire dataset for testing.  You will need to complete the following table (you can just output the results in this format you don't need to copy them into the text field).

|dataset|# of instances|# of features | Your NB Training Time | Your NB Test Time | Your NB Accuracy | sklearn CategoricalNB Training Time | sklearn Categorical NB Test Time | sklearn CategoricalNB Accuracy|
|-----------|------------|-------------|------------------|-------------------|-------------------------|---------------------------------|------------------------|----------------------------------|
test1_1 |
test1_2 |
test1_4 |
test1_5 |
test2_1 |
test2_2 |
test2_4 |
test2_5 |
test4_1 |
test4_2 |
test4_4 |
test4_5 |
test5_1 |
test5_2 |
test5_4 |


In [15]:
#Train and test your BinaryNBClassifier and the sklearn CategoricalNBClassifier on the datasets from canvas


The next step for this assignment is split the vote dataset (also found on canvas) into a *train/test split* (use 20% of the data for testing).  Train both algorithms on the training data using *cross-fold validation* and then report the accuracy, f1-score, mcc and informedness results.

In [16]:
#Split the vote dataset
#Use cross-validatation to compare BinaryNBClassifier against CategoricalNBClassifier

Finally, choose the algorithm that performed the best on the cross-validation, train it on all the training data and test on the test data.  Report the accuracy, f1-score, mcc and informedness results

In [None]:
#Final Generalization test

In [None]:
# This cell remains as a reminder of a LOT of work I did to try to maintain the words with their 'labels'. 

#Spliting the dataset in independent and dependent variables
# X = df.iloc[:,:-1].to_dict('list') # The idea here is to capture all the columns except the last one as X
# y = df['Class'].to_dict('records')