__Naive Bayes Classifier Theory__

The Naive Bayes classifier is generative model that seeks to model a classifier as a conditional probability, $p(C_{(k)}|\boldsymbol{x})$, where $C_{(k)}$ is the class and $\boldsymbol{x}$ is the data observation. From Bayes rule,

$$ p(C_{(k)}|\boldsymbol{x}) = \frac{p(C_{(k)}) p(\boldsymbol{x}|C_{(k)})}{p(\boldsymbol{x})} \propto p(C_{(k)}) p(\boldsymbol{x}|C_{(k)}) $$

The naive Bayes premise assumes that each $x_i \in \boldsymbol{x}$ is conditionally independent of $x_j \in \boldsymbol{x}$ for $i,j = 1,\dots,n$ and $i \neq j$. That is,

$$ p(\boldsymbol{x}|C_{(k)}) = p(x_1,\dots,x_n|C_{(k)}) = \prod_{i=1}^n p(x_i|C_{(k)}) $$

Thus,

$$ p(C_{(k)}|\boldsymbol{x}) \propto p(C_{(k)}) \prod_{i=1}^n p(x_i|C_{(k)}) $$

The Naive Bayes Classifier is the classifier that maximizes the likelihood of the data (the so called \textit{maximum a posteriori} or \textit{MAP} decision rule).

$$ \hat{y} = \text{argmax}_{k \in 1, \dots, K} \enspace p(C_{(k)}) \prod_{i=1}^n p(x_i|C_{(k)}) $$

The parametric form of $p(x_i|C_{(k)})$ is chosen from the knowledge of the data, and the data is used to calculate the parameters of $p(x_i|C_{(k)})$. For real valued data, Gaussian Naive Bayes is often used, where $p(x_i|C_{(k)})$ is taken to be Gaussian. That is,

$$ p(x_i = \hat{x}_i | C_{(k)}) =  \frac{1}{\sqrt{2 \pi \sigma_{i(k)}^2}} e^{-\frac{\left( \hat{x}_i-\mu_{i(k)} \right)^2}{2 \sigma_{i(k)}^2}} $$

__Naive Bayes Classifier Code__

I coded up the Gaussian Naive Bayes classifier in the "gnb" class. The code below tests the class out on a synthetic version breast cancer malignancy data set.

In [2]:
# Test code
       
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.metrics import confusion_matrix

from gnb import gnb # Custom Gaussian Naive Bayes class

# Import cancer image features data   
data = pd.read_csv('Data_to_applicants.csv')
y = data.loc[:,'diagnosis']
X_raw = data.loc[:,'radius_mean':'fractal_dimension_worst']

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_raw, y, 
                                                    test_size = 0.1, 
                                                    stratify = y, 
                                                    random_state = 1001)
X_train = pd.DataFrame(X_train, columns = X_raw.columns)
y_train = pd.DataFrame(y_train, columns = ['diagnosis'])
X_test = pd.DataFrame(X_test, columns = X_raw.columns)
y_test = pd.DataFrame(y_test, columns = ['diagnosis'])

# Remove missing values
imp = IterativeImputer(max_iter=100, random_state=0)
imp.fit(X_train)
X_train_impute = pd.DataFrame(data = imp.transform(X_train), 
                              columns = data.columns[1:-1])
X_test_impute = pd.DataFrame(data = imp.transform(X_test), 
                             columns = data.columns[1:-1])

# Drop columns with little differential variance or high covariance with 
# other features
drop_these = ['perimeter_mean', 'area_mean','perimeter_se', 'area_se',
              'perimeter_worst', 'area_worst', 'smoothness_mean', 
              'fractal_dimension_mean', 'smoothness_se', 
              'compactness_se', 'concavity_se', 'concave points_se',
              'symmetry_se', 'fractal_dimension_se',
              'fractal_dimension_worst']
X_train_reduced = X_train_impute.drop(columns = drop_these)
X_test_reduced = X_test_impute.drop(columns = drop_these)                                           

# Fit Gaussian Naive Bayes classifier
classifier = gnb()
classifier.fit(X_train_reduced, y_train)

# Predict on test set
y_pred = classifier.predict(X_test_reduced)

# Confusion matrix
def print_cm(cm, labels, hide_zeroes=False, hide_diagonal=False, hide_threshold=None):
    """
    Pretty print for confusion matrixes

    Reference link: https://gist.github.com/zachguo/10296432
    """
    columnwidth = max([len(x) for x in labels] + [5])  # 5 is value length
    empty_cell = " " * columnwidth
    
    # Begin CHANGES
    fst_empty_cell = (columnwidth-3)//2 * " " + "t/p" + (columnwidth-3)//2 * " "
    
    if len(fst_empty_cell) < len(empty_cell):
        fst_empty_cell = " " * (len(empty_cell) - len(fst_empty_cell)) + fst_empty_cell
    # Print header
    print("    " + fst_empty_cell, end=" ")
    # End CHANGES
    
    for label in labels:
        print("%{0}s".format(columnwidth) % label, end=" ")
        
    print()
    # Print rows
    for i, label1 in enumerate(labels):
        print("    %{0}s".format(columnwidth) % label1, end=" ")
        for j in range(len(labels)):
            cell = "%{0}.1d".format(columnwidth) % cm[i, j]
            if hide_zeroes:
                cell = cell if float(cm[i, j]) != 0 else empty_cell
            if hide_diagonal:
                cell = cell if i != j else empty_cell
            if hide_threshold:
                cell = cell if cm[i, j] > hide_threshold else empty_cell
            print(cell, end=" ")
        print()
        
cm = confusion_matrix(y_test,y_pred)
print_cm(cm, ['Benign', 'Malignant'])

       t/p       Benign Malignant 
       Benign        34         2 
    Malignant         5        16 
