<center> <h1>ðŸ“© Naive Bayes ðŸ“©</h1> </center>

<p> <center> This notebook is in <span style="color: green"> <b> Active </b> </span> state of development! </center> </p>  
<p> <center> Be sure to checkout my other notebooks for <span style="color: blue"> <b> knowledge, insight and laughter </b> </span>! ðŸ§ ðŸ’¡ðŸ˜‚</center> </p> 

<center> <img src="https://www2.isye.gatech.edu/isyebayes/bank/bayesfun.jpg" width="625" height="625" /> </center>

<hr style="height:2px;border-width:0;color:gray;background-color:gray">

# Aim

The aim is to provide, from scratch, code implementations for linear regression problems. This will involve both the main functions needed to solve a linear regression and some additional utility functions as well.

**Note**: We will not be diving into in-depth exploratory data analysis, feature engineering etc... in these notebooks and so will not be commenting extensively on things such as skewness, kurtosis, homoscedasticity etc...

<hr style="height:2px;border-width:0;color:gray;background-color:gray">

# Background

Naive Bayes is a classification algorithm based on Bayes' theorem. Bayesâ€™ theorem provides a way to calculate the probability of a data point belonging to a given class, given our prior knowledge. It is defined as

$$
\mathbb P (class|data) = \frac{\mathbb P (data|class) \ \mathbb P (class)}{\mathbb P (data)} ,
$$

where $\mathbb P (class | data)$ is the probability over the potential classes given the provided data. The different probabilities $\mathbb P$ you see in the equations above are commonly called prior, likelihood, evidence, and posterior as follows.

$$
\overbrace{\mathbb P (class|data)}^{\text{posterior}} = \frac{\overbrace{\mathbb P (data|class)}^{\text{likelihood}} \ \overbrace{\mathbb P (class)}^{\text{prior}}}{\underbrace{\mathbb P (data)}_{\text{evidence}}}
$$

The algorithm is 'naive', because of its assumption that features of data are independent given the class label. This idea helps us simplify the likelihood event. Let us call the data features $x_1, \dots, x_i, \dots, x_n$ and the class label $y$, and rewrite Bayes theorem in these terms:

$$
\mathbb P (y|x_1, \dots, x_n) = \frac{\mathbb P (x_1, \dots, x_n|y) * \mathbb P (y)}{\mathbb P (x_1, \dots, x_n)} \, . 
$$

Then, the naive assumption of conditional independence between any two features given the class label can be expressed as

$$
\mathbb P(x_i | y, x_1, \dots, x_{i-1}, x_{i+1}, \dots, x_n) = \mathbb P (x_i | y) \, .
$$

For all $i$, we can simply write Bayes' theorem as:

$$
\mathbb P (y | x_1, \dots, x_n) = \frac{\mathbb P (y) \prod_{i=1}^n \mathbb P(x_i | y)}{\mathbb P (x_1, \dots, x_n)} \, .
$$

Since $\mathbb P (x_1, \dots, x_n)$ is the constant input, we can define the following proportional relationship

$$
\mathbb P (y|x_1, \dots, x_n) \propto \mathbb P (y) \prod_{i=1}^n \mathbb P(x_i | y) \, ,
$$

and can use it to classify any data point as

$$
\hat y = \underset{y}{\text{arg max}} \ \mathbb P (y) \prod_{i=1}^n \mathbb P(x_i | y) \, .
$$

**Note:** Naive Bayes can indeed be used for multiclass classification, however we use it here as a binary classifier._ 

<hr style="height:2px;border-width:0;color:gray;background-color:gray">

## Import Modules

In [None]:
# Importing standard packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import copy
from typing import Callable, List, Dict, Tuple
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.utils.validation import check_X_y, check_array
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report

# Data Collection

In [None]:
# Import dataset
df = pd.read_csv('../input/sms-spam-collection-dataset/spam.csv')
# Display dataframe
df

Unlike our other datasets in the other notebooks, the target variable is in the first column (and not in the last column) and so our data processing will be slightly different. If you want, you can just switch the columns and then use code from the other notebooks to proceed with data processing. 

# Data Processing

In [None]:
# Remove unneccesary columns
df = df.iloc[:,:2]

In [None]:
df

In [None]:
#Â To make dataset binary, change ham 0 and spam to 1
df = df.replace("ham", 0)
df = df.replace("spam", 1)

In [None]:
# Check for Nulls
df.info()

In [None]:
# Check for NaNs 
df.isna().sum()

In [None]:
# Check for duplicates
df.duplicated().sum()

In [None]:
# Drop duplicates
df = df.drop_duplicates().reset_index(drop=True)

In [None]:
df

In [None]:
# Bar chart of class ratio 
target_pd = pd.DataFrame(index = ["Not Spam","Spam"], columns= ["Quantity", "Percentage"])
# Not spam
target_pd.loc["Not Spam"]["Quantity"] = len(df[df.columns[0]][df[df.columns[0]]==0].dropna())
target_pd.loc["Not Spam"]["Percentage"] = target_pd.iloc[0,0]/len(df[df.columns[0]])*100
# Spam
target_pd.loc["Spam"]["Quantity"] = len(df[df.columns[0]][df[df.columns[0]]==1].dropna())
target_pd.loc["Spam"]["Percentage"] = target_pd.iloc[1,0]/len(df[df.columns[0]])*100
# Plot barchart
fig = plt.figure(figsize = (10, 5))
plt.bar(list(target_pd.index), target_pd.iloc[:,0], color =["maroon", "blue"], width = 0.4)
plt.ylabel("Number of cases")
plt.title("Distribution of disease and non-disease cases");
# Print the dataframe
target_pd

Again, like most of the other datasets we have been dealing with in other notebooks, we have a highly imbalanced dataset, with more "no spam" emails than "spam". We will continue with our models, without using imbalanced classification techniques such as SMOTE, ADASYN etc...

## Splitting dataset

For most machine learning models, we would like them to have low bias and low variance - that is, the model should perform well on the training set (low bias) and also the test set, alongside with other new random test sets (low variance). Therefore, to test for bias and variance of our model, we shall split the dataset into training and test set. We will not be tuning any hyperparameters (and thus do not need a validation set).  We will not be tuning any hyperparameters (and thus do not need a validation set). 

For these functions, the $X$ dataset (of features) should have a column 1's as the first column to account for the bias term/intercept co-efficient. Before this occurs, one should check the order of magnitude of the features - if they differ hugely, one must apply feature scaling. Having looked at the data however, we can see that since we have merely categorical features, there is no need for feature scaling. 

In [None]:
#Â Create X (emails) and y (binary target) dataset
X = df[df.columns[-1]]
y = df[df.columns[:-1]]

In [None]:
# Split the dataset into training and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=1/4, random_state=42, stratify=y)
# Re-index
X_train = X_train.reset_index(drop=True) 
y_train = y_train.reset_index(drop=True) 
X_test = X_test.reset_index(drop=True) 
y_test = y_test.reset_index(drop=True) 

# Vocabulary List

The first job is to create a vocabulary list of all the possible words that are contained in the dataset.

In [None]:
def vocab_list(X: pd.Series) -> List[str]:
    
    """ Returns every word in the dataset on it's own as a string in a list. """
    
    #Â Create copy to not override X dataframe
    X_copy = X.copy(deep=True)
    for i in range(len(X_copy)):
        X_copy[i] = X[i].split()
    #Â Flatten the list of lists 
    vocabs = [vocab for sublist in X_copy.tolist() for vocab in sublist]
    # Remove duplicates 
    vocabs = [i for n, i in enumerate(vocabs) if i not in vocabs[:n]]
    return vocabs

# Word Count

Now that we have the vocabulary list, we can now calculate the number of occurences of each word per sentence. We will use the help of **one** in-built sk-learn function for now (till we work out how to create our own one). 

In [None]:
def word_count(X: pd.Series, y: pd.Series, vocab_list: List, show_X: bool) -> Tuple:
    
    """ Return word count of email dataset. """
    
    #Â Convert a collection of text documents to a matrix of token counts
    vectorizer = CountVectorizer(vocabulary=vocab_list)
    word_counts = vectorizer.fit_transform(X.tolist()).toarray()
    df = pd.DataFrame(word_counts, columns=vocab_list)
    #Â Function to transform new test data into word count matrix
    msg_tx_func = lambda x: vectorizer.transform(x).toarray()
    if show_X:
        display(df)
    return df.to_numpy(), np.array(y), msg_tx_func

# Classification Prior & Likelihood

Next, we train the Naive Bayes classifier, where we define the prior and likelihood. The prior is the probability distribution incorporating our knowledge of the data. Consequently, we use the available training set to define it.

In [None]:
def prior_lh(X: pd.Series, y: pd.Series) -> Tuple:
    
    """ Use training data for Naive Bayes classifier. """

    n = X.shape[0]
    # Re-order X as a 2-dimensional array; each dimension contains data examples of only one of our two classes i.e. X_by_class[0] = non-spam and X_by_class[1] = spam
    X_by_class = np.array([X[y==c] for c in np.unique(y)])
    # Define prior
    prior = np.array([len(X_class)/n for X_class in X_by_class]) 
    # Count words in each class
    word_counts = np.array([sub_arr.sum(axis=0) for sub_arr in X_by_class])
    # Define likelihood
    lh_word = word_counts / word_counts.sum(axis=1).reshape(-1, 1)
    return prior, lh_word

# Classification Posterior

We compute the last part of the Bayes Theorem component, which is the posterior.

In [None]:
def posterior(X: pd.DataFrame, prior: np.array, lh_word: np.array) -> np.array:
    
    """ Predict probability of class. """
    
    # Loop over each observation to calculate conditional probabilities
    class_numerators = np.zeros(shape=(X.shape[0], prior.shape[0]))
    for i, x in enumerate(X):
        # Count how often words appear in each email
        word_exists = x.astype(bool)
        # Compute likelihoods of words (probability of data appearing in any class)
        lh_words_present = lh_word[:, word_exists] ** x[word_exists]
        # Compute likelihood of entire message with likelihoods of words
        lh_message = (lh_words_present).prod(axis=1)
        # Combine likelihood and prior to numerator
        class_numerators[i] = lh_message * prior ## 
    normalize_term = class_numerators.sum(axis=1).reshape(-1, 1)
    posteriors = class_numerators / normalize_term
    if not (posteriors.sum(axis=1) - 1 < 1e-5).all():
        raise ValueError('Rows should sum to 1')
    return posteriors

# Classification Prediction

Putting all of this together, we can now predict in a binary fashion by asserting any data points to the class with the highest probability. Here, we take our emails we trained our Naive Bayes classifier on also to evaluate it, but the evaluation normally happens on unseen emails.

In [None]:
def predict(X: np.array, y: np.array) -> np.array:
    
    """ Predict class with highest probability. """
    
    y_pred  = posterior(X, y).argmax(axis=1)
    return y_pred

# Full Naive Bayes Classifier Model

In [None]:
class NaiveBayesClassifier():
    
    def __init__(self):
    
        """ Initialise parameters. """
       
        self._vocab_list = None
        self._word_count = None
        self._prior = None
        self._lh = None
        self._posteriors = None
        
    def fit(self, X: pd.Series, y: pd.Series) -> np.array:
    
        """ Fit Naive bayes model. """

        #Â Allocate initialised parameters
        self._vocabs = self.vocab_list(X)
        self._word_matrix = self.word_count(X, y, self._vocabs, show_X=False)[0]
        self._prior = self.prior_lh(self._word_matrix, y)[0]
        self._lh = self.prior_lh(self._word_matrix, y)[1]
        self._posteriors = self.posterior(self._word_matrix, y, self._prior, self._lh)
    
    def predict(self) -> np.array:
    
        """ Predict class with highest probability. """

        y_pred = self._posteriors.argmax(axis=1)
        return y_pred
    
        
    def vocab_list(self, X: pd.Series) -> List[str]:
    
        """ Returns every word in the dataset on it's own as a string in a list. """

        #Â Create copy to not override X dataframe
        X_copy = X.copy(deep=True)
        for i in range(len(X_copy)):
            X_copy[i] = X[i].split()
        #Â Flatten the list of lists 
        vocabs = [vocab for sublist in X_copy.tolist() for vocab in sublist]
        # Remove duplicates 
        vocabs = [i for n, i in enumerate(vocabs) if i not in vocabs[:n]]
        return vocabs

    def word_count(self, X: pd.Series, y: pd.Series, vocab_list: List, show_X: bool) -> Tuple:
    
        """ Return word count of email dataset. """

        #Â Convert a collection of text documents to a matrix of token counts
        vectorizer = CountVectorizer(vocabulary=vocab_list)
        word_counts = vectorizer.fit_transform(X.tolist()).toarray()
        df = pd.DataFrame(word_counts, columns=vocab_list)
        #Â Function to transform new test data into word count matrix
        msg_tx_func = lambda x: vectorizer.transform(x).toarray()
        if show_X:
            display(df)
        return df.to_numpy(), np.array(y), msg_tx_func
    
    def prior_lh(self, word_matrix: pd.DataFrame, y: pd.Series) -> Tuple:

        """ Use training data for Naive Bayes classifier. """
        
        n = X.shape[0]
        # Re-order X as a 2-dimensional array; each dimension contains data examples of only one of our two classes i.e. X_by_class[0] = non-spam and X_by_class[1] = spam
        X_no_spam = np.array(word_matrix[y[y==0].dropna().index.tolist(),:])
        X_spam = np.array(word_matrix[y[y==1].dropna().index.tolist(),:])
        X_by_class = np.array([X_no_spam, X_spam], dtype= "object")
        # Define prior
        prior = np.array([len(X_class)/n for X_class in X_by_class]) 
        # Count words in each class
        word_counts = np.array([sub_arr.sum(axis=0) for sub_arr in X_by_class])
        # Define likelihood
        lh_word = word_counts / word_counts.sum(axis=1).reshape(-1, 1)
        return prior, lh_word
    
    def posterior(self, word_matrix: pd.DataFrame, y: pd.Series, prior: np.array, lh_word: np.array) -> np.array:
    
        """ Predict probability of class. """

         # Loop over each observation to calculate conditional probabilities
        class_numerators = np.zeros(shape=(word_matrix.shape[0], prior.shape[0]))
        for i, x in enumerate(word_matrix):
            # Count how often words appear in each email
            word_exists = x.astype(bool)
            # Compute likelihoods of words (probability of data appearing in any class)
            lh_words_present = lh_word[:, word_exists] ** x[word_exists]
            # Compute likelihood of entire message with likelihoods of words
            lh_message = (lh_words_present).prod(axis=1)
            # Combine likelihood and prior to numerator
            class_numerators[i] = lh_message * prior ## 
        normalize_term = class_numerators.sum(axis=1).reshape(-1, 1)
        posteriors = class_numerators / normalize_term
        if not (posteriors.sum(axis=1) - 1 < 1e-3).all():
            raise ValueError('Rows should sum to 1')
        return posteriors

# Model Testing and Results

In [None]:
# Instantiate training model
spam_model_train = NaiveBayesClassifier()
# Fit model to training dataset to obtain Bayes Theorem components
spam_model_train.fit(X_train, y_train)
# Return training predictions
y_pred_train = spam_model_train.predict()

In [None]:
# Print train confusion matrix
cm_train = confusion_matrix(y_train, y_pred_train)
ax = sns.heatmap(cm_train, annot=True, fmt=".1f")
ax.set(xlabel="Predicted Labels", ylabel="True Labels");

In [None]:
# Print train metric report
pd.DataFrame(classification_report(y_train, y_pred_train, output_dict=True))

# Summary

- It is easy and fast to predict class of test data set. 
- When assumption of independence holds, a Naive Bayes classifier performs better compare to other models like logistic regression and you need less training data.
- A limitation of Naive Bayes is the assumption of independent predictors. In real life, it is almost impossible that we get a set of predictors which are completely independent.
- Another limitation is that if there is a category in the test data set that never appeared in the training set, then the probability output will always be 0 (which is not accurate).

<hr style="height:2px;border-width:0;color:gray;background-color:gray">

## Extra

Some comments about the code implementations:

1. If dealing with arrays rather than dataframes, some of the functions may need altering to account for dimension/shape issues e.g. the _prior lh_ and _posterior_ functions.  
2. To debug this, it is important to print out the _word count_ and _X by class_ so you can check if you are obtaining the correct outputs. 

<hr style="height:2px;border-width:0;color:gray;background-color:gray">

Thanks for reading this notebook. If there are any mistakes or things that need more clarity, feel free to respond in the comment section and I will be happy to reply.

As always, please leave an upvote - it would also be helpful if you cite this documentation if you are going to use any of the code. ðŸ˜Š

#CodeWithSid