# ML Deep Dive: Naive Bayes

In this tutorial I'll do a deep dive of Naive Bayes, a classical classification algorithm. Naive Bayes is very simple to understand (once you understand Bayes Rule), and has historically worked well on many classification problems, especially text classification. Naive Bayes is perhaps the most important supervised learning algorithm in NLP, or at least was before deep learning became competitive.

## Introduction to Probability

### Motivation

Perhaps more than other ML algorithms, understanding Naive Bayes requires some understanding of basic probability theory. Since probability theory isn't a prerequisite for these tutorials, I'll now give a brief intro to probability theory with the goal of understanding Bayes Rule, which is the core formula used in Naive Bayes.

Probability theory is essentially a calculus for modeling random quantities, a way of taking into account the fact that real world data is usually noisy. For practical purposes, you can think of randomness as something you can't predict because you don't have enough information. If you could understand and model your problem perfectly, taking all variables into account with perfect precision, then you wouldn't need to model randomness because there wouldn't be any. Thus, randomness is an expression of ignorance (aka *entropy*). The more you know and can account for, the less random your data is. Since it's usually impossible to know everything about your problem, randomness is something you have to deal with in real life, which is what probability (and statistics) is for.

### Basics
In probability lingo, a random quantity is called a *random variable*. A random variable can be either a number or a collection of numbers (you can have random vectors, random matrices, random tensors, random functions, etc). Similar to how you can think of a data point as some variable taking on a value, you can think of a (random) data point as some random variable taking on a value. The difference with random variables is you can't guarantee it'll take on the value you want it to. You can imagine some process you can't see adding some noise to the value you want, and giving you that noisy value instead. Instead of getting `x = 12`, you might get `x = 12 + error`, where `error` is something you can't predict exactly.

Rather than random variables taking on a *value*, they take on a *distribution* of values. In probability theory, it's these distributions that get all the attention. If you can understand what distribution your data came from, you in some sense understand the most about your data you can possibly know (given the variables you have).

Probabilities are what determine which values in a distribution a random variable is likely to take on. Values with a higher probability are more likely to occur than values with lower probability. In statistics lingo we refer to "taking on a value" as *sampling* from the distribution. Thus, a *random sample* is just a set of values taken from some distribution.

In its most basic form, you can think of a probability distribution as a set of values whose probabilities all sum to one. Any set of values that sum to one can define a probability distribution over that set of values. Here the "values" are all possible values that a distribution can take on, so this doesn't mean that the probability of the values of all the data you have must sum to one; this is only true if you have captured *all* the values and those values are unique.

In math lingo, a probability defined over a collection of values $x$ is a function $p(x)$ whose values are always positive and whose sum over all values in the collection is one. That is, suppose $x$ is a random variable that can take on $n$ values $\{x_1, x_2, \cdots, x_n\}$. Then

$$p(x) \geq 0 \quad \text{for all } x \text{ in } \{x_1, x_2, \cdots, x_n\},$$
$$\sum_i^n p(x_i) = p(x_1) + p(x_2) + \cdots + p(x_n) = 1.$$

We say $p(x)$ is the "probability that $x$ will occur". One can *define* then the distribution of values $\{x_1, x_2, \cdots, x_n\}$ by the function $p(x)$ itself, or equivalently by the set of values $\{p(x_1), p(x_2), \cdots, p(x_n)\}$. 

We can define the probability of any pair of values $x_1, x_2$ occuring as well by $p(\{x_1, x_2\}) = p(x_1) + p(x_2)$. This definition extends readily to larger collections of values as well. You just sum their probabilities.

### Joint Distributions and Conditional Distributions

Just like we can talk about functions of multiple variables, like $z=f(x,y)$, we can talk about probabilities of multiple variables as well, like $p(x,y)$ (often called the *joint distribution* of $x$ and $y$). The two most important concepts to take away from joint probabilities are the concepts of *independence* and *conditional probability*. 

Informally, two random variables are independent when knowing the distribution of one tells you nothing about the distribution of the other; you can essentially think of them as two completely separate (i.e. independent) processes. In math terms, this just means the joint distribution factors, i.e. $p(x,y)=p(x)p(y)$. (Subtle point: Each $p$ has a different meaning in this formula. Writing $p(x,y)$ means "the distribution $x,y$ are jointly sampled from in pairs". Writing $p(x)$ means "the distribution that $x$ alone is sampled from if we do so ignoring $y$. And writing $p(y)$ means "the distribution that $y$ alone is sampled from if we do so ignoring $x$. This abuse of notation is done all the time in probability. You can usually tell which $p$ is which by looking at what random variable it's a function of.)

Informally, conditionally probability is a measure of how non-independent (i.e. dependent) two random variables are. If $x$ and $y$ are dependent, knowing what $x$ is before sampling $y$ can completely affect what the distribution of $y$ ends up being. (For ordinary, non-random variables, this just says that $y$ would be a function of $x$, $y=y(x)$. Dependence is just an extension of that concept to random variables.). We measure the dependence of $y$ on $x$ by defining the conditional probability distribution $p(y|x)$, which reads as "the probability of sampling a value $y$ given we know the value of $x$ already", or in short "the probability of $y$ given $x$". It's defined mathematically by

$$p(y|x) = \frac{p(x,y)}{p(x)}.$$

Notice $p(y|x)=p(y)$ only when $x$ and $y$ are independent. Otherwise, knowing something about $x$ changes what distribution $y$ could be sampled from, as $p(y|x)$ is its own distribution completely different from $p(y)$. We often express the sampling of $y$ given $x$ as a "new" random variable $y|x$. 

It should hopefully be clear that we can just as easily define a distribution $p(x|y)$ as well, the probability of x given y. Just swap $x$ and $y$ in the above. Note that, just as knowing something about $x$ can affect what $y$ is, knowing something about $y$ can affect what $x$ is. However, and this is very important, note in general $p(x|y) \neq p(y|x)$! You wouldn't believe how many people make this mistake when applying it to real life problems, even smart people.

### Bayes Rule

Conditional probability is what allows us to define the important formula for Naive Bayes: Bayes Rule. Bayes Rule is a way of relating the two conditional distributions $p(y|x)$ and $p(x|y)$. I just said these two are in general not equal to each other. How do we relate them then? The trick is to use the fact that 

$$p(x,y)=p(y|x)p(x)=p(x|y)p(y)$$

from the definition of conditional probability above. Solving the right two equations for $p(y|x)$ we arrive at

$$p(y|x) = \frac{p(x|y)p(y)}{p(x)}. \quad \quad \text{(Bayes Rule)}$$

Note in Bayesian lingo $p(y)$ in this formula is often called the *prior* distribution and $p(y|x)$ the *posterior* distribution. This comes from thinking of Bayes Rule as an update rule. Imagine you want to know something about $y$. You start with a "prior belief" about what $y$ might be, a guess. You then collect some measurements $x$, and would like to figure out how knowing something about $x$ changes your belief about what $y$ is, i.e. what $y|x$ is. You can thus think of the posterior $p(y|x)$ as your updated belief about $y$ given that you've seen $x$. The distribution $p(x|y)$ is often called the *likelihood*, which is the distribution you assume $x$ is being sampled from (i.e. it's your model of $x$ that you're trying to fit to your input data).

### Probabilities as Frequencies

There's another way to interpret probabilities that will be useful to us as well in the derivation of Naive Bayes, namely that of a frequency. Suppose you have some data $D=\{x_1,x_2,\cdots,x_n\}$, containing $n$ data points $x_i$. Suppose each of your data points can take on any of the values $v_1,v_2,\cdots,v_k$ in some set. Let $n_j$ be the number of times the value $v_j$ occurs in your dataset $D$. Then we can (approximately) define the probability $p(x=v_j)$ by

$$p(x=v_j) \approx \frac{n_j}{n} = \frac{\text{# times } v_j \text{ occurs in the dataset}}{\text{# of samples in the dataset}}.$$

This says that practically speaking, you can think of a probability of some value occuring as the ratio of the number of times it did happen to the number of times it could have happened in your data. Note in practice, you rarely know the *true* distribution your data was sampled from, and so you often have to approximate it using this formula. Also, note it must be true that $p(v_1)+\cdots+p(v_k)=1$. Check this!

### Summary

To help aid your memory about what these are, here's a rough conversion dictionary between "regular" functions you might be used to and probabilities.

|   | ------Functions------- | ------Probabilities------- |
|--------|--------|--------|
| Values  | $x$ exact  | $x \pm \text{error}$  |
| Single Variable  | $f(x)$  | $p(x)$  |
| Multiple Variable  | $f(x,y)$  | $p(x,y)$  |
| Independence  | $f(x,y)=f(x)f(y)$  | $p(x,y)=p(x)p(y)$  |
| Dependence of $y$ on $x$  |  $y=f(x)$ | $p(y|x)= \frac{p(x,y)}{p(x)}$  |
| Dependence of $x$ on $y$ | $x=f(y)$ | $p(x|y) = \frac{p(x,y)}{p(y)}$ |

## Naive Bayes From Scratch

### Imports and Data

Our goal will be to describe the Naive Bayes algorithm, implement it from scratch, and test it on an example dataset. Since Naive Bayes is perhaps most popular in the NLP community, the dataset we will use is the same SMS Spam Dataset from the Text Classification tutorial. Since this tutorial isn't about text, we will just copy and use the same functions from that tutorial to load and process the data into a numerical format. Our goal here is to focus on the classification piece, not the preprocessing. See the other tutorial if you're curious on the preprocessing details.

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from pathlib import Path
from collections import Counter, defaultdict

# Text Processing Packages
import re
import string
import spacy
import nltk
from nltk.stem import SnowballStemmer
from nltk.corpus import stopwords
nltk.download('stopwords')

# Sklearn Packages (mainly for utility functions)
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score, accuracy_score, precision_score, recall_score
from imblearn.under_sampling import RandomUnderSampler

np.random.seed(42)

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/ryankingery/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [2]:
def get_text(text_file):
    df = pd.read_csv(text_file, encoding='ISO-8859-1')
    df = df[['v1', 'v2']]
    df.columns = ['labels', 'text_raw']
    df.labels = df.labels.replace('ham', 0)
    df.labels = df.labels.replace('spam', 1)
    df = df.sample(frac=1).reset_index(drop=True)
    return df

repo_path = Path.home() / 'Repos'  # insert path to the repo here
text_file = repo_path / 'ml_tutorials/resources/spam.csv'
df = get_text(text_file)
df.head(10)

Unnamed: 0,labels,text_raw
0,0,"Funny fact Nobody teaches volcanoes 2 erupt, t..."
1,0,I sent my scores to sophas and i had to do sec...
2,1,We know someone who you know that fancies you....
3,0,Only if you promise your getting out as SOON a...
4,1,Congratulations ur awarded either å£500 of CD ...
5,0,"I'll text carlos and let you know, hang on"
6,0,K.i did't see you.:)k:)where are you now?
7,0,No message..no responce..what happend?
8,0,Get down in gandhipuram and walk to cross cut ...
9,0,You flippin your shit yet?


In [3]:
df.labels.value_counts()

0    4825
1     747
Name: labels, dtype: int64

In [4]:
def sub_special_tokens(text):
    # note I stole many of these regexes regularly from S.O.
    # convert simple URLs to xxurl token (e.g. www.google.com, http:google.com -> xxurl)
    text = re.sub(r' www.', ' http://www.', text)
    text = re.sub(r'(https|http)?:\/\/(\w|\.|\/|\?|\=|\&|\%)*\b', ' xxurl ', text)
    # convert (British) phone numbers to xxphone token (e.g. 09058097218 -> xxphone)
    pat = r'\d{3}[-\.\s]??\d{4}[-\.\s]??\d{4}|\d{5}[-\.\s]??\d{3}[-\.\s]??\d{3}|(?:\d{4}\)?[\s-]?\d{3}[\s-]?\d{4})'
    text = re.sub(pat, ' xxphone ', text)
    # replace monetary values with xxmon token
    text = text.replace('£','$ ')
    text = re.sub(r'(\d+)[ ]{0,1}p', '$ 0.\1', text)
    text = re.sub(r'\$[ ]*(\d+[,\.])*\d+', ' xxmon ', text)
    # put xxup token before words in all caps (easy way to recognize info from capitalizing a word)
    text = re.sub(r'(\b[A-Z][A-Z0-9]*\b)', r' xxup \1 ', text)
    # put xxcap token before words with capitalized first letter (easy way to recognize first word in a sentence)
    text = re.sub(r'(\b[A-Z][a-z0-9]+\b)', r' xxcap \1 ', text)
    # convert some common text "emojis" to xxemoji: ;), :), :(, :-(, etc
    text = re.sub(r'[:;][ ]*[-]*[ ]*[()]', ' xxemoji ', text)
    return text

def normalize_text(text):
    # converts common patterns into special tokens
    text = sub_special_tokens(text)
    # convert text to lowercase
    text = text.lower()
    # strip out any lingering html tags
    text = re.sub(r'<[^>]*>', '', text)
    # convert all common abrevs to regular word
    text = text.replace('&',' and ')
    text = re.sub(r'\bu\b', ' you ', text)
    text = re.sub(r'\bur\b', ' your ', text)
    text = re.sub(r'\b2\b', ' to ', text)
    text = re.sub(r'\b4\b', ' for ', text)
    # put spaces between punctuation (eg: 9.Blah -> 9 . Blah)
    puncts = r'[' + re.escape(string.punctuation) + r']'
    text = re.sub('(?<! )(?=' + puncts + ')|(?<=' + puncts + ')(?! )', r' ', text)
    # strip non-ascii characters (easy way to denoise text a bit)
    text = text.encode("ascii", errors="ignore").decode()
    # remove all punctuation except ?
    text = re.sub(r"[^\w\s?]",' xxpunct ',text)
    # convert all other numbers to xxnum token (e.g. 123, 1.2.3, 1-2-3 -> xxnum)
    text = re.sub(r'\b([.-]*[0-9]+[.-]*)+\b', ' xxnum ', text)
    # remove nltk's common set of stop words (common for classical NLP analysis)
    stop_words = stopwords.words('english')
    text = ' '.join(word for word in text.split() if word not in stop_words)
    # stem words using nltk snowball stemmer, e.g. converts {run, running, runs} all to "run"
    stemmer = SnowballStemmer('english')
    stemmed_text = ''
    for word in text.split():
            stemmed_text = stemmed_text + stemmer.stem(word) + ' '
    text = stemmed_text
    # sub the occurance of 2 or more spaces with a single space
    text = re.sub(r'[ ]{2,}',' ',text)
    return text

df['text'] = df['text_raw'].apply(normalize_text)
df.head(10)

Unnamed: 0,labels,text_raw,text
0,0,"Funny fact Nobody teaches volcanoes 2 erupt, t...",xxcap funni fact xxcap nobodi teach volcano er...
1,0,I sent my scores to sophas and i had to do sec...,xxup sent score sopha secondari applic school ...
2,1,We know someone who you know that fancies you....,xxcap know someon know fanci xxpunct xxcap cal...
3,0,Only if you promise your getting out as SOON a...,xxcap promis get xxup soon xxpunct xxcap xxpun...
4,1,Congratulations ur awarded either å£500 of CD ...,xxcap congratul award either xxmon xxup cd gif...
5,0,"I'll text carlos and let you know, hang on",xxup xxpunct text carlo let know xxpunct hang
6,0,K.i did't see you.:)k:)where are you now?,xxup k xxpunct xxpunct see xxpunct xxemoji k x...
7,0,No message..no responce..what happend?,xxcap messag xxpunct xxpunct responc xxpunct x...
8,0,Get down in gandhipuram and walk to cross cut ...,xxcap get gandhipuram walk cross cut road xxpu...
9,0,You flippin your shit yet?,xxcap flippin shit yet ?


In [5]:
text = df['text'].tolist()
tokens = [[word for word in doc.split(' ') if len(word) > 0] for doc in text]
labels = df['labels'].tolist()

### Simple Naive Bayes Algorithm