# ML Deep Dive: Naive Bayes

In this tutorial I'll do a deep dive of Naive Bayes, a classical classification algorithm. Naive Bayes is very simple to understand (once you understand Bayes Rule), and has historically worked well on many classification problems, especially text classification. Naive Bayes is perhaps the most important supervised learning algorithm in NLP, or at least was before deep learning became competitive.

Perhaps more than other ML algorithms, understanding Naive Bayes requires some understanding of basic probability theory. Since probability theory isn't a prerequisite for these tutorials, I'll now give a brief intro to probability theory with the goal of understanding Bayes Rule, which is the core formula used in Naive Bayes. If you already have a basic understanding of probability feel free to skip the next section.

## Introduction to Probability

### Motivation

Probability theory is essentially a calculus for modeling random quantities, a way of taking into account the fact that real world data is usually noisy. For practical purposes, you can think of randomness as something you can't predict because you don't have enough information. If you could understand and model your problem perfectly, taking all variables into account with perfect precision, then you wouldn't need to model randomness because there wouldn't be any. Thus, randomness is an expression of ignorance (aka *entropy*). The more you know and can account for, the less random your data is. Since it's usually impossible to know everything about your problem, randomness is something you have to deal with in real life, which is what probability (and statistics) is for.

### Basics
In probability lingo, a random quantity is called a *random variable*. A random variable can be either a number or a collection of numbers (you can have random vectors, random matrices, random tensors, random functions, etc). Similar to how you can think of a data point as some variable taking on a value, you can think of a (random) data point as some random variable taking on a value. The difference with random variables is you can't guarantee it'll take on the value you want it to. You can imagine some process you can't see adding some noise to the value you want, and giving you that noisy value instead. Instead of getting `x = 12`, you might get `x = 12 + error`, where `error` is something you can't predict exactly.

Rather than random variables taking on a *value*, they take on a *distribution* of values. In probability theory, it's these distributions that get all the attention. If you can understand what distribution your data came from, you in some sense understand the most about your data you can possibly know (given the variables you have).

Probabilities are what determine which values in a distribution a random variable is likely to take on. Values with a higher probability are more likely to occur than values with lower probability. In statistics lingo we refer to "taking on a value" as *sampling* from the distribution. Thus, a *random sample* is just a set of values taken from some distribution.

In its most basic form, you can think of a probability distribution as a set of values whose probabilities all sum to one. Any set of values that sum to one can define a probability distribution over that set of values. Here the "values" are all possible values that a distribution can take on, so this doesn't mean that the probability of the values of all the data you have must sum to one; this is only true if you have captured *all* the values and those values are unique.

In math lingo, a probability defined over a collection of values $x$ is a function $p(x)$ whose values are always positive and whose sum over all values in the collection is one. That is, suppose $x$ is a random variable that can take on $n$ values $\{x_1, x_2, \cdots, x_n\}$. Then

$$p(x) \geq 0 \quad \text{for all } x \text{ in } \{x_1, x_2, \cdots, x_n\},$$
$$\sum_i^n p(x_i) = p(x_1) + p(x_2) + \cdots + p(x_n) = 1.$$

We say $p(x)$ is the "probability that $x$ will occur". One can *define* then the distribution of values $\{x_1, x_2, \cdots, x_n\}$ by the function $p(x)$ itself, or equivalently by the set of values $\{p(x_1), p(x_2), \cdots, p(x_n)\}$. 

We can define the probability of any pair of values $x_1, x_2$ occuring as well by $p(\{x_1, x_2\}) = p(x_1) + p(x_2)$. This definition extends readily to larger collections of values as well. You just sum their probabilities.

### Joint Distributions and Conditional Distributions

Just like we can talk about functions of multiple variables, like $z=f(x,y)$, we can talk about probabilities of multiple variables as well, like $p(x,y)$ (often called the *joint distribution* of $x$ and $y$). The two most important concepts to take away from joint probabilities are the concepts of *independence* and *conditional probability*. 

Informally, two random variables are independent when knowing the distribution of one tells you nothing about the distribution of the other; you can essentially think of them as two completely separate (i.e. independent) processes. In math terms, this just means the joint distribution factors, i.e. $p(x,y)=p(x)p(y)$. (Subtle point: Each $p$ has a different meaning in this formula. Writing $p(x,y)$ means "the distribution from which $x,y$ are jointly sampled in pairs". Writing $p(x)$ means "the distribution from which $x$ alone is sampled if we do so ignoring $y$". And writing $p(y)$ means "the distribution from which $y$ alone is sampled if we do so ignoring $x$". This abuse of notation is done all the time in probability. You can usually tell which $p$ is which by looking at what random variable it's a function of.)

Informally, conditionally probability is a measure of how non-independent (i.e. dependent) two random variables are. If $x$ and $y$ are dependent, knowing what $x$ is before sampling $y$ can completely affect what the distribution of $y$ ends up being. (For ordinary, non-random variables, this just says that $y$ would be a function of $x$, $y=y(x)$. Dependence is just an extension of that concept to random variables.). We measure the dependence of $y$ on $x$ by defining the conditional probability distribution $p(y|x)$, which reads as "the probability of sampling a value $y$ given we know the value of $x$ already", or in short "the probability of $y$ given $x$". It's defined mathematically by

$$p(y|x) = \frac{p(x,y)}{p(x)}.$$

Notice $p(y|x)=p(y)$ only when $x$ and $y$ are independent. Otherwise, knowing something about $x$ changes what distribution $y$ could be sampled from, as $p(y|x)$ is its own distribution completely different from $p(y)$. We often express the sampling of $y$ given $x$ as a "new" random variable $y|x$. 

It should hopefully be clear that we can just as easily define a distribution $p(x|y)$ as well, the probability of $x$ given $y$. Just swap $x$ and $y$ in the above. Note that, just as knowing something about $x$ can affect what $y$ is, knowing something about $y$ can affect what $x$ is. However, and this is very important, note in general $p(x|y) \neq p(y|x)$! You wouldn't believe how many people make this mistake when applying it to real life problems, even smart people.

### Bayes Rule

Conditional probability is what allows us to define the important formula for Naive Bayes: Bayes Rule. Bayes Rule is a way of relating the two conditional distributions $p(y|x)$ and $p(x|y)$. I just said these two are in general not equal to each other. How do we relate them then? The trick is to use the fact that 

$$p(x,y)=p(y|x)p(x)=p(x|y)p(y)$$

from the definition of conditional probability above. Solving the right two equations for $p(y|x)$ we arrive at

$$p(y|x) = \frac{p(x|y)p(y)}{p(x)}. \quad \quad \text{(Bayes Rule)}$$

Note in Bayesian lingo $p(y)$ in this formula is often called the *prior* distribution and $p(y|x)$ the *posterior* distribution. This comes from thinking of Bayes Rule as an update rule. Imagine you want to know something about $y$. You start with a "prior belief" about what $y$ might be, a guess. You then collect some measurements $x$, and would like to figure out how knowing something about $x$ changes your belief about what $y$ is, i.e. what $y|x$ is. You can thus think of the posterior $p(y|x)$ as your updated belief about $y$ given that you've seen $x$. The distribution $p(x|y)$ is often called the *likelihood*, which is the distribution you assume $x$ is being sampled from (i.e. it's your model of $x$ that you're trying to fit to your input data).

### Probabilities as Frequencies

There's another way to interpret probabilities that will be useful to us as well in the derivation of Naive Bayes, namely that of a frequency. Suppose you have some data $D=\{x_1,x_2,\cdots,x_n\}$, containing $n$ data points $x_i$. Suppose each of your data points can take on any of the values $v_1,v_2,\cdots,v_k$ in some set. Let $n_j$ be the number of times the value $v_j$ occurs in your dataset $D$. Then we can (approximately) define the probability $p(x=v_j)$ by

$$p(x=v_j) \approx \frac{n_j}{n} = \frac{\text{# times } v_j \text{ occurs in the dataset}}{\text{# of samples in the dataset}}.$$

This says that practically speaking, you can think of a probability of some value occuring as the ratio of the number of times it did happen to the number of times it could have happened in your data. Note in practice, you rarely know the *true* distribution your data was sampled from, and so you often have to approximate it using this formula. 

Also, note it must be true that $p(v_1)+\cdots+p(v_k)=1$. You can see from the above frequency formula that this follows automatically, as the number of times all possible values can occur is $n$, so $\frac{n_1+\cdots+n_k}{n}=\frac{n}{n}=1$.

For real data, as long as you have a set of positive-valued weights, you can always normalize them to become a probability by dividing by their sum. That is, given weights $w_1,\cdots,w_k$, you can define a probability on the values $v_1,\cdots,v_k$ that $x$ can take on by

$$p(x=v_j) \equiv \frac{w_j}{\sum_i w_i}.$$

### Summary

To help aid your memory about what these are, here's a rough conversion dictionary between "regular" functions you might be used to and probabilities.

|   | ------Functions------- | ------Probabilities------- |
|--------|--------|--------|
| Values  | $x$ exact  | $x \pm \text{error}$  |
| Single Variable  | $f(x)$  | $p(x)$  |
| Multiple Variable  | $f(x,y)$  | $p(x,y)$  |
| Independence  | $f(x,y)=f(x)f(y)$  | $p(x,y)=p(x)p(y)$  |
| Dependence of $y$ on $x$  |  $y=f(x)$ | $p(y|x)= \frac{p(x,y)}{p(x)}$  |
| Dependence of $x$ on $y$ | $x=f(y)$ | $p(x|y) = \frac{p(x,y)}{p(y)}$ |

## Naive Bayes From Scratch

### Imports and Data

Our goal will be to describe the Naive Bayes algorithm, implement it from scratch, and test it on an example dataset. Since Naive Bayes is perhaps most popular in the NLP community, the dataset we will use is the same SMS Spam Dataset from the Text Classification tutorial. Since this tutorial isn't about text, we will just copy and use the same functions from that tutorial to load and process the data into a numerical format. Our goal here is to focus on the classification piece, not the preprocessing. See the other tutorial if you're curious on the preprocessing details.

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from pathlib import Path
from collections import Counter, defaultdict

# Text Processing Packages
import re
import string
import spacy
import nltk
from nltk.stem import SnowballStemmer
from nltk.corpus import stopwords
nltk.download('stopwords')

# Sklearn Packages (mainly for utility functions)
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score, accuracy_score, precision_score, recall_score
from imblearn.under_sampling import RandomUnderSampler

np.random.seed(42)

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/ryankingery/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [2]:
def get_text(text_file):
    df = pd.read_csv(text_file, encoding='ISO-8859-1')
    df = df[['v1', 'v2']]
    df.columns = ['labels', 'text_raw']
    df.labels = df.labels.replace('ham', 0)
    df.labels = df.labels.replace('spam', 1)
    df = df.sample(frac=1).reset_index(drop=True)
    return df

repo_path = Path.home() / 'Repos'  # insert path to the repo here
text_file = repo_path / 'ml_tutorials/resources/spam.csv'
df = get_text(text_file)
df.head(10)

Unnamed: 0,labels,text_raw
0,0,"Funny fact Nobody teaches volcanoes 2 erupt, t..."
1,0,I sent my scores to sophas and i had to do sec...
2,1,We know someone who you know that fancies you....
3,0,Only if you promise your getting out as SOON a...
4,1,Congratulations ur awarded either å£500 of CD ...
5,0,"I'll text carlos and let you know, hang on"
6,0,K.i did't see you.:)k:)where are you now?
7,0,No message..no responce..what happend?
8,0,Get down in gandhipuram and walk to cross cut ...
9,0,You flippin your shit yet?


In [3]:
df.labels.value_counts()

0    4825
1     747
Name: labels, dtype: int64

In [4]:
def sub_special_tokens(text):
    # note I stole many of these regexes regularly from S.O.
    # convert simple URLs to xxurl token (e.g. www.google.com, http:google.com -> xxurl)
    text = re.sub(r' www.', ' http://www.', text)
    text = re.sub(r'(https|http)?:\/\/(\w|\.|\/|\?|\=|\&|\%)*\b', ' xxurl ', text)
    # convert (British) phone numbers to xxphone token (e.g. 09058097218 -> xxphone)
    pat = r'\d{3}[-\.\s]??\d{4}[-\.\s]??\d{4}|\d{5}[-\.\s]??\d{3}[-\.\s]??\d{3}|(?:\d{4}\)?[\s-]?\d{3}[\s-]?\d{4})'
    text = re.sub(pat, ' xxphone ', text)
    # replace monetary values with xxmon token
    text = text.replace('£','$ ')
    text = re.sub(r'(\d+)[ ]{0,1}p', '$ 0.\1', text)
    text = re.sub(r'\$[ ]*(\d+[,\.])*\d+', ' xxmon ', text)
    # put xxup token before words in all caps (easy way to recognize info from capitalizing a word)
    text = re.sub(r'(\b[A-Z][A-Z0-9]*\b)', r' xxup \1 ', text)
    # put xxcap token before words with capitalized first letter (easy way to recognize first word in a sentence)
    text = re.sub(r'(\b[A-Z][a-z0-9]+\b)', r' xxcap \1 ', text)
    # convert some common text "emojis" to xxemoji: ;), :), :(, :-(, etc
    text = re.sub(r'[:;][ ]*[-]*[ ]*[()]', ' xxemoji ', text)
    return text

def normalize_text(text):
    # converts common patterns into special tokens
    text = sub_special_tokens(text)
    # convert text to lowercase
    text = text.lower()
    # strip out any lingering html tags
    text = re.sub(r'<[^>]*>', '', text)
    # convert all common abrevs to regular word
    text = text.replace('&',' and ')
    text = re.sub(r'\bu\b', ' you ', text)
    text = re.sub(r'\bur\b', ' your ', text)
    text = re.sub(r'\b2\b', ' to ', text)
    text = re.sub(r'\b4\b', ' for ', text)
    # put spaces between punctuation (eg: 9.Blah -> 9 . Blah)
    puncts = r'[' + re.escape(string.punctuation) + r']'
    text = re.sub('(?<! )(?=' + puncts + ')|(?<=' + puncts + ')(?! )', r' ', text)
    # strip non-ascii characters (easy way to denoise text a bit)
    text = text.encode("ascii", errors="ignore").decode()
    # remove all punctuation except ?
    text = re.sub(r"[^\w\s?]",' xxpunct ',text)
    # convert all other numbers to xxnum token (e.g. 123, 1.2.3, 1-2-3 -> xxnum)
    text = re.sub(r'\b([.-]*[0-9]+[.-]*)+\b', ' xxnum ', text)
    # remove nltk's common set of stop words (common for classical NLP analysis)
    stop_words = stopwords.words('english')
    text = ' '.join(word for word in text.split() if word not in stop_words)
    # stem words using nltk snowball stemmer, e.g. converts {run, running, runs} all to "run"
    stemmer = SnowballStemmer('english')
    stemmed_text = ''
    for word in text.split():
            stemmed_text = stemmed_text + stemmer.stem(word) + ' '
    text = stemmed_text
    # sub the occurance of 2 or more spaces with a single space
    text = re.sub(r'[ ]{2,}',' ',text)
    return text

df['text'] = df['text_raw'].apply(normalize_text)
df.head(10)

Unnamed: 0,labels,text_raw,text
0,0,"Funny fact Nobody teaches volcanoes 2 erupt, t...",xxcap funni fact xxcap nobodi teach volcano er...
1,0,I sent my scores to sophas and i had to do sec...,xxup sent score sopha secondari applic school ...
2,1,We know someone who you know that fancies you....,xxcap know someon know fanci xxpunct xxcap cal...
3,0,Only if you promise your getting out as SOON a...,xxcap promis get xxup soon xxpunct xxcap xxpun...
4,1,Congratulations ur awarded either å£500 of CD ...,xxcap congratul award either xxmon xxup cd gif...
5,0,"I'll text carlos and let you know, hang on",xxup xxpunct text carlo let know xxpunct hang
6,0,K.i did't see you.:)k:)where are you now?,xxup k xxpunct xxpunct see xxpunct xxemoji k x...
7,0,No message..no responce..what happend?,xxcap messag xxpunct xxpunct responc xxpunct x...
8,0,Get down in gandhipuram and walk to cross cut ...,xxcap get gandhipuram walk cross cut road xxpu...
9,0,You flippin your shit yet?,xxcap flippin shit yet ?


In [5]:
text = df['text'].tolist()
tokens = [[word for word in doc.split(' ') if len(word) > 0] for doc in text]
labels = df['labels'].tolist()

### Naive Bayes Derivation

Naive Bayes is a supervised learning algorithm, which assumes we have labeled training data. We will assume we have a set of labeled data $D=\{(x_1,y_1),\cdots,(x_n,y_n)\}$. As usual, the goal is to learn a prediction function $f(x)$ on the training inputs $x$ such that the training outputs $y$ match the predicted outputs given by $f(x)$. That is, we seek to find a function $f(x)$ such that $y_i \approx f(x_i)$ for each $(x_i,y_i)$ in $D$.

Now, each training input $x$ is composed of features. As we are using a text example in this problem, we assume the features are the words (really tokens) appearing in each sample of text. For a given sample $x$, we write it as a sequence of words, $x = (w_1,w_2,\cdots,w_m)$. As we are doing spam classification, this is a binary classification problem (with `0=nonspam`, `1=spam`), so each label $y=0,1$.

Recall from the previous section that a function $y \approx f(x)$ has a probabilistic analogue $p(y|x)$. Thus, if we want to learn an ML model that allows for noise, it would be reasonable to learn the posterior probability $p(y|x)$ instead of a "point estimator" function like $y \approx f(x)$. For the given example with $x = (w_1,w_2,\cdots,w_m)$ and $y=0,1$, this means learning the probabilities

$$p(y=0|x) = p(y=0|w_1,w_2,\cdots,w_m)$$
$$p(y=1|x) = p(y=1|w_1,w_2,\cdots,w_m)$$

for each $x,y$ pair in the dataset. But how do we do this? Recall from Bayes Rule we can express $p(y|x)$ in terms of a likelihood $p(x|y)$ and a prior $p(y)$. So

$$p(y=0|x) = \frac{p(x|y=0)p(y=0)}{p(x)} = \frac{p(w_1,w_2,\cdots,w_m|y=0)p(y=0)}{p(x)}$$
$$p(y=1|x) = \frac{p(x|y=1)p(y=1)}{p(x)} = \frac{p(w_1,w_2,\cdots,w_m|y=1)p(y=1)}{p(x)}.$$

What we just wrote down is true for any binary classification problem. There's nothing "naive" about it. Now we make the important assumption of Naive Bayes. 

**Naive Bayes Assumption:** The features of $x$ are independent of each other, given $y$. That is, each $w_i|y$ and $w_j|y$ are independent for all $i,j$.

This implies then that $p(w_i,w_j|y)=p(w_i|y)p(w_j|y)$ for each features $w_i$ and $w_j$ of $x$. As this is true for any pair of features, it follows that it's true for all features, and so the likelihood $p(x|y)$ factors:

$$p(x|y) = p(w_1,w_2,\cdots,w_m|y) = p(w_1|y)p(w_2|y)\cdots p(w_m|y) = \prod_{j=1}^m p(w_j|y).$$

Under these assumptions, we can re-express Bayes Rule for our problem by

$$p(y=0|x) = \frac{p(y=0)}{p(x)}\prod_{j=1}^m p(w_j|y=0)$$
$$p(y=1|x) = \frac{p(y=1)}{p(x)}\prod_{j=1}^m p(w_j|y=1).$$

Now, to get our prediction rule, we define the probability ratio $r$ by

$$r \equiv \frac{p(y=1|x)}{p(y=0|x)} = \frac{p(y=1)}{p(y=0)}\frac{\prod_{j=1}^m p(w_j|y=1)}{\prod_{j=1}^m p(w_j|y=0)}.$$

Now, observe the following statements are equivalent:
- $y=1$ is more likely than $y=0$, given $x$,
- $p(y=1|x) > p(y=0|x)$,
- $r > 1$.

Similarly, the following are equivalent statements as well:
- $y=0$ is more likely than $y=1$, given $x$,
- $p(y=0|x) > p(y=1|x)$,
- $r < 1$. 

Thus, we've derived a classification rule. If for a given $x$ we have $r > 1$, predict $\hat y=1$. If for that given $x$ we instead have $r < 1$, predict $\hat y=0$. (What if $r=1$? It's unlikely to happen, but if it does you can either round up or down, or flip a coin if you're paranoid. To keep things simple we'll round down.)

This suggests a simple classification rule: Given $x=(w_1,\cdots,w_m)$, calculate $r$. Predict $\hat y = 1 \text{ if } r > 1 \text{ else } 0$.

Before implementing this, there are a few practical considerations to address first.

### Calculating the Probabilities

Until now we've dodged the question of how one would actually calculate each of these probabilities. If we can't calculate each $p(y)$ and $p(w_j|y)$ we can't calculate $r$, so we're dead in the water. Naive Bayes keeps things naive though. We're not actually going to "learn" what these probabilities are, we're just going to estimate them using frequencies calculated from the training set.

$$p(w_j|y=0) \approx \frac{n_{j0}}{n_0} = \frac{\text{# times word } w_j \text{ occurs in negative samples}}{\text{# of negative samples }},$$

$$p(w_j|y=1) \approx \frac{n_{j1}}{n_1} = \frac{\text{# times word } w_j \text{ occurs in positive samples}}{\text{# of positive samples }},$$

$$p(y=0) \approx \frac{n_0}{n} = \frac{\text{# of negative samples}}{\text{# total samples}},$$

$$p(y=1) \approx \frac{n_1}{n} = \frac{\text{# of positive samples}}{\text{# total samples}}.$$

Notice something important: We have all these quantities in our training set. They're just counts of things. We don't need to guess or "learn" what any of them are. Thus, calculating these probabilities, and thus the posteriors $p(y|x)$ and $r$ is fairly mechanical. For simplicity, define the probability ratio for $w_j$ by

$$r_j \equiv \frac{p(w_j|y=1)}{p(w_j|y=0)} \approx \frac{n_{j1}}{n_{j0}}.$$

Then we have the following simple formula for the probability ratio $r$:

$$ r = \frac{n_1}{n_0} \prod_{j=1}^m r_{j} = \frac{n_1}{n_0} r_1 \cdots r_m.$$

### Laplace Smoothing

There's one subtle scenario I've glossed over until now. What happens if a word doesn't appear in one of the sets used to calculate the above probabilities? Since probabilities correspond to counts in Naive Bayes, if a word $w_j$ didn't occur in one of your counts ($n_{j1}$ or $n_{j0}$) then you'd end up getting $p(w_j|y)=0$ in that situation, which would cause the calculation of $r$ to either become zero or blow up.

But that's not reasonable. Just because you haven't seen a word occur in your positive samples doesn't mean it gives no predictive power whatsoever (and likewise for words in negative samples). You'd like to allow for the possibility that the occurance of that word does have some kind of predictive power, regardless whether it occured that way in your training set. 

The trick to deal with this is to add pseudocounts, fake observations, into your counts. We build into the formulas that we pretended to see a single example (even though we didn't of course). If we pretend we hadn't seen $w_j$ at all, but pretend we observed $w_j$ one time, then $p(w_j|y=0) = \frac{1}{n_0 + 1}$, where the denominator had to increase by 1 since we added the single pseudocount. Similarly for $p(w_j|y=1)$. Since we do this to all the words though, this modifies the formulas for each $p(w_j|y)$ as well, not just the $w_j$ we haven't seen:

$$p(w_j|y=0) = \frac{n_{j0} + 1}{n_0 + 1},$$

$$p(w_j|y=1) = \frac{n_{j1} + 1}{n_1 + 1}.$$

Other than this adjustment, everything else in calculating $r$ remains the same. For completeness, the modified form for $r$ is



### Numerical Underflow

On real computers, floating point numbers don't have infinite precision. Among other things, this means if a number gets too small it might appear to be exactly 0 even if it isn't. When you're multiplying a lot of numbers between 0 and 1 together, which you are with all the $p(w_j|y)$ terms, this *numerical underflow* runs a high risk of happening. This is bad because you're throwing away information in your probabilities, and will output overly pessimistic predictions.

When multiplications risk causing numerical underflow (or its counterpart, numerical overflow, which causes terms to blow up to infinity), the easiest way to deal with it is to take the logarithm. Recall from math that the log of a product is the sum of the logs: $\log(xy) = \log(x) + \log(y)$. Thus, logs convert products to sums. Since sums are far less likely to cause numerical underflow (or overflow), working with the log of the values will often fix these issues.

For Naive Bayes, the problem is the $\prod_j p(w_j|y)$ terms. We thus take the logarithm of $r$, called the *log-ratio*, and use it to do classification instead of $r$. Observe

\begin{equation}
\begin{aligned}
\log(r) &= \log \bigg(\frac{p(y=1|x)}{p(y=0|x)} \bigg) \\
        &= \log \bigg (\frac{p(y=1)}{p(y=0)}\frac{\prod_{j=1}^m p(w_j|y=1)}{\prod_{j=1}^m p(w_j|y=0)} \bigg ) \\
        &= \big ( \log p(y=1) - \log p(y=0) \big ) + \sum_{j=1}^m \big (\log p(w_j|y=1) - \log p(w_j|y=0) \big ).
\end{aligned}
\end{equation}

Now, if $r=1$ (our classification threshold), then $\log r = 0$. Also, $\log$ is an increasing function, so if $r < 1$ then $\log r < \log 1 = 0$, and if $r > 1$ then $\log r > \log 1 = 0$. This means we can apply the same logic for classification as we did for $r$, except for $\log r$ we use 0 for the threshold instead of 1.

This gives us an updated classification rule: Given $x=(w_1,\cdots,w_m)$, calculate $\log r$. Predict $\hat y = 1 \text{ if } \log r > 0 \text{ else } 0$.

### Naive Bayes Algorithm

blah