In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np
import pandas as pd

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

import random

In [None]:
# Import statements!

import os
import random
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.metrics import plot_confusion_matrix
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.dummy import DummyClassifier

# Hands on With Bayesian Methods
## Author: Henry Sue
### Email: henry@henrysue.dev

Please credit me if you clone or reproduce this notebook.

## Introduction: Thomas Bayes and Bayes Theorem

Some history:

In the 1700's, there existed a minister and statistician named Thomas Bayes, who believed strongly that probabilistic models were more accurate as beliefs and hypotheses rather than as a frequency of occurence. This separates the two popular 'schools' of probability theory - *Bayesian* and *Frequentist*, named for Bayes and for frequencies, repsectively. His most famous contribution is "Bayes Theorem" or Bayes Rule. This theorem is used to make an estimate of the probability of events when one has access to prior information of conditions related to an event.

![Thomas Bayes](https://upload.wikimedia.org/wikipedia/commons/d/d4/Thomas_Bayes.gif)

*Image source: Terence O'Donnell, History of Life Insurance in Its Formative Years (Chicago: American Conservation Co:, 1936), p. 335 (caption "Rev. T. Bayes: Improver of the Columnar Method developed by Barrett.")*


## So what even is Bayes Theorem and how is it useful?

First, lets take a look at the structure of the Bayes Theorem:

\begin{equation}
 P(A|B)=\frac{P(B|A)P(A)}{P(B)}
\end{equation}

$P(A)$, $P(B)$ : The probability of event **A** occurring, The probability of event **B** occurring.  
$P(A|B)$ : The probability of event **A** occurring *given that* event **B** occurred. (Conditional Probability)  
$P(B|A)$ : The probability of event **B** occurring *given that* event **A** occurred. (Conditional Probability)  

Sometimes, $P(B)$ will be expanded into $P(B|A)P(A) + P(B|\neg A)P(\neg A)$.  
(Read: The probability of A *given* B $\times$ the probability of A $+$ the probability of A *given* the negation of B $\times$ probability of the negation of B)  

This is due to the **law of total probability.** 

Don't worry if this doesn't click immediately! It will make more sense in the concrete example below. If you would like to learn more about the probability side of things, check out [this video by 3Blue1Brown on Bayes Theorem.](https://www.youtube.com/watch?v=HZGCoVF3YvM)


## **Section 1: Common Application - Disease Tests**

A very common application for Bayes theorem is medical tests. 

Let's say we are at risk of a rare disease named "statistitis" and want to take a test to see if we have this disease. Let us consider a test that is 99% *accurate*. In the context of statistics, accuracy is a measure of the rate of correctness - that is to say, when we correctly predict the outcome divided by all cases. That means that if the test predicts that we are positive for the disease, it is correct 99% of the time, and if the test comes back negative, it is also correct 99% of the time. Let us also assume that the disease is extremely rare: about 1 in 10,000 people get this disease. 

What if we test positive for this disease? With this information, how can we best estimate the chances that we have this disease, given that we tested positive?

It is normal for our first instinct to jump to 99%, as this is how accurate the test is! But when we only look at the number of true positives (the times the test predicts correctly that you have the disease, given that you actually have the disease), we are missing a key part of the greater picture.  

#### Applying Bayes Theorem to our problem
We want to get a good estimate of the chances that we have the disease, but what are the steps that we need to take to use Bayes Theorem for our problem?

**Step 1:** Let us start by getting each component of Bayes Theorem, and along the way, we can intuitively think about what each component means.

Recall the structure of Bayes Theorem: 
\begin{equation}
 P(A|B)=\frac{P(B|A)P(A)}{P(B)}
\end{equation} 

Lets define a few things: *Event A*, or just $A$ is having the disease; *Event B* or $B$ is testing positive for the disease.

$P(A)$ is known to us - it is the probability that we have the disease, or 1 in 10,000 (0.001). This is known as the **Prior Probability.**  

$P(B|A)$, or the probability of B *given* A (the probability that we test positive, given that we have the disease) is 99% or 0.99, since the test is 99% accurate.
    
The tricky part is getting $P(B)$, or the probability of testing positive. As we mentioned above, we can use something called the Law of Total Probablity to get this from the information we are given. The two possibilities for a positive test are: 1. If we *do* have the disease and the test is *correctly* positive or 2. We *do not* have the disease and the test *incorrectly* returns positive. If we add the probabilities of both possible cases, we get the total probability of $P(B)$.

Written out, it would be: $P(B)$ = $P(B|A)P(A) + P(B|\neg A)P(\neg A)$  
Plugging in our probabilities, we get $P(B)$ = $(0.99 * 0.001) + (0.01 * 0.999) = 0.00099 + 0.00999 = 0.01098$

**Step 2:** Now let us plug all of our numbers in to get our answer!

\begin{equation}
 P(A|B)=\frac{P(B|A)P(A)}{P(B)} = \frac{0.99 * 0.001}{0.01098} \approx 0.009016 \approx 0.9\%
\end{equation} 

Despite getting a positive test back with 99% accuracy, we still **only have a 0.9% chance to actually have the disease.**   
On an intuitive level, this doesn't seem to make sense, but once we consider how rare the disease is actually - out of 1 million people, 100 would have the disease, and 99 would test positive; whereas 999,900 would *not* have the disease, and out of those 999,900, 9999 would get a false positive test. The number of True Positives are vastly outweighed by False Positives.

## Section 2: Coin Flipping

#### So what does 'beliefs and hypotheses' mean in terms of probability distributions? 

Let us take a simple probabilistic example/model: Flipping a coin. *What is the probability of flipping heads on a two-sided coin?*

At first, the question is extremely easy to answer intuitively: the coin has two sides, so you have a 50% / 50% chance of getting heads or tails. But what happens if you have an unfairly weighted coin? What happens if you flip the coin a certain way that affects the outcome of the coin flip? These scenarios are hardly easy to come up with a simple intuitive rule. 

We can make an educated guess at what the chances that the coin lands heads (or consequently tails). Normally, our intuition is that the odds of flipping heads or tails is about 50%/50%, or 0.5 / 0.5, we can say that this is our **"Prior Belief"**. However, for the sake of our example, we will assume that all possible coin weights have about an equal probability of occurring.


#### Now lets try flipping our coin 100 times and seeing what happens!

I have defined a coin flipping function "flip_coin" that takes 1 input (number of times you want to flip the coin), prints how many times each option appears, and outputs an array of the results. 

In [None]:
def flip_coin(num_times):
    
    weight = 0.2
    num_heads = 0
    num_tails = 0
    results = []
    
    for i in range(num_times):
        draw = random.uniform(0,1)
        if draw >= weight:
            num_heads += 1
            results.append(1)
        else:
            num_tails += 1
            results.append(0)
    
    print('Number of Heads: ' + str(num_heads))
    print('Number of Tails: ' + str(num_tails))
    
    return(results)

In [None]:
# Lets try it out!

test_flips = flip_coin(100)
print("First 10 flips: " + str(test_flips[0:11]))

So what can we tell from these results? 

We start with a **prior belief** that the probability of getting heads can be any one of our 100 discreet weights. Now that we have some trials, we need to *update* our belief to match what we have observed. For every time we flip our coin, we *update* our prior belief using Bayes Theorem. Lets go ahead and plot out a uniform prior distribution to represent the probability that each of the possible coin biases are the actual coin's bias.  

In [None]:
# Set a range of possible biases (in our case, 100 discreet biases.)

possible_biases = np.linspace(0, 1, 100) # 100 discreet possible weights
prior_bias = np.ones(len(possible_biases)) / len(possible_biases) # P(A)

# Lets graph it to see our prior distribution
plt.plot(possible_biases, prior_bias)
plt.xlabel('Bias (weight)')
plt.ylabel('Probability of bias (weight)')
plt.show()

We can see that we start with a uniform prior probability distribution. What does this mean? Essentially, we are starting with the assumption that all possible coin weights (biases) have an equal probability of being the coin's actual bias.

In [None]:
# Now let us take each result from our coin flips and update our prior probability to get our posterior probability distribution.

for result in test_flips:
    likelihood = (possible_biases ** result) * ((1 - possible_biases) ** (1 - result)) # P(B|A)
    evidence = np.sum(likelihood * possible_biases) # P(B)
    prior_bias = (likelihood * prior_bias) / evidence # Bayes Theorem P(A|B) = P(A) P(B|A) / P(B)
    
plt.plot(possible_biases, prior_bias)
plt.xlabel('Bias (weight)')
plt.ylabel('Probability of bias (weight)')
plt.title('Probability of each possible coin bias.')
plt.show()

We can observe that there is already a high probability that our coin is weighted approximately 80:20 (heads - tails). Although this does not rule any possible bias out, we can say with high confidence that the coin's weighting is somewhere around 0.8 heads. As we continue to flip the weighted coin more and more, the observed bias becomes more and more likely! Using this estimate of our coin's bias, we can *infer* that the coin will show heads approximately 80% of the time!

## Section 3: The fun part! Spam Classification

Now that we have some sort of grasp on Bayes Theorem, how can we apply this to a use case? What if we want to train a model to predict whether a message is spam or not? Other tasks may be - fraud detection, weather prediction, user behavior prediction and a lot more. We use a model called the Naive Bayes Classifier that is based on the Bayes Theorem. The "Naive" comes from the assumption that all features (or items the classifier considers when making a prediction) are independent of eachother. Often times in the real world, this is not the case, for example many words will often occur together such as "Happy" and "Birthday". However, despite this, the Naive Bayes classifier is simple, quick, and a fairly robust baseline model to compare other machine learning models against. 

You may have noticed that this kaggle notebook is sitting on top of a dataset: spam! A bit of context for this dataset: it a collection of SMS text messages that are labeled spam or not spam (ham). 

Let's go ahead and load our dataset to get a feel for what our data looks like.

In [None]:
# Loading dataset and Exploratory Data Analysis

spam_data = pd.read_csv("../input/spam-text-message-classification/SPAM text message 20170820 - Data.csv")

display(spam_data.head())
display(spam_data.describe())

print("How many of each label do we have?")
print(spam_data['Category'].value_counts())

We have two columns: *Category*, which is our column of target labels, and *Message*, which are sms text messages that are either real messages or spam messages.

Notice that we have many more "ham" texts than we do "spam" texts. There is a ratio of 86.6% ham: 13.4% spam. This is important to keep in mind, as if we created a classifier that *only* guesses "ham", we would be correct 86% of the time. Therefore, when we create our holdout test set, we should keep our labels to 50/50 ham/spam in order to make sure that our classifier metrics accurately represent how well our classifier performs.

Now let's clean up our data to get it ready for ingestion into our model.

In [None]:
# We need to encode our categories into numerical data so that our model is able to use it

spam_data['labels'] = LabelEncoder().fit_transform(spam_data['Category'])

print(spam_data.head())

Next, we need to split our data into a training set and a test set in order to evaluate the performance of our model. We set Stratify to "None" to make sure that our classifier is not "cheating" by memorizing the ratio of spam to ham and guessing.

In [None]:
# Split data into training and test sets

X = spam_data['Message']
y = spam_data['labels']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, stratify = None)

Next, we initialize our vectorizer. In order for us to feed our data into our machine learning model, we need to represent our text data in numerical form. One of the ways we can do this is to represent our text data as a count of the number of times a word appears in the message text. By representing the space of all possible words (the corpus), we can encode our messages or text passages as vectors of the counts of each word in the message.

For example, if we had a message 

"dog, bird, dog, cat, dog"

This can be represented by a vector 

\begin{vmatrix}
dog & cat & bird \\
3 & 1 & 1
\end{vmatrix}

In [None]:
# Initialize our vectorizer

vectorizer = CountVectorizer()

Next, we need to choose the appropriate type of Naive Bayes. What?? There are more than one!?!?  

There are three:
* Gaussian Naive Bayes
* Bernoulli Naive Bayes
* Multinomial Naive Bayes

These names are not very descriptive. At a high level, the Gaussian Naive Bayes assumes continuous, normally distributed features (think heights of people, for example), whereas the Bernoulli and Multinomial assume categorical features (True or False, Rock Paper Scissors, etc). The Bernoulli Naive Bayes only uses the appearance of a feature (whether a word is used or not), whereas the Multinomial Naive Bayes will take into account the number of times a feature occurs ("dog" in the above vector example).

For our use case, the Multinomial Naive Bayes is the best choice, as words may appear more than once within a passage, and our vectorized text is a categorical variable.  

Note: I also use a "Dummy Classifier" in order for us to have a model that guesses purely based on a strategy. I chose "Stratified" as this strategy takes the relative proportion of the labels (ham/spam) and guesses it randomly based on the distribution (86% of the time is guesses ham/14% of the time is guesses spam).

In [None]:
# Lets intitialize our Naive Bayes and our Dummy Classifier

NB = MultinomialNB()
NB.fit(vectorizer.fit_transform(X_train), y_train)

Dummy = DummyClassifier(strategy = 'stratified')
Dummy.fit(vectorizer.fit_transform(X_train), y_train);

X_test_vector = vectorizer.transform(X_test)
y_pred = NB.predict(X_test_vector)

I've defined two functions below: A helper function that feeds our model the text and makes predictions, and a function that aggregates predictions and tallys the number correct over the total.

In [None]:
# Now lets define a function to evaluate our classifiers

def analyze_text(clf, vectorizer, test_text):
    """
    This helper function takes in a classifier, vectorizer and text and returns a prediction.
    Input: clf: Classifier Model, vectorizer: Vectorizer, test_text: a text passage 
    Output: Original test_text passage, the model's prediction
    """
    prediction = clf.predict(vectorizer.transform([test_text]))
    
    return(test_text, prediction)

# Define a function to evalute the accuracy of a model
def evaluate_model_accuracy(clf, test_text, test_result):
    """
    This function runs the "analyze_text" function over each row in a test_text column and compares the output to
        the test answers ("test_result" column)
    Input: clf:classifier, test_text: column of test texts, test_result: column of correct class answers
    Output: A formatted string for model accuracy (rounded to two decimal places) 
    """
    total = len(test_text)
    num_correct = 0
    
    for index, item in enumerate(test_text):
        
        text, result = analyze_text(clf, vectorizer, item)
        
        if result == test_result.iloc[index]:
            num_correct +=1 
    
    return_string = 'Model Info: ' + str(clf) + '\nModel Accuracy: ' + str(round(num_correct * 100 / total, 2)) + '%\n'
    
    print(return_string)

Now let's run our functions and take a look at our model accuracy!

In [None]:
evaluate_model_accuracy(NB, X_test, y_test)
evaluate_model_accuracy(Dummy, X_test, y_test)

We can see that the accuracy for model is much better than the dummy classifier. However, accuracy is not the whole model. Lets take a look at our models holistically.

In [None]:
# Using Confusion Matrix

print("            Confusion Matrix")
plot_confusion_matrix(NB, X_test_vector, y_test);

From the confusion matrix, we can read our classifier metrics:

**Accuracy:** *(The number of correct predictions)* = $\frac{TP + TN}{TP + TN + FP + FN}$ = $\frac{488 + 62}{558}$ = 98.6%   
**Precision:** *(The number of correct positive predictions)* = $\frac{TP}{TP + FP}$ = $\frac{488}{488 + 6}$ =  98.7%   
**Recall:** *(True Positive Rate)*  = $\frac{TP}{TP + FN}$ =  $\frac{488}{488 + 2}$ = 99.5%  
**Specificity:** *(True Negative Rate)* = $\frac{TN}{TN + FP}$ = $\frac{62}{62 + 2}$ = 96.8%  

Note: In our case, "positive" predictions are '0' for our label, or Not Spam. Random seeds are not set, so your instance of the notebook may have slightly different predictions.

### Overall, our model peformed very well!

## Closing Thoughts and Extra Resources

In conclusion, Bayes Theorem is a fundamental piece of statistical inference and has many real-world applications through Bayes Theorem as well as machine learning algorithms. The Naive Bayes classifier is a fairly simple and robust machine learning classifier that can be used as a baseline model to compare to.

I encourage everyone to [watch 3Blue1Brown's videos](https://www.youtube.com/c/3blue1brown/videos), as Grant covers a wide variety of fundamental Data Science concepts, as well as useful mathematics.

If you would like to learn more about the math behind the Naive Bayes Classifier, I encourage you to check out
[Chapter 13 of "Introduction to Information Retrieval"](https://nlp.stanford.edu/IR-book/pdf/13bayes.pdf), which covers text classification and the Naive Bayes.

As well, I have also written a short paper describing the Naive Bayes Classifier, and its application to a twitter dataset at [my website.](https://www.henrysue.dev/projects/Naive_Bayes_from_scratch.pdf)

Also, check out [this StatQuest video](https://www.youtube.com/watch?v=O2L2Uv9pdDA) about the Naive Bayes Classifier. 