# Tutorial: Classification Using Naive Bayes

## Introduction

**Classification** is the problem of identifying which category a new observation belongs in, based on a training set of data containing observations whose category membership we already know. As one of the most widely used areas of machine learning, classification plays a large role in our everyday lives behind the scenes with its wide variety of applications including but not limited to ad targeting, spam detection, medical diagnosis, and image classification. 

Any algorithm that implements classification is known as a **classifier**. Essentially, a classification model draws some conclusion from a prior history of values, and based on a new input, it will predict the value of the outcome. 

Without going over too much math or probability theory, in this tutorial, we will be introducing some of the concepts behind Naive Bayes before looking at a simple example. Then, we will classify some emails as *'spam' or 'not spam'*.

## What is Naive Bayes?

A **Naive Bayes classifier** is a simple probabilistic classifier. This means it is a classifier that is able to predict a probability distribution over a set of possible results/classes, rather than just returning the single most likely class that a new observation should belong to. As you can probably tell, this classifier applies **Bayes' Theorem**, which states the following: $$ P(A\,|\,B) = \frac{P(B\,|\,A) * P(A)}{P(B)} $$ This conditional probability model can essentially be seen as: $$ posterior = \frac{prior * likelihood}{evidence} $$

As aforementioned, we are essentially using prior knowledge with some observed data to make a new prediction.

## Part I: Naive Bayes - A Simple Example

Now that you have some familiarity of the background behind Naive Bayes, let's dive into a simple example. All we have to do is use the simple formula above for each outcome. We look at the evidence, and consider how likely it is to be one class or another class, and assign a label. The class with the highest probability is the one we assign to that specific combination of features. For example, let us consider the following data:

| | | Round | Not Round | Sweet | Not Sweet | Red | Not Red | $\textbf{Total}$
|------||------|------||------|------||------|------||------|------||------|
| $\textbf{Apple}$ ||   4000  | 1000|   3500  | 1500|   4500  | 500|   5000  | 
| $\textbf{Banana}$ ||   0  | 3000|   1500  | 1500|   3000  | 0 |   3000  | 
| $\textbf{Other}$ ||   1000  | 1000|   1500  | 500|   500  | 1500|   2000  | 
| $\textbf{Total}$ ||   5000  | 5000|   6500  | 3500|   8000  | 2000 |   10000  | 

This is our **training set**. Here, we have 10,000 pieces of fruit. Here, we essentially have 3 **features** or **predictors**: whether or not the fruit is round, whether or not the fruit is sweet, and whether or not the fruit is red. We have 3 possible **classes/labels**: *Apple, Banana, or Other.* 

Let's say that we receive a new piece of fruit that is round, sweet, and red. We want to be able to classify what type of fruit the new piece is. If we follow the formulas mentioned in the previous section, we have the following:

   $$ P(Apple \,|\, Round, Sweet, Red) = \frac{P(Long\,|\,Apple) * P (Sweet\,|\,Apple) * P (Sweet\,|\,Apple) * P(Apple)}{P(Round) * P(Sweet) * P(Red)}$$
   
 The formula would be the same for Banana and Other, just replace Apple with the proper fruit. If you notice, the demoninator or the "evidence" doesn't actually rely on our class and is constant. We use this to scale the result probability to be between 0 and 1, but in the grand scheme of things we can get rid of this as long as we do it for the other classes as well. 
 
 Now, all you have to do is count and do the math. Getting rid of the denominator, for the 3 classes, we would end up with the following:
 
   $$ P(Apple \,|\, Round, Sweet, Red) = 0.8 * 0.7 * 0.9 * 0.5 = 0.252 $$
   $$ P(Banana \,|\, Round, Sweet, Red) = 0 $$
   $$ P(Other \,|\, Round, Sweet, Red) = 0.5 * 0.75 * 0.25 * 0.2 =  0.01875 $$
  
Since 0.252 is greater than 0.01875 and 0, we would classify a new fruit that is Round, Sweet, and Red as an Apple! 

## Part II: Naive Bayes - Classifying Emails and Filtering Spam

Now that we've gone over the basics of Naive Bayes, let's start actually coding. We'll be using the [Enron Email dataset](http://www2.aueb.gr/users/ion/data/enron-spam/), already divided into spam or not spam [here](http://www2.aueb.gr/users/ion/data/enron-spam/). Download the dataset and put the enron directories on the same level as this notebook file. We'll use Enron1 in its pre-processed form for training, and then we will use Enron5 and Enron6 for testing to see how accurate our classifier is. Essentially, words are our features and our labels are 'spam' or 'not spam'.

Before doing anything, you should manually inspect the data. You'll notice that the emails looks quite messy, and that we should definitely process our text. From previous lectures, you should know that we have many options here but we'll just do a few simple things for the sake of brevity in this tutorial. Also note that the term for **'not spam'** is **ham**, and spam is just spam.

In [111]:
import numpy as np
import math
import os
import string
import re

In [112]:
# Text Processing
def removePunctuation(st):
    mapping = str.maketrans("", "", string.punctuation)
    removedPunc = st.translate(mapping)
    return removedPunc

def tokenize(st):
    regex = "\W+"
    removedPunc = removePunctuation(st)
    splitWords = re.split(regex, st)
    return splitWords

# Returns a dictionary with the counts for every word
def createDictionary(hashTable, words, label):
    for word in words:
        totalLabelTokens = label + "^" + "anyWord"
        if totalLabelTokens in hashTable:
            hashTable[totalLabelTokens] += 1
        else:
            hashTable[totalLabelTokens] = 1
        labelPlusWord = label + "^" + word
        if labelPlusWord in hashTable:
            hashTable[labelPlusWord] += 1
        else:
            hashTable[labelPlusWord] = 1
    return hashTable

def trainDataset(dataset):
    hashTable = {"anyLabel": 0, "spam": 0, "ham": 0}
    for data in dataset:
        for root, dirs, files in os.walk(data):
            # If we're in a spam or not spam folder
            if len(dirs) == 0:
                subFolder = root.split('/')[1]
                for file in files:
                    # Latin-1 encoding to read special characters
                    with open(os.path.join(root, file), encoding="latin-1") as email:
                        hashTable["anyLabel"] += 1
                        hashTable[subFolder] += 1
                        unProcessedEmail = email.read()
                        words = tokenize(unProcessedEmail)
                        hashTable = createDictionary(hashTable, words, subFolder) 
    return hashTable

trainingData = ['enron1']
hashTable = trainDataset(trainingData)

count = 0
for key, value in hashTable.items():
    if count == 10:
        break
    print([key, value])
    count += 1

['anyLabel', 5172]
['spam', 1500]
['ham', 3672]
['ham^anyWord', 575678]
['ham^Subject', 3672]
['ham^christmas', 19]
['ham^tree', 4]
['ham^farm', 6]
['ham^pictures', 42]
['ham^', 983]


As you can see from some of the key value pairs, we've read the data and then created a dictionary mapping every word with what label it was categorized in with a count of how many times the word occurred. All our code is doing here is accumulating the counts of every word in the training data in order to do the probability calculations in the prediction step. 

### Smoothing

Before we get into testing, there's something else important to mention that we touched upon in class.

If you notice in our equations above, we take the product of the probabilities. When we multiply many small probabilities, we may run into numerical underflow. To prevent this, we'll be using the addition of **log probabilities.** I won't go into this too much since we went over it in class, but since the log function is monotonic, we can still just simply look at the greatest value. 

**But what happens when one of the counts is 0? **

log(0) is undefined, which would cause an error in our calculations. A common way to combat this is [Laplace smoothing](https://en.wikipedia.org/wiki/Additive_smoothing), where we just add 1 to the count, and then balance this by adding 1 to the overall vocabulary size as well. We need this to make our probability fail safe. If none of the words in the training sample appear in the test, then our model would essentially conclude that the sentence is impossible, which is obviously not the case.

With that out of the way, let's get to testing.

In [113]:
def findNumUniqueWords(hashTable):
    uniqueWords = set()
    for key in hashTable:
        if "^" in key and "anyWord" not in key:
            uniqueWords.add(key.split("^")[1])
    return len(uniqueWords)

# Function to grab a subset of the testing data
# for brevity purposes. Returns a list of file
# paths for ham emails and spam emails
def getSubsetData(testData):
    hamEmails = []
    spamEmails = []
    for data in testData:
        for root, dirs, files in os.walk(data):
            if (len(dirs) == 0):
                subFolder = root.split('/')[1]
                if subFolder == 'ham':
                    for file in files:
                        hamEmails.append(os.path.join(root, file))
                else:
                    for file in files:
                        spamEmails.append(os.path.join(root, file))
    return hamEmails, spamEmails

def naiveBayesEstimator(hashTable, testData):
    numCorrect = 0
    numEmails = 0
    uniqueWords = findNumUniqueWords(hashTable)
    logProbDict = {}
    hamEmails, spamEmails = getSubsetData(testData)
    emails = hamEmails + spamEmails
    for email in emails:
        trueLabel = email.split('/')[1]
        with open(email, encoding="latin-1") as e:
            numEmails += 1
            unProcessedEmail = e.read()
            words = tokenize(unProcessedEmail)
            # Go through the labels in our hash table
            for label in hashTable:
                logProb = 0
                sigmaValue = 0
                # Doing the actual calculations. 2 is the number of classes (spam/not spam)
                numerator = float(int(hashTable[label]) + 2 * float(1)/2)
                denominator = int(hashTable['anyLabel']) + 2
                rhs = math.log(numerator/denominator)
                if label in ['ham', 'spam']:
                    for word in words:
                        key = label + "^" + word
                        if key in hashTable:
                            count = int(hashTable[key])
                        else:
                            count = 0
                        # Smoothing
                        lhsNumerator = float(count + 1)
                        lhsDenominator = int(hashTable[label + "^" + "anyWord"]) + uniqueWords
                        # The sum of the log probabilities
                        sigmaValue += math.log(lhsNumerator/lhsDenominator)
                    # Add the terms to get the final log probability for that label
                    logProb = sigmaValue + rhs
                    logProbDict[label] = logProb
            # Get the label with the highest probability
            maxProbLabel = max(logProbDict, key = logProbDict.get)
            # Checking to see if our prediction is correct
            if maxProbLabel == trueLabel:
                numCorrect += 1
    return numCorrect, numEmails
    
testData = ['enron5', 'enron6']
numCorrect, numEmails = naiveBayesEstimator(hashTable, testData)
print("Percent correct: " + str(numCorrect) + "/" + str(numEmails) + "=" + str(float(numCorrect)/numEmails))

Percent correct: 10812/11175=0.9675167785234899


## Important Things to Note

The Naive Bayes model also has an important subtlety that one should take note of:

### The features might not be independent!

The whole backbone of Naive Bayes classifiers is the assumption that the predictors are independent, or that presence of a feature in a class is unrelated to the presence of any other feature. In reality, the conditional independence assumption does not hold, especially for text data. The words I write in a sentence or email definitely depend on other words in the text. This is one of the reasons why it is called *Naive* Bayes.

One might ask: How can Naive Bayes be a good classifier when it oversimplifies everything and one of its biggest assumptions are violated? You can read a formal study [here](http://www.aaai.org/Papers/FLAIRS/2004/Flairs04-097.pdf), but the basic gist is that Naive Bayes works not only when features are independent, but also when dependencies of features from each other are similar between features. In a practical sense, despite the fact that the main assumption is most likely being violated, Naive Bayes still performs quite well. 

## Summary, Further Exploration, and Additional Resources

Hopefully this tutorial has shown you how big of a role classifiers play in our lives, whether we consciously realize this or not. You can blame a classifier the next time you get spam in your inbox!

To summarize, a Naive Bayes classifier uses the assumption that the features/predictors are independent, and that they follow some distribution. We went over many different types of distributions in class, and there are other specific instances of a Naive Bayes classifier such as the Multinomial Naive Bayes classifier or the Bernoulli Naive Bayes classifier, where the features following different types of distributions. We can use a Naive Bayes classifier for a broad array of problems, and despite the strong independence assumption, it still performs quite well, even against other classifiers!

**Some parting thoughts:** 
1. How would our code in Part II change depending on the distribution of the features?
2. Note that in our spam classification task, we also didn't really process the text much before training. We've learned some text processing methods in class. How might we improve the accuracy of our classifier?

If you're interested in learning more about classification or the application of Naive Bayes, here are some resources I would recommend:

1. [Udacity - Classification Models](https://www.udacity.com/course/classification-models--ud978): Free video lectures and assignments teaching students how to use classification models to create business insights.
2. [Coursera - Classification in Machine Larning](https://www.coursera.org/learn/ml-classification): Course on classification and its broad array of applications.
3. [Data Mining - Naive Bayes](https://gerardnico.com/data_mining/naive_bayes): Good overview of Naive Bayes
4. [Comparing Classification Techniques](https://pdfs.semanticscholar.org/51c0/68c263ee197a292df5b74b58c8c55df9f9ca.pdf): Comparative study of various classification algorithms/techniques, specifically kNN, Naive Bayes, and Decision Trees
5. [scikit-learn: Naive Bayes](http://scikit-learn.org/stable/modules/naive_bayes.html): Documentation for Naive Bayes in scikit-learn, a free machine learning library for Python