# Naive Bayes in Classifying Text

**Introduction** One of the most powerful tools for classification in machine learning, is also one of the simplest tools, called the Naive Bayes Classifier. The classifier is built from some simple rules of probability, as well as a few strong assumptions that allow classification to be done relatively quickly. Some common applications of the algorithm include marking email as spam or not, or reading text for positive or negative sentiment. In the following tutorial, we will walk through the theory behind the algorithm, how we make assumptions to make the problem of classification easier, and write the code to implement a classifier that determines which person a tweet was most likely tweeted by. 

This tutorial introduces the math behind Naive Bayes, but if you would like to read about it even further, Wikipedia goes through the derivation of the algorithm in depth __[here](https://en.wikipedia.org/wiki/Naive_Bayes_classifier)__. 

**The Math** The Bayes Classifier originates from Bayes' Theorem, which is widely used in probability when dealing with conditional independence. We say the *conditional probability* P(X=x|Y=y) is the probability that a random variable X takes on the value x, *given* we know that the random variable Y takes on value y.

Bayes' Theorem says that we can break down conditional probabilities in the following way:

$P(X=x | Y=y) = \frac{P(X=x) * P(Y=y | X=x)}{P(Y)}$

In text classification, we are interested in the question: Given a piece of text with the attributes $(x_1, x_2, ..., x_n)$, what is the probability that this text came from author A? If we can answer this question for every possible author, then we say the person who wrote the text is probably the one with the highest probability. Bayes rule allows us to calculate these probabilities. Formally, we wish to compute

$P(Author = A | attributes (x_1, x_2, ..., x_n))$

Then, by Bayes' Rule,

$ = \frac{P(Author = A) * P(attributes (x_1, x_2, ..., x_n) | Author = A)}{P(attributes (x_1, x_2, ..., x_n))}$

**More Math** We can further simplify this expression by treating the denominator as constant, since as will see further down, we know the probability of the attributes $(x_1, x_2, ..., x_n)$, so we can basically ignore the denominator.

Now we have $P(Author = A) * P(attributes (x_1, x_2, ..., x_n) | Author = A)$. For those of you who are well-versed in probability, you may recognize this expression as the joint probability, 

$P(Author = A, x_1, x_2, ..., x_n)$

which can be thought of as the probability that the real author is author A *and* the text has attribute $x_1$ *and* attribute $x_2$ and so on. 

By some more math, we know that we can "unroll" joint probabilities into conditional probabilities using the chain rule as follows:

$P(Author = A, x_1, x_2, ..., x_n) = P(x_1 | x_2, ..., x_n, A) * P(x_2, ..., x_n, A)$

Unrolling completely with the chain rule, we get

$P(x_1 | x_2, ..., x_n, A) * P(x_2, ..., x_n, A) = P(x_1 | x_2, ..., x_n, A) * P(x_2 | x_3, ..., x_n, A) * P(x_3, ..., x_n, A)$
.
.
.
$= P(x_1 | x_2, ..., x_n, A) * P(x_2 | x_3, ..., x_n, A) * ... * P(x_{n-1} | x_n, A) * P(x_n | A) * P(A) $

**Strong Assumptions** Now, after doing all of this math, if we actually tried to implement Naive Bayes using this formula, we would find that running the classification algorithm takes exponential time (computing conditional independences is hard)! So now we introduce the secret ingredient to the Naive Bayes algorithm. We assume that each attribute that an example text has is independent of all the other attributes. So take the sentence "I want to eat food." Its attributes are "I", "want", "to", "eat", and "food". Even though verbs like "want" usually follow nouns like "I", we make the assumption that these two words are entirely independent of each other. This might seem like it will affect our classification, since the assumption that words are completely independent isn't entirely true, but in practice this assumption does not doom the Naive Bayes algorithm. This assumption is what makes Naive Bayes so beautiful, it brings down computation time to linear in the number of attributes, rather than exponential!

**The Final Math** Now, armed with our very strong assumption that attributes are independent, we can rewrite our long expression in the following way:

$P(x_1 | x_2, ..., x_n, A) * P(x_2 | x_3, ..., x_n, A) * ... * P(x_{n-1} | x_n, A) * P(x_n | A) * P(A) $

$= P(x_1 | A) * P(x_2 | A) * ... * P(x_n | A) * P(A)$

$ = P(A) \prod_{i=1}^{n} P(x_i | A) $



Beautiful! Now how do we actually use this nice formula? Say we have 20 pieces of text, and 9 of them are written by author A. We can estimate that P(A) = $\frac{9}{20}$. We can also use our training data to estimate each $P(x_i | A)$ by counting the number of times author A uses the word $x_i$ in their writing. Then we plug everything into this formula and voila, we have the likelihood that author A wrote a new piece of text! The rest of this tutorial will be spent walking through a text classification example to illustrate how to use this powerful tool.

## Example: Classifying Tweets

We will be classifying the tweets of two twitter users, Tomi Lahren and Colin Kaepernick. If you are unfamiliar on their feud or their beliefs, feel free to read more about them __[here](https://www.huffingtonpost.com/entry/tomi-lahren-colin-kaepernick-d-day-photo_us_5a17ca2fe4b0d4906cadfa6a)__. So first to get our training data (scraped through the Twitter API). We can start with 100 tweets. Kaepernick's tweets are in the file called Kaepernick_Train.csv, and Lahren's in Lahren_Train.csv

In [1]:
import csv
import numpy as np
import pandas as pd
lahren_tweets = pd.read_csv("Lahren_Train.csv", names=["Lahren"])
kaep_tweets = pd.read_csv("Kaepernick_Train.csv", names=["Kaepernick"])
kaep_tweets.head()

Unnamed: 0,Kaepernick
0,Let em know Queen! https://t.co/16DKlFbThQ
1,Turning up the heat!🔥🔥🔥https://t.co/VI4CxdHPX3
2,They did to @MeekMill what they've done to you...
3,“Love is an expression of power. We can use it...
4,. @Ava showed out when she made A Wrinkle in T...


Next we need to go through each of our tweets and extract our attributes. As you might guess, the attributes are just the words. We will build a list containing all the words that Tomi Lahren ever uses, and the same for Kaepernick. We can represent our attributes with many data structures, but we will just use lists, where each entry in a list represents a word in the tweet. Nltk actually has a tokenizer specifically for tokenizing tweets (which you can read about __[here](http://www.nltk.org/api/nltk.tokenize.html)__), so we will use that tool in defining our attributes.

In [2]:
from nltk.tokenize import TweetTokenizer
tknzr = TweetTokenizer()

# takes in a list of tweets, tokenizes each tweet, and returns a list of all the words used
def tokenizeTweets(tweets):
    # tokenize each tweet individually
    tokenized = [tknzr.tokenize(tweet[0]) for i, tweet in tweets.iterrows()]

    # put all the words into one list
    words = []
    for tweet in tokenized:
        words.extend(tweet)
    return words

lahren_words = tokenizeTweets(lahren_tweets)
kaep_words = tokenizeTweets(kaep_tweets)

# make all the words lower case since we might not want to differentiate between "the" and "The"
lahren_words = list(map(lambda x: x.lower(), lahren_words))
kaep_words = list(map(lambda x: x.lower(), kaep_words))
kaep_words[:10]

['let',
 'em',
 'know',
 'queen',
 '!',
 'https://t.co/16dklfbthq',
 'turning',
 'up',
 'the',
 'heat']

Next we need an overall vocabulary that represents every possible word in our tweets. That way, we can calculate the probability of seeing a word in this overall vocabulary given it's in a Lahren tweet or a Kaeprnick tweet. We can do this easily by just combining the lahren_words with the kaep_words into a set, so we have no duplicates.

In [3]:
overall_vocab = set(lahren_words + kaep_words)
print(len(overall_vocab))

2118


Now we have all the words that our twitter users use. Next we can start finding the probabilities we need for using the Naive Bayes Algorithm. To estimate $P(x_i | Lahren)$ and $P(x_i | Kaepernick)$, we can simply go through our vocabulary and keep track of how many times the word $x_i$ appears in the tweets that Lahren and Kaepernick tweet. We can easily do this with the Counter object (read more about __[here](https://docs.python.org/2/library/collections.html#collections.Counter)__).

Additionally, when calculating the probabilities of each word, we need to take something into account in our formula for the algorithm. In $ = P(A) \prod_{i=1}^{n} P(x_i | A) $, we have the product of many things. If a single thing we have in our product is 0, then the entire possibility of the author being A is 0. For instance, when we go through the words in our vocabulary, if a word is not in any of Lahran's tweets, it's probability given Lahren will be 0. We don't want this! The tweet could still very much be from Lahren. To avoid this problem, we will make sure that none of the probabilities we compute are 0, by adding a fake count of 1 to the count of each word occuring in Lahren's or Kaepernick's used words. That way, none of the probabilities of a word occuring in someone's tweets will be 0.

In [4]:
from collections import Counter

# a function that takes in a user's words and returns a dictionary 
# mapping words to their probability given their user
def getWordProbabilities(words, vocab):
    # Get the counts of each words
    word_counts = Counter(words)

    # Get the probabilities of each word in the overall vocabulary 
    # by taking their count in a user's text over the total number of words
    word_probabilities = dict()
    for word in vocab:
        # add one to avoid getting probability of zero
        probability = (word_counts[word]+1)/(len(words)+len(vocab))     
        word_probabilities[word] = probability
    
    return word_probabilities 

lahren_word_probabilities = getWordProbabilities(lahren_words, overall_vocab)
kaep_word_probabilities = getWordProbabilities(kaep_words, overall_vocab)

print(lahren_word_probabilities["the"])
print(kaep_word_probabilities["the"])

0.016312407315867524
0.01914968376669009


Now that we have the word probabilities, all we need is the probability of a tweet being from Lahren or Kaepernick in general, which we estimate by counting the number of training tweets we have from each user.

In [5]:
lahren_probability = len(lahren_tweets)/(len(lahren_tweets)+len(kaep_tweets))
kaep_probability = len(kaep_tweets)/(len(lahren_tweets)+len(kaep_tweets))

print(lahren_probability)
print(kaep_probability)

0.564935064935065
0.43506493506493504


We are finally ready to start classifying! The following function takes a new tweet (not from the training data), and uses the Naive Bayes algorithm formula to determine the most likely user who made the tweet. 

In [6]:
import math

# takes a tokenized tweet, and returns the most likely user
def classify(tweet, vocab, word_probs1, word_probs2):
    
    # calculate the probability that the tweet came from Lahren
    product = 1
    for word in tweet:
        if word in vocab:
            # get the probability of that word occuring in a lahren tweet
            # we take the log of the probability because each probability is so small, that 
            # multiplying many of them together could cause underflow in Python. log is monotonically
            # increasing, so it is a good choice for transforming our probabilities
            word_probability = word_probs1[word]*1000
            product *= word_probability
    # finally we use our nifty formula
    prob_lahren_tweeted_this = product * lahren_probability
    
    # calculate probabilites that the tweet came form Kaepernick
    product = 1
    for word in tweet:
        if word in vocab:
            word_probability = word_probs2[word]*1000
            product *= word_probability
    prob_kaep_tweeted_this = product * kaep_probability
    
    # now we just return the user with the higher probability
    if prob_lahren_tweeted_this > prob_kaep_tweeted_this: 
        return "Lahren"
    else: 
        return "Kaepernick"

# Example:
classify(["Hi", "my", "name", "is", "paul"], overall_vocab, lahren_word_probabilities, kaep_word_probabilities)

'Kaepernick'

Cool! Now let's see how well we can predict a batch of tweets from both of our users. We can get new tweets in the Test_Tweets.csv file. The first column in this file is the user that tweeted it, and the second column is the tweet itself.

In [7]:
test_tweets = pd.read_csv("Test_Tweets.csv", names=["user", "tweet"])
test_tweets.head()

Unnamed: 0,user,tweet
0,Kaepernick,Just posted a video https://t.co/HEebUsNUaI
1,Kaepernick,@yourrightscamp #KnowYourRightsOakland #KnowYo...
2,Kaepernick,I can't thank all of these people enough! They...
3,Kaepernick,Young Kings! @yourrightscamp #KnowYourRightsOa...
4,Kaepernick,The Queens are rising! @yourrightscamp #KnowYo...


In [8]:
# let's check our accuracy in classification
def accuracy(vocab, word_probs1, word_probs2):
    correct = 0
    for i, row in test_tweets.iterrows():
        user = row.user
        tweet = row.tweet
        tweet = list(map(lambda x: x.lower(), tknzr.tokenize(tweet)))
        guess = classify(tweet, vocab, word_probs1, word_probs2)
        if guess == user: correct += 1
    return correct/len(test_tweets)*100

"accuracy: " + str(accuracy(overall_vocab, lahren_word_probabilities, kaep_word_probabilities)) + "%"

'accuracy: 100.0%'

100% accuracy? You might be asking how is this possible? How can such a simple algorithm produce perfect results? That's the beauty of Naive Bayes. With just a little work, a little data, and some data processing, we can get extremely powerful results. 

From here, there are other ways to make our implementation better (we got 100% accuracy on this example, but we won't always get perfect guessing right away). For instance, there are so many words that don't really give us any indication as to who made a tweet, such as "a" or "the". We can go through our tweets and remove these "stopwords" from our calculations to try to do better. We can also remove a large number of words that occur only once ever, since they are probably links or typos that don't help us in our classification.

In [9]:
import copy

# let's find the N most common words in our vocabulary, and the least common words, and remove them
def findN(N):
    new_vocab = copy.deepcopy(overall_vocab)
    count_of_all_words = Counter(lahren_words+kaep_words)
    # here we can use a nifty built in method for the counter object
    most_common = count_of_all_words.most_common(N)
    for word in most_common:
        new_vocab.remove(word[0])
    return new_vocab
        
    # for the least common, we can remove the words that only occur once
    for word in count_of_all_words:
        count = count_of_all_words[word]
        if count <= 2:
            new_vocab.remove(word)
    return new_vocab

# We can print the words we took out:
list(overall_vocab-findN(10))[:10]

['…', 'the', '!', 'to', 'and', 'you', ',', '’', 'for', '.']

Now how are we doing?

In [10]:
best_accuracy = 0
new_lahren_words = None
new_kaep_words = None
for i in range(1, 15):
    new_vocab = findN(i)
    
    lahren_words_without_stopwords = []
    for word in lahren_words:
        if word in new_vocab:
            lahren_words_without_stopwords.append(word)

    kaep_words_without_stopwords = []
    for word in kaep_words:
        if word in new_vocab:
            kaep_words_without_stopwords.append(word)
            
        
    new_lahren_word_probs = getWordProbabilities(lahren_words_without_stopwords, new_vocab)
    new_keap_word_probs = getWordProbabilities(kaep_words_without_stopwords, new_vocab)
    this_accuracy = accuracy(new_vocab, new_lahren_word_probs, new_keap_word_probs)
    print("Best Accuracy: " + str(this_accuracy) + "%")
    if this_accuracy > best_accuracy:
        best_vocab = new_vocab
        best_accuracy = this_accuracy
        new_lahren_words = lahren_words_without_stopwords
        new_kaep_words = kaep_words_without_stopwords
        
print("Best Accuracy: " + str(best_accuracy) + "%")

Best Accuracy: 100.0%
Best Accuracy: 100.0%
Best Accuracy: 100.0%
Best Accuracy: 100.0%
Best Accuracy: 100.0%
Best Accuracy: 100.0%
Best Accuracy: 100.0%
Best Accuracy: 100.0%
Best Accuracy: 100.0%
Best Accuracy: 100.0%
Best Accuracy: 98.0%
Best Accuracy: 98.0%
Best Accuracy: 100.0%
Best Accuracy: 98.0%
Best Accuracy: 100.0%


As we can see, we maintain perfect accuracy by removing less than 10 words. After that, removing words actually makes our accuracy worse, which indicates that some of the words we remove are necessary in guessing correctly who the tweeter is. Now what can we do? We can try introducing some human knowledge into our process. From following Kaepernick and Lahren on Twitter, I know that Kaepernick often tweets about #BlackLivesMatter. If I see this hashtag in a tweet, I would be inclined to think that Kaepernick wrote it. We can formalize this intuition by looking at the most common words 

In [11]:
# Let's look at the most common words that Lahren and Kaepernick use
lahren_word_counts = Counter(new_lahren_words)
kaep_word_counts = Counter(new_kaep_words)
print(kaep_word_counts.most_common(50))

[('to', 123), ('…', 123), ('the', 108), (',', 98), ('.', 70), ('my', 68), ('a', 60), ('for', 54), ('of', 50), ('in', 45), ('i', 45), ('you', 38), ('$', 37), ('and', 35), ('&', 34), ('your', 29), ('#milliondollarpledge', 28), ('#10for10', 28), ('history', 27), ('is', 25), ('know', 23), ('brother', 23), ('we', 20), ('’', 20), ('on', 20), ('thank', 19), ('just', 19), ('black', 18), ('our', 18), ('are', 18), (':', 18), ('s', 17), ('month', 17), ('@yourrightscamp', 16), ('posted', 16), ('#encore', 15), ('10k', 15), ('it', 14), ('have', 14), ('“', 13), ('4', 13), ('donate', 12), ('people', 11), ('all', 11), ('he', 11), ('with', 11), ('go', 11), ('was', 11), ('video', 11), ('t', 10)]


Hm, it looks like Lahren talks about Republican and conservative views, and Fox news channel fairly often. We can use this intuition in guessing on a new tweet. For instance, if we see '@foxandfriends' in a tweet, we can probably skip the Naive Bayes calculation (which could guess Kaepernick), and just say Lahren tweeted this. Let's try implementing this:

In [12]:
common_lahren_words = ['@foxandfriends', '@seanhannity', 'Americans']
common_kaep_words = ['#MillionDollarPledge', '#10for10', '@yourrightscamp']

Now before we classify any new tweets, we check to see if a tweet contains any of these words. If not, then we classify as usual.

In [13]:
# takes a tokenized tweet, the overall_vocab, and the word probabilities, and predicts which user made the tweet
def classifyWithHelp(tweet, vocab, word_probs1, word_probs2):
    for i, row in test_tweets.iterrows():
        tweet = row.tweet
        user = row.user
        tweet = list(map(lambda x: x.lower(), tknzr.tokenize(tweet)))
        for word in tweet:
            if word in common_lahren_words:
                return "Lahren"
            elif word in common_kaep_words:
                return "Kaepernick"
            else:
                return classify(tweet, vocab, word_probs1, word_probs2)
            
# Now let's look at a tweet from Kaepernick
raw_tweet = "For the last round of my #MillionDollarPledge I'm going #10for10! My brother @KDTrey5 is joining &amp; chose DeBug https://t.co/wmM9qEgPKb"
tokenized_tweet = list(map(lambda x: x.lower(), tknzr.tokenize(raw_tweet)))
classifyWithHelp(tokenized_tweet, overall_vocab, lahren_word_probabilities, kaep_word_probabilities)   

'Kaepernick'

With that, we have explored the Naive Bayes algorithm and where it comes from in statistics and probability, implemented the algorithm from scratch, tested it on a real life dataset, and seen some other ways of improving the algorithm. Hopefully you have learned the beauty of the simplicity of the Naive Bayes algorithm, and are equipped to go use it to classify things out in the world!