<h1>Create a simple spam filter using Naive Bayes classification</h1>

<p>In classification tasks, as in many other things, complicated does not necessarily equal better. While it is possible and sometimes desirable to use things like neural network or support vector machines to classify observations, Naive Bayes - one of the simplest classification methods-  often outperforms its more complicated counterparts. This seems to be particularly true when it comes to text classification (see [here](http://cogprints.org/6708/1/4-1-16-23.pdf) and [here](https://thesai.org/Downloads/Volume4No11/Paper_5-Performance_Comparison_between_Na%C3%AFve_Bayes.pdf), for example).</p>

<p>In this post, we’ll first look at the concept of conditional probability, which is the basis of the Naive Bayes classifier. We’ll then explore three different ways conditional probability can be characterized depending on what assumptions we want to make about our data. Then we’ll use the scikit-learn library to train a classifier to detect spam emails.</p>

<h2>Conditional Probability</h2>

<p>You’ve probably heard of the duck test:</p>

<blockquote>“If it looks like a duck, swims like a duck, and quacks like a duck, then it probably is a duck.”</blockquote>

<p>That’s conditional probability. For any given set of animals, the probability that any one animal is a duck can be estimated by:</p>

<blockquote>P(is a duck) = [total # of ducks] / [total # of animals]</blockquote>

<p>And the probability that any one animal, say, swims like a duck, can be estimated by:</p>

<blockquote>P(swims like a duck) = [total # of animals who swim like a duck] / [total # of animals]</blockquote>

<p>In this example, conditional probability is the probability that an animal is a duck if we already know that the animal swims like a duck. That’s calculated by dividing the joint probability (the percentage of animals that both are a duck and swim like a duck) but the percentage of animals who swim like a duck. So:</p>

<blockquote>P(is a duck, given that it swims like a duck) = P(is a duck and swims like a duck) / P(swims like a duck)</blockquote>

<p>In other words, reduce our universe from the total number of animals to just the percentage of animals that swim like a duck. By using a percentage instead of a count, we’re continuing to use the original probabilities (based on total number of animals), but we’re forcing our calculation to only consider those animals that meet the condition.</p>

<p>What does this have to do with classification? Well, if we have a lot of animals, and we want to figure out which animals are ducks, we can combined conditional probabilities. For example, if one animal looks like a duck, swims like a duck, and quacks like a duck, then then:</p>

<blockquote>P(animal is a duck) = P(is a duck, given that it looks like a duck) * P(is a duck, given that it swims like a duck) * P(is a duck, given that it quacks like a duck)</blockquote>

<p>If, instead of a duck, we thought it was possible that the animal was a goose, we could calculate another probability:</p>

<blockquote>P(animal is a goose) = P(is a goose, given that it looks like a duck) * P(is a goose, given that it swims like a duck) * P(is a goose, given that it quacks like a duck)</blockquote>

<p>A goose could conceivably look a little like a duck, and could very likely swim like a duck, must most likely wouldn’t quack like a duck. Therefore, if an animal looked, swam, and quacked like a duck, the probability of it being a duck would be higher than the probability of it being a goose, and therefore we would classify it as a duck.</p>

<h2>Spam filtering</h2>

<p>Now we will look at how we could automate the above classification procedure in Python. I don’t have an animals dataset, so instead of ducks and geese we’ll look at spam (unsolicited bulk emails) and ham (emails that aren’t spam). There are a number corpuses of spam/ham emails. We’ll grab a few from the Enron dataset:</p>

In [None]:
from cStringIO import StringIO
import tarfile
import requests

url = 'http://www.aueb.gr/users/ion/data/enron-spam/preprocessed/enron1.tar.gz'
response = requests.get(url)

tar = tarfile.open(mode="r:gz", fileobj=StringIO(response.content))
spam = [tar.extractfile(m).read() for m in tar.getmembers() if 'spam.txt' in m.name]
ham = [tar.extractfile(m).read() for m in tar.getmembers() if 'ham.txt' in m.name]

<p>The above code requests a zipped tar file containing text files, each file containing a single email. We use the filename for each text file to create separate lists of spam and ham.</p>

<p>Now that we have the texts, we need to pull out word counts. The number of times a word appears in each email will be our equivalent of “looks like a duck”, “quacks like a duck”, etc.</p>

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

tf_vectorizer = CountVectorizer(
   max_df=0.95,
   min_df=2,
   max_features=1000,
   stop_words='english',
   lowercase=True,
   encoding='utf-8',
   decode_error='replace'
)

tf = tf_vectorizer.fit_transform(spam + ham)

<p>The `CountVectorizer` class from scikit-learn turns a list of texts into a sparse document-by-word matrix. The class has a lot of parameters, so it makes sense to explain the ones we used above:</p>
<ul>
<li>`max_df=0.95`: only take words that appear in 95% or fewer of the total documents (if a word appears in every document, then it probably won’t help us differentiate spam documents from ham documents).</li>
<li>`min_df=2`: only take words that appear in at least two documents (if a word appears in only one document, it probably won’t help us differentiate anything).</li>
<li>`max_features=1000`: extract the 1000 most frequent words.</li>
<li>`stop_words=’english’`: remove words that occur so commonly in English that they probably won’t help us differentiate documents; these include words like “and” and “the”.</li>
<li>`lowercase=True`: make all words lowercase before counting; this prevents us from treating, say, “Hello” and “hello” as two separate words just because one happened to be placed at the first of a sentence and the other one in the middle of the sentence.</li>
<li>`encoding=’utf-8’`: this tells CountVectorizer the range of characters that should be considered valid.</li>
<li>`decode_error=’replace’`: this tells CountVectorizer to replace invalid characters with a meaningless valid character - if we planned to use our spam filter in real life, this would probably be a bad idea, but for the purposes of our example here it allows us to not spend too much time figuring out how to clean our data.</li>
</ul>

<p>Now, let’s put our data into a Pandas DataFrame so we can explore it a little bit:</p>


In [None]:
from pandas import DataFrame

df = DataFrame(tf.todense(), columns=tf_vectorizer.get_feature_names())
is_spam = Series(([1.0] * len(spam)) + ([0.0] * len(ham)))

<p>The above code creates a DataFrame from the document-word matrix, and labels each column with the appropriate word from the vectorizer. It also creates a Series indicating 1.0 if a document was identified as spam, and 0.0 if identified as ham. We can then do things like look at the 10 most-used words.</p>

In [None]:
print df.sum().sort_values(ascending=False).head(10)

<p>However, many words appear multiple times within a single document. If we want to see which words are most-represented across documents:</p>

In [None]:
print df.gt(0).mean().sort_values(ascending=False).head(10)