# Building a Spam Filter - Exercise

Spam filtering is a classic example of a binary classification task. In this Notebook you’ll learn how to implement your own one.

The task is to distinguish between two types of emails, "spam" and "non-spam" often called "ham". The machine learning classifier will detect that an email is spam if it is characterised by certain features (mainly the existence of words like "Viagra" or "lottery" or phrases like "You’ve won a $100,000,000! Click here!" and "Join now!").

Once again, we will use the same approach as for the sentiment analyzer.

## 1. Import

As always we're going to use the Naive Bayes classifier. This is a pretty popular classifier used in text classification, sentiment analysis, spam filtering, ... Import the necessary NLTK packages for NLP.

In [None]:
from nltk.classify import NaiveBayesClassifier
from nltk.classify.util import accuracy as nltk_accuracy
from nltk.tokenize import word_tokenize

import nltk

## 2. Explore the data

In the resources folder you will find two subfolders spam and ham which contains 1,500 legitimate (ham) emails and 4,500 spam emails.

To start with, create two variables `spam_list` and `ham_list` and try to write a program that fills these variables with the names of the spam and ham emails. Print the names of the spam documents and the total number of spam and ham emails.

Now try to print the content of the first spam file. You should get (tip: use the word tokenizer):

```
['subject', ':', 'adv', ':', 'space', 'saving', 'computer', 'to', 'replace', 'that', 'big', 'box', 'on', 'or', 'under', 'your', 'desk', '!', '!', 'revolutionary', '!', '!', '!', 'full', 'featured', '!', '!', '!', 'space', 'saving', 'computer', 'in', 'a', 'keyboard', 'eliminate', 'that', 'big', 'box', 'computer', 'forever', '!', 'great', 'forhome', '.', '.', '.', '.', 'office', '.', '.', '.', 'or', 'students', '.', '.', '.', 'any', 'place', 'where', 'desk', 'space', 'is', 'at', 'a', 'premium', '!', 'the', 'computer', 'in', ... ]
```

## 3. Create a list of documents

Create a list of documents in the following format:

```
[(['subject', 'space', 'saving', 'computer', 'replace', ...], 'spam'), 
 (['subject', 'miningnews', 'net', 'newsletter', 'thursday', ...], 'spam'), 
 (['subject', 'say', 'goodbye', 'doctor', 'visits', ...], 'spam'), 
 (['subject', 'annoy', 'you', 'eternal', 'benson', ...], 'spam'),
 (['subject', 'scott', 'leaving', 'intel', 'today', ...], 'ham'), ...]
```

Use random to shuffle your documents. This is because we're going to train and test. If we left them in order, we'd train on all of the spam emails and then test only against ham emails. We don't want that, so we shuffle the data.

In [None]:
print(documents[:5])

## 4. Collect the top 3,000 words

Make a new variable `top_words` which contains the top 3,000 most common words. Don't forget to remove the stopwords and the punctuation. Also remove the one-letter words and numbers. Printing the first 20 top 3,000 words frequency should give:

```
[('subject', 6873), ('com', 3258), ('``', 2617), ('http', 2523), ('data', 2363), ('please', 2293), ('database', 2163), ('company', 2120), ('dbcaps', 2010), ('date', 1827), ('time', 1737), ('email', 1733), ('hourahead', 1729), ('get', 1724), ('information', 1715), ('hour', 1658), ('start', 1627), ('us', 1586), ('www', 1520), ('may', 1497)]

```

In [None]:
print(top_words)

## 5. Create the featureset

Use every top 3,000 word as an input feature for your classifier. Whether the word exists in the mail (true or false) is the value of the feature. Use spam or ham as output feature or label. Therefore the featureset used to train the classifier should look something like this.

```
[({'subject': True, 'annuity': False, 'deal': False, ...}, 'spam'), ({'subject': False, 'annuity': False, 'deal': False, ...}, 'ham'), ... ]
```

In [None]:
print(featuresets[0])

## 6. Train the classifier 

Now it is time to separate our data into training and testing sets, and press go! The algorithm that we're going to use is the Naive Bayes classifier. Train the classifier with 90% of the data.

After training print the accuracy of the classifier.

Print what the most valuable words (top 15) are when it comes to spam or ham.

## 7. Test the classifier

In the test folder you can find 10 mails (spam and ham). Use these mails as input for your classifier and check if you can classify them correctly as spam or ham. You should get:

```
File: 0010.2003-12-18.GP.spam.txt => spam Probability: 0.75
Content:
Subject: re : hot topics : growing young

---------------------------------------

File: 0166.2001-04-18.williams.ham.txt => ham Probability: 1.0
Content:
 Subject: new hire dinner rsvps
don ' t forget to let me know if you are interested in attending the new hire dinner at 6 pm on thursday , april 26 th ! it will be held at oritalia at the westin hotel . please indicate your meal selection ( meat - - beef , or veggie - - pasta ) . i will need to submit final entree numbers thursday morning ...
```