<a href="https://colab.research.google.com/github/rahiakela/machine-learning-algorithms/blob/main/Essential-NLP/02-your-first-nlp-example/1_implementing_own_spam_filter.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Implementing own spam filter

In this notebook, you use the spam filtering as your practical NLP application as it is an example of a very widely spread family of tasks – text classification. Text classification comprises a number of applications, for example user profiling, sentiment analysis and topic labeling, so this
will give you a good start for the rest of the book. 

First, let’s see what exactly classification addresses.

We, humans apply classification in our everyday lives pretty regularly: classifying things simply implies that we try to put them into clearly defined groups, classes or categories. 

In fact, we tend to classify all sorts of things all the time. Here are some examples:

- based on our level of engagement and interest in a movie, we may classify it as interesting or boring;
- based on temperature, we classify water as cold or hot;
- based on the amount of sunshine, humidity, wind strength and air temperature, we classify the weather as good or bad;
- based on the number of wheels, we classify vehicles into unicycles, bicycles, tricycles, quadricycles, cars and so on;
- based on the availability of the engine, we may classify two-wheeled vehicles into bicycles and motorcycles.

Classification is useful because it makes it easier for us to reason about things and adjust our behavior accordingly.

When classifying things, we often go for simple contrasts – good vs. bad, interesting vs. boring, hot vs. cold. When we are dealing with two labels only, this is called binary classification.

Classification that implies more than two classes is called multi-class classification.



##Setup

In [1]:
import os
import codecs
import random

import nltk
from nltk import word_tokenize
from nltk import NaiveBayesClassifier, classify
from nltk.text import Text

In [2]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [25]:
%%shell

wget -qq https://github.com/ekochmar/Essential-NLP/raw/master/enron1.zip
wget -qq https://github.com/ekochmar/Essential-NLP/raw/master/enron2.zip
unzip -qq enron1.zip
unzip -qq enron2.zip



##Step 1: Define the data and classes

Enron email dataset is a large dataset of emails (the original dataset contains about 0.5M messages), including both ham and spam emails, for about 150 users, mostly senior management of Enron.

We are going to use enron1/ folder for training. All folders in Enron
dataset contain spam and ham emails in separate subfolders, so you don’t need to worry about pre-defining them. Each email is stored as a text file in these subfolders. 

Let’s read in the contents of these text files in each subfolder, store the spam emails contents and the ham emails contents as two separate data structures and point our algorithm at each, clearly
defining which one is spam and which one is ham.

In [4]:
def read_files(folder):
  files = os.listdir(folder)
  a_list = []
  for a_file in files:
    # Skip hidden files, that are sometimes automatically created by the operating systems. They can be easily identified because their names start with “.”
    if not a_file.startswith("."):
      file = codecs.open(folder + a_file, "r", encoding="ISO-8859-1", errors="ignore")
      a_list.append(file.read())
      file.close()
  return a_list

Now you can define two such lists – spam_list and ham_list, letting the machine know what data to use as examples of spam emails and what data represents ham emails.

In [5]:
spam_list = read_files("enron1/spam/")
ham_list = read_files("enron1/ham/")

# Check the lengths of the lists: for spam it should be 1500 and for ham – 3672
print(len(spam_list))
print(len(ham_list))

1500
3672


Let's check out the contents of the first entry. In both cases, it should coincide with the contents of the first file in each correspondent subfolder.

In [6]:
print(spam_list[0])

Subject: hI there man - feel the vitality
What about their health
V * i. 0' x' x 25 m. G 3 o piils 72. 5 o
V' 1. A. G. R' a loo m, g 32 pills 149. Oo
C. 1. A, l. 1' s 2 o m. G lo pills 79. Oo
Create an 0. R. D. E. R: http:// earthy. Yesmeds - now. Net/? Wid = 209015! Same day shlpp 1 ng!
We also have in stock:
X, a * n, a, x 1 m, g 3 o pI | | s 79. 0 o
P * r. 0, z. A * c 20 m. G 3 o pilis 110. 00
P * a, x * 1, l 2 o m. G 2 o pil | s 155. 00
M, e. R' i' d' i. A lo m. G 3 o p! | | s 147. Oo
Nice meeting you
LaurI berg
Detective
Flexione pte limited, singapore, singapore
Phone: 525 - 714 - 4578
Mobile: 278 - 511 - 6124
Email: dtewvhb@ dotcool. Com
This message is for confirmation
This freeware is a 46 day usage package
Notes:
The contents of this info is for attention and should not be cretaceous laidlaw
Fire freemen needful
Time: sat, 22 jan 2005 12: 19: 58 + 0200



In [7]:
print(ham_list[0])

Subject: meter 1517
Daren - meter 1517 has a nom of 0/day for jan. It flowed about 5. 400 on day
1. This is a valid flow. Could you please extend the deal from dec. (deal #
506192) or create a new one? Thanks.
Al


Next, you’ll need to preprocess the data (e.g., by splitting text strings into words) and extract the features.

Finally, remember that you will need to split the data randomly into the
training and test sets. 

Let’s shuffle the resulting list of emails with their labels, and make
sure that the shuffle is reproducible by fixing the way in which the data is shuffled:

In [8]:
# for each member of the ham_list and spam_list it stores a tuple with the content and associated label
all_emails = [(email_content, "spam") for email_content in spam_list]
all_emails += [(email_content, "ham") for email_content in ham_list]

# By defining the seed for the random operator you can make sure that all future runs will shuffle the data in the same way
random.seed(2020)
random.shuffle(all_emails)

# it should be equal to 1500 + 3672 = 5172
print(f"Dataset size = {str(len(all_emails))} emails")

Dataset size = 5172 emails


##Step 2: Split the text into words

Remember, that the email contents that you’ve read in so far each come as a single string of symbols. The first step of text preprocessing involves splitting the running text into words.

You are going to use NLTK’s tokenizer. It takes running text as input and returns a list of words based on a number of customized regular expressions, which help to delimit the text by whitespaces and punctuation marks, keeping common words like “U.S.A.” unsplit.

In [9]:
def tokenize(sent):
  word_list = []
  for word in word_tokenize(sent):
    word_list.append(word)
  return word_list

In [10]:
input = "What's the best way to split a sentence into words?"
print(tokenize(input))

['What', "'s", 'the', 'best', 'way', 'to', 'split', 'a', 'sentence', 'into', 'words', '?']


In [11]:
input = "I live in U.S.A country."
print(tokenize(input))

['I', 'live', 'in', 'U.S.A', 'country', '.']


##Step 3: Extract and normalize the features

Once the words are extracted from running text, you need to convert them into features. In particular, you need to put all words into lower case to make your algorithm establish the connection between different formats like “Lottery” and “lottery”.

Putting all strings to lower case can be achieved with Python’s string functionality. To extract the features (words) from the text, you need to iterate through the recognized words and put all words to lower case.

In [12]:
def get_features(text):
  features = {}
  word_list = [word for word in word_tokenize(text.lower())]
  # For each word in the email let’s switch on the ‘flag’ that this word is contained in the email
  for word in word_list:
    features[word] = True
  
  return features

In [13]:
# it will keep tuples containing the list of features matched with the “spam” or “ham” label for each email
all_features = [(get_features(email), label) for (email, label) in all_emails]

print(get_features("Participate In Our New Lottery NOW!"))

print(len(all_features))
print(len(all_features[0][0]))
print(len(all_features[99][0]))

{'participate': True, 'in': True, 'our': True, 'new': True, 'lottery': True, 'now': True, '!': True}
5172
29
18


With this bit of code, you iterate over the emails in your collection (all_emails) and store the list of features extracted from each email matched with the label.

For example, if a spam email consists of a single sentence “Participate In Our New Lottery NOW!” your algorithm will first extract the list of features present in this email and assign a ‘True’ value to each of them.

Then, the algorithm will add this list of features to
all_features together with the “spam” label.

<img src='https://github.com/rahiakela/machine-learning-algorithms/blob/main/Essential-NLP/02-your-first-nlp-example/images/1.png?raw=1' width='800'/>

Imagine your whole dataset contained only one spam text “Participate In Our New Lottery NOW!” and one ham text “Participate in the Staff Survey”. What features will be extracted from this dataset?

You will end up with the following feature set:

<img src='https://github.com/rahiakela/machine-learning-algorithms/blob/main/Essential-NLP/02-your-first-nlp-example/images/2.png?raw=1' width='800'/>

Let’s now clarify what each tuple structure representing an email contains. Tuples pair up two information fields: in this case a list of features extracted from the email and its label, i.e. each tuple in `all_features` contains a pair (`list_of_features`, `label`).

So if you’d like to access first email in the list, you call on `all_features[0]`, to access its list of features you use `all_features[0][0]`, and to access its label you use `all_features[0][1]`.

<img src='https://github.com/rahiakela/machine-learning-algorithms/blob/main/Essential-NLP/02-your-first-nlp-example/images/3.png?raw=1' width='800'/>





In [14]:
# access first email in the list with feature and label
all_features[0]

({'!': True,
  ',': True,
  '.': True,
  ':': True,
  'am': True,
  'and': True,
  'fl': True,
  'for': True,
  'friends': True,
  'from': True,
  'hey': True,
  'hi': True,
  'homepage': True,
  'i': True,
  'is': True,
  'jane': True,
  'last': True,
  'looking': True,
  'miami': True,
  'my': True,
  'name': True,
  'new': True,
  'photos': True,
  'see': True,
  'subject': True,
  'webcam': True,
  'weblog': True,
  'with': True,
  'you': True},
 'spam')

In [15]:
# access its list of features only
all_features[0][0]

{'!': True,
 ',': True,
 '.': True,
 ':': True,
 'am': True,
 'and': True,
 'fl': True,
 'for': True,
 'friends': True,
 'from': True,
 'hey': True,
 'hi': True,
 'homepage': True,
 'i': True,
 'is': True,
 'jane': True,
 'last': True,
 'looking': True,
 'miami': True,
 'my': True,
 'name': True,
 'new': True,
 'photos': True,
 'see': True,
 'subject': True,
 'webcam': True,
 'weblog': True,
 'with': True,
 'you': True}

In [16]:
# access its label only
all_features[0][1]

'spam'

##Step 4: Train the classifier

Next, let’s apply machine learning and teach the machine to distinguish between the features that describe each of the two classes. There are a number of classification algorithms that you can use, let’s start with one of the most interpretable ones – an algorithm called **Naïve Bayes**. Don’t be misled by the word “Naïve” in its title, though: despite relative simplicity of the approach compared to other ones, this algorithm often works well in practice
and sets a competitive performance baseline that is hard to beat with more sophisticated approaches.

Naïve Bayes is a probabilistic classifier, which means that it makes the class prediction based on the estimate of which outcome is most likely: i.e., it assesses the probability of an
email being spam and compares it with the probability of it being ham, and then selects the outcome that is most probable between the two.

In the previous step, you extracted the content of the email and converted it into a list of individual words (features). In this step, the machine will try to predict whether the email content represents spam or ham. 

In other words, it will try to predict whether the email is spam or ham given or conditioned on its content. This type of probability, when the outcome (class of “spam” or “ham”) depends on the condition (words used as features), is called conditional probability. 

For spam detection, you estimate `P(spam | email content)` and `P(ham | email content)`, or generally `P(outcome | (given) condition)`.Then you compare one estimate to another and return the most probable class.

```python
If P(spam | content) = 0.58 and P(ham | content) = 0.42, predict spam
If P(spam | content) = 0.37 and P(ham | content) = 0.63, predict ham
```

<img src='https://github.com/rahiakela/machine-learning-algorithms/blob/main/Essential-NLP/02-your-first-nlp-example/images/4.png?raw=1' width='800'/>

A machine can estimate the probability that an email is spam or ham conditioned on its content taking the number of times it has seen this content leading to a particular outcome.

```
P(spam | "Participate in our lottery now!") = (number of emails "Participate in our lottery now!" that are spam) / (total number of emails "Participate in our lottery now!", either spam or ham)

P(ham | "Participate in our lottery now!") = (number of emails "Participate in our lottery now!" that are ham) / (total number of emails "Participate in our lottery now!", either spam or ham)
```

In the general form, this can be expressed as:

```
P(outcome | condition) = number_of_times(condition led to outcome) number_of_times(condition applied)
```

Remember that you used tokenization to split long texts into separate words to let the algorithm access the smaller bits of information – words rather than whole sequences. The idea of estimating probabilities based on separate features rather than based on the whole sequence of features (whole text) is somewhat similar.

In the previous step, you converted this single text into a set of features as:

```['participate': True, 'in': True, …, 'now': True, '!': True]``` 

Note that the conditional probabilities like:

``` 
P(spam| "Participate in our lottery now!") and P(spam| ['participate': True,
‘in’: True, …, ‘now’: True, ‘!’: True])
```

are the same because this set of features encodes the text.

Is there a way to split this set to get at more fine-grained, individual probabilities, for example to establish a link between `[‘lottery’: True]` and the class of “spam”?

Unfortunately, there is no way to split the conditional probability estimation like `P(outcome | conditions)` when there are multiple conditions specified, however it is possible to split the probability estimation like `P(outcomes | condition)` when there is a single condition and multiple outcomes.

In spam detection, the class is a single value (it is “spam” or “ham”), while features are a set `([‘participate’: True, ‘in’: True, …, ‘now’: True, ‘!’:
True])`. If you can flip around the single value of class and the set of features in such a way that the class becomes the new condition and the features become the new outcomes, you can split the probability into smaller components and establish the link between individual features like `[‘lottery’: True]` and class values like “spam”.


<img src='https://github.com/rahiakela/machine-learning-algorithms/blob/main/Essential-NLP/02-your-first-nlp-example/images/5.png?raw=1' width='800'/>

Luckily, there is a way to flip the outcomes (class) and conditions (features extracted from the content) around!

Let’s look into the estimation of conditional probabilities again: you
estimate the probability that the email is spam given that its content is “Participate in our new lottery now!” based on how often in the past an email with such content was spam. For that, you take the proportion of the times you have seen “Participate in our new lottery now!” in a spam email among the emails with this content.

```
P(spam | "Participate in our new lottery now!") = P("Participate in our new lottery now!" is used in a spam email) / P("Participate in our new lottery now!" is used in an email)
```

Similarly to how you estimated the probabilities above, you need the proportion of times you have seen “Participate in our new lottery now!” in a spam email among all spam emails.

```
P("Participate in our new lottery now!" | spam) = P("Participate in our new lottery now!" is used in a spam email) / P(an email is spam)
```

That is, every time you use conditional probabilities, you need to
divide how likely it is that you see the condition and outcome together by how likely it is that you see the condition on its own – this is the bit after |.

Now you can see that both Formulas 1 and 2 rely on how often you see particular content in an email of particular class. They share this bit, so you can use it to connect the two formulas. For instance, from Formula 2 you know that:

```
P("Participate in our new lottery now!" is used in a spam email) = P("Participate in our new lottery now!" | spam) * P(an email is spam)
```

Now you can fit this into Formula 1:

```
P(spam | "Participate in our new lottery now!") = P("Participate in our new lottery now!" is used in a spam email) / P("Participate in our new lottery now!" is used in an email) = [P("Participate in our new lottery now!" | spam) * P(an email is spam)] / P("Participate in our new lottery now!" is used in an email)
```

<img src='https://github.com/rahiakela/machine-learning-algorithms/blob/main/Essential-NLP/02-your-first-nlp-example/images/6.png?raw=1' width='800'/>

In the general form:

```
P(class | content) = P(content represents class) / P(content) = [P(content | class) * P(class)] / P(content)
```

In other words, you can express the probability of a class given email content via the probability of the content given the class.

Now you can replace the conditional probability of `P(class | content)` with `P(content | class)`, e.g. whereas before you had to calculate `P(“spam” | “Participate in our new lottery now!”)` or equally `P(“spam” | [‘participate’: True, ‘in’: True, …, ‘now’: True, ‘!’: True])`, which is hard to do because you will often end up with too few examples of exactly the same email content or exactly the same combination of features, now you can estimate `P([‘participate’: True, ‘in’: True, …, ‘now’: True, ‘!’: True] | “spam”)` instead.

But how does this solve the problem? Aren’t you still dealing with a long sequence of features?

Here is where the “naïve” assumption in Naïve Bayes helps: it assumes that the features are independent of each other, or that your chances of seeing a word “lottery” in an email are independent of seeing a word “new” or any other word in this email before. So you can estimate the probability of the whole sequence of features given a class as a product of probabilities of each feature given this class.

```
P([‘participate’: True, ‘in’: True, …, ‘now’: True, ‘!’: True] | “spam”) = P(‘participate’: True | “spam”) * P(‘in’: True | “spam”) … * P(‘!’: True | “spam”)
```

If you express `[‘participate’: True]` as the first feature in the feature list, or `f1`, `[‘in’: True]` as `f2`, and so on, until `fn = [‘!’: True]`, you can use the general formula:

```
P([f1, f2, …, fn] | class) = P(f1 | class) * P(f2| class) … * P(fn| class)
```

<img src='https://github.com/rahiakela/machine-learning-algorithms/blob/main/Essential-NLP/02-your-first-nlp-example/images/7.png?raw=1' width='800'/>

Now that you have broken down the probability of the whole feature list given class into the probabilities for each word given that class, how do you actually estimate them?

Since for each email you note which words occur in it, the total number of times you can switch on the flag `[‘feature’: True]` equals the total number of emails in that class, while the actual number of times you switch on this flag is the number of emails where this feature is actually
present. The conditional probability `P(feature | class)` is simply the proportion of the two:

```
P(feature | class) = number(emails in class with feature present) / total_number(emails in class)
```

These numbers are easy to estimate from the training data – let’s try to do that with an example.

>Suppose you have 5 spam emails and 10 ham emails. What are the conditional probabilities for P('prescription':True | spam), P('meeting':True | ham), P('stock':True | spam) and P('stock':True | ham), if:
- 2 spam emails contain word prescription
- 1 spam email contains word stock
- 3 ham emails contain word stock
- 5 ham emails contain word meeting

**Solution:**

The probabilities are simply:
- P('prescription':True | spam) = number(spam emails with 'prescription')/number(spam emails) = 2/5 = 0.40
- P('meeting':True | ham) = 5/10 = 0.50
- P('stock':True | spam) = 1/5 = 0.20
- P('stock':True | ham) = 3/10 = 0.30

Let’s iterate through the classification steps again: during the training phase, the algorithm learns prior class probabilities (this is simply
class distribution, e.g. `P(ham)=0.71 and P(spam)=0.29)` and probabilities for each feature given each of the classes (this is simply the proportion of emails with each feature in each class, e.g. `P(‘meeting’:True | ham) = 0.50)`. 

During test phase, or when the algorithm is applied to a new email and is asked to predict its class, the following comparison from the beginning of this section is applied:

```
Predict “spam” if P(spam | content) > P(ham | content) 
Predict “ham” otherwise
```

This is what we started with originally, but we said that the conditions are flipped, so it becomes:

```
Predict “spam” if P(content | spam) * P(spam) / P(content) > P(content | ham) * P(ham) / P(content)

Predict “ham” otherwise
```

Note that we end up with `P(content)` in denominator on both sides of the expression, so the absolute value of this probability doesn’t matter and it can be removed from the expression altogether. So we can simplify the expression as:

```
Predict “spam” if P(content | spam) * P(spam) > P(content | ham) * P(ham)
Predict “ham” otherwise
```

`P(spam)` and `P(ham)` are class probabilities estimated during training, and `P(content | class)`, using naïve independence assumption, are products of probabilities, so:

```
Predict “spam” if P([f1, f2, …, fn]| spam) * P(spam) > P([f1, f2, …, fn]| ham) * P(ham)
Predict “ham” otherwise
```

is split into the individual feature probabilities as:

```
Predict “spam” if P(f1 | spam) * P(f2| spam) … * P(fn| spam) * P(spam) > P(f1 | ham) * P(f2| ham) … * P(fn| ham) * P(ham)
Predict “ham” otherwise
```

This is the final expression the classifier relies on. The following code implements this idea.
Since Naïve Bayes is frequently used for NLP tasks, NLTK comes with its own
implementation, too, and here you are going to use it.





In [17]:
def train(features, proportion):
  train_size = int(len(features) * proportion)
  # Use the first n% (according to the specified proportion) of emails with their features for training, and the rest for testing
  train_set, test_set = features[: train_size], features[train_size:]
  print(f"Training set size = {str(len(train_set))} emails")
  print(f"Test set size = {str(len(test_set))} emails")

  classifier = NaiveBayesClassifier.train(train_set)

  return train_set, test_set, classifier

In [18]:
# Apply the train function using 80% (or a similar proportion) of emails for training. 
train_set, test_set, classifier = train(all_features, 0.8)

Training set size = 4137 emails
Test set size = 1035 emails


##Step 5: Evaluate your classifier

Finally, let’s evaluate how well the classifier performs in detecting whether an email is spam or ham. For that, let’s use the accuracy score returned by the NLTK’s classifier:

In [19]:
def evaluate(train_set, test_set, classifier):
  print(f"Accuracy on the training set = {str(classify.accuracy(classifier, train_set))}")
  print(f"Accuracy of the test set = {str(classify.accuracy(classifier, test_set))}")

  # inspect the most informative features (words). You need to specify the number of the top most informative features to look into, e.g. 50 here
  classifier.show_most_informative_features(50)

In [20]:
evaluate(train_set, test_set, classifier)

Accuracy on the training set = 0.95987430505197
Accuracy of the test set = 0.9497584541062802
Most Informative Features
               forwarded = True              ham : spam   =    204.3 : 1.0
                    2004 = True             spam : ham    =    141.9 : 1.0
                    2001 = True              ham : spam   =    130.8 : 1.0
            prescription = True             spam : ham    =    127.8 : 1.0
                     nom = True              ham : spam   =    125.5 : 1.0
                    pain = True             spam : ham    =    107.4 : 1.0
                     ect = True              ham : spam   =    106.9 : 1.0
                    spam = True             spam : ham    =     90.1 : 1.0
                  health = True             spam : ham    =     87.0 : 1.0
                featured = True             spam : ham    =     74.5 : 1.0
              nomination = True              ham : spam   =     73.9 : 1.0
                  differ = True             spam : ham 

As you can see, many spam emails in this dataset are related to medications,
which shows a particular bias – the most typical spam that you personally get might be on a different topic altogether! What effect might this mismatch between the training data from the publicly available dataset like Enron and your personal data have?

One other piece of information presented in this output is accuracy. Test accuracy shows the proportion of test emails that are correctly classified by Naïve Bayes among all test emails.

Note, that since the classifier is trained on the training data, it actually gets to “see” all the correct labels for the training examples. 

Shouldn’t it then know the correct answers and perform at 100% accuracy on the training data?

Well, the point here is that the classifier doesn’t just
retrieve the correct answers: during training it has built some probabilistic model (i.e., learned about the distribution of classes and the probability of different features), and then it applies this model to the data. So, it is actually very likely that the probabilistic model doesn’t capture all the things in the data 100% correctly.

Therefore, when you run the code above, you will get accuracy on the training data of `96.13%`. This is not perfect (i.e., not `100%`) but very close to it! When you apply the same classifier to new data – the test set that the classifier hasn’t seen during training – the accuracy reflects its generalizing ability. That is, it shows whether the probabilistic assumptions it made based on the training data can be successfully applied to any other data. The accuracy on the test set is `94.20%`, which is slightly lower than that on the training set, but is also very high.

Finally, if you’d like to gain any further insight into how the words are used in the emails from different classes, you can also check the occurrences of any particular word in all available contexts.

For example, word “stocks” features as a very strong predictor of spam
messages. Why is that? You might be thinking, “OK, some emails containing “stocks” will be spam, but surely there must be contexts where “stocks” is used in a completely harmless way?”

In [21]:
def concordance(data_list, search_word):
  for email in data_list:
    word_list = [word for word in word_tokenize(email.lower())]
    text_list = Text(word_list)

    """
    “Concordancer” is a tool that checks for the occurrences of the specified word and prints out the word in
    its context. By default, NLTK’s concordancer prints out the search_word surrounded by the previous
    36 and the following 36 characters – so note, that it doesn’t always result in full words
    """
    if search_word in word_list:
      text_list.concordance(search_word)

In [22]:
# Apply this function to two lists – ham_list and spam_list – to find out about the different contexts of use for the word “stocks”
print("STOCKS in HAM:")
concordance(ham_list, "stocks")
print("\n\nSTOCKS in SPAM:")
concordance(spam_list, "stocks")

STOCKS in HAM:
Displaying 1 of 1 matches:
ur member directory . * follow your stocks and news headlines , exchange files
Displaying 1 of 1 matches:
ur member directory . * follow your stocks and news headlines , exchange files
Displaying 1 of 1 matches:
ad my portfolio is diversified into stocks that have lost even more money than
Displaying 1 of 1 matches:
ur member directory . * follow your stocks and news headlines , exchange files


STOCKS in SPAM:
Displaying 5 of 5 matches:
5 where were you when the following stocks exploded : scos : exploded from . 3
d . 80 on friday . face it . little stocks can mean big gains for you . this r
might occur . as with many microcap stocks , today ' s company has additional 
his email pertaining to investing , stocks , securities must be understood as 
ntative before deciding to trade in stocks featured within this report . none 
Displaying 4 of 4 matches:
hree days . play of the week tracks stocks on downward trends , foresees botto
mark is our unc

If you run this code and print out the contexts for “stocks”, you will find out that “stocks” feature in only 4 ham contexts (e.g., an email reminder “Follow your stocks and news headlines”) as compared to hundreds of spam contexts including “Stocks to play”, “Big money was made in these stocks”, “Select gold mining stocks”, “Little stocks can mean big gains for you”, and so on.

##Deploying your spam filter in practice

For instance, the classifier that you’ve built performs at 94% accuracy, so
you can expect it to classify real emails into spam and ham quite accurately. It’s time to deploy it in practice then. When you run it on some new emails (perhaps, some from your own inbox) you need to perform the same steps on these emails as before, that is:

- you need to read them in, then
- you need to extract the features from these emails, and finally
- you need to apply the classifier that you trained before on these emails.

In [23]:
# Feel free to provide your own examples
test_spam_list = [
  "Participate in our new lottery!",
  "Try out this new medicine"
]
test_ham_list = [
  "See the minutes from the last meeting attached", 
  "Investors are coming to our office on Monday"
]

# Read the emails extracting their textual content and keeping the labels for further evaluation
test_emails = [(email_content, "spam") for email_content in test_spam_list]
test_emails += [(email_content, "ham") for email_content in test_ham_list]

# Extract the features
new_test_set = [(get_features(email), label) for (email, label) in test_emails]

# Apply the trained classifier and evaluate its performance
evaluate(train_set, new_test_set, classifier)

Accuracy on the training set = 0.95987430505197
Accuracy of the test set = 1.0
Most Informative Features
               forwarded = True              ham : spam   =    204.3 : 1.0
                    2004 = True             spam : ham    =    141.9 : 1.0
                    2001 = True              ham : spam   =    130.8 : 1.0
            prescription = True             spam : ham    =    127.8 : 1.0
                     nom = True              ham : spam   =    125.5 : 1.0
                    pain = True             spam : ham    =    107.4 : 1.0
                     ect = True              ham : spam   =    106.9 : 1.0
                    spam = True             spam : ham    =     90.1 : 1.0
                  health = True             spam : ham    =     87.0 : 1.0
                featured = True             spam : ham    =     74.5 : 1.0
              nomination = True              ham : spam   =     73.9 : 1.0
                  differ = True             spam : ham    =     71.3 :

The classifier that you’ve trained performs with 100% accuracy on these examples. Good! How can you print out the predicted label for each particular email though?

For that, you simply extract the features from the email content and print out the label, i.e. you don’t need to run the full evaluation with the accuracy calculation.


In [24]:
for email in test_spam_list:
  print(email)
  print(classifier.classify(get_features(email)))

for email in test_ham_list:
  print(email)
  print(classifier.classify(get_features(email)))

Participate in our new lottery!
spam
Try out this new medicine
spam
See the minutes from the last meeting attached
ham
Investors are coming to our office on Monday
ham


Let’s summarize what you have covered.

You have learned how build a classifier in five steps:

1. the emails should be read, and the two classes should be clearly defined for the machine to learn from.
2. the text content should be extracted.
3. then the content should be converted into features.
4. the classifier should be trained on the training set of the data.
5. finally, the classifier should be evaluated on the test set

There are a number of machine learning classifiers, and you’ve applied one of the most interpretable of them – Naïve Bayes. Naïve Bayes is a probabilistic
classifier: it assumes that the data in two classes is generated by different probability distributions, which are learned from the training data. Despite its simplicity and “naïve” feature independence assumption, Naïve Bayes often performs well in practice, and sets competitive baseline for other more sophisticated algorithms.



##Assignment

Apply the trained classifier to a different dataset, for example to `enron2/spam` and ham emails that originate with a different owner (check `Summary.txt` for more information). For that you need to:

- read the data from the `spam/` and `ham/` subfolders in `enron2/`
- extract the textual content and convert it into features
- evaluate the classifier

What do the results suggest? 

Hint: one man’s spam may be another man’s ham. If you are not satisfied with the results, try combining the data from the two owners in one dataset.

In [26]:
test_spam_list = read_files("enron2/spam/")
print(len(test_spam_list))
print(test_spam_list[0])

test_ham_list = read_files("enron2/ham/")
print(len(test_ham_list))
print(test_ham_list[0])

test_emails = [(email_content, "spam") for email_content in test_spam_list]
test_emails += [(email_content, "ham") for email_content in test_ham_list]

random.shuffle(test_emails)

new_test_set = [(get_features(email_content), label) for email_content, label in test_emails]

evaluate(train_set, new_test_set, classifier)

1496
Subject: help
Television in 1919 by seat to my knoweledge. Chrono cross in 1969
4361
Subject: re: eol
Clayton,
Great news. I would like to sit down with you, tom and stinson and review
Where
We are with this project. Also, I would like to talk to you about your
Status (finalizing
The transfer to another group).
Vince
Clayton vernon@ enron
01/18/2001 03: 21 pm
To: vasant shanbhogue/hou/ect@ ect
Cc: stinson gibner/hou/ect@ ect, vince j kaminski/hou/ect@ ect
Subject: eol
Vasant -
Dave delaney called an hour ago. He needed a statistic from eol that the eol
Folks couldn' t give him (it seems they had a database problem in 1999), and
The grapevine had it we had the data. Tom barkley was able to give him the
Data he needed for his presentation, within a matter of 10 minutes or so.
Clayton
Accuracy on the training set = 0.95987430505197
Accuracy of the test set = 0.7611405156223322
Most Informative Features
               forwarded = True              ham : spam   =    

Now, we will combine the two datasets.

In [27]:
spam_list = read_files("enron1/spam/") + read_files("enron2/spam/")
print(len(spam_list))

ham_list = read_files("enron1/ham/") + read_files("enron2/ham/")
print(len(ham_list))

all_emails = [(email_content, "spam") for email_content in spam_list]
all_emails += [(email_content, "ham") for email_content in ham_list]

random.shuffle(test_emails)

all_features = [(get_features(email_content), label) for email_content, label in all_emails]
print(len(all_features))

train_set, test_set, classifier = train(all_features, 0.8)
evaluate(train_set, new_test_set, classifier)

2996
8033
11029
Training set size = 8823 emails
Test set size = 2206 emails
Accuracy on the training set = 0.9818655786013828
Accuracy of the test set = 0.9820727334813044
Most Informative Features
                   meter = True              ham : spam   =    264.5 : 1.0
                   vince = True              ham : spam   =    200.0 : 1.0
                     nom = True              ham : spam   =    195.6 : 1.0
                     sex = True             spam : ham    =    195.1 : 1.0
            prescription = True             spam : ham    =    169.2 : 1.0
                     ect = True              ham : spam   =    167.7 : 1.0
                    spam = True             spam : ham    =    145.8 : 1.0
               forwarded = True              ham : spam   =    137.7 : 1.0
                     fyi = True              ham : spam   =    137.0 : 1.0
                    2005 = True             spam : ham    =    128.1 : 1.0
                   logos = True             spam : h