<a href="https://colab.research.google.com/github/rahiakela/nlp-research-and-practice/blob/main/getting-started-with-nlp/02-spam-filtering/spam_filtering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Spam Filtering

If you want to build a machine-learning classifier for spam detection, you need to provide
your algorithm with a sufficient number of spam and ham emails. 

The best way to
build such a classifier would be to collect your own ham and spam emails and train your algorithm to detect what you personally would consider spam.

##Setup

In [1]:
import os
import codecs
import random

import nltk
from nltk import word_tokenize
from nltk import NaiveBayesClassifier, classify
from nltk.text import Text

In [None]:
%%shell

wget https://github.com/rahiakela/nlp-research-and-practice/raw/main/getting-started-with-nlp/datasets/enron1.zip

unzip enron1.zip
rm -rf enron1.zip

##Step 1: Define the data and classes

In [3]:
def read_in(folder):
  files = os.listdir(folder)
  a_list = []
  for a_file in files:
    # skip hidden files
    if not a_file.startswith("."):
      # Read the contents of each files
      f = codecs.open(folder + a_file, "r", encoding="ISO-8859-1", errors="ignore")
      a_list.append(f.read())
      f.close()
  return a_list

In [4]:
# verify that the data is uploaded and read in correctly
spam_list = read_in("enron1/spam/")
print(len(spam_list))
print(spam_list[0])

1500
Subject: check it out
Hello,
If you want that roooock harrrrd john son
Check out the first and original one on the market... Don' t be fooled by imitations and copy - cats.
Later,
Jordan



In [5]:
ham_list = read_in("enron1/ham/")
print(len(ham_list))
print(ham_list[0])

3672
Subject: natural gas nomination for december 2000 - - r e v I s I o n #2
Please revise the natural gas nomination for the mtbe plant for december 2000
As follows:
10, 500 mmbtu for the entire month of december.
- - - - - - - - - - - - - - - - - - - - - - forwarded by michael mitcham/gpgfin/enron on
11/29/2000 08: 43 am - - - - - - - - - - - - - - - - - - - - - - - - - - -
Maritta mullet
11/27/2000 05: 08 pm
To: david bush/ecf/enron@ enron, mark diedrich/gpgfin/enron@ enron, steven m
Elliott/hou/ect@ ect, daren j farmer/hou/ect@ ect, paul fox/ecf/enron@ enron,
David m johnson/ecf/enron@ enron, robert e lee/hou/ect@ ect, anita
Luong/hou/ect@ ect, gregg lenart/hou/ect@ ect, thomas meers/gpgfin/enron@ enron,
Michael mitcham/gpgfin/enron@ enron, john l nowlan/hou/ect@ ect, lee l
Papayoti/hou/ect@ ect, james prentice/gpgfin/enron@ enron, kerry
Roper/gpgfin/enron@ enron, sally shuler/gpgfin/enron@ enron
Cc:
Subject: natural gas nomination for december 2000 - - r e v I s I

In [6]:
random.seed(42)

# combine the data into a single structure
all_emails = [(email_content, "spam") for email_content in spam_list]
all_emails += [(email_content, "ham") for email_content in ham_list]

random.shuffle(all_emails)
print(f"Dataset size = {str(len(all_emails))} emails")

Dataset size = 5172 emails


##Step 2: Split the text into words