<a href="https://colab.research.google.com/github/rahiakela/machine-learning-algorithms/blob/main/Essential-NLP/02-your-first-nlp-example/1_implementing_own_spam_filter.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Implementing own spam filter

In this notebook, you use the spam filtering as your practical NLP application as it is an example of a very widely spread family of tasks – text classification. Text classification comprises a number of applications, for example user profiling, sentiment analysis and topic labeling, so this
will give you a good start for the rest of the book. 

First, let’s see what exactly classification addresses.

We, humans apply classification in our everyday lives pretty regularly: classifying things simply implies that we try to put them into clearly defined groups, classes or categories. 

In fact, we tend to classify all sorts of things all the time. Here are some examples:

- based on our level of engagement and interest in a movie, we may classify it as interesting or boring;
- based on temperature, we classify water as cold or hot;
- based on the amount of sunshine, humidity, wind strength and air temperature, we classify the weather as good or bad;
- based on the number of wheels, we classify vehicles into unicycles, bicycles, tricycles, quadricycles, cars and so on;
- based on the availability of the engine, we may classify two-wheeled vehicles into bicycles and motorcycles.

Classification is useful because it makes it easier for us to reason about things and adjust our behavior accordingly.

When classifying things, we often go for simple contrasts – good vs. bad, interesting vs. boring, hot vs. cold. When we are dealing with two labels only, this is called binary classification.

Classification that implies more than two classes is called multi-class classification.



##Setup

In [1]:
import os
import codecs
import random

import nltk
from nltk import word_tokenize
from nltk import NaiveBayesClassifier, classify
from nltk.text import Text

In [None]:
nltk.download('punkt')

In [2]:
%%shell

wget -qq https://github.com/ekochmar/Essential-NLP/raw/master/enron1.zip
unzip -qq enron1.zip



##Step 1: Define the data and classes

Enron email dataset is a large dataset of emails (the original dataset contains about 0.5M messages), including both ham and spam emails, for about 150 users, mostly senior management of Enron.

We are going to use enron1/ folder for training. All folders in Enron
dataset contain spam and ham emails in separate subfolders, so you don’t need to worry about pre-defining them. Each email is stored as a text file in these subfolders. 

Let’s read in the contents of these text files in each subfolder, store the spam emails contents and the ham emails contents as two separate data structures and point our algorithm at each, clearly
defining which one is spam and which one is ham.

In [3]:
def read_files(folder):
  files = os.listdir(folder)
  a_list = []
  for a_file in files:
    # Skip hidden files, that are sometimes automatically created by the operating systems. They can be easily identified because their names start with “.”
    if not a_file.startswith("."):
      file = codecs.open(folder + a_file, "r", encoding="ISO-8859-1", errors="ignore")
      a_list.append(file.read())
      file.close()
  return a_list

Now you can define two such lists – spam_list and ham_list, letting the machine know what data to use as examples of spam emails and what data represents ham emails.

In [4]:
spam_list = read_files("enron1/spam/")
ham_list = read_files("enron1/ham/")

# Check the lengths of the lists: for spam it should be 1500 and for ham – 3672
print(len(spam_list))
print(len(ham_list))

1500
3672


Let's check out the contents of the first entry. In both cases, it should coincide with the contents of the first file in each correspondent subfolder.

In [5]:
print(spam_list[0])

Subject: re: clement
Hassan



In [6]:
print(ham_list[0])

Subject: enron/hpl actuals for sept. 28, 2000
Teco tap 60. 000/enron; 90. 000. Hpl gas daily


Next, you’ll need to preprocess the data (e.g., by splitting text strings into words) and extract the features.

Finally, remember that you will need to split the data randomly into the
training and test sets. 

Let’s shuffle the resulting list of emails with their labels, and make
sure that the shuffle is reproducible by fixing the way in which the data is shuffled:

In [7]:
# for each member of the ham_list and spam_list it stores a tuple with the content and associated label
all_emails = [(email_content, "spam") for email_content in spam_list]
all_emails += [(email_content, "ham") for email_content in ham_list]

# By defining the seed for the random operator you can make sure that all future runs will shuffle the data in the same way
random.seed(2020)
random.shuffle(all_emails)

# it should be equal to 1500 + 3672 = 5172
print(f"Dataset size = {str(len(all_emails))} emails")

Dataset size = 5172 emails


##Step 2: Split the text into words

Remember, that the email contents that you’ve read in so far each come as a single string of symbols. The first step of text preprocessing involves splitting the running text into words.

You are going to use NLTK’s tokenizer. It takes running text as input and returns a list of words based on a number of customized regular expressions, which help to delimit the text by whitespaces and punctuation marks, keeping common words like “U.S.A.” unsplit.

In [8]:
def tokenize(sent):
  word_list = []
  for word in word_tokenize(sent):
    word_list.append(word)
  return word_list

In [11]:
input = "What's the best way to split a sentence into words?"
print(tokenize(input))

['What', "'s", 'the', 'best', 'way', 'to', 'split', 'a', 'sentence', 'into', 'words', '?']


In [12]:
input = "I live in U.S.A country."
print(tokenize(input))

['I', 'live', 'in', 'U.S.A', 'country', '.']


##Step 3: Extract and normalize the features