<a href="https://colab.research.google.com/github/rahiakela/machine-learning-algorithms/blob/main/Essential-NLP/02-your-first-nlp-example/1_implementing_own_spam_filter.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Implementing own spam filter

In this notebook, you use the spam filtering as your practical NLP application as it is an example of a very widely spread family of tasks – text classification. Text classification comprises a number of applications, for example user profiling, sentiment analysis and topic labeling, so this
will give you a good start for the rest of the book. 

First, let’s see what exactly classification addresses.

We, humans apply classification in our everyday lives pretty regularly: classifying things simply implies that we try to put them into clearly defined groups, classes or categories. 

In fact, we tend to classify all sorts of things all the time. Here are some examples:

- based on our level of engagement and interest in a movie, we may classify it as interesting or boring;
- based on temperature, we classify water as cold or hot;
- based on the amount of sunshine, humidity, wind strength and air temperature, we classify the weather as good or bad;
- based on the number of wheels, we classify vehicles into unicycles, bicycles, tricycles, quadricycles, cars and so on;
- based on the availability of the engine, we may classify two-wheeled vehicles into bicycles and motorcycles.

Classification is useful because it makes it easier for us to reason about things and adjust our behavior accordingly.

When classifying things, we often go for simple contrasts – good vs. bad, interesting vs. boring, hot vs. cold. When we are dealing with two labels only, this is called binary classification.

Classification that implies more than two classes is called multi-class classification.



##Setup

In [1]:
import os
import codecs
import random

import nltk
from nltk import word_tokenize
from nltk import NaiveBayesClassifier, classify
from nltk.text import Text

In [2]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [3]:
%%shell

wget -qq https://github.com/ekochmar/Essential-NLP/raw/master/enron1.zip
unzip -qq enron1.zip



##Step 1: Define the data and classes

Enron email dataset is a large dataset of emails (the original dataset contains about 0.5M messages), including both ham and spam emails, for about 150 users, mostly senior management of Enron.

We are going to use enron1/ folder for training. All folders in Enron
dataset contain spam and ham emails in separate subfolders, so you don’t need to worry about pre-defining them. Each email is stored as a text file in these subfolders. 

Let’s read in the contents of these text files in each subfolder, store the spam emails contents and the ham emails contents as two separate data structures and point our algorithm at each, clearly
defining which one is spam and which one is ham.

In [4]:
def read_files(folder):
  files = os.listdir(folder)
  a_list = []
  for a_file in files:
    # Skip hidden files, that are sometimes automatically created by the operating systems. They can be easily identified because their names start with “.”
    if not a_file.startswith("."):
      file = codecs.open(folder + a_file, "r", encoding="ISO-8859-1", errors="ignore")
      a_list.append(file.read())
      file.close()
  return a_list

Now you can define two such lists – spam_list and ham_list, letting the machine know what data to use as examples of spam emails and what data represents ham emails.

In [5]:
spam_list = read_files("enron1/spam/")
ham_list = read_files("enron1/ham/")

# Check the lengths of the lists: for spam it should be 1500 and for ham – 3672
print(len(spam_list))
print(len(ham_list))

1500
3672


Let's check out the contents of the first entry. In both cases, it should coincide with the contents of the first file in each correspondent subfolder.

In [6]:
print(spam_list[0])

Subject: alert: spam prevention
R 3 mov 3
Sll 08
8721 santa monica blvd #1106
Santa monica, ca 90069
Diat corroteras imizattrys valing
 shinteraysion paral furill bakize ficknessive
Rac flicaldeplansgrant ass



In [7]:
print(ham_list[0])

Subject: re: license
To all:
Well, let me tell it to you like this! I will never ever ever have
To study for another fricken exam! Hell yes I passed that s. O. B. I about
Died when I opened the envelope b/c I couldn' t believe it! I am now rusty
R. Glover, p. E.! I probably won' t know you next time I see you! My head
Was so big this morning I could barely make it through the door!
Later,
Rusty r. Glover, p. E.
- - - - - original message - - - - -
From: camille davis [mailto: kcdavis@ pdq. Net]
Sent: monday, january 31, 2000 1: 59 pm
To: glover, rusty
Subject: license
Did you past your test???
Camille
- attl. Htm


Next, you’ll need to preprocess the data (e.g., by splitting text strings into words) and extract the features.

Finally, remember that you will need to split the data randomly into the
training and test sets. 

Let’s shuffle the resulting list of emails with their labels, and make
sure that the shuffle is reproducible by fixing the way in which the data is shuffled:

In [8]:
# for each member of the ham_list and spam_list it stores a tuple with the content and associated label
all_emails = [(email_content, "spam") for email_content in spam_list]
all_emails += [(email_content, "ham") for email_content in ham_list]

# By defining the seed for the random operator you can make sure that all future runs will shuffle the data in the same way
random.seed(2020)
random.shuffle(all_emails)

# it should be equal to 1500 + 3672 = 5172
print(f"Dataset size = {str(len(all_emails))} emails")

Dataset size = 5172 emails


##Step 2: Split the text into words

Remember, that the email contents that you’ve read in so far each come as a single string of symbols. The first step of text preprocessing involves splitting the running text into words.

You are going to use NLTK’s tokenizer. It takes running text as input and returns a list of words based on a number of customized regular expressions, which help to delimit the text by whitespaces and punctuation marks, keeping common words like “U.S.A.” unsplit.

In [9]:
def tokenize(sent):
  word_list = []
  for word in word_tokenize(sent):
    word_list.append(word)
  return word_list

In [10]:
input = "What's the best way to split a sentence into words?"
print(tokenize(input))

['What', "'s", 'the', 'best', 'way', 'to', 'split', 'a', 'sentence', 'into', 'words', '?']


In [11]:
input = "I live in U.S.A country."
print(tokenize(input))

['I', 'live', 'in', 'U.S.A', 'country', '.']


##Step 3: Extract and normalize the features

Once the words are extracted from running text, you need to convert them into features. In particular, you need to put all words into lower case to make your algorithm establish the connection between different formats like “Lottery” and “lottery”.

Putting all strings to lower case can be achieved with Python’s string functionality. To extract the features (words) from the text, you need to iterate through the recognized words and put all words to lower case.

In [13]:
def get_features(text):
  features = {}
  word_list = [word for word in word_tokenize(text.lower())]
  # For each word in the email let’s switch on the ‘flag’ that this word is contained in the email
  for word in word_list:
    features[word] = True
  
  return features

In [14]:
# it will keep tuples containing the list of features matched with the “spam” or “ham” label for each email
all_features = [(get_features(email), label) for (email, label) in all_emails]

print(get_features("Participate In Our New Lottery NOW!"))

print(len(all_features))
print(len(all_features[0][0]))
print(len(all_features[99][0]))

{'participate': True, 'in': True, 'our': True, 'new': True, 'lottery': True, 'now': True, '!': True}
5172
33
18


With this bit of code, you iterate over the emails in your collection (all_emails) and store the list of features extracted from each email matched with the label.

For example, if a spam email consists of a single sentence “Participate In Our New Lottery NOW!” your algorithm will first extract the list of features present in this email and assign a ‘True’ value to each of them.

Then, the algorithm will add this list of features to
all_features together with the “spam” label.

<img src='https://github.com/rahiakela/machine-learning-algorithms/blob/main/Essential-NLP/02-your-first-nlp-example/images/1.png?raw=1' width='800'/>

Imagine your whole dataset contained only one spam text “Participate In Our New Lottery NOW!” and one ham text “Participate in the Staff Survey”. What features will be extracted from this dataset?

You will end up with the following feature set:

<img src='https://github.com/rahiakela/machine-learning-algorithms/blob/main/Essential-NLP/02-your-first-nlp-example/images/2.png?raw=1' width='800'/>

Let’s now clarify what each tuple structure representing an email contains. Tuples pair up two information fields: in this case a list of features extracted from the email and its label, i.e. each tuple in `all_features` contains a pair (`list_of_features`, `label`).

So if you’d like to access first email in the list, you call on `all_features[0]`, to access its list of features you use `all_features[0][0]`, and to access its label you use `all_features[0][1]`.

<img src='https://github.com/rahiakela/machine-learning-algorithms/blob/main/Essential-NLP/02-your-first-nlp-example/images/3.png?raw=1' width='800'/>





In [15]:
# access first email in the list with feature and label
all_features[0]

({'.': True,
  '60': True,
  '80': True,
  ':': True,
  'all': True,
  'are': True,
  'as': True,
  'assists': True,
  'at': True,
  'direct': True,
  'economise': True,
  'equivalents': True,
  'expensive': True,
  'lower': True,
  'medications': True,
  'more': True,
  'much': True,
  'on': True,
  'or': True,
  'our': True,
  'people': True,
  'percents': True,
  'pharmacy': True,
  'prescriptions': True,
  'prices': True,
  'recipes': True,
  'retail': True,
  'subject': True,
  'test': True,
  'than': True,
  'which': True,
  'with': True,
  'worths': True},
 'spam')

In [17]:
# access its list of features only
all_features[0][0]

{'.': True,
 '60': True,
 '80': True,
 ':': True,
 'all': True,
 'are': True,
 'as': True,
 'assists': True,
 'at': True,
 'direct': True,
 'economise': True,
 'equivalents': True,
 'expensive': True,
 'lower': True,
 'medications': True,
 'more': True,
 'much': True,
 'on': True,
 'or': True,
 'our': True,
 'people': True,
 'percents': True,
 'pharmacy': True,
 'prescriptions': True,
 'prices': True,
 'recipes': True,
 'retail': True,
 'subject': True,
 'test': True,
 'than': True,
 'which': True,
 'with': True,
 'worths': True}

In [18]:
# access its label only
all_features[0][1]

'spam'

##Step 4: Train the classifier