## Load Data

In this step, we load the data from the file. Each line in the file is a labeled sample that has this format:

*{spam_or_ham},{email_text}*

The first part is the label that identifies whether the email is spam or ham (not spam), followed by the email text. For example:

`Spam,<p>But few feere in nor revellers in pride the a. Ear fathers yes begun revellers blazon one but not of take high. In had his her satiety alone fulness he sins perchance in thence climes nine scorching weary drugged...`

We will load all the sample into two lists

In [221]:
import random

def read_file(path):
    """
    read and return all data in a file
    """
    with open(path, 'r') as f:
        return f.read()

def load_data():
    """
    load and return the data in features and labels lists
    each item in features contains the raw email text
    each item in labels identifies whether the corresponding item in features is spam (1) or ham (0)
    """
    # load all data from file
    data_path = "data/SpamDetectionData.txt"
    all_data = read_file(data_path)
    
    # split the data into lines, each line is a single sample
    all_lines = all_data.split('\n')

    # each line in the file is a sample and has the following format
    # it begins with either "Spam," or "Ham,", and follows by the actual text of the email
    # e.g. Spam,<p>His honeyed and land....
    
    # extract the feature (email text) and label (spam or ham) from each line
    features = []
    labels = []
    for line in all_lines:
        if line[0:4] == 'Spam':
            labels.append(1)
            features.append(line[5:])
            pass
        elif line[0:3] == 'Ham':
            labels.append(0)
            features.append(line[4:])
            pass
        else:
            # ignore markers, empty lines and other lines that aren't valid sample
            # print('ignore: "{}"'.format(line));
            pass
    
    return features, labels
    
features, labels = load_data()

print("total no. of samples: {}".format(len(labels)))
print("total no. of spam samples: {}".format(labels.count(1)))
print("total no. of ham samples: {}".format(labels.count(0)))

print("\nPrint a random sample for inspection:")
random_sample = random.randint(0, len(labels))
print("example feature: {}".format(features[random_sample][0:]))
print("example label: {} ({})".format(labels[random_sample], 'spam' if labels[random_sample] else 'ham'))

total no. of samples: 2100
total no. of spam samples: 1043
total no. of ham samples: 1057

Print a random sample for inspection:
example feature: <p>Mood for thence sorrow the before and me are love suffice festal as suits. Harolds not care rill tis ways ah to. They bliss and to he below bade left. Mine none none from worse worse but these but was ye neer whateer. Loved and nor. Along him shamed and nor aisle strength he adversity his monks loved his power was companie. All pollution hall was was riot in true been will ever his so. That at of perchance to and would was degree clay pleasure hope few. Oh all there was would eremites. Dwelt for plain whence earthly a feere by den for go seraphs youth day. And only change visit nine for however tear albions sick breast he say to feere degree it save. Deemed call light minstrels break men her superstition but flee muse mine yes bade nine though her grace oft. Feeble but he feel holy name in save. Grief reverie not of spent he feere crime hi

## Preprocess Data

In [222]:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer

# load features and labels
features, labels = load_data()

# split data into training / test sets
features_train, features_test, labels_train, labels_test = train_test_split(
    features, 
    labels, 
    test_size=0.1,   # use 10% for testing
    random_state=42)

print("no. of train features: {}".format(len(features_train)))
print("no. of train labels: {}".format(len(labels_train)))
print("no. of test features: {}".format(len(features_test)))
print("no. of test labels: {}".format(len(labels_test)))

# vectorize email text into tfidf matrix
# TfidfVectorizer converts collection of raw documents to a matrix of TF-IDF features.
# It's equivalent to CountVectorizer followed by TfidfTransformer.
vectorizer = TfidfVectorizer(
    input='content',     # input is actual text
    lowercase=True,      # convert to lower case before tokenizing
    stop_words='english' # remove stop words
)
features_train_transformed = vectorizer.fit_transform(features_train)
features_test_transformed  = vectorizer.transform(features_test)


no. of train features: 1890
no. of train labels: 1890
no. of test features: 210
no. of test labels: 210


In [223]:
from sklearn.naive_bayes import MultinomialNB
import pickle

def save(vectorizer, clf):
    '''
    save classifier from disk
    '''
    with open('clf.pkl', 'wb') as file:
        pickle.dump((vectorizer, clf), file)
        
def load():
    '''
    load classifier from disk
    '''
    with open('clf.pkl', 'rb') as file:
      vectorizer, clf = pickle.load(file)
    return vectorizer, clf

# train a classifier
classifier = MultinomialNB()
classifier.fit(features_train_transformed, labels_train)

# save classifier
save(vectorizer, classifier)

# score the classifier accuracy
print("classifier accuracy {:.2f}%".format(classifier.score(features_test_transformed, labels_test) * 100))



classifier accuracy 100.00%


In [224]:
saved_classifer = load()

print('\nPerform a test')                    
test_case = ['is this a spam letter? what are some of the spammy letters? nd long power goodly had formed pilgrimage and domestic longdeserted revellers than so and to. Heartless in other reverie dome birth land the did sad more bidding not by childe the. To from maidens the seraphs haply hall passion losel pillared the monks that be his his true have. Stalked open all now parasites day their true it revel apart who and. One had haply was lineage if which in and cell loathed him pomp from hall ever to and oer. Nor sore disporting grief call sad long by feel scorching ofttimes things. Tales his was drowsy visit her by and himnot he deem blazon lyres for pillared. His childe disporting labyrinth honeyed mirthful']
test_case_transformed = vectorizer.transform(test_case)
prediction = clf.predict(test_case_transformed)
print('spam' if prediction else 'ham')




Perform a test
spam
