# Filter spam emails

Data is taken from Ling-Spam dataset https://goo.gl/CSMxHU. <br>
Notice: data was processed like:
- Remove stop words
- Lemmatization
- Remove non-words

File `train-featues-50.txt` contains data of training set was collapsed with only 50 emails. Each `labels.txt` contains many of lines, each line is 0 or 1 (not spam or spam email)

Each file `features.txt` contains many of lines, each line has 3 number:

1 564 1 <br>
1 19 2

The first is index of email, start with 1; the second is order of words in the dictionary; the third is a frequency of that word in current email. <br>
First line pointed out that, 564th word in dictionary appears only 1 time

## Import libraries and dataset

In [8]:
import numpy as np
from scipy.sparse import coo_matrix # for sparse matrix
from sklearn.naive_bayes import MultinomialNB, BernoulliNB
from sklearn.metrics import accuracy_score

# data path and file name
path = './data'
train_data_fn = '/train-features.txt'
test_data_fn = '/test-features.txt'
train_label_fn = '/train-labels.txt'
test_label_fn = '/test-labels.txt'

In [9]:
nwords = 2500

def read_data(data_fn, label_fn):
    # read label_fn
    with open(path + label_fn) as f:
        content = f.readlines()
    label = [int(x.strip()) for x in content]

    # read data_fn
    with open(path + data_fn) as f:
        content = f.readlines()

    # remove '\n' at the end of each line
    content = [x.strip() for x in content]
    dat = np.zeros((len(content), 3), dtype=int)
    for i, line in enumerate(content):
        a = line.split(' ')
        dat[i, :] = np.array([int(a[0]), int(a[1]), int(a[2])])

    # remember to -1 at coordinate since we’re in Python
    data = coo_matrix((dat[:, 2], (dat[:, 0] - 1, dat[:, 1] - 1)),\
    shape=(len(label), nwords))
    return (data, label)

In [10]:
(train_data, train_label) = read_data(train_data_fn, train_label_fn)
(test_data, test_label) = read_data(test_data_fn, test_label_fn)

clf = MultinomialNB()
clf.fit(train_data, train_label)
y_pred = clf.predict(test_data)

print('Training size = %d, accuracy = %.2f%%' % \
(train_data.shape[0],accuracy_score(test_label, y_pred)*100))


Training size = 700, accuracy = 98.08%


## Train model with smaller than trainning set

In [12]:
train_data_fn = '/train-features-100.txt'
train_label_fn = '/train-labels-100.txt'
test_data_fn = '/test-features.txt'
test_label_fn = '/test-labels.txt'
(train_data, train_label) = read_data(train_data_fn, train_label_fn)
(test_data, test_label) = read_data(test_data_fn, test_label_fn)
clf = MultinomialNB()
clf.fit(train_data, train_label)
y_pred = clf.predict(test_data)
print('Training size = %d, accuracy = %.2f%%' % \
(train_data.shape[0],accuracy_score(test_label, y_pred)*100))
train_data_fn = '/train-features-50.txt'
train_label_fn = '/train-labels-50.txt'
test_data_fn = '/test-features.txt'
test_label_fn = '/test-labels.txt'
(train_data, train_label) = read_data(train_data_fn, train_label_fn)
(test_data, test_label) = read_data(test_data_fn, test_label_fn)
clf = MultinomialNB()
clf.fit(train_data, train_label)
y_pred = clf.predict(test_data)
print('Training size = %d, accuracy = %.2f%%' % \
(train_data.shape[0],accuracy_score(test_label, y_pred)*100))

Training size = 100, accuracy = 97.69%
Training size = 50, accuracy = 97.31%


## Using BernoulliNB distribution

In [14]:
clf = BernoulliNB(binarize = .5)
clf.fit(train_data, train_label)
y_pred = clf.predict(test_data)
print('Training size = %d, accuracy = %.2f%%' % \
(train_data.shape[0],accuracy_score(test_label, y_pred)*100))

Training size = 50, accuracy = 69.62%


MultinomialNB works better than BernoulliNB in this problem