The goal of this project is to train a model to detect spam from a set of email features. These features have been predetermined and assumed to be logical, correct, etc.

In [1]:
import numpy as np
# read in dataset
data = np.genfromtxt('data/spambase/spambase.data', delimiter=',')
data[:,-1]

array([ 1.,  1.,  1., ...,  0.,  0.,  0.])

In [2]:
# get column names from names file to understand the features
f = open('data/spambase/spambase.names', 'r')
lines = [line.strip() for line in f]
f.close()
# comment lines start with | or 1 in this case + empty lines; colnames and type are separated by :
colnames = [line.partition(':')[0] for line in lines if not (len(line) == 0 or line[0] == '|' or line[0] == '1')]
# need to add the name for the final column
colnames.append('spam')
len(colnames)

58

In [3]:
# store as DataFrame
import pandas as pd
df = pd.DataFrame(data, columns=colnames)

In [4]:
# Now there a nicer view of the data, easier to explore
df.ix[:3,50:58]

Unnamed: 0,char_freq_[,char_freq_!,char_freq_$,char_freq_#,capital_run_length_average,capital_run_length_longest,capital_run_length_total,spam
0,0,0.778,0.0,0.0,3.756,61,278,1
1,0,0.372,0.18,0.048,5.114,101,1028,1
2,0,0.276,0.184,0.01,9.821,485,2259,1
3,0,0.137,0.0,0.0,3.537,40,191,1


Now that the data is read in and ready for use, we need to prepare it for training and testing. We will use 80% for training and 20% for testing.

We will move the file handling functionality into a dedicated method. This will make running several iterations of the individual algorithms much easier (cross-validation).

There are may types of algorithms to learn the features. We will pick Support Vector Machine (SVM) and Naive Bayes Classifier (NB), as these tend to have a decent performance for this type of dataset and typically do not need much setup.
Prior to actually training and validating the models, we also want to perform some form of dimensionality reduction, as we do not know apriori that all features are equally informative or just nuisance parameters. For this reason we compare results using Principal Component Analysis (PCA) and Linear Discriminant Analysis (LDA), as well as using all features. For PCA and LDA we will use the extreme case of using just the dominant feature and 10 features.
In total we will have (2 Algorithms) x [(2 Dim. Reduction) x (2 feature combinations) + 1] = 10 result sets.

As this will be messy in the notebook, there is a helper file (helper.py) that contains relevant classes and methods.

In [5]:
from helper import lda, pca, svm, bayes

Now define a couple of methods to make this process less manual.

In [6]:
from numpy import genfromtxt
from sklearn.cross_validation import train_test_split
from numpy import mean, var, sum, diag, shape

def load_data():
    data = genfromtxt('data/spambase/spambase.data', delimiter=',')
    target = data[:,-1]
    data = data[:,:-1]
    return data, target

def evaluate(algo, dim_rec, components, iterations=15):
    X, y = load_data()
    if components > 0:
        X, y = dim_rec(X, y, components)
    res = []
    for i in range(iterations):
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
        
        classifier = algo(X_train, y_train)
        confusion = 1.0 * classifier.classify(X_test, y_test) / len(X_test)
        res += [confusion]
    mean_confusion = mean(res, axis=0)
    var_confusion = var(res, axis=0)

    return sum(diag(mean_confusion)), iterations * sum(diag(var_confusion)), mean_confusion

In [7]:
components = [0,1,10] # 0 will be handled like 'all'
dim_rec = [pca, lda]
algo = [svm, bayes]
best = [0.0, 0.0, 'none', 'none', 0]
acc, err, mat = 0.0, 0.0, 0.0
for c in components:
    for d in dim_rec:
        tmp = 'all'
        if(c>0):
            tmp = str(c)
            print ("# Using %s" % (d.__name__))

        print ("# Using %s components." % (tmp))
        for a in algo:
            print ("## %s:" % a.__name__)
            acc, err, mat = evaluate(a, d, c)
            
            print ("### Accuracy of {n}".format(n=acc))
            print ("### Error {n}".format(n=err))
            print ("### Confusion Matrix")
            print (mat)
            print ()
            if acc > best[0]:
                best = [acc, err, d.__name__, a.__name__, c]
            print ()
        if(c==0):
            break

print ("##############################")            
print ("  Best performing combination")
print ("  Algorithm: %s" % (best[3]))
print ("  Components: %s" % (best[4])) 
print ("  Decomposition: %s" % (best[2])) 
print ("  Accuracy: %f" % (best[0]))   
print ("  Error: %f" % (best[1]))      

# Using all components.
## svm:
### Accuracy of 0.8373507057546146
### Error 0.003962395907138027
### Confusion Matrix
[[ 0.51646761  0.08975751]
 [ 0.07289178  0.3208831 ]]


## bayes:
### Accuracy of 0.8243937748823743
### Error 0.007406228497168452
### Confusion Matrix
[[ 0.44618169  0.15888527]
 [ 0.01672096  0.37821209]]


# Using pca
# Using 1 components.
## svm:
### Accuracy of 0.6944625407166125
### Error 0.005924731296883787
### Confusion Matrix
[[ 0.4723127   0.13637351]
 [ 0.16916395  0.22214984]]


## bayes:
### Accuracy of 0.6542164314151285
### Error 0.002000689269519709
### Confusion Matrix
[[ 0.58487152  0.02272892]
 [ 0.32305465  0.06934491]]


# Using lda
# Using 1 components.
## svm:
### Accuracy of 0.9109663409337676
### Error 0.004394348618690526
### Confusion Matrix
[[ 0.56728194  0.03923272]
 [ 0.04980094  0.3436844 ]]


## bayes:
### Accuracy of 0.8984437205935577
### Error 0.005735476906523811
### Confusion Matrix
[[ 0.57690916  0.02909881]
 [ 0.07245747  0.321