# Classification task on reporting spam emails

In the task we need to classify spam emails based on a dataset of 4601 emails, of which 1813 are spam. This is a typical binary classification problem as the result will be either 0 (not spam) or 1 (spam).

For classical Machine Learning methods, we can choose various of methods such as Artificial Neural Network, Decision Tree, or SVM. Since the requirement is not specified, I'm going to choose SVM as the classifier.

The python ML library scikit-learn will be used for this task.

First we need to import svm from scikit-learn, np from numpy and the data from file.

In [1]:
import numpy as np
from sklearn import svm

input_data = np.genfromtxt('spambase.data', delimiter=",")
# number of samples
l = input_data.shape[0]
# k-fold
k = 10

data = input_data[:, 0:57]
target = input_data[:, 57]

As required, it is fair to perform 10-fold cross-validation, so we divide our data into 10 groups. This means we need some breakpoints.

In [3]:
breakpoints = [460*n for n in range(0,k)]

For SVM, in each of the 10-fold validation we can try different parameters (gamma and C) and find the better one by performing k-fold cross-validation. We can also compare and choose different kernels. Since it is indicated in the instruction that this is not the point, I will choose rbf kernel and leave others to default value.

Before training the data, we specify the result table here:
|     | false-positive | false-negative | error-rate |
|-----|----------------|----------------|------------|
| 1   |                |                |            |
| ... |                |                |            |
| 10  |                |                |            |

In [4]:
results = []
for i in range(0, k):
    start = breakpoints[i]
    end = breakpoints[i]+(l//k+1) if (i == (k-1)) else breakpoints[i]+(l//k)
    
    train_data = np.array([data[n] for n in range(0,l) if (n < start or n >= end)])
    train_target = np.array([target[n] for n in range(0,l) if (n < start or n >= end)])
    
    test_data = np.array([data[n] for n in range(0,l) if (n >= start and n < end)])
    test_target = np.array([target[n] for n in range(0,l) if (n >= start and n < end)])
    
    rbf_svc = svm.SVC(gamma='auto', kernel='rbf')
    
    rbf_svc.fit(train_data, train_target)
    
    predict_result = rbf_svc.predict(test_data)
    
    fp = 0
    fn = 0
    num = predict_result.shape[0]
    for t in range(0, num):
        if test_target[t] == 0 and predict_result[t] == 1:
            fp += 1
        if test_target[t] == 1 and predict_result[t] == 0:
            fn += 1
    
    err = (fp + fn) / num
    results.append((fp, fn, err))

Finally we calculate the average false-positive, false-negative and error rate.

In [5]:
total_fp = 0
total_fn = 0
for r in results:
    total_fp += r[0]
    total_fn += r[1]

results.append((total_fp/k, total_fn/k, (total_fp+total_fn)/l))

print(results)

[(0, 148, 0.3217391304347826), (0, 144, 0.3130434782608696), (0, 108, 0.23478260869565218), (5, 126, 0.2847826086956522), (105, 0, 0.22826086956521738), (88, 0, 0.19130434782608696), (125, 0, 0.2717391304347826), (64, 0, 0.1391304347826087), (72, 0, 0.1565217391304348), (160, 0, 0.3470715835140998), (61.9, 52.6, 0.24885894370788958)]
