# Lesson 2: Support Vector Machines (SVMs)

## Import libraries

In [4]:
import sys
from time import time
sys.path.append("../tools/")
from email_preprocess import preprocess
from sklearn import svm
from sklearn.metrics import accuracy_score

## Loading test and training data

In [9]:
features_train, features_test, labels_train, labels_test = preprocess()

no. of Chris training emails: 7936
no. of Sara training emails: 7884


## Setting up the classifier (with a linear kernel)

In [5]:
clf = svm.SVC(kernel="linear")

### What is the accurary?

In [6]:
clf.fit(features_train, labels_train)
clf.score(features_test, labels_test)

# alternatively for clf.score:
# pred_test = clf.pred(features_test)
# accuracy.score(pred_test, labels_test)

0.98407281001137659

### How does the timing compare to the Naive Bayes?

In [None]:
t0 = time()
clf.fit(features_train, labels_train)
t1 = time()

t2 = time()
pred_test = clf.pred(features_test)
t3 = time()

print "Time for training = ", round(t1-t0,3), " s"
print "Time for testing = ", round(t3-t2, 3), " s"

## Tossing out training data

We throw out 99% of the training data.

In [10]:
features_train_trunc = features_train[:len(features_train)/100]
labels_train_trunc = labels_train[:len(labels_train)/100]

### What is the accuracy now?

In [11]:
clf.fit(features_train_trunc, labels_train_trunc)
clf.score(features_test, labels_test)

0.88452787258248011

## Changing the kernel

In [12]:
clf = svm.SVC(kernel="rbf")

### What is the accuracy now?

In [13]:
clf.fit(features_train_trunc, labels_train_trunc)
clf.score(features_test, labels_test)

0.61604095563139927

## Optimizing C

We set C to 10.0, 100., 1000., and 10000. Which one gives the best accuracy?

In [14]:
cs = [10.0, 100., 1000., 10000.]

for c_value in cs:
    print
    clf = svm.SVC(kernel="rbf", C = c_value)
    clf.fit(features_train_trunc, labels_train_trunc)
    pred_test = clf.predict(features_test)
    print "The accuracy for C = %d is %.3f" %(c_value, accuracy_score(pred_test, labels_test))


The accuracy for C = 10 is 0.616

The accuracy for C = 100 is 0.616

The accuracy for C = 1000 is 0.821

The accuracy for C = 10000 is 0.892


C = 10000 gives the highest accuracy. It also finds the most complex decision boundary.

### Now that we have optimized C: What is the accuracy for the optimized C for the whole training set?

In [16]:
clf = svm.SVC(kernel="rbf", C = 10000.)
clf.fit(features_train, labels_train)
clf.score(features_test, labels_test)

0.99089874857792948

What class does your SVM (0 or 1, corresponding to Sara and Chris respectively) predict for element 10 of the test set? The 26th? The 50th? (Use the RBF kernel, C=10000, and 1% of the training set. Normally you'd get the best results using the full training set, but we found that using 1% sped up the computation considerably and did not change our results--so feel free to use that shortcut here.)

In [15]:
writer = {0: "Sara", 1: "Chris"}

clf = svm.SVC(kernel="rbf", C = 10000.)
clf.fit(features_train_trunc, labels_train_trunc)
pred_test = clf.predict(features_test)
for index in [10, 26, 50]:
    print "The predicted writer of mail %d is %s (class %d)." % (index, writer[pred_test[index]], pred_test[index])

The predicted writer of mail 10 is Chris (class 1).
The predicted writer of mail 26 is Sara (class 0).
The predicted writer of mail 50 is Chris (class 1).


There are over 1700 test events--how many are predicted to be in the “Chris” (1) class? (Use the RBF kernel, C=10000., and the full training set.)

In [17]:
clf = svm.SVC(kernel="rbf", C = 10000.)
clf.fit(features_train, labels_train)
pred_test = clf.predict(features_test)
print "Out of the %d mails predicted, %d were predicted to be written by Chris." %(len(pred_test), pred_test.sum())

Out of the 1758 mails predicted, 877 were predicted to be written by Chris.
