In [1]:
"""
In this mini-project, we'll tackle the exact same email author ID problem as the Naive Bayes mini-project, but now with an SVM. 
What we find will help clarify some of the practical differences between the two algorithms. This project also gives us a chance 
to play around with parameters a lot more than Naive Bayes did, so we will do that too.
"""

#!/usr/bin/python

import pickle
import cPickle
import numpy

from sklearn import cross_validation
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_selection import SelectPercentile, f_classif


def preprocess(words_file = "word_data.pkl", authors_file="email_authors.pkl"):
    """ 
        this function takes a pre-made list of email texts (by default word_data.pkl)
        and the corresponding authors (by default email_authors.pkl) and performs
        a number of preprocessing steps:
            -- splits into training/testing sets (10% testing)
            -- vectorizes into tfidf matrix
            -- selects/keeps most helpful features

        after this, the feaures and labels are put into numpy arrays, which play nice with sklearn functions

        4 objects are returned:
            -- training/testing features
            -- training/testing labels

    """

    ### the words (features) and authors (labels), already largely preprocessed
    ### this preprocessing will be repeated in the text learning mini-project
    authors_file_handler = open(authors_file, "r")
    authors = pickle.load(authors_file_handler)
    authors_file_handler.close()

    words_file_handler = open(words_file, "r")
    word_data = cPickle.load(words_file_handler)
    words_file_handler.close()

    ### test_size is the percentage of events assigned to the test set
    ### (remainder go into training)
    features_train, features_test, labels_train, labels_test = cross_validation.train_test_split(word_data, authors, test_size=0.1, random_state=42)



    ### text vectorization--go from strings to lists of numbers
    vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5,
                                 stop_words='english')
    features_train_transformed = vectorizer.fit_transform(features_train)
    features_test_transformed  = vectorizer.transform(features_test)



    ### feature selection, because text is super high dimensional and 
    ### can be really computationally chewy as a result
    selector = SelectPercentile(f_classif, percentile=10)
    selector.fit(features_train_transformed, labels_train)
    features_train_transformed = selector.transform(features_train_transformed).toarray()
    features_test_transformed  = selector.transform(features_test_transformed).toarray()

    ### info on the data
    print "no. of Chris training emails:", sum(labels_train)
    print "no. of Sara training emails:", len(labels_train)-sum(labels_train)
    
    return features_train_transformed, features_test_transformed, labels_train, labels_test

In [2]:
#!/usr/bin/python

""" 
    This is the code to accompany the Lesson 2 (SVM) mini-project.

    Use a SVM to identify emails from the Enron corpus by their authors:    
    Sara has label 0
    Chris has label 1
"""
    
import sys
from time import time
#sys.path.append("../tools/")
#from email_preprocess import preprocess


### features_train and features_test are the features for the training
### and testing datasets, respectively
### labels_train and labels_test are the corresponding item labels
features_train, features_test, labels_train, labels_test = preprocess()


no. of Chris training emails: 7936
no. of Sara training emails: 7884


In [4]:
"""
Go to the svm directory to find the starter code (svm/svm_author_id.py).

Import, create, train and make predictions with the sklearn SVC classifier. When creating the classifier, use a linear kernel 
(if you forget this step, you will be unpleasantly surprised by how long the classifier takes to train). What is the accuracy 
of the classifier?
"""

from sklearn import svm
from sklearn.metrics import accuracy_score

clf = svm.SVC(kernel='linear', gamma='auto', C=1.0)

### train step
clf.fit(features_train, labels_train)

### use the trained classifier to predict labels for the test features
pred = clf.predict(features_test)

### calculate and return the accuracy on the test data
accuracy = accuracy_score(pred, labels_test)

print 'SVM Accuracy:', accuracy

SVM Accuracy: 0.984072810011


In [5]:
"""
Place timing code around the fit and predict functions, like you did in the Naive Bayes mini-project. How do the training and 
prediction times compare to Naive Bayes?
"""

clf = svm.SVC(kernel='linear', gamma='auto', C=1.0)

t0 = time()
clf.fit(features_train, labels_train)
print "training time:", round(time()-t0, 3), "s"

t0 = time()
pred = clf.predict(features_test)
print "predicting time:", round(time()-t0, 3), "s"

training time: 181.191 s
predicting time: 18.813 s


In [6]:
"""
One way to speed up an algorithm is to train it on a smaller training dataset. The tradeoff is that the accuracy almost always 
goes down when you do this. Let's explore this more concretely: add in the following two lines immediately before training your 
classifier. 

features_train = features_train[:len(features_train)/100] 
labels_train = labels_train[:len(labels_train)/100] 

These lines effectively slice the training dataset down to 1% of its original size, tossing out 99% of the training data. You 
can leave all other code unchanged. What's the accuracy now?
"""
features_train, features_test, labels_train, labels_test = preprocess()

clf = svm.SVC(kernel='linear', gamma='auto', C=1.0)

### train step
features_train = features_train[:len(features_train)/100] 
labels_train = labels_train[:len(labels_train)/100] 
clf.fit(features_train, labels_train)

### use the trained classifier to predict labels for the test features
pred = clf.predict(features_test)

### calculate and return the accuracy on the test data
accuracy = accuracy_score(pred, labels_test)

print 'SVM Accuracy:', accuracy


SVM Accuracy: 0.884527872582


In [4]:
"""
Keep the training set slice code from the last quiz, so that you are still training on only 1% of the full training set. Change 
the kernel of your SVM to "rbf". What's the accuracy now, with this more complex kernel?
"""

from sklearn import svm
from sklearn.metrics import accuracy_score

features_train, features_test, labels_train, labels_test = preprocess()
features_train = features_train[:len(features_train)/100] 
labels_train = labels_train[:len(labels_train)/100] 

clf = svm.SVC(kernel='rbf', gamma='auto', C=1.0)
clf.fit(features_train, labels_train)

### use the trained classifier to predict labels for the test features
pred = clf.predict(features_test)

### calculate and return the accuracy on the test data
accuracy = accuracy_score(pred, labels_test)

print 'SVM Accuracy:', accuracy


no. of Chris training emails: 7936
no. of Sara training emails: 7884
SVM Accuracy: 0.616040955631


In [5]:
"""
Keep the training set size and rbf kernel from the last quiz, but try several values of C (say, 10.0, 100., 1000., and 10000.). 
Which one gives the best accuracy?
"""
# C = 10.0
clf = svm.SVC(kernel='rbf', gamma='auto', C=10.0)
clf.fit(features_train, labels_train)
pred = clf.predict(features_test)
accuracy = accuracy_score(pred, labels_test)
print 'SVM Accuracy using C=10.0:', accuracy

# C = 100.0
clf = svm.SVC(kernel='rbf', gamma='auto', C=100.0)
clf.fit(features_train, labels_train)
pred = clf.predict(features_test)
accuracy = accuracy_score(pred, labels_test)
print 'SVM Accuracy using C=100.0:', accuracy

# C = 1000.0
clf = svm.SVC(kernel='rbf', gamma='auto', C=1000.0)
clf.fit(features_train, labels_train)
pred = clf.predict(features_test)
accuracy = accuracy_score(pred, labels_test)
print 'SVM Accuracy using C=1000.0:', accuracy

# C = 10000.0
clf = svm.SVC(kernel='rbf', gamma='auto', C=10000.0)
clf.fit(features_train, labels_train)
pred = clf.predict(features_test)
accuracy = accuracy_score(pred, labels_test)
print 'SVM Accuracy using C=10000.0:', accuracy

SVM Accuracy using C=10.0: 0.616040955631
SVM Accuracy using C=100.0: 0.616040955631
SVM Accuracy using C=1000.0: 0.821387940842
SVM Accuracy using C=10000.0: 0.892491467577


In [6]:
"""
Now that you've optimized C for the RBF kernel, go back to using the full training set. In general, having a larger training set 
will improve the performance of your algorithm, so (by tuning C and training on a large dataset) we should get a fairly optimized 
result. What is the accuracy of the optimized SVM?
"""
features_train, features_test, labels_train, labels_test = preprocess()

clf = svm.SVC(kernel='rbf', gamma='auto', C=10000.0)
clf.fit(features_train, labels_train)

### use the trained classifier to predict labels for the test features
pred = clf.predict(features_test)

### calculate and return the accuracy on the test data
accuracy = accuracy_score(pred, labels_test)

print 'SVM Accuracy using full dataset and C=10000.0:', accuracy

no. of Chris training emails: 7936
no. of Sara training emails: 7884
SVM Accuracy using full dataset and C=10000.0: 0.990898748578


In [7]:
"""
What class does your SVM (0 or 1, corresponding to Sara and Chris respectively) predict for element 10 of the test set? The 26th? 
The 50th? (Use the RBF kernel, C=10000, and 1% of the training set. Normally you'd get the best results using the full training 
set, but we found that using 1% sped up the computation considerably and did not change our results--so feel free to use that 
shortcut here.)

And just to be clear, the data point numbers that we give here (10, 26, 50) assume a zero-indexed list. So the correct answer 
for element #100 would be found using something like answer=predictions[100]
"""
features_train = features_train[:len(features_train)/100] 
labels_train = labels_train[:len(labels_train)/100] 

clf = svm.SVC(kernel='rbf', gamma='auto', C=10000.0)
clf.fit(features_train, labels_train)
pred = clf.predict(features_test)
answer= (pred[10], pred[26], pred[50]) 
print 'SVM answers for (10, 26, 50) elements are: ', answer



SVM answers for (10, 26, 50) elements are:  (1, 0, 1)


In [10]:
"""
There are over 1700 test events--how many are predicted to be in the "Chris" (1) class? (Use the RBF kernel, C=10000., 
and the full training set.)
"""
features_train, features_test, labels_train, labels_test = preprocess()

clf = svm.SVC(kernel='rbf', gamma='auto', C=10000.0)
clf.fit(features_train, labels_train)
pred = clf.predict(features_test)

chris = 0
for n, el in enumerate(pred, 1):
    if el == 1:
        chris += 1
print 'There are %d elements predicted as Chris' % chris
print 'Total of %d elements' % n

no. of Chris training emails: 7936
no. of Sara training emails: 7884
There are 877 elements predicted as Chris
Total of 1758 elements
