# This is the code to accompany the Lesson 2 (SVM) mini-project.

```Use a SVM to identify emails from the Enron corpus by their authors:    
Sara has label 0
Chris has label 1```

**Import, create, train and make predictions with the sklearn SVC classifier**. When creating the classifier, use a linear kernel (if you forget this step, you will be unpleasantly surprised by how long the classifier takes to train). **What is the accuracy of the classifier?**

In [20]:
import sys
from time import time
sys.path.append("../tools/")
from email_preprocess import preprocess

In [2]:
### features_train and features_test are the features for the training
### and testing datasets, respectively
### labels_train and labels_test are the corresponding item labels
features_train, features_test, labels_train, labels_test = preprocess()

#########################################################
### your code goes here ###
from sklearn import svm
from sklearn.metrics import accuracy_score
clf = svm.SVC(kernel='linear', gamma='auto')
t0 = time()
clf.fit(features_train, labels_train)
print("training time: {}s".format(round(time()-t0, 3)))
pred = clf.predict(features_test)
accuracy = accuracy_score(labels_test, pred)

print("accuracy: {}".format(accuracy))

no. of Chris training emails: 7936
no. of Sara training emails: 7884
training time: 233.227s
accuracy: 0.9840728100113766


Place timing code around the fit and predict functions, like you did in the Naive Bayes mini-project. How do the training and prediction times compare to Naive Bayes?

#### Are the SVM training and predicting times faster than that of Naive Bayes?
A: Slower.

#### Notes: 
One way to speed up an algorithm is to train it on a smaller training dataset. The tradeoff is that the accuracy almost always goes down when you do this. Let’s explore this more concretely: add in the following two lines immediately before training your classifier.

```features_train = features_train[:len(features_train)/100] 
labels_train = labels_train[:len(labels_train)/100]```

These lines effectively slice the training dataset down to 1% of its original size, tossing out 99% of the training data. You can leave all other code unchanged. **What’s the accuracy now?**

In [18]:
features_train = features_train[:len(features_train)//100] 
labels_train = labels_train[:len(labels_train)//100] 

In [21]:
# Display the no. of records chris and Sara in the labels
displayDataInfo(labels_train)

NameError: name 'displayDataInfo' is not defined

In [9]:
t0 = time()
clf.fit(features_train, labels_train)
print("training time: {}s".format(round(time()-t0, 3)))
pred = clf.predict(features_test)
accuracy = accuracy_score(labels_test, pred)
print("accuracy: {}".format(accuracy))

training time: 0.144s
accuracy: 0.8845278725824801


#### Notes:
If speed is a major consideration (and for many real-time machine learning applications, it certainly is) then you may want to sacrifice a bit of accuracy if it means you can train/predict faster. **Which of these are applications where you can imagine a very quick-running algorithm is especially important?**

```
A. predicting the author of an email
B. flagging credit card fraud, and blocking a transaction before it goes through
C. voice recognition, like Siri
```

Answer : B and C

Keep the training set slice code from the last quiz, so that you are still training on only 1% of the full training set. Change the kernel of your SVM to “rbf”. **What’s the accuracy now, with this more complex kernel?**

In [11]:
clf = svm.SVC(kernel='rbf', gamma='auto')
t0 = time()
clf.fit(features_train, labels_train)
print("training time: {}s".format(round(time()-t0, 3)))
pred = clf.predict(features_test)
accuracy = accuracy_score(labels_test, pred)
print("accuracy: {}".format(accuracy))

training time: 0.16s
accuracy: 0.6160409556313993


Keep the training set size and rbf kernel from the last quiz, but try several values of C (say, 10.0, 100., 1000., and 10000.). **Which one gives the best accuracy?**

**Answer:** 10000

Once you've optimized the C value for your RBF kernel, **what accuracy does it give? Does this C value correspond to a simpler or more complex decision boundary?**

**What's the accuracy of your SVM now? Is the boundary decision more or less complex than C had it's default value of 1.00**

<input type="checkbox">More Complex</input>

<input type="checkbox">less Complex</input>

**Answer**: More Complex.

Now that you’ve **optimized C for the RBF kernel**, go back to using the **full training set**. In general, having a larger training set will improve the performance of your algorithm, so (by tuning C and training on a large dataset) we should get a fairly optimized result. 

**What is the accuracy of the optimized SVM?**

#### Using the full training set.

In [16]:
features_train, features_test, labels_train, labels_test = preprocess()

no. of Chris training emails: 7936
no. of Sara training emails: 7884


In [17]:
clf = svm.SVC(C=10000, kernel='rbf', gamma='auto')
t0 = time()
clf.fit(features_train, labels_train)
print("training time: {}s".format(round(time()-t0, 3)))
pred = clf.predict(features_test)
accuracy = accuracy_score(labels_test, pred)
print("accuracy: {}".format(accuracy))

training time: 154.844s
accuracy: 0.9908987485779295


### What class does your SVM (0 or 1, corresponding to Sara and Chris respectively) predict for element 10 of the test set? The 26th? The 50th? 
(Use the RBF kernel, C=10000, and 1% of the training set. Normally you'd get the best results using the full training set, but we found that using 1% sped up the computation considerably and did not change our results--so feel free to use that shortcut here.)

And just to be clear, the data point numbers that we give here (10, 26, 50) assume a zero-indexed list. So the correct answer for element #100 would be found using something like answer=predictions[100]

In [None]:
print("Predictions of element {}, {}, and {} are {}, {}, {} respectively.".format(10, 26, 50, pred[10], pred[26, pred[50]]))

### There are over 1700 test events--how many are predicted to be in the “Chris” (1) class? 
(Use the RBF kernel, C=10000., and the full training set.)

In [None]:
features_train, features_test, labels_train, labels_test = preprocess()
t0 = time()
clf.fit(features_train, labels_train)
print("training time: {}s".format(round(time()-t0, 3)))
pred = clf.predict(features_test)
accuracy = accuracy_score(labels_test, pred)
print("accuracy: {}".format(accuracy))
print("Number of events predicted in Chris class is {}".format(sum(pred))