# Introduction (35pts)
In this notebook, you will classify emails as either spam or not spam using support vector machines. The full dataset consists 80k labeled emails. The labels are 1 if they are ham (not spam), and -1 if they are spam. The lines of the emails have already been slightly processed, such that different words are space delimited, however little other processing has occurred. 

## Preliminary notes
1. You can not use scikit-learn. 
2. For this notebook, each proceeding part depends on the previous since we are building up a moderately sized data science pipeline. Verify your previous parts before proceeding onto the next. 
3. Similar the linear regression notebook of the previous assignment, you will need to use the tfidf function from the natural language processing notebook. You can download your notebook as a module and import it. If you're in 388 or you find that your implementation is too slow, copy the reference solution (its only 10 lines). 
4. As we move into more advanced algorithms and techniques, there will be more introductions of randomness. This means that some of the example outputs in the notebook contain some randomness, and will probably not match your results exactly. Verify your code by checking your properties/invariants or feeding in static inputs for which you can calculate the output. 
5. When writing pickle files to be read into Autolab, **write files with the binary flag**
5. There is another contest at the end of this notebook. 

In [1]:
import numpy as np
import scipy.sparse as sp
from collections import Counter
import scipy.optimize
import pickle

In [None]:
# AUTOLAB_IGNORE_START
with open("X1.txt") as f:
    emails = f.readlines()
labels = np.loadtxt("y1.txt")
# AUTOLAB_IGNORE_STOP

In [None]:
# AUTOLAB_IGNORE_START
from natural_language_processing import tfidf
features, all_words = tfidf(emails)
# AUTOLAB_IGNORE_STOP

## SVM classification (15pts)
Recall the support vector machine (SVM) from slide 17 of linear classification. Since it is such a straightforward algorithm, you will implement it below. 

### Grading
* 2pts - correct objective
* 5pts - correct gradient
* 8pts - correct prediction after training

### Specifications
1. If you do not use matrix operations, your code will be **very slow**. Every function in here can be implemented in 1 or 2 lines using matrix equations, and the only for loop you need is the training loop for gradient descent. **If your code is slow here, it will be extremely slow in the next section when doing parameter search**.
2. You should train your SVM using gradient descent as described in the slides. Your objective value should also mimic that of the slides. 
3. Since this is a convex function, your gradient steps should always decrease your objective. A simple check when writing these optimization procedures is to print your objectives and verify that this is the case (or plot them with matplotlib).
4. You can also use scipy.optimize.check_grad to numerically verify the correctness of your gradients. 
5. For the unlikely boundary case where your hypothesis outputs 0, we will treat that as a positive prediction. 
6. Be careful of numpy.matrix objects which are constrained to always have dimension 2 (scipy operations will sometimes return this instead of an ndarray). 

In [None]:
class SVM:
    def __init__(self, X, y, reg):
        """ Initialize the SVM attributes and initialize the weights vector to the zero vector. 
            Attributes: 
                X (array_like) : training data intputs
                y (vector) : 1D numpy array of training data outputs
                reg (float) : regularizer parameter
                theta : 1D numpy array of weights
        """
        self.X = X
        self.y = y
        self.reg = reg
        self.theta = np.zeros(X.shape[1])
    
    def objective(self, X, y):
        """ Calculate the objective value of the SVM. When given the training data (self.X, self.y), this is the 
            actual objective being optimized. 
            Args:
                X (array_like) : array of examples, where each row is an example
                y (array_like) : array of outputs for the training examples
            Output:
                (float) : objective value of the SVM when calculated on X,y
        """
        pass
    
    def gradient(self):
        """ Calculate the gradient of the objective value on the training examples. 
            Output:
                (vector) : 1D numpy array containing the gradient
        """
        pass
    
    def train(self, niters=100, learning_rate=1, verbose=False):
        """ Train the support vector machine with the given parameters. 
            Args: 
                niters (int) : the number of iterations of gradient descent to run
                learning_rate (float) : the learning rate (or step size) to use when training
                verbose (bool) : an optional parameter that you can use to print useful information (like objective value)
        """
        pass
            
    
    def predict(self, X):
        """ Predict the class of each label in X. 
            Args: 
                X (array_like) : array of examples, where each row is an example
            Output:
                (vector) : 1D numpy array containing predicted labels
        """
        pass
    


Some useful tricks for debugging: 
1. Use very simple inputs (i.e. small vectors of ones) and compare the output of each function with a hand calculation. 
2. One way to guarantee your gradient is correct is to verify it numerically using a derivative approximation. You can read more about numerical differentiation methods here (https://en.wikipedia.org/wiki/Finite_difference) but for your purposes, you can use scipy.optimize.check_grad to do the numerical checking for you. 

In [None]:
# AUTOLAB_IGNORE_START
# Verify the correctness of your code on small examples
y0 = np.random.randint(0,2,5)*2-1
X0 = np.random.random((5,10))
t0 = np.random.random(10)
svm0 = SVM(X0,y0, 1e-4)
svm0.theta = t0


def obj(theta):
    pass

def grad(theta):
    pass

scipy.optimize.check_grad(obj, grad, t0)

svm0.train(niters=100, learning_rate=1, verbose=True)
# AUTOLAB_IGNORE_STOP

On the above small example, our solution gets a gradient error on the order of 1e-08 from scipy.optimize.check_grad. Your objective values should be monotonically decreasing. 

Once that works, try training your SVM on the tfidf features.

In [None]:
# AUTOLAB_IGNORE_START
# svm = SVM(...)
# svm.train()
# yp = svm.predict(...)

# AUTOLAB_IGNORE_STOP

Our implementation gets the following results:
* For 100 iterations, regularization 1e-4, and learning rate 1.0, our solution is able to achieve perfect training classification accuracy (100% accuracy on the training data)
* Training for 100 iterations takes about 2.13 seconds (measured using %timeit). 

## Model Selection: Cross validation and Parameter Grid Search (15pts)
As you may have noticed, there are parameters in the SVM learning algorithm that we chose somewhat arbitrarily: the regularization parameter and the learning rate (also technically the number of iterations for the learning algorithm, but you'll only consider the first two for simplicity). 

We were also able to achieve perfect training accuracy with these random parameters. This should make you suspicious: we have an enormous amount of features so it would be extremely easy to overfit to the data, so our model may not generalize well. 

You will now evaluate and select parameters using cross validation and grid search.

### Grading
* 2pts correct blocks and test_block attributes
* 8pts correct cross validation 
* 3pts correct grid search
* 2pts correct test

In [None]:
import math

class ModelSelector:
    """ A class that performs model selection. 
        Attributes:
            blocks (list) : list of lists of indices of each block used for k-fold cross validation, e.g. blocks[i] 
            gives the indices of the examples in the ith block 
            test_block (list) : list of indices of the test block that used only for reporting results
            
    """
    def __init__(self, X, y, P, k, niters):
        """ Initialize the model selection with data and split into train/valid/test sets. Split the permutation into blocks 
            and save the block indices as an attribute to the model. 
            Args:
                X (array_like) : array of features for the datapoints
                y (vector) : 1D numpy array containing the output labels for the datapoints
                P (vector) : 1D numpy array containing a random permutation of the datapoints
                k (int) : number of folds
                niters (int) : number of iterations to train for
        """
        pass

    def cross_validation(self, lr, reg):
        """ Given the permutation P in the class, evaluate the SVM using k-fold cross validation for the given parameters 
            over the permutation
            Args: 
                lr (float) : learning rate
                reg (float) : regularizer parameter
            Output: 
                (float) : the cross validated error rate
        """
        pass
    
    def grid_search(self, lrs, regs):
        """ Given two lists of parameters for learning rate and regularization parameter, perform a grid search using
            k-wise cross validation to select the best parameters. 
            Args:  
                lrs (list) : list of potential learning rates
                regs (list) : list of potential regularizers
            Output: 
                (lr, reg) : 2-tuple of the best found parameters
        """
        pass
    
    def test(self, lr, reg):
        """ Given parameters, calculate the error rate of the test data given the rest of the data. 
            Args: 
                lr (float) : learning rate
                reg (float) : regularizer parameter
            Output: 
                (err, svm) : tuple of the error rate of the SVM on the test data and the learned model
        """
        pass
    

## K-fold cross validation
How can we evaluate our choice of parameters? One way is to perform k-fold cross validation, which operates as follows 

1. We split the data into k+1 randomly selected but uniformly sized pieces, and set aside one block for testing
2. For each of the remaining k parts, we train the model on k-1 parts and validate our model on the heldout part. 
3. This gives k results, and the average of these runs gives the final result

The idea is that by holding out part of the dataset as validation data, we can train and measure our generalization ability. Note the key fact here: the training does not see the validation data at all, which is why it measures generalization! Randomizing the groups removes bias from ordering (i.e. if these results occurred in chronological order, we don't want to train on only Monday's results to predict on Wednesday's results), and averaging over the groups reduces the variance. 

In this problem, we will use classification error rate as our result metric (so the fraction of times in which our model returns the wrong answer). Calculating this value via k-fold cross validation gives us a measure of how well our model generalizes to new data (lower error rate is better). 

### Specification
1. Break the examples in k+1 groups as follows: 
    * break the permutation into blocks of size $\text{ceil}\left(\frac{n}{k+1}\right)$ (the last block may be shorter than the rest)
    * set aside the k+1th group as the testing block, and use the remaining k blocks for cross validation
    * use the permutation as indices to select the rows that correspond to that block
    * Example: k=2, P=[1,3,2,4,5,6] sets aside [5,6] as the test set, and break the remaining permutation into [[1,3],[2,4]] so the blocks of data for validation are X[[1,3],:] and X[[2,4],:]
    * the order of the indices in the blocks should match the order in the original permutation
2. For each group k, train the model on all other datapoints, and compute the error rate on the held-out group. 
3. Return the average error rate over all k folds, along 

You can try it on the random dataset just to make sure it works, but you won't get anything meaningful. 

In [None]:
# AUTOLAB_IGNORE_START
MS0 = ModelSelector(X0, y0, np.arange(X0.shape[0]), 3, 100)
MS0.cross_validation(0.1, 1e-4)
# AUTOLAB_IGNORE_STOP

Try running this on the tfidf features. Can you achieve the same performance on the validation dataset as you did on the training data set? Remember to use a random permutation (you'll get noticeably different results). 

In [None]:
# AUTOLAB_IGNORE_START
# MS0 = ...
# MS0.cross_validation(...)

# AUTOLAB_IGNORE_STOP

Our implementation returns results with mean classification error 0.01169 and standard deviation 0.0092 (over 10 different permutations). The parameters we used are k=5 folds for learning rate 1 and regularization 1e-4, when run for 100 iterations. Pretty good!

## Grid search
Now, we have a means of evaluating our choice of parameters. We can now combine this with a grid search over parameters to determine the best combination. Given two lists of parameters, we compute the classification error using k-fold cross validation for each pair of parameters, and output the parameters that produces the best validation result. 

### Specification
1. Select the pair of hyperparamers that produces the smallest k-fold validation error. 
2. Train a new model using all the training and validation data
3. Report the classification accuracy on the test data

In [None]:
# MS = ModelSelector(...)
# lr, reg = MS.grid_search(...)
# print(lr, reg)
# print(MS.test(lr,reg))

# AUTOLAB_IGNORE_START
MS = ModelSelector(features, labels, np.arange(features.shape[0]), 4, 100)
lr, reg = MS.grid_search(np.logspace(-1,1,3), np.logspace(-2,1,4))
print(lr, reg)
print(MS.test(lr,reg))
# AUTOLAB IGNORE_STOP

Again, you can try it on the randomized small example just to make sure your code runs, however it won't produce any sort of meaningful result. On our implementation, performing a grid search on learning rates [0.1, 1, 10] and regularization [0.01, 0.1, 1, 10] with 100 iterations for training results in a final test error rate of 0.0232 and selects a learning rate of 1, and a regularization parameter of 0.1. Our implementation takes about 1 minute and 7 seconds to perform the grid search. 

## Feature Compression (0pts)
While you are able to get decent results using an SVM and basic tf-idf features, there are 2 main problems here:
1. The actual dataset is 8x larger than the one that you load at the start
2. The number of features is extremely bloated and consumes a lot of space and computing power for a binary classification problem

So the above methodology would actually take a lot of time and memory to run on the full dataset. Following the example you did in the text classification notebook, we would need to save the tf-idf matrix for the entire training dataset (which is enormous), and then use that to generate features on new examples. 

One way to tackle this is to generate fewer, but effective, features. For example, instead of generating full tf_idf features for every single word in every email, we can instead try to focus on keywords that frequently only occur in spam email. This was hinted at in the previous contest, but was not emphasized enough. 

This problem is not graded if you wish to create different features. 

In [None]:
def find_frequent_indicator_words(docs, y, threshold):
    pass


In [None]:
# AUTOLAB_IGNORE_START
s,h = find_frequent_indicator_words(emails, labels, 50)
with open('student_data.pkl', 'wb') as f:
    pickle.dump((s,h), f)
# AUTOLAB_IGNORE_STOP

Our implementation gets 2422 spam words and 290 ham words. 

## Efficient Spam Detection (5pts)

Your goal here is to get at least 80% accuracy on spam detection in an efficient manner. If you are unsure of what to do, one way is to use the frequent indicator words implemented above and generate 2 features per emails: the number of spam indicator words and the number of ham indicator words for a total of two features. This is a huge dimensionality reduction!

Of course, you don't have to do this. As long as you achieve at least 80% accuracy with your features you will receive the base credit for this problem. You are allowed to submit supplemental files. See the Contest section for more details. Make sure these supplemental files make it into your tar file (update the Makefile if you use it). 

### Grading
* 5pts 80% or higher accuracy within the constraints of Autolab

In [None]:
def email2features(emails):
    """ Given a list of emails, create a matrix containing features for each email. """
    # with open('student_data.pkl', 'rb') as f:
    #     data = pickle.load(f)
    pass


In [None]:
# AUTOLAB_IGNORE_START
small_features = email2features(emails)
# MS = ModelSelector(...)
# lr, reg = MS.grid_search(...)
# print(lr, reg)
# err, svm = MS.test(lr,reg)
# print(err)
# AUTOLAB_IGNORE_STOP

## Contest
The contest here is straightforward: get the best accuracy level you can on the held out test dataset. You are allowed many things:
1. You can upload a file named 'student_data.pkl' which can be of arbitrary format so long as it meets the Autolab file size submission constraint. You can use this, i.e. to store dictionaries of useful words (but are not limited to just this). 
2. You can upload a file named 'student_params.pkl' which contains a dictionary of parameters you'd like us to run when training your SVM, i.e. regularization parameter, learning rate, and niters. It should follow the format as shown in the next cell. 
3. When writing pkl files, make sure you pass the binary flag 'b' when writing the file or Autolab will be unable to read it. 
4. Add pkl files to your submitted tar file if you want them to be present on autolab. 
4. In addition to the X1.txt and y1.txt, there are 70k more emails (10k per file) that you can peruse for useful data. These are available on the website as a separate download. You are free to use any or all of them locally to learn your parameters and to determine what data to save. 

Reiterating: there are **70k more emails** that you can use to build your features on the course website. If you are serious about hitting the top of the leaderboard, go get them! We've only included the first 10k in the handout. 

In [None]:
# Example. Remember to add these files to your tar archive
# AUTOLAB_IGNORE_START
with open('student_data.pkl', 'wb') as f:
    pickle.dump((s,h), f)
    
with open('student_params.pkl', 'wb') as f:
    pickle.dump({
        "lr" : 1.0,
        "reg" : 1e-4,
        "niters" : 100
    }, f)
# AUTOLAB_IGNORE_STOP