# In which I help you decide if cash4u@freestuff.net is legit or not.

Week 7 of Andrew Ng's ML course on Coursera introduces the Support Vector Machine algorithm and challenges us to use it for classifying email as spam or ham. Here I use the [SpamAssassin public corpus](https://spamassassin.apache.org/publiccorpus/) to build an SVM spam email classifier in order to learn about the relevant python tools. [Part I](http://sdsawtelle.github.io/blog/output/spam-classification-text-processing.html) focused on the preprocessing of individual emails, but now I'm going to actually do some machine learning. 

>## Tools Covered:
- `CountVectorizer` for mapping text data to numeric word occurrence vectors
- `tfidfTransformer` for normalizing word occurrence vectors 
- `Pipeline` for chaining together transformer (preprocessing, feature extraction) and estimator steps
- `GridSearchCV` for optimizing over the metaparameters of an estimator or pipeline
- `FunctionTransformer` for creating custom transformer objects from a function
- `FeatureUnion` for combining features from different feaure-creation pipelines

In [1]:
# Set up environment
import scipy.io
import matplotlib.pyplot as plt
import matplotlib 
import pandas as pd
import numpy as np
import pickle
import os
import re

from nltk.stem.snowball import SnowballStemmer
from bs4 import BeautifulSoup
import nltk
from nltk.corpus import stopwords
nltk.download("stopwords")

import snips as snp  # my snippets
snp.prettyplot(matplotlib)  # my aesthetic preferences for plotting
%matplotlib inline

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Sonya\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [2]:
cd hw-wk7-spam-svm

C:\Users\Sonya\Box Sync\Projects\course-machine-learning\hw-wk7-spam-svm



# Incorporating Other Features in the Pipeline with `FeatureUnion`
We might do even better by incorporating the other numeric features that we created during the email preprocessing in [Part I: NLP](http://sdsawtelle.github.io/blog/output/spam-classification-text-processing.html). Remember these were things like the number of HTML tags and the number of blank lines. We still would like to utilize a grid search, but it's not clear how these other features can be included in our pipeline. This is where `FeatureUnion` comes to the rescue! This is a tool that can combine multiple transformer objects into one object whose output is the union of the individual outputs.  

We would like to combine the features generated by a `CountVectorizer` + `tfidfTransformer` pipeline with the features generated by our function `special_features()`. To combine these two approaches we need them to take the same input, so we'll use the corpus of unprocessed email strings. Let's start by making a pipeline of `CountVectorizer` + `tfidfTransformer`.

In [165]:
# Create a pipeline for vectorizing bag of words
objs = [("vect", CountVectorizer(preprocessor=word_salad)), 
        ("tfidf", TfidfTransformer())]
pipe_vectorizer = Pipeline(objs)

## Making a Custom Transformer 
Now we need to turn our custom function `special_features` into a protypical sklearn `transformer` before we can union it with the vectorization pipeline. Luckily there is a helper function for this called `FunctionTransformer`! First let's make a wrapper for `special_featues` so that it can take the same input as the vectorization pipeline: a list of raw email body strings. 

In [192]:
def special_features(body):
    '''Produce some special features from email body.'''
    # Parse HTML extract content only (but count tags)
    soup = BeautifulSoup(body, 'html.parser')
    nhtml = len(soup.find_all())
    nlinks = len(soup.find_all("a"))
    body = soup.get_text()
    
    # count all URLs 
    nhttps = body.count("http")

    # count all email addresses
#     regx = re.compile(r"\b[^\s]+@[^\s]+[.][^\s]+\b")
#     nemails = len(regx.findall(string=body))
    nemails = body.count("@")
    
    # Count uppercases 
    nupper = len([charup for charup, char in zip(body, body.lower()) if charup != char])
    
    # Count and replace all numbers (integer and float)
    regx = re.compile(r"\b[\d.]+\b")
    nnum = len(regx.findall(string=body))

    # Count number of special punctuation
    ndollar, nexclaim, nquest = body.count("$"), body.count("!"), body.count("?")

    # Count carriage returs and blank lines
    nblanks, nnewlines = body.count("\n\n"), body.count("\n")

    # Get total word count
    freqns = np.array([nemails, nhttps, nexclaim, nquest, ndollar,
              nblanks, nnewlines, nhtml, nlinks, nnum])/len(body)
#     freqns = [count/nchars for count in counts]

    return freqns

def special_features_wrapper(bodies):
    xs = [special_features(body) for body in bodies]
    X = np.array(xs)
    return X

Since most transformers in a preprocessing pipeline typically operate on a on a matrix of numbers, `FunctionTransformer` thinks it is being helpful by creating a transformer who checks it's input type. Our input is instead a list of raw email strings, so we need to call `FunctionTransformer` with kwarg `validate=False` to turn off this default checking.

In [193]:
# Turn our custom function into a sklearn transformer!
from sklearn.preprocessing import FunctionTransformer
special_features_transformer = FunctionTransformer(special_features_wrapper, validate=False)

In [194]:
# Test out the transform method of our new transformer
special_features_transformer.transform(emails_raw[0:3])

array([[ 0.00127959,  0.0006398 ,  0.        ,  0.        ,  0.00191939,
         0.01023672,  0.03262956,  0.0006398 ,  0.        ],
       [ 0.00110988,  0.00221976,  0.00221976,  0.00110988,  0.        ,
         0.00665927,  0.0299667 ,  0.        ,  0.        ],
       [ 0.00057045,  0.0011409 ,  0.0011409 ,  0.        ,  0.        ,
         0.00399315,  0.02224758,  0.        ,  0.        ]])

And here's a little bonus for you, instead of using sklearn's helper function, you can make a custom transformer by defining your own class who inherits from sklearn's base classes. Your class should define a `transform` method that executes your custom preprocessing / feature extraction functionality. Refer to[this helpful (and pretty) blog post](https://michelleful.github.io/code-blog/2015/06/20/pipelines/) for more info.

In [150]:
from sklearn.base import BaseEstimator, TransformerMixin

class CustomTransformer(BaseEstimator, TransformerMixin):

    def __init__(self):
        pass

    def transform(self, X, y=None):
        return special_features_wrapper(X)  # your custom feature extraction / preprocessing

    def fit(self, X, y=None):
        return self  # generally does nothing

In [151]:
# Instantiate our custom transformer class and try it out
mytransformer = CustomTransformer()
mytransformer.transform(emails_raw[0:3])

array([[ 2,  1,  0,  0,  3, 16, 51,  1,  0],
       [ 1,  2,  2,  1,  0,  6, 27,  0,  0],
       [ 1,  2,  2,  0,  0,  7, 39,  0,  0]])

## Pipelining Special Feature Extraction With Scaling

We need to make sure that our special features are all properly scaled, to facilitate the numerical optimization of the SVM in fitting. That means we need to create a pipeline which is our special feature transformer chained with sklearn's `StandardScaler`.

In [195]:
from sklearn.preprocessing import StandardScaler

# Create a pipeline for standardizing our special features
objs = [("specials", special_features_transformer),
       ("scaler", StandardScaler())]
pipe_specialfeats = Pipeline(objs)

## Feature Union of Two Pipelines
Now we are ready for our feature union! We will create an object whose output is our full feature vector, which will be the union of a tfidf-transformed word occurrence vector with a scaled special feature vector.

In [196]:
# Create a feature union of vectorizer pipeline and custom transformer
from sklearn.pipeline import FeatureUnion
objs = [("vectscaled", pipe_vectorizer), 
        ("specialfeats", pipe_specialfeats)]
union = FeatureUnion(objs)

Are you confused yet? We made a pipeline from `CountVectorizer` and `tfidfTranformer` to get an object that takes a raw email body and spits out an occurrence vector. Then we used `FunctionTransformer` to turn our custom feature extraction function into a transformer that takes in a raw email body and spits a vector of other features. Then we created a union of these two things, which is itself a transformer object that will take in a raw email body and spitout the union of the ocurrence vector with the special features vector. 

## Final Complete Pipeline with SVM Estimator
Now we need a *final* pipeline that combines our feature union object with the actual SVM estimator:

In [197]:
# Create a pipeline of feature union with svm estimator
objs = [("union", union),
       ("svm", SVC(kernel="rbf"))]
pipe_full = Pipeline(objs)

# Grid Search With the Full Pipeline
Now we can do a grid search with the full pipeline and using the raw emails as our input. Note that to refer to parameters of objects that are nested within pipelines you use the `__` syntax to go down as many layers as needed. 

In [198]:
# Specify parameters of the pipeline and their ranges for grid search
# params = {
#     'union__vectscaled__vect__min_df': (0.01, 0.03, 0.06),
#     'union__vectscaled__vect__ngram_range': ((1, 1),(1, 2)),  # unigrams or bigrams
#     'union__vectscaled__tfidf__use_idf': (True,),
#     'svm__C': np.logspace(-2, 2, 10),
#     'svm__gamma': np.logspace(-3, 1, 10),
# }

params = {
    'union__vectscaled__vect__min_df': (0.01,),
    'union__vectscaled__vect__ngram_range': ((1, 1),),  # unigrams or bigrams
    'union__vectscaled__tfidf__use_idf': (False,),
    'svm__C': (1,),
    'svm__gamma': (0.1,)
}


# Construct and fit our grid search object
search = GridSearchCV(pipe_full, param_grid=params)
search.fit(tempX, tempy)

GridSearchCV(cv=None, error_score='raise',
       estimator=Pipeline(steps=[('union', FeatureUnion(n_jobs=1,
       transformer_list=[('vectscaled', Pipeline(steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None...,
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False))]),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'union__vectscaled__vect__min_df': (0.01,), 'svm__gamma': (0.1,), 'svm__C': (1,), 'union__vectscaled__vect__ngram_range': ((1, 1),), 'union__vectscaled__tfidf__use_idf': (False,)},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=0)

In [189]:
search.param_grid

{'svm__C': (1,),
 'svm__gamma': (0.1,),
 'union__vectscaled__tfidf__use_idf': (False,),
 'union__vectscaled__vect__min_df': (0.01,),
 'union__vectscaled__vect__ngram_range': ((1, 1),)}

Having specified a maximum n-gram size of two, we should now see single words and word pairs in our fitted vocab list:

In [199]:
df = pd.DataFrame(search.cv_results_)
df.head()

Unnamed: 0,mean_fit_time,mean_score_time,mean_test_score,mean_train_score,param_svm__C,param_svm__gamma,param_union__vectscaled__tfidf__use_idf,param_union__vectscaled__vect__min_df,param_union__vectscaled__vect__ngram_range,params,...,split0_test_score,split0_train_score,split1_test_score,split1_train_score,split2_test_score,split2_train_score,std_fit_time,std_score_time,std_test_score,std_train_score
0,1.840653,0.932307,0.724138,0.965789,1,0.1,False,0.01,"(1, 1)","{'union__vectscaled__vect__min_df': 0.01, 'uni...",...,0.7,1.0,0.8,0.947368,0.666667,0.95,0.102011,0.24252,0.056661,0.024214


In [184]:
tempX = emails_raw[0:20] + emails_raw[-10:-1]

In [185]:
tempy = y[0:20] + y[-10:-1]

In [186]:
tempy

[0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1]