Suppose the spam generator becomes more intelligent and begins producing prose which looks "more legitimate" than before. 

There are numerous ways the prose could become more like legitimate text. For the purpose of this notebook we will simply force the spam data to 'drift' by adding the first few lines of Pride and Prejudice to the start of the spam documents in our testing set. We will then see how the trained model responds.  

In [1]:
import pandas as pd
df = pd.read_parquet("data/training.parquet")

  labels = getattr(columns, 'labels', None) or [
  return pd.MultiIndex(levels=new_levels, labels=labels, names=columns.names)
  labels, = index.labels


In [2]:
from sklearn import model_selection


df_train, df_test = model_selection.train_test_split(df, random_state=43)
df_test_spam = df_test[df_test.label == 'spam'].copy()

In [3]:
def add_text(doc, adds):
    """
    takes in a string _doc_ and
    appends text _adds_ to the start
    """
    
    return adds + doc

In [4]:
pride_pred = '''It is a truth universally acknowledged, that a single man in possession of a good fortune, must be in want of a wife.However little known the feelings or views of such a man may be on his first entering a neighbourhood, this truth is so well fixed in the minds of the surrounding families, that he is considered the rightful property of some one or other of their daughters.“My dear Mr. Bennet,” said his lady to him one day, “have you heard that Netherfield Park is let at last?” Mr. Bennet replied that he had not. “But it is,” returned she; “for Mrs. Long has just been here, and she told me all about it.” Mr. Bennet made no answer. “Do you not want to know who has taken it?” cried his wife impatiently.'''

In [5]:
df_test_spam["text"] = df_test_spam.text.apply(add_text, adds=pride_pred)

In [6]:
pd.set_option('display.max_colwidth', -1) #ensures that all the text is visible

df_test_spam.sample(3)

Unnamed: 0,index,label,text
39419,19419,spam,"It is a truth universally acknowledged, that a single man in possession of a good fortune, must be in want of a wife.However little known the feelings or views of such a man may be on his first entering a neighbourhood, this truth is so well fixed in the minds of the surrounding families, that he is considered the rightful property of some one or other of their daughters.“My dear Mr. Bennet,” said his lady to him one day, “have you heard that Netherfield Park is let at last?” Mr. Bennet replied that he had not. “But it is,” returned she; “for Mrs. Long has just been here, and she told me all about it.” Mr. Bennet made no answer. “Do you not want to know who has taken it?” cried his wife impatiently.I snack on them in short order.I have been unable to get a halfway decent replacement for Oreo's. They gave her shots and they slowly went away. There is none of the guilt!"
21004,1004,spam,"It is a truth universally acknowledged, that a single man in possession of a good fortune, must be in want of a wife.However little known the feelings or views of such a man may be on his first entering a neighbourhood, this truth is so well fixed in the minds of the surrounding families, that he is considered the rightful property of some one or other of their daughters.“My dear Mr. Bennet,” said his lady to him one day, “have you heard that Netherfield Park is let at last?” Mr. Bennet replied that he had not. “But it is,” returned she; “for Mrs. Long has just been here, and she told me all about it.” Mr. Bennet made no answer. “Do you not want to know who has taken it?” cried his wife impatiently.BUT, I really hate the new change and won't buy again. I prefer tendons or pizzles, since they last awhile. Definetly my favorite coffee shop. Compared to it's competetor, I'll never go back to Amazon and the price is very reasonable, it cost the same or even comparable to Dave's. Great organic product. No guilt, no high glycemic index to make you lose weight.Health benefits aside, we purchased these for my husband. This is a great thing, since it is much improved!! This is great!!! first time my husband and I have wanted nothing to do with the ritual to stand up to a lovely day while your supply lasts. Same thing. I finished every drop."
37910,17910,spam,"It is a truth universally acknowledged, that a single man in possession of a good fortune, must be in want of a wife.However little known the feelings or views of such a man may be on his first entering a neighbourhood, this truth is so well fixed in the minds of the surrounding families, that he is considered the rightful property of some one or other of their daughters.“My dear Mr. Bennet,” said his lady to him one day, “have you heard that Netherfield Park is let at last?” Mr. Bennet replied that he had not. “But it is,” returned she; “for Mrs. Long has just been here, and she told me all about it.” Mr. Bennet made no answer. “Do you not want to know who has taken it?” cried his wife impatiently.She was chewing on hulls. Great alternative to other kinds of coffee has skyrocketed. There's very good directions for making them available to order in bulk until a couple of times until a friend recommended this to my vegan friends either. 3 stars. We have found it on Amazon and it was hot. I think they taste really fresh for a long time and I can report back if her tartar build-up has been reduced in half within a week. It is pretty much like chewing styrofoam! doesn't have a strong cheddar taste. you looking at the news on your computer. Mixed with chopped cooked hotdogs."


Need to compute the summaries on this. Using the function 

In [7]:
import re
import numpy as np
from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS

def strip_punct(doc):
    """
    takes in a document _doc_ and
    returns a tuple of the punctuation-free
    _doc_ and the count of punctuation in _doc_
    """
    
    return re.subn(r"""[!.><:;',@#~{}\[\]\-_+=£$%^&()?]""", "", doc, count=0, flags=0)

def caps(word):
    return not word.islower()

def isstopword(word):
    return word in ENGLISH_STOP_WORDS

def standard_summary(row):
    """
    takes in an entry _row_ from the data and 
    computes each of the summaries then returns
    the summaries in a tuple, along with the unique 
    'level_0' id
    """
    
    doc = row["text"]

    no_punct = strip_punct(doc)
    
    words = no_punct[0].split()
    
    number_words = len(words)
    
    word_length = [len(x) for x in words]
    
    mean_wl = sum(word_length)/number_words
    
    max_wl = max(word_length)
    min_wl = min(word_length)

    pc_90_wl = np.percentile(word_length, 90)
    pc_10_wl = np.percentile(word_length, 10)
    
    upper = sum([caps(x) for x in words])
    stop_words = sum([isstopword(x) for x in words])

    return [ no_punct[1], number_words, mean_wl, max_wl, min_wl, pc_10_wl, pc_90_wl, upper, stop_words]

In [8]:
features = df_test_spam.apply(standard_summary, axis=1).apply(pd.Series)
features.columns = ["num_punct", "num_words", "av_wl", "max_wl", "min_wl", "10_quantile", "90_quantile", "upper_case", "stop_words"]

In [9]:
features.sample(4)

Unnamed: 0,num_punct,num_words,av_wl,max_wl,min_wl,10_quantile,90_quantile,upper_case,stop_words
36401,27.0,195.0,4.174359,15.0,1.0,2.0,7.6,20.0,101.0
37887,32.0,222.0,4.234234,16.0,1.0,2.0,7.9,29.0,108.0
21287,34.0,258.0,4.248062,15.0,1.0,2.0,8.0,33.0,133.0
21154,51.0,261.0,4.279693,18.0,1.0,2.0,8.0,31.0,125.0


Now we can load in the models we generated earlier and see how well they classify these feature vectors. 

In [10]:
filename = 'models/lr_model_simplesummaries.sav'

In [11]:
import pickle
# load the model from disk
loaded_model = pickle.load(open(filename, 'rb'))
#result = loaded_model.score(X_test, Y_test)
#print(result)

In [12]:
loaded_model

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=4000, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='lbfgs',
          tol=0.0001, verbose=0, warm_start=False)

In [13]:
predictions = loaded_model.predict(features.iloc[:,0:features.shape[1]])

In [14]:
predictions

array(['legitimate', 'legitimate', 'legitimate', ..., 'legitimate',
       'legitimate', 'spam'], dtype=object)

In [15]:
type(predictions)

numpy.ndarray

In [16]:
np.array(np.unique(predictions, return_counts = True))

array([['legitimate', 'spam'],
       [3804, 1192]], dtype=object)

In [None]:
#double checking that the test data set is indeed the same when we use the seed in the split function
# and checking model seems to be the same

In [19]:
features_no_drift = df_test[df_test.label == 'spam'].apply(standard_summary, axis=1).apply(pd.Series)
features_no_drift.columns = ["num_punct", "num_words", "av_wl", "max_wl", "min_wl", "10_quantile", "90_quantile", "upper_case", "stop_words"]

no_drift_preds = loaded_model.predict(features_no_drift.iloc[:,0:features.shape[1]])

In [21]:
np.array(np.unique(no_drift_preds, return_counts = True))

array([['legitimate', 'spam'],
       [854, 4142]], dtype=object)