Suppose the spam generator becomes more intelligent and begins producing prose which looks "more legitimate" than before. 

There are numerous ways the prose could become more like legitimate text. For the purpose of this notebook we will simply force the spam data to 'drift' by adding the first few lines of Pride and Prejudice to the start of the spam documents in our testing set. We will then see how the trained model responds.  

In [1]:
import pandas as pd
df = pd.read_parquet("data/training.parquet")

  labels = getattr(columns, 'labels', None) or [
  return pd.MultiIndex(levels=new_levels, labels=labels, names=columns.names)
  labels, = index.labels


We split the data into training and testing sets, as in the modelling notebooks. We use the 'random_state' parameter to ensure that the data is split in the same way as it was when we fit the model. 

In [2]:
from sklearn import model_selection

df_train, df_test = model_selection.train_test_split(df, random_state=43)
df_test_spam = df_test[df_test.label == 'spam'].copy() #filter the spam documents

In [3]:
def add_text(doc, adds):
    """
    takes in a string _doc_ and
    appends text _adds_ to the start
    """
    
    return adds + doc

In [4]:
pride_pred = '''It is a truth universally acknowledged, that a single man in possession of a good fortune, must be in want of a wife.However little known the feelings or views of such a man may be on his first entering a neighbourhood, this truth is so well fixed in the minds of the surrounding families, that he is considered the rightful property of some one or other of their daughters.“My dear Mr. Bennet,” said his lady to him one day, “have you heard that Netherfield Park is let at last?” Mr. Bennet replied that he had not. “But it is,” returned she; “for Mrs. Long has just been here, and she told me all about it.” Mr. Bennet made no answer. “Do you not want to know who has taken it?” cried his wife impatiently.'''

In [5]:
# appending text to the start of the spam
df_test_spam["text"] = df_test_spam.text.apply(add_text, adds=pride_pred)

In [6]:
pd.set_option('display.max_colwidth', -1) #ensures that all the text is visible
df_test_spam.sample(3)

Unnamed: 0,index,label,text
33863,13863,spam,"It is a truth universally acknowledged, that a single man in possession of a good fortune, must be in want of a wife.However little known the feelings or views of such a man may be on his first entering a neighbourhood, this truth is so well fixed in the minds of the surrounding families, that he is considered the rightful property of some one or other of their daughters.“My dear Mr. Bennet,” said his lady to him one day, “have you heard that Netherfield Park is let at last?” Mr. Bennet replied that he had not. “But it is,” returned she; “for Mrs. Long has just been here, and she told me all about it.” Mr. Bennet made no answer. “Do you not want to know who has taken it?” cried his wife impatiently.I've fed her premium brands and a few glowing ones. I am definitely getting this for my Girlfriend's Dad as a birthday gift of these chocolates are good enough to eat! Doesn't taste like fertilizer like another brand of juice. The batch I received wasn't old, and as for taste, it's equivalent to one fruit in the cranberry flavor so much I made a loaf with free shipping using their coupon. Believe me, you will never go back. I was excited about receiving decaf tea, but it is worth the effort. I have eaten some pretty strange things to kill carb cravings, and most of the time, I decant them for draining and make an excellent main course coffee too.Enjoy. The added advantage of dental hygiene is great. At first I was concerned about the crash but there is an easy review: put it in a shake and it tastes bold, like a Cabernet should. If you are looking for an oil to soak toothpicks in like we use to wrap and make Kimbab, Onigiri wraps or Sushi.. I realize that taste is a hit with my kitty, and it is now a once-a-week dinner in our house. There are usually three qualities of matcha: ceremonial, universal, and culinary. Very fine alfalfa seed, very helpful and it gives a sour taste."
21835,1835,spam,"It is a truth universally acknowledged, that a single man in possession of a good fortune, must be in want of a wife.However little known the feelings or views of such a man may be on his first entering a neighbourhood, this truth is so well fixed in the minds of the surrounding families, that he is considered the rightful property of some one or other of their daughters.“My dear Mr. Bennet,” said his lady to him one day, “have you heard that Netherfield Park is let at last?” Mr. Bennet replied that he had not. “But it is,” returned she; “for Mrs. Long has just been here, and she told me all about it.” Mr. Bennet made no answer. “Do you not want to know who has taken it?” cried his wife impatiently.Looking forward to trying out some of the dark/bold flavors and even he says they contain anthocyanins which in his words,help you to lose weight or just eat it by the case at all. Full body and not harsh when left in the gym and my workouts are great, I am so addicted to these!!! You know the end of the class--in addition to small treats after each run. Her stool is also firm again.I am now going to have to do is follow the size and so spicy, you can't eat just one? The type in which organic apple juice concentrate. Phat Beets are a different, gourmet style.The Phat Beets recipe has a complex layering of spices--rosemary, ginger, lemon, onion, allspice, cinnamon sticks, cloves and brown sugar oatmeal. However, unlike the Irish, I can only surmise that something has happened to this brand? We recently adopted a 15 lb. If you want to look for it elsewhere..."
38633,18633,spam,"It is a truth universally acknowledged, that a single man in possession of a good fortune, must be in want of a wife.However little known the feelings or views of such a man may be on his first entering a neighbourhood, this truth is so well fixed in the minds of the surrounding families, that he is considered the rightful property of some one or other of their daughters.“My dear Mr. Bennet,” said his lady to him one day, “have you heard that Netherfield Park is let at last?” Mr. Bennet replied that he had not. “But it is,” returned she; “for Mrs. Long has just been here, and she told me all about it.” Mr. Bennet made no answer. “Do you not want to know who has taken it?” cried his wife impatiently.This company is gouging you people!! The product itself - however, I only had one! It is the only low fat cheddar flavored cracker that I am not sure what flavor it was. It's also a little more than others but, as a previous reviewer about adding a can of this into a smoothie that is different than heavy packed snow."


We now need to compute the feature vectors for the 'drifted' spam data. We will do this twice: once for the tf_idf feature vectors, and one for the simple summary feature vectors, and compare the results.  

In [7]:
### functions to compute 'simple' summary statistics

import re
import numpy as np
from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS

def strip_punct(doc):
    """
    takes in a document _doc_ and
    returns a tuple of the punctuation-free
    _doc_ and the count of punctuation in _doc_
    """
    
    return re.subn(r"""[!.><:;',@#~{}\[\]\-_+=£$%^&()?]""", "", doc, count=0, flags=0)

def caps(word):
    return not word.islower()

def isstopword(word):
    return word in ENGLISH_STOP_WORDS

def standard_summary(row):
    """
    takes in an entry _row_ from the data and 
    computes each of the summaries then returns
    the summaries in a tuple, along with the unique 
    'level_0' id
    """
    
    doc = row["text"]

    no_punct = strip_punct(doc)
    
    words = no_punct[0].split()
    
    number_words = len(words)
    
    word_length = [len(x) for x in words]
    
    mean_wl = sum(word_length)/number_words
    
    max_wl = max(word_length)
    min_wl = min(word_length)

    pc_90_wl = np.percentile(word_length, 90)
    pc_10_wl = np.percentile(word_length, 10)
    
    upper = sum([caps(x) for x in words])
    stop_words = sum([isstopword(x) for x in words])

    return [ no_punct[1], number_words, mean_wl, max_wl, min_wl, pc_10_wl, pc_90_wl, upper, stop_words]

In [8]:
features = df_test_spam.apply(standard_summary, axis=1).apply(pd.Series)
features.columns = ["num_punct", "num_words", "av_wl", "max_wl", "min_wl", "10_quantile", "90_quantile", "upper_case", "stop_words"]

In [9]:
features.sample(4)

Unnamed: 0,num_punct,num_words,av_wl,max_wl,min_wl,10_quantile,90_quantile,upper_case,stop_words
20027,40.0,240.0,4.175,15.0,1.0,2.0,8.0,34.0,120.0
35121,37.0,230.0,4.108696,13.0,1.0,2.0,7.0,24.0,115.0
26715,40.0,221.0,4.239819,13.0,1.0,2.0,8.0,28.0,112.0
21754,42.0,261.0,4.180077,18.0,1.0,2.0,7.0,34.0,131.0


Now we can load in the models we generated earlier and see how well they classify these feature vectors. 

In [10]:
filename = 'models/lr_model_simplesummaries.sav' 

In [11]:
import pickle

loaded_model = pickle.load(open(filename, 'rb'))

In [12]:
predictions = loaded_model.predict(features.iloc[:,0:features.shape[1]])

In [13]:
predictions

array(['legitimate', 'legitimate', 'legitimate', ..., 'legitimate',
       'legitimate', 'spam'], dtype=object)

In [14]:
np.array(np.unique(predictions, return_counts = True))

array([['legitimate', 'spam'],
       [3804, 1192]], dtype=object)

Before data drift, this model classified 4142 of the spam documents correctly and 864 incorrectly. 

We will now compute the tf_idf feature vectors for this drifted data and see if the model performs any better. 

In [15]:
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

hv = HashingVectorizer(norm=None, token_pattern='(?u)\\b[A-Za-z]\\w+\\b', n_features=8192, alternate_sign = False)
hvcounts = hv.fit_transform(df_test_spam["text"])
tfidf_transformer = TfidfTransformer()
hvdf_tfidf = tfidf_transformer.fit_transform(hvcounts)
dense_tf_idf = hvdf_tfidf.toarray()
tf_features = pd.concat([df_test_spam.reset_index()[["index", "label"]],pd.DataFrame(dense_tf_idf)], axis=1)
tf_features.columns = tf_features.columns.astype(str)

In [16]:
tf_features.sample(10)

Unnamed: 0,index,label,0,1,2,3,4,5,6,7,...,8182,8183,8184,8185,8186,8187,8188,8189,8190,8191
1736,4355,spam,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.191259,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3593,2186,spam,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1396,6348,spam,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3405,7826,spam,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1261,12290,spam,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
782,15235,spam,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
637,766,spam,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2378,17409,spam,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4349,19539,spam,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
730,14698,spam,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [17]:
filename = 'models/lr_model_tfidfsummaries.sav'
loaded_model = pickle.load(open(filename, 'rb'))

In [18]:
predictions = loaded_model.predict(tf_features.iloc[:,2:tf_features.shape[1]])

In [19]:
predictions

array(['spam', 'legitimate', 'legitimate', ..., 'legitimate',
       'legitimate', 'spam'], dtype=object)

In [20]:
import numpy as np
np.array(np.unique(predictions, return_counts = True))

array([['legitimate', 'spam'],
       [3807, 1189]], dtype=object)

## Exercises
The two models perform very similarly on the 'drifted' data in this notebook. Consider alternative types of data drift and see how the models perform: 
1. What happens when fewer words from Pride and Prejudice are appended to the spam? 
2. How about using a completely different exert of Austen? 
3. How do the models perform when generic text (neither Austen nor food reviews) is appended to the spam? 