Suppose the spam generator becomes more intelligent and begins producing prose which looks "more legitimate" than before. 

There are numerous ways the prose could become more like legitimate text. For the purpose of this notebook we will simply force the spam data to 'drift' by adding the first few lines of Pride and Prejudice to the start of the spam documents in our testing set. We will then see how the trained model responds.  

In [1]:
import pandas as pd
df = pd.read_parquet("data/training.parquet")

  labels = getattr(columns, 'labels', None) or [
  return pd.MultiIndex(levels=new_levels, labels=labels, names=columns.names)
  labels, = index.labels


We split the data into training and testing sets, as in the modelling notebooks. We use the 'random_state' parameter to ensure that the data is split in the same way as it was when we fit the model. 

In [2]:
from sklearn import model_selection

df_train, df_test = model_selection.train_test_split(df, random_state=43)
df_test_spam = df_test[df_test.label == 'spam'].copy() #filter the spam documents

In [3]:
def add_text(doc, adds):
    """
    takes in a string _doc_ and
    appends text _adds_ to the start
    """
    
    return adds + doc

In [4]:
pride_pred = '''It is a truth universally acknowledged, that a single man in possession of a good fortune, must be in want of a wife.However little known the feelings or views of such a man may be on his first entering a neighbourhood, this truth is so well fixed in the minds of the surrounding families, that he is considered the rightful property of some one or other of their daughters.“My dear Mr. Bennet,” said his lady to him one day, “have you heard that Netherfield Park is let at last?” Mr. Bennet replied that he had not. “But it is,” returned she; “for Mrs. Long has just been here, and she told me all about it.” Mr. Bennet made no answer. “Do you not want to know who has taken it?” cried his wife impatiently.'''

In [5]:
# appending text to the start of the spam
df_test_spam["text"] = df_test_spam.text.apply(add_text, adds=pride_pred)

In [6]:
pd.set_option('display.max_colwidth', -1) #ensures that all the text is visible
df_test_spam.sample(3)

Unnamed: 0,index,label,text
37924,17924,spam,"It is a truth universally acknowledged, that a single man in possession of a good fortune, must be in want of a wife.However little known the feelings or views of such a man may be on his first entering a neighbourhood, this truth is so well fixed in the minds of the surrounding families, that he is considered the rightful property of some one or other of their daughters.“My dear Mr. Bennet,” said his lady to him one day, “have you heard that Netherfield Park is let at last?” Mr. Bennet replied that he had not. “But it is,” returned she; “for Mrs. Long has just been here, and she told me all about it.” Mr. Bennet made no answer. “Do you not want to know who has taken it?” cried his wife impatiently.So far we have tried many! This Stash organic decaf green tea is simply not authentic chai. I absolutely LOVED EB food...as did my baby. These crackers are DELICIOUS!!!!I bought some at Costco. My dog absolutely loves them."
35608,15608,spam,"It is a truth universally acknowledged, that a single man in possession of a good fortune, must be in want of a wife.However little known the feelings or views of such a man may be on his first entering a neighbourhood, this truth is so well fixed in the minds of the surrounding families, that he is considered the rightful property of some one or other of their daughters.“My dear Mr. Bennet,” said his lady to him one day, “have you heard that Netherfield Park is let at last?” Mr. Bennet replied that he had not. “But it is,” returned she; “for Mrs. Long has just been here, and she told me all about it.” Mr. Bennet made no answer. “Do you not want to know who has taken it?” cried his wife impatiently.I have to agree that it's a great training treat, and they all look dark brown. He still gets to inhale his food and he'll graze in the house and up stairs. I will have to think twice about it until 2 hours later i threw everything up and my stomach still feels horrible and i still havent recived them. I've tried Zico, O.N.E., various brands from the Indian market I usually purchase this at Publix, Walmart, or Target in the smaller boxes. This ginger/lemon beverage was such a nice round coffee taste. We buy this gum in stores, and I am TOTALLY satisfied with the price on Amazon is really making some money off this one. I grew up with, but hated that it cost 50-70% less on Amazon, so when I noticed for the first cup and it keeps her from snacking on junk food. If you order this item. I love to indulge upon occasion. This one tastes great on everything it touches!"
20166,166,spam,"It is a truth universally acknowledged, that a single man in possession of a good fortune, must be in want of a wife.However little known the feelings or views of such a man may be on his first entering a neighbourhood, this truth is so well fixed in the minds of the surrounding families, that he is considered the rightful property of some one or other of their daughters.“My dear Mr. Bennet,” said his lady to him one day, “have you heard that Netherfield Park is let at last?” Mr. Bennet replied that he had not. “But it is,” returned she; “for Mrs. Long has just been here, and she told me all about it.” Mr. Bennet made no answer. “Do you not want to know who has taken it?” cried his wife impatiently.Wouldn't be without it! Then, the question is, what do you expect. The taste is great when traveling--camping, hotels, etc. Especially camping. I never seen coconut oil so i buy it on occasion and it tasted so.. I usually put some kind of fine so it stays fresh until consumed.Whole bean or ground flavored coffee products."


We now need to compute the feature vectors for the 'drifted' spam data. We will do this twice: once for the tf_idf feature vectors, and one for the simple summary feature vectors, and compare the results. We use the helper functions [`features_tfidf`](mlworkflows/featurestfidf.py) and [`features_simple`](mlworkflows/featuressimple.py) to compute the feature vectors. 

In [9]:
from mlworkflows import featuressimple
features = featuressimple.features_simple(df_test_spam)

In [10]:
features.sample(4)

Unnamed: 0,index,label,num_punct,num_words,av_wl,max_wl,min_wl,10_quantile,90_quantile,upper_case,stop_words
36321,16321,spam,34.0,237.0,4.168776,14.0,1.0,2.0,7.0,31.0,121.0
21267,1267,spam,41.0,276.0,4.155797,14.0,1.0,2.0,8.0,30.0,144.0
34470,14470,spam,29.0,171.0,4.140351,14.0,1.0,2.0,8.0,21.0,87.0
29675,9675,spam,40.0,251.0,4.159363,13.0,1.0,2.0,7.0,29.0,122.0


Now we can load in the models we generated earlier and see how well they classify these feature vectors. 

In [11]:
filename = 'models/lr_model_simplesummaries.sav' 

In [12]:
import pickle

loaded_model = pickle.load(open(filename, 'rb'))

In [14]:
predictions = loaded_model.predict(features.iloc[:,2:features.shape[1]])

In [15]:
predictions

array(['legitimate', 'legitimate', 'legitimate', ..., 'legitimate',
       'legitimate', 'spam'], dtype=object)

In [17]:
import numpy as np
np.array(np.unique(predictions, return_counts = True))

array([['legitimate', 'spam'],
       [3804, 1192]], dtype=object)

Before data drift, this model classified 4142 of the spam documents correctly and 864 incorrectly. 

We will now compute the tf_idf feature vectors for this drifted data and see if the model performs any better. 

In [18]:
from mlworkflows import featurestfidf

In [24]:
tf_features = featurestfidf.features_tfidf(df_test_spam)

In [25]:
tf_features.sample(10)

Unnamed: 0,index,label,0,1,2,3,4,5,6,7,...,8182,8183,8184,8185,8186,8187,8188,8189,8190,8191
242,5679,spam,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2839,5389,spam,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
493,2367,spam,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4456,19986,spam,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2314,16874,spam,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4553,13404,spam,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
417,17054,spam,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3450,17318,spam,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3334,13454,spam,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1557,14318,spam,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [26]:
filename = 'models/lr_model_tfidfsummaries.sav'
loaded_model = pickle.load(open(filename, 'rb'))

In [27]:
predictions = loaded_model.predict(tf_features.iloc[:,2:tf_features.shape[1]])

In [19]:
predictions

array(['spam', 'legitimate', 'legitimate', ..., 'legitimate',
       'legitimate', 'spam'], dtype=object)

In [20]:
import numpy as np
np.array(np.unique(predictions, return_counts = True))

array([['legitimate', 'spam'],
       [3807, 1189]], dtype=object)

## Exercises
The two models perform very similarly on the 'drifted' data in this notebook. Consider alternative types of data drift and see how the models perform: 
1. What happens when fewer words from Pride and Prejudice are appended to the spam? 
2. How about using a completely different exert of Austen? 
3. How do the models perform when generic text (neither Austen nor food reviews) is appended to the spam? 