Suppose the spam generator becomes more intelligent and begins producing prose which looks "more legitimate" than before. 

There are numerous ways the prose could become more like legitimate text. For the purpose of this notebook we will simply force the spam data to *drift* by adding the first few lines of Pride and Prejudice to the start of the spam documents in our testing set. We will then see how the trained model responds.  

In [1]:
import pandas as pd
import os.path

df = pd.read_parquet(os.path.join("data", "training.parquet"))

We split the data into training and testing sets, as in the modelling notebooks. We use the `random_state` parameter to ensure that the data is split in the same way as it was when we fit the model. 

In [2]:
from sklearn import model_selection

df_train, df_test = model_selection.train_test_split(df, random_state=43)
df_test_spam = df_test[df_test.label == 'spam'].copy() #filter the spam documents

In [3]:
def add_text(doc, adds):
    """
    takes in a string _doc_ and
    appends text _adds_ to the start
    """
    
    return adds + doc

In [4]:
pride_pred = '''It is a truth universally acknowledged, that a single man in possession of a good fortune, must be in want of a wife.However little known the feelings or views of such a man may be on his first entering a neighbourhood, this truth is so well fixed in the minds of the surrounding families, that he is considered the rightful property of some one or other of their daughters.“My dear Mr. Bennet,” said his lady to him one day, “have you heard that Netherfield Park is let at last?” Mr. Bennet replied that he had not. “But it is,” returned she; “for Mrs. Long has just been here, and she told me all about it.” Mr. Bennet made no answer. “Do you not want to know who has taken it?” cried his wife impatiently.'''

In [5]:
# appending text to the start of the spam
df_test_spam["text"] = df_test_spam.text.apply(add_text, adds=pride_pred)

In [6]:
pd.set_option('display.max_colwidth', -1) #ensures that all the text is visible
df_test_spam.sample(3)

  pd.set_option('display.max_colwidth', -1) #ensures that all the text is visible


Unnamed: 0,index,label,text
39085,19085,spam,"It is a truth universally acknowledged, that a single man in possession of a good fortune, must be in want of a wife.However little known the feelings or views of such a man may be on his first entering a neighbourhood, this truth is so well fixed in the minds of the surrounding families, that he is considered the rightful property of some one or other of their daughters.“My dear Mr. Bennet,” said his lady to him one day, “have you heard that Netherfield Park is let at last?” Mr. Bennet replied that he had not. “But it is,” returned she; “for Mrs. Long has just been here, and she told me all about it.” Mr. Bennet made no answer. “Do you not want to know who has taken it?” cried his wife impatiently.Oh why did they change the recipe until the very last tasty bite. Love them!! First tried these at a local quick-mart. As far as blue cheese flaver, its so mild we used them in baking instead of sugar and melting the butter instead of oil really gives it a good protein supplement quite a pain! If I could give this product ANY stars! Every year for the holidays as a gag gift to me. Just last night my local grocery stopped stocking Fifty50, I was determined to not give our dogs these treats for years I thought we'd try something different. It brings out the best way to make a quick buck by sticking their label on it! I keep saying.You can't go wrong ! For authentic Belgian street waffles, you must use this mint jelly as a condiment."
20736,736,spam,"It is a truth universally acknowledged, that a single man in possession of a good fortune, must be in want of a wife.However little known the feelings or views of such a man may be on his first entering a neighbourhood, this truth is so well fixed in the minds of the surrounding families, that he is considered the rightful property of some one or other of their daughters.“My dear Mr. Bennet,” said his lady to him one day, “have you heard that Netherfield Park is let at last?” Mr. Bennet replied that he had not. “But it is,” returned she; “for Mrs. Long has just been here, and she told me all about it.” Mr. Bennet made no answer. “Do you not want to know who has taken it?” cried his wife impatiently.I was wrong this one doesn't require any coins and holds more jelly beans than I thought a protein based bar would be too much to handle. We highly recommend this product. NO morning tea I ever need.At least, that's how I like it; if I kept the wrapper and then rewraps them in their treat spots. Tasted like chemicals. I have never found a lip balm as awesome as this one!"
35137,15137,spam,"It is a truth universally acknowledged, that a single man in possession of a good fortune, must be in want of a wife.However little known the feelings or views of such a man may be on his first entering a neighbourhood, this truth is so well fixed in the minds of the surrounding families, that he is considered the rightful property of some one or other of their daughters.“My dear Mr. Bennet,” said his lady to him one day, “have you heard that Netherfield Park is let at last?” Mr. Bennet replied that he had not. “But it is,” returned she; “for Mrs. Long has just been here, and she told me all about it.” Mr. Bennet made no answer. “Do you not want to know who has taken it?” cried his wife impatiently.I bought these for my 3 retrievers. I added a caramel, some hazelnut flavoring, a packet of 8 California Shrimp Tempura Rolls with a nice blend of spices to tea. White tea should have a protein followed by a carbohydrate as the first two ingredients. It was an easy decision to go with anything as well. I am very disappointed in what it promises. You'll pay $1.70. It wasn't the tea, it came in a nicely layout single-layer container so all the wings and legs, I butterfly the meat into small pieces and crumbs. These two things make the tug a jug! Surprise! the best tasting, most versatile chips I have eaten. We also like the fact that each box costs around $5 in stores."


We now pass this "drifted" data through the pipeline we created: we compute feature vectors, and we make spam/legitimate classifications using the model we trained. 

In [7]:
from sklearn.pipeline import Pipeline
import pickle, os

## loading in feature vectors pipeline
filename = 'feature_tfidf_pipeline.sav'
feat_pipeline = pickle.load(open(filename, 'rb'))

## loading model
filename = 'model_tfidf_forest.sav'
model = pickle.load(open(filename, 'rb'))



In [8]:
pipeline = Pipeline([
    ('features',feat_pipeline),
    ('model',model)
])

## we need to fit the model, using the un-drifted data, as we did in the previous notebooks. 

pipeline.fit(df_train["text"], df_train["label"])



In [9]:
## we can then go on and make predictions for the drifted spam, using the fitted pipeline above. 
# predict test instances
y_preds = pipeline.predict(df_test_spam["text"])
print(y_preds)

['legitimate' 'legitimate' 'legitimate' ... 'legitimate' 'legitimate'
 'legitimate']


In [10]:
import numpy as np
np.array(np.unique(y_preds, return_counts = True))

array([['legitimate', 'spam'],
       [4993, 3]], dtype=object)

The model is worse at classifying drifted data, since this is not what we trained the model on. 

## Exercises
The two models perform very similarly on the "drifted" data in this notebook. Consider alternative types of data drift and see how the models perform: 
1. What happens when fewer words from Pride and Prejudice are appended to the spam? 
2. How about using a completely different excerpt of Austen? 
3. How do the models perform when generic text (neither Austen nor food reviews) is appended to the spam? 