## Data Drift


Suppose the spam generator becomes more intelligent and begins producing prose which looks "more legitimate" than before. Words that were frequently associated with appearing in non-spam may appear more in real emails, or vice versa. Models that are running continuously with streamed data need to be monitored to ensure that the data has not drifted. 
 0. Data 

For the purpose of this notebook we will simply force the spam data to *drift* by adding the first few lines of Pride and Prejudice to the start of the spam documents in our testing set. We will then see how the trained model responds.  

In [15]:
import pandas as pd
import os.path

data = pd.read_parquet(os.path.join("data", "training.parquet"))
df = pd.read_parquet(os.path.join("data", "training.parquet"))

We split the data into training and testing sets, as in the modelling notebooks. We use the `random_state` parameter to ensure that the data is split in the same way as it was when we fit the model. 

In [16]:
from sklearn import model_selection

df_train, df_test = model_selection.train_test_split(df, random_state=43)
df_test_spam = df_test[df_test.label == 'spam'].copy() #filter the spam documents

In [17]:
def add_text(doc, adds):
    """
    takes in a string _doc_ and
    appends text _adds_ to the start
    """
    
    return adds + doc

In [18]:
pride_pred = '''It is a truth universally acknowledged, that a single man in possession of a good fortune, must be in want of a wife.However little known the feelings or views of such a man may be on his first entering a neighbourhood, this truth is so well fixed in the minds of the surrounding families, that he is considered the rightful property of some one or other of their daughters.“My dear Mr. Bennet,” said his lady to him one day, “have you heard that Netherfield Park is let at last?” Mr. Bennet replied that he had not. “But it is,” returned she; “for Mrs. Long has just been here, and she told me all about it.” Mr. Bennet made no answer. “Do you not want to know who has taken it?” cried his wife impatiently.'''

In [19]:
# appending text to the start of the spam
df_test_spam["text"] = df_test_spam.text.apply(add_text, adds=pride_pred)

In [20]:
pd.set_option('display.max_colwidth', None) #ensures that all the text is visible
df_test_spam.sample(3)

Unnamed: 0,index,label,text
21224,1224,spam,"It is a truth universally acknowledged, that a single man in possession of a good fortune, must be in want of a wife.However little known the feelings or views of such a man may be on his first entering a neighbourhood, this truth is so well fixed in the minds of the surrounding families, that he is considered the rightful property of some one or other of their daughters.“My dear Mr. Bennet,” said his lady to him one day, “have you heard that Netherfield Park is let at last?” Mr. Bennet replied that he had not. “But it is,” returned she; “for Mrs. Long has just been here, and she told me all about it.” Mr. Bennet made no answer. “Do you not want to know who has taken it?” cried his wife impatiently.To be sure I don't run out! The flavor is subtle, the texture is terrible. All the oats I had came out of the ground."
39979,19979,spam,"It is a truth universally acknowledged, that a single man in possession of a good fortune, must be in want of a wife.However little known the feelings or views of such a man may be on his first entering a neighbourhood, this truth is so well fixed in the minds of the surrounding families, that he is considered the rightful property of some one or other of their daughters.“My dear Mr. Bennet,” said his lady to him one day, “have you heard that Netherfield Park is let at last?” Mr. Bennet replied that he had not. “But it is,” returned she; “for Mrs. Long has just been here, and she told me all about it.” Mr. Bennet made no answer. “Do you not want to know who has taken it?” cried his wife impatiently.Cheap.The worst part though was the scent. But what is more wonderful, no more doggie breath. I am assuming it is because I LOVE food. Haven't tried with anything else in the future otherwise it would fall out of it great for settling an upset stomach with dogs but this is by far the worst I've ever had. The first time we have had for at least a case, becauseit will be used up in the morning. I just thought you might want to have an alternative to potato chips."
22990,2990,spam,"It is a truth universally acknowledged, that a single man in possession of a good fortune, must be in want of a wife.However little known the feelings or views of such a man may be on his first entering a neighbourhood, this truth is so well fixed in the minds of the surrounding families, that he is considered the rightful property of some one or other of their daughters.“My dear Mr. Bennet,” said his lady to him one day, “have you heard that Netherfield Park is let at last?” Mr. Bennet replied that he had not. “But it is,” returned she; “for Mrs. Long has just been here, and she told me all about it.” Mr. Bennet made no answer. “Do you not want to know who has taken it?” cried his wife impatiently.There was about a year now. Amazing! Great product and price! But, I found these and they are willing to cut corners on quality. They have saved me again. Still useful to add into my own container I pour off about a third of a teaspoon in hot chocolate that I've ever had."


We now pass this "drifted" data through the pipeline we created: we compute feature vectors, and we make spam/legitimate classifications using the model we trained. 

In [21]:
from sklearn.pipeline import Pipeline
import pickle, os

## loading in feature vectors pipeline
filename = 'feature_pipeline.sav'
feat_pipeline = pickle.load(open(filename, 'rb'))

## loading model
filename = 'model.sav'
model = pickle.load(open(filename, 'rb'))



In [22]:
pipeline = Pipeline([
    ('features',feat_pipeline),
    ('model',model)
])

## we need to fit the model, using the un-drifted data, as we did in the previous notebooks. 

pipeline.fit(df_train["text"], df_train["label"])

Pipeline(steps=[('features',
                 Pipeline(steps=[('vect',
                                  HashingVectorizer(alternate_sign=False,
                                                    n_features=1024, norm=None,
                                                    token_pattern='(?u)\\b[A-Za-z]\\w+\\b')),
                                 ('tfidf', TfidfTransformer())])),
                ('model', LogisticRegression(max_iter=4000))])

In [23]:
## we can then go on and make predictions for the drifted spam, using the fitted pipeline above. 
# predict test instances
y_preds = pipeline.predict(df_test_spam["text"])
print(y_preds)

['legitimate' 'legitimate' 'legitimate' ... 'legitimate' 'legitimate'
 'legitimate']


It looks as though the drifted data is always classified as legitimate (remember that it is actually 100% spam), but let's look at a confusion matrix to visualize the predictions.

In [25]:
from sklearn.metrics import confusion_matrix
from mlworkflows import plot

df, chart = plot.binary_confusion_matrix(df_test_spam["label"], y_preds)
confusion_matrix(df_test_spam["label"], y_preds)
chart

Not surprisingly, the model is quite terrible at classifying drifted data, since this is not what we trained the model on.

In [29]:
import numpy as np

#need this since KSdrift must convert to tensor in KSDrift, this was workaround
df_test = np.asarray(df_test)
print(df_test[:,1])

['legitimate' 'spam' 'legitimate' ... 'spam' 'legitimate' 'legitimate']


This method also uses PCA which ...

In [30]:
import sklearn.decomposition

DIMENSIONS = 2

pca = sklearn.decomposition.PCA(DIMENSIONS)

We'll use the [Alibi Detect](https://github.com/SeldonIO/alibi-detect) library to detect drift. While there are many ways to detect drift, we will display [Kolmogorov-Smirnov](https://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test), or K-S, tests in this notebook for detection. These tests compare the probability distribution between original and (possibily) drifted data per feature (in this case, the hashed values *is this true*). Checking each feature's drift is all well and good, but you need to know if the data as a whole has shifted in a statistically signficant way in order to know if you need to retrain the model. Using a [Bonferroni](https://mathworld.wolfram.com/BonferroniCorrection.html) correction, the K-S test results are aggregated and tested as a whole. 

In [35]:
#KSDrift
import alibi_detect
from alibi_detect.cd import KSDrift
from alibi_detect.cd.preprocess import uae  # Untrained AutoEncoder
from sklearn import preprocessing

#initialize label encoder
label_encoder = preprocessing.LabelEncoder() 


p_val = 0.05
drift_detect = KSDrift(
    p_val = p_val, # p-value for KS set
    X_ref = df_test, # test against original test set
    preprocess_fn = pca, 
    preprocess_kwargs = {'model': label_encoder.fit(df_test[:,1]), 'batch_size':32},
    alternative = 'two-sided',  # other options: 'less', 'greater'
    correction = 'bonferroni' # could also use 
)

We'll start with a sanity check and test the original data. Since we're feeding in the same data set twice, we should not get any drift.

In [37]:
preds_test = drift_detect.predict(df_test)
labels = ['No!', 'Yes!']
print('Drift? {}'.format(labels[preds_test['data']['is_drift']]))

Drift? No!


This was the desired output! Let's try again, but with the drifted data. 

## Exercises
The two models perform very similarly on the "drifted" data in this notebook. Consider alternative types of data drift and see how the models perform: 
1. What happens when fewer words from Pride and Prejudice are appended to the spam? 
2. How about using a completely different excerpt of Austen? 
3. How do the models perform when generic text (neither Austen nor food reviews) is appended to the spam? 