# Data Drift


Suppose the spam generator becomes more intelligent and begins producing prose which looks "more legitimate" than before. Models with changing data need to be monitored to ensure that the model is still performing as expected. Data drift can be seen in different forms. To illustrate a few: 
 + The structure of data may change. Maybe spam emails start utilizing photo attachments rather than text. Since our model is based off of text within the email, it would likely start performing very poorly.
 + Data can change meaning, even if structure does not. (example)
 + Feature may change. Features that are previously infrequent may become more frequent, or vice versa. One (unlikely) drift could be that all modern spam emails start containing the word "coffee" and never the word "tree." This could be an important insight to include in our model. 
 
Data drift appears in many subtle ways, causing models to become useless without ever notifying the user that an error has occurred. 

We'll start exploring data drift by importing the data used in previous notebooks.

In [1]:
import pandas as pd
import os.path

data = pd.read_parquet(os.path.join("data", "training.parquet"))
df = pd.read_parquet(os.path.join("data", "training.parquet"))

We split the data into training and testing sets, as in the modelling notebooks. We use the `random_state` parameter to ensure that the data is split in the same way as it was when we fit the model. 

In [2]:
from sklearn import model_selection

df_train, df_test = model_selection.train_test_split(df, random_state=43)
df_test_spam = df_test[df_test.label == 'spam'].copy() #filter the spam documents

Then, we filter out the spam and force the spam data to drift by adding the first few lines of Pride and Prejudice to the start of the spam documents in our testing set. 

In [3]:
def add_text(doc, adds):
    """
    takes in a string _doc_ and
    appends text _adds_ to the start
    """
    
    return adds + doc

In [4]:
pride_pred = '''It is a truth universally acknowledged, that a single man in possession of a good fortune, must be in want of a wife.However little known the feelings or views of such a man may be on his first entering a neighbourhood, this truth is so well fixed in the minds of the surrounding families, that he is considered the rightful property of some one or other of their daughters.“My dear Mr. Bennet,” said his lady to him one day, “have you heard that Netherfield Park is let at last?” Mr. Bennet replied that he had not. “But it is,” returned she; “for Mrs. Long has just been here, and she told me all about it.” Mr. Bennet made no answer. “Do you not want to know who has taken it?” cried his wife impatiently.'''

In [5]:
# appending text to the start of the spam
df_test_spam["text"] = df_test_spam.text.apply(add_text, adds=pride_pred)

In [6]:
pd.set_option('display.max_colwidth', None) #ensures that all the text is visible
df_test_spam.sample(3)

Unnamed: 0,index,label,text
22853,2853,spam,"It is a truth universally acknowledged, that a single man in possession of a good fortune, must be in want of a wife.However little known the feelings or views of such a man may be on his first entering a neighbourhood, this truth is so well fixed in the minds of the surrounding families, that he is considered the rightful property of some one or other of their daughters.“My dear Mr. Bennet,” said his lady to him one day, “have you heard that Netherfield Park is let at last?” Mr. Bennet replied that he had not. “But it is,” returned she; “for Mrs. Long has just been here, and she told me all about it.” Mr. Bennet made no answer. “Do you not want to know who has taken it?” cried his wife impatiently.My favorite gum --- it's been hard to find in Delaware, especially the Linstead Ackee. My daughter isn't always the best pricing and incredible customer service! I find deep river snacks everywhere but hardly ever find the cracked pepper in it. After the window near the tree started crawling with bugs, I threw this out. Subscribe for the delivery time. Good company to do business with this seller. A 1 ounce serving of nuts really easily, and these totally fit the bill."
36594,16594,spam,"It is a truth universally acknowledged, that a single man in possession of a good fortune, must be in want of a wife.However little known the feelings or views of such a man may be on his first entering a neighbourhood, this truth is so well fixed in the minds of the surrounding families, that he is considered the rightful property of some one or other of their daughters.“My dear Mr. Bennet,” said his lady to him one day, “have you heard that Netherfield Park is let at last?” Mr. Bennet replied that he had not. “But it is,” returned she; “for Mrs. Long has just been here, and she told me all about it.” Mr. Bennet made no answer. “Do you not want to know who has taken it?” cried his wife impatiently.Tore open a bag when traveling to dog shows. My dog and I would definitely buy it again. You'll be happy with just a hint of sweetness. He made like 5 snow cones in the grocery stores around my place..i think they should lower the canister prices! A good product.I purchased this to eat in minutes.The butterscotch and pistachio are surprisingly satisfying, too, while being sugar-free/fat-free. Nugo bars are the real deal, with no husk to worry about."
22771,2771,spam,"It is a truth universally acknowledged, that a single man in possession of a good fortune, must be in want of a wife.However little known the feelings or views of such a man may be on his first entering a neighbourhood, this truth is so well fixed in the minds of the surrounding families, that he is considered the rightful property of some one or other of their daughters.“My dear Mr. Bennet,” said his lady to him one day, “have you heard that Netherfield Park is let at last?” Mr. Bennet replied that he had not. “But it is,” returned she; “for Mrs. Long has just been here, and she told me all about it.” Mr. Bennet made no answer. “Do you not want to know who has taken it?” cried his wife impatiently.EVO is the greatest soda drink on the beverage market. As soon as you open your first bag of grain free and made with table-grade meat. This popcorn has great taste and healthy lifestyle."


We now pass this "drifted" data through the pipeline we created: we compute feature vectors, and we make spam/legitimate classifications using the model we trained. 

In [7]:
from sklearn.pipeline import Pipeline
import pickle, os

## loading in feature vectors pipeline
filename = 'feature_pipeline.sav'
feat_pipeline = pickle.load(open(filename, 'rb'))

## loading model
filename = 'model.sav'
model = pickle.load(open(filename, 'rb'))

In [8]:
pipeline = Pipeline([
    ('features',feat_pipeline),
    ('model',model)
])

## we need to fit the model, using the un-drifted data, as we did in the previous notebooks. 

pipeline.fit(df_train["text"], df_train["label"])

Pipeline(steps=[('features',
                 Pipeline(steps=[('vect',
                                  HashingVectorizer(alternate_sign=False,
                                                    n_features=1024, norm=None,
                                                    token_pattern='(?u)\\b[A-Za-z]\\w+\\b')),
                                 ('tfidf', TfidfTransformer())])),
                ('model', LogisticRegression(max_iter=4000))])

In [9]:
## we can then go on and make predictions for the drifted spam, using the fitted pipeline above. 
# predict test instances
y_preds = pipeline.predict(df_test_spam["text"])
print(y_preds)

['legitimate' 'legitimate' 'legitimate' ... 'legitimate' 'legitimate'
 'legitimate']


It looks as though the drifted data is mostly classified as legitimate (even though the entire test set was spam), but let's look at a confusion matrix to visualize the predictions.

In [10]:
from sklearn.metrics import confusion_matrix
from mlworkflows import plot

df, chart = plot.binary_confusion_matrix(df_test_spam["label"], y_preds)
confusion_matrix(df_test_spam["label"], y_preds)
chart

  ncm = ccm.astype('float') / ccm.sum(axis=1)[:, np.newaxis]


Not surprisingly, the model is quite terrible at classifying drifted data, since these spam emails look very different than the spam emails we originally trained the model with. 

This method also uses [Principal component analysis](https://en.wikipedia.org/wiki/Principal_component_analysis), or PCA,

In [11]:
import sklearn.decomposition

DIMENSIONS = 2
pca = sklearn.decomposition.PCA(DIMENSIONS)

In [12]:
#broken! need to fix
#pca_a = pca.fit_transform(df_test)
#pca_b = pca.transform(df_test_spam)

From this information, we've been able to prove that some change in the underlying data caused our model to be no longer useful. Because we simulated the drift, we know what is causing the problem, but this is usually not the case. Further exploration may be needed: is the drift gradual or abrupt? Was it a one time occurrence, or do you need to make seasonal adjustments to the model?

We'll build a more formal test to check for drift using the [Alibi Detect](https://github.com/SeldonIO/alibi-detect) library. 


In [13]:
import numpy as np

#need this since KSdrift must convert to tensor in KSDrift, this was workaround
df_test = np.asarray(df_test)
df_test_spam = np.asarray(df_test_spam)

While there are many ways to detect drift, we will display [Kolmogorov-Smirnov](https://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test), or K-S, tests in this notebook for detection. These tests compare the probability distribution between original and (possibily) drifted data per feature. Checking each feature's drift is all well and good, but you need to know if the data as a whole has shifted in a statistically signficant way in order to know if you need to retrain the model. Using a [Bonferroni](https://mathworld.wolfram.com/BonferroniCorrection.html) correction, the K-S test results are aggregated and tested as a whole. 

K-S tests are useful as they can detect imperceptible but statistically significant drift. However, this method will not answer questions about the frequency or severity of drift. 

In [14]:
#KSDrift
import alibi_detect
from alibi_detect.cd import KSDrift
from alibi_detect.cd.preprocess import uae  # Untrained AutoEncoder
from sklearn import preprocessing

#initialize label encoder
label_encoder = preprocessing.LabelEncoder() 


p_val = 0.05
drift_detect = KSDrift(
    p_val = p_val, # p-value for KS set
    X_ref = df_test, # test against original test set
    preprocess_fn = pca, 
    preprocess_kwargs = {'model': label_encoder.fit(df_test[:,1]), 'batch_size':32},
    alternative = 'two-sided',  # other options: 'less', 'greater'
    correction = 'bonferroni' # could also use 
)

Importing plotly failed. Interactive plots will not work.


We'll start with a sanity check and test the original data. Since we're feeding in the same data set twice, we should not get any drift.

In [15]:
preds_test = drift_detect.predict(df_test)
labels = ['No!', 'Yes!']
print('Has the data drifted? {}'.format(labels[preds_test['data']['is_drift']]))

Has the data drifted? No!


This was the desired output! Let's try again, but with the drifted data. 

In [16]:
p_val = 0.05
drift_detect_spam = KSDrift(
    p_val = p_val, # p-value for KS set
    X_ref = df_test, # test against original test set
    preprocess_fn = pca, 
    preprocess_kwargs = {'model': label_encoder.fit(df_test[:,1]), 'batch_size':32},
    alternative = 'two-sided',  # other options: 'less', 'greater'
    correction = 'bonferroni' # could also use 
)

In [17]:
preds_test = drift_detect_spam.predict(df_test_spam)
labels = ['No!', 'Yes!']
print('Has the data drifted? {}'.format(labels[preds_test['data']['is_drift']]))

Has the data drifted? Yes!


Great! Our drift detector is able to pick up on the shift in data.

Now we can both visualize and prove our data has drifted. This is important information, but what does it mean in regards to having a model in production? There's no one-size-fits-all answer to this question. If your model is still performing well on the drifted data, you may choose to continue monitoring the data without taking any action. If your model suddenly cannot recognize a single spam email, it may be time to update your model. You may choose to retrain your model to include the new data, update parameters, or build a new model that suits the drift better. There's no right answer, but it

We look at services in [another notebook](07-services.ipynb) to visualize streamed data.

## Exercises
The two models perform very similarly on the "drifted" data in this notebook. Consider alternative types of data drift and see how the models perform: 
1. What happens when fewer words from Pride and Prejudice are appended to the spam? 
2. How about using a completely different excerpt of Austen? 
3. How do the models perform when generic text (neither Austen nor food reviews) is appended to the spam? 