In [None]:
import warnings
warnings.filterwarnings('ignore')
%load_ext autoreload
%autoreload 2
from absl import logging as absl_logging
absl_logging.set_verbosity(-10000)

In [None]:
!zenml init
!zenml stack set local_stack

## Integrating Evidently
Evidently is an open source tool that allows you to easily compute drift on your data. Here is a little blog post of ours that explains the evidently integration in a bit more detail.

At its core, Evidently’s drift detection calculation functions take in a reference data set and compare it with a separate comparison dataset. These are both passed in as Pandas dataframes, though CSV inputs are also possible. ZenML implements this functionality in the form of several standardized steps along with an easy way to use the visualization tools also provided along with Evidently as ‘Dashboards’.

If you’re working on any kind of machine learning problem that has an ongoing training loop that takes in new data, you’ll want to guard against drift. Machine learning pipelines are built on top of data inputs, so it is worth checking for drift if you have a model that was trained on a certain distribution of data. The incoming data is something you have less control over and since things often change out in the real world, you should have a plan for knowing when things have shifted. Evidently offers a growing set of features that help you monitor not only data drift but other key aspects like target drift and so on.

In [None]:
# First we need to install evidently to our python environment
!zenml integration install evidently -f

In [None]:
# Zenml provides some standard steps for the evidently integration
from zenml.integrations.evidently.steps import (
    EvidentlyProfileConfig,
    EvidentlyProfileStep,
)

# We create a config object for our evidently step - 
#  here we choose the datadrift profile 
evidently_drift_detector_config = EvidentlyProfileConfig(
    column_mapping=None,profile_sections=["datadrift"],)

## Splitting the data for drift detection

We are using the MNIST dataset from Keras for training in this example. Since this is not a timeseries, one way of splittig the data could be to select an arbitary row and then take all samples prior to it as rhe reference dataset and the rows after it as the new data. 

We have the option to add noise to the dataset to control the occurence of drift. 

TODO: Girl on car drifting awesome GIF

In [None]:
from steps.splitter import reference_data_splitter, TrainingSplitConfig

drift_data_split_config = TrainingSplitConfig(
    row=30000,
    add_noise=True)