# ZenML Data Drift Detection With Evidently

## Purpose

Data Drift is something you often want to guard against in your pipelines.
Machine learning pipelines are built on top of data inputs, so it is worth
checking for drift if you have a model that was trained on a certain
distribution of data.

This example uses [`evidently`](https://github.com/evidentlyai/evidently), a
useful open-source library to painlessly check for data drift (among other
features). At its core, Evidently's drift detection takes in a reference data
set and compares it against another comparison dataset. These are both input in
the form of a `pandas` dataframe, though CSV inputs are also possible.

ZenML implements this functionality in the form of a standard `EvidentlyProfileStep` step.
You select which of the profile sections you want to use in your step by passing
a string into the `EvidentlyProfileConfig`. Possible options supported by
Evidently are:

- "datadrift"
- "categoricaltargetdrift"
- "numericaltargetdrift"
- "dataquality"
- "classificationmodelperformance"
- "regressionmodelperformance"
- "probabilisticmodelperformance"

If you want to run this notebook in an interactive environment, feel free to run
it in a [Google Colab](https://colab.research.google.com/github/zenml-io/zenml/blob/main/examples/evidently_drift_detection/evidently.ipynb)
or view it on [GitHub](https://github.com/zenml-io/zenml/tree/main/examples/evidently_drift_detection) directly.

## Install libraries

In [None]:
# Install the ZenML CLI tool, Evidently and scikit-learn

!pip install zenml 
!zenml integration install evidently sklearn -y

Once the installation is completed, you can go ahead and create a ZenML repository for this project by running:

In [None]:
# Initialize a ZenML repository
!zenml init

Now, the setup is completed. For the next steps, just make sure that you are executing the code within your ZenML repository.

## Setup the Stack

You need to have an Evidently Data Validator component to your stack to be able to use Evidently data profiling in your ZenML pipelines. Creating such a stack is easily accomplished:

In [None]:
!zenml data-validator register evidently -f evidently
!zenml stack register evidently_stack -o default -a default -dv evidently --set

## Import relevant packages

We will use pipelines and steps to train our model.

In [None]:
import pandas as pd
from evidently.model_profile import Profile
from rich import print
from sklearn import datasets

from zenml.integrations.constants import EVIDENTLY, SKLEARN
from zenml.pipelines import pipeline
from zenml.steps import Output, step

## Define ZenML Steps

In the code that follows, we are defining the various steps of our pipeline. Each step is decorated with `@step`, the main abstraction that is currently available for creating pipeline steps, with the exception of the Evidently data drift built-in step that is shipped with ZenML.

The first step is a `data_loader` step that downloads the breast cancer Wisconsin dataset and returns it as a panda DataFrame. We'll use this as the reference dataset for our data drift detection example.

In [None]:
@step
def data_loader() -> pd.DataFrame:
    """Load the breast cancer dataset."""
    breast_cancer = datasets.load_breast_cancer()
    df = pd.DataFrame(
        data=breast_cancer.data, columns=breast_cancer.feature_names
    )
    df["class"] = breast_cancer.target
    return df

We then add a `data_splitter` step that takes the input dataset and splits it into two subsets. Later on, in the pipeline, we'll compare these datasets against each other using Evidently and generate a data drift profile and associated dashboard.

In [None]:
@step
def data_splitter(
    input_df: pd.DataFrame,
) -> Output(reference_dataset=pd.DataFrame, comparison_dataset=pd.DataFrame):
    """Splits the dataset into two subsets, the reference dataset and the 
    comparison dataset"""
    return input_df[100:], input_df[:100]

Next, we add an Evidently step that takes in the reference dataset and partial dataset and generates a data drift profile and HTML report. This step is already defined as part of the ZenML library, so we only need to add it to our pipeline with a custom configuration. Under the hood, ZenML uses Evidently in the implementation of this step to generate data drift reports and Materializers to automatically persist them as Artifacts into the Artifact Store.

In [None]:
from zenml.integrations.evidently.steps import (
    EvidentlyColumnMapping,
    EvidentlyProfileParameters,
    evidently_profile_step,
)

drift_detector = evidently_profile_step(
    step_name="drift_detector",
    params=EvidentlyProfileParameters(
        column_mapping=EvidentlyColumnMapping(
            target="class", prediction="class"
        ),
        profile_sections=[
            "dataquality",
            "categoricaltargetdrift",
            "numericaltargetdrift",
            "datadrift",
        ],
        verbose_level=1,
    )
)

This next step serves as an example showing how the Evidently profile returned as output from the previous step can be used in other steps in the pipeline to analyze the data drift report in detail and take different actions depending on the results. 

In [None]:
@step
def analyze_drift(
    profile: Profile,
) -> bool:
    """Analyze the Evidently drift report and return a true/false value
    indicating whether data drift was detected."""
    return profile.object()["data_drift"]["data"]["metrics"]["dataset_drift"]


## Define ZenML Pipeline

A pipeline is defined with the `@pipeline` decorator. This defines the various steps of the pipeline and specifies the dependencies between the steps, thereby determining the order in which they will be run.

Note how the ZenML Evidently step returns two artifacts: the drift profile report and the drift HTML report. We only use the profile report in the pipeline, while the HTML report will be extracted and rendered separately in the post execution workflow, via the ZenML Evidently visualizer.

In [None]:
from zenml.config import DockerSettings
docker_settings = DockerSettings(required_integrations=[EVIDENTLY, SKLEARN])

@pipeline(settings={"docker": docker_settings})
def drift_detection_pipeline(
    data_loader,
    data_splitter,
    drift_detector,
    drift_analyzer,
):
    """Links all the steps together in a pipeline"""
    data = data_loader()
    reference_dataset, comparison_dataset = data_splitter(data)
    drift_report, _ = drift_detector(reference_dataset=reference_dataset, comparison_dataset=comparison_dataset)
    drift_analyzer(drift_report)

## Run the pipeline

Running the pipeline is as simple as calling the `run()` method on an instance of the defined pipeline.

In [None]:
pipeline_instance = drift_detection_pipeline(
    data_loader=data_loader(),
    data_splitter=data_splitter(),
    drift_detector=drift_detector,
    drift_analyzer=analyze_drift(),
)
pipeline_instance.run()

# Post execution workflow

We did mention above that the Materializer takes care of persisting the Evidently profile and HTML reports in the Artifact Store. These artifacts can be extracted and visualized after the pipeline run is complete.

In [None]:
from zenml.integrations.evidently.visualizers import EvidentlyVisualizer

last_run = pipeline_instance.get_runs()[-1]
drift_analysis_step = last_run.get_step(
    name="drift_analyzer"
)
print(f'Data drift detected: {drift_analysis_step.output.read()}')

Extracting and displaying the Evidently profile generated in the `drift_detector` step is possible, but using the ZenML Evidently visualizer, as shown in the section after next is the better alternative.

In [None]:
import json

drift_detection_step = last_run.get_step(
    name="drift_detector"
)
profile = drift_detection_step.outputs['profile'].read()
print(profile.json())


The ZenML Evidently visualizer takes in a ZenML pipeline step run and renders all the Evidently dashboards that were generated during its execution.

In [None]:
EvidentlyVisualizer().visualize(drift_detection_step)

# Congratulations!

You have successfully used ZenML and Evidently to detect data drift and visualize data drift reports.

For more ZenML features and use-cases, you should check out some of the other ZenML examples. You should also take a look at our [docs](https://docs.zenml.io/) or our [Github](https://github.com/zenml-io/zenml) repo, or even better, join us on our [Slack channel](https://zenml.io/slack-invite).

Cheers!