<a href="https://colab.research.google.com/github/zenml-io/zenml/blob/main/examples/deepchecks_data_validation/deepchecks.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# ZenML Data Validation With Deepchecks

In data-centric machine learning development, data quality is critical not only
to achieve good initial results but also to keep data drift and concept drift
at bay as your models are deployed to production and interact with live data.

Data validation tools can be employed early on in your machine learning
pipelines to generate data quality profiles and to run data validation checks
that can be used to continuously validate the data being ingested at various
points in the pipeline. For example, data quality reports and checks can be
run on the training and validation datasets used during model training, or on
the inference data used for batch predictions. This is one good way of detecting
training-serving skew.

## Purpose

This example uses [Deepchecks](https://github.com/deepchecks/deepchecks), a
feature rich data validation open-source library to painlessly do data validation.
Deepchecks can do a variety of data validation tasks, from data integrity checks
that work with a single dataset to data+model evaluation to data drift analyses.
All this can be done with minimal configuration input from the user, or
customized with specialized conditions that the validation checks should perform.

At its core, the Deepchecks data validation library takes in a target dataset and
an optional model and reference dataset and generates a data validation check
result in the form of a `SuiteResult` object that can be analyzed programmatically
of visualized in a notebook or in the browser as a HTML webpage.. 
Datasets come in the form of `pandas` dataframes and models can be anything
that implement a `predict` method for regression tasks and also a `predict_proba`
method for classification tasks.

If you want to run this notebook in an interactive environment, feel free to run
it in a [Google Colab](https://colab.research.google.com/github/zenml-io/zenml/blob/main/examples/deepchecks_drift_detection/deepchecks.ipynb)
or view it on [GitHub](https://github.com/zenml-io/zenml/tree/main/examples/deepchecks_drift_detection) directly.

## Install libraries

In [None]:
# Install the ZenML CLI tool, Deepchecks and scikit-learn

!pip install zenml 
!zenml integration install deepchecks sklearn -f

Once the installation is completed, you can go ahead and create a ZenML repository for this project by running:

In [None]:
# Initialize a ZenML repository
!zenml init

Now, the setup is completed. For the next steps, just make sure that you are executing the code within your ZenML repository.

## Setup the Stack

You need to have an Deepchecks Data Validator component to your stack to be able to use Deepchecks data profiling in your ZenML pipelines. Creating such a stack is easily accomplished:

In [None]:
!zenml data-validator register deepchecks -f deepchecks
!zenml stack register deepchecks_stack -o default -a default -dv deepchecks --set

## Import relevant packages

We will use pipelines and steps to train our model.

In [None]:
import pandas as pd
from rich import print
from sklearn.model_selection import train_test_split
from sklearn.base import ClassifierMixin
from sklearn.ensemble import RandomForestClassifier

from zenml.integrations.constants import DEEPCHECKS, SKLEARN
from zenml.integrations.deepchecks.visualizers import DeepchecksVisualizer
from zenml.logger import get_logger
from zenml.pipelines import pipeline
from zenml.steps import Output, step

## Define ZenML Steps

The first step is a `data_loader` step that downloads the breast cancer Wisconsin dataset and returns it as a panda DataFrame. We'll use this as the reference dataset for our data drift detection example.

In [None]:
from deepchecks.tabular.datasets.classification import iris

LABEL_COL = "target"

@step
def data_loader() -> Output(
    reference_dataset=pd.DataFrame, comparison_dataset=pd.DataFrame
):
    """Load the iris dataset."""
    iris_df = iris.load_data(data_format="Dataframe", as_train_test=False)
    df_train, df_test = train_test_split(
        iris_df, stratify=iris_df[LABEL_COL], random_state=0
    )
    return df_train, df_test

We also add a model training step:

In [None]:
@step
def trainer(df_train: pd.DataFrame) -> ClassifierMixin:
    # Train Model
    rf_clf = RandomForestClassifier(random_state=0)
    rf_clf.fit(df_train.drop(LABEL_COL, axis=1), df_train[LABEL_COL])
    return rf_clf

Next, we add our Deepchecks validation steps. First, a data integrity check that we'll run against the training dataset.

In [None]:
from zenml.integrations.deepchecks.steps import (
    DeepchecksDataIntegrityCheckStepParameters,
    deepchecks_data_integrity_check_step,
)

data_validator = deepchecks_data_integrity_check_step(
    step_name="data_validator",
    params=DeepchecksDataIntegrityCheckStepParameters(
        dataset_kwargs=dict(label=LABEL_COL, cat_features=[]),
    ),
)


Add a Deepchecks data drift check step that we'll use to compare the validation dataset against the training dataset.

In [None]:
from zenml.integrations.deepchecks.steps import (
    DeepchecksDataDriftCheckStepParameters,
    deepchecks_data_drift_check_step,
)

data_drift_detector = deepchecks_data_drift_check_step(
    step_name="data_drift_detector",
    params=DeepchecksDataDriftCheckStepParameters(
        dataset_kwargs=dict(label=LABEL_COL, cat_features=[]),
    ),
)

Add a Deepchecks model evaluation check step to run it against our model.

In [None]:
from zenml.integrations.deepchecks.steps import (
    DeepchecksModelValidationCheckStepParameters,
    deepchecks_model_validation_check_step,
)

model_validator = deepchecks_model_validation_check_step(
    step_name="model_validator",
    params=DeepchecksModelValidationCheckStepParameters(
        dataset_kwargs=dict(label=LABEL_COL, cat_features=[]),
    ),
)

Finally, add a Deepchecks model drift check step to compare the performance of the model against two datasets: our original training dataset and the data validation dataset.

In [None]:
from zenml.integrations.deepchecks.steps import (
    DeepchecksModelDriftCheckStepParameters,
    deepchecks_model_drift_check_step,
)

model_drift_detector = deepchecks_model_drift_check_step(
    step_name="model_drift_detector",
    params=DeepchecksModelDriftCheckStepParameters(
        dataset_kwargs=dict(label=LABEL_COL, cat_features=[]),
    ),
)


## Define ZenML Pipeline

A pipeline is defined with the `@pipeline` decorator. This defines the various steps of the pipeline and specifies the dependencies between the steps, thereby determining the order in which they will be run.

In [None]:
from zenml.config import DockerSettings
docker_settings = DockerSettings(required_integrations=[DEEPCHECKS, SKLEARN])

@pipeline(enable_cache=False, settings={"docker": docker_settings})
def data_validation_pipeline(
    data_loader,
    trainer,
    data_validator,
    model_validator,
    data_drift_detector,
    model_drift_detector,
):
    """Links all the steps together in a pipeline"""
    df_train, df_test = data_loader()
    data_validator(dataset=df_train)
    data_drift_detector(
        reference_dataset=df_train,
        target_dataset=df_test,
    )
    model = trainer(df_train)
    model_validator(dataset=df_train, model=model)
    model_drift_detector(
        reference_dataset=df_train, target_dataset=df_test, model=model
    )

## Run the pipeline

Running the pipeline is as simple as calling the `run()` method on an instance of the defined pipeline.

In [None]:
pipeline_instance = data_validation_pipeline(
    data_loader=data_loader(),
    trainer=trainer(),
    data_validator=data_validator,
    model_validator=model_validator,
    data_drift_detector=data_drift_detector,
    model_drift_detector=model_drift_detector,
)
pipeline_instance.run()

# Post execution workflow

We can visualize all the validation check results from the pipeline.

In [None]:
pipeline_instance.run()

last_run = pipeline_instance.get_runs()[-1]
data_val_step = last_run.get_step(step=data_validator)
model_val_step = last_run.get_step(step=model_validator)
data_drift_step = last_run.get_step(step=data_drift_detector)
model_drift_step = last_run.get_step(step=model_drift_detector)

In [None]:
DeepchecksVisualizer().visualize(data_val_step)

In [None]:
DeepchecksVisualizer().visualize(model_val_step)

In [None]:
DeepchecksVisualizer().visualize(data_drift_step)

In [None]:
DeepchecksVisualizer().visualize(model_drift_step)

# Congratulations!

You have successfully used ZenML and Deepchecks to validate data and generate a validation report.

For more ZenML features and use-cases, you should check out some of the other ZenML examples. You should also take a look at our [docs](https://docs.zenml.io/) or our [Github](https://github.com/zenml-io/zenml) repo, or even better, join us on our [Slack channel](https://zenml.io/slack-invite).

Cheers!