<a href="https://colab.research.google.com/github/zenml-io/zenml/blob/main/examples/deepchecks_data_validation/deepchecks.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# ZenML Data Validation With Deepchecks

## Purpose

Data validation  is something you often want to guard against in your pipelines.
Machine learning pipelines are built on top of data inputs, so it is worth
checking the data to ensure it looks the way you want it to look.

This example uses [`deepchecks`](https://github.com/deepchecks/deepchecks), a
useful open-source library to painlessly do data validation. At its core, `deepchecks` 
data validation library takes in a reference data set and compares it against another comparison dataset. 
These are both input in the form of a `pandas` dataframe. You can receive these results in the form of a 
`SuiteResult` object, that can be visualized in a notebook or on the browser as a HTML webpage.



If you want to run this notebook in an interactive environment, feel free to run
it in a [Google Colab](https://colab.research.google.com/github/zenml-io/zenml/blob/main/examples/deepchecks_drift_detection/deepchecks.ipynb)
or view it on [GitHub](https://github.com/zenml-io/zenml/tree/main/examples/deepchecks_drift_detection) directly.

## Install libraries

In [None]:
# Install the ZenML CLI tool, Evidently and scikit-learn

!pip install zenml 
!zenml integration install deepchecks -f
!zenml integration install sklearn -f

Once the installation is completed, you can go ahead and create a ZenML repository for this project by running:

In [None]:
# Initialize a ZenML repository
!zenml init

Now, the setup is completed. For the next steps, just make sure that you are executing the code within your ZenML repository.

## Import relevant packages

We will use pipelines and steps to train our model.

In [None]:
import pandas as pd
from deepchecks.core import SuiteResult
from deepchecks.tabular import Dataset
from deepchecks.tabular.datasets.classification import iris
from deepchecks.tabular.suites import full_suite
from rich import print
from sklearn.model_selection import train_test_split

from zenml.integrations.constants import DEEPCHECKS, SKLEARN
from zenml.integrations.deepchecks.visualizers import DeepchecksVisualizer
from zenml.logger import get_logger
from zenml.pipelines import pipeline
from zenml.repository import Repository
from zenml.steps import Output, step

## Define ZenML Steps

The first step is a `data_loader` step that downloads the breast cancer Wisconsin dataset and returns it as a panda DataFrame. We'll use this as the reference dataset for our data drift detection example.

In [None]:
@step
def data_loader() -> Output(
    reference_dataset=pd.DataFrame, comparison_dataset=pd.DataFrame
):
    """Load the iris dataset."""
    iris_df = iris.load_data(data_format="Dataframe", as_train_test=False)
    label_col = "target"
    df_train, df_test = train_test_split(
        iris_df, stratify=iris_df[label_col], random_state=0
    )
    return df_train, df_test

Next, we add a data validator step from deepchecks

In [None]:
@step
def data_validator(
    reference_dataset: pd.DataFrame, comparison_dataset: pd.DataFrame
) -> SuiteResult:
    """Validate data using deepchecks"""
    ds_train = Dataset(reference_dataset)
    ds_test = Dataset(comparison_dataset)
    suite = full_suite()
    return suite.run(train_dataset=ds_train, test_dataset=ds_test)

This next step serves as an example showing how the Evidently profile returned as output from the previous step can be used in other steps in the pipeline to analyze the data drift report in detail and take different actions depending on the results. 

In [None]:
@step
def post_validation(result: SuiteResult) -> None:
    """Consumes the SuiteResult."""
    print(result)
    result.save_as_html()

## Define ZenML Pipeline

A pipeline is defined with the `@pipeline` decorator. This defines the various steps of the pipeline and specifies the dependencies between the steps, thereby determining the order in which they will be run.

In [None]:
@pipeline(required_integrations=[DEEPCHECKSEVIDENTLY, SKLEARN])
def data_validation_pipeline(
    data_loader,
    data_validator,
    post_validation,
):
    """Links all the steps together in a pipeline"""
    reference_dataset, comparison_dataset = data_loader()
    validation_result = data_validator(
        reference_dataset=reference_dataset,
        comparison_dataset=comparison_dataset,
    )
    post_validation(validation_result)

## Run the pipeline

Running the pipeline is as simple as calling the `run()` method on an instance of the defined pipeline.

In [None]:
pipeline = data_validation_pipeline(
    data_loader=data_loader(),
    data_validator=data_validator(),
    post_validation=post_validation(),
)
pipeline.run()

# Post execution workflow

We can also visualize the results

In [None]:
repo = Repository()
pipeline = repo.get_pipeline(pipeline_name="data_validation_pipeline")
last_run = pipeline.runs[-1]
data_val_step = last_run.get_step(name="data_validator")

In [None]:
DeepchecksVisualizer().visualize(data_val_step)

# Congratulations!

You have successfully used ZenML and Deepchecks to validate data and generate a validation report.

For more ZenML features and use-cases, you should check out some of the other ZenML examples. You should also take a look at our [docs](https://docs.zenml.io/) or our [Github](https://github.com/zenml-io/zenml) repo, or even better, join us on our [Slack channel](https://zenml.io/slack-invite).

Cheers!