<a href="https://colab.research.google.com/github/zenml-io/zenml/blob/main/examples/deepchecks_data_validation/deepchecks.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# ZenML Data Validation With Deepchecks

## Purpose

Data validation  is something you often want to guard against in your pipelines.
Machine learning pipelines are built on top of data inputs, so it is worth
checking the data to ensure it looks the way you want it to look.

This example uses [`deepchecks`](https://github.com/deepchecks/deepchecks), a
useful open-source library to painlessly do data validation. At its core, `deepchecks` 
data validation library takes in a reference data set and compares it against another comparison dataset. 
These are both input in the form of a `pandas` dataframe. You can receive these results in the form of a 
`SuiteResult` object, that can be visualized in a notebook or on the browser as a HTML webpage.



If you want to run this notebook in an interactive environment, feel free to run
it in a [Google Colab](https://colab.research.google.com/github/zenml-io/zenml/blob/main/examples/deepchecks_drift_detection/deepchecks.ipynb)
or view it on [GitHub](https://github.com/zenml-io/zenml/tree/main/examples/deepchecks_drift_detection) directly.

## Install libraries

In [1]:
# Install the ZenML CLI tool, Deepchecks and scikit-learn

!pip install zenml 
!zenml integration install deepchecks -f
!zenml integration install sklearn -f

Collecting packaging<21,>=20
  Using cached packaging-20.9-py2.py3-none-any.whl (40 kB)
Collecting dill<0.3.2,>=0.3.1.1
  Using cached dill-0.3.1.1-py3-none-any.whl






Installing collected packages: packaging, dill
  Attempting uninstall: packaging
    Found existing installation: packaging 21.3
    Uninstalling packaging-21.3:
      Successfully uninstalled packaging-21.3
  Attempting uninstall: dill
    Found existing installation: dill 0.3.4
    Uninstalling dill-0.3.4:
      Successfully uninstalled dill-0.3.4
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
statsmodels 0.13.2 requires packaging>=21.3, but you have packaging 20.9 which is incompatible.
pathos 0.2.8 requires dill>=0.3.4, but you have dill 0.3.1.1 which is incompatible.
multiprocess 0.70.12.2 requires dill>=0.3.4, but you have dill 0.3.1.1 which is incompatible.
apache-airflow 2.2.0 requires packaging~=21.0, but you have packaging 20.9 which is incompatible.[0m
Successfully installed dill-0.3.1.1 packaging-20.9
You should consider upgrading via the '/h

[2K[32m⠼[0m Installing integrations...Collecting packaging
  Using cached packaging-21.3-py3-none-any.whl (40 kB)
[2K[32m⠏[0m Installing integrations...Installing collected packages: packaging, deepchecks
  Attempting uninstall: packaging
    Found existing installation: packaging 20.9
    Uninstalling packaging-20.9:
      Successfully uninstalled packaging-20.9
  Attempting uninstall: deepchecks
    Found existing installation: deepchecks dev


    Uninstalling deepchecks-dev:
[2K[32m⠸[0m Installing integrations...      Successfully uninstalled deepchecks-dev
[2K[32m⠴[0m Installing integrations...[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
ml-pipelines-sdk 1.7.1 requires packaging<21,>=20, but you have packaging 21.3 which is incompatible.[0m
Successfully installed deepchecks-0.6.3 packaging-21.3
You should consider upgrading via the '/home/hamza/.cache/pypoetry/virtualenvs/zenml-5F48Ch7I-py3.8/bin/python -m pip install --upgrade pip' command.[0m
[2K[32m⠦[0m Installing integrations...
[2K[32m⠴[0m Installing integrations...^C
[31mERROR: Operation cancelled by user[0m
You should consider upgrading via the '/home/hamza/.cache/pypoetry/virtualenvs/zenml-5F48Ch7I-py3.8/bin/python -m pip install --upgrade pip' command.[0m
[2K[32m⠧[0m Installing integrations...
[1A[2K
Aborted!


Once the installation is completed, you can go ahead and create a ZenML repository for this project by running:

In [2]:
# Initialize a ZenML repository
!zenml init

[?25l[32m⠋[0m Initializing ZenML repository at 
/home/hamza/workspace/zenml_io/zenml/examples/deepchecks_data_validation.

[?25h[1A[2K[1A[2K[1A[2KError: [31m[1mFound existing ZenML repository at path '/home/hamza/workspace/zenml_io/zenml/examples/deepchecks_data_validation'.[0m


Now, the setup is completed. For the next steps, just make sure that you are executing the code within your ZenML repository.

## Import relevant packages

We will use pipelines and steps to train our model.

In [3]:
import pandas as pd
from deepchecks.core. import SuiteResult
from deepchecks.tabular import Dataset
from deepchecks.tabular.datasets.classification import iris
from deepchecks.tabular.suites import full_suite
from rich import print
from sklearn.model_selection import train_test_split

from zenml.integrations.constants import DEEPCHECKS, SKLEARN
from zenml.integrations.deepchecks.visualizers import DeepchecksVisualizer
from zenml.logger import get_logger
from zenml.pipelines import pipeline
from zenml.repository import Repository
from zenml.steps import Output, step

## Define ZenML Steps

The first step is a `data_loader` step that downloads the breast cancer Wisconsin dataset and returns it as a panda DataFrame. We'll use this as the reference dataset for our data drift detection example.

In [4]:
@step
def data_loader() -> Output(
    reference_dataset=pd.DataFrame, comparison_dataset=pd.DataFrame
):
    """Load the iris dataset."""
    iris_df = iris.load_data(data_format="Dataframe", as_train_test=False)
    label_col = "target"
    df_train, df_test = train_test_split(
        iris_df, stratify=iris_df[label_col], random_state=0
    )
    return df_train, df_test

Next, we add a data validator step from deepchecks

In [5]:
@step
def data_validator(
    reference_dataset: pd.DataFrame, comparison_dataset: pd.DataFrame
) -> SuiteResult:
    """Validate data using deepchecks"""
    ds_train = Dataset(reference_dataset)
    ds_test = Dataset(comparison_dataset)
    suite = full_suite()
    return suite.run(train_dataset=ds_train, test_dataset=ds_test)

This next step serves as an example showing how the Evidently profile returned as output from the previous step can be used in other steps in the pipeline to analyze the data drift report in detail and take different actions depending on the results. 

In [6]:
@step
def post_validation(result: SuiteResult) -> None:
    """Consumes the SuiteResult."""
    print(result)
    result.save_as_html()

## Define ZenML Pipeline

A pipeline is defined with the `@pipeline` decorator. This defines the various steps of the pipeline and specifies the dependencies between the steps, thereby determining the order in which they will be run.

In [7]:
@pipeline(required_integrations=[DEEPCHECKS, SKLEARN])
def data_validation_pipeline(
    data_loader,
    data_validator,
    post_validation,
):
    """Links all the steps together in a pipeline"""
    reference_dataset, comparison_dataset = data_loader()
    validation_result = data_validator(
        reference_dataset=reference_dataset,
        comparison_dataset=comparison_dataset,
    )
    post_validation(validation_result)

NameError: name 'DEEPCHECKSEVIDENTLY' is not defined

## Run the pipeline

Running the pipeline is as simple as calling the `run()` method on an instance of the defined pipeline.

In [None]:
pipeline = data_validation_pipeline(
    data_loader=data_loader(),
    data_validator=data_validator(),
    post_validation=post_validation(),
)
pipeline.run()

# Post execution workflow

We can also visualize the results

In [None]:
repo = Repository()
pipeline = repo.get_pipeline(pipeline_name="data_validation_pipeline")
last_run = pipeline.runs[-1]
data_val_step = last_run.get_step(name="data_validator")

In [None]:
DeepchecksVisualizer().visualize(data_val_step)

# Congratulations!

You have successfully used ZenML and Deepchecks to validate data and generate a validation report.

For more ZenML features and use-cases, you should check out some of the other ZenML examples. You should also take a look at our [docs](https://docs.zenml.io/) or our [Github](https://github.com/zenml-io/zenml) repo, or even better, join us on our [Slack channel](https://zenml.io/slack-invite).

Cheers!