# ZenML Data Validation With Great Expectations

## Purpose

In data-centric machine learning development, data quality is critical not only
to achieve good initial results but also to keep data drift and concept drift
at bay as your models are deployed to production and interact with live data.

Data validation tools can be employed early on in your machine learning
pipelines to generate data statistical profiles and infer validation rules
that can be used to continuously validate the data being ingested at various
points in the pipeline. For example, data validation rules can be inferred from
the training dataset and then used to validate the datasets used to perform
batch predictions. This is one good way of detecting training-serving skew.

This example uses the very popular [`Great Expectations`](https://greatexpectations.io/)
open-source library to run data quality tasks on [the University of Wisconsin breast cancer diagnosis
dataset](https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic))
to illustrate how it works.

If you want to run this notebook in an interactive environment, feel free to run
it in a [Google Colab](https://colab.research.google.com/github/zenml-io/zenml/blob/main/examples/great_expectations_data_validation/great_expectations.ipynb)
or view it on [GitHub](https://github.com/zenml-io/zenml/tree/main/examples/great_expectations_data_validation) directly.

## Install libraries

In [None]:
# Install the ZenML CLI tool, Great Expectations and scikit-learn
import IPython

!pip install zenml 
!zenml integration install -y great_expectations sklearn dash s3

# automatically restart kernel
IPython.Application.instance().kernel.do_shutdown(restart=True)

Once the installation is completed, you can go ahead and create a ZenML repository for this project by running:

In [None]:
# Initialize a ZenML repository
!zenml init

## Setup the Stack

In this section we configure a ZenML Stack featuring Great Expectations as a Data Validator and a cloud Artifact Store that uses a managed object storage service (AWS S3) as a backend.

### ZenML and Great Expectations: Store Great Expectations artifacts with the ZenML cloud artifact store

This is a ZenML stack that includes an Artifact Store connected to a cloud
object storage. This example uses AWS as a backend, but [the ZenML documentation](https://docs.zenml.io/component-gallery/artifact-stores/artifact-stores)
has similar instructions on how to configure a GCP or Azure Blob Storage powered
Artifact Store.

For this stack, you will need an S3 bucket where our ML artifacts can later be
stored. You can configure one by following [this AWS tutorial](https://docs.aws.amazon.com/AmazonS3/latest/userguide/create-bucket-overview.html).

The path for your bucket should be in this format: `s3://your-bucket`.

![Great Expectations Stack on S3](great_expectations_stack.png "Great Expectations Stack on S3")

In [None]:
!zenml artifact-store register s3_store --flavor=s3 --path=s3://zenfiles
!zenml data-validator register ge_s3 --flavor=great_expectations
!zenml stack register s3_stack -o default -a s3_store -dv ge_s3
!zenml stack set s3_stack

Now, the setup is completed. For the next steps, just make sure that you are executing the code within your ZenML repository.

## Import relevant packages

We will use pipelines and steps to validate our data.

In [None]:
import pandas as pd
from sklearn import datasets

from great_expectations.checkpoint.types.checkpoint_result import (  # type: ignore[import]
    CheckpointResult,
)

from zenml.integrations.constants import GREAT_EXPECTATIONS, SKLEARN
from zenml.integrations.great_expectations.steps import (
    GreatExpectationsProfilerParameters,
    GreatExpectationsProfilerStep,
    GreatExpectationsValidatorParameters,
    GreatExpectationsValidatorStep,
)
from zenml.integrations.great_expectations.visualizers import (
    GreatExpectationsVisualizer,
)
from zenml.pipelines import pipeline
from zenml.steps import BaseParameters, Output, step

## Define ZenML Steps

In the code that follows, we are defining the various steps of our pipeline. Each step is decorated with `@step`, the main abstraction that is currently available for creating pipeline steps, with the exception of the Great Expectations data profiling and data validation built-in steps that are shipped with ZenML.

The first step is an `importer` step that downloads the breast cancer Wisconsin dataset and returns it as a panda DataFrame. It is used to simulate loading data from two different sources:

* reference data used to train a model
* "live" data that is used in a pipeline to run batch predictions on a model e.g. in production

If `reference_data` is set in the step configuration, a slice of the data is returned as a reference dataset. Otherwise, a different slice is returned representing the "live" data.

In [None]:
class DataLoaderParameters(BaseParameters):
    reference_data: bool = True
    
@step
def importer(
        params: DataLoaderParameters,
) -> Output(dataset=pd.DataFrame, condition=bool):
    """Load the breast cancer dataset.
    
    This step is used to simulate loading data from two different sources.
    If `reference_data` is set in the step configuration, a slice of the
    data is returned as a reference dataset. Otherwise, a different slice
    is returned as a test dataset to be validated.
    """
    breast_cancer = datasets.load_breast_cancer()
    df = pd.DataFrame(
        data=breast_cancer.data, columns=breast_cancer.feature_names
    )
    df["class"] = breast_cancer.target
    if params.reference_data:
        dataset = df[100:] 
    else:
        dataset = df[:100]
    return dataset, params.reference_data

Next, we add the Great Expectations steps that we'll use to perform data
profiling and data validation. These steps are already defined as part of the
ZenML library, so we only need to add them to our pipeline with a custom
configuration.

Under the hood, ZenML uses Great Expectations in the implementation of these
steps to generate an Expectation Suite from an input dataset and to validate
an input dataset using an existing Expectation Suite.

In [None]:
# instantiate a builtin Great Expectations data profiling step
ge_profiler_params = GreatExpectationsProfilerParameters(
    expectation_suite_name="breast_cancer_suite",
    data_asset_name="breast_cancer_ref_df",
)
ge_profiler_step = GreatExpectationsProfilerStep(params=ge_profiler_params)


# instantiate a builtin Great Expectations data validation step
ge_validator_params = GreatExpectationsValidatorParameters(
    expectation_suite_name="breast_cancer_suite",
    data_asset_name="breast_cancer_test_df",
)
ge_validator_step = GreatExpectationsValidatorStep(params=ge_validator_params)

This next step serves as an example showing how the Great Expectations validation result returned as output from the validator step can be used in other steps in the pipeline to analyze the results in detail and take different actions depending on the results. 

In [None]:
from zenml.steps import (
    STEP_ENVIRONMENT_NAME,
    StepEnvironment,
)
from zenml.environment import Environment
from typing import cast

@step
def analyze_result(
    result: CheckpointResult,
) -> str:
    """Analyze the Great Expectations validation result and return a true/false value indicating
    whether it passed or failed."""
    step_env = cast(StepEnvironment, Environment()[STEP_ENVIRONMENT_NAME])
    pipeline_name = step_env.pipeline_name
    pipeline_run_id = step_env.pipeline_run_id
    step_name = step_env.step_name
    pipeline_context = f"Pipeline {pipeline_name}, with run {pipeline_run_id}, in step {step_name} produced the following output:\n\n"
    if result.success:
        message = pipeline_context + "Great Expectations data validation was successful!"
    else:
        message = pipeline_context + "Great Expectations data validation failed!"
    print(message)
    return message

## Define ZenML Pipelines

A pipeline is defined with the `@pipeline` decorator. This defines the various steps of the pipeline and specifies the dependencies between the steps, thereby determining the order in which they will be run.

We'll define two ZenML pipelines:

* a data profiling pipeline. The pipeline imports a reference dataset from a source then uses the builtin Great Expectations profiler step to generate an expectation suite (i.e. validation rules) inferred from the schema and statistical properties of the reference dataset. In more complete use-cases, this would be the model training pipeline and the profiled dataset would be the training dataset.

* a data validation pipeline. The pipeline imports "live" data from a source, then uses the builtin Great Expectations data validation step to validate the dataset against the expectation suite generated in the profiling pipeline. In more complete use-cases, this would be the batch inference pipeline and the validated dataset would be the "live" inference dataset.

In [None]:
from zenml.config import DockerSettings
docker_settings = DockerSettings(required_integrations=[SKLEARN, GREAT_EXPECTATIONS])

@pipeline(enable_cache=False, settings={"docker": docker_settings})
def profiling_pipeline(
    importer, profiler
):
    """Data profiling pipeline for Great Expectations.

    The pipeline imports a reference dataset from a source then uses the builtin
    Great Expectations profiler step to generate an expectation suite (i.e.
    validation rules) inferred from the schema and statistical properties of the
    reference dataset.

    Args:
        importer: reference data importer step
        profiler: data profiler step
    """
    dataset, _ = importer()
    profiler(dataset)

In [None]:
from zenml.config import DockerSettings
docker_settings = DockerSettings(required_integrations=[SKLEARN, GREAT_EXPECTATIONS])

@pipeline(enable_cache=False, settings={"docker": docker_settings})
def validation_pipeline(
    importer, validator, checker
):
    """Data validation pipeline for Great Expectations.

    The pipeline imports a test data from a source, then uses the builtin
    Great Expectations data validation step to validate the dataset against
    the expectation suite generated in the profiling pipeline.

    Args:
        importer: test data importer step
        validator: dataset validation step
        checker: checks the validation results
    """
    dataset, condition = importer()
    results = validator(dataset, condition)
    message = checker(results)

## Run the pipelines

Running the pipelines is as simple as calling the `run()` method on an instance of the defined pipeline. You can also switch between the ZenML stacks we configured at the beginning of the exercise.

In [None]:
profiling_pipeline(
    importer=importer(params=DataLoaderParameters(reference_data=True)),
    profiler=ge_profiler_step,
).run()

In [None]:
validation_pipeline(
    importer=importer(params=DataLoaderParameters(reference_data=False)),
    validator=ge_validator_step,
    checker=analyze_result(),
).run()

# Post execution workflow

Here we setup some helper functions that we'll use to visualize the pipelines and the artifacts.

In [None]:
from zenml.post_execution import get_pipeline

def start_pipeline_visualizer(name: str):

    from zenml.integrations.dash.visualizers.pipeline_run_lineage_visualizer import (
        PipelineRunLineageVisualizer,
    )

    latest_run = get_pipeline(name).runs[-1]
    PipelineRunLineageVisualizer().visualize(latest_run)

In [None]:
def visualize_results(pipeline_name: str, step_name: str) -> None:
    pipeline = get_pipeline(pipeline_name)
    last_run = pipeline.runs[-1]
    step = last_run.get_step(step=step_name)
    GreatExpectationsVisualizer().visualize(step)

Both ZenML and Great Expectations takes care of persisting the Expectation Suites and data validation results in the Artifact Store. These artifacts can be extracted and visualized after the pipeline runs are complete.

In [None]:
start_pipeline_visualizer("profiling_pipeline")

In [None]:
start_pipeline_visualizer("validation_pipeline")

In [None]:
visualize_results("profiling_pipeline", "profiler")

In [None]:
visualize_results("validation_pipeline", "validator")

# Congratulations!

You have successfully used ZenML and Great Expectations to validate data and visualize data validation reports.

For more ZenML features and use-cases, you should check out some of the other ZenML examples. You should also take a look at our [docs](https://docs.zenml.io/) or our [Github](https://github.com/zenml-io/zenml) repo, or even better, join us on our [Slack channel](https://zenml.io/slack-invite).

Cheers!