<a href="https://colab.research.google.com/github/zenml-io/zenml/blob/main/examples/whylogs_data_profiling/whylogs.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# ZenML Data Logging, Profiling and Visualization With Whylogs

Data logging and profiling is an important part of any production ML
pipeline. [whylogs](https://whylabs.ai/whylogs) is an open source library
that analyzes your data and creates statistical summaries called whylogs
profiles. whylogs profiles can be visualized locally or uploaded to the
[WhyLabs](https://whylabs.ai/) platform where more comprehensive analyses can be carried out.

## Purpose

ZenML integrates seamlessly with whylogs and WhyLabs. This example shows
how easy it is to enhance steps in an existing ML pipeline with whylogs
profiling features. Changes to the user code are minimal while ZenML takes
care of all aspects related to whylogs serialization, versioning and persistence
and even uploading generated profiles to WhyLabs.

The ZenML whylogs integration includes the following features showcased in this
example:

* a predefined `WhylogsProfilerStep` ZenML step class that can be
instantiated and inserted into any pipeline to generate a whylogs profile
out of a Pandas DataFrame and return the profile as a step output artifact.
Instantiating this type of step is simplified even further through the
use of the `whylogs_profiler_step` utility function.
* a `WhylogsVisualizer` ZenML visualizer that can be used to display whylogs
profile artifacts produced during the execution of pipelines.

If you want to run this notebook in an interactive environment, feel free to run
it in a [Google Colab](https://colab.research.google.com/github/zenml-io/zenml/blob/main/examples/whylogs_data_profiling/whylogs.ipynb)
or view it on [GitHub](https://github.com/zenml-io/zenml/tree/main/examples/whylogs_data_profiling) directly.

## Install libraries

In [None]:
# Install the ZenML CLI tool, Whylogs and scikit-learn

!pip install zenml 
!zenml integration install -y whylogs sklearn

Once the installation is completed, you can go ahead and create a ZenML repository for this project by running:

In [None]:
# Initialize a ZenML repository
!zenml init

Now, the setup is completed. For the next steps, just make sure that you are executing the code within your ZenML repository.

## Setup the Stack

You need to have a whylogs Data Validator component to your stack to be able to use whylogs data profiling in your ZenML pipelines. Creating such a stack is easily accomplished:

In [None]:
!zenml data-validator register whylogs -f whylogs
!zenml stack register whylogs_stack -o default -a default -dv whylogs --set

## Import relevant packages

We will use pipelines and steps to train our model.

In [None]:
import os
import pandas as pd
import whylogs as why

from sklearn import datasets

from zenml.integrations.constants import SKLEARN, WHYLOGS
from zenml.pipelines import pipeline
from zenml.steps import step, Output

from whylogs.core import DatasetProfileView


## Define ZenML Steps

In the code that follows, we are defining the various steps of our pipeline. Each step is decorated with `@step`, the main abstraction that is currently available for creating pipeline steps, with the exception of the whylogs data profiling built-in step that is shipped with ZenML.

The first step is a `data_loader` step that downloads the diabetes tabular dataset and returns it as a panda DataFrame. The step also generates and returns a whylogs profile out of the entire dataset before splitting it in a subsequent step.

In [None]:
os.environ["ZENML_ANALYTICS_OPT_IN"] = "false"

@step
def data_loader() -> Output(
    data=pd.DataFrame,
    profile=DatasetProfileView,
):
    """Load the diabetes dataset."""
    X, y = datasets.load_diabetes(return_X_y=True, as_frame=True)

    # merge X and y together
    df = pd.merge(X, y, left_index=True, right_index=True)

    profile = why.log(pandas=df).profile().view()
    return df, profile


We then add a `data_splitter` step that takes the input dataset and splits it into a training and a validation subset. Later on, in the pipeline, we'll use the builtin whylogs profiler step to generate profiles for both of them.

In [None]:
from sklearn.model_selection import train_test_split

@step
def data_splitter(
    input: pd.DataFrame,
) -> Output(train=pd.DataFrame, test=pd.DataFrame,):
    """Splits the input dataset into train and test slices."""
    train, test = train_test_split(input, test_size=0.1, random_state=13)
    return train, test


We create two instances of the builtin whylogs profiler step to generate profiles for the test and validation datasets:

In [None]:
from zenml.integrations.whylogs.steps import WhylogsProfilerParameters, whylogs_profiler_step

train_data_profiler = whylogs_profiler_step(
    step_name="train_data_profiler",
    params=WhylogsProfilerParameters(),
    dataset_id="model-2",
)
test_data_profiler = whylogs_profiler_step(
    step_name="test_data_profiler",
    params=WhylogsProfilerParameters(),
    dataset_id="model-3",
)

## Define ZenML Pipeline

A pipeline is defined with the `@pipeline` decorator. This defines the various steps of the pipeline and specifies the dependencies between the steps, thereby determining the order in which they will be run.

In [None]:
from zenml.config import DockerSettings

docker_settings = DockerSettings(required_integrations=[SKLEARN, WHYLOGS])


@pipeline(settings={"docker": docker_settings})
def data_profiling_pipeline(
    data_loader,
    data_splitter,
    train_data_profiler,
    test_data_profiler,
):
    """Links all the steps together in a pipeline"""
    data, _ = data_loader()
    train, test = data_splitter(data)
    train_data_profiler(train)
    test_data_profiler(test)


## Run the pipeline

Running the pipeline is as simple as calling the `run()` method on an instance of the defined pipeline. Note how we use the builtin whylogs profiler steps to generate whylogs profiles out of the test and validation datasets.

In [None]:
pipeline_instance = data_profiling_pipeline(
    data_loader=data_loader(),
    data_splitter=data_splitter(),
    train_data_profiler=train_data_profiler,
    test_data_profiler=test_data_profiler,
)
pipeline_instance.run()

# Post execution workflow

All whylogs profiles generated by the pipeline run have been versioned, serialized and stored in the ZenML Artifact Store, alongside all other artifacts. The builtin whylogs Materializer included in the whylogs integration took care of that. These artifacts can be extracted and visualized after the pipeline run is complete. The ZenML whylogs visualizer takes in a ZenML pipeline step run and renders all the plots associated with the dataset profile that was generated during its execution. It can also take in two dataset profiles and generate a data drift report visualization.

The following is just a helper function to help with that:

In [None]:
from zenml.integrations.whylogs.visualizers import WhylogsVisualizer
from zenml.logger import get_logger
from zenml.post_execution import get_pipeline

def visualize_statistics(
    step_name: str, reference_step_name: str = None
) -> None:
    """Helper function to visualize whylogs statistics from step artifacts.

    Args:
        step_name: step that generated and returned a whylogs profile
        reference_step_name: an optional second step that generated a whylogs
            profile to use for data drift visualization where two whylogs
            profiles are required.
    """
    pipe = get_pipeline(pipeline="data_profiling_pipeline")
    whylogs_step = pipe.runs[-1].get_step(step=step_name)
    whylogs_reference_step = None
    if reference_step_name:
        whylogs_reference_step = pipe.runs[-1].get_step(
            name=reference_step_name
        )

    WhylogsVisualizer().visualize(
        whylogs_step,
        reference_step_view=whylogs_reference_step,
    )


We use the helper function to render two dashboards:

* a visualization of the profile generated for the entire dataset in the loader step
* a data drift visualization rendered from the two profiles we created from the test/validation slices

In [None]:
visualize_statistics("data_loader")

In [None]:
visualize_statistics("train_data_profiler", "test_data_profiler")

# Congratulations!

You have successfully used ZenML and whylogs to generate data profiles and visualize data drift reports.

For more ZenML features and use-cases, you should check out some of the other ZenML examples. You should also take a look at our [docs](https://docs.zenml.io/) or our [Github](https://github.com/zenml-io/zenml) repo, or even better, join us on our [Slack channel](https://zenml.io/slack-invite).

Cheers!