# ZenML: Open-source MLOps Framework for reproducible ML pipelines

![Test](../_assets/Logo/zenml.svg)

![Sam](../_assets/sam.png)

In [None]:
from absl import logging as absl_logging
import warnings
warnings.filterwarnings('ignore')
%load_ext autoreload
%autoreload 2
absl_logging.set_verbosity(-10000)

Let's begin by initializing ZenML in our directory. We are going to use a local stack to begin with, for simplicity and then transition to other stacks. This can be achieved in code by executing the following block.

# Initialize ZenML

In [None]:
!zenml init
!zenml stack set local_stack

We will start by looking at the definition of a pipeline that we want to build. This will give an overview of what we want to achieve and how we plan on getting there. We'll dive into the details on some of the interesting steps after that.

# Create a simple training pipeline

Create a mnist training pipeline

In [None]:
import numpy as np
import pandas as pd
from sklearn.base import ClassifierMixin

from zenml.integrations.sklearn.helpers.digits import (
    get_digits,
)
from sklearn.base import ClassifierMixin
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from zenml.pipelines import pipeline
from zenml.steps import Output, step

## Define Steps
In the code that follows, you can see that we are defining the various steps of our pipeline. Each step is decorated with @step, the main abstraction that is currently available for creating pipeline steps.

The first step is an import step that downloads the MNIST dataset and returns four numpy arrays as its output.

In [None]:
@step
def importer() -> Output(
    X_train=np.ndarray, X_test=np.ndarray, y_train=np.ndarray, y_test=np.ndarray
):
    """Loads the digits array as normal numpy arrays."""
    X_train, X_test, y_train, y_test = get_digits()
    return X_train, X_test, y_train, y_test

We then add a Trainer step, that takes the imported data and trains a sklearn classifier on the data. Note that the model is not explicitly saved within the step. Under the hood ZenML uses Materializers to automatically persist the Artifacts that result from each step into the Artifact Store.

In [None]:
@step
def svc_trainer(
    X_train: np.ndarray,
    y_train: np.ndarray,
) -> ClassifierMixin:
    """Train another simple sklearn classifier for the digits dataset."""
    model = SVC(gamma=0.001)
    model.fit(X_train, y_train)
    return model

Finally, we add an Evaluator step that takes as input the test set and the trained model and evaluates some final metrics.

In [None]:
@step
def evaluator(
    X_test: np.ndarray,
    y_test: np.ndarray,
    model: ClassifierMixin,
) -> float:
    """Calculate the accuracy on the test set"""
    test_acc = model.score(X_test, y_test)
    print(f"Test accuracy: {test_acc}")
    return test_acc

## Define and Run Pipeline
A pipeline is defined with the @pipeline decorator. This defines the various steps of the pipeline and specifies the dependencies between the steps, thereby determining the order in which they will be run.


In [None]:
@pipeline
def digits_pipeline(
    importer,
    trainer,
    evaluator,
):
    """Links all the steps together in a pipeline"""
    X_train, X_test, y_train, y_test = importer()
    model = trainer(X_train=X_train, y_train=y_train)
    evaluator(X_test=X_test, y_test=y_test, model=model)

In [None]:
# Initialize the pipeline
first_pipeline = digits_pipeline(
    importer=importer(),
    trainer=svc_trainer(),
    evaluator=evaluator(),
)
first_pipeline.run()

## Add Drift Detection with Evidently

Evidently is an open source tool that allows you to easily compute drift on your data. [Here](https://blog.zenml.io/zenml-loves-evidently/) is a little blog post of ours that explains the evidently integration in a bit more detail. 

At its core, Evidently’s drift detection calculation functions take in a reference data set and compare it with a separate comparison dataset. These are both passed in as Pandas dataframes, though CSV inputs are also possible. ZenML implements this functionality in the form of several standardized steps along with an easy way to use the visualization tools also provided along with Evidently as ‘Dashboards’.


If you’re working on any kind of machine learning problem that has an ongoing training loop that takes in new data, you’ll want to guard against drift. Machine learning pipelines are built on top of data inputs, so it is worth checking for drift if you have a model that was trained on a certain distribution of data. The incoming data is something you have less control over and since things often change out in the real world, you should have a plan for knowing when things have shifted. Evidently offers a [growing set of features](https://github.com/evidentlyai/evidently) that help you monitor not only data drift but other key aspects like target drift and so on.

![Evidently](../_assets/zenml+evidently.png "Evidently")

In [None]:
!zenml integration install evidently -f

### Add a Drift Detection Step to our Pipeline

In [None]:
from zenml.integrations.evidently.steps import (
    EvidentlyProfileConfig,
    EvidentlyProfileStep,
)

In [None]:
@step
def get_reference_data(
    X_train: np.ndarray,
    X_test: np.ndarray,
) -> Output(reference=pd.DataFrame, comparison=pd.DataFrame):
    """Splits data for drift detection."""
    # X_train = _add_awgn(X_train)
    columns = [str(x) for x in list(range(X_train.shape[1]))]
    return pd.DataFrame(X_test, columns=columns), pd.DataFrame(X_train, columns=columns)

In [None]:
@pipeline(enable_cache=False)
def digits_pipeline_with_drift(
    importer,
    trainer,
    evaluator,
    
    get_reference_data,
    drift_detector,
):
    """Links all the steps together in a pipeline"""
    X_train, X_test, y_train, y_test = importer()
    model = trainer(X_train=X_train, y_train=y_train)
    evaluator(X_test=X_test, y_test=y_test, model=model)
    
    reference, comparison = get_reference_data(X_train, X_test)
    drift_detector(reference, comparison)

### Run the pipeline with evidently

In [None]:
evidently_profile_config = EvidentlyProfileConfig(
    column_mapping=None,
    profile_sections=["datadrift"])

second_pipeline = digits_pipeline_with_drift(
    importer=importer(),
    trainer=svc_trainer(),
    evaluator=evaluator(),
    
    # EvidentlyProfileStep takes reference_dataset and comparison dataset
    get_reference_data=get_reference_data(),
    drift_detector=EvidentlyProfileStep(config=evidently_profile_config)
)
second_pipeline.run()

In [None]:
from zenml.integrations.evidently.visualizers import EvidentlyVisualizer
from zenml.repository import Repository
import json

repo = Repository()
p = repo.get_pipeline('digits_pipeline_with_drift')
last_run = p.runs[-1]

drift_detection_step = last_run.get_step(
    name="drift_detector"
)
evidently_outputs = drift_detection_step

EvidentlyVisualizer().visualize(evidently_outputs)

## Add alerts with Discord

![Discord](../_assets/evidently+discord.png "Discord")

In [None]:
import requests
from zenml.steps import step

# This is a private ZenML Discord channel. We will get notified if you use 
# this, but you won't be able to see it. Feel free to create a new Discord 
# [webhook](https://support.discord.com/hc/en-us/articles/228383668-Intro-to-Webhooks) 
# and replace this one!
DISCORD_URL = (
    "https://discord.com/api/webhooks/935835443826659339/Q32jTwmqc"
    "GJAUr-r_J3ouO-zkNQPchJHqTuwJ7dK4wiFzawT2Gu97f6ACt58UKFCxEO9"
)


@step(enable_cache=False)
def discord_alert(
    drift_report: dict
) -> None:
    """Send a message to the discord channel to report drift.
    Args:
        deployment_decision: True if drift detected; false otherwise.
    """
    drift = drift_report["data_drift"]["data"]["metrics"]["dataset_drift"]
    url = DISCORD_URL
    data = {
        "content": "Drift Detected!" if drift else "No Drift Detected!",
        "username": "Drift Bot",
    }
    result = requests.post(url, json=data)

    try:
        result.raise_for_status()
    except requests.exceptions.HTTPError as err:
        print(err)
    else:
        print(
            "Posted to discord successfully, code {}.".format(
                result.status_code
            )
        )
    print("Drift detected" if drift else "No Drift detected")

### Add the pipeline step

In [None]:
@pipeline
def digits_pipeline_with_drift_alert(
    importer,
    trainer,
    evaluator,
    
    get_reference_data,
    drift_detector,
    
    alerter,
):
    """Links all the steps together in a pipeline"""
    X_train, X_test, y_train, y_test = importer()
    model = trainer(X_train=X_train, y_train=y_train)
    evaluator(X_test=X_test, y_test=y_test, model=model)
    
    reference, comparison = get_reference_data(X_train, X_test)
    drift_report, _ = drift_detector(reference, comparison)
    
    alerter(drift_report)

In [None]:
evidently_profile_config = EvidentlyProfileConfig(
    column_mapping=None,
    profile_sections=["datadrift"])

third_pipeline = digits_pipeline_with_drift_alert(
    importer=importer(),
    trainer=svc_trainer(),
    evaluator=evaluator(),
    
    # EvidentlyProfileStep takes reference_dataset and comparison dataset
    get_reference_data=get_reference_data(),
    drift_detector=EvidentlyProfileStep(config=evidently_profile_config),
    
    # Add discord
    alerter=discord_alert()
)
third_pipeline.run()

## Track experiments and parameters with MLFlow

For this pipeline we want to take you a step further by showing you some more integrations. We will be using MLFlow Tracking for visualizing and comparing multiple pipeline runs. 

![MLflow](../_assets/evidently+discord+mlflow.png "MLflow")

In [None]:
!zenml integration install mlflow -f

### Create a trainer with mlflow logging

Now that we have mlflow enabled we need to choose what we want to log into mlflow. For now, we have chosen to use the [mlflow autolog](https://www.mlflow.org/docs/latest/tracking.html#scikit-learn) functionality to automatically log the model and training parameters within the training step.


<div class="alert alert-block alert-info">
    <b>Note:</b> The @enable_mlflow decorator above the step is all we need to get started with mlflow. This decorator sets up an mlflow experiment and an mlflow backend for all runs within this pipeline. 
</div>

In [None]:
from zenml.integrations.mlflow.mlflow_step_decorator import enable_mlflow
import mlflow


@enable_mlflow
@step(enable_cache=False)
def svc_trainer_mlflow(
    X_train: np.ndarray,
    y_train: np.ndarray,
) -> ClassifierMixin:
    """Train another simple sklearn classifier for the digits dataset."""
    mlflow.sklearn.autolog()
    model = SVC(gamma=0.001)
    model.fit(X_train, y_train)
    return model

In [None]:
fourth_pipeline = digits_pipeline_with_drift_alert(
    importer=importer(),
    trainer=svc_trainer_mlflow(),
    evaluator=evaluator(),
    
    # EvidentlyProfileStep takes reference_dataset and comparison dataset
    get_reference_data=get_reference_data(),
    drift_detector=EvidentlyProfileStep(config=evidently_profile_config),
    
    # Add discord
    alerter=discord_alert()
)
fourth_pipeline.run()

### Let's have a look at mlflow

Training is done, let's have a look at our mlflow ui and see if our training including the model have made it in there.

In [None]:
# This will start a serving process for mlflow 
#  - if you want to continue in the notebook you need to manually
#  interrupt the kernel 
from zenml.environment import Environment
from zenml.integrations.mlflow.mlflow_environment import MLFLOW_ENVIRONMENT_NAME

!mlflow ui --backend-store-uri {Environment()[MLFLOW_ENVIRONMENT_NAME].tracking_uri} --port 4998

Environment()[MLFLOW_ENVIRONMENT_NAME].tracking_uri

### Create another trainer with a different model

In [None]:
@enable_mlflow
@step(enable_cache=False)
def tree_trainer_with_mlflow(
    X_train: np.ndarray,
    y_train: np.ndarray,
) -> ClassifierMixin:
    """Train another simple sklearn classifier for the digits dataset."""
    mlflow.sklearn.autolog()
    model = DecisionTreeClassifier()
    model.fit(X_train, y_train)
    return model

In [None]:
fifth_pipeline = digits_pipeline_with_drift_alert(
    importer=importer(),
    trainer=tree_trainer_with_mlflow(),
    evaluator=evaluator(),
    
    # EvidentlyProfileStep takes reference_dataset and comparison dataset
    get_reference_data=get_reference_data(),
    drift_detector=EvidentlyProfileStep(config=evidently_profile_config),
    
    # Add discord
    alerter=discord_alert()
)
fifth_pipeline.run()

In [None]:
# This will start a serving process for mlflow 
#  - if you want to continue in the notebook you need to manually
#  interrupt the kernel 
from zenml.environment import Environment
from zenml.integrations.mlflow.mlflow_environment import MLFLOW_ENVIRONMENT_NAME

!mlflow ui --backend-store-uri {Environment()[MLFLOW_ENVIRONMENT_NAME].tracking_uri} --port 4998

# Continous Deployment using mlflow

In [None]:
@pipeline(enable_cache=False)
def continuous_deployment_pipeline(
    importer,
    trainer,
    evaluator,
    get_reference_data,
    drift_detector,
    alerter,
    
    deployment_trigger,
    model_deployer,
):
    """Links all the steps together in a pipeline"""
    X_train, X_test, y_train, y_test = importer()
    model = trainer(X_train=X_train, y_train=y_train)
    evaluator(X_test=X_test, y_test=y_test, model=model)
    
    reference, comparison = get_reference_data(X_train, X_test)
    drift_report, _ = drift_detector(reference, comparison)
    
    alerter(drift_report)
    
    # new 
    deployment_decision = deployment_trigger(drift_report)
    model_deployer(deployment_decision)

In [None]:
@step
def deployment_trigger(
    drift_report: dict,
) -> bool:
    """Implements a simple model deployment trigger that looks at the
    drift report and deploys if there's none"""

    drift = drift_report["data_drift"]["data"]["metrics"]["dataset_drift"]

    if drift:
        return False
    else:
        return True

In [None]:
from zenml.integrations.mlflow.steps import mlflow_deployer_step
from zenml.services import load_last_service_from_step
from zenml.integrations.mlflow.steps import MLFlowDeployerConfig

model_deployer = mlflow_deployer_step(name="model_deployer")


sixth_pipeline = continuous_deployment_pipeline(
    importer=importer(),
    trainer=tree_trainer_with_mlflow(),
    evaluator=evaluator(),
    
    # EvidentlyProfileStep takes reference_dataset and comparison dataset
    get_reference_data=get_reference_data(),
    drift_detector=EvidentlyProfileStep(config=evidently_profile_config),
    
    # Add discord
    alerter=discord_alert(),
    
    deployment_trigger=deployment_trigger(),
    model_deployer=model_deployer(config=MLFlowDeployerConfig(workers=1)),
)
sixth_pipeline.run()

In [None]:
repo = Repository()
p = repo.get_pipeline('continuous_deployment_pipeline')
last_run = p.runs[-1]
X_test = last_run.steps[0].outputs['X_test'].read()
y_test = last_run.steps[0].outputs['y_test'].read()

In [None]:
service = load_last_service_from_step(
    pipeline_name="continuous_deployment_pipeline",
    step_name="model_deployer",
    running=True,
)

In [None]:
X_test[0], y_test[0]

In [None]:
service.predict(X_test[0:1])

In [None]:
y_test[0]

In [None]:
# Standard scientific Python imports
import matplotlib.pyplot as plt


# ax.set_axis_off()
plt.imshow(X_test[0].reshape(8, 8), cmap=plt.cm.gray_r, interpolation="nearest")

In [None]:
service.stop()