# ZenML: Modular orchestration

***Key Concepts:*** *Artifacts, Artifact Stores, Metadata, Metadata Stores, Versioning, Caching, Orchestrator*

In this lesson we will learn about one of the coolest features of ML pipelines: automated artifact versioning and tracking. This will give us tremendous insights into how exactly each of our models was created. Furthermore, it enables artifact caching, allowing us to switch out parts of our ML pipelines without having to rerun any previous steps.

This notebook requires you to have the ZenML [Dash](https://dash.plotly.com/introduction) integration installed. Install it with the following command if you have not done so before, which will also restart the kernel of your notebook.

In [None]:
!zenml integration install sklearn dash -y

import IPython

# automatically restart kernel
IPython.Application.instance().kernel.do_shutdown(restart=True)

Before we dive into any versioning and caching, let's clarify what exactly **[Artifacts](https://docs.zenml.io/mlops-stacks/artifact-stores)** are. To illustrate, let us first rebuild our digits pipeline from the previous chapter:

In [None]:
!zenml stack list

In [None]:
!zenml stack set default

In [None]:
import numpy as np
import pandas as pd
from zenml.integrations.sklearn.helpers.digits import get_digits
from zenml.steps import Output, step


@step
def importer() -> Output(
    X_train=np.ndarray, X_test=np.ndarray, y_train=np.ndarray, y_test=np.ndarray
):
    """Loads the digits array as normal numpy arrays."""
    print("test")
    X_train, X_test, y_train, y_test = get_digits()
    return X_train, X_test, y_train, y_test

In [None]:
import numpy as np
from sklearn.base import ClassifierMixin
from sklearn.svm import SVC

@step
def svc_trainer(
    X_train: np.ndarray,
    y_train: np.ndarray,
) -> ClassifierMixin:
    """Train a sklearn SVC classifier."""
    model = SVC(gamma=0.001)
    model.fit(X_train, y_train)
    return model

In [None]:
import numpy as np  # type: ignore [import]
from sklearn.base import ClassifierMixin
from zenml.steps import step

@step
def evaluator(
    X_test: np.ndarray,
    y_test: np.ndarray,
    model: ClassifierMixin,
) -> float:
    """Calculate the accuracy on the test set"""
    test_acc = model.score(X_test, y_test)
    print(f"Test accuracy: {test_acc}")
    return test_acc

In [None]:
from zenml.pipelines import pipeline

@pipeline
def digits_pipeline(importer, trainer, evaluator):
    """Links all the steps together in a pipeline"""
    X_train, X_test, y_train, y_test = importer()
    model = trainer(X_train=X_train, y_train=y_train)
    evaluator(X_test=X_test, y_test=y_test, model=model)

The artifacts of this pipeline are simply the local variables we defined: `X_train`, `X_test`, `y_train`, `y_test`, and `model`. These make up the data that flows in and out of our steps. Artifacts are at the core of our pipelines, and the pipeline definition above just defines which artifact is the input or output of what step.

## Pipeline Visualization with Dash

To visualize how the steps connect the different artifacts, we can view our pipeline with ZenML's [Dash](https://dash.plotly.com/introduction) integration. 

Run the following code, then open http://127.0.0.1:8050 in your browser.

In [None]:
digits_svc_pipeline = digits_pipeline(
    importer=importer(), trainer=svc_trainer(), evaluator=evaluator()
)
digits_svc_pipeline.run()

In [None]:
def start_pipeline_visualizer():
    from zenml.integrations.dash.visualizers.pipeline_run_lineage_visualizer import (
        PipelineRunLineageVisualizer,
    )
    from zenml.repository import Repository

    repo = Repository()
    latest_run = repo.get_pipeline("digits_pipeline").runs[-1]
    PipelineRunLineageVisualizer().visualize(latest_run)

start_pipeline_visualizer()

You should now see an interactive visualization in your browser, as shown below. The squares represent your artifacts and the circles your pipeline steps. Also, note that the different nodes are color-coded, so if your pipeline ever fails or runs for too long, you can find the responsible step at a glance!

## Artifact Caching
As mentioned in the beginning, tracking which exact artifact went into what steps allows us to cache and reuse artifacts. Let's see this in action: First, stop the execution of the last notebook cell if it is still running. Then, execute the next cell to rerun our pipeline and visualize it with dash again.

In [None]:
digits_svc_pipeline.run()

In [None]:
start_pipeline_visualizer()

You should now see a visualization as shown below. Note that the color of all nodes in the graph has changed to green now. This means they were still cached from our previous run.

Let's now replace the SVC model in our ML pipeline with a decision tree and see what happens.

In [None]:
import numpy as np
from sklearn.base import ClassifierMixin
from sklearn.tree import DecisionTreeClassifier
from zenml.steps import step


@step()
def tree_trainer(
    X_train: np.ndarray,
    y_train: np.ndarray,
) -> ClassifierMixin:
    """Train a sklearn decision tree classifier."""
    model = DecisionTreeClassifier()
    model.fit(X_train, y_train)
    return model


# redefine and rerun our pipeline, this time with tree_trainer()
digits_tree_pipeline = digits_pipeline(
    importer=importer(), trainer=tree_trainer(), evaluator=evaluator()
)
digits_tree_pipeline.run()

In [None]:
start_pipeline_visualizer()

The visualization should now look as shown below. Since we changed the trainer, the corresponding node and all subsequent nodes are now blue again, meaning they were rerun and the artifacts were freshly created. However, note how the input data artifacts are still green. They did not have to be recreated. In an actual production setting, this might save us a tremendous amount of time and resources as those data artifacts might have resulted from some complex, expensive preprocessing job.

## Artifact Stores, Metadata Stores, and Orchestrators

You might now wonder how our ML pipelines can keep track of which artifacts changed and which did not. This requires several additional MLOps components that you would typically have to set up and configure yourself. Luckily, ZenML automatically set this up for us.

### Artifact Stores

Under the hood, all the artifacts in our ML pipeline are automatically stored in an [Artifact Store](https://docs.zenml.io/mlops-stacks/artifact-stores). By default, this is simply a place in your local file system, but we could also configure ZenML to store this data in a cloud bucket like [Amazon S3](https://aws.amazon.com/s3/) or any other place instead. We will see this in more detail when we migrate our MLOps stack to the cloud in a later chapter.

You can run the following command to find out where exactly your artifacts are currently stored:

In [None]:
!zenml artifact-store describe

### Metadata Stores

In addition to the artifact itself, ZenML automatically stores [Metadata](https://docs.zenml.io/mlops-stacks/metadata-stores), e.g., where the artifact is stored, in a [Metadata Store](https://docs.zenml.io/mlops-stacks/metadata-stores). This is an SQLite database on your local machine by default, but we could again easily switch it out for a cloud service.

Run the following command to see where the metadata is currently stored:

In [None]:
!zenml metadata-store describe

# ZenML MLOps Stacks

Artifact stores, metadata stores, and orchestrators are the backbone of any MLOps stack, as they enable us to store, share, and reproduce our work. Without them, we can easily lose track of how exactly our current ML pipelines were created. You can see a list of all components in your current MLOps stack using the following command:

### The Default Stack

In [None]:
!zenml stack describe

As we see, our stack currently consists of only three components:
- Artifact Store
- Metadata Store
- Orchestrator

The [Orchestrator](https://docs.zenml.io/mlops-stacks/orchestrators) is the component that defines how and where each pipeline step is executed when calling `pipeline.run()`. This component is not of much interest to us right now, but we will learn more about it in later chapters, e.g., to run our pipelines on a [Kubernetes](https://kubernetes.io/) clusters with a [Kubeflow](https://www.kubeflow.org/) orchestrator.

![Local MLOps Stack](assets/axa_local_stack.png)

We will add several more components to our MLOps stack throughout the subsequent chapters, including model deployment tools, experiment trackers, data and model monitoring tools, and more. Let's start with experiment tracking in the [next lesson](2-1_Experiment_Tracking.ipynb).

In [1]:
from steps.evaluator import evaluator
from steps.importer import importer
from steps.sklearn_trainer import svc_trainer

from pipelines.digits_pipeline import digits_pipeline

In [None]:
# redefine and rerun our pipeline, this time with tree_trainer()
digits_tree_pipeline = digits_pipeline(
    importer=importer(), trainer=svc_trainer(), evaluator=evaluator()
)
digits_tree_pipeline.run()

### The AWS VM Stack

![AWS VM MLOps Stack](assets/axa_local_stack-1.png)

In [None]:
ECR_REGISTRY_NAME = "715803424590.dkr.ecr.us-east-1.amazonaws.com"
AWS_REGION = "us-east-1"
REMOTE_ARTIFACT_STORE_PATH = "s3://zenml-demos"

In [None]:
!aws ecr get-login-password --region {AWS_REGION} | docker login --username AWS --password-stdin {ECR_REGISTRY_NAME}

In [None]:
zenml container-registry register ecr_registry \
    --flavor=aws \
    --uri={ECR_REGISTRY_NAME}

In [None]:
!zenml artifact-store register s3_store \
    --flavor=s3 \
    --path={REMOTE_ARTIFACT_STORE_PATH}

In [None]:
!zenml metadata-store register rds_metadata_store --flavor mysql \
    database=zenml \
    host="zenml.c4db8gwgpugq.us-east-1.rds.amazonaws.com" \
    name=rds_metadata_store \
    port=3306 \
    secret=aws_rds_secret

In [None]:
!zenml secrets-manager register aws_secrets_manager --flavor aws \
    region_name={AWS_REGION}

In [None]:
!zenml secret register aws_rds_secret \
    --schema=mysql \
    --user=root \
    --password=zenmlmysql

In [None]:
!zenml orchestrator register aws_vm_orchestrator --flavor=aws_vm

In [None]:
!zenml stack register aws_minimal_stack \
    -m aws_rds_secret \
    -a s3_store \
    -o k8s_orchestrator \
    -c ecr_registry \
    -x aws_secrets_manager

In [4]:
!zenml stack set aws_minimal_stack
!zenml stack up

[2;36mRunning with active profile: [0m[2;32m'great_expectations'[0m[2;36m [0m[1;2;36m([0m[2;36mlocal[0m[1;2;36m)[0m
[?25l[2;36mActive stack set to: [0m[2;32m'aws_minimal_stack'[0m
[2K[32m⠋[0m Setting the active stack to 'aws_minimal_stack'...inimal_stack'...[0m
[1A[2K[2;36mRunning with active profile: [0m[2;32m'great_expectations'[0m[2;36m [0m[1;2;36m([0m[2;36mlocal[0m[1;2;36m)[0m
[2;36mProvisioning resources for active stack [0m[2;32m'aws_minimal_stack'[0m[2;36m.[0m
[1;35mProvisioning resources for stack 'aws_minimal_stack'.[0m
[1;35mResuming provisioned resources for stack aws_minimal_stack.[0m


In [5]:
# redefine and rerun our pipeline, this time with tree_trainer()
digits_tree_pipeline = digits_pipeline(
    importer=importer(), trainer=svc_trainer(), evaluator=evaluator()
)
digits_tree_pipeline.run()

[1;35mCreating run for pipeline: [0m[33mdigits_pipeline[1;35m[0m
[1;35mCache enabled for pipeline [0m[33mdigits_pipeline[1;35m[0m


Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  binary = (7, np.dtype("bytes"), "BinaryType", np.object)




[1;35mBuilding Docker image [0m[33m715803424590.dkr.ecr.us-east-1.amazonaws.com/aws-vm-orchestrator:digits_pipeline[1;35m:[0m
[1;35mThe following external requirements are built into the image:[0m
[1;35m	- [0m[33msagemaker==2.82.2[1;35m[0m
[1;35m	- [0m[33mboto3==1.21.0[1;35m[0m
[1;35m	- [0m[33ms3fs==2022.3.0[1;35m[0m
[1;35mNo [0m[33m.dockerignore[1;35m found, including all files inside [0m[33m/home/htahir1/workspace/zenml[1;35m.[0m
[33mBuild context size for docker image: [0m[33m206.52 MiB[33m. If you believe this is unreasonably large, make sure to include a [0m[33m.dockerignore[33m fileat the root of your build context [0m[33m/home/htahir1/workspace/zenml/.dockerignore[33m or specify a custom file for argument [0m[33mdockerignore_file[33m when defining your pipeline.[0m


  s1 = StrictVersion(v1)

  s1 = StrictVersion(v1)

  s2 = StrictVersion(v2)

  s2 = StrictVersion(v2)



[1;35mBuilding the image might take a while...[0m
[1;35mFinished building Docker image [0m[33m715803424590.dkr.ecr.us-east-1.amazonaws.com/aws-vm-orchestrator:digits_pipeline[1;35m.[0m
[1;35mPushing Docker image [0m[33m715803424590.dkr.ecr.us-east-1.amazonaws.com/aws-vm-orchestrator:digits_pipeline[1;35m.[0m
[1;35mFinished pushing Docker image [0m[33m715803424590.dkr.ecr.us-east-1.amazonaws.com/aws-vm-orchestrator:digits_pipeline[1;35m.[0m
[1;35mUsing stack [0m[33maws_minimal_stack[1;35m to run pipeline [0m[33mdigits_pipeline[1;35m...[0m


  s1 = StrictVersion(v1)

  s1 = StrictVersion(v1)

  s2 = StrictVersion(v2)

  s2 = StrictVersion(v2)



### The AWS Kubernetes Stack

![AWS K8s MLOps Stack](assets/axa_local_stack-2.png)

In [None]:
AWS_EKS_CLUSTER = ""
KUBE_CONTEXT = ""

In [None]:
!aws eks --region {AWS_REGION} update-kubeconfig \
    --name {AWS_EKS_CLUSTER} \
    --alias {KUBE_CONTEXT}

In [None]:
!zenml orchestrator register k8s_orchestrator \
    --flavor=kubernetes \
    --kubernetes_context={KUBE_CONTEXT} \
    --kubernetes_namespace=zenml \
    --synchronous=True

In [None]:
!zenml stack register k8s_stack 
    -m aws_rds_secret \
    -a s3_store \
    -o k8s_orchestrator \
    -c ecr_registry \
    -x aws_secrets_manager

In [None]:
!zenml stack set k8s_stack
!zenml stack up

In [None]:
# redefine and rerun our pipeline, this time with tree_trainer()
digits_tree_pipeline = digits_pipeline(
    importer=importer(), trainer=svc_trainer(), evaluator=evaluator()
)
digits_tree_pipeline.run()