# Lesson 1.2: Artifact Lineage

[![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/zenml-io/zenbytes/blob/main/1-2_Artifact_Lineage.ipynb)

***Key Concepts:*** *Artifacts, Artifact Stores, Metadata, Metadata Stores, Versioning, Caching*

In this lesson we will learn about one of the coolest features of ML pipelines: automated artifact versioning and tracking. This will give us tremendous insights into how exactly each of our models was created. Furthermore, it enables artifact caching, such that we can switch out parts of our ML pipelines without needing to rerun any of the prior steps.

This notebook requires you to have the ZenML [Dash](https://dash.plotly.com/introduction) integration installed. If you have not done so before, install it with the following command, then restart the kernel of your notebook.

In [None]:
!zenml integration install dash -f

Before we dive into any versioning and caching, let's get clear on what exactly **[Artifacts](https://docs.zenml.io/core-concepts#artifact)** are. To illustrate, let us first rebuild our digits pipeline from the previous chapter:

In [None]:
from zenml.pipelines import pipeline

from steps.importer import importer
from steps.sklearn_trainer import svc_trainer
from steps.evaluator import evaluator


@pipeline
def digits_pipeline(importer, trainer, evaluator):
    """Links all the steps together in a pipeline"""
    X_train, X_test, y_train, y_test = importer()
    model = trainer(X_train=X_train, y_train=y_train)
    evaluator(X_test=X_test, y_test=y_test, model=model)

The artifacts of this pipeline are simply the local variables we defined: `X_train`, `X_test`, `y_train`, `y_test`, and `model`. These make up the data that flows in and out of our steps. In fact, this data is at the core of our pipelines, and the pipeline definition above just defines which artifact is the input or output of what step.

## Pipeline Visualization with Dash

To visualize how the steps connect the different artifacts, we can view our pipeline with ZenML's [Dash](https://dash.plotly.com/introduction) integration. 

Run the following code, then open http://127.0.0.1:8050 in your browser.

In [None]:
digits_svc_pipeline = digits_pipeline(
    importer=importer(), trainer=svc_trainer(), evaluator=evaluator()
)
digits_svc_pipeline.run()

In [None]:
from zenml.integrations.dash.visualizers.pipeline_run_lineage_visualizer import (
    PipelineRunLineageVisualizer,
)
from zenml.repository import Repository

repo = Repository()
latest_run = repo.get_pipeline("digits_pipeline").runs[-1]
PipelineRunLineageVisualizer().visualize(latest_run)

You should now see an interactive visualization in your browser as shown below. Squares represent your artifacts, circles your pipeline steps. Also note that the different nodes are color coded, so if your pipeline ever fails or runs for too long, you can find the responsible step at a glance!

![Dash Visualization](_assets/1-2/dash_initial.png)

## Artifact Caching
As mentioned in the beginning, tracking which exact artifact went into what steps also allows us to cache and reuse artifacts. Let's see this in action:
First, stop the execution of the previous notebook cell in case it is still running. Then, execute the next cell to rerun our pipeline and visualize it with dash again.

In [None]:
digits_svc_pipeline.run()
latest_run = repo.get_pipeline("digits_pipeline").runs[-1]
PipelineRunLineageVisualizer().visualize(latest_run)

You should now see a visualization as shown below. Note that the color of all nodes in the graph changed to green now. This means they were still cached from our previous run.

![Dash Visualization Cached](_assets/1-2/dash_cached.png)

Let's now replace the SVC model in our ML pipeline with a decision tree and see what happens.

In [None]:
import numpy as np
from sklearn.base import ClassifierMixin
from sklearn.tree import DecisionTreeClassifier
from zenml.steps import step


@step()
def tree_trainer(
    X_train: np.ndarray,
    y_train: np.ndarray,
) -> ClassifierMixin:
    """Train a sklearn decision tree classifier."""
    model = DecisionTreeClassifier()
    model.fit(X_train, y_train)
    return model


# redefine and rerun our pipeline, this time with tree_trainer()
digits_tree_pipeline = digits_pipeline(
    importer=importer(), trainer=tree_trainer(), evaluator=evaluator()
)
digits_tree_pipeline.run()

latest_run = repo.get_pipeline("digits_pipeline").runs[-1]
PipelineRunLineageVisualizer().visualize(latest_run)

The visualization should now look as shown below. Since we changed the trainer, the corresponding node and all subsequent nodes are now blue again, meaning they were rerun and the artifacts were freshly created. However, note how the input data artifacts are still green. They did not have to be recreated. In a real production setting this might save us a tremendous amount of time and resources as those data artifacts might have been the result of some complex, expensive preprocessing job. 

![Dash Visualization Partly Cached](_assets/1-2/dash_partly_cached.png)


## Artifact Stores, Metadata Stores, and Orchestrators

You might be wondering now how exactly our ML pipelines can keep track of which artifacts changed and which did not. This actually requires several additional MLOps components which you would normally have to setup and configure yourself. Luckily, ZenML automatically set this up for us.

### Artifact Stores

Under the hood, all the artifacts in our ML pipeline are automatically stored in an [Artifact Store](https://docs.zenml.io/core-concepts#artifact-store). By default, this is simply a place in your local file system, but we could also configure ZenML do store this data in a cloud bucket like [Amazon S3](https://aws.amazon.com/s3/) or any other place instead. We will see this in more detail when we migrate our MLOps stack to the cloud in a later chapter.

You can run the following command to find out where exactly your artifacts are currently stored:

In [None]:
!zenml artifact-store describe

### Metadata Stores

In addition to the artifact itself, ZenML automatically stores [Metadata](https://docs.zenml.io/core-concepts#metadata-store), e.g. where the artifact is stored, in a [Metadata Store](https://docs.zenml.io/core-concepts#metadata-store). By default, this is a SQLite database on your local machine, but we could again easily switch this out for a cloud service.

Run the following command to see where the metadata is currently stored:

In [None]:
!zenml metadata-store describe

### ZenML MLOps Stacks

Artifact stores, metadata stores, and orchestrators are the backbone of any MLOps stack, as they enable us to store, share, and reproduce our work. Without them, we can easily loose track of how exactly our current ML pipelines were created. You can see a list of all components in your current MLOps stack using the following command:

In [None]:
!zenml stack describe

As we see, our stack currently consists of only three components:
- Artifact Store
- Metadata Store
- Orchestrator

The [Orchestrator](https://docs.zenml.io/core-concepts#orchestrator) is the component that defines how and where each pipeline step is executed when calling `pipeline.run()`. For now, this component is not of much interest to us, but we will learn more about it in later chapters, e.g., to run our pipelines on a [Kubernetes](https://kubernetes.io/) clusters with a [Kubeflow](https://www.kubeflow.org/) orchestrator.

![Local MLOps Stack](_assets/1-2/localstack.png)

Throughout the subsequent chapters, we will add several more components to our MLOps stack, including model deployment tools, experiment trackers, data and model monitoring tools, and more. Let's start with experiment tracking in the [next lesson](2-1_Experiment_Tracking.ipynb).