# Lesson 2: Artifact Lineage
***Key Concepts:*** *Artifacts, Artifact Stores, Metadata Stores, Versioning, Caching*

In this lesson we will learn about one of the coolest features of ML pipelines: automatic artifact versioning and tracking. This will give us tremendous insights into how exactly each of our models was created. Furthermore, it enables automatic artifact caching, such that we can switch out parts of our ML pipelines without needing to rerun any of the prior steps.

Before we dive into any of this, let's get clear on what exactly **[Artifacts](https://docs.zenml.io/core-concepts#artifact)** are. To illustrate, let us first rebuild our digits pipeline from the previous chapter:

In [1]:
from zenml.pipelines import pipeline

from src.steps.importer import importer
from src.steps.sklearn_trainer import svc_trainer
from src.steps.evaluator import evaluator


@pipeline
def digits_pipeline(importer, trainer, evaluator):
    """Links all the steps together in a pipeline"""
    X_train, X_test, y_train, y_test = importer()
    model = trainer(X_train=X_train, y_train=y_train)
    evaluator(X_test=X_test, y_test=y_test, model=model)

The artifacts of this pipeline are simply the local variables we defined: `X_train`, `X_test`, `y_train`, `y_test`, and `model`. These make up the data that flows in and out of our steps. In fact, this data is at the core of our pipelines, and the pipeline definition above just defines which artifact is the input or output of what step.

## Pipeline Visualization with Dash

To visualize how the steps connect the different artifacts, we can view our pipeline with ZenML's [Dash](https://dash.plotly.com/introduction) integration. Run the following code, then open `http://127.0.0.1:8050/` in your browser.

In [2]:
!zenml integration install dash -f

[2K[32m⠹[0m Installing integrations...
[1A[2K

In [5]:
digits_svc_pipeline = digits_pipeline(
    importer=importer(), trainer=svc_trainer(), evaluator=evaluator()
)
digits_svc_pipeline.run()

[1;35mCreating run for pipeline: `[0m[33;21mdigits_pipeline`[1;35m[0m
[1;35mCache enabled for pipeline `[0m[33;21mdigits_pipeline`[1;35m[0m


In [4]:
from zenml.integrations.dash.visualizers.pipeline_run_lineage_visualizer import (
    PipelineRunLineageVisualizer,
)
from zenml.repository import Repository

repo = Repository()
latest_run = repo.get_pipeline("digits_pipeline").runs[-1]
PipelineRunLineageVisualizer().visualize(latest_run)

[1;35mDash is running on http://127.0.0.1:8050/
[0m
Dash is running on http://127.0.0.1:8050/

 * Serving Flask app 'zenml.integrations.dash.visualizers.pipeline_run_lineage_visualizer' (lazy loading)
 * Environment: production
[2m   Use a production WSGI server instead.[0m
 * Debug mode: off


INFO:werkzeug: * Running on http://127.0.0.1:8050 (Press CTRL+C to quit)
INFO:werkzeug:127.0.0.1 - - [05/May/2022 17:48:49] "GET / HTTP/1.1" 200 -
INFO:werkzeug:127.0.0.1 - - [05/May/2022 17:48:49] "GET /_dash-component-suites/dash_cytoscape/dash_cytoscape.v0_2_0m1651759559.min.js HTTP/1.1" 200 -
INFO:werkzeug:127.0.0.1 - - [05/May/2022 17:48:49] "GET /_dash-component-suites/dash/dcc/dash_core_components.v2_3_0m1651759558.js HTTP/1.1" 200 -
INFO:werkzeug:127.0.0.1 - - [05/May/2022 17:48:50] "GET /_dash-dependencies HTTP/1.1" 200 -
INFO:werkzeug:127.0.0.1 - - [05/May/2022 17:48:50] "GET /_dash-layout HTTP/1.1" 200 -
INFO:werkzeug:127.0.0.1 - - [05/May/2022 17:48:50] "[36mGET /_dash-component-suites/dash/dcc/async-markdown.js HTTP/1.1[0m" 304 -
INFO:werkzeug:127.0.0.1 - - [05/May/2022 17:48:50] "POST /_dash-update-component HTTP/1.1" 200 -
INFO:werkzeug:127.0.0.1 - - [05/May/2022 17:48:50] "POST /_dash-update-component HTTP/1.1" 200 -
INFO:werkzeug:127.0.0.1 - - [05/May/2022 17:48:50] 

<dash.dash.Dash at 0x157f0c970>

You should now see an interactive visualization in your browser as shown below. Squares represent your artifacts, circles your pipeline steps. Also note that the different nodes are color coded, so if your pipeline ever fails or runs for too long, you can find the responsible step at a glance!

![Dash Visualization](_assets/02_Artifact_Lineage/dash_initial.png)

## Artifact Caching
As mentioned in the beginning, tracking which exact artifact went into what steps also allows us to cache and reuse artifacts. Let's this in action:
First, stop the execution of the previous notebook cell in case it is still running. Then, execute the next cell to rerun our pipeline and visualize it with dash again.

In [6]:
digits_svc_pipeline.run()
latest_run = repo.get_pipeline("digits_pipeline").runs[-1]
PipelineRunLineageVisualizer().visualize(latest_run)

[1;35mUsing stack `[0m[33;21mdefault`[1;35m to run pipeline `[0m[33;21mdigits_pipeline`[1;35m...[0m
[1;35mStep `[0m[33;21mimporter`[1;35m has started.[0m
[1;35mUsing cached version of `[0m[33;21mimporter`[1;35m [`[0m[33;21mimporter`[1;35m].[0m
[1;35mStep `[0m[33;21mimporter`[1;35m has finished in 0.023s.[0m
[1;35mStep `[0m[33;21msvc_trainer`[1;35m has started.[0m
[1;35mUsing cached version of `[0m[33;21msvc_trainer`[1;35m [`[0m[33;21mtrainer`[1;35m].[0m
[1;35mStep `[0m[33;21msvc_trainer`[1;35m has finished in 0.025s.[0m
[1;35mStep `[0m[33;21mevaluator`[1;35m has started.[0m
[1;35mUsing cached version of `[0m[33;21mevaluator`[1;35m [`[0m[33;21mevaluator`[1;35m].[0m
[1;35mStep `[0m[33;21mevaluator`[1;35m has finished in 0.028s.[0m
[1;35mPipeline run `[0m[33;21mdigits_pipeline-05_May_22-18_11_34_624043`[1;35m has finished in 0.091s.[0m
[1;35mDash is running on http://127.0.0.1:8050/
[0m
Dash is running on http://127.0.0.

INFO:werkzeug: * Running on http://127.0.0.1:8050 (Press CTRL+C to quit)
INFO:werkzeug:127.0.0.1 - - [05/May/2022 18:11:43] "GET / HTTP/1.1" 200 -
INFO:werkzeug:127.0.0.1 - - [05/May/2022 18:11:43] "GET /_dash-layout HTTP/1.1" 200 -
INFO:werkzeug:127.0.0.1 - - [05/May/2022 18:11:43] "GET /_dash-dependencies HTTP/1.1" 200 -
INFO:werkzeug:127.0.0.1 - - [05/May/2022 18:11:43] "[36mGET /_dash-component-suites/dash/dcc/async-markdown.js HTTP/1.1[0m" 304 -
INFO:werkzeug:127.0.0.1 - - [05/May/2022 18:11:43] "POST /_dash-update-component HTTP/1.1" 200 -
INFO:werkzeug:127.0.0.1 - - [05/May/2022 18:11:43] "POST /_dash-update-component HTTP/1.1" 200 -
INFO:werkzeug:127.0.0.1 - - [05/May/2022 18:11:43] "[36mGET /_dash-component-suites/dash/dcc/async-highlight.js HTTP/1.1[0m" 304 -
INFO:werkzeug:127.0.0.1 - - [05/May/2022 18:11:54] "GET /_dash-layout HTTP/1.1" 200 -
INFO:werkzeug:127.0.0.1 - - [05/May/2022 18:11:54] "GET /_dash-dependencies HTTP/1.1" 200 -
INFO:werkzeug:127.0.0.1 - - [05/May/20

<dash.dash.Dash at 0x157efd6a0>

You should now see a visualization as shown below. Note that the color of all nodes in the graph changed to green now. This means they were still cached from our previous run.

![Dash Visualization Cached](_assets/02_Artifact_Lineage/dash_cached.png)

Let's now replace the SVC model in our ML pipeline with a decision tree and see what happens.

In [8]:
import numpy as np
from sklearn.base import ClassifierMixin
from sklearn.tree import DecisionTreeClassifier
from zenml.steps import step


@step(enable_cache=False)
def tree_trainer(
    X_train: np.ndarray,
    y_train: np.ndarray,
) -> ClassifierMixin:
    """Train an sklearn decision tree classifier."""
    model = DecisionTreeClassifier()
    model.fit(X_train, y_train)
    return model


# redefine and rerun our pipeline, this time with tree_trainer()
digits_tree_pipeline = digits_pipeline(
    importer=importer(), trainer=tree_trainer(), evaluator=evaluator()
)
digits_tree_pipeline.run()

latest_run = repo.get_pipeline("digits_pipeline").runs[-1]
PipelineRunLineageVisualizer().visualize(latest_run)

[1;35mCreating run for pipeline: `[0m[33;21mdigits_pipeline`[1;35m[0m
[1;35mCache enabled for pipeline `[0m[33;21mdigits_pipeline`[1;35m[0m
[1;35mUsing stack `[0m[33;21mdefault`[1;35m to run pipeline `[0m[33;21mdigits_pipeline`[1;35m...[0m
[1;35mStep `[0m[33;21mimporter`[1;35m has started.[0m
[1;35mUsing cached version of `[0m[33;21mimporter`[1;35m [`[0m[33;21mimporter`[1;35m].[0m
[1;35mStep `[0m[33;21mimporter`[1;35m has finished in 0.023s.[0m
[1;35mStep `[0m[33;21mtree_trainer`[1;35m has started.[0m
[1;35mStep `[0m[33;21mtree_trainer`[1;35m has finished in 0.048s.[0m
[1;35mStep `[0m[33;21mevaluator`[1;35m has started.[0m
Test accuracy: 0.7697441601779755
[1;35mStep `[0m[33;21mevaluator`[1;35m has finished in 0.042s.[0m
[1;35mPipeline run `[0m[33;21mdigits_pipeline-05_May_22-18_23_02_752560`[1;35m has finished in 0.128s.[0m
[1;35mDash is running on http://127.0.0.1:8050/
[0m
Dash is running on http://127.0.0.1:8050/

Dash 

INFO:werkzeug: * Running on http://127.0.0.1:8050 (Press CTRL+C to quit)
INFO:werkzeug:127.0.0.1 - - [05/May/2022 18:23:05] "GET / HTTP/1.1" 200 -
INFO:werkzeug:127.0.0.1 - - [05/May/2022 18:23:05] "GET /_dash-layout HTTP/1.1" 200 -
INFO:werkzeug:127.0.0.1 - - [05/May/2022 18:23:05] "GET /_dash-dependencies HTTP/1.1" 200 -
INFO:werkzeug:127.0.0.1 - - [05/May/2022 18:23:05] "GET /_favicon.ico?v=2.3.1 HTTP/1.1" 200 -
INFO:werkzeug:127.0.0.1 - - [05/May/2022 18:23:05] "[36mGET /_dash-component-suites/dash/dcc/async-markdown.js HTTP/1.1[0m" 304 -
INFO:werkzeug:127.0.0.1 - - [05/May/2022 18:23:05] "POST /_dash-update-component HTTP/1.1" 200 -
INFO:werkzeug:127.0.0.1 - - [05/May/2022 18:23:05] "POST /_dash-update-component HTTP/1.1" 200 -
INFO:werkzeug:127.0.0.1 - - [05/May/2022 18:23:05] "[36mGET /_dash-component-suites/dash/dcc/async-highlight.js HTTP/1.1[0m" 304 -


<dash.dash.Dash at 0x156f51c70>

The visualization should now look as shown below. Since we changed the trainer, the corresponding node and all subsequent nodes are now blue again, meaning they were rerun and the artifacts were freshly created. However, note how the input data artifacts are still green. They did not have to be recreated. In a real production setting this might save us a tremendous amount of time and resources as those data artifacts might have been the result of some complex, expensive preprocessing job.

![Dash Visualization Partly Cached](_assets/02_Artifact_Lineage/dash_partly_cached.png)
