<a href="https://colab.research.google.com/github/zenml-io/zenml/blob/main/examples/quickstart/quickstart.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# ZenML Quickstart Guide

Our goal here is to help you to get the first practical experience with our tool and give you a brief overview on some basic functionalities of ZenML. We'll create a training pipeline for the [MNIST](http://yann.lecun.com/exdb/mnist/) dataset and then later the [Fashion MNIST](https://github.com/zalandoresearch/fashion-mnist) dataset developed by Zalando.

If you want to run this notebook in an interactive environment, feel free to run it in a [Google Colab](https://colab.research.google.com/github/zenml-io/zenml/blob/main/examples/quickstart/quickstart.ipynb) or view it on [GitHub](https://github.com/zenml-io/zenml/tree/main/examples/quickstart) directly.


## Purpose

This quickstart guide is designed to provide a practical introduction to some of the main concepts and paradigms used by the ZenML framework. If you want more detail, our [full documentation](https://docs.zenml.io/) provides more on the concepts and how to implement them.

## Using Google Colab

You will want to use a GPU for this example. If you are following this quickstart in Google's Colab, follow these steps:

- Before running anything, you need to tell Colab that you want to use a GPU. You can do this by clicking on the ‘Runtime’ tab and selecting ‘Change runtime type’. A pop-up window will open up with a drop-down menu.
- Select ‘GPU’ from the menu and click ‘Save’.
- It may ask if you want to restart the runtime. If so, go ahead and do that.

<!-- The code for the MNIST training borrows heavily from [this](https://www.tensorflow.org/datasets/keras_example) -->

## Relation to quickstart.py
This notebook is a variant of [quickstart.py](https://github.com/zenml-io/zenml/blob/main/examples/quickstart/quickstart.py) which is shown off in the [ZenML Docs](https://docs.zenml.io). The core difference being it adds a modular aspect of the importer step and shows how to fetch pipelines, runs, and artifacts in the post-execution workflow.

## Install libraries

In [None]:
# Install the ZenML CLI tool and Tensorflow
!pip install zenml 
!zenml integration install sklearn

Once the installation is completed, you can go ahead and create your first ZenML repository for your project. As ZenML repositories are built on top of Git repositories, you can create yours in a desired empty directory through:

In [None]:
# Initialize a ZenML repository
!zenml init

Now, the setup is completed. For the next steps, just make sure that you are executing the code within your ZenML repository.

## Import relevant packages

We will use pipelines and steps to train our model.

In [1]:
import numpy as np
from sklearn.base import ClassifierMixin

from zenml.integrations.sklearn.helpers.digits import (
    get_digits,
)
from sklearn.base import ClassifierMixin
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from zenml.pipelines import pipeline
from zenml.steps import Output, step

## Define ZenML Steps

In the code that follows, you can see that we are defining the various steps of our pipeline. Each step is decorated with `@step`, the main abstraction that is currently available for creating pipeline steps.

The first step is an `import` step that downloads the MNIST dataset and returns four numpy arrays as its output. 

In [2]:
@step
def importer() -> Output(
    X_train=np.ndarray, X_test=np.ndarray, y_train=np.ndarray, y_test=np.ndarray
):
    """Loads the digits array as normal numpy arrays."""
    X_train, X_test, y_train, y_test = get_digits()
    return X_train, X_test, y_train, y_test

We then add a `Trainer` step, that takes the imported data and trains a sklearn classifier on the data. Note that the model is not explicitly saved within the step. Under the hood ZenML uses Materializers to automatically persist the Artifacts that result from each step into the Artifact Store.

In [3]:
@step
def decision_tree_trainer(
    X_train: np.ndarray,
    y_train: np.ndarray,
) -> ClassifierMixin:
    """Train another simple sklearn classifier for the digits dataset."""
    model = DecisionTreeClassifier()
    model.fit(X_train, y_train)
    return model

Finally, we add an `Evaluator` step that takes as input the test set and the trained model and evaluates some final metrics.

In [4]:
@step
def evaluator(
    X_test: np.ndarray,
    y_test: np.ndarray,
    model: ClassifierMixin,
) -> float:
    """Calculate the accuracy on the test set"""
    test_acc = model.score(X_test, y_test)
    print(f"Test accuracy: {test_acc}")
    return test_acc

## Define ZenML Pipeline

A pipeline is defined with the `@pipeline` decorator. This defines the various steps of the pipeline and specifies the dependencies between the steps, thereby determining the order in which they will be run.

In [5]:
@pipeline
def mnist_pipeline(
    importer,
    trainer,
    evaluator,
):
    """Links all the steps together in a pipeline"""
    X_train, X_test, y_train, y_test = importer()
    model = trainer(X_train=X_train, y_train=y_train)
    evaluator(X_test=X_test, y_test=y_test, model=model)

## Run the pipeline

Running the pipeline is as simple as calling the `run()` method on an instance of the defined pipeline. Here we explicitly name our pipeline run to make it easier to access later on. Be aware that you can only run the pipeline once with this name. To rerun, rename the the run, or remove the run name.

In [6]:
RUN_NAME_1 = "decision_tree_mnist_training_run"

# Initialize the pipeline
first_pipeline = mnist_pipeline(
    importer=importer(),
    trainer=decision_tree_trainer(),
    evaluator=evaluator(),
)
first_pipeline.run(run_name=RUN_NAME_1) # Make sure to change the name if you want to rerun

[1;35mCreating run for pipeline: `[0m[33;21mmnist_pipeline`[1;35m[0m
[1;35mCache enabled for pipeline `[0m[33;21mmnist_pipeline`[1;35m[0m
[1;35mUsing stack `[0m[33;21mlocal_stack`[1;35m to run pipeline `[0m[33;21mmnist_pipeline`[1;35m...[0m
[1;35mStep `[0m[33;21mimporter`[1;35m has started.[0m
[1;35mStep `[0m[33;21mimporter`[1;35m has finished in 0.120s.[0m
[1;35mStep `[0m[33;21mdecision_tree_trainer`[1;35m has started.[0m
[1;35mStep `[0m[33;21mdecision_tree_trainer`[1;35m has finished in 0.065s.[0m
[1;35mStep `[0m[33;21mevaluator`[1;35m has started.[0m
Test accuracy: 0.7619577308120133
[1;35mStep `[0m[33;21mevaluator`[1;35m has finished in 0.051s.[0m
[1;35mPipeline run `[0m[33;21mstandard_mnist_training_run`[1;35m has finished in 0.250s.[0m


## Swapping the trainer

We got pretty good results on the MNIST model that we trained, but maybe we want to see how a similar training pipeline would work on a different model.

You can see how easy it is to switch out one trainer step for another in our pipeline.

In [7]:
@step
def svc_trainer(
    X_train: np.ndarray,
    y_train: np.ndarray,
) -> ClassifierMixin:
    """Train another simple sklearn classifier for the digits dataset."""
    model = SVC(gamma=0.001)
    model.fit(X_train, y_train)
    return model

In [8]:
RUN_NAME_2 = "svc_mnist_training_run"


# Initialize a new pipeline
second_pipeline = mnist_pipeline(
    importer=importer(),
    trainer=svc_trainer(),
    evaluator=evaluator(),
)

# Run the new pipeline
second_pipeline.run(run_name=RUN_NAME_2) # Make sure to change the name if you want to rerun

[1;35mCreating run for pipeline: `[0m[33;21mmnist_pipeline`[1;35m[0m
[1;35mCache enabled for pipeline `[0m[33;21mmnist_pipeline`[1;35m[0m
[1;35mUsing stack `[0m[33;21mlocal_stack`[1;35m to run pipeline `[0m[33;21mmnist_pipeline`[1;35m...[0m
[1;35mStep `[0m[33;21mimporter`[1;35m has started.[0m
[1;35mStep `[0m[33;21mimporter`[1;35m has finished in 0.028s.[0m
[1;35mStep `[0m[33;21msvc_trainer`[1;35m has started.[0m
[1;35mStep `[0m[33;21msvc_trainer`[1;35m has finished in 0.063s.[0m
[1;35mStep `[0m[33;21mevaluator`[1;35m has started.[0m
Test accuracy: 0.9688542825361512
[1;35mStep `[0m[33;21mevaluator`[1;35m has finished in 0.058s.[0m
[1;35mPipeline run `[0m[33;21mmnist_training_run_2`[1;35m has finished in 0.164s.[0m


# Post execution workflow

We did mention above that the Materializer takes care of persisting your artifacts for you. But how do you access your runs and their associated artifacts from code? Let's do that step by step.

## Get repo

First off, we load your repository: this is where all your pipelines live. 

In [9]:
from zenml.repository import Repository

repo = Repository()

## Pipelines 

This is how you get all of the pipelines within your repository. Above we reused the same pipeline two times with different importers. We should expect to only see one pipeline named `mnist_pipeline` here. 

In [10]:
pipelines = repo.get_pipelines()
print(pipelines)

[PipelineView(id=1, name='mnist_pipeline')]


## Retrieve the pipeline

We could now just take the pipeline from above by index using `pipelines[0]`. 
Alternatively we can get our pipelines by name from our repo. The name of the pipeline defaults to the function name, if not specified.

In [11]:
mnist_pipeline = repo.get_pipeline(pipeline_name="mnist_pipeline")

## Get the runs
All runs are saved chronologically within the corresponding pipeline. Here you

In [12]:
runs = mnist_pipeline.runs  # chronologically ordered
print(runs)

[PipelineRunView(id=2, name='standard_mnist_training_run'), PipelineRunView(id=10, name='mnist_training_run_2')]


In [13]:
# Let's first extract out the first run on the standard mnist dataset
decision_tree_mnist_run = mnist_pipeline.get_run(RUN_NAME_1)

# Now we can extract our second run trained on fashion mnist
svc_mnist_run = mnist_pipeline.get_run(RUN_NAME_2)

## Get the steps

In [14]:
decision_tree_mnist_run.steps

[StepView(id=1, name='importer', entrypoint_name='importer'parameters={}),
 StepView(id=2, name='trainer', entrypoint_name='decision_tree_trainer'parameters={}),
 StepView(id=3, name='evaluator', entrypoint_name='evaluator'parameters={})]

In [15]:
svc_mnist_run.steps

[StepView(id=4, name='importer', entrypoint_name='importer'parameters={}),
 StepView(id=5, name='trainer', entrypoint_name='svc_trainer'parameters={}),
 StepView(id=6, name='evaluator', entrypoint_name='evaluator'parameters={})]

## Check the results of the evaluator and compare

In [16]:
decision_tree_eval_step = decision_tree_mnist_run.get_step(name='evaluator')
svc_eval_step = svc_mnist_run.get_step(name='evaluator')

In [17]:
# One output is simply called `output`, multiple is a dict called `outputs`.
decision_tree_eval_step.output.read()

0.7619577308120133

In [18]:
svc_eval_step.output.read()

0.9688542825361512

# Congratulations!

… and that's it for the quickstart. If you came here without a hiccup, you must have successly installed ZenML, set up a ZenML repo, configured a training pipeline, executed it and evaluated the results. And, this is just the tip of the iceberg on the capabilities of ZenML.

However, if you had a hiccup or you have some suggestions/questions regarding our framework, you can always check our [docs](https://docs.zenml.io/) or our [Github](https://github.com/zenml-io/zenml) or even better join us on our [Slack channel](https://zenml.io/slack-invite).

Cheers!

For more detailed information on all the components and steps that went into this short example, please continue reading [our more detailed documentation pages](https://docs.zenml.io/).