<a href="https://colab.research.google.com/github/zenml-io/zenml/blob/main/examples/kubeflow/run.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# ZenML: Create production-ready ML pipelines

Our goal here is to help you to get the first practical experience with our tool and give you a brief overview on some basic functionalities of ZenML. We'll create a training pipeline for the [MNIST](http://yann.lecun.com/exdb/mnist/) dataset. We will start local in the jupyter notebook but will transition over to a more robust environment with Kubeflow pipelines.

If you want to run this notebook in an interactive environment, feel free to run it in a [Google Colab](https://colab.research.google.com/github/zenml-io/zenml/blob/main/examples/kubeflow/run.ipynb) or view it on [GitHub](https://github.com/zenml-io/zenml/tree/main/examples/kubeflow) directly.


## Purpose

This quickstart guide is designed to provide a practical introduction to some of the main concepts and paradigms used by the ZenML framework. If you want more detail, our [full documentation](https://docs.zenml.io/) provides more on the concepts and how to implement them.

## Using Google Colab

You will want to use a GPU for this example. If you are following this quickstart in Google's Colab, follow these steps:

- Before running anything, you need to tell Colab that you want to use a GPU. You can do this by clicking on the ‘Runtime’ tab and selecting ‘Change runtime type’. A pop-up window will open up with a drop-down menu.
- Select ‘GPU’ from the menu and click ‘Save’.
- It may ask if you want to restart the runtime. If so, go ahead and do that.

# Start developing locally

## Install libraries

In [None]:
# Install the ZenML CLI tool and Tensorflow
!pip install zenml
!zenml integration install kubeflow -f
!zenml integration install sklearn -f

Once the installation is completed, you can go ahead and create your first ZenML repository for your project. As ZenML repositories are built on top of Git repositories, you can create yours in a desired empty directory through:

In [None]:
# Initialize a git repository
!git init

# Initialize ZenML's .zen file
!zenml init

Now, the setup is completed. For the next steps, just make sure that you are executing the code within your ZenML repository.

## Import relevant packages

We will use pipelines and steps in to train our model.

In [None]:
import numpy as np
from sklearn.base import ClassifierMixin

from zenml.integrations.sklearn.helpers.digits import get_digits, get_digits_model
from zenml.pipelines import pipeline
from zenml.steps import step
from zenml.steps.step_output import Output

## Define ZenML Steps

In the code that follows, you can see that we are defining the various steps of our pipeline. Each step is decorated with `@step`, the main abstraction that is currently available for creating pipeline steps.

The first step is an `importer` step that downloads a sample of the MNIST dataset.

In [None]:
@step
def importer() -> Output(
    X_train=np.ndarray, X_test=np.ndarray, y_train=np.ndarray, y_test=np.ndarray
):
    """Loads the digits array as normal numpy arrays."""
    X_train, X_test, y_train, y_test = get_digits()
    return X_train, X_test, y_train, y_test

Then we add a `normalizer` step that takes as input the test set and the trained model and evaluates some final metrics.

In [None]:
@step
def normalizer(
    X_train: np.ndarray, X_test: np.ndarray
) -> Output(X_train_normed=np.ndarray, X_test_normed=np.ndarray):
    """Normalize the values for all the images so they are between 0 and 1"""
    X_train_normed = X_train / 255.0
    X_test_normed = X_test / 255.0
    return X_train_normed, X_test_normed

We then add a `trainer` step, that takes the normalized data and trains a sklearn model on the data.

In [None]:
@step
def trainer(
    X_train: np.ndarray,
    y_train: np.ndarray,
) -> ClassifierMixin:
    """Train a simple sklearn classifier for the digits dataset."""
    model = get_digits_model()
    model.fit(X_train, y_train)
    return model

Finally, we had an `evaluator` to see how we did on the dataset!

In [None]:
@step
def evaluator(
    X_test: np.ndarray,
    y_test: np.ndarray,
    model: ClassifierMixin,
) -> float:
    """Calculate the accuracy on the test set"""
    test_acc = model.score(X_test, y_test)
    print(f"Test accuracy: {test_acc}")
    return test_acc

## Define ZenML Pipeline

A pipeline is defined with the `@pipeline` decorator. This defines the various steps of the pipeline and specifies the dependencies between the steps, thereby determining the order in which they will be run.

In [None]:
@pipeline
def mnist_pipeline(
    importer,
    normalizer,
    trainer,
    evaluator,
):
    # Link all the steps together
    X_train, X_test, y_train, y_test = importer()
    X_trained_normed, X_test_normed = normalizer(X_train=X_train, X_test=X_test)
    model = trainer(X_train=X_trained_normed, y_train=y_train)
    evaluator(X_test=X_test_normed, y_test=y_test, model=model)

## Run the pipeline

Running the pipeline is as simple as calling the `run()` method on an instance of the defined pipeline.

In [None]:
# Initialise the pipeline
first_pipeline = mnistpipeline(
    importer=importer(),
    normalizer=normalizer(),
    trainer=trainer(),
    evaluator=evaluator(),
)

first_pipeline.run()

# Transitioning to Kubeflow Pipelines

We got pretty good results on the MNIST model that we trained, but maybe we want to see how a similar training pipeline would work on a different dataset.

You can see how easy it is to switch out one data import step and processing for another in our pipeline.

## Pre-requisites

In order to run this example, you need to have installed:

* Docker
* K3D https://k3d.io/v5.2.1/
* Kubectl

## Define requirements

In [6]:
%%writefile requirements.txt
scikit-learn
pandas
numpy

Overwriting requirements.txt


In [4]:
requirements_file = os.path.join(os.path.abspath(''), "requirements.txt")
with open(requirements_file, 'r') as f:
    c = f.read()
    
c.split()

['scikit-learn', 'pandas', 'numpy']

## Create a Kubeflow Stack

In [None]:
!zenml container-registry register local_registry localhost:5000

In [None]:
!zenml orchestrator register kubeflow_orchestrator kubeflow

In [None]:
!zenml stack register local_kubeflow_stack -m local_metadata_store -a local_artifact_store -o kubeflow_orchestrator -c local_registry

In [None]:
!zenml stack set local_kubeflow_stack

## Lets spin the stack up

In [None]:
!zenml stack up

## Write the pipeline to disk

In [None]:
%%writefile run.py
#  Copyright (c) ZenML GmbH 2021. All Rights Reserved.
#
#  Licensed under the Apache License, Version 2.0 (the "License");
#  you may not use this file except in compliance with the License.
#  You may obtain a copy of the License at:
#
#       http://www.apache.org/licenses/LICENSE-2.0
#
#  Unless required by applicable law or agreed to in writing, software
#  distributed under the License is distributed on an "AS IS" BASIS,
#  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express
#  or implied. See the License for the specific language governing
#  permissions and limitations under the License.

import os

import numpy as np
import pandas as pd
import requests
from sklearn.base import ClassifierMixin
from sklearn.linear_model import LogisticRegression

from zenml.pipelines import pipeline
from zenml.steps import step
from zenml.steps.base_step_config import BaseStepConfig
from zenml.steps.step_output import Output

# Path to a pip requirements file that contains requirements necessary to run
# the pipeline
requirements_file = os.path.join(os.path.dirname(__file__), "requirements.txt")


class ImporterConfig(BaseStepConfig):
    n_days: int = 1


def get_X_y_from_api(n_days: int = 1, is_train: bool = True):
    url = (
        "https://storage.googleapis.com/zenml-public-bucket/mnist"
        "/mnist_handwritten_train.json"
        if is_train
        else "https://storage.googleapis.com/zenml-public-bucket/mnist"
        "/mnist_handwritten_test.json"
    )
    df = pd.DataFrame(requests.get(url).json())
    X = df["image"].map(lambda x: np.array(x)).values
    X = np.array([x.reshape(28, 28) for x in X])
    y = df["label"].map(lambda y: np.array(y)).values
    return X, y


@step
def importer(
    config: ImporterConfig,
) -> Output(
    X_train=np.ndarray, y_train=np.ndarray, X_test=np.ndarray, y_test=np.ndarray
):
    """Downloads the latest data from a mock API."""
    X_train, y_train = get_X_y_from_api(n_days=config.n_days, is_train=True)
    X_test, y_test = get_X_y_from_api(n_days=config.n_days, is_train=False)
    return X_train, y_train, X_test, y_test


@step
def normalizer(
    X_train: np.ndarray, X_test: np.ndarray
) -> Output(X_train_normed=np.ndarray, X_test_normed=np.ndarray):
    """Normalize the values for all the images so they are between 0 and 1"""
    X_train_normed = X_train / 255.0
    X_test_normed = X_test / 255.0
    return X_train_normed, X_test_normed


@step
def trainer(
    X_train: np.ndarray,
    y_train: np.ndarray,
) -> ClassifierMixin:
    """Train SVC from sklearn."""
    clf = LogisticRegression(penalty="l1", solver="saga", tol=0.1)
    clf.fit(X_train.reshape((X_train.shape[0], -1)), y_train)
    return clf


@step
def evaluator(
    X_test: np.ndarray,
    y_test: np.ndarray,
    model: ClassifierMixin,
) -> float:
    """Calculate accuracy score with classifier."""
    test_acc = model.score(X_test.reshape((X_test.shape[0], -1)), y_test)
    return test_acc


@pipeline(requirements_file=requirements_file)
def mnistpipeline(
    importer,
    normalizer,
    trainer,
    evaluator,
):
    # Link all the steps together
    X_train, y_train, X_test, y_test = importer()
    X_trained_normed, X_test_normed = normalizer(X_train=X_train, X_test=X_test)
    model = trainer(X_train=X_trained_normed, y_train=y_train)
    evaluator(X_test=X_test_normed, y_test=y_test, model=model)


if __name__ == "__main__":
    # Run the pipeline
    p = mnistpipeline(
        importer=importer(),
        normalizer=normalizer(),
        trainer=trainer(),
        evaluator=evaluator(),
    )
    p.run()


In [None]:
# Initialise a new pipeline
!python run.py

# Post execution workflow

In [None]:
from zenml.core.repo import Repository

## Get repo

In [None]:
repo = Repository()

## Pipelines 

In [None]:
pipelines = repo.get_pipelines()

## Retrieve the pipeline

In [None]:
mnist_pipeline = pipelines[0]

## Get the first run

In [None]:
runs = mnist_pipeline.runs  # chronologically ordered
mnist_run = runs[0]

## Get the second run

In [None]:
fashion_mnist_run = runs[1]

## Get the steps (note the first step name is different)

In [None]:
mnist_run.steps

In [None]:
fashion_mnist_run.steps

## Check the results of the evaluator and compare

In [None]:
mnist_eval_step = mnist_run.get_step(name='evaluator')
fashion_mnist_eval_step = fashion_mnist_run.get_step(name='evaluator')

In [None]:
# One output is simply called `output`, multiple is a dict called `outputs`.
mnist_eval_step.output.read()

In [None]:
fashion_mnist_eval_step.output.read()

# Congratulations!

… and that's it for the quickstart. If you came here without a hiccup, you must have successly installed ZenML, set up a ZenML repo, configured a training pipeline, executed it and evaluated the results. And, this is just the tip of the iceberg on the capabilities of ZenML.

However, if you had a hiccup or you have some suggestions/questions regarding our framework, you can always check our [docs](https://docs.zenml.io/) or our [Github](https://github.com/zenml-io/zenml) or even better join us on our [Slack channel](https://zenml.io/slack-invite).

Cheers!

For more detailed information on all the components and steps that went into this short example, please continue reading [our more detailed documentation pages](https://docs.zenml.io/).