# Basic zenml quickstart guide stuff

- introduction to what the quickstart is about
- what will be covered here / what we'll do

## Intro

- what is zenml
- diagram showing the quickstart workflow etc

# Installation

things that need installing

In [None]:
#TODO: add things relating to cloudflare pipelines etc and zenml installation

In [1]:
!zenml init

[?25l[2;36mFound existing ZenML repository at path [0m
[2;32m'/Users/strickvl/coding/zenml/repos/zenml/examples/quickstart/new_quickstart'[0m[2;36m.[0m
[2;32m⠋[0m[2;36m [0m[2;36mInitializing ZenML repository at [0m
[2;36m/Users/strickvl/coding/zenml/repos/zenml/examples/quickstart/new_quickstart.[0m
[2K[1A[2K[1A[2K[32m⠋[0m Initializing ZenML repository at 
/Users/strickvl/coding/zenml/repos/zenml/examples/quickstart/new_quickstart.

[?25h[1A[2K[1A[2K[1A[2K

In [2]:
# !zenml integration install sklearn mlflow -y


# # automatically restart kernel
# import IPython
# IPython.Application.instance().kernel.do_shutdown(restart=True)

## Setup for Google Colab

## Register our local stack

Register our local stack that's able to handle the code we've written above

- go through the different parts of it
- also diagrams

In [3]:
# Register the MLflow experiment tracker
!zenml experiment-tracker register mlflow_tracker --flavor=mlflow

# Register the MLflow model registry
!zenml model-registry register mlflow_registry --flavor=mlflow

# Register the MLflow model deployer
!zenml model-deployer register mlflow_deployer --flavor=mlflow

# Register a new stack with the new stack components
!zenml stack register quickstart -a default\
                                       -o default\
                                       -d mlflow_deployer\
                                       -e mlflow_tracker\
                                       -r mlflow_registry\

!zenml stack set quickstart

[33mThe current repo active workspace is no longer available. Resetting the active workspace to 'default'.[0m
[33mThe current repo active stack is no longer available. Resetting the active stack to default.[0m
[2;36mConnected to the ZenML server: [0m[2;32m'http://127.0.0.1:8237'[0m
[2;36mRunning with active workspace: [0m[2;32m'default'[0m[2;36m [0m[1;2;36m([0m[2;36mrepository[0m[1;2;36m)[0m
[2;36mRunning with active stack: [0m[2;32m'default'[0m[2;36m [0m[1;2;36m([0m[2;36mrepository[0m[1;2;36m)[0m
[?25l[2;36mSuccessfully registered experiment_tracker `mlflow_tracker`.[0m
[2;32m⠋[0m[2;36m [0m[2;36mRegistering experiment tracker 'mlflow_tracker'...[0m
[2K[1A[2K[32m⠋[0m Registering experiment tracker 'mlflow_tracker'...
[2K[1A[2K[32m⠋[0m Registering experiment tracker 'mlflow_tracker'...

[1A[2K[1A[2K[2;36mConnected to the ZenML server: [0m[2;32m'http://127.0.0.1:8237'[0m
[2;36mRunning with active workspace: [0m[2;32m'default'

# Explain the example / what we're doing

- the dataset we're using
- how this is part of a common workflow
- trying some things out / train some baseline models

In [23]:
import pandas as pd
from sklearn.model_selection import train_test_split

from zenml import step
from zenml.steps import Output


@step(enable_cache=False)
def training_data_loader() -> (
    Output(
        X_train=pd.DataFrame,
        X_test=pd.DataFrame,
        y_train=pd.Series,
        y_test=pd.Series,
    )
):
    """Load the Census Income dataset as tuple of Pandas DataFrame / Series."""
    # Load the dataset
    url = "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data"
    column_names = [
        "age",
        "workclass",
        "fnlwgt",
        "education",
        "education-num",
        "marital-status",
        "occupation",
        "relationship",
        "race",
        "sex",
        "capital-gain",
        "capital-loss",
        "hours-per-week",
        "native-country",
        "income",
    ]
    data = pd.read_csv(
        url, names=column_names, na_values="?", skipinitialspace=True
    )

    # Drop rows with missing values
    data = data.dropna()

    # Encode categorical features and drop original columns
    categorical_cols = [
        "workclass",
        "education",
        "marital-status",
        "occupation",
        "relationship",
        "race",
        "sex",
        "native-country",
    ]
    data = pd.get_dummies(data, columns=categorical_cols, drop_first=True)

    # Encode target feature
    data["income"] = data["income"].apply(
        lambda x: 1 if x.strip() == ">50K" else 0
    )

    # Separate features and target
    X = data.drop("income", axis=1)
    y = data["income"]

    # Split the dataset into train and test sets
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42
    )
    return X_train, X_test, y_train, y_test

Explain that we want to try it out, so we can just call the step independently of ZenML just as a Python function

In [24]:
X_train, X_test, y_train, y_test = training_data_loader()

In [25]:
X_train

Unnamed: 0,age,fnlwgt,education-num,capital-gain,capital-loss,hours-per-week,workclass_Local-gov,workclass_Private,workclass_Self-emp-inc,workclass_Self-emp-not-inc,...,native-country_Portugal,native-country_Puerto-Rico,native-country_Scotland,native-country_South,native-country_Taiwan,native-country_Thailand,native-country_Trinadad&Tobago,native-country_United-States,native-country_Vietnam,native-country_Yugoslavia
19863,53,168539,5,0,0,70,False,False,False,True,...,False,False,False,False,False,False,False,True,False,False
24342,49,56841,13,0,0,70,False,False,False,True,...,False,False,False,False,False,False,False,True,False,False
10027,28,154571,10,0,0,40,False,True,False,False,...,False,False,False,False,False,False,False,False,False,False
25710,60,188236,6,0,0,40,False,True,False,False,...,False,False,False,False,False,False,False,True,False,False
13824,53,87158,9,0,0,40,False,True,False,False,...,False,False,False,False,False,False,False,True,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32171,40,67852,9,0,0,35,False,True,False,False,...,False,False,False,False,False,False,False,True,False,False
5875,41,120539,10,3103,0,40,False,False,False,True,...,False,False,False,False,False,False,False,True,False,False
935,37,176900,9,0,0,99,False,True,False,False,...,False,False,False,False,False,False,False,True,False,False
17056,56,51662,7,0,0,40,False,False,False,True,...,False,False,False,False,False,False,False,True,False,False


Training two models now

- SGD Classifier
- Random Forest Classifier

And using MLflow to track the hyperparams and metrics

In [26]:
import mlflow

from sklearn.base import ClassifierMixin
from sklearn.ensemble import RandomForestClassifier

from zenml.client import Client

experiment_tracker = Client().active_stack.experiment_tracker


@step(enable_cache=False, experiment_tracker=experiment_tracker.name)
def random_forest_trainer_mlflow(
    X_train: pd.DataFrame,
    y_train: pd.Series,
) -> ClassifierMixin:
    """Train a sklearn Random Forest classifier and log to MLflow."""
    mlflow.sklearn.autolog()  # log all model hparams and metrics to MLflow
    model = RandomForestClassifier()
    model.fit(X_train.to_numpy(), y_train.to_numpy())
    train_acc = model.score(X_train.to_numpy(), y_train.to_numpy())
    print(f"Train accuracy: {train_acc}")
    return model

from sklearn.linear_model import SGDClassifier


@step(enable_cache=False, experiment_tracker=experiment_tracker.name)
def sgd_trainer_mlflow(
    X_train: pd.DataFrame,
    y_train: pd.Series,
) -> ClassifierMixin:
    """Train a SGD classifier and log to MLflow."""
    mlflow.sklearn.autolog()  # log all model hparams and metrics to MLflow
    model = SGDClassifier()
    model.fit(X_train.to_numpy(), y_train.to_numpy())
    train_acc = model.score(X_train.to_numpy(), y_train.to_numpy())
    print(f"Train accuracy: {train_acc}")
    return model

[1;35mReloading configuration file /Users/strickvl/coding/zenml/repos/zenml/examples/quickstart/new_quickstart/.zen/config.yaml[0m


Now adding an evaluator to return the best performing of the two models.

In [27]:
@step
def evaluator(
    X_test: pd.DataFrame,
    y_test: pd.Series,
    model1: ClassifierMixin,
    model2: ClassifierMixin,
) -> ClassifierMixin:
    """Calculate the accuracy on the test set and return the best model of two."""
    test_acc1 = model1.score(X_test.to_numpy(), y_test.to_numpy())
    test_acc2 = model2.score(X_test.to_numpy(), y_test.to_numpy())
    print(f"Test accuracy ({model1.__class__.__name__}): {test_acc1}")
    print(f"Test accuracy ({model2.__class__.__name__}): {test_acc2}")
    return model1 if test_acc1 > test_acc2 else model2

Define a step that registers to our model registry

In [28]:
from zenml.integrations.mlflow.steps.mlflow_registry import (
    mlflow_register_model_step,
)

model_name = "zenml-quickstart-model"

register_model = mlflow_register_model_step.with_options(
        parameters=dict(
            name=model_name,
            description="The first run of the Quickstart pipeline.",
        )
    )

Now we can define the pipeline itself

- explain a bit about pipelines

In [29]:
from zenml import pipeline

@pipeline(enable_cache=False)
def train_and_register_model_pipeline() -> None:
    """Train a model."""
    register_model.after(evaluator)
    
    X_train, X_test, y_train, y_test = training_data_loader()
    model1 = random_forest_trainer_mlflow(X_train=X_train, y_train=y_train)
    model2 = sgd_trainer_mlflow(X_train=X_train, y_train=y_train)
    best_model = evaluator(
        X_test=X_test, y_test=y_test, model1=model1, model2=model2
    )
    register_model(best_model)

Run the pipeline

In [30]:
train_and_register_model_pipeline()

[1;35mRegistered pipeline [0m[33mtrain_and_register_model_pipeline[1;35m (version 2).[0m
[1;35mRunning pipeline [0m[33mtrain_and_register_model_pipeline[1;35m on stack [0m[33mquickstart[1;35m (caching disabled)[0m
[1;35mStep [0m[33mtraining_data_loader[1;35m has started.[0m
[1;35mStep [0m[33mtraining_data_loader[1;35m has finished in 3.647s.[0m
[1;35mStep [0m[33mrandom_forest_trainer_mlflow[1;35m has started.[0m




Train accuracy: 0.9999585560943264
[1;35mStep [0m[33mrandom_forest_trainer_mlflow[1;35m has finished in 12.816s.[0m
[1;35mStep [0m[33msgd_trainer_mlflow[1;35m has started.[0m




Train accuracy: 0.7674582452650338
[1;35mStep [0m[33msgd_trainer_mlflow[1;35m has finished in 6.984s.[0m
[1;35mStep [0m[33mevaluator[1;35m has started.[0m
Test accuracy (RandomForestClassifier): 0.8509862423338306
Test accuracy (SGDClassifier): 0.7626388198242997
[1;35mStep [0m[33mevaluator[1;35m has finished in 0.592s.[0m
[1;35mStep [0m[33mmlflow_register_model_step[1;35m has started.[0m
[1;35mMLflow model registry does not take a version as an argument. Registering a new version for the model [0m[33m'zenml-quickstart-model'[1;35m a version will be assigned automatically.[0m


2023/06/23 14:46:55 INFO mlflow.tracking._model_registry.client: Waiting up to 300 seconds for model version to finish creation.                     Model name: zenml-quickstart-model, version 2


[1;35mRegistered model zenml-quickstart-model with version 2 from source file:///Users/strickvl/Library/Application Support/zenml/local_stores/35fcee21-53e6-4de0-a49f-d238ea1d5040/mlruns/729179721969064096/3dc28d417f0948a393f4aad2d77a9760/artifacts/model.[0m
[1;35mStep [0m[33mmlflow_register_model_step[1;35m has finished in 0.704s.[0m
[1;35mPipeline run [0m[33mtrain_and_register_model_pipeline-2023_06_23-12_46_28_401417[1;35m has finished in 27.272s.[0m
[1;35mDashboard URL: http://127.0.0.1:8237/workspaces/default/pipelines/15e1a708-bad2-43b9-823b-b8403a752a40/runs[0m


MAYBE SHOW THE ZENML DASHBOARD HERE

ALSO MAYBE SHOW THE MLFLOW UI + HOW TO ACCESS IT HERE

In [14]:
from zenml.integrations.mlflow.mlflow_utils import get_tracking_uri

get_tracking_uri()

'file:/Users/strickvl/Library/Application Support/zenml/local_stores/35fcee21-53e6-4de0-a49f-d238ea1d5040/mlruns'

In [15]:
# !mlflow ui --backend-store-uri 'file:/Users/strickvl/Library/Application Support/zenml/local_stores/35fcee21-53e6-4de0-a49f-d238ea1d5040/mlruns'

Talk about the pipeline output

Now we've trained our model, and we've found the best one, we want to deploy it and run some inference on the deployed model

In [45]:
from zenml.integrations.mlflow.steps.mlflow_deployer import mlflow_model_registry_deployer_step
from zenml.integrations.mlflow.steps.mlflow_registry import mlflow_register_model_step
from zenml.model_registries.base_model_registry import ModelRegistryModelMetadata

model_deployer = mlflow_model_registry_deployer_step.with_options(
    parameters=dict(
        registry_model_name=model_name,
        registry_model_version=2,
    )
)

Something about services + why we're doing it that way

In [46]:
from zenml.services import BaseService
from zenml.client import Client


@step(enable_cache=False)
def prediction_service_loader() -> BaseService:
    """Load the model service of our train_and_register_model_pipeline."""
    client = Client()
    model_deployer = client.active_stack.model_deployer
    services = model_deployer.find_model_server(
        pipeline_name="train_and_register_model_pipeline",
        running=True,
    )
    service = services[0]
    return service

@step
def predictor(
    service: BaseService,
    data: pd.DataFrame,
) -> Output(predictions=list):
    """Run a inference request against a prediction service"""
    service.start(timeout=10)  # should be a NOP if already started
    prediction = service.predict(data.to_numpy())
    prediction = prediction.argmax(axis=-1)
    print(f"Prediction is: {[prediction.tolist()]}")
    return [prediction.tolist()]

Explain our new pipeline

In [55]:
@pipeline
def register_and_deploy_model() -> None:
    """Print the name of the model."""
    prediction_service_loader.after(model_deployer)
    predictor.after(prediction_service_loader)
    model_deployer()
    _, inference_data, _, _ = training_data_loader()
    model_deployment_service = prediction_service_loader()
    predictor(service=model_deployment_service, data=inference_data)

In [56]:
register_and_deploy_model()

[1;35mReloading configuration file /Users/strickvl/coding/zenml/repos/zenml/examples/quickstart/new_quickstart/.zen/config.yaml[0m
[1;35mRegistered pipeline [0m[33mregister_and_deploy_model[1;35m (version 4).[0m
[1;35mRunning pipeline [0m[33mregister_and_deploy_model[1;35m on stack [0m[33mquickstart[1;35m (caching enabled)[0m
[1;35mStep [0m[33mmlflow_model_registry_deployer_step[1;35m has started.[0m


[1;35mUpdating an existing MLflow deployment service: MLFlowDeploymentService[ac19964c-358a-4aeb-84d3-ba6f27194822] (type: model-serving, flavor: mlflow)[0m


Output()

[1;35mMLflow deployment service started and reachable at:
    http://127.0.0.1:8003/invocations
[0m
[1;35mStep [0m[33mmlflow_model_registry_deployer_step[1;35m has finished in 13.104s.[0m
[1;35mStep [0m[33mtraining_data_loader[1;35m has started.[0m
[1;35mStep [0m[33mtraining_data_loader[1;35m has finished in 3.301s.[0m
[1;35mStep [0m[33mprediction_service_loader[1;35m has started.[0m
[1;35mStep [0m[33mprediction_service_loader[1;35m has finished in 0.189s.[0m
[1;35mStep [0m[33mpredictor[1;35m has started.[0m


Prediction is: [75]
[1;35mStep [0m[33mpredictor[1;35m has finished in 0.305s.[0m
[1;35mPipeline run [0m[33mregister_and_deploy_model-2023_06_23-14_36_51_918325[1;35m has finished in 18.196s.[0m
[1;35mDashboard URL: http://127.0.0.1:8237/workspaces/default/pipelines/280971aa-45b8-4f70-90ea-3c9b9fb23aa1/runs[0m


In [49]:
!zenml model-registry models list

[2;36mConnected to the ZenML server: [0m[2;32m'http://127.0.0.1:8237'[0m
[2;36mRunning with active workspace: [0m[2;32m'default'[0m[2;36m [0m[1;2;36m([0m[2;36mrepository[0m[1;2;36m)[0m
[2;36mRunning with active stack: [0m[2;32m'quickstart'[0m[2;36m [0m[1;2;36m([0m[2;36mrepository[0m[1;2;36m)[0m
┏━━━━━━━━━━━━━━━━━━━━━━━━┯━━━━━━━━━━━━━┯━━━━━━━━━━┓
┃[1m [0m[1m         NAME         [0m[1m [0m│[1m [0m[1mDESCRIPTION[0m[1m [0m│[1m [0m[1mMETADATA[0m[1m [0m┃
┠────────────────────────┼─────────────┼──────────┨
┃ zenml-quickstart-model │             │          ┃
┗━━━━━━━━━━━━━━━━━━━━━━━━┷━━━━━━━━━━━━━┷━━━━━━━━━━┛


In [50]:
!zenml model-registry models list-versions zenml-quickstart-model

[2;36mConnected to the ZenML server: [0m[2;32m'http://127.0.0.1:8237'[0m
[2;36mRunning with active workspace: [0m[2;32m'default'[0m[2;36m [0m[1;2;36m([0m[2;36mrepository[0m[1;2;36m)[0m
[2;36mRunning with active stack: [0m[2;32m'quickstart'[0m[2;36m [0m[1;2;36m([0m[2;36mrepository[0m[1;2;36m)[0m
┏━━━━━━━━━━━━━━━━━━━━┯━━━━━━━━━━━━━━━┯━━━━━━━━━━━━━━━━━━━━┯━━━━━━━━━━━━━━━━━━━━┓
┃[1m                    [0m│[1m               [0m│[1m [0m[1mVERSION_DESCRIPTIO[0m[1m [0m│[1m                    [0m┃
┃[1m [0m[1m       NAME       [0m[1m [0m│[1m [0m[1mMODEL_VERSION[0m[1m [0m│[1m [0m[1mN                 [0m[1m [0m│[1m [0m[1mMETADATA          [0m[1m [0m┃
┠────────────────────┼───────────────┼────────────────────┼────────────────────┨
┃ zenml-quickstart-m │ 2             │ The first run of   │ {'zenml_version':  ┃
┃        odel        │               │ the Quickstart     │ '0.40.3',          ┃
┃                    │               │ pipelin

In [51]:
!zenml model-deployer models list

[2;36mConnected to the ZenML server: [0m[2;32m'http://127.0.0.1:8237'[0m
[2;36mRunning with active workspace: [0m[2;32m'default'[0m[2;36m [0m[1;2;36m([0m[2;36mrepository[0m[1;2;36m)[0m
[2;36mRunning with active stack: [0m[2;32m'quickstart'[0m[2;36m [0m[1;2;36m([0m[2;36mrepository[0m[1;2;36m)[0m
┏━━━━━━━━┯━━━━━━━━━━━━━━━━━━┯━━━━━━━━━━━━━━━━━━┯━━━━━━━━━━━━━━━━━━┯━━━━━━━━━━━━┓
┃[1m        [0m│[1m                  [0m│[1m                  [0m│[1m [0m[1mPIPELINE_STEP_NA[0m[1m [0m│[1m            [0m┃
┃[1m [0m[1mSTATUS[0m[1m [0m│[1m [0m[1mUUID            [0m[1m [0m│[1m [0m[1mPIPELINE_NAME   [0m[1m [0m│[1m [0m[1mME              [0m[1m [0m│[1m [0m[1mMODEL_NAME[0m[1m [0m┃
┠────────┼──────────────────┼──────────────────┼──────────────────┼────────────┨
┃   ✅   │ ac19964c-358a-4a │ train_and_regist │                  │ model      ┃
┃        │ eb-84d3-ba6f2719 │ er_model_pipelin │                  │            ┃
┃        │ 4822

In [54]:
!zenml model-deployer models describe "ac19964c-358a-4aeb-84d3-ba6f27194822"

[2;36mConnected to the ZenML server: [0m[2;32m'http://127.0.0.1:8237'[0m
[2;36mRunning with active workspace: [0m[2;32m'default'[0m[2;36m [0m[1;2;36m([0m[2;36mrepository[0m[1;2;36m)[0m
[2;36mRunning with active stack: [0m[2;32m'quickstart'[0m[2;36m [0m[1;2;36m([0m[2;36mrepository[0m[1;2;36m)[0m
[3m        Properties of Served Model ac19964c-358a-4aeb-84d3-ba6f27194822         [0m
┏━━━━━━━━━━━━━━━━━━━━━━━━┯━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃[1m [0m[1mMODEL SERVICE PROPERTY[0m[1m [0m│[1m [0m[1mVALUE                                              [0m[1m [0m┃
┠────────────────────────┼─────────────────────────────────────────────────────┨
┃ DAEMON_PID             │ 76499                                               ┃
┠────────────────────────┼─────────────────────────────────────────────────────┨
┃ MODEL_NAME             │ model                                               ┃
┠────────────────────────┼───────────────────────────