# Intro to MLOps using ZenML

## 🌍 Overview

This repository is a minimalistic MLOps project intended as a starting point to learn how to put ML workflows in production. It features: 

- A feature engineering pipeline that loads data and prepares it for training.
- A training pipeline that loads the preprocessed dataset and trains a model.
- A batch inference pipeline that runs predictions on the trained model with new data.

Follow along this notebook to understand how you can use ZenML to productionalize your ML workflows!

<img src="assets/pipelines_overview.png" alt="Pipelines Overview">

## Run on Colab

You can use Google Colab to see ZenML in action, no signup / installation
required!

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](
https://colab.research.google.com/github/zenml-io/zenml/blob/main/examples/quickstart/run.ipynb)

# 👶 Step 0. Install Requirements

Let's install ZenML to get started. First we'll install the latest version of
ZenML as well as the `sklearn` integration of ZenML:

In [1]:
!pip install "zenml[server]"



In [2]:
from zenml.environment import Environment

if Environment.in_google_colab():
    # Install Cloudflare Tunnel binary
    !wget -q https://github.com/cloudflare/cloudflared/releases/latest/download/cloudflared-linux-amd64.deb && dpkg -i cloudflared-linux-amd64.deb


In [3]:
!zenml integration install sklearn mlflow -y

import IPython
IPython.Application.instance().kernel.do_shutdown(restart=True)

[2K[32m⠸[0m Installing integrations...Collecting mlflow<=2.6.0,>=2.1.1
  Using cached mlflow-2.6.0-py3-none-any.whl.metadata (12 kB)
[2K[32m⠼[0m Installing integrations...Collecting mlserver>=1.3.3
  Using cached mlserver-1.3.5-py3-none-any.whl.metadata (6.3 kB)
[2K[32m⠦[0m Installing integrations...Collecting mlserver-mlflow>=1.3.3
  Using cached mlserver_mlflow-1.3.5-py3-none-any.whl.metadata (1.2 kB)
[2K[32m⠧[0m Installing integrations...Collecting databricks-cli<1,>=0.8.7 (from mlflow<=2.6.0,>=2.1.1)
  Using cached databricks_cli-0.18.0-py2.py3-none-any.whl.metadata (4.0 kB)
Collecting entrypoints<1 (from mlflow<=2.6.0,>=2.1.1)
  Using cached entrypoints-0.4-py3-none-any.whl (5.3 kB)
[2K[32m⠇[0m Installing integrations...Collecting protobuf<5,>=3.12.0 (from mlflow<=2.6.0,>=2.1.1)
[2K[32m⠏[0m Installing integrations...  Downloading protobuf-4.25.1-cp37-abi3-manylinux2014_x86_64.whl.metadata (541 bytes)
[2K[32m⠋[0m Installing integrations...Collecting sqlparse<1,

{'status': 'ok', 'restart': True}

: 

Please wait for the installation to complete before running subsequent cells. At
the end of the installation, the notebook kernel will automatically restart.

Optional: If you are using ZenML Cloud, execute the following cell with your tenant URL. Otherwise ignore.

In [None]:
zenml_server_url = "PLEASE_UPDATE_ME"  # in the form "https://URL_TO_SERVER"

!zenml connect --url $zenml_server_url

In [1]:
# Initialize ZenML and set the default stack
!zenml init

!zenml stack set default


[?25l[2;36mFound existing ZenML repository at path [0m
[2;32m'/home/htahir1/workspace/zenml_io/template-starter/template'[0m[2;36m.[0m
[2;32m⠋[0m[2;36m Initializing ZenML repository at [0m
[2;36m/home/htahir1/workspace/zenml_io/template-starter/template.[0m
[2K[1A[2K[1A[2K[32m⠋[0m Initializing ZenML repository at 
/home/htahir1/workspace/zenml_io/template-starter/template.

[2K[2;36mActive repository stack set to: [0m[2;32m'default'[0mive stack to 'default'...
[2K[32m⠙[0m Setting the repository active stack to 'default'...t'...[0m
[1A[2K

In [19]:
# Do the imports at the top
import random
from zenml import ExternalArtifact, pipeline, ModelVersion 
from zenml.client import Client
from zenml.logger import get_logger
from uuid import UUID

import os
from typing import Optional, List

from zenml import pipeline

from steps import (
    data_loader,
    data_preprocessor,
    data_splitter,
    model_evaluator,
    model_trainer,
    inference_predict,
    inference_preprocessor
)

logger = get_logger(__name__)

client = Client()

## 🥇 Step 1: Load your data and execute feature engineering

We'll start off by importing our data. In this quickstart we'll be working with
[the Breast Cancer](https://archive.ics.uci.edu/dataset/17/breast+cancer+wisconsin+diagnostic) dataset
which is publicly available on the UCI Machine Learning Repository. The task is a classification
problem, to predict whether a patient is diagnosed with breast cancer or not.

When you're getting started with a machine learning problem you'll want to do
something similar to this: import your data and get it in the right shape for
your training. ZenML mostly gets out of your way when you're writing your Python
code, as you'll see from the following cell.

In [3]:
import pandas as pd
from sklearn.datasets import load_breast_cancer
from typing_extensions import Annotated
from zenml import step
from zenml.logger import get_logger

logger = get_logger(__name__)


@step
def data_loader_simplified(
    random_state: int, is_inference: bool = False, target: str = "target"
) -> Annotated[pd.DataFrame, "dataset"]:  # We name the dataset 
    """Dataset reader step."""
    dataset = load_breast_cancer(as_frame=True)
    inference_size = int(len(dataset.target) * 0.05)
    dataset: pd.DataFrame = dataset.frame
    inference_subset = dataset.sample(inference_size, random_state=random_state)
    if is_inference:
        dataset = inference_subset
        dataset.drop(columns=target, inplace=True)
    else:
        dataset.drop(inference_subset.index, inplace=True)
    dataset.reset_index(drop=True, inplace=True)
    logger.info(f"Dataset with {len(dataset)} records loaded!")
    return dataset


The whole function is decorated with the `@step` decorator, which
tells ZenML to track this function as a step in the pipeline. This means that
ZenML will automatically version, track, and cache the data that is produced by
this function as an `artifact`. This is a very powerful feature, as it means that you can
reproduce your data at any point in the future, even if the original data source
changes or disappears. 

Note the use of the `typing` module's `Annotated` type hint in the output of the
step. We're using this to give a name to the output of the step, which will make
it possible to access it via a keyword later on.

You'll also notice that we have included type hints for the outputs
to the function. These are not only useful for anyone reading your code, but
help ZenML process your data in a way appropriate to the specific data types.

ZenML is built in a way that allows you to experiment with your data and build
your pipelines as you work, so if you want to call this function to see how it
works, you can just call it directly. Here we take a look at the first few rows
of your training dataset.

In [4]:
data_loader_simplified(random_state=42).head()

[1;35mDataset with 541 records loaded![0m
[1;35mDataset with 541 records loaded![0m


Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,target
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,0
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,0
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,0
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,0
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,0


Everything looks as we'd expect and the values are all in the right format 🥳.

We're now at the point where can bring all this step and some others together into a single
pipeline, the top-level organising entity for code in ZenML. Creating such a pipeline is
as simple as adding a `@pipeline` decorator to a function. This specific
pipeline doesn't return a value, but that option is available to you if you need.

In [5]:
@pipeline
def feature_engineering(
    test_size: float = 0.2,
    drop_na: Optional[bool] = None,
    normalize: Optional[bool] = None,
    drop_columns: Optional[List[str]] = None,
    target: Optional[str] = "target",
):
    """Feature engineering pipeline."""
    # Link all the steps together by calling them and passing the output
    # of one step as the input of the next step.
    raw_data = data_loader(random_state=random.randint(0, 100), target=target)
    dataset_trn, dataset_tst = data_splitter(
        dataset=raw_data,
        test_size=test_size,
    )
    dataset_trn, dataset_tst, _ = data_preprocessor(
        dataset_trn=dataset_trn,
        dataset_tst=dataset_tst,
        drop_na=drop_na,
        normalize=normalize,
        drop_columns=drop_columns,
        target=target,
    )
    
    return dataset_trn, dataset_tst

We're ready to run the pipeline now, which we can do just -- as with the step -- by calling the
pipeline function itself:

In [6]:
feature_engineering()

[1;35mInitiating a new run for the pipeline: [0m[1;36mfeature_engineering[1;35m.[0m
[1;35mReusing registered version: [0m[1;36m(version: 3)[1;35m.[0m
[1;35mExecuting a new run.[0m
[1;35mUsing user: [0m[1;36mhamza@zenml.io[1;35m[0m
[1;35mUsing stack: [0m[1;36mdefault[1;35m[0m
[1;35m  artifact_store: [0m[1;36mdefault[1;35m[0m
[1;35m  orchestrator: [0m[1;36mdefault[1;35m[0m
[1;35mStep [0m[1;36mdata_loader[1;35m has started.[0m
[1;35mDataset with 541 records loaded![0m
[1;35mStep [0m[1;36mdata_loader[1;35m has finished in [0m[1;36m1.950s[1;35m.[0m
[1;35mStep [0m[1;36mdata_splitter[1;35m has started.[0m
[1;35mStep [0m[1;36mdata_splitter[1;35m has finished in [0m[1;36m3.650s[1;35m.[0m
[1;35mStep [0m[1;36mdata_preprocessor[1;35m has started.[0m
[1;35mStep [0m[1;36mdata_preprocessor[1;35m has finished in [0m[1;36m4.506s[1;35m.[0m
[1;35mRun [0m[1;36mfeature_engineering-2023_12_07-16_50_18_998605[1;35m has finished in 

Let's run this again with a slightly different test size, to create another dataset:

In [7]:
feature_engineering(test_size=0.3)

[1;35mInitiating a new run for the pipeline: [0m[1;36mfeature_engineering[1;35m.[0m
[1;35mReusing registered version: [0m[1;36m(version: 4)[1;35m.[0m
[1;35mExecuting a new run.[0m
[1;35mUsing user: [0m[1;36mhamza@zenml.io[1;35m[0m
[1;35mUsing stack: [0m[1;36mdefault[1;35m[0m
[1;35m  artifact_store: [0m[1;36mdefault[1;35m[0m
[1;35m  orchestrator: [0m[1;36mdefault[1;35m[0m
[1;35mUsing cached version of [0m[1;36mdata_loader[1;35m.[0m
[1;35mStep [0m[1;36mdata_loader[1;35m has started.[0m
[1;35mStep [0m[1;36mdata_splitter[1;35m has started.[0m
[1;35mStep [0m[1;36mdata_splitter[1;35m has finished in [0m[1;36m3.236s[1;35m.[0m
[1;35mStep [0m[1;36mdata_preprocessor[1;35m has started.[0m
[1;35mStep [0m[1;36mdata_preprocessor[1;35m has finished in [0m[1;36m4.466s[1;35m.[0m
[1;35mRun [0m[1;36mfeature_engineering-2023_12_07-16_50_36_291181[1;35m has finished in [0m[1;36m10.784s[1;35m.[0m
[1;35mDashboard URL: https://1cf18d

Notice that the data loader step was cached, while the rest of the pipeline was rerun. 
This is because ZenML automatically determined that nothing had changed in the data loader step, 
so it didn't need to rerun it.

At this point you might be interested to view your pipeline runs in the ZenML
Dashboard. You can spin this up by executing the next cell. This will start a
server which you can access by clicking on the link that appears in the output
of the cell.

Log into the Dashboard using default credentials (username 'default' and
password left blank). From there you can inspect the pipeline or the specific
pipeline run.


In [8]:
from zenml.environment import Environment

if Environment.in_google_colab():
    # run ZenML through a cloudflare tunnel to get a public endpoint
    !zenml up --port 8237 & cloudflared tunnel --url http://localhost:8237
else:
    !zenml up

Error: [31m[1mYour ZenML client is already connected to a remote server. If you want to spin up a local ZenML server, please disconnect from the remote server first by running `zenml disconnect`.[0m


We can also fetch the pipeline from the server and view our results directly in the notebook:

In [9]:
client = Client()
run = client.get_pipeline("feature_engineering").last_run
print(run.name)

feature_engineering-2023_12_07-16_50_36_291181


We can also see the data artifacts that were produced by the last step of the pipeline:

In [10]:
run.steps["data_preprocessor"].outputs

{'dataset_tst': ArtifactResponse(id=UUID('77b1c2eb-c9bd-4030-b42c-21d06927d2b9'), permission_denied=False, body=ArtifactResponseBody(created=datetime.datetime(2023, 12, 7, 16, 50, 45), updated=datetime.datetime(2023, 12, 7, 16, 50, 45), user=UserResponse(id=UUID('c6fcdcc8-69e1-4ff5-9eb2-6a53aa81a08b'), permission_denied=False, body=UserResponseBody(created=datetime.datetime(2023, 10, 24, 7, 36, 26), updated=datetime.datetime(2023, 12, 7, 16, 6, 42), active=True, activation_token=None, full_name='Hamza Tahir', email_opted_in=True, is_service_account=False), metadata=None, name='hamza@zenml.io'), version='80', uri='/home/htahir1/.config/zenml/local_stores/466b79ce-3df9-4549-a50b-67ed433461f3/data_preprocessor/dataset_tst/b277e9ac-2631-4038-86f5-a71b7da09104', type=<ArtifactType.DATA: 'DataArtifact'>), metadata=None, name='dataset_tst'),
 'dataset_trn': ArtifactResponse(id=UUID('92cbfbef-0bf7-4247-9448-73c429465b82'), permission_denied=False, body=ArtifactResponseBody(created=datetime.dat

In [11]:
# Read one of the datasets. This is the one with a 0.3 test split
run.steps["data_preprocessor"].outputs["dataset_trn"].load()

ValidationError: 1 validation error for DistributionPackageSource
package_name
  field required (type=value_error.missing)

We can also get the artifacts directly. 

In [12]:
dataset_trn_artifact = client.get_artifact("dataset_trn")
dataset_tst_artifact = client.get_artifact("dataset_tst")

dataset_trn_artifact

ArtifactResponse(id=UUID('92cbfbef-0bf7-4247-9448-73c429465b82'), permission_denied=False, body=ArtifactResponseBody(created=datetime.datetime(2023, 12, 7, 16, 50, 44), updated=datetime.datetime(2023, 12, 7, 16, 50, 44), user=UserResponse(id=UUID('c6fcdcc8-69e1-4ff5-9eb2-6a53aa81a08b'), permission_denied=False, body=UserResponseBody(created=datetime.datetime(2023, 10, 24, 7, 36, 26), updated=datetime.datetime(2023, 12, 7, 16, 6, 42), active=True, activation_token=None, full_name='Hamza Tahir', email_opted_in=True, is_service_account=False), metadata=None, name='hamza@zenml.io'), version='82', uri='/home/htahir1/.config/zenml/local_stores/466b79ce-3df9-4549-a50b-67ed433461f3/data_preprocessor/dataset_trn/b277e9ac-2631-4038-86f5-a71b7da09104', type=<ArtifactType.DATA: 'DataArtifact'>), metadata=None, name='dataset_trn')

We'll use these artifacts from above in our next pipeline

# ⌚ Step 2: Training pipeline

Now that we have our data it makes sense to train some models to get a sense of
how difficult the task is. The Breast Cancer dataset is sufficiently large and complex 
that it's unlikely we'll be able to train a model that behaves perfectly since the problem 
is inherently complex, but we can get a sense of what a reasonable baseline looks like.

We'll start with two simple models, a SGD Classifier and a Random Forest
Classifier, both batteries-included from `sklearn`. We'll train them both on the
same data and then compare their performance.

In [28]:
import pandas as pd
from sklearn.base import ClassifierMixin
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import SGDClassifier
from typing_extensions import Annotated
from zenml import ArtifactConfig, step
from zenml.logger import get_logger

logger = get_logger(__name__)


@step
def model_trainer(
    dataset_trn: pd.DataFrame,
    model_type: str = "sgd",
) -> Annotated[ClassifierMixin, ArtifactConfig(name="model", is_model_artifact=True)]:
    """Configure and train a model on the training dataset."""
    target = "target"
    if model_type == "sgd":
        model = SGDClassifier()
    elif model_type == "rf":
        model = RandomForestClassifier()
    else:
        raise ValueError(f"Unknown model type {model_type}")   

    logger.info(f"Training model {model}...")

    model.fit(
        dataset_trn.drop(columns=[target]),
        dataset_trn[target],
    )
    return model


Our two training steps both return different kinds of `sklearn` classifier
models, so we use the generic `ClassifierMixin` type hint for the return type.

ZenML allows you to load any version of any dataset that is tracked by the framework
directly into a pipeline using the `ExternalArtifact` interface. This is very convenient
in this case, as we'd like to send our preprocessed dataset from the older pipeline directly
into the training pipeline.

In [29]:
@pipeline
def training(
    train_dataset_id: Optional[UUID] = None,
    test_dataset_id: Optional[UUID] = None,
    model_type: str = "sgd",
    min_train_accuracy: float = 0.0,
    min_test_accuracy: float = 0.0,
):
    """Model training pipeline.""" 
    if train_dataset_id is None or test_dataset_id is None:
        # If we dont pass the IDs, this will run the feature engineering pipeline   
        dataset_trn, dataset_tst = feature_engineering()
    else:
        # Load the datasets from an older pipeline
        dataset_trn = ExternalArtifact(id=train_dataset_id)
        dataset_tst = ExternalArtifact(id=test_dataset_id) 

    trained_model = model_trainer(
        dataset_trn=dataset_trn,
        model_type=model_type,
    )

    model_evaluator(
        model=trained_model,
        dataset_trn=dataset_trn,
        dataset_tst=dataset_tst,
        min_train_accuracy=min_train_accuracy,
        min_test_accuracy=min_test_accuracy,
    )

The end goal of this quick baseline evaluation is to understand which of the two
models performs better. We'll use the `evaluator` step to compare the two
models. This step takes in the model from the trainer step, and computes its score
over the testing set.

In [30]:
# Use a random forest model
training(model_type="rf", train_dataset_id=dataset_trn_artifact.id, test_dataset_id=dataset_tst_artifact.id)

rf_run = client.get_pipeline("training").last_run

[1;35mInitiating a new run for the pipeline: [0m[1;36mtraining[1;35m.[0m
[1;35mRegistered new version: [0m[1;36m(version 13)[1;35m.[0m
[1;35mExecuting a new run.[0m
[1;35mUsing user: [0m[1;36mhamza@zenml.io[1;35m[0m
[1;35mUsing stack: [0m[1;36mdefault[1;35m[0m
[1;35m  artifact_store: [0m[1;36mdefault[1;35m[0m
[1;35m  orchestrator: [0m[1;36mdefault[1;35m[0m
[1;35mStep [0m[1;36mmodel_trainer[1;35m has started.[0m
[1;35mTraining model RandomForestClassifier()...[0m
[1;35mTraining model RandomForestClassifier()...[0m
[1;35mTraining model RandomForestClassifier()...[0m
[1;35mTraining model RandomForestClassifier()...[0m
[1;35mTraining model RandomForestClassifier()...[0m
[1;35mStep [0m[1;36mmodel_trainer[1;35m has finished in [0m[1;36m2.695s[1;35m.[0m
[1;35mStep [0m[1;36mmodel_evaluator[1;35m has started.[0m
[1;35mTrain accuracy=100.00%[0m
[1;35mTest accuracy=95.71%[0m
[1;35mStep [0m[1;36mmodel_evaluator[1;35m has finished 

In [31]:
# Use a SGD classifier
training(model_type="sgd", train_dataset_id=dataset_trn_artifact.id, test_dataset_id=dataset_tst_artifact.id)

sgd_run = client.get_pipeline("training").last_run

[1;35mInitiating a new run for the pipeline: [0m[1;36mtraining[1;35m.[0m
[1;35mRegistered new version: [0m[1;36m(version 14)[1;35m.[0m
[1;35mExecuting a new run.[0m
[1;35mUsing user: [0m[1;36mhamza@zenml.io[1;35m[0m
[1;35mUsing stack: [0m[1;36mdefault[1;35m[0m
[1;35m  artifact_store: [0m[1;36mdefault[1;35m[0m
[1;35m  orchestrator: [0m[1;36mdefault[1;35m[0m
[1;35mStep [0m[1;36mmodel_trainer[1;35m has started.[0m
[1;35mTraining model SGDClassifier()...[0m
[1;35mTraining model SGDClassifier()...[0m
[1;35mTraining model SGDClassifier()...[0m
[1;35mTraining model SGDClassifier()...[0m
[1;35mTraining model SGDClassifier()...[0m
[1;35mStep [0m[1;36mmodel_trainer[1;35m has finished in [0m[1;36m2.781s[1;35m.[0m
[1;35mStep [0m[1;36mmodel_evaluator[1;35m has started.[0m
[1;35mTrain accuracy=87.57%[0m
[1;35mTest accuracy=90.18%[0m
[1;35mStep [0m[1;36mmodel_evaluator[1;35m has finished in [0m[1;36m3.756s[1;35m.[0m
[1;35mRun [

You can see from the logs already how our model training went: the
`RandomForestClassifier` performed considerably better than the `SGDClassifier`.
We can use the ZenML `Client` to verify this:

In [32]:
# The evaluator returns a float value with the accuracy
rf_run.steps["model_evaluator"].output.load() > sgd_run.steps["model_evaluator"].output.load()

True

# ⌚ Step 3: Associating a model with your pipeline

You can see it is relatively easy to train ML models using ZenML pipelines. But it can be somewhat clunky to track
all the models produced as you develop your experiments and use-cases. Luckily, ZenML offers a *Model Control Plane*,
which is a central register of all your ML models.

You can easily create a ZenML Model and associate it with your pipelines using the `ModelVersion` object:

In [33]:
pipeline_settings = {}
pipeline_settings["model_version"] = ModelVersion(
    name="breast_cancer_classifier",
    license="Apache 2.0",
    description="A breast cancer classifier",
    tags=["classification", "sklearn"],
)

# the `with_options` method allows us to pass in pipeline settings
#  and returns a configured pipeline
training_configured = training.with_options(**pipeline_settings)

In [35]:
# We can now run this as usual
training_configured(model_type="sgd", train_dataset_id=dataset_trn_artifact.id, test_dataset_id=dataset_tst_artifact.id)

[1;35mInitiating a new run for the pipeline: [0m[1;36mtraining[1;35m.[0m
[1;35mReusing registered version: [0m[1;36m(version: 14)[1;35m.[0m
[1;35mNew model version [0m[1;36m13[1;35m was created.[0m
[1;35mExecuting a new run.[0m
[1;35mUsing user: [0m[1;36mhamza@zenml.io[1;35m[0m
[1;35mUsing stack: [0m[1;36mdefault[1;35m[0m
[1;35m  artifact_store: [0m[1;36mdefault[1;35m[0m
[1;35m  orchestrator: [0m[1;36mdefault[1;35m[0m
[1;35mUsing cached version of [0m[1;36mmodel_trainer[1;35m.[0m
[1;35mStep [0m[1;36mmodel_trainer[1;35m has started.[0m
[1;35mUsing cached version of [0m[1;36mmodel_evaluator[1;35m.[0m
[1;35mLinking artifact [0m[1;36moutput[1;35m to model [0m[1;36mNone[1;35m version [0m[1;36mNone[1;35m implicitly.[0m
[1;35mStep [0m[1;36mmodel_evaluator[1;35m has started.[0m
[1;35mRun [0m[1;36mtraining-2023_12_07-17_08_57_860304[1;35m has finished in [0m[1;36m6.124s[1;35m.[0m
[1;35mDashboard URL: https://1cf18d95-z

In [36]:
# We can now run this as usual
training_configured(model_type="rf", train_dataset_id=dataset_trn_artifact.id, test_dataset_id=dataset_tst_artifact.id)

[1;35mInitiating a new run for the pipeline: [0m[1;36mtraining[1;35m.[0m
[1;35mReusing registered version: [0m[1;36m(version: 13)[1;35m.[0m
[1;35mNew model version [0m[1;36m14[1;35m was created.[0m
[1;35mExecuting a new run.[0m
[1;35mUsing user: [0m[1;36mhamza@zenml.io[1;35m[0m
[1;35mUsing stack: [0m[1;36mdefault[1;35m[0m
[1;35m  artifact_store: [0m[1;36mdefault[1;35m[0m
[1;35m  orchestrator: [0m[1;36mdefault[1;35m[0m
[1;35mUsing cached version of [0m[1;36mmodel_trainer[1;35m.[0m
[1;35mStep [0m[1;36mmodel_trainer[1;35m has started.[0m
[1;35mUsing cached version of [0m[1;36mmodel_evaluator[1;35m.[0m
[1;35mLinking artifact [0m[1;36moutput[1;35m to model [0m[1;36mNone[1;35m version [0m[1;36mNone[1;35m implicitly.[0m
[1;35mStep [0m[1;36mmodel_evaluator[1;35m has started.[0m
[1;35mRun [0m[1;36mtraining-2023_12_07-17_09_08_682638[1;35m has finished in [0m[1;36m5.989s[1;35m.[0m
[1;35mDashboard URL: https://1cf18d95-z


You can list your ZenML model and their versions as follows:



In [37]:
client = Client()
zenml_model = client.get_model("breast_cancer_classifier")
print(zenml_model)

print(f"Model {zenml_model.name} has {len(zenml_model.versions)} versions")

name='breast_cancer_classifier' license='Apache 2.0' description='Classification of Breast Cancer Dataset.' audience=None use_cases=None limitations=None trade_offs=None ethics=None id=UUID('952d7089-dac6-4402-874a-89d81e308e33') created=datetime.datetime(2023, 12, 7, 14, 17, 13) updated=datetime.datetime(2023, 12, 7, 14, 17, 13) missing_permissions=False user=UserResponse(id=UUID('c6fcdcc8-69e1-4ff5-9eb2-6a53aa81a08b'), permission_denied=False, body=UserResponseBody(created=datetime.datetime(2023, 10, 24, 7, 36, 26), updated=datetime.datetime(2023, 12, 7, 16, 6, 42), active=True, activation_token=None, full_name='Hamza Tahir', email_opted_in=True, is_service_account=False), metadata=None, name='hamza@zenml.io') workspace=WorkspaceResponse(id=UUID('f3a544f2-afb5-4672-934a-7a465c66201c'), permission_denied=False, body=WorkspaceResponseBody(created=datetime.datetime(2023, 10, 23, 15, 34, 47), updated=datetime.datetime(2023, 10, 23, 15, 34, 47)), metadata=None, name='default') tags=[TagRe

You can see a new model version was created when the `training` pipeline was run. 

In [None]:
fe_t_configured()

In [None]:
@pipeline
def batch_inference():
    """
    Model batch inference pipeline.

    This is a pipeline that loads the inference data, processes
    it, analyze for data drift and run inference.
    """
    ### ADD YOUR OWN CODE HERE - THIS IS JUST AN EXAMPLE ###
    # Link all the steps together by calling them and passing the output
    # of one step as the input of the next step.
    ########## ETL stage  ##########
    random_state = client.get_artifact("dataset").run_metadata["random_state"].value
    target = client.get_artifact("dataset_trn").run_metadata['target'].value
    df_inference = data_loader(
        random_state=random_state, is_inference=True
    )
    df_inference = inference_preprocessor(
        dataset_inf=df_inference,
        preprocess_pipeline=ExternalArtifact(name="preprocess_pipeline"),
        target=target,
    )
    inference_predict(
        dataset_inf=df_inference,
    )


In [None]:
pipeline_args = {}
pipeline_args["config_path"] = os.path.join("configs", "inference.yaml")
fe_b_configured = batch_inference.with_options(**pipeline_args)

In [None]:
fe_b_configured()

# 🍳Breaking it down





In [None]:
@step
def data_loader() -> Annotated[DatasetDict, "dataset"]:
    logger.info(f"Loading dataset airline_reviews... ")
    hf_dataset = load_dataset("Shayanvsf/US_Airline_Sentiment")
    hf_dataset = hf_dataset.rename_column("airline_sentiment", "label")
    hf_dataset = hf_dataset.remove_columns(
        ["airline_sentiment_confidence", "negativereason_confidence"]
    )
    return hf_dataset

Notice that you can give each dataset a name with Python’s Annotated object. The DatasetDict is a native Huggingface dataset which ZenML knows how to persist through steps. This flow ensures reproducibility and version control for every dataset iteration.

Also notice this is a simple Python function, that can be called with the `entrypoint` wrapper:

In [None]:
hf_dataset = data_loader.entrypoint()
print(hf_dataset)

Now we put this a full feature engineering pipeline. Each run of the feature engineering pipeline produces a new dataset to use for the training pipeline. ZenML versions this data as it flows through the pipeline.

<img src="assets/pipelines_feature_eng.png" alt="Pipelines Feature Engineering">

### Set your stack

In [None]:
!zenml stack describe hf-sagemaker-local

In [None]:
!zenml stack set hf-sagemaker-local

In [None]:
!zenml stack get

### Run the pipeline

In [None]:
@pipeline(on_failure=notify_on_failure)
def sentinment_analysis_feature_engineering_pipeline(
    lower_case: Optional[bool] = True,
    padding: Optional[str] = "max_length",
    max_seq_length: Optional[int] = 128,
    text_column: Optional[str] = "text",
    label_column: Optional[str] = "label",
):
    # Link all the steps together by calling them and passing the output
    # of one step as the input of the next step.

    ########## Load Dataset stage ##########
    dataset = data_loader()

    ########## Data Quality stage ##########
    reference_dataset, comparison_dataset = generate_reference_and_comparison_datasets(
        dataset
    )
    text_data_report = evidently_report_step.with_options(
        parameters=dict(
            column_mapping=EvidentlyColumnMapping(
                target="label",
                text_features=["text"],
            ),
            metrics=[
                EvidentlyMetricConfig.metric("DataQualityPreset"),
                EvidentlyMetricConfig.metric(
                    "TextOverviewPreset", column_name="text"
                ),
            ],
            # We need to download the NLTK data for the TextOverviewPreset
            download_nltk_data=True,
        ),
    )
    text_data_report(reference_dataset, comparison_dataset)

    ########## Tokenization stage ##########
    tokenizer = tokenizer_loader(lower_case=lower_case)
    tokenized_data = tokenization_step(
        dataset=dataset,
        tokenizer=tokenizer,
        padding=padding,
        max_seq_length=max_seq_length,
        text_column=text_column,
        label_column=label_column,
    )
    return tokenizer, tokenized_data

In [None]:
# Run a pipeline with the required parameters. 
no_cache: bool = True
zenml_model_name: str = "distil_bert_sentiment_analysis"
max_seq_length = 512

# This executes all steps in the pipeline in the correct order using the orchestrator
# stack component that is configured in your active ZenML stack.
model_config = ModelConfig(
    name=zenml_model_name,
    license="Apache 2.0",
    description="Show case Model Control Plane.",
    create_new_model_version=True,
    delete_new_version_on_failure=True,
    tags=["sentiment_analysis", "huggingface"],
)

pipeline_args = {}

if no_cache:
    pipeline_args["enable_cache"] = False

# Execute Feature Engineering Pipeline
pipeline_args["model_config"] = model_config
pipeline_args["config_path"] = os.path.join("configs", "feature_engineering_config.yaml")
run_args_feature = {
    "max_seq_length": max_seq_length,
}
pipeline_args[
    "run_name"
] = f"sentinment_analysis_feature_engineering_pipeline_run_{dt.now().strftime('%Y_%m_%d_%H_%M_%S')}"
p = sentinment_analysis_feature_engineering_pipeline.with_options(**pipeline_args)
p(**run_args_feature)

In [None]:
from zenml.client import Client
from IPython.display import display, HTML

client = Client()
# CHANGE THIS TO THE LATEST RUN ID
latest_run = client.get_pipeline_run("sentinment_analysis_feature_engineering_pipeline_run_2023_11_21_10_55_56")
html = latest_run.steps["evidently_report_step"].outputs['report_html'].load()
display(HTML(html))

## 💪 Step 2: Train the model with Huggingface Hub as the model registry
 

Once the feature engineering pipeline has run a few times, we have many datasets to choose from. We can feed our desired one into a function that trains the model on the data. Thanks to the ZenML Huggingface integration, this data is loaded directly from the ZenML artifact store.

<img src="assets/training_pipeline_overview.png" alt="Pipelines Trains">

On the left side, we see our local MLOps stack, which defines our infrastructure and tooling we are using for this particular pipeline. ZenML makes it easy to run on a local stack on your development machine, or switch out the stack to run on a AWS Kubeflow-based stack (if you want to scale up).

On the right side is the new kid on the block - the ZenML Model Control Plane. The Model Control Plane is a new feature in ZenML that allows users to have a complete overview of their machine learning models. It allows teams to consolidate all artifacts related to their ML models into one place, and manage its lifecycle easily as you can see from this view from the ZenML Cloud:

In [None]:
pipeline_args["config_path"] = os.path.join("configs", "trainer_config.yaml")

pipeline_args["enable_cache"] = True

run_args_train = {
    "num_epochs": 1,
    "train_batch_size": 64,
    "eval_batch_size": 64,
    "learning_rate": 2e-4,
    "weight_decay": 0.01,
    "max_seq_length": 512,
}

# Use versioned artifacts from the last step
# run_args_train["dataset_artifact_id"] = latest_run.steps['tokenization_step'].output.id
# run_args_train["tokenizer_artifact_id"] = latest_run.steps['tokenizer_loader'].output.id

# Configure the model
pipeline_args["model_config"] = model_config

pipeline_args[
    "run_name"
] = f"sentinment_analysis_training_run_{dt.now().strftime('%Y_%m_%d_%H_%M_%S')}"

In [None]:
sentinment_analysis_training_pipeline.with_options(**pipeline_args)(
    **run_args_train
)

In [None]:
### Check out a new stack
!zenml stack describe hf-sagemaker-airflow

In [None]:
### Change the stack
!zenml stack set hf-sagemaker-airflow

In [None]:
sentinment_analysis_training_pipeline.with_options(**pipeline_args)(
    **run_args_train
)

## 🫅 Step 3: Promote the model to production


Following training, the automated promotion pipeline evaluates models against predefined metrics, identifying and marking the most performant one as 'Production ready'. This is another common use case for the Model Control Plane; we store the relevant metrics there to access them easily later.

<img src="assets/promoting_pipeline_overview.png" alt="Pipelines Trains">

In [None]:
!zenml stack set hf-sagemaker-local

In [None]:
run_args_promoting = {}
model_config = ModelConfig(name=zenml_model_name)
pipeline_args["config_path"] = os.path.join("configs", "promoting_config.yaml")

pipeline_args["model_config"] = model_config

pipeline_args[
    "run_name"
] = f"sentinment_analysis_promoting_pipeline_run_{dt.now().strftime('%Y_%m_%d_%H_%M_%S')}"

In [None]:
sentinment_analysis_promote_pipeline.with_options(**pipeline_args)(
    **run_args_promoting
)

## 💯 Step 4: Deploy the model to AWS Sagemaker Endpoints


This is the final step to automate the deployment of the slated production model to a Sagemaker endpoint. The deployment pipelines handles the complexities of AWS interactions and ensures that the model, along with its full history and context, is transitioned into a live environment ready for use. Here again we use the Model Control Plane interface to query the Huggingface revision and use that information to push to Huggingface Hub.

<img src="assets/deploying_pipeline_overview.png" alt="Pipelines Trains">


In [None]:
!zenml stack set hf-sagemaker-local

## Congratulations!

You just built two ML pipelines! You trained two models, evaluated them against
a test set, registered the best one with the ZenML model control plane,
and served some predictions. You also learned how to iterate on your models and
data by using some of the ZenML utility abstractions. You saw how to view your
artifacts and stacks via the CLI as well as the ZenML Dashboard.

And that is just the tip of the iceberg of what ZenML can do; check out the [**docs**](https://docs.zenml.io/) to learn more
about the capabilities of ZenML.

## What to do now

* If you have questions or feedback... join our [**Slack Community**](https://zenml.io/slack) and become part of the ZenML family!
* If you want to try ZenML in a real-world setting... check out the [ZenML Cloud](https://cloud.zenml.io/).