<a href="https://colab.research.google.com/github/siwarnasri/MlOps_CustomerSatisfaction/blob/main/1_1_Pipelines.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Notebook 1.1: ML Pipelines

In this notebook, you will learn how to convert existing ML code into ML pipelines using ZenML.

Since we will be creating models using Sklearn, you must have the ZenML Sklearn integration installed. You can install ZenML and the Sklearn integration with the following command, which will also reboot your notebook's kernel.

In [1]:
%pip install "zenml[server]"
!zenml integration install sklearn -y
%pip install pyparsing==2.4.2  # required for Colab

import IPython

# automatically restart kernel
IPython.Application.instance().kernel.do_shutdown(restart=True)

Collecting zenml[server]
  Downloading zenml-0.44.3-py3-none-any.whl (6.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.0/6.0 MB[0m [31m5.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting alembic<1.9.0,>=1.8.1 (from zenml[server])
  Downloading alembic-1.8.1-py3-none-any.whl (209 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m209.8/209.8 kB[0m [31m21.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting azure-mgmt-resource>=21.0.0 (from zenml[server])
  Downloading azure_mgmt_resource-23.0.1-py3-none-any.whl (2.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.5/2.5 MB[0m [31m83.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting click<8.1.4,>=8.0.1 (from zenml[server])
  Downloading click-8.1.3-py3-none-any.whl (96 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m96.6/96.6 kB[0m [31m11.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting click-params<0.4.0,>=0.3.0 (from zenml[server])
  Downloading click_pa

[1;35mNumExpr defaulting to 2 threads.[0m
[2K[32m⠏[0m Installing integrations...
[1A[2KCollecting pyparsing==2.4.2
  Downloading pyparsing-2.4.2-py2.py3-none-any.whl (65 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m65.5/65.5 kB[0m [31m1.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pyparsing
  Attempting uninstall: pyparsing
    Found existing installation: pyparsing 2.4.7
    Uninstalling pyparsing-2.4.7:
      Successfully uninstalled pyparsing-2.4.7
Successfully installed pyparsing-2.4.2


{'status': 'ok', 'restart': True}

**Colab Note:** On Colab, you need an [ngrok account](https://dashboard.ngrok.com/signup) to view some of the visualizations later. Please set up an account, then set your user token below:

In [2]:
NGROK_TOKEN = "ONNKZJSBB6C24BGK2DLSM5C4MQDORS5V"  # TODO: set your ngrok token if you are working on Colab

In [3]:
from zenml.environment import Environment

if Environment.in_google_colab():  # Colab only setup

    # install and authenticate ngrok
    !pip install pyngrok
    !ngrok authtoken {NGROK_TOKEN}

Collecting pyngrok
  Downloading pyngrok-7.0.0.tar.gz (718 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/718.7 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.2/718.7 kB[0m [31m2.8 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m716.8/718.7 kB[0m [31m11.6 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m718.7/718.7 kB[0m [31m10.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: pyngrok
  Building wheel for pyngrok (setup.py) ... [?25l[?25hdone
  Created wheel for pyngrok: filename=pyngrok-7.0.0-py3-none-any.whl size=21129 sha256=26e8bc0878660051c903748ed0d6aab4de92db929f8a4f85e79971e61b274636
  Stored in directory: /root/.cache/pip/wheels/60/29/7b/f64332aa7e5e88fbd56d4002185ae22dcdc83b35b3d1c2cbf5
Succe

As an ML practitioner, you are probably familiar with building ML models using Scikit-learn, PyTorch, TensorFlow, or similar. An ML pipeline is simply an extension that includes other steps you would normally perform before or after creating a model, such as data collection, preprocessing, model deployment, or monitoring. The ML pipeline essentially defines a step-by-step process for your work as an ML practitioner. Defining ML pipelines explicitly in code is great because:

We can easily repeat all of our work, not just the model, to eliminate errors and make our models easier to reproduce.
Data and models can be versioned and tracked, so we can see at a glance which dataset a model was trained on and how it compares to other models.
When the entire pipeline is coded, we can automate many operational tasks, such as re-training and redeploying models when the underlying problem or data changes, or rolling out new and improved models with CI/CD workflows.
A well-defined ML pipeline is essential for ML teams looking to deploy models at scale.

## ZenML Setup
Throughout this series, we will define our ML pipelines using [ZenML](https://github.com/zenml-io/zenml/). ZenML is an excellent tool for this task, as it is easy and intuitive to use, and has [integrations](https://zenml.io/integrations) with most of the advanced MLOps tools we will use later. Make sure you have ZenML installed (via `pip install zenml`). Next, we run some commands to make sure you start with a fresh ML stack.

In [4]:
!rm -rf .zen
!zenml init

[1;35mNumExpr defaulting to 2 threads.[0m
[?25l[1;35mInitializing the ZenML global configuration version to 0.44.3[0m
[32m⠋[0m Initializing ZenML repository at /content.
[2K[1A[2K[32m⠙[0m Initializing ZenML repository at /content.
[2K[1A[2K[32m⠹[0m Initializing ZenML repository at /content.
[2K[1A[2K[32m⠸[0m Initializing ZenML repository at /content.
[2K[1A[2K[32m⠼[0m Initializing ZenML repository at /content.
[2K[1A[2K[32m⠴[0m Initializing ZenML repository at /content.
[2K[1A[2K[32m⠦[0m Initializing ZenML repository at /content.
[2K[1A[2K[32m⠧[0m Initializing ZenML repository at /content.
[2K[1A[2K[32m⠇[0m Initializing ZenML repository at /content.
[2K[1A[2K[32m⠏[0m Initializing ZenML repository at /content.
[2K[1A[2K[32m⠋[0m Initializing ZenML repository at /content.
[2K[1A[2K[32m⠙[0m Initializing ZenML repository at /content.
[2K[1A[2K[32m⠹[0m Initializing ZenML repository at /content.
[2K[1A[2K[32m⠸[0m Initiali

## Example of experimental ML code
Let's start with a simple ML example code. Below, we train a Scikit-learn SVC classifier to classify images of handwritten digits. We load the data, train a model on the training set, and then test it on the test set.

In [5]:
import numpy as np
from sklearn.base import ClassifierMixin
from sklearn.svm import SVC
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split


def train_test() -> None:
    """Train and test a Scikit-learn SVC classifier on digits"""
    digits = load_digits()
    data = digits.images.reshape((len(digits.images), -1))
    X_train, X_test, y_train, y_test = train_test_split(
        data, digits.target, test_size=0.2, shuffle=False
    )
    model = SVC(gamma=0.001)
    model.fit(X_train, y_train)
    test_acc = model.score(X_test, y_test)
    print(f"Test accuracy: {test_acc}")


train_test()

Test accuracy: 0.9583333333333334


## Turning experiments into ML pipelines with ZenML

In practice, of course, your ML workflows will be much more complicated. You may have complex preprocessing that you don't want to repeat every time you train a model, you may need to compare the performance of different models, deploy them in a production environment, and more. This is where ML pipelines come in, allowing us to define our workflows in modular steps that we can then mix and match.

<!-- ![Digits pipeline](_assets/1-1/digits_pipeline.png) -->

In our example, we see three distinct steps: loading the data, training the model, and evaluating the model. Let's now define each of these steps as a ZenML **[pipeline step](https://docs.zenml.io/user-guide/starter-guide)** by moving each step into its own function and equipping it with the ZenML `@step`  [Python decorator](https://realpython.com/primer-on-python-decorators/)

In [6]:
from zenml.steps import step, Output


@step
def importer() -> Output(
    X_train=np.ndarray,
    X_test=np.ndarray,
    y_train=np.ndarray,
    y_test=np.ndarray,
):
    """Load the digits dataset as numpy arrays."""
    digits = load_digits()
    data = digits.images.reshape((len(digits.images), -1))
    X_train, X_test, y_train, y_test = train_test_split(
        data, digits.target, test_size=0.2, shuffle=False
    )
    return X_train, X_test, y_train, y_test


@step
def svc_trainer(
    X_train: np.ndarray,
    y_train: np.ndarray,
) -> ClassifierMixin:
    """Train an sklearn SVC classifier."""
    model = SVC(gamma=0.001)
    model.fit(X_train, y_train)
    return model


@step
def evaluator(
    X_test: np.ndarray,
    y_test: np.ndarray,
    model: ClassifierMixin,
) -> float:
    """Calculate the test set accuracy of an sklearn model."""
    test_acc = model.score(X_test, y_test)
    print(f"Test accuracy: {test_acc}")
    return test_acc

INFO:numexpr.utils:NumExpr defaulting to 2 threads.


[1;35mNumExpr defaulting to 2 threads.[0m
[33mThe [0m[1;36m@step[33m decorator that you used to define your importerstep is deprecated. Check out the 0.40.0 migration guide for more information on how to migrate your steps to the new syntax: https://docs.zenml.io/reference/migration-guide/migration-zero-forty[0m
[33mUsing the [0m[1;36mOutput[33m class to define the outputs of your steps is deprecated. You should instead use the standard Python way of type annotating your functions. Check out our documentation https://docs.zenml.io/user-guide/advanced-guide/pipelining-features/configure-steps-pipelines#step-output-names for more information on how to assign custom names to your step outputs.[0m
[33mThe [0m[1;36m@step[33m decorator that you used to define your svc_trainerstep is deprecated. Check out the 0.40.0 migration guide for more information on how to migrate your steps to the new syntax: https://docs.zenml.io/reference/migration-guide/migration-zero-forty[0m
[33m

Similarly, we can use the ZenML decorator `@pipeline` to connect all our steps into an ML pipeline.

Note that the pipeline definition does not depend on the specific step functions we defined above; it simply specifies a recipe for how the data passes through the steps. This means that we can substitute steps as we see fit, for example, to run the same pipeline with different models and compare their performances.

In [7]:
from zenml.pipelines import pipeline


@pipeline
def digits_pipeline(importer, trainer, evaluator):
    """Links all the steps together in a pipeline"""
    X_train, X_test, y_train, y_test = importer()
    model = trainer(X_train=X_train, y_train=y_train)
    evaluator(X_test=X_test, y_test=y_test, model=model)

[33mThe [0m[1;36m@pipeline[33m decorator that you used to define your digits_pipeline pipeline is deprecated. Check out the 0.40.0 migration guide for more information on how to migrate your pipelines to the new syntax: https://docs.zenml.io/reference/migration-guide/migration-zero-forty.html[0m


## Running ZenML Pipelines
Finally, we initialize our pipeline with concrete step functions and call the `run()` method to execute it.

In [None]:
digits_svc_pipeline = digits_pipeline(
    importer=importer(), trainer=svc_trainer(), evaluator=evaluator()
)
digits_svc_pipeline.run(unlisted=True)

And that's it, we just built and operated our first ML pipeline! Well done!

You can now visualize the pipeline run in the ZenML dashboard. To do so, run
`zenml up` to create a local ZenML dashboard, log in with the username `default`
and a blank password and navigate to the Runs tab in the Pipelines section.

In [None]:
from zenml.environment import Environment

def start_zenml_dashboard(port=8237):
    if Environment.in_google_colab():
        from pyngrok import ngrok

        public_url = ngrok.connect(port)
        print(f"\x1b[31mIn Colab, use this URL instead: {public_url}!\x1b[0m")
        !zenml up --blocking --port {port}

    else:
        !zenml up --port {port}

start_zenml_dashboard()

In the next notebook, you'll see one of the best features of ML pipelines in action: automatic versioning and caching of artifacts. See you there!