# Lesson 1.1: ML Pipelines with ZenML

***Key Concepts:*** *ML Pipelines, Steps*

In this notebook we will learn how to easily convert existing ML code into ML pipelines using ZenML.

Since we will build models with [sklearn](https://scikit-learn.org/stable/), you will need to have the ZenML sklearn integration installed. If you have not done so before, install it with the following command, then restart the kernel of your notebook.

In [None]:
!zenml integration install sklearn -f

As an ML practitioner, you are probably familiar with how to build ML models with Scikit-learn, PyTorch, TensorFlow, or similar.
An **[ML Pipeline](https://docs.zenml.io/core-concepts#pipeline)** is simply an extension of that, which also includes other steps you would typically do before or after building a model, like data acquisition, preprocessing, model deployment, or monitoring. In essence, the ML pipeline defines a step-by-step procedure of your work as ML practitioner.
Defining ML pipelines explicitly in code is great because:
- We can easily rerun *all* of our work, not just the model. This eliminates bugs and makes our models easier to reproduce.
- Data and models can be versioned and tracked, so we can see at a glance which dataset a model was trained on and how it compares to other models.
- If the entire pipeline is coded up, we can automate many operational tasks, like retraining and redeploying models when the underlying problem or data changes, or rolling out new and improved models with CI/CD workflows.

For ML teams that aim to serve models at large scale, having a clearly defined ML pipeline is a must.

## ZenML Setup
Throughout this series, we will define our ML pipelines using [ZenML](https://github.com/zenml-io/zenml/). ZenML is an excellent tool for this task, as it is very easy and intuitive to use and has [integrations](https://docs.zenml.io/features/integrations) with most of the advanced MLOps tools we will want to use later. Make sure you have ZenML installed (via `pip install zenml`). In the following, we run some commands to make sure you start out with a fresh ML stack. You can ignore this for now as it will be explained in more detail in a later chapter.

In [None]:
!rm -rf .zen
!zenml init
!zenml stack set default

## Example Experimentation ML Code
Let us get started with some simple examplary ML code. In the following, we train a Scikit-learn SVC classifier to classify images of handwritten digits. We load the data, train a model on the training set, then test it on the test set.

In [None]:
import numpy as np
from sklearn.base import ClassifierMixin
from sklearn.svm import SVC
from zenml.integrations.sklearn.helpers.digits import get_digits


def train_test() -> None:
    """Train and test a Scikit-learn SVC classifier on digits"""
    X_train, X_test, y_train, y_test = get_digits()
    model = SVC(gamma=0.001)
    model.fit(X_train, y_train)
    test_acc = model.score(X_test, y_test)
    print(f"Test accuracy: {test_acc}")


train_test()

## Turning experiments into ML pipelines with ZenML

In practice, your ML workflows will of course be much more complicated than that. You might have complex preprocessing that you do not want to redo every time you train a model, you will need to compare the performance of different models, deploy them in a production setting, and much more. This is where ML pipelines come into play, which allow us to define our workflows in distinct modular steps that we can then mix and match.

![Digits Pipeline](_assets/1-1/digits_pipeline.png)

In our example, we can identify three distinct steps: data loading, model training, and model evaluation. Let us now define each of them as a ZenML **[Pipeline Step](https://docs.zenml.io/core-concepts#step)**, simply by moving each step to it's own function and decorating them with ZenML's `@step` [Python decorator](https://realpython.com/primer-on-python-decorators/).

In [None]:
from zenml.steps import step, Output


@step
def importer() -> Output(
    X_train=np.ndarray,
    X_test=np.ndarray,
    y_train=np.ndarray,
    y_test=np.ndarray,
):
    """Load the digits dataset as numpy arrays."""
    X_train, X_test, y_train, y_test = get_digits()
    return X_train, X_test, y_train, y_test


@step
def svc_trainer(
    X_train: np.ndarray,
    y_train: np.ndarray,
) -> ClassifierMixin:
    """Train a sklearn SVC classifier."""
    model = SVC(gamma=0.001)
    model.fit(X_train, y_train)
    return model


@step
def evaluator(
    X_test: np.ndarray,
    y_test: np.ndarray,
    model: ClassifierMixin,
) -> float:
    """Calculate the test set accuracy of an sklearn model."""
    test_acc = model.score(X_test, y_test)
    print(f"Test accuracy: {test_acc}")
    return test_acc

Similarly, we can use ZenML's `@pipeline` decorator to connect all of our steps into a ML pipeline.

Note that the pipeline definition does not depend on the concrete step functions we defined above, it merely defines a recipe for how data moves through the steps. This means we can replace steps as we wish, e.g., to run the same pipeline with different models to compare their performances.

In [None]:
from zenml.pipelines import pipeline


@pipeline
def digits_pipeline(importer, trainer, evaluator):
    """Links all the steps together in a pipeline"""
    X_train, X_test, y_train, y_test = importer()
    model = trainer(X_train=X_train, y_train=y_train)
    evaluator(X_test=X_test, y_test=y_test, model=model)

## Running ZenML Pipelines
Finally, to run our pipeline, we simply initialize it with concrete step functions and call the `run()` method.

In [None]:
digits_svc_pipeline = digits_pipeline(
    importer=importer(), trainer=svc_trainer(), evaluator=evaluator()
)
digits_svc_pipeline.run()

And that's it, we just built our first ML pipeline! Great job!

In the [next lesson](1-2_Artifact_Lineage.ipynb), you will see one of the coolest features of ML pipelines in action: automated artifact versioning and caching. See you there!