# How Data Flows In ZenML

Pipelines in ZenML are data-centric. This means that data forms the link between different steps in a pipeline. In other words, the flow of the pipeline execution is data based and not task or function based. 

Since data holds such an integral position in the workflow, it is important to be able to track and maintain it seamlessly across all steps. In this chapter we will see how ZenML takes care of tracking your artifacts and all relevant metadata automatically and makes it available to be used and analyzed using a host of first class integration with tools like Evidently, MLflow and Wandb among others!

## ZenML behind the scenes

Let's look at a simple pipeline from before. ZenML works behind the scenes to store the outputs of each step and make them accessible to all other steps. 

In [None]:
from steps import importer, trainer, evaluator
from zenml.pipelines import pipeline
from zenml.steps import Output, step

In [None]:
# definition of our pipeline
@pipeline
def digits_pipeline(
    importer,
    trainer,
    evaluator,
):
    """Links all the steps together in a pipeline"""
    X_train, X_test, y_train, y_test = importer()
    model = trainer(X_train=X_train, y_train=y_train)
    evaluator(X_test=X_test, y_test=y_test, model=model)

### Diving deeper
We can see that these steps are linked together with their inputs and outputs. If we dive into the code of one of the steps, we can notice that the artifacts for this step are strongly typed. In the example below, the output is clearly specified as an object of type `ClassifierMixin`. 

In [None]:
import numpy as np
from sklearn.base import ClassifierMixin
from sklearn.svm import SVC

@step
def svc_trainer(
    X_train: np.ndarray,
    y_train: np.ndarray,
) -> ClassifierMixin:
    """Train another simple sklearn classifier for the digits dataset."""
    print("test")
    model = SVC(gamma=0.001)
    model.fit(X_train, y_train)
    return model

### Materializers

Having the knowledge of the type of artifacts produced by a step allows ZenML to pair the type with its corresponding "materializer". Materializers in ZenML are responsible for defining the logic for storing an artifact as a specific file type. Some types are supported by built-in materializers right out of the box, such as for libraries like numpy, pandas, pytorch, sklearn and more. In the case where you have a type of output which is not yet supported by ZenML, you can very easily implement one on your own!

Let's build a custom materializer for the `ClassifierMixin` type for our trainer step. A ZenML implementation already exists and so this would be redundant but it serves as a good exercise on just how easy it is to replace a few values and have your own materializer.

In [None]:
import os
import numpy as np
from typing import Any, Type
import pickle

from zenml.materializers.base_materializer import BaseMaterializer
from zenml.io import fileio
from zenml.steps import step

DEFAULT_FILENAME = 'model'

class SklearnMaterializer(BaseMaterializer):
    """Materializer to read data to and from sklearn."""

    ASSOCIATED_TYPES = (
        ClassifierMixin,
    )

    def handle_input(
        self, data_type: Type[Any]
    ) -> ClassifierMixin:
        """Reads a ClassifierMixin model from a pickle file."""
        super().handle_input(data_type)
        filepath = os.path.join(self.artifact.uri, DEFAULT_FILENAME)
        with fileio.open(filepath, "rb") as fid:
            clf = pickle.load(fid)
        return clf

    def handle_return(
        self,
        clf: ClassifierMixin
    ) -> None:
        """Creates a pickle for a ClassifierMixin model

        Args:
            clf: A ClassifierMixin model.
        """
        super().handle_return(clf)
        filepath = os.path.join(self.artifact.uri, DEFAULT_FILENAME)
        with fileio.open(filepath, "wb") as fid:
            pickle.dump(clf, fid)


#### Few important points to notice
- The `ASSOCIATED_TYPES` field contains the types which you want this materializer to be used for. 
- The `handle_input` function holds the logic for reading the specific type.
- The `handle_return` function holds the logic for storing the type to a file format of your choice.

In this example, we have used `pickle` to save and load our `ClassifierMixin` python object for simplicity. You can choose to have other specialised implementations depending on the type used and its corresponding best practices. 

You can resuse this code and replace the values in the associated types to quickly build a materializer for your custom needs. For more examples and different implementations, check out our docs on [custom materializers](https://docs.zenml.io/guides/functional-api/materialize-artifacts#create-custom-materializer) and the code for built-in materializers on our GitHub!

While running the pipeline, you specify your custom materializer for your step by using the function `with_return_materializer`.

In [None]:
# Initialize and run the pipeline
first_pipeline = digits_pipeline(
    importer=importer.importer(),
    trainer=trainer.svc_trainer_mlflow().with_return_materializers(SklearnMaterializer),
    evaluator=evaluator.evaluator(),
)
first_pipeline.run()

## Where do the artifacts go?
The artifacts are stored in an artifact store which you can configure as a part of your stack. Chapter 2 deals with switching between stacks and shows how easy the process is. 
The artifacts are referenced through a metadata store, also configurable through the stack. It holds the artifacts URIs for all steps across all pipeline runs! Head over to our [concepts page](https://docs.zenml.io/core-concepts) to learn more.

IMAGE of stack maybe

### Accessing artifacts from within a step
The metadata store can be accessed from inside a step and this adds a lot of possibilities when it commes to interacting with your data and making smart decisions from them.
Let's modify one of our steps to include a `StepContext` object as a parameter to it. `StepContext` provides additional and is used to access materializers and artifact URIs inside a step function.

We will use the metadata store to fetch the trained models from all past runs of our pipeline and then select the best performing model.

In [None]:
from zenml.steps import StepContext

@step
def evaluate_best_model(
    X_test: np.ndarray,
    Y_test: np.ndarray,
    model: ClassifierMixin,
    context: StepContext
) -> Output(current_acc=float, best_acc=float, model=ClassifierMixin):
    """Calculate the accuracy on the test set"""
    best_acc = model.score(X_test, Y_test)
    current_acc = best_acc
    best_model = model
    print(f"Current test accuracy: {best_acc}")
    
    metadata_store = context.metadata_store  # can access all of metadata store's functions here
    
    pipeline_runs = metadata_store.get_pipeline("digits_pipeline").runs
    for run in pipeline_runs:
        # get the trained model of all pipeline runs
        model = run.get_step("trainer").output.read()
        accuracy = model.score(X_test, Y_test)
        if accuracy > best_acc:
            # if the model accuracy is better than our currently-best model,
            # store it
            best_acc = accuracy
            best_model = model
    
    return current_acc, best_acc, best_model

Since this was the last step in our pipeline and we are not using the outputs from this step in any other steps, we won't need to redefine our pipeline for it to work. The only difference between the newer runs and the older runs would be that, after execution, we will now be able to get the best model stored as an output artifact for the last step (along with the accuracies). We will see more of how to access those artifacts in the next section.

In [None]:
# Initialize and run the pipeline
second_pipeline = digits_pipeline(
    importer=importer.importer(),
    trainer=trainer.svc_trainer_mlflow().with_return_materializers(SklearnMaterializer),
    evaluator=evaluate_best_model(),
)
second_pipeline.run()

## Analyzing data from a pipeline run
Now that you understand the logic that goes behind making your artifacts accessible between steps, let's move forward with our data journey. Your data is available to you not only between and within steps but even after a pipeline has finished executing. Data from historical pipeline runs is saved and versioned by ZenML and can be accessed in the post execution workflow. 
We will now use this feature to investigate the previous two pipeline runs and check if the second pipeline returned the best accuracy value as we had designed it.

In [None]:
from zenml.repository import Repository

repo = Repository()
# get our pipeline
pipeline = repo.get_pipeline(pipeline_name='digits_pipeline')
# this is the first_pipeline run
second_latest_run = pipeline.runs[-2]
# in this run, the third step (evaluator) has only one output.
# so we use steps.output function
accuracy_from_first_run = second_latest_run.steps[2].output
accuracy_from_first_run

In [None]:
# this is the second_pipeline run
latest_run = pipeline.runs[-1]
# here, there are multiple outputs. we choose the cuurent accuracy output
accuracy_from_second_run = latest_run.steps[2].outputs["current_acc"]
accuracy_from_second_run

Now, let's check what the best accuracy value is, from the latest pipeline run and check if it is the higher of the two accuracies that we have obtained.

In [None]:
# get the best accuracy output
best_accuracy = latest_run.steps[2].outputs["best_acc"]
best_accuracy

Note: The code above is logical and valid on the assumption that the order of execution is first the `first_pipeline`, then the `second_pipeline` and then this code. 

## Visualizing data through integrations

ZenML offers integrations with tools like Evidently, MLflow, Tensorboard and Wandb among others to help you better track, visualize and learn from your data. [Chapter 1](./01%20-%20Training%20Pipeline.ipynb) shows how you can leverage the MLflow integration to track your artifacts and then visualize them through the MLflow UI.