description |
---|
Add more resources to your pipeline configuration. |
Now that we have our pipeline up and running in the cloud, you might be wondering how ZenML figured out what sort of dependencies to install in the Docker image that we just ran on the VM. The answer lies in the runner script we executed (i.e. run.py), in particular, these lines:
pipeline_args["config_path"] = os.path.join(
config_folder, "training_rf.yaml"
)
# Configure the pipeline
training_pipeline_configured = training_pipeline.with_options(**pipeline_args)
# Create a run
training_pipeline_configured()
The above commands configure our training pipeline with a YAML configuration called training_rf.yaml
(found here in the source code). Let's learn more about this configuration file.
{% hint style="info" %}
The with_options
command that points to a YAML config is only one way to configure a pipeline. We can also directly configure a pipeline or a step in the decorator:
@pipeline(settings=...)
However, it is best to not mix configuration from code to ensure separation of concerns in our codebase. {% endhint %}
The YAML configuration of a ZenML pipeline can be very simple, as in this case. Let's break it down and go through each section one by one:
settings:
docker:
required_integrations:
- sklearn
requirements:
- pyarrow
The first section is the so-called settings
of the pipeline. This section has a docker
key, which controls the containerization process. Here, we are simply telling ZenML that we need pyarrow
as a pip requirement, and we want to enable the sklearn
integration of ZenML, which will in turn install the scikit-learn
library. This Docker section can be populated with many different options, and correspond to the DockerSettings class in the Python SDK.
The next section is about associating a ZenML Model with this pipeline.
# Configuration of the Model Control Plane
model:
name: breast_cancer_classifier
version: rf
license: Apache 2.0
description: A breast cancer classifier
tags: ["breast_cancer", "classifier"]
You will see that this configuration lines up with the model created after executing these pipelines:
{% tabs %} {% tab title="CLI" %}
# List all versions of the breast_cancer_classifier
zenml model version list breast_cancer_classifier
{% endtab %}
{% tab title="Dashboard" %} ZenML Pro ships with a Model Control Plane dashboard where you can visualize all the versions:
All model versions listed
{% endtab %} {% endtabs %}The last part of the config YAML is the parameters
key:
# Configure the pipeline
parameters:
model_type: "rf" # Choose between rf/sgd
This parameters key aligns with the parameters that the pipeline expects. In this case, the pipeline expects a string called model_type
that will inform it which type of model to use:
@pipeline
def training_pipeline(model_type: str):
...
So you can see that the YAML config is fairly easy to use and is an important part of the codebase to control the execution of our pipeline. You can read more about how to configure a pipeline in the how to section, but for now, we can move on to scaling our pipeline.
When we ran our pipeline with the above config, ZenML used some sane defaults to pick the resource requirements for that pipeline. However, in the real world, you might want to add more memory, CPU, or even a GPU depending on the pipeline at hand.
This is as easy as adding the following section to your local training_rf.yaml
file:
# These are the resources for the entire pipeline, i.e., each step
settings:
...
# Adapt this to vm_gcp accordingly
orchestrator:
memory: 32 # in GB
...
steps:
model_trainer:
settings:
orchestrator:
cpus: 8
Here we are configuring the entire pipeline with a certain amount of memory, while for the trainer step we are additionally configuring 8 CPU cores. The orchestrator
key corresponds to the SkypilotBaseOrchestratorSettings
class in the Python SDK.
Instructions for Microsoft Azure Users
As discussed before, we are using the Kubernetes orchestrator for Azure users. In order to scale compute for the Kubernetes orchestrator, the YAML file needs to look like this:
# These are the resources for the entire pipeline, i.e., each step
settings:
...
resources:
memory: "32GB"
...
steps:
model_trainer:
settings:
resources:
memory: "8GB"
{% hint style="info" %} Read more about settings in ZenML here and here {% endhint %}
Now let's run the pipeline again:
python run.py --training-pipeline
Now you should notice the machine that gets provisioned on your cloud provider would have a different configuration as compared to last time. As easy as that!
Bear in mind that not every orchestrator supports ResourceSettings
directly. To learn more, you can read about ResourceSettings
here, including the ability to attach a GPU.