# Run pipelines in Azure Machine Learning
In Azure Machine Learning, you can experiment in notebooks and train (and retrain) machine learning models by running scripts as jobs.

In an enterprise data science process, you'll want to separate the overall process into individual tasks. You can group tasks together as pipelines. Pipelines are key to implementing an effective Machine Learning Operations (MLOps) solution in Azure.

You'll learn how to create components of individual tasks, making it easier to reuse and share code. You'll then combine components into an Azure Machine Learning pipeline, which you'll run as a pipeline job.

>The term pipeline is used extensively across various domains, including machine learning and software engineering. In Azure Machine Learning, a pipeline contains steps related to the training of a machine learning model. In Azure DevOps or GitHub, a pipeline can refer to a build or release pipelines, which perform the build and configuration tasks required to deliver software. In Azure Synapse Analytics, a pipeline is used to define the data ingestion and transformation process. The focus of this module is on Azure Machine Learning pipelines. However, bear in mind that it's possible to have pipelines across services interact with each other. For example, an Azure DevOps or Azure Synapse Analytics pipeline can trigger an Azure Machine Learning pipeline.

Learn more about MLOps in relation to Azure Machine Learning with an [introduction to machine learning operations](https://learn.microsoft.com/en-us/training/paths/introduction-machine-learn-operations)

## Create components
Components allow you to create reusable scripts that can easily be shared across users within the same Azure Machine Learning workspace. You can also use components to build an Azure Machine Learning pipeline.

There are two main reasons why you'd use components:
- To build a pipeline.
- To share ready-to-go code.

You'll want to create components when you're preparing your code for scale. When you're done with experimenting and developing, and ready to move your model to production.

Within Azure Machine Learning, you can create a component to store code (in your preferred language) within the workspace. Ideally, you design a component to perform a specific action that is relevant to your machine learning workflow.

For example, a component may consist of a Python script that normalizes your data, trains a machine learning model, or evaluates a model.

Components can be easily shared to other Azure Machine Learning users, who can reuse components in their own Azure Machine Learning pipelines.

![alt text](assets/01-01-components.png)

A component consists of three parts:
- Metadata: Includes the component's name, version, etc.
- Interface: Includes the expected input parameters (like a dataset or hyperparameter) and expected output (like metrics and artifacts).
- Command, code and environment: Specifies how to run the code.

To create a component, you need two files:
- A script that contains the workflow you want to execute.
- A YAML file to define the metadata, interface, and command, code, and environment of the component.

You can create the YAML file, or use the `command_component()` function as a decorator to create the YAML file.

>Here, we'll focus on creating a YAML file to create a component. Alternatively, learn more about [how to create components using command_component()](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-create-component-pipeline-python).


For example, you may have a Python script `prep.py` that prepares the data by removing missing values and normalizing the data:


In [None]:
# import libraries
import argparse
import pandas as pd
import numpy as np
from pathlib import Path
from sklearn.preprocessing import MinMaxScaler

# setup arg parser
parser = argparse.ArgumentParser()

# add arguments
parser.add_argument("--input_data", dest='input_data',
                    type=str)
parser.add_argument("--output_data", dest='output_data',
                    type=str)

# parse args
args = parser.parse_args()

# read the data
df = pd.read_csv(args.input_data)

# remove missing values
df = df.dropna()

# normalize the data    
scaler = MinMaxScaler()
num_cols = ['feature1','feature2','feature3','feature4']
df[num_cols] = scaler.fit_transform(df[num_cols])

# save the data as a csv
output_df = df.to_csv(
    (Path(args.output_data) / "prepped-data.csv"), 
    index = False
)

To create a component for the `prep.py` script, you'll need a YAML file `prep.yml`:

```yaml
$schema: https://azuremlschemas.azureedge.net/latest/commandComponent.schema.json
name: prep_data
display_name: Prepare training data
version: 1
type: command
inputs:
  input_data: 
    type: uri_file
outputs:
  output_data:
    type: uri_file
code: ./src
environment: azureml:AzureML-sklearn-0.24-ubuntu18.04-py37-cpu@latest
command: >-
  python prep.py 
  --input_data ${{inputs.input_data}}
  --output_data ${{outputs.output_data}}
  ```

  Notice that the YAML file refers to the `prep.py` script, which is stored in the `src` folder. You can load the component with the following code:


In [None]:
from azure.ai.ml import load_component
parent_dir = ""

loaded_component_prep = load_component(source=parent_dir + "./prep.yml")

When you've loaded the component, you can use it in a pipeline or register the component.

## Register a component
To use components in a pipeline, you'll need the script and the YAML file. To make the components accessible to other users in the workspace, you can also register components to the Azure Machine Learning workspace.

You can register a component with the following code:


In [None]:
prep = ml_client.components.create_or_update(prepare_data_component)

# Create a pipeline
In Azure Machine Learning, a pipeline is a workflow of machine learning tasks in which each task is defined as a component.

Components can be arranged sequentially or in parallel, enabling you to build sophisticated flow logic to orchestrate machine learning operations. Each component can be run on a specific compute target, making it possible to combine different types of processing as required to achieve an overall goal.

A pipeline can be executed as a process by running the pipeline as a pipeline job. Each component is executed as a child job as part of the overall pipeline job.

## Build a pipeline
An Azure Machine Learning pipeline is defined in a YAML file. The YAML file includes the pipeline job name, inputs, outputs, and settings.
You can create the YAML file, or use the `@pipeline()` function to create the YAML file.

>Review the [reference documentation for the @pipeline() function](https://learn.microsoft.com/en-us/python/api/azure-ai-ml/azure.ai.ml.dsl).

For example, if you want to build a pipeline that first prepares the data, and then trains the model, you can use the following code:


In [None]:
from azure.ai.ml.dsl import pipeline

@pipeline()
def pipeline_function_name(pipeline_job_input):
    prep_data = loaded_component_prep(input_data=pipeline_job_input)
    train_model = loaded_component_train(training_data=prep_data.outputs.output_data)

    return {
        "pipeline_job_transformed_data": prep_data.outputs.output_data,
        "pipeline_job_trained_model": train_model.outputs.model_output,
    }

To pass a registered data asset as the pipeline job input, you can call the function you created with the data asset as input:


In [None]:
from azure.ai.ml import Input
from azure.ai.ml.constants import AssetTypes

pipeline_job = pipeline_function_name(
    Input(type=AssetTypes.URI_FILE, 
    path="azureml:data:1"
))

The `@pipeline()` function builds a pipeline consisting of two sequential steps, represented by the two loaded components.

To understand the pipeline built in the example, let's explore it step by step:

1. The pipeline is built by defining the function pipeline_function_name.
2. The pipeline function expects pipeline_job_input as the overall pipeline input.
3. The first pipeline step requires a value for the input parameter input_data. The value for the input will be the value of pipeline_job_input.
4. The first pipeline step is defined by the loaded component for prep_data.
5. The value of the output_data of the first pipeline step is used for the expected input training_data of the second pipeline step.
6. The second pipeline step is defined by the loaded component for train_model and results in a trained model referred to by model_output.
7. Pipeline outputs are defined by returning variables from the pipeline function. There are two outputs:
    - pipeline_job_transformed_data with the value of prep_data.outputs.output_data
    - pipeline_job_trained_model with the value of train_model.outputs.model_output

![alt text](assets/pipeline-overview.png)

The result of running the @pipeline() function is a YAML file that you can review by printing the pipeline_job object you created when calling the function:

```py
print(pipeline_job)
```

The output will be formatted as a YAML file, which includes the configuration of the pipeline and its components. Some parameters included in the YAML file are shown in the following example.

```yaml
display_name: pipeline_function_name
type: pipeline
inputs:
  pipeline_job_input:
    type: uri_file
    path: azureml:data:1
outputs:
  pipeline_job_transformed_data: null
  pipeline_job_trained_model: null
jobs:
  prep_data:
    type: command
    inputs:
      input_data:
        path: ${{parent.inputs.pipeline_job_input}}
    outputs:
      output_data: ${{parent.outputs.pipeline_job_transformed_data}}
  train_model:
    type: command
    inputs:
      input_data:
        path: ${{parent.outputs.pipeline_job_transformed_data}}
    outputs:
      output_model: ${{parent.outputs.pipeline_job_trained_model}}
tags: {}
properties: {}
settings: {}
```

>Learn more about [the pipeline job YAML schema to explore which parameters are included when building a component-based pipeline](https://learn.microsoft.com/en-us/azure/machine-learning/reference-yaml-job-pipeline).


# Run a pipeline job
When you've built a component-based pipeline in Azure Machine Learning, you can run the workflow as a pipeline job.

## Configure a pipeline job
A pipeline is defined in a YAML file, which you can also create using the `@pipeline()` function. After you've used the function, you can edit the pipeline configurations by specifying which parameters you want to change and the new value.

For example, you may want to change the output mode for the pipeline job outputs:

In [None]:
# change the output mode
pipeline_job.outputs.pipeline_job_transformed_data.mode = "upload"
pipeline_job.outputs.pipeline_job_trained_model.mode = "upload"

Or, you may want to set the default pipeline compute. When a compute isn't specified for a component, it will use the default compute instead:

In [None]:
# set pipeline level compute
pipeline_job.settings.default_compute = "aml-cluster"

You may also want to change the default datastore to where all outputs will be stored:

In [None]:
# set pipeline level datastore
pipeline_job.settings.default_datastore = "workspaceblobstore"

To review your pipeline configuration, you can print the pipeline job object:

In [None]:
print(pipeline_job)

# Run a pipeline job
When you've configured the pipeline, you're ready to run the workflow as a pipeline job.

To submit the pipeline job, run the following code:

In [None]:
# submit job to workspace
pipeline_job = ml_client.jobs.create_or_update(
    pipeline_job, experiment_name="pipeline_job"
)

After you submit a pipeline job, a new job will be created in the Azure Machine Learning workspace. A pipeline job also contains child jobs, which represent the execution of the individual components. The Azure Machine Learning studio creates a graphical representation of your pipeline. You can expand the Job overview to explore the pipeline parameters, outputs, and child jobs:

![alt text](assets/pipeline-output.png)

To troubleshoot a failed pipeline, you can check the outputs and logs of the pipeline job and its child jobs.

- If there's an issue with the configuration of the pipeline itself, you'll find more information in the outputs and logs of the pipeline job.
- If there's an issue with the configuration of a component, you'll find more information in the outputs and logs of the child job of the failed component.


## Schedule a pipeline job
A pipeline is ideal if you want to get your model ready for production. Pipelines are especially useful for automating the retraining of a machine learning model. To automate the retraining of a model, you can schedule a pipeline.

To schedule a pipeline job, you'll use the JobSchedule class to associate a schedule to a pipeline job.

There are various ways to create a schedule. A simple approach is to create a time-based schedule using the RecurrenceTrigger class with the following parameters:
- frequency: Unit of time to describe how often the schedule fires. Value can be either minute, hour, day, week, or month.
- interval: Number of frequency units to describe how often the schedule fires. Value needs to be an integer.

To create a schedule that fires every minute, run the following code:


In [None]:
from azure.ai.ml.entities import RecurrenceTrigger

schedule_name = "run_every_minute"

recurrence_trigger = RecurrenceTrigger(
    frequency="minute",
    interval=1,
)

To schedule a pipeline, you'll need pipeline_job to represent the pipeline you've built:


In [None]:
from azure.ai.ml.entities import JobSchedule

job_schedule = JobSchedule(
    name=schedule_name, trigger=recurrence_trigger, create_job=pipeline_job
)

job_schedule = ml_client.schedules.begin_create_or_update(
    schedule=job_schedule
).result()

The display names of the jobs triggered by the schedule will be prefixed with the name of your schedule. You can review the jobs in the Azure Machine Learning studio:

![alt text](assets/scheduled-jobs.png)

To delete a schedule, you first need to disable it:

In [None]:
ml_client.schedules.begin_disable(name=schedule_name).result()
ml_client.schedules.begin_delete(name=schedule_name).result()

>Learn more about [the schedules you can create to trigger pipeline jobs in Azure Machine Learning](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-schedule-pipeline-job?tabs=python%3Fazure-portal%3Dtrue). Or, explore an [example notebook to learn how to work with schedules](https://github.com/Azure/azureml-examples/blob/main/sdk/python/schedules/job-schedule.ipynb).