**What is Kubeflow Pipelines?**  
Kubeflow Pipelines (KFP) is a platform for building and deploying portable and scalable machine learning (ML) workflows using containers on Kubernetes-based systems.

With KFP you can author components and pipelines using the KFP Python SDK, compile pipelines to an intermediate representation YAML, and submit the pipeline to run on a KFP-conformant backend such as the open source KFP backend or Google Cloud Vertex AI Pipelines.

**What is a pipeline?**  
A pipeline is a description of a machine learning (ML) workflow, including all of the components in the workflow and how the components relate to each other in the form of a directed acyclic graph (DAG). The pipeline configuration includes the definition of the inputs (parameters) required to run the pipeline and the inputs and outputs of each component.

When you run a pipeline, the system launches one or more Kubernetes Pods corresponding to the steps (components) in your workflow (pipeline). The Pods start Docker containers, and the containers in turn start your programs. Pipelines may also feature control flow.

## Concepts

### Pipeline Root
[Pipeline root](https://www.kubeflow.org/docs/components/pipelines/concepts/pipeline-root/) represents the path within an object store bucket where Kubeflow Pipelines stores a pipeline’s artifacts.

### Component
A pipeline component is the fundamental building block to construct a Kubeflow Pipelines pipeline. The component structure serves the purpose of packaging a functional unit of code along with its dependencies, so that it can be run as part of a workflow in a Kubernetes environement. Components can be combined in a pipeline that creates a repeatable workflow, with individual components coordinating on inputs and outputs like parameters and artifacts.

A component is similar to a programming function. It is most often implemented as a wrapper to a Python function using the KFP Python SDK. However, a KFP component goes further than a simple function, with support for code dependencies, runtime environments, and distributed execution requirements.

KFP components are designed to simplify constructing and running ML workflows in a Kubernetes environment.

#### What Does a Component Consist Of?
A KFP component consist of the following key elements:

1. Code  
    - Typically a Python function, but can be other code such as a Bash command.
2. Dependency Support  
    - **Python libraries** - to be installed at runtime
    - **Environment variables** - to be available in the runtime environment
    - **Python package indices** - (for example, private PyPi servers) if needed to support installations
    - **Cluster resources** - to support use of ConfigMaps, Secrets, PersistentVolumeClaims, and more
    - **Runtime dependencies** - to support CPU, memory, and GPU requests and limits
3. Base Image
    - Defines the base container runtime environment (defaults to a generic Python base image)
    - May include system dependencies and pre-installed Python libraries
4. Input/Output (I/O) Specification
    - Individual components cannot share in-memory data with each other, so they use the following concepts to support exchanging information and publishing results:
        - **Parameters** – for small values
        - **Artifacts** - for larger data like model files, processed datasets, and metadata

#### Python-Based Components
- `@component` wrapper helps the KFP Python SDK supply the needed context for running these functions in containers as part of a KFP pipeline.
- By default Python function run on the default base image (`kfp.dsl.component_factory._DEFAULT_BASE_IMAGE`).
- Layers of customization can be added to a component by supplying the name of a specific `base_image`, and `packages_to_install`.
- since the function will run inside a container (and won’t have the script context), all Python library dependencies need to be imported within the component function.
- use KFP’s `Output[<Artifact>]` class for creating a KFP artifact type output.
- inputs and outputs are defined as Python function parameters.
- dependencies can often be installed at runtime, avoiding the need for custom base containers.
- Python-based components give close access to the Python tools that ML experimenters rely on, like modules and imports, usage information, type hints, and debugging tools.

#### YAML-Based Components
The KFP backend uses YAML-based definitions to specify components. While the KFP Python SDK can do this conversion automatically when a Python-based pipeline is submitted, some use-cases can benefit from the direct YAML-based component approach.

A YAML-based component definition has the following parts:
- **Metadata**: name, description, etc.
- **Interface**: input/output specifications (name, type, description, default value, etc).
- **Implementation**: A specification of how to run the component given a set of argument values for the component’s inputs. The implementation section also describes how to get the output values from the component once the component has finished running.

YAML-based components support system commands directly. In fact, any command (or binary) that exists on the base image can be run.

YAML-based components can be loaded for use in the Python SDK alongside Python-based components:

```python
from kfp.components import load_component_from_file

my_comp = load_component_from_file("my_component.yaml")
```

#### “Containerize” a Component
The KFP command-line tool contains a build command to help users “containerize” a component. This can be used to create the `Dockerfile`, `runtime-dependencies.txt`, and other supporting files, or even to build the custom image and push it to a registry. In order to use this utility, the `target_image` parameter must be set in the Python-based component definition, which itself is saved in a file.

```bash
# build Dockerfile and runtime-dependencies.txt
kfp component build --component-filepattern the_component.py --no-build-image --platform linux/amd64 .
```

Creating and maintaining custom containers can carry a significant maintenance burden. In general, a 1-to-1 relationship between components and containers is not needed or recommended, as AI/ML work is often highly iterative. A best practice is to work with a small set of base images that can support many components. If you need more control over the container build than the kfp CLI provides, consider using a container CLI like docker or podman.

### Run and Recurring Run
A **run** is a single execution of a pipeline. Runs comprise an immutable log of all experiments that you attempt, and are designed to be self-contained to allow for reproducibility. 

A **recurring run**, or job in the Kubeflow Pipelines backend APIs, is a repeatable run of a pipeline. The configuration for a recurring run includes a copy of a pipeline with all parameter values specified and a run trigger.

### Run Trigger
A run trigger is a flag that tells the system when a recurring run configuration spawns a new run. The following types of run trigger are available:

- Periodic: for an interval-based scheduling of runs (for example: every 2 hours or every 45 minutes).
- Cron: for specifying `cron` semantics for scheduling runs.

### Step
A step is an execution of one of the components in the pipeline. The relationship between a step and its component is one of instantiation, much like the relationship between a run and its pipeline. In a complex pipeline, components can execute multiple times in loops, or conditionally after resolving an if/else like clause in the pipeline code.

### Output Artifact
An output artifact is an output emitted by a pipeline component. It’s useful for pipeline components to include artifacts so that you can provide for performance evaluation, quick decision making for the run, or comparison across different runs. Artifacts also make it possible to understand how the pipeline’s various components work. An artifact can range from a plain textual view of the data to rich interactive visualizations.

Within the scope of a component, artifacts can be read (for inputs) and written (for outputs) via the `.path` attribute. The KFP backend ensures that input artifact files are copied to the executing pod’s local file system from the remote storage at runtime, so that the component function can read input artifacts from the local file system. By comparison, output artifact files are copied from the local file system of the pod to remote storage, when the component finishes running. This way, the output artifacts persist outside the pod. In both cases, the component author needs to interact with the local file system only to create persistent artifacts.

```python
@dsl.component(packages_to_install=['pandas==1.3.5'])
def create_dataset(iris_dataset: Output[Dataset]):
    import pandas as pd

    csv_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
    col_names = [
        'Sepal_Length', 'Sepal_Width', 'Petal_Length', 'Petal_Width', 'Labels'
    ]
    df = pd.read_csv(csv_url, names=col_names)

    with open(iris_dataset.path, 'w') as f:
        df.to_csv(f)
```

### ML Metadata
Kubeflow Pipelines backend stores runtime information of a pipeline run in Metadata store. Runtime information includes the status of a task, availability of artifacts, custom properties associated with Execution or Artifact, etc.

You can view the connection between Artifacts and Executions across Pipeline Runs, if one Artifact is being used by multiple Executions in different Runs. This connection visualization is called a **Lineage Graph**.

## Core Functions

### Compile a Pipeline
compiler creates a file called pipeline.yaml, which contains a hermetic representation of your pipeline. The output is called an `Intermediate Representation (IR) YAML`, which is a serialized `PipelineSpec` protocol buffer message.

Because components are actually pipelines, you may also compile them to IR YAML.

#### Type checking
By default, the DSL compiler statically type checks a pipeline to ensure type consistency between components that pass data between one another. Static type checking helps identify component I/O inconsistencies without having to run the pipeline, shortening development iterations.

Specifically, the type checker checks for type equality between the type of data a component input expects and the type of the data provided.

For example, for parameters, a list input may only be passed to parameters with a `typing.List` annotation. Similarly, a float may only be passed to parameters with a `float` annotation.

Input data types and annotations must also match for artifacts, with one exception: the `Artifact` type is compatible with all other artifact types. In this sense, the `Artifact` type is both the default artifact type and an artifact “any” type.

#### Compiler arguments
The Compiler.compile method accepts the following arguments:

Name	            | Type	        | Description
--------------------|---------------|--------------------------------
`pipeline_func`	    | `function`	    | (Required) Pipeline function constructed with the `@dsl.pipeline` or component constructed with the `@dsl.component` decorator.
`package_path`	    | `string`	    | (Required) Output YAML file path. For example, `~/my_pipeline.yaml` or `~/my_component.yaml`.
`pipeline_name`	    | `string`	    | (Optional) If specified, sets the name of the pipeline template in the `pipelineInfo.name` field in the compiled IR YAML output. Overrides the `name` of the pipeline or component specified by the name parameter in the `@dsl.pipeline` decorator.
`pipeline_parameters`	| `Dict[str, Any]`| (Optional) Map of parameter names to argument values. This lets you provide default values for pipeline or component parameters. You can override these default values during pipeline submission.
`type_check`	        | `bool`	        |(Optional) Indicates whether static type checking is enabled during compilation.

#### IR YAML
The IR YAML is an intermediate representation of a compiled pipeline or component. It is an instance of the `PipelineSpec` protocol buffer message type, which is a platform-agnostic pipeline representation protocol. It is considered an intermediate representation because the KFP backend compiles `PipelineSpec` to `Argo Workflow` YAML as the final pipeline definition for execution.

### Control Flow
The core types of control flow in KFP pipelines are:
1. Conditions
2. Loops
3. Exit handling

#### dsl.If / dsl.Elif / dsl.Else
The `dsl.If` context manager enables conditional execution of tasks within its scope based on the output of an upstream task or pipeline input parameter.

The context manager takes two arguments: a required `condition` and an optional `name`. The `condition` is a comparative expression where at least one of the two operands is an output from an upstream task or a pipeline input parameter.

You may also use `dsl.Elif` and `dsl.Else` context managers immediately downstream of `dsl.If` for additional conditional control flow functionality. 

#### dsl.OneOf (unsupported)
`dsl.OneOf` can be used to gather outputs from mutually exclusive branches into a single task output which can be consumed by a downstream task or outputted from a pipeline. Branches are mutually exclusive if exactly one will be executed. To enforce this, the KFP SDK compiler requires `dsl.OneOf` consume from tasks within a logically associated group of conditional branches and that one of the branches is a `dsl.Else` branch.

You should provide task outputs to the `dsl.OneOf` using `.output` or `.outputs[<key>]`, just as you would pass an output to a downstream task. The outputs provided to `dsl.OneOf` must be of the same type and cannot be other instances of `dsl.OneOf` or `dsl.Collected`.

```python
@dsl.pipeline
def my_pipeline() -> str:
    coin_flip_task = flip_three_sided_coin()
    with dsl.If(coin_flip_task.output == 'heads'):
        t1 = print_and_return(text='Got heads!')
    with dsl.Elif(coin_flip_task.output == 'tails'):
        t2 = print_and_return(text='Got tails!')
    with dsl.Else():
        t3 = print_and_return(text='Draw!')

    oneof = dsl.OneOf(t1.output, t2.output, t3.output)
    announce_result(oneof)
    return oneof
```

### Loops
Kubeflow Pipelines supports loops which cause fan-out and fan-in of tasks.

#### dsl.ParallelFor (unsupported)
The `dsl.ParallelFor` context manager allows parallel execution of tasks over a static set of items.

The context manager takes three arguments:
- `items`: static set of items to loop over
- `name` (optional): is the name of the loop context
- `parallelism` (optional): is the maximum number of concurrent iterations while executing the `dsl.ParallelFor` group
    - note, `parallelism=0` indicates unconstrained parallelism

#### dsl.Collected (unsupported)
Use `dsl.Collected` with `dsl.ParallelFor` to gather outputs from a parallel loop of tasks. Downstream tasks might consume `dsl.Collected` outputs via an input annotated with a `List` of parameters or a `List` of artifacts.

You can use `dsl.Collected` to collect outputs from nested loops in a nested list of parameters.
For example, output parameters from two nested `dsl.ParallelFor` groups are collected in a multilevel nested list of parameters, where each nested list contains the output parameters from one of the `dsl.ParallelFor` groups. The number of nested levels is based on the number of nested `dsl.ParallelFor` contexts.

By comparison, artifacts created in nested loops are collected in a flat list.

```python
@dsl.pipeline
def my_pipeline():
    
    # Train a model for 1, 5, 10, and 25 epochs
    with dsl.ParallelFor(
        items=[1, 5, 10, 25],
    ) as epochs:
        train_model_task = train_model(epochs=epochs)
        
    # Find the model with the highest accuracy
    max_accuracy(
        models=dsl.Collected(train_model_task.outputs['model'])
    )
```

### Exit handling
Kubeflow Pipelines supports exit handlers for implementing cleanup and error handling tasks that run after the main pipeline tasks finish execution.

#### dsl.ExitHandler 
The `dsl.ExitHandler` context manager allows pipeline authors to specify an exit task which will run after the tasks within the context manager’s scope finish execution, even if one of those tasks fails.

This construct is analogous to using a `try:` block followed by a `finally:` block in normal Python, where the exit task is in the `finally: block`. The context manager takes two arguments: a required `exit_task` and an optional `name`. `exit_task` accepts an instantiated `PipelineTask`.

The most common use case for `dsl.ExitHandler` is to run a cleanup task after the main pipeline tasks finish execution.

The task you use as an exit task may use a special input that provides access to pipeline and task status metadata, including pipeline failure or success status.

You can use this special input by annotating your exit task with the `dsl.PipelineTaskFinalStatus` annotation. The argument for this parameter will be provided by the backend automatically at runtime. You should not provide any input to this annotation when you instantiate your exit task.

The `.ignore_upstream_failure()` task method on `PipelineTask` enables another approach to author pipelines with exit handling behavior. Calling this method on a task causes the task to ignore failures of any specified upstream tasks (as established by data exchange or by use of `.after()`). If the task has no upstream tasks, this method has no effect.

```python
@dsl.pipeline
def my_pipeline():
    clean_up_task = clean_up_resources()
    with dsl.ExitHandler(exit_task=clean_up_task):
        dataset_task = create_datasets()
        train_task = train_and_save_models(dataset=dataset_task.output)
```

### Caching
Caching in KFP is a feature that allows you to cache the results of a component execution and reuse them in subsequent runs. When caching is enabled for a component, KFP will reuse the component’s outputs if the component is executed again with the same inputs and parameters (and the output is still available).

Caching is particularly useful when you have components that take a long time to execute or when you have components that are executed multiple times with the same inputs and parameters.

If a task’s results are retrieved from cache, its representation in the UI will be marked with a green “arrow from cloud” icon.

Caching is enabled by default for all components in KFP. You can disable caching for a component by calling `.set_caching_options(enable_caching=False)` on a task object.

```python
def hello_pipeline(recipient: str = 'World!') -> str:
    hello_task = say_hello(name=recipient)
    hello_task.set_caching_options(False)
```

You can also enable or disable caching for all components in a pipeline by setting the argument caching when submitting a pipeline for execution. This will override the caching settings for all components in the pipeline.

```python
client.create_run_from_pipeline_func(
    hello_pipeline,
    enable_caching=True,  # overrides the above disabling of caching
)
```