![tracker](https://us-central1-vertex-ai-mlops-369716.cloudfunctions.net/pixel-tracking?path=statmike%2Fvertex-ai-mlops%2FMLOps%2FPipelines&file=Vertex+AI+Pipelines+-+Pattern+-+Modular+and+Reusable.ipynb)
<!--- header table --->
<table align="left">
  <td style="text-align: center">
    <a href="https://colab.research.google.com/github/statmike/vertex-ai-mlops/blob/main/MLOps/Pipelines/Vertex%20AI%20Pipelines%20-%20Pattern%20-%20Modular%20and%20Reusable.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/colab-logo-32px.png" alt="Google Colaboratory logo">
      <br>Run in<br>Colab
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/colab/import/https%3A%2F%2Fraw.githubusercontent.com%2Fstatmike%2Fvertex-ai-mlops%2Fmain%2FMLOps%2FPipelines%2FVertex%2520AI%2520Pipelines%2520-%2520Pattern%2520-%2520Modular%2520and%2520Reusable.ipynb">
      <img width="32px" src="https://lh3.googleusercontent.com/JmcxdQi-qOpctIvWKgPtrzZdJJK-J3sWE1RsfjZNwshCFgE_9fULcNpuXYTilIR2hjwN" alt="Google Cloud Colab Enterprise logo">
      <br>Run in<br>Colab Enterprise
    </a>
  </td>      
  <td style="text-align: center">
    <a href="https://github.com/statmike/vertex-ai-mlops/blob/main/MLOps/Pipelines/Vertex%20AI%20Pipelines%20-%20Pattern%20-%20Modular%20and%20Reusable.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" alt="GitHub logo">
      <br>View on<br>GitHub
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/statmike/vertex-ai-mlops/main/MLOps/Pipelines/Vertex%20AI%20Pipelines%20-%20Pattern%20-%20Modular%20and%20Reusable.ipynb">
      <img src="https://lh3.googleusercontent.com/UiNooY4LUgW_oTvpsNhPpQzsstV5W8F7rYgxgGBD85cWJoLmrOzhVs_ksK_vgx40SHs7jCqkTkCk=e14-rj-sc0xffffff-h130-w32" alt="Vertex AI logo">
      <br>Open in<br>Vertex AI Workbench
    </a>
  </td>
</table>

---
This notebook present a pattern, or common workflow with **pipelines**.  The concept used in the notebook are introduced and explained in more detail in this [series of notebook based workflows](./readme.md) that teach all the ways to use pipelines within Vertex AI. The suggested order and description/reason is:

|Notebook Workflow|Description|
|---|---|
|[Vertex AI Pipelines - Start Here](./Vertex%20AI%20Pipelines%20-%20Start%20Here.ipynb)|What are pipelines? Start here to go from code to pipeline and see it in action.|
|[Vertex AI Pipelines - Introduction](./Vertex%20AI%20Pipelines%20-%20Introduction.ipynb)|Introduction to pipelines with the console and Vertex AI SDK|
|[Vertex AI Pipelines - Components](./Vertex%20AI%20Pipelines%20-%20Components.ipynb)|An introduction to all the ways to create pipeline components from your code|
|[Vertex AI Pipelines - IO](./Vertex%20AI%20Pipelines%20-%20IO.ipynb)|An overview of all the type of inputs and outputs for pipeline components|
|[Vertex AI Pipelines - Control](./Vertex%20AI%20Pipelines%20-%20Control.ipynb)|An overview of controlling the flow of exectution for pipelines|
|[Vertex AI Pipelines - Secret Manager](./Vertex%20AI%20Pipelines%20-%20Secret%20Manager.ipynb)|How to pass sensitive information to pipelines and components|
|[Vertex AI Pipelines - GCS Read and Write](./Vertex%20AI%20Pipelines%20-%20GCS%20Read%20and%20Write.ipynb)|How to read/write to GCS from components, including container components.|
|[Vertex AI Pipelines - Scheduling](./Vertex%20AI%20Pipelines%20-%20Scheduling.ipynb)|How to schedule pipeline execution|
|[Vertex AI Pipelines - Notifications](./Vertex%20AI%20Pipelines%20-%20Notifications.ipynb)|How to send email notification of pipeline status.|
|[Vertex AI Pipelines - Management](./Vertex%20AI%20Pipelines%20-%20Management.ipynb)|Managing, Reusing, and Storing pipelines and components|
|[Vertex AI Pipelines - Testing](./Vertex%20AI%20Pipelines%20-%20Testing.ipynb)|Strategies for testing components and pipeliens locally and remotely to aide development.|
|[Vertex AI Pipelines - Managing Pipeline Jobs](./Vertex%20AI%20Pipelines%20-%20Managing%20Pipeline%20Jobs.ipynb)|Manage runs of pipelines in an environment: list, check status, filtered list, cancel and delete jobs.|


To discover these notebooks as part of an introduction to MLOps orchestration [start here](./readme.md).  To read more about MLOps also check out [the parent folder](../readme.md).

---

# Vertex AI Pipelines - Pattern - Modular and Reusable

The patterns makes use of modular components and pipelines while also show that pipelines can be used within pipelines for ultimate modularity.  The components and pipelines are managed and stored in artifact registry for easy recall and reloading.

This notebook walks through:
- Example: A example pipeline, completely local build/compile
- Example 1: Save the pipeline to Artifact Registry and run directly on Vertex AI Pipelines
- Example 2: Save a component to Artifact Registry and recall it (download and import) for use in a new pipeline
- Example 3: Creating a sub-pipeline of multiple components, saving to Artifact Registry.  Then, recall the pipeline and use it as a component in a new pipeline.

---
## Colab Setup

To run this notebook in Colab run the cells in this section.  Otherwise, skip this section.

This cell will authenticate to GCP (follow prompts in the popup).

In [1]:
PROJECT_ID = 'statmike-mlops-349915' # replace with project ID

In [2]:
try:
    from google.colab import auth
    auth.authenticate_user()
    !gcloud config set project {PROJECT_ID}
    print('Colab authorized to GCP')
except Exception:
    print('Not a Colab Environment')
    pass

Not a Colab Environment


---
## Installs

The list `packages` contains tuples of package import names and install names.  If the import name is not found then the install name is used to install quitely for the current user.

In [18]:
# tuples of (import name, install name, min_version)
packages = [
    ('google.cloud.aiplatform', 'google-cloud-aiplatform'),
    ('google.cloud.artifactregistry_v1', 'google-cloud-artifact-registry'),
    ('kfp', 'kfp'),
    ('google_cloud_pipeline_components', 'google-cloud-pipeline-components'),
]

import importlib
install = False
for package in packages:
    if not importlib.util.find_spec(package[0]):
        print(f'installing package {package[1]}')
        install = True
        !pip install {package[1]} -U -q --user
    elif len(package) == 3:
        if importlib.metadata.version(package[0]) < package[2]:
            print(f'updating package {package[1]}')
            install = True
            !pip install {package[1]} -U -q --user

### API Enablement

In [6]:
!gcloud services enable aiplatform.googleapis.com
!gcloud services enable artifactregistry.googleapis.com

### Restart Kernel (If Installs Occured)

After a kernel restart the code submission can start with the next cell after this one.

In [7]:
if install:
    import IPython
    app = IPython.Application.instance()
    app.kernel.do_shutdown(True)
    IPython.display.display(IPython.display.Markdown("""<div class=\"alert alert-block alert-warning\">
        <b>⚠️ The kernel is going to restart. Please wait until it is finished before continuing to the next step. The previous cells do not need to be run again⚠️</b>
        </div>"""))

---
## Setup

Inputs

In [10]:
project = !gcloud config get-value project
PROJECT_ID = project[0]
PROJECT_ID

'statmike-mlops-349915'

In [11]:
REGION = 'us-central1'
EXPERIMENT = 'pipeline-pattern-modular'
SERIES = 'mlops'

# gcs bucket
GCS_BUCKET = PROJECT_ID

Packages

In [19]:
import os
import yaml
import time
import importlib
from google.cloud import aiplatform
from google.cloud import artifactregistry_v1
import kfp
from google_cloud_pipeline_components.types import artifact_types
from typing import NamedTuple
from IPython.display import Markdown

Clients

In [13]:
# vertex ai clients
aiplatform.init(project = PROJECT_ID, location = REGION)

# artifact registry client
ar_client = artifactregistry_v1.ArtifactRegistryClient()

parameters:

In [14]:
DIR = f"temp/{SERIES}-{EXPERIMENT}"

In [15]:
SERVICE_ACCOUNT = !gcloud config list --format='value(core.account)' 
SERVICE_ACCOUNT = SERVICE_ACCOUNT[0]
SERVICE_ACCOUNT

'1026793852137-compute@developer.gserviceaccount.com'

environment:
- make a local folder for temporary storage

In [17]:
if not os.path.exists(DIR):
    os.makedirs(DIR)

---
## Example Pipeline: Components and Pipeline are Local

> The pipeline used here is also used and explained in the workflow: [Vertex AI Pipelines - IO](./Vertex%20AI%20Pipelines%20-%20IO.ipynb).

This is an example pipeline that makes use of:
- all 8 `kfp` artifact types and multiple Google Cloud Artifact Types
- passing artifacts as outputs and intput between components
- returning multiple artifacts from components
- saving content for multiple artifacts with the same component

It has components that do the following:
- `data_source` Creates a BigQuery Table artifact from the metadata (project, dataset, table name)
- `data_prep` From a BigQuery Table artifact it retrieves, splits and saves the prepared data while defining dataset artifacts that point to train and test data.  It also create a generic artifact that contains lists for feature names and classificaiton labels in the datasets.
- `model_gb` Retrieves training data and feature information to use with Scikit-Learn with a pipeline that imputes, scales, encodes and then fits a gradient boosting classifier that is then saved as a model artifact.
- `model_rf` Retrieves training data and feature information to use with Scikit-Learn in a pipeline that imputes, scales, encodes and then fits a random forest classifier that is then saved as a model artifact.
- `metrics` Retrieves data, feature info, and model files to compute evalaution metrics that are saved as artifacts.
- `overview` Compiles metrics results from both models on train and test data and then saves as HTML and Markdown artifacts for easy review.

### Create Pipeline Components

These are simple Python components, specifically lightweight Python components.  For more details on the types of components check out this workflow in the same repository:
- [Vertex AI Pipelines - Components](./Vertex%20AI%20Pipelines%20-%20Components.ipynb)

#### Component: `data_source`

This component defines an artifact that points to the data source in place, in BigQuery.  It uses the Google Cloud Artifact Type for BigQuery Tables: [`google_cloud_pipeline_components.types.artifact_types.BQTable()`](https://google-cloud-pipeline-components.readthedocs.io/en/google-cloud-pipeline-components-2.14.0/api/artifact_types.html#google_cloud_pipeline_components.types.artifact_types.BQTable).

>**NOTE:** This could be done with an importer component `kfp.dsl.importer` - see [Vertex AI Pipelines - Components](./Vertex%20AI%20Pipelines%20-%20Components.ipynb).

In [20]:
@kfp.dsl.component(
    base_image = "python:3.11",
    packages_to_install = ["google-cloud-pipeline-components"]
)
def data_source(
    bq_project: str,
    bq_dataset: str,
    bq_table: str,
    bq_table_artifact: kfp.dsl.Output[artifact_types.BQTable]
):
    
    bq_table_artifact.uri = f'https://www.googleapis.com/bigquery/v2/projects/{bq_project}/datasets/{bq_dataset}/tables/{bq_table}'
    bq_table_artifact.metadata['projectId'] = bq_project
    bq_table_artifact.metadata['datasetId'] = bq_dataset
    bq_table_artifact.metadata['tableId'] = bq_table

#### Component: `data_prep`

A lightweight Python component that:
- read data from BigQuery Table using Input Artifact for BigQuery Table
- split data in the train, eval, text
- create output artifacts (`kfp.dsl.Dataset`) for each split of the data
- create output artifact (`kfp.dsl.Artifact`) with feature information from the data

In [21]:
@kfp.dsl.component(
    base_image = "python:3.11",
    packages_to_install = ["google-cloud-pipeline-components", "bigframes", "scikit-learn"]
)
def data_prep(
    project_id: str,
    bq_source: kfp.dsl.Input[artifact_types.BQTable],
) -> NamedTuple(
        'output',
        train=kfp.dsl.Dataset,
        test=kfp.dsl.Dataset,
        features=kfp.dsl.Artifact
):
    from typing import NamedTuple
    outputs = NamedTuple(
            'output',
            train=kfp.dsl.Dataset,
            test=kfp.dsl.Dataset,
            features=kfp.dsl.Artifact
    )
    
    # connect to BigQuery table, ELT, read to local
    import bigframes.pandas as bpd
    bpd.options.bigquery.project = project_id
    bpd.options.bigquery.location = 'us'
    ds = bpd.read_gbq(f"{bq_source.metadata['projectId']}.{bq_source.metadata['datasetId']}.{bq_source.metadata['tableId']}")
    # fix data quality issue
    ds['sex'] = ds['sex'].replace('.', None)
    full_ds = ds.to_pandas()
    
    # split data into train/test
    from sklearn.model_selection import train_test_split
    train_ds, test_ds = train_test_split(full_ds, test_size = 0.25)
    
    # write test and train to Dataset artifacts - with specific subfolders
    import os
    #train
    train = kfp.dsl.Dataset(
        uri = kfp.dsl.get_uri(suffix = 'train'),
        metadata = dict(
            samples = train_ds.shape[0],
            filename = 'data.txt'
        )
    )
    path = train.path + '/data.txt'
    os.makedirs(os.path.dirname(path), exist_ok = True)
    with open(path, 'w') as f:
        f.write(train_ds.to_json(orient='records'))
    # test    
    test = kfp.dsl.Dataset(
        uri = kfp.dsl.get_uri(suffix = 'test'),
        metadata = dict(
            samples = test_ds.shape[0],
            filename = 'data.txt'
        )
    )
    path = test.path + '/data.txt'
    os.makedirs(os.path.dirname(path), exist_ok = True)
    with open(path, 'w') as f:
        f.write(test_ds.to_json(orient='records'))
    
    # add feature info the feature Artifact
    features = kfp.dsl.Artifact(
        metadata = dict(
            label_col = 'species',
            label_values = ds['species'].unique().to_list(),
            train_n = train_ds.shape[0],
            test_n = test_ds.shape[0],
            features = [x for x in ds.columns.to_list() if x != 'species']
        )
    )
    
    return outputs(train, test, features)

#### Component: `model_gb`

A lightweight Python component that:
- inputs artifacts for training data as well as feature information created by the `data_prep` component
- creates a model with `sklearn.ensemble.GradientBoostingClassifier`
- output artifact for the model (`kfp.dsl.Model`)

In [22]:
@kfp.dsl.component(
    base_image = "python:3.11",
    packages_to_install = ["pandas", "scikit-learn"]
)
def model_gb(
    train: kfp.dsl.Dataset,
    features: kfp.dsl.Artifact
) -> kfp.dsl.Model:
    
    # import data
    import pandas as pd
    from io import StringIO
    with open(train.path + f"/{train.metadata['filename']}", 'r') as f:
        train_ds = f.read()
    train_ds = pd.read_json(StringIO(train_ds), orient='records')
    
    # prepare data for training: split the features (x) and label (y)
    train_x = train_ds[features.metadata['features']]
    train_y = train_ds[features.metadata['label_col']] 
    
    # create pipeline with preprocessing and training
    import sklearn.ensemble
    import sklearn.impute
    import sklearn.pipeline
    import sklearn.preprocessing
    import sklearn.compose
    import numpy as np
    numerical_transformer = sklearn.pipeline.Pipeline([
        ('imputer', sklearn.impute.SimpleImputer(strategy = 'mean')),
        ('scaler', sklearn.preprocessing.MinMaxScaler()),
    ])
    categorical_transformer = sklearn.pipeline.Pipeline([
        ('imputer', sklearn.impute.SimpleImputer(strategy = 'most_frequent', add_indicator = True)),
        ('encoder', sklearn.preprocessing.OrdinalEncoder()),
    ])
    preprocessor = sklearn.compose.ColumnTransformer(
        transformers = [
            ('numerical', numerical_transformer, [c for c in train_x.columns if train_x[c].isna().any() and train_x[c].dtypes == 'float64']),
            ('categorical', categorical_transformer, [c for c in train_x.columns if train_x[c].isna().any() and train_x[c].dtypes == 'object'])
        ]
    )
    pipeline = sklearn.pipeline.Pipeline([
        ('preprocessor', preprocessor),
        ('classifier', sklearn.ensemble.GradientBoostingClassifier(n_estimators = 100, learning_rate = 0.125, max_depth = 3)),
    ])
    
    # fit/train model
    pipeline.fit(train_x, train_y)
    
    # save model and create artifact
    import pickle, os
    model = kfp.dsl.Model(
        uri = kfp.dsl.get_uri(),
        metadata = dict(
            accuracy = pipeline.score(train_x, train_y)
        )
    )
    path = model.path + '/model.pkl'
    os.makedirs(os.path.dirname(path), exist_ok = True)
    with open(path, 'wb') as f:
        pickle.dump(pipeline, f)
        
    return model

#### Component: `model_rf`

A lightweight Python component that:
- inputs artifacts for training data as well as feature information created by the `data_prep` component
- creates a model with `sklearn.ensemble.RandomForestClassifier`
- output artifacts for the model (`kfp.dsl.Model`)

In [23]:
@kfp.dsl.component(
    base_image = "python:3.11",
    packages_to_install = ["pandas", "scikit-learn"]
)
def model_rf(
    train: kfp.dsl.Dataset,
    features: kfp.dsl.Artifact
) -> kfp.dsl.Model:
    
    # import data
    import pandas as pd
    from io import StringIO
    with open(train.path + f"/{train.metadata['filename']}", 'r') as f:
        train_ds = f.read()
    train_ds = pd.read_json(StringIO(train_ds), orient='records')
    
    # prepare data for training: split the features (x) and label (y)
    train_x = train_ds[features.metadata['features']]
    train_y = train_ds[features.metadata['label_col']] 
    
    # create pipeline with preprocessing and training
    import sklearn.ensemble
    import sklearn.impute
    import sklearn.pipeline
    import sklearn.preprocessing
    import sklearn.compose
    import numpy as np
    numerical_transformer = sklearn.pipeline.Pipeline([
        ('imputer', sklearn.impute.SimpleImputer(strategy = 'mean')),
        ('scaler', sklearn.preprocessing.MinMaxScaler()),
    ])
    categorical_transformer = sklearn.pipeline.Pipeline([
        ('imputer', sklearn.impute.SimpleImputer(strategy = 'most_frequent', add_indicator = True)),
        ('encoder', sklearn.preprocessing.OrdinalEncoder()),
    ])
    preprocessor = sklearn.compose.ColumnTransformer(
        transformers = [
            ('numerical', numerical_transformer, [c for c in train_x.columns if train_x[c].isna().any() and train_x[c].dtypes == 'float64']),
            ('categorical', categorical_transformer, [c for c in train_x.columns if train_x[c].isna().any() and train_x[c].dtypes == 'object'])
        ]
    )
    pipeline = sklearn.pipeline.Pipeline([
        ('preprocessor', preprocessor),
        ('classifier', sklearn.ensemble.RandomForestClassifier(n_estimators = 200, max_depth = 3)),
    ])
    
    # fit/train model
    pipeline.fit(train_x, train_y)
    
    # save model and create artifact
    import pickle, os
    model = kfp.dsl.Model(
        uri = kfp.dsl.get_uri(),
        metadata = dict(
            accuracy = pipeline.score(train_x, train_y)
        )
    )
    path = model.path + '/model.pkl'
    os.makedirs(os.path.dirname(path), exist_ok = True)
    with open(path, 'wb') as f:
        pickle.dump(pipeline, f)
        
    return model

#### Component: `metrics`

A lightweight Python component that:
- inputs artifacts for a dataset and a model
- create artifacts for:
    - Metrics with `kfp.dsl.Metrics`
    - Classification metrics with `kfp.dsl.ClassificationMetrics`
    - Sliced Classification metrics wtih `kfp.dsl.SlicedClassificationMetrics`

In [24]:
@kfp.dsl.component(
    base_image = "python:3.11",
    packages_to_install = ["pandas", "numpy", "scikit-learn"]
)
def metrics(
    data: kfp.dsl.Dataset,
    features: kfp.dsl.Artifact,
    model: kfp.dsl.Model
) -> NamedTuple(
        'output',
        metrics=kfp.dsl.Metrics,
        class_metrics=kfp.dsl.ClassificationMetrics,
        #slice_class_metrics=kfp.dsl.SlicedClassificationMetrics
):
    from typing import NamedTuple
    outputs = NamedTuple(
            'output',
            metrics=kfp.dsl.Metrics,
            class_metrics=kfp.dsl.ClassificationMetrics,
            #slice_class_metrics=kfp.dsl.SlicedClassificationMetrics
    )
    
    # import data
    import pandas as pd
    from io import StringIO
    with open(data.path + f"/{data.metadata['filename']}", 'r') as f:
        ds = f.read()
    ds = pd.read_json(StringIO(ds), orient='records')
    
    # get the ground truth
    x = ds[features.metadata['features']]
    y = ds[features.metadata['label_col']]
    
    # import model
    import pickle
    with open(model.path+'/model.pkl', 'rb') as f:
        classifier = pickle.load(f)
    pred = classifier.predict(x)
    proba = classifier.predict_proba(x)
    
    # metrics artifact
    import sklearn.metrics
    metrics = kfp.dsl.Metrics()
    metrics.log_metric('accuracy', classifier.score(x, y))
    if len(features.metadata['label_values'])>2:
        metrics.log_metric('precision', sklearn.metrics.precision_score(y, pred, average='macro'))
        metrics.log_metric('recall', sklearn.metrics.recall_score(y, pred, average='macro'))
        metrics.log_metric('f1', sklearn.metrics.f1_score(y, pred, average='macro'))
        metrics.log_metric('average_precision', sklearn.metrics.average_precision_score(y, proba, average='macro'))
    else:
        metrics.log_metric('precision', sklearn.metrics.precision_score(y, pred, average='binary'))
        metrics.log_metric('recall', sklearn.metrics.recall_score(y, pred, average='binary'))
        metrics.log_metric('f1', sklearn.metrics.f1_score(y, pred, average='binary'))
        metrics.log_metric('average_precision', sklearn.metrics.average_precision_score(y, proba, average='binary'))
    
    # classification metrics artifact
    class_metrics = kfp.dsl.ClassificationMetrics()
    class_metrics.log_confusion_matrix(
        categories = classifier.classes_,
        matrix = sklearn.metrics.confusion_matrix(y, pred).tolist()
    )
    
    # sliced classification metrics artifact
    #import numpy as np
    #import sklearn.preprocessing
    #slice_class_metrics = kfp.dsl.SlicedClassificationMetrics()
    #labeler = sklearn.preprocessing.LabelBinarizer().fit(y)
    #for c in classifier.classes_:
    #    i = np.where(labeler.transform([c]) == 1)[0][0]
    #    fpr, tpr, thresholds = sklearn.metrics.roc_curve(
    #        y_true = labeler.transform(y)[:, i],
    #        y_score = classifier.predict_proba(x)[:, i]
    #    )
    #    infs = [t==np.inf for t in thresholds.tolist()]
    #    
    #    slice_class_metrics.load_roc_readings(
    #        c,
    #        [
    #            [t for i,t in enumerate(thresholds.tolist()) if infs[i]==True],
    #            [t for i,t in enumerate(tpr.tolist()) if infs[i]==True],
    #            [t for i,t in enumerate(fpr.tolist()) if infs[i]==True]
    #        ]
    #    )
        
                       
    return outputs(metrics, class_metrics) #, slice_class_metrics)

#### Component: `overview`

A lightweight Python component that:
- inputs a list of metric artifacts
- create a `kfp.dsl.HTML` artifact
- creates a `kfp.dsl.Markdown` artifact

In [25]:
@kfp.dsl.component(
    base_image = "python:3.11",
    packages_to_install = ["pandas", "tabulate"]
)
def overview(
    metrics_0: kfp.dsl.Metrics,
    metrics_1: kfp.dsl.Metrics,
    metrics_2: kfp.dsl.Metrics,
    metrics_3: kfp.dsl.Metrics,
    models: list,
    data: list
) -> NamedTuple(
        'output',
        html=kfp.dsl.HTML,
        md=kfp.dsl.Markdown
):
    from typing import NamedTuple
    outputs = NamedTuple(
            'output',
            html=kfp.dsl.HTML,
            md=kfp.dsl.Markdown
    )
    
    # construct dataframe
    import pandas as pd
    metrics = [metrics_0.metadata, metrics_1.metadata, metrics_2.metadata, metrics_3.metadata]
    records = []
    for m, metric in enumerate(metrics):
        records.append(
            dict(
                model = models[m],
                data = data[m]
            )|metrics[m]
        )
    df = pd.DataFrame(records)
 
    import os
    # html artifact
    html = kfp.dsl.HTML(uri = kfp.dsl.get_uri('html.html'))
    os.makedirs(os.path.dirname(html.path), exist_ok = True)
    with open(html.path, 'w') as f:
        f.write(df.to_html(index = False))
    
    
    # markdown artifact
    md = kfp.dsl.Markdown(uri = kfp.dsl.get_uri('md.md'))
    os.makedirs(os.path.dirname(md.path), exist_ok = True)
    with open(md.path, 'w') as f:
        f.write(df.to_markdown(index = False))
    
    return outputs(html, md)

### Create Pipeline

In [26]:
pipeline_name = f'{SERIES}-{EXPERIMENT}-preview'

In [27]:
@kfp.dsl.pipeline(
    name = pipeline_name,
    description = 'A simple pipeline for testing',
    pipeline_root = f'gs://{GCS_BUCKET}/{SERIES}/{EXPERIMENT}/pipeline_root'
)
def pipeline(
    project_id: str,
    bq_project: str,
    bq_dataset: str,
    bq_table: str
):
    
    bq_source = data_source(
        bq_project = bq_project,
        bq_dataset = bq_dataset,
        bq_table = bq_table
    )
    train_data = data_prep(
        project_id = project_id,
        bq_source = bq_source.output
    )
    model_1 = model_gb(
        train = train_data.outputs['train'],
        features = train_data.outputs['features']
    )
    model_2 = model_rf(
        train = train_data.outputs['train'],
        features = train_data.outputs['features']
    )
    metrics_1_train = metrics(
        data = train_data.outputs['train'],
        features = train_data.outputs['features'],
        model = model_1.output,
    ).set_display_name('Metrics: Training Data')
    metrics_1_test = metrics(
        data = train_data.outputs['test'],
        features = train_data.outputs['features'],
        model = model_1.output,
    ).set_display_name('Metrics: Test Data')
    metrics_2_train = metrics(
        data = train_data.outputs['train'],
        features = train_data.outputs['features'],
        model = model_2.output,
    ).set_display_name('Metrics: Training Data')
    metrics_2_test = metrics(
        data = train_data.outputs['test'],
        features = train_data.outputs['features'],
        model = model_2.output,
    ).set_display_name('Metrics: Test Data')
    
    review = overview(
        metrics_0 = metrics_1_train.outputs['metrics'],
        metrics_1 = metrics_1_test.outputs['metrics'],
        metrics_2 = metrics_2_train.outputs['metrics'],
        metrics_3 = metrics_2_test.outputs['metrics'],
        models = ['GB', 'GB', 'RF', 'RF'],
        data = ['Train', 'Test', 'Train', 'Test']
    )

### Compile Pipeline

In [28]:
kfp.compiler.Compiler().compile(
    pipeline_func = pipeline,
    package_path = f'{DIR}/{pipeline_name}.yaml'
)

### Create Pipeline Job

In [29]:
parameters = dict(
    project_id = PROJECT_ID,
    bq_project = 'bigquery-public-data',
    bq_dataset = 'ml_datasets',
    bq_table = 'penguins'
)

In [30]:
pipeline_job = aiplatform.PipelineJob(
    display_name = pipeline_name,
    template_path = f"{DIR}/{pipeline_name}.yaml",
    parameter_values = parameters,
    pipeline_root = f'gs://{GCS_BUCKET}/{SERIES}/{EXPERIMENT}/pipeline_root',
    enable_caching = None # True (enabled), False (disable), None (defer to component level caching) 
)

### Submit Pipeline Job

In [31]:
response = pipeline_job.submit(
    service_account = SERVICE_ACCOUNT
)

Creating PipelineJob
PipelineJob created. Resource name: projects/1026793852137/locations/us-central1/pipelineJobs/mlops-pipeline-pattern-modular-preview-20240801123852
To use this PipelineJob in another session:
pipeline_job = aiplatform.PipelineJob.get('projects/1026793852137/locations/us-central1/pipelineJobs/mlops-pipeline-pattern-modular-preview-20240801123852')
View Pipeline Job:
https://console.cloud.google.com/vertex-ai/locations/us-central1/pipelines/runs/mlops-pipeline-pattern-modular-preview-20240801123852?project=1026793852137


In [32]:
print(f'The Dashboard can be viewed here:\n{pipeline_job._dashboard_uri()}')

The Dashboard can be viewed here:
https://console.cloud.google.com/vertex-ai/locations/us-central1/pipelines/runs/mlops-pipeline-pattern-modular-preview-20240801123852?project=1026793852137


In [33]:
pipeline_job.wait()

PipelineJob projects/1026793852137/locations/us-central1/pipelineJobs/mlops-pipeline-pattern-modular-preview-20240801123852 current state:
PipelineState.PIPELINE_STATE_RUNNING
PipelineJob projects/1026793852137/locations/us-central1/pipelineJobs/mlops-pipeline-pattern-modular-preview-20240801123852 current state:
PipelineState.PIPELINE_STATE_RUNNING
PipelineJob projects/1026793852137/locations/us-central1/pipelineJobs/mlops-pipeline-pattern-modular-preview-20240801123852 current state:
PipelineState.PIPELINE_STATE_RUNNING
PipelineJob projects/1026793852137/locations/us-central1/pipelineJobs/mlops-pipeline-pattern-modular-preview-20240801123852 current state:
PipelineState.PIPELINE_STATE_RUNNING
PipelineJob projects/1026793852137/locations/us-central1/pipelineJobs/mlops-pipeline-pattern-modular-preview-20240801123852 current state:
PipelineState.PIPELINE_STATE_RUNNING
PipelineJob projects/1026793852137/locations/us-central1/pipelineJobs/mlops-pipeline-pattern-modular-preview-20240801123

### Retrieve Pipeline Information

In [34]:
aiplatform.get_pipeline_df(pipeline = f'{pipeline_name}')[0:10]

Unnamed: 0,pipeline_name,run_name,param.input:project_id,param.vertex-ai-pipelines-artifact-argument-binding,param.input:bq_project,param.input:bq_table,param.vmlmd_lineage_integration,param.input:bq_dataset,metric.precision,metric.average_precision,metric.f1,metric.accuracy,metric.recall,metric.confusionMatrix
0,mlops-pipeline-pattern-modular-preview,mlops-pipeline-pattern-modular-preview-2024080...,statmike-mlops-349915,{'output:metrics-class_metrics': ['projects/10...,bigquery-public-data,penguins,{'pipeline_run_component': {'parent_task_names...,ml_datasets,0.997199,0.99994,0.996848,0.996124,0.996528,{'annotationSpecs': [{'displayName': 'Adelie P...


---
## Manage Pipelines With Artifact Registry

The YAML file created above is a complete specification of a workflow with input parameters and a staging bucket.  This could be very useful to save, share, and reuse.  This section covers using Artifact Registry do this with a combination of:
- Kubeflow Pipelines SDK and the included [`kfp.registry.RegistryClient`](https://kubeflow-pipelines.readthedocs.io/en/latest/source/registry.html)
- Google Cloud [Artifact Registry](https://cloud.google.com/artifact-registry/docs/overview) with native format for [Kubeflow pipeline templates](https://cloud.google.com/artifact-registry/docs/kfp)
- [Integration with Vertex AI](https://cloud.google.com/vertex-ai/docs/pipelines/create-pipeline-template#kubeflow-pipelines-sdk-client) for creating, uploading and using pipeline templates

### Setup Artifact Registry For KFP Repository

[Artifact registry](https://cloud.google.com/artifact-registry/docs) organizes artifacts with repositories.  Each repository contains packages and is designated to hold a partifcular format of package: Docker images, Python Packages and [others](https://cloud.google.com/artifact-registry/docs/supported-formats#package).  There is even a registry type specifically for [Kubeflow pipeline templates](https://cloud.google.com/artifact-registry/docs/kfp?hl=en) which is the focus in this notebooks workflow. It specifically stores and managed `kfp` pipelines files (compiled pipeliens are YAML files).

#### List All Repositories In Project/Region

This may be empty if no repositories have been created for this project

In [35]:
for repo in ar_client.list_repositories(parent = f'projects/{PROJECT_ID}/locations/{REGION}'):
    print(repo.name)

projects/statmike-mlops-349915/locations/us-central1/repositories/gcf-artifacts
projects/statmike-mlops-349915/locations/us-central1/repositories/mlops
projects/statmike-mlops-349915/locations/us-central1/repositories/statmike-mlops-349915
projects/statmike-mlops-349915/locations/us-central1/repositories/statmike-mlops-349915-docker
projects/statmike-mlops-349915/locations/us-central1/repositories/statmike-mlops-349915-python


#### Create/Retrieve KFP Repository

Create an Artifact Registry Repository to hold Docker Images created by this notebook.  First, check to see if it is already created by a previous run and retrieve it if it has.  Otherwise, create one named for this project.

In [36]:
try:
    kfp_repo = ar_client.get_repository(name = f'projects/{PROJECT_ID}/locations/{REGION}/repositories/{SERIES}')
except Exception:
    operation = ar_client.create_repository(
        parent = f'projects/{PROJECT_ID}/locations/{REGION}',
        repository_id = SERIES,
        repository = artifactregistry_v1.Repository(
            name = SERIES,
            format = artifactregistry_v1.Repository.Format.KFP
        )
    )
    kfp_repo = operation.result()

In [37]:
kfp_repo.name, kfp_repo.format_.name

('projects/statmike-mlops-349915/locations/us-central1/repositories/mlops',
 'KFP')

In [38]:
REPOSITORY = f"https://{REGION}-{kfp_repo.format_.name}.pkg.dev/{PROJECT_ID}/{kfp_repo.name.split('/')[-1]}"

In [39]:
REPOSITORY

'https://us-central1-KFP.pkg.dev/statmike-mlops-349915/mlops'

---
## Example 1: Full Pipeline Saved To Artifact Registry

> For more details on using the KFP artifact registry to manage pipelines see the workflow: [Vertex AI Pipelines - Management](./Vertex%20AI%20Pipelines%20-%20Management.ipynb)

This example upload a complete pipeline to the registry and then create runs directly referencing the repository.

### Upload Pipeline To Registry

Loading the pipelines compiled YAML file can be done in several ways:
- Directly in the [Console For Vertex AI Pipelines](https://cloud.google.com/vertex-ai/docs/pipelines/create-pipeline-template#upload-the-template)
- With code using the KFP SDK [`kfp.registry`](https://kubeflow-pipelines.readthedocs.io/en/latest/source/registry.html)

This section uses code to manage the upload.

#### Setup KFP Registry Client

In [40]:
import kfp.registry

In [126]:
token = !gcloud auth application-default print-access-token
kfp_registry = kfp.registry.RegistryClient(
    host = REPOSITORY,
    auth = kfp.registry.ApiAuth(token[0])
)

#### Upload YAML File To Registry

Like other repository types there are two types of references for artifacts: tags and version.  The versions are managed by the repository and the naming of versions is possible by supplying tags.  

In [42]:
!ls {DIR}

mlops-pipeline-pattern-modular-preview.yaml


In [43]:
pipeline_name

'mlops-pipeline-pattern-modular-preview'

In [44]:
template, version = kfp_registry.upload_pipeline(
    file_name = f"{DIR}/{pipeline_name}.yaml",
    tags = ['v1', 'new'],
    extra_headers = dict(description = 'Full Pipeline Template', note = 'This is an example for the full pipeline.')
)

In [45]:
template, version

('mlops-pipeline-pattern-modular-preview',
 'sha256:320595151ec9f4e0e884a61cb2093634c717871dcb272cfe0705e44dd40d0d58')

### Pipeline Runs From The Registry

The pipeline could be download to local first but it is also possible to directly reference the pipeline in the repository when creating a run.  More methods, including no code runs from the console, are covered in the workflow: [Vertex AI Pipelines - Management](./Vertex%20AI%20Pipelines%20-%20Management.ipynb).

#### Remote: Vertex AI SDK

This is actually idential to the normal use of the Verex AI Pipeline SDK and just the `template_path` needs updating to point to the version in the artifact registry.
- [`aiplatform.PipelineJob()`](https://cloud.google.com/python/docs/reference/aiplatform/latest/google.cloud.aiplatform.PipelineJob)

The `template_path` can refer directly to a version (`@<version>`) or any tag ('/<tag>'):

In [47]:
parameters = dict(
    project_id = PROJECT_ID,
    bq_project = 'bigquery-public-data',
    bq_dataset = 'ml_datasets',
    bq_table = 'penguins'
)

In [52]:
pipeline_job = aiplatform.PipelineJob(
    display_name = pipeline_name,
    template_path = f"{REPOSITORY.lower()}/{pipeline_name}/{version}",
    parameter_values = parameters,
    pipeline_root = f'gs://{GCS_BUCKET}/{SERIES}/{EXPERIMENT}/pipeline_root',
    enable_caching = False # True (enabled), False (disable), None (defer to component level caching) 
)

In [53]:
response = pipeline_job.submit(
    service_account = SERVICE_ACCOUNT
)

Creating PipelineJob
PipelineJob created. Resource name: projects/1026793852137/locations/us-central1/pipelineJobs/mlops-pipeline-pattern-modular-preview-20240801142330
To use this PipelineJob in another session:
pipeline_job = aiplatform.PipelineJob.get('projects/1026793852137/locations/us-central1/pipelineJobs/mlops-pipeline-pattern-modular-preview-20240801142330')
View Pipeline Job:
https://console.cloud.google.com/vertex-ai/locations/us-central1/pipelines/runs/mlops-pipeline-pattern-modular-preview-20240801142330?project=1026793852137


In [54]:
pipeline_job.wait()

PipelineJob projects/1026793852137/locations/us-central1/pipelineJobs/mlops-pipeline-pattern-modular-preview-20240801142330 current state:
PipelineState.PIPELINE_STATE_RUNNING
PipelineJob projects/1026793852137/locations/us-central1/pipelineJobs/mlops-pipeline-pattern-modular-preview-20240801142330 current state:
PipelineState.PIPELINE_STATE_RUNNING
PipelineJob projects/1026793852137/locations/us-central1/pipelineJobs/mlops-pipeline-pattern-modular-preview-20240801142330 current state:
PipelineState.PIPELINE_STATE_RUNNING
PipelineJob projects/1026793852137/locations/us-central1/pipelineJobs/mlops-pipeline-pattern-modular-preview-20240801142330 current state:
PipelineState.PIPELINE_STATE_RUNNING
PipelineJob projects/1026793852137/locations/us-central1/pipelineJobs/mlops-pipeline-pattern-modular-preview-20240801142330 current state:
PipelineState.PIPELINE_STATE_RUNNING
PipelineJob projects/1026793852137/locations/us-central1/pipelineJobs/mlops-pipeline-pattern-modular-preview-20240801142

In [55]:
aiplatform.get_pipeline_df(pipeline = f'{pipeline_name}')

Unnamed: 0,pipeline_name,run_name,param.input:bq_project,param.input:bq_table,param.vmlmd_lineage_integration,param.input:project_id,param.input:bq_dataset,param.vertex-ai-pipelines-artifact-argument-binding,metric.f1,metric.average_precision,metric.precision,metric.accuracy,metric.recall,metric.confusionMatrix
0,mlops-pipeline-pattern-modular-preview,mlops-pipeline-pattern-modular-preview-2024080...,bigquery-public-data,penguins,{'pipeline_template_component': {'version_sha2...,statmike-mlops-349915,ml_datasets,{'output:metrics-class_metrics': ['projects/10...,0.986425,0.999493,0.990991,0.988372,0.982456,"{'rows': [{'row': [114.0, 2.0, 0.0]}, {'row': ..."
1,mlops-pipeline-pattern-modular-preview,mlops-pipeline-pattern-modular-preview-2024080...,bigquery-public-data,penguins,{'pipeline_template_component': {'template_id'...,statmike-mlops-349915,ml_datasets,{'output:metrics-2-class_metrics': ['projects/...,0.964231,0.995937,0.96646,0.965116,0.962418,"{'rows': [{'row': [33.0, 1.0, 0.0]}, {'row': [..."
2,mlops-pipeline-pattern-modular-preview,mlops-pipeline-pattern-modular-preview-2024080...,bigquery-public-data,penguins,{'pipeline_run_component': {'location_id': 'us...,statmike-mlops-349915,ml_datasets,{'output:metrics-3-class_metrics': ['projects/...,0.996848,0.99994,0.997199,0.996124,0.996528,{'annotationSpecs': [{'displayName': 'Adelie P...


---
## Example 2: Local Pipeline From Modular Components Saved in Artifact Registry

Components can be compiled and the resulting YAML can be treated as source and recalled easily from local directories as files, from URLs, and from strings. 

> Learn more about modular components and their structure in the workflow: [Vertex AI Pipelines - Management](./Vertex%20AI%20Pipelines%20-%20Management.ipynb).

### Compile Components

Just as pipelines can be compiled into YAML, so can components.  The following does this compilation for both components created above.
- [`kfp.compiler`](https://kubeflow-pipelines.readthedocs.io/en/latest/source/compiler.html)

In [60]:
type(data_source)

kfp.dsl.python_component.PythonComponent

In [56]:
kfp.compiler.Compiler().compile(
    data_source,
    package_path = f'{DIR}/component_data_source.yaml'
)

In [57]:
!ls {DIR}

component_data_source.yaml  mlops-pipeline-pattern-modular-preview.yaml


### Loading A Compiled Component

Managing (saving, reusing, and sharing) a compiled component file is completed by being able to load the component directly for future pipelines.  This is accomplished using the [`kfp.components`](https://kubeflow-pipelines.readthedocs.io/en/latest/source/components.html) module which offers three functions for loading components compiled as YAML from either a file, a url or directly from a string:
- [`kfp.components.load_component_from_file`](https://kubeflow-pipelines.readthedocs.io/en/latest/source/components.html#kfp.components.load_component_from_file)
- [`kfp.components.load_component_from_url`](https://kubeflow-pipelines.readthedocs.io/en/latest/source/components.html#kfp.components.load_component_from_url)
- [`kfp.components.load_component_from_text`](https://kubeflow-pipelines.readthedocs.io/en/latest/source/components.html#kfp.components.load_component_from_text)

The following cell load the `example_parameters` component from file and creates a new local component named `imported_example_parameters`.

In [58]:
import_data_source = kfp.components.load_component_from_file(f'{DIR}/component_data_source.yaml')

In [59]:
type(import_data_source)

kfp.dsl.yaml_component.YamlComponent

In [61]:
type(data_source)

kfp.dsl.python_component.PythonComponent

### Save A Component To Artifact Registry

The compiled YAML file could also be saved in the KFP artifact registry similar to a complete pipeline as show before.

In [62]:
template, version = kfp_registry.upload_pipeline(
    file_name = f"{DIR}/component_data_source.yaml",
    tags = ['v1', 'new'],
    extra_headers = dict(description = 'A single component: data_source', note = 'This is an example for a single component.')
)

In [69]:
template, version

('data-source',
 'sha256:e0ea97bcbe73b2b2f8b50cd1633f863562922eb38ff762e6957191bc7996d1a1')

### Download And Import Component From Artifact Registry

In [127]:
kfp_registry.download_pipeline(
    package_name = template,
    version = version,
    file_name = f'{DIR}/downloaded_component_data_source.yaml'
)

'temp/mlops-pipeline-pattern-modular/downloaded_component_data_source.yaml'

In [128]:
!ls {DIR}

component_data_source.yaml
downloaded_component_data_source.yaml
mlops-pipeline-pattern-modular-preview.yaml


In [129]:
import_data_source = kfp.components.load_component_from_file(f'{DIR}/downloaded_component_data_source.yaml')

In [130]:
type(import_data_source)

kfp.dsl.yaml_component.YamlComponent

### Create 

The same pipeline specification as above but using the downloaded and imported component from artifact registry: `data_source`, now named `import_data_source`

In [131]:
pipeline_name = f'{SERIES}-{EXPERIMENT}-example-2'

In [132]:
@kfp.dsl.pipeline(
    name = pipeline_name,
    description = 'A simple pipeline for testing',
    pipeline_root = f'gs://{GCS_BUCKET}/{SERIES}/{EXPERIMENT}/pipeline_root'
)
def pipeline(
    project_id: str,
    bq_project: str,
    bq_dataset: str,
    bq_table: str
):
    # use the downloaded/imported component for data_source here
    bq_source = import_data_source(
        bq_project = bq_project,
        bq_dataset = bq_dataset,
        bq_table = bq_table
    )
    train_data = data_prep(
        project_id = project_id,
        bq_source = bq_source.output
    )
    model_1 = model_gb(
        train = train_data.outputs['train'],
        features = train_data.outputs['features']
    )
    model_2 = model_rf(
        train = train_data.outputs['train'],
        features = train_data.outputs['features']
    )
    metrics_1_train = metrics(
        data = train_data.outputs['train'],
        features = train_data.outputs['features'],
        model = model_1.output,
    ).set_display_name('Metrics: Training Data')
    metrics_1_test = metrics(
        data = train_data.outputs['test'],
        features = train_data.outputs['features'],
        model = model_1.output,
    ).set_display_name('Metrics: Test Data')
    metrics_2_train = metrics(
        data = train_data.outputs['train'],
        features = train_data.outputs['features'],
        model = model_2.output,
    ).set_display_name('Metrics: Training Data')
    metrics_2_test = metrics(
        data = train_data.outputs['test'],
        features = train_data.outputs['features'],
        model = model_2.output,
    ).set_display_name('Metrics: Test Data')
    
    review = overview(
        metrics_0 = metrics_1_train.outputs['metrics'],
        metrics_1 = metrics_1_test.outputs['metrics'],
        metrics_2 = metrics_2_train.outputs['metrics'],
        metrics_3 = metrics_2_test.outputs['metrics'],
        models = ['GB', 'GB', 'RF', 'RF'],
        data = ['Train', 'Test', 'Train', 'Test']
    )

### Compile Pipeline

In [133]:
kfp.compiler.Compiler().compile(
    pipeline_func = pipeline,
    package_path = f'{DIR}/{pipeline_name}.yaml'
)

### Create Pipeline Job

In [134]:
parameters = dict(
    project_id = PROJECT_ID,
    bq_project = 'bigquery-public-data',
    bq_dataset = 'ml_datasets',
    bq_table = 'penguins'
)

In [135]:
pipeline_job = aiplatform.PipelineJob(
    display_name = pipeline_name,
    template_path = f"{DIR}/{pipeline_name}.yaml",
    parameter_values = parameters,
    pipeline_root = f'gs://{GCS_BUCKET}/{SERIES}/{EXPERIMENT}/pipeline_root',
    enable_caching = None # True (enabled), False (disable), None (defer to component level caching) 
)

### Submit Pipeline Job

In [136]:
response = pipeline_job.submit(
    service_account = SERVICE_ACCOUNT
)

Creating PipelineJob
PipelineJob created. Resource name: projects/1026793852137/locations/us-central1/pipelineJobs/mlops-pipeline-pattern-modular-example-2-20240801170641
To use this PipelineJob in another session:
pipeline_job = aiplatform.PipelineJob.get('projects/1026793852137/locations/us-central1/pipelineJobs/mlops-pipeline-pattern-modular-example-2-20240801170641')
View Pipeline Job:
https://console.cloud.google.com/vertex-ai/locations/us-central1/pipelines/runs/mlops-pipeline-pattern-modular-example-2-20240801170641?project=1026793852137


In [137]:
print(f'The Dashboard can be viewed here:\n{pipeline_job._dashboard_uri()}')

The Dashboard can be viewed here:
https://console.cloud.google.com/vertex-ai/locations/us-central1/pipelines/runs/mlops-pipeline-pattern-modular-example-2-20240801170641?project=1026793852137


In [138]:
pipeline_job.wait()

PipelineJob projects/1026793852137/locations/us-central1/pipelineJobs/mlops-pipeline-pattern-modular-example-2-20240801170641 current state:
PipelineState.PIPELINE_STATE_RUNNING
PipelineJob projects/1026793852137/locations/us-central1/pipelineJobs/mlops-pipeline-pattern-modular-example-2-20240801170641 current state:
PipelineState.PIPELINE_STATE_RUNNING
PipelineJob projects/1026793852137/locations/us-central1/pipelineJobs/mlops-pipeline-pattern-modular-example-2-20240801170641 current state:
PipelineState.PIPELINE_STATE_RUNNING
PipelineJob projects/1026793852137/locations/us-central1/pipelineJobs/mlops-pipeline-pattern-modular-example-2-20240801170641 current state:
PipelineState.PIPELINE_STATE_RUNNING
PipelineJob projects/1026793852137/locations/us-central1/pipelineJobs/mlops-pipeline-pattern-modular-example-2-20240801170641 current state:
PipelineState.PIPELINE_STATE_RUNNING
PipelineJob projects/1026793852137/locations/us-central1/pipelineJobs/mlops-pipeline-pattern-modular-example-2

### Retrieve Pipeline Information

In [139]:
aiplatform.get_pipeline_df(pipeline = f'{pipeline_name}')[0:10]

Unnamed: 0,pipeline_name,run_name,param.input:bq_table,param.input:bq_project,param.input:project_id,param.vertex-ai-pipelines-artifact-argument-binding,param.vmlmd_lineage_integration,param.input:bq_dataset,metric.precision,metric.f1,metric.accuracy,metric.average_precision,metric.recall,metric.confusionMatrix
0,mlops-pipeline-pattern-modular-example-2,mlops-pipeline-pattern-modular-example-2-20240...,penguins,bigquery-public-data,statmike-mlops-349915,{'output:metrics-3-class_metrics': ['projects/...,{'pipeline_run_component': {'location_id': 'us...,ml_datasets,0.975334,0.973245,0.976744,0.997809,0.971285,"{'rows': [{'row': [39.0, 0.0, 0.0]}, {'row': [..."


---
## Example 3: Pipelines As Components

This example will combine several components into a pipeline.  Then it will save this pipelines and recall it (download and import) as a component is a new pipeline.  The new pipeline will be saved to artifact registry and used in a pipeline run.

### Pipeline: Data Preparation

This pipeline will combine the `data_source` and `data_prep` component portion of the pipeline into a new pipeline.

In [155]:
pipeline_name = f'{SERIES}-{EXPERIMENT}-example-3-data-preparation'

In [156]:
@kfp.dsl.pipeline(
    name = pipeline_name,
    description = 'A simple pipeline for testing',
    pipeline_root = f'gs://{GCS_BUCKET}/{SERIES}/{EXPERIMENT}/pipeline_root'
)
def pipeline(
    project_id: str,
    bq_project: str,
    bq_dataset: str,
    bq_table: str
) -> NamedTuple(
        'output',
        train=kfp.dsl.Dataset,
        test=kfp.dsl.Dataset,
        features=kfp.dsl.Artifact
):
    
    # use the downloaded/imported component for data_source here
    bq_source = import_data_source(
        bq_project = bq_project,
        bq_dataset = bq_dataset,
        bq_table = bq_table
    )
    train_data = data_prep(
        project_id = project_id,
        bq_source = bq_source.output
    )
    

    from typing import NamedTuple
    outputs = NamedTuple(
            'output',
            train=kfp.dsl.Dataset,
            test=kfp.dsl.Dataset,
            features=kfp.dsl.Artifact
    )
    
    return outputs(
        train_data.outputs['train'],
        train_data.outputs['test'],
        train_data.outputs['features']
    )

#### Compile Pipeline

In [157]:
kfp.compiler.Compiler().compile(
    pipeline_func = pipeline,
    package_path = f'{DIR}/{pipeline_name}.yaml'
)

#### Upload Pipeline To Registry

Like other repository types there are two types of references for artifacts: tags and version.  The versions are managed by the repository and the naming of versions is possible by supplying tags.  

In [158]:
!ls {DIR}

component_data_source.yaml
downloaded_component_data_source.yaml
mlops-pipeline-pattern-modular-example-2.yaml
mlops-pipeline-pattern-modular-example-3-data-preparation.yaml
mlops-pipeline-pattern-modular-example-3-data_preparation.yaml
mlops-pipeline-pattern-modular-preview.yaml


In [159]:
pipeline_name

'mlops-pipeline-pattern-modular-example-3-data-preparation'

In [160]:
template, version = kfp_registry.upload_pipeline(
    file_name = f"{DIR}/{pipeline_name}.yaml",
    tags = ['v1', 'new'],
    extra_headers = dict(description = 'Data Prep Pipeline Template', note = 'This is an example for the data prep pipeline.')
)

In [161]:
template, version

('mlops-pipeline-pattern-modular-example-3-data-preparation',
 'sha256:b36f6b0b6dc7df584a669f600ef36dacbda23ff23244216b8aa1700691df4a0f')

#### Pipeline Runs From The Registry

The pipeline could be download to local first but it is also possible to directly reference the pipeline in the repository when creating a run.  More methods, including no code runs from the console, are covered in the workflow: [Vertex AI Pipelines - Management](./Vertex%20AI%20Pipelines%20-%20Management.ipynb).

In [162]:
parameters = dict(
    project_id = PROJECT_ID,
    bq_project = 'bigquery-public-data',
    bq_dataset = 'ml_datasets',
    bq_table = 'penguins'
)

In [164]:
pipeline_job = aiplatform.PipelineJob(
    display_name = pipeline_name,
    template_path = f"{REPOSITORY.lower()}/{pipeline_name}/{version}",
    parameter_values = parameters,
    pipeline_root = f'gs://{GCS_BUCKET}/{SERIES}/{EXPERIMENT}/pipeline_root',
    enable_caching = False # True (enabled), False (disable), None (defer to component level caching) 
)

In [165]:
response = pipeline_job.submit(
    service_account = SERVICE_ACCOUNT
)

Creating PipelineJob
PipelineJob created. Resource name: projects/1026793852137/locations/us-central1/pipelineJobs/mlops-pipeline-pattern-modular-example-3-data-preparation-20240801173254
To use this PipelineJob in another session:
pipeline_job = aiplatform.PipelineJob.get('projects/1026793852137/locations/us-central1/pipelineJobs/mlops-pipeline-pattern-modular-example-3-data-preparation-20240801173254')
View Pipeline Job:
https://console.cloud.google.com/vertex-ai/locations/us-central1/pipelines/runs/mlops-pipeline-pattern-modular-example-3-data-preparation-20240801173254?project=1026793852137


In [166]:
pipeline_job.wait()

PipelineJob projects/1026793852137/locations/us-central1/pipelineJobs/mlops-pipeline-pattern-modular-example-3-data-preparation-20240801173254 current state:
PipelineState.PIPELINE_STATE_RUNNING
PipelineJob projects/1026793852137/locations/us-central1/pipelineJobs/mlops-pipeline-pattern-modular-example-3-data-preparation-20240801173254 current state:
PipelineState.PIPELINE_STATE_RUNNING
PipelineJob projects/1026793852137/locations/us-central1/pipelineJobs/mlops-pipeline-pattern-modular-example-3-data-preparation-20240801173254 current state:
PipelineState.PIPELINE_STATE_RUNNING
PipelineJob projects/1026793852137/locations/us-central1/pipelineJobs/mlops-pipeline-pattern-modular-example-3-data-preparation-20240801173254 current state:
PipelineState.PIPELINE_STATE_RUNNING
PipelineJob projects/1026793852137/locations/us-central1/pipelineJobs/mlops-pipeline-pattern-modular-example-3-data-preparation-20240801173254 current state:
PipelineState.PIPELINE_STATE_RUNNING
PipelineJob run completed

In [167]:
aiplatform.get_pipeline_df(pipeline = f'{pipeline_name}')

Unnamed: 0,pipeline_name,run_name,param.input:bq_project,param.vertex-ai-pipelines-artifact-argument-binding,param.input:bq_table,param.input:bq_dataset,param.input:project_id,param.vmlmd_lineage_integration,metric.filename,metric.samples,metric.train_n,metric.label_col,metric.features,metric.label_values,metric.test_n
0,mlops-pipeline-pattern-modular-example-3-data-...,mlops-pipeline-pattern-modular-example-3-data-...,bigquery-public-data,{'output:train': ['projects/1026793852137/loca...,penguins,ml_datasets,statmike-mlops-349915,{'pipeline_template_component': {'task_name': ...,data.txt,258.0,258.0,species,"[island, culmen_length_mm, culmen_depth_mm, fl...","[Gentoo penguin (Pygoscelis papua), Adelie Pen...",86.0


### Pipeline: Full Pipeline With Data Prep Pipeline as Component

Reconstruct the full pipeline using the downloaded data prep pipeline as a component

#### Download and Import Data Prep Pipeline

In [168]:
kfp_registry.download_pipeline(
    package_name = template,
    version = version,
    file_name = f'{DIR}/downloaded_data_prep_pipeline.yaml'
)

'temp/mlops-pipeline-pattern-modular/downloaded_data_prep_pipeline.yaml'

In [169]:
!ls {DIR}

component_data_source.yaml
downloaded_component_data_source.yaml
downloaded_data_prep_pipeline.yaml
mlops-pipeline-pattern-modular-example-2.yaml
mlops-pipeline-pattern-modular-example-3-data-preparation.yaml
mlops-pipeline-pattern-modular-example-3-data_preparation.yaml
mlops-pipeline-pattern-modular-preview.yaml


In [170]:
data_prep_pipeline = kfp.components.load_component_from_file(f'{DIR}/downloaded_data_prep_pipeline.yaml')

In [171]:
type(data_prep_pipeline)

kfp.dsl.yaml_component.YamlComponent

#### Build Pipeline Using Sub-Pipeline As Component

In [175]:
pipeline_name = f'{SERIES}-{EXPERIMENT}-example-3'

In [176]:
@kfp.dsl.pipeline(
    name = pipeline_name,
    description = 'A simple pipeline for testing',
    pipeline_root = f'gs://{GCS_BUCKET}/{SERIES}/{EXPERIMENT}/pipeline_root'
)
def pipeline(
    project_id: str,
    bq_project: str,
    bq_dataset: str,
    bq_table: str
):

    # use the pipeline as a component:
    train_data = data_prep_pipeline(
        project_id = project_id,
        bq_project = bq_project,
        bq_dataset = bq_dataset,
        bq_table = bq_table
    )
    model_1 = model_gb(
        train = train_data.outputs['train'],
        features = train_data.outputs['features']
    )
    model_2 = model_rf(
        train = train_data.outputs['train'],
        features = train_data.outputs['features']
    )
    metrics_1_train = metrics(
        data = train_data.outputs['train'],
        features = train_data.outputs['features'],
        model = model_1.output,
    ).set_display_name('Metrics: Training Data')
    metrics_1_test = metrics(
        data = train_data.outputs['test'],
        features = train_data.outputs['features'],
        model = model_1.output,
    ).set_display_name('Metrics: Test Data')
    metrics_2_train = metrics(
        data = train_data.outputs['train'],
        features = train_data.outputs['features'],
        model = model_2.output,
    ).set_display_name('Metrics: Training Data')
    metrics_2_test = metrics(
        data = train_data.outputs['test'],
        features = train_data.outputs['features'],
        model = model_2.output,
    ).set_display_name('Metrics: Test Data')
    
    review = overview(
        metrics_0 = metrics_1_train.outputs['metrics'],
        metrics_1 = metrics_1_test.outputs['metrics'],
        metrics_2 = metrics_2_train.outputs['metrics'],
        metrics_3 = metrics_2_test.outputs['metrics'],
        models = ['GB', 'GB', 'RF', 'RF'],
        data = ['Train', 'Test', 'Train', 'Test']
    )

#### Compile Pipeline

In [177]:
kfp.compiler.Compiler().compile(
    pipeline_func = pipeline,
    package_path = f'{DIR}/{pipeline_name}.yaml'
)

#### Upload Pipeline To Registry

Like other repository types there are two types of references for artifacts: tags and version.  The versions are managed by the repository and the naming of versions is possible by supplying tags.  

In [178]:
!ls {DIR}

component_data_source.yaml
downloaded_component_data_source.yaml
downloaded_data_prep_pipeline.yaml
mlops-pipeline-pattern-modular-example-2.yaml
mlops-pipeline-pattern-modular-example-3-data-preparation.yaml
mlops-pipeline-pattern-modular-example-3-data_preparation.yaml
mlops-pipeline-pattern-modular-example-3.yaml
mlops-pipeline-pattern-modular-preview.yaml


In [179]:
pipeline_name

'mlops-pipeline-pattern-modular-example-3'

In [180]:
template, version = kfp_registry.upload_pipeline(
    file_name = f"{DIR}/{pipeline_name}.yaml",
    tags = ['v1', 'new'],
    extra_headers = dict(description = 'Example 3 pipeline built from pipelines', note = 'This is an example of a modular pipeline construction.')
)

In [181]:
template, version

('mlops-pipeline-pattern-modular-example-3',
 'sha256:299e61e951dfc528ae9ab16d9251a1df9971bfbed409d924a3fdb41c5fc05775')

#### Pipeline Runs From The Registry

The pipeline could be download to local first but it is also possible to directly reference the pipeline in the repository when creating a run.  More methods, including no code runs from the console, are covered in the workflow: [Vertex AI Pipelines - Management](./Vertex%20AI%20Pipelines%20-%20Management.ipynb).

In [182]:
parameters = dict(
    project_id = PROJECT_ID,
    bq_project = 'bigquery-public-data',
    bq_dataset = 'ml_datasets',
    bq_table = 'penguins'
)

In [183]:
pipeline_job = aiplatform.PipelineJob(
    display_name = pipeline_name,
    template_path = f"{REPOSITORY.lower()}/{pipeline_name}/{version}",
    parameter_values = parameters,
    pipeline_root = f'gs://{GCS_BUCKET}/{SERIES}/{EXPERIMENT}/pipeline_root',
    enable_caching = False # True (enabled), False (disable), None (defer to component level caching) 
)

In [184]:
response = pipeline_job.submit(
    service_account = SERVICE_ACCOUNT
)

Creating PipelineJob
PipelineJob created. Resource name: projects/1026793852137/locations/us-central1/pipelineJobs/mlops-pipeline-pattern-modular-example-3-20240801174507
To use this PipelineJob in another session:
pipeline_job = aiplatform.PipelineJob.get('projects/1026793852137/locations/us-central1/pipelineJobs/mlops-pipeline-pattern-modular-example-3-20240801174507')
View Pipeline Job:
https://console.cloud.google.com/vertex-ai/locations/us-central1/pipelines/runs/mlops-pipeline-pattern-modular-example-3-20240801174507?project=1026793852137


In [185]:
pipeline_job.wait()

PipelineJob projects/1026793852137/locations/us-central1/pipelineJobs/mlops-pipeline-pattern-modular-example-3-20240801174507 current state:
PipelineState.PIPELINE_STATE_RUNNING
PipelineJob projects/1026793852137/locations/us-central1/pipelineJobs/mlops-pipeline-pattern-modular-example-3-20240801174507 current state:
PipelineState.PIPELINE_STATE_RUNNING
PipelineJob projects/1026793852137/locations/us-central1/pipelineJobs/mlops-pipeline-pattern-modular-example-3-20240801174507 current state:
PipelineState.PIPELINE_STATE_RUNNING
PipelineJob projects/1026793852137/locations/us-central1/pipelineJobs/mlops-pipeline-pattern-modular-example-3-20240801174507 current state:
PipelineState.PIPELINE_STATE_RUNNING
PipelineJob projects/1026793852137/locations/us-central1/pipelineJobs/mlops-pipeline-pattern-modular-example-3-20240801174507 current state:
PipelineState.PIPELINE_STATE_RUNNING
PipelineJob projects/1026793852137/locations/us-central1/pipelineJobs/mlops-pipeline-pattern-modular-example-3

In [186]:
aiplatform.get_pipeline_df(pipeline = f'{pipeline_name}')

Unnamed: 0,pipeline_name,run_name,param.vertex-ai-pipelines-artifact-argument-binding,param.vmlmd_lineage_integration,param.input:project_id,param.input:bq_table,param.input:bq_project,param.input:bq_dataset,metric.confusionMatrix,metric.precision,metric.f1,metric.average_precision,metric.accuracy,metric.recall
0,mlops-pipeline-pattern-modular-example-3,mlops-pipeline-pattern-modular-example-3-20240...,{'output:metrics-2-metrics': ['projects/102679...,{'pipeline_template_component': {'template_id'...,statmike-mlops-349915,penguins,bigquery-public-data,ml_datasets,"{'rows': [{'row': [118.0, 0.0, 0.0]}, {'row': ...",0.152455,0.20922,0.345348,0.457364,0.333333
