## Using Vertex ML Metadata with Pipelines

In this lab, you will learn how to analyze metadata from your Vertex Pipelines runs with Vertex ML Metadata.

What you learn
You'll learn how to:

- Use the Kubeflow Pipelines SDK to build an ML pipeline that creates a dataset in Vertex AI, and trains and deploys a custom Scikit-learn model on that dataset
- Write custom pipeline components that generate artifacts and metadata
- Compare Vertex Pipelines runs, both in the Cloud console and programmatically
- Trace the lineage for pipeline-generated artifacts
- Query your pipeline run metadata



The focus of this lab is on understanding metadata from pipeline runs. In order to do that, we'll need a pipeline to run on Vertex Pipelines, which is where we'll start. Here we'll define a 3-step pipeline with the following custom components:

get_dataframe: Retrieve data from a BigQuery table and convert it into a Pandas DataFrame
train_sklearn_model: Use the Pandas DataFrame to train and export a Scikit Learn model, along with some metrics
deploy_model: Deploy the exported Scikit Learn model to an endpoint in Vertex AI
In this pipeline, we'll use the UCI Machine Learning Dry beans dataset, from: KOKLU, M. and OZKAN, I.A., (2020), "Multiclass Classification of Dry Beans Using Computer Vision and Machine Learning Techniques."In Computers and Electronics in Agriculture, 174, 105507. DOI.

This is a tabular dataset, and in our pipeline we'll use the dataset to train, evaluate, and deploy a Scikit-learn model that classifies beans into one of 7 types based on their characteristics. Let's start coding!

In [None]:
!pip install kfp

In [17]:
import matplotlib.pyplot as plt
import pandas as pd

from kfp.v2 import compiler, dsl
from kfp.v2.dsl import pipeline, component, Artifact, Dataset, Input, Metrics, Model, Output, InputPath, OutputPath

from google.cloud import aiplatform

# We'll use this namespace for metadata querying
from google.cloud import aiplatform_v1

In [18]:
BUCKET_NAME="gs://demogct-wd"
REGION="us-central1"
PROJECT_ID="demogct"

In [19]:

PIPELINE_ROOT = f"{BUCKET_NAME}/pipeline_root/"
PIPELINE_ROOT

'gs://demogct-wd/pipeline_root/'

### Create Dataframe componenet


In [20]:
#This is a generic table name that can be used across
@component(
    packages_to_install=["google-cloud-bigquery", "pandas", "pyarrow"],
    base_image="python:3.9",
    output_component_file="create_dataset.yaml"
)
def get_dataframe(
    bq_table: str,
    output_data_path: OutputPath("Dataset")
):
    from google.cloud import bigquery
    import pandas as pd

    bqclient = bigquery.Client()
    table = bigquery.TableReference.from_string(
        bq_table
    )
    rows = bqclient.list_rows(
        table
    )
    dataframe = rows.to_dataframe(
        create_bqstorage_client=True,
    )
    dataframe = dataframe.sample(frac=1, random_state=2)
    dataframe.to_csv(output_data_path)

Let's take a closer look at what's happening in this component:

- The @component decorator compiles this function to a component when the pipeline is run. You'll use this anytime you write a custom component.
- The base_image parameter specifies the container image this component will use.

- This component will use a few Python libraries, which we specify via the *packages_to_install* parameter.

- The output_component_file parameter is optional, and specifies the yaml file to write the compiled component to. After running the cell you should see that file written to your notebook instance. 
- Next, this component uses the BigQuery Python client library to download our data from BigQuery into a Pandas DataFrame, and then creates an output artifact of that data as a CSV file. This will be passed as input to our next component


In [22]:
#This is a custom training pipeline
@component(
    packages_to_install=["sklearn", "pandas", "joblib"],
    base_image="python:3.9",
    output_component_file="fraud_model_component.yaml",
)
def sklearn_train(
    dataset: Input[Dataset],
    metrics: Output[Metrics],
    model: Output[Model]
):
    from sklearn.tree import DecisionTreeClassifier
    from sklearn.metrics import roc_curve
    from sklearn.model_selection import train_test_split
    from joblib import dump

    import pandas as pd
    df = pd.read_csv(dataset.path)
    labels = df.pop("Class").tolist()
    data = df.values.tolist()
    x_train, x_test, y_train, y_test = train_test_split(data, labels)

    skmodel = DecisionTreeClassifier()
    skmodel.fit(x_train,y_train)
    score = skmodel.score(x_test,y_test)
    print('accuracy is:',score)

    metrics.log_metric("accuracy",(score * 100.0))
    metrics.log_metric("framework", "Scikit Learn")
    metrics.log_metric("dataset_size", len(df))
    dump(skmodel, model.path + ".joblib")

### Define a component to upload and deploy the model to Vertex AI

Finally, our last component will take the trained model from the previous step, upload it to Vertex AI, and deploy it to an endpoint:

In [8]:
@component(
    packages_to_install=["google-cloud-aiplatform"],
    base_image="python:3.9",
    output_component_file="fraud_deploy_component.yaml",
)
def deploy_model(
    model: Input[Model],
    project: str,
    region: str,
    vertex_endpoint: Output[Artifact],
    vertex_model: Output[Model]
):
    from google.cloud import aiplatform

    aiplatform.init(project=project, location=region)

    deployed_model = aiplatform.Model.upload(
        display_name="fraud-model-pipeline",
        artifact_uri = model.uri.replace("model", ""),
        serving_container_image_uri="us-docker.pkg.dev/vertex-ai/prediction/sklearn-cpu.0-24:latest"
    )
    endpoint = deployed_model.deploy(machine_type="n1-standard-4")

    # Save data to the output params
    vertex_endpoint.uri = endpoint.resource_name
    vertex_model.uri = deployed_model.resource_name

### Define and compile the component

Now that we've defined our three components, next we'll create our pipeline definition. This describes how input and output artifacts flow between steps:

In [23]:
from kfp.v2 import compiler, dsl
from kfp.v2.dsl import pipeline, component, Artifact, Dataset, Input, Metrics, Model, Output, InputPath, OutputPath

@pipeline(
    # Default pipeline root. You can override it when submitting the pipeline.
    pipeline_root=PIPELINE_ROOT,
    # A name for the pipeline.
    name="fraud-pipeline"
)
def pipeline(
    bq_table: str = "",
    output_data_path: str = "data.csv",
    project: str = PROJECT_ID,
    region: str = REGION
):
    #BQ Read data from the components
    dataset_task = get_dataframe(bq_table)

    model_task = sklearn_train(
        dataset_task.output
    )

    deploy_task = deploy_model(
        model=model_task.outputs["model"],
        project=project,
        region=region
    )

In [24]:
compiler.Compiler().compile(
    pipeline_func=pipeline, package_path="fraud_pipeline.json"
)

### Start two pipeline runs
Next we'll kick off two runs of our pipeline. First let's define a timestamp to use for our pipeline job IDs:

In [12]:
from datetime import datetime

TIMESTAMP = datetime.now().strftime("%Y%m%d%H%M%S")

Remember that our pipeline takes one parameter when we run it: the bq_table we want to use for training data. This pipeline run will use a smaller version of the beans dataset:

In [27]:
#See the small Data Set Run
run1 = aiplatform.PipelineJob(
    display_name="fraud-pipeline",
    template_path="fraud_pipeline.json",
    job_id="fraud-pipeline-{0}".format(TIMESTAMP),
    parameter_values={"bq_table": "bigquery-public-data.ml_datasets.ulb_fraud_detection"},
    enable_caching=True,
)

The next snippet is with larger dataset run
- parameter_values={"bq_table": "sara-vertex-demos.beans_demo.**large_dataset**"},

Finally, kick off pipeline executions for both runs. It's best to do this in two separate notebook cells so you can see the output for each run.

In [26]:
run1.submit()

INFO:google.cloud.aiplatform.pipeline_jobs:Creating PipelineJob
INFO:google.cloud.aiplatform.pipeline_jobs:PipelineJob created. Resource name: projects/313681173937/locations/us-central1/pipelineJobs/fraud-pipeline-20211211151521
INFO:google.cloud.aiplatform.pipeline_jobs:To use this PipelineJob in another session:
INFO:google.cloud.aiplatform.pipeline_jobs:pipeline_job = aiplatform.PipelineJob.get('projects/313681173937/locations/us-central1/pipelineJobs/fraud-pipeline-20211211151521')
INFO:google.cloud.aiplatform.pipeline_jobs:View Pipeline Job:
https://console.cloud.google.com/vertex-ai/locations/us-central1/pipelines/runs/fraud-pipeline-20211211151521?project=313681173937


#### Submit the second Job

In [31]:
%%bigquery

select * from bigquery-public-data.ml_datasets.ulb_fraud_detection limit 100;


Query complete after 0.00s: 100%|██████████| 1/1 [00:00<00:00, 658.34query/s]                          
Downloading: 100%|██████████| 100/100 [00:00<00:00, 124.69rows/s]


Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,282.0,-0.356466,0.725418,1.971749,0.831343,0.369681,-0.107776,0.751610,-0.120166,-0.420675,...,0.020804,0.424312,-0.015989,0.466754,-0.809962,0.657334,-0.043150,-0.046401,0.0,0
1,380.0,-1.299837,0.881817,1.452842,-1.293698,-0.025105,-1.170103,0.861610,-0.193934,0.592001,...,-0.272563,-0.360853,0.223911,0.598930,-0.397705,0.637141,0.234872,0.021379,0.0,0
2,403.0,1.237413,0.512365,0.687746,1.693872,-0.236323,-0.650232,0.118066,-0.230545,-0.808523,...,-0.077543,-0.178220,0.038722,0.471218,0.289249,0.871803,-0.066884,0.012986,0.0,0
3,430.0,-1.860258,-0.629859,0.966570,0.844632,0.759983,-1.481173,-0.509681,0.540722,-0.733623,...,0.268028,0.125515,-0.225029,0.586664,-0.031598,0.570168,-0.043007,-0.223739,0.0,0
4,711.0,-0.431349,1.027694,2.670816,2.084787,-0.274567,0.286856,0.152110,0.200872,-0.596505,...,0.001241,0.154170,-0.141533,0.384610,-0.147132,-0.087100,0.101117,0.077944,0.0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,13746.0,1.109824,0.331352,1.560526,2.719556,-0.483555,0.712093,-0.741679,0.236268,1.119022,...,-0.204147,-0.206828,0.000416,-0.067350,0.297569,-0.079282,0.013765,0.016206,0.0,0
96,13789.0,1.176946,0.455849,0.667277,1.662556,0.167902,0.360023,-0.219811,0.043702,0.363473,...,-0.180051,-0.267334,0.037060,-0.306686,0.178491,0.868662,-0.086905,-0.012165,0.0,0
97,14123.0,-1.812070,-2.323482,2.055503,-1.650284,1.417647,-1.071647,-2.022478,0.531297,1.006664,...,0.508821,1.043305,0.249980,-0.384754,-0.540387,-0.489493,0.088495,0.198365,0.0,0
98,14126.0,1.184445,0.254904,1.047054,2.539504,0.108843,1.459859,-0.787241,0.337184,1.552502,...,-0.306462,-0.459128,-0.080633,-1.399512,0.374303,0.035759,0.028059,0.010794,0.0,0


In [32]:
%%bigquery
select count(*) from bigquery-public-data.ml_datasets.ulb_fraud_detection limit 100;

Query complete after 0.00s: 100%|██████████| 2/2 [00:00<00:00, 1509.29query/s]                        
Downloading: 100%|██████████| 1/1 [00:01<00:00,  1.41s/rows]


Unnamed: 0,f0_
0,284807
