# 03c - BQML + Vertex AI > Pipelines - automated pipelines for updating models
As time goes on change occurs:

* inputs to our models may shift in distribution compared to when the model was trained - called training-serving skew
* inputs to out models may shift over time - called prediction drift
* new inputs/features may become available
* a better model may be created
<br>

* In the 03b notebook we deployed the model built with BQML in the 03a notebook to a Vertex AI Endpoint for online prediction. 
* In this notebook we will build a challenger model with the same training data, also using BQML but with a different model type - a deep neural network similar what we build in the 05 series of ntoebooks. 
* We will construct a Vertex AI Pipeline to orchestrate the process of building the new model, comparing to the deployed mode, and conditionally replacing the deployed model with the new one.

This process could be triggered based on time elapsed, amount of new data, detected training-serving skew or even prediction drift by using Vertex AI Monitoring.

---
## Prerequisites:
03a - BigQuery Machine Learning (BQML) - Machine Learning with SQL
03b - Vertex AI + BQML - Online Predictions with BQML Models

---
## Overview:
* Build Custom Pipeline Components
    * Use BigQuery ML to Get Predictions and Scikit-Learn to calculate model metrics
    * Use BigQuery ML to train a new model - A Deep Neural Network
    * Compare model metrics for baseline and challenger model
    * Export BigQuery ML model to Google Cloud Storage
    * Replace a model deployed to an endpoint (03b) with the challenger model, undeploy previous model
* Define the Pipeline Flow
* Compile the Pipeline
* Run the Pipeline in Vertex AI
* Get Predictions from the upated Endpoint

---
## Setup

Inputs:

In [1]:
REGION = 'us-central1'
PROJECT_ID='nguyen-demo5'
DATANAME = 'taxi'
NOTEBOOK = '03c'

# Resources
DEPLOY_IMAGE='us-docker.pkg.dev/vertex-ai/prediction/tf2-cpu.2-3:latest'
DEPLOY_COMPUTE = 'n1-standard-4'

# Model Training
VAR_TARGET = 'tips_label'
#  Based on the best result among the models, we selected the logistic regression model VERSION 4: taxi_lr_v4 for our online prediction.
VAR_OMIT= 'unique_key taxi_id trip_start_timestamp trip_end_timestamp trip_miles pickup_census_tract dropoff_census_tract pickup_community_area dropoff_community_area tips extras trip_total pickup_latitude pickup_longitude dropoff_latitude dropoff_longitude' # add more variables to the string with space delimiters

Packages:

In [2]:
from google.cloud import aiplatform
from datetime import datetime
from typing import NamedTuple
import kfp # used for dsl.pipeline
import kfp.v2.dsl as dsl # used for dsl.component, dsl.Output, dsl.Input, dsl.Artifact, dsl.Model, ...
from google_cloud_pipeline_components import aiplatform as gcc_aip

from google.cloud import bigquery
from google.protobuf import json_format
from google.protobuf.struct_pb2 import Value
import json
import numpy as np

Clients:

In [3]:
aiplatform.init(project=PROJECT_ID, location=REGION)
bigquery = bigquery.Client()

Parameters:

In [4]:
TIMESTAMP = datetime.now().strftime("%Y%m%d%H%M%S")
BUCKET = PROJECT_ID
URI = f"gs://{BUCKET}/{DATANAME}/models/{NOTEBOOK}"
DIR = f"temp/{NOTEBOOK}"

In [5]:
# Give service account roles/storage.objectAdmin permissions
# Console > IMA > Select Account <projectnumber>-compute@developer.gserviceaccount.com > edit - give role
SERVICE_ACCOUNT = !gcloud config list --format='value(core.account)' 
SERVICE_ACCOUNT = SERVICE_ACCOUNT[0]
SERVICE_ACCOUNT

'716133108361-compute@developer.gserviceaccount.com'

Enviroments:

In [6]:
!rm -rf {DIR}
!mkdir -p {DIR}

---
## Custom Components (KFP)
Vertex AI Pipelines are made up of components that run independently with inputs and outputs that connect to form a graph - the pipeline. For this notebook workflow the following custom components are used to orchestrate the training of a challenger model, evaluating the challenger and an existing model, comparing them based on model metrics, if the challenger is better then replace the model already deployed on an existing endpoint. These custom components are constructed as python functions!

Model Metrics
* Get Predictions for Test data from BigQuery Model
* Calculate average precision for the precision-recall curve

In [18]:
var_target = VAR_TARGET
project = PROJECT_ID
dataname = DATANAME
model= 'taxi_lr_v4'

In [19]:
 query = f"""
    SELECT {var_target}, predicted_{var_target}, prob, splits 
    FROM ML.PREDICT (MODEL `{project}.{dataname}.{model}`,(
        SELECT *
        FROM `{project}.{dataname}.{dataname}_prepped`
        WHERE splits = 'TEST')
      ), UNNEST(predicted_{var_target}_probs)
    WHERE label='YES'
    """
pred = bigquery.query(query = query).to_dataframe()


In [20]:
pred

Unnamed: 0,tips_label,predicted_tips_label,prob,splits
0,NO,NO,0.002929,TEST
1,NO,NO,0.002663,TEST
2,NO,NO,0.002872,TEST
3,NO,NO,0.002658,TEST
4,NO,NO,0.002665,TEST
...,...,...,...,...
744053,NO,NO,0.008251,TEST
744054,YES,YES,0.970322,TEST
744055,NO,NO,0.017442,TEST
744056,NO,NO,0.008396,TEST


In [28]:
pred['tips_label_num'] = np.where(
    pred['tips_label'] == 'YES', 1, 0)
pred

Unnamed: 0,tips_label,predicted_tips_label,prob,splits,tips_label_num
0,NO,NO,0.002929,TEST,0
1,NO,NO,0.002663,TEST,0
2,NO,NO,0.002872,TEST,0
3,NO,NO,0.002658,TEST,0
4,NO,NO,0.002665,TEST,0
...,...,...,...,...,...
744053,NO,NO,0.008251,TEST,0
744054,YES,YES,0.970322,TEST,1
744055,NO,NO,0.017442,TEST,0
744056,NO,NO,0.008396,TEST,0


In [31]:
pred['tips_label_num']

0         0
1         0
2         0
3         0
4         0
         ..
744053    0
744054    1
744055    0
744056    0
744057    1
Name: tips_label_num, Length: 744058, dtype: int64

In [39]:
pred[f'{var_target}_num']

0         0
1         0
2         0
3         0
4         0
         ..
744053    0
744054    1
744055    0
744056    0
744057    1
Name: tips_label_num, Length: 744058, dtype: int64

In [24]:
pred['prob']

0         0.002929
1         0.002663
2         0.002872
3         0.002658
4         0.002665
            ...   
744053    0.008251
744054    0.970322
744055    0.017442
744056    0.008396
744057    0.970987
Name: prob, Length: 744058, dtype: float64

In [40]:
from collections import namedtuple
from sklearn.metrics import average_precision_score, confusion_matrix
auPRC = average_precision_score(pred[f'{var_target}_num'], pred['prob'], average='micro')

In [41]:
auPRC

0.9774253846153457

In [46]:
from sklearn import metrics
metrics.log_metric('auPRC', auPRC)

AttributeError: module 'sklearn.metrics' has no attribute 'log_metric'