# Scikit-Learn Workflow

A new template workflow for [scikit-Learn](https://scikit-learn.org/stable/index.html) model training and serving workflows in Vertex AI.

**Prerequisites:**
-  [01 - BigQuery - Table Data Source](../01%20-%20Data%20Sources/01%20-%20BigQuery%20-%20Table%20Data%20Source.ipynb)

---
## Colab Setup

To run this notebook in Colab run the cells in this section.  Otherwise, skip this section.

This cell will authenticate to GCP (follow prompts in the popup).

In [1]:
PROJECT_ID = 'statmike-mlops-349915' # replace with project ID

In [5]:
try:
    from google.colab import auth
    auth.authenticate_user()
    !gcloud config set project {PROJECT_ID}
    print('Colab authorized to GCP')
except Exception:
    print('Not a Colab Environment')
    pass

Not a Colab Environment


---
## Installs

The list `packages` contains tuples of package import names and install names.  If the import name is not found then the install name is used to install quitely for the current user.

In [15]:
# tuples of (import name, install name, min_version)
packages = [
    ('google.cloud.aiplatform', 'google-cloud-aiplatform'),
    ('google.cloud.bigquery', 'google-cloud-bigquery'),
    ('bigframes', 'bigframes')
]

import importlib
install = False
for package in packages:
    if not importlib.util.find_spec(package[0]):
        print(f'installing package {package[1]}')
        install = True
        !pip install {package[1]} -U -q --user
    elif len(package) == 3:
        if importlib.metadata.version(package[0]) < package[2]:
            print(f'updating package {package[1]}')
            install = True
            !pip install {package[1]} -U -q --user

## API Enablement

In [16]:
!gcloud services enable aiplatform.googleapis.com

### Restart Kernel (If Installs Occured)

After a kernel restart the code submission can start with the next cell after this one.

In [17]:
if install:
    import IPython
    app = IPython.Application.instance()
    app.kernel.do_shutdown(True)

---
## Setup

Inputs

In [1]:
project = !gcloud config get-value project
PROJECT_ID = project[0]
PROJECT_ID

'statmike-mlops-349915'

In [51]:
REGION = 'us-central1'
EXPERIMENT = 'sklearn-workflow'
SERIES = 'dev'

# gcs bucket
GCS_BUCKET = PROJECT_ID

# source data
BQ_PROJECT = PROJECT_ID
BQ_DATASET = 'fraud'
BQ_TABLE = 'fraud_prepped'

Packages

In [72]:
import os
import sklearn.ensemble
import pickle
import importlib
from datetime import datetime
from google.cloud import aiplatform
from google.cloud import bigquery
import bigframes.pandas as bpd

Clients

In [4]:
# vertex ai clients
aiplatform.init(project = PROJECT_ID, location = REGION)

# bigquery clients
bq = bigquery.Client(project = PROJECT_ID)
bpd.options.bigquery.project = PROJECT_ID

parameters:

In [49]:
DIR = f"temp/{EXPERIMENT}"
TIMESTAMP = datetime.now().strftime("%Y%m%d%H%M%S")

In [122]:
RUN_NAME = f'run-{TIMESTAMP}'

In [6]:
SERVICE_ACCOUNT = !gcloud config list --format='value(core.account)' 
SERVICE_ACCOUNT = SERVICE_ACCOUNT[0]
SERVICE_ACCOUNT

'1026793852137-compute@developer.gserviceaccount.com'

environment:

In [7]:
if not os.path.exists(DIR):
    os.makedirs(DIR)

---
## Data Source

In [13]:
data = bq.query(f'SELECT * FROM {BQ_PROJECT}.{BQ_DATASET}.{BQ_TABLE}').to_dataframe()

In [14]:
data.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V23,V24,V25,V26,V27,V28,Amount,Class,transaction_id,splits
0,35337,1.092844,-0.01323,1.359829,2.731537,-0.707357,0.873837,-0.79613,0.437707,0.39677,...,-0.167647,0.027557,0.592115,0.219695,0.03697,0.010984,0.0,0,a1b10547-d270-48c0-b902-7a0f735dadc7,TEST
1,60481,1.238973,0.035226,0.063003,0.641406,-0.260893,-0.580097,0.049938,-0.034733,0.405932,...,-0.057718,0.104983,0.537987,0.589563,-0.046207,-0.006212,0.0,0,814c62c8-ade4-47d5-bf83-313b0aafdee5,TEST
2,139587,1.870539,0.211079,0.224457,3.889486,-0.380177,0.249799,-0.577133,0.179189,-0.120462,...,0.180776,-0.060226,-0.228979,0.080827,0.009868,-0.036997,0.0,0,d08a1bfa-85c5-4f1b-9537-1c5a93e6afd0,TEST
3,162908,-3.368339,-1.980442,0.153645,-0.159795,3.847169,-3.516873,-1.209398,-0.292122,0.760543,...,-1.171627,0.214333,-0.159652,-0.060883,1.294977,0.120503,0.0,0,802f3307-8e5a-4475-b795-5d5d8d7d0120,TEST
4,165236,2.180149,0.218732,-2.637726,0.348776,1.063546,-1.249197,0.942021,-0.547652,-0.087823,...,-0.176957,0.563779,0.730183,0.707494,-0.131066,-0.090428,0.0,0,c8a5b93a-1598-4689-80be-4f9f5df0b8ce,TEST


---
## Model Training: Local

https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.HistGradientBoostingClassifier.html

In [92]:
train_x = data.loc[data['splits']=='TRAIN', ~data.columns.isin(['transaction_id', 'splits'])]
train_y = train_x.pop('Class').astype('int')

In [93]:
classifier = sklearn.ensemble.HistGradientBoostingClassifier().fit(train_x, train_y)

In [94]:
classifier.score(train_x, train_y)

0.9984083205808972

In [95]:
classifier.predict(train_x[0:5]), train_y[0:5].values

(array([0, 0, 0, 0, 0]), array([0, 0, 0, 0, 0]))

In [96]:
classifier.predict_proba(train_x[0:5])

array([[0.99854863, 0.00145137],
       [0.99854863, 0.00145137],
       [0.99854863, 0.00145137],
       [0.99854863, 0.00145137],
       [0.99854863, 0.00145137]])

In [114]:
classifier.classes_

array([0, 1])

In [155]:
with open(f'{DIR}/model.pkl','wb') as f:
    pickle.dump(classifier, f)

In [156]:
with open(f'{DIR}/model.pkl','rb') as f:
    classifier_import = pickle.load(f)

In [160]:
classifier_import.predict_proba(train_x[0:5])

array([[0.99854863, 0.00145137],
       [0.99854863, 0.00145137],
       [0.99854863, 0.00145137],
       [0.99854863, 0.00145137],
       [0.99854863, 0.00145137]])

---
## Model Training: Vertex AI Training Custom Job
-https://cloud.google.com/vertex-ai/docs/training/create-custom-job#create_custom_job-python_vertex_ai_sdk
-https://cloud.google.com/vertex-ai/docs/training/exporting-model-artifacts#scikit-learn

In [104]:
%%writefile {DIR}/train.py
# imports
from google.cloud import bigquery
import sklearn.ensemble
import argparse
import pickle
import os
import logging

# setup logging
logger = logging.getLogger()
logger.setLevel(logging.INFO)
logger.addHandler(logging.StreamHandler())

# import argument to local variables
parser = argparse.ArgumentParser()
parser.add_argument('--project_id', dest = 'PROJECT_ID', type=str)
parser.add_argument('--bq_project', dest = 'BQ_PROJECT', type=str)
parser.add_argument('--bq_dataset', dest = 'BQ_DATASET', type=str)
parser.add_argument('--bq_table', dest = 'BQ_TABLE', type=str)
args = parser.parse_args()
logging.info('Finished parsing input parameters.')

# bigquery client
bq = bigquery.Client(project = args.PROJECT_ID)

# download data
data = bq.query(f'SELECT * FROM {args.BQ_PROJECT}.{args.BQ_DATASET}.{args.BQ_TABLE}').to_dataframe()
logging.info('Read data from BQ.')

# prepare training data
train_x = data.loc[data['splits']=='TRAIN', ~data.columns.isin(['transaction_id', 'splits'])]
train_y = train_x.pop('Class').astype('int')
logging.info('Prepared training data.')

# fit model
classifier = sklearn.ensemble.HistGradientBoostingClassifier().fit(train_x, train_y)
logging.info('Model training complete.')

# Use predefined environment variable to establish model directory
storage_path = f"/gcs/{os.environ['AIP_MODEL_DIR'][5:]}" + 'model.pkl'
os.makedirs(os.path.dirname(storage_path), exist_ok=True)

# output the model save files directly to GCS destination
with open(storage_path,'wb') as f:
    pickle.dump(classifier, f)
logging.info('Model saved to GCS.')

Overwriting temp/sklearn-workflow/train.py


https://cloud.google.com/vertex-ai/docs/training/pre-built-containers#scikit-learn

In [105]:
CMDARGS = [
    "--project_id=" + PROJECT_ID,
    "--bq_project=" + BQ_PROJECT,
    "--bq_dataset=" + BQ_DATASET,
    "--bq_table=" + BQ_TABLE
]

TRAIN_IMAGE = 'us-docker.pkg.dev/vertex-ai/training/sklearn-cpu.1-0:latest'
TRAIN_COMPUTE = 'n1-standard-4'
URI = f"gs://{GCS_BUCKET}/{SERIES}/{EXPERIMENT}"

In [106]:
customJob = aiplatform.CustomJob.from_local_script(
    display_name = f'{SERIES}_{EXPERIMENT}_{TIMESTAMP}',
    script_path = f'{DIR}/train.py',
    container_uri = TRAIN_IMAGE,
    args = CMDARGS,
    requirements = ['db-dtypes', 'google-cloud-bigquery'],
    replica_count = 1,
    machine_type = TRAIN_COMPUTE,
    accelerator_count = 0,
    base_output_dir = f"{URI}/models/{TIMESTAMP}",
    staging_bucket = f"{URI}/models/{TIMESTAMP}",
    labels = {'series' : f'{SERIES}', 'experiment' : f'{EXPERIMENT}'}
)

Training script copied to:
gs://statmike-mlops-349915/dev/sklearn-workflow/models/20240225203750/aiplatform-2024-02-25-21:56:30.388-aiplatform_custom_trainer_script-0.1.tar.gz.


In [107]:
customJob.run(
    service_account = SERVICE_ACCOUNT
)

Creating CustomJob
CustomJob created. Resource name: projects/1026793852137/locations/us-central1/customJobs/1880269818137935872
To use this CustomJob in another session:
custom_job = aiplatform.CustomJob.get('projects/1026793852137/locations/us-central1/customJobs/1880269818137935872')
View Custom Job:
https://console.cloud.google.com/ai/platform/locations/us-central1/training/1880269818137935872?project=1026793852137
CustomJob projects/1026793852137/locations/us-central1/customJobs/1880269818137935872 current state:
JobState.JOB_STATE_PENDING
CustomJob projects/1026793852137/locations/us-central1/customJobs/1880269818137935872 current state:
JobState.JOB_STATE_PENDING
CustomJob projects/1026793852137/locations/us-central1/customJobs/1880269818137935872 current state:
JobState.JOB_STATE_PENDING
CustomJob projects/1026793852137/locations/us-central1/customJobs/1880269818137935872 current state:
JobState.JOB_STATE_PENDING
CustomJob projects/1026793852137/locations/us-central1/customJobs

In [108]:
customJob.display_name

'dev_sklearn-workflow_20240225203750'

In [109]:
customJob.resource_name

'projects/1026793852137/locations/us-central1/customJobs/1880269818137935872'

Create hyperlinks to job here:

In [110]:
job_link = f"https://console.cloud.google.com/vertex-ai/locations/{REGION}/training/{customJob.resource_name.split('/')[-1]}/cpu?cloudshell=false&project={PROJECT_ID}"
print(f'Review the Custom Job here:\n{job_link}')

Review the Custom Job here:
https://console.cloud.google.com/vertex-ai/locations/us-central1/training/1880269818137935872/cpu?cloudshell=false&project=statmike-mlops-349915


---
## Register Model: Vertex AI Model Registry

- https://cloud.google.com/vertex-ai/docs/predictions/pre-built-containers
- https://cloud.google.com/python/docs/reference/aiplatform/latest/google.cloud.aiplatform.Model#google_cloud_aiplatform_Model_training_job

In [163]:
DEPLOY_IMAGE = 'us-docker.pkg.dev/vertex-ai/prediction/sklearn-cpu.1-0:latest'
DEPLOY_COMPUTE = 'n1-standard-4'

In [164]:
upload_model = True
try:
    model = aiplatform.Model(
        project = PROJECT_ID,
        location = REGION,
        model_name = f'model_{SERIES}_{EXPERIMENT}'
    )
    print('Model already in registry')
    if RUN_NAME in model.version_aliases:
        upload_model = False
        print("This version already loaded, no action taken.")
    else:
        ('Loading model as new default version.')
        parent_model = model.resource_name
except Exception:
    print('This is a new model, creating in model registry')
    parent_model = ''

if upload_model:
    print('Uploading Model now...')
    model = aiplatform.Model.upload(
        display_name = f'{SERIES}_{EXPERIMENT}',
        model_id = f'model_{SERIES}_{EXPERIMENT}',
        parent_model =  parent_model,
        serving_container_image_uri = DEPLOY_IMAGE,
        artifact_uri = f"{URI}/models/{TIMESTAMP}/model",
        is_default_version = True,
        version_aliases = [RUN_NAME],
        version_description = RUN_NAME
    )

This is a new model, creating in model registry
Uploading Model now...
Creating Model
Create Model backing LRO: projects/1026793852137/locations/us-central1/models/model_dev_sklearn-workflow/operations/489564183696769024
Model created. Resource name: projects/1026793852137/locations/us-central1/models/model_dev_sklearn-workflow@1
To use this Model in another session:
model = aiplatform.Model('projects/1026793852137/locations/us-central1/models/model_dev_sklearn-workflow@1')


In [165]:
model.versioned_resource_name

'projects/1026793852137/locations/us-central1/models/model_dev_sklearn-workflow@1'

---
## Model Serving: Online with Vertex AI Prediction Endpoints

- https://cloud.google.com/vertex-ai/docs/general/deployment
- https://cloud.google.com/python/docs/reference/aiplatform/latest/google.cloud.aiplatform.Endpoint

### Create/Retrieve Endpoint

In [166]:
endpoints = aiplatform.Endpoint.list(filter = f"display_name={SERIES}")
if endpoints:
    endpoint = endpoints[0]
    print(f'Endpoint Exists: {endpoint.resource_name}')
else:
    endpoint = aiplatform.Endpoint.create(
        display_name = SERIES
    )
    print('Endpoint Created: ')
    
print(f'Review the Endpoint in the Console:\nhttps://console.cloud.google.com/vertex-ai/locations/{REGION}/endpoints/{endpoint.name}?project={PROJECT_ID}')

Endpoint Exists: projects/1026793852137/locations/us-central1/endpoints/8609714806183690240
Review the Endpoint in the Console:
https://console.cloud.google.com/vertex-ai/locations/us-central1/endpoints/8609714806183690240?project=statmike-mlops-349915


In [167]:
endpoint.display_name

'dev'

In [168]:
endpoint.traffic_split

{}

In [169]:
deployed_models = endpoint.list_models()
#deployed_models

### Deploy Model To Endpoint

In [None]:
#if upload_model and 
endpoint.deploy(
    model = model,
    deployed_model_display_name = model.display_name,
    traffic_percentage = 100,
    machine_type = DEPLOY_COMPUTE,
    min_replica_count = 1,  
    max_replica_count = 1
)
    

Deploying Model projects/1026793852137/locations/us-central1/models/model_dev_sklearn-workflow to Endpoint : projects/1026793852137/locations/us-central1/endpoints/8609714806183690240
Deploy Endpoint model backing LRO: projects/1026793852137/locations/us-central1/endpoints/8609714806183690240/operations/240740304284549120


In [None]:
for deployed_model in endpoint.list_models():
    if deployed_model.id in endpoint.traffic_split:
        print(f"Model {deployed_model.display_name} with version {deployed_model.model_version_id} has traffic = {endpoint.traffic_split[deployed_model.id]}")
    else:
        endpoint.undeploy(deployed_model_id = deployed_model.id)
        print(f"Undeploying {deployed_model.display_name} with version {deployed_model.model_version_id} because it has no traffic.")

In [None]:
endpoint.traffic_split

In [None]:
endpoint.list_models

### Get Predictions
- https://cloud.google.com/vertex-ai/docs/predictions/get-online-predictions

test_x = data.loc[data['splits']=='TEST', ~data.columns.isin(['transaction_id', 'splits'])]
test_y = test_x.pop('Class').astype('int')

instances = [{k:[v] for k,v in instance.items()} for instnace in text_x.to_dict(orient = 'records')

---
## Model Serving: Customize Online Serving With Vertex AI Prediction Endpoints

### Build Custom Prediction Routine

### Run Custom Prediction Routine: Local

### Save Image To Artifact Registry

### Register Model

### Deploy Model To Endpoint

### Get Predictions

---
## Model Serving: Batch With BigQuery ML

### Convert Model To ONNX

### Import Model With BigQuery ML

### Get Predictions

---
## Model Evaluations: With SDK

https://cloud.google.com/vertex-ai/docs/evaluation/introduction?authuser=1&_ga=2.56160942.-427663343.1708439669#tabular

---
## Model Evaluations: With Pipeline Components