![ga4](https://www.google-analytics.com/collect?v=2&tid=G-6VDTYWLKX6&cid=1&en=page_view&sid=1&dl=statmike%2Fvertex-ai-mlops%2F03+-+BigQuery+ML+%28BQML%29&dt=BQML+Import+Model+-+scikit-learn.ipynb)
<!--- header table --->
<table align="left">
  <td style="text-align: center">
    <a href="https://colab.research.google.com/github/statmike/vertex-ai-mlops/blob/main/03%20-%20BigQuery%20ML%20%28BQML%29/BQML%20Import%20Model%20-%20scikit-learn.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/colab-logo-32px.png" alt="Google Colaboratory logo">
      <br>Run in<br>Colab
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/colab/import/https%3A//raw.githubusercontent.com/statmike/vertex-ai-mlops/main/03%20-%20BigQuery%20ML%20%28BQML%29/BQML%20Import%20Model%20-%20scikit-learn.ipynb">
      <img width="32px" src="https://lh3.googleusercontent.com/JmcxdQi-qOpctIvWKgPtrzZdJJK-J3sWE1RsfjZNwshCFgE_9fULcNpuXYTilIR2hjwN" alt="Google Cloud Colab Enterprise logo">
      <br>Run in<br>Colab Enterprise
    </a>
  </td>      
  <td style="text-align: center">
    <a href="https://github.com/statmike/vertex-ai-mlops/blob/main/03%20-%20BigQuery%20ML%20%28BQML%29/BQML%20Import%20Model%20-%20scikit-learn.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" alt="GitHub logo">
      <br>View on<br>GitHub
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https%3A//raw.githubusercontent.com/statmike/vertex-ai-mlops/main/03%20-%20BigQuery%20ML%20%28BQML%29/BQML%20Import%20Model%20-%20scikit-learn.ipynb">
      <img src="https://lh3.googleusercontent.com/UiNooY4LUgW_oTvpsNhPpQzsstV5W8F7rYgxgGBD85cWJoLmrOzhVs_ksK_vgx40SHs7jCqkTkCk=e14-rj-sc0xffffff-h130-w32" alt="Vertex AI logo">
      <br>Open in<br>Vertex AI Workbench
    </a>
  </td>
</table>

# BQML Import A Scikit-Learn Model Using ONNX

This notebooks shows how to import a [scikit-learn](https://scikit-learn.org/stable/) model/pipeline into BigQuery ML for prediction directly inside BigQuery with [ML.PREDICT()](https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-predict) function.  This is accomplished by converting the `scikit-learn` model/pipeline to the  [ONNX](https://onnx.ai/) format - an open standard for machine learning interoperability - and then importing it directly into BigQuery ML.

$$\textrm{scikit-learn} \Longrightarrow \textrm{ONNX} \Longrightarrow \textrm{BigQuery ML}$$

**BigQuery ML Inference Engine**

With BigQuery ML you can [import models trained outside of BigQuery](https://cloud.google.com/bigquery/docs/reference/standard-sql/inference-overview#inference_using_imported_models) in formats like [TensorFlow](https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-create-tensorflow), [TensorFlow Lite](https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-create-tflite), [XGBoost](https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-create-xgboost), and [ONNX](https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-create-onnx). This is part of what is called the BigQuery ML [Inference Engine](https://cloud.google.com/bigquery/docs/reference/standard-sql/inference-overview#inference_using_imported_models) which has methods for working with models trained/hosted outside of BigQuery ML while using the same SQL API for convenience.  Read the blog post annoucing the inference engine from March 2023 [here](https://cloud.google.com/blog/products/data-analytics/introducing-bigquery-ml-inference-engine) for a great overview.

**ONNX - Open Neural Network Exchange**

The [ONNX](https://onnx.ai/) format is an open standard for machine learning interoperability.  This makes models usable across many frameworks, tools, and runtimes.  
- [supported frameworks](https://onnx.ai/supported-tools.html#buildModel)
- and more with [onnxmltools](https://github.com/onnx/onnxmltools)

**Converting scikit-learn to ONNX**

While [onnxmltools](https://github.com/onnx/onnxmltools) has a wrapper for [skl2onnx](https://github.com/onnx/sklearn-onnx/), this notebook shows using the `skl2onnx` package directly.


**Inference for ONNX format with ONNXRuntime**

The [onnxruntim](https://onnxruntime.ai/) package can be used for prediction/inference of a model in the ONNX format.  It is demonstrated in this notebook but not necessary for BigQuery ML which gives fully managed execution with the model so users don't need to configure this part.

**Prerequisites**

This notebook uses a scikit-learn model built previsously and stored in GCS as a pickle file (.pkl).  It retrieves this model by looking for the results of any of the notebooks in the [04 - scikit-learn](../04%20-%20scikit-learn) series.  Each of the notebooks in that series (04a - 04i) results in a Vertex AI Prediction Endpoint which this notebook finds and uses to identify a current model file.  It is also possible to edit the code and bypass this by pointing directly to any model save file on GCS.

**Resources**
- Tutorial [Make predictions with scikit-learn models in ONNX format](https://cloud.google.com/bigquery/docs/making-predictions-with-sklearn-models-in-onnx-format)


---
## Colab Setup

To run this notebook in Colab click [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/statmike/vertex-ai-mlops/blob/main/03%20-%20BigQuery%20ML%20(BQML)/BQML%20Import%20Model%20-%20scikit-learn.ipynb) and run the cells in this section.  Otherwise, skip this section.

This cell will authenticate to GCP (follow prompts in the popup).

In [1]:
PROJECT_ID = 'statmike-mlops-349915' # replace with project ID

In [473]:
try:
    import google.colab
    from google.colab import auth
    auth.authenticate_user()
    !gcloud config set project {PROJECT_ID}
except Exception:
    pass

---
## Setup

installations:

In [42]:
try:
    import skl2onnx
except ImportError:
    !pip install --user skl2onnx onnxruntime  -U -q

inputs:

In [3]:
project = !gcloud config get-value project
PROJECT_ID = project[0]
PROJECT_ID

'statmike-mlops-349915'

In [89]:
REGION = 'us-central1'
EXPERIMENT = 'import-onnx-sklearn'
SERIES = 'bqml'

# source data
BQ_PROJECT = PROJECT_ID
BQ_DATASET = 'fraud'
BQ_TABLE = 'fraud_prepped'

# Model Training
VAR_TARGET = 'Class'
VAR_OMIT = 'transaction_id' # add more variables to the string with space delimiters

packages:

In [44]:
from datetime import datetime

from google.cloud import aiplatform
from google.cloud import bigquery
from google.cloud import storage

import numpy as np
import sklearn
import pickle

import skl2onnx
import onnxruntime

clients:

In [6]:
aiplatform.init(project = PROJECT_ID, location = REGION)
bq = bigquery.Client(project = PROJECT_ID)
gcs = storage.Client()

---
## Get Data For Predictions

### Retrieve Records For Prediction

In [58]:
n = 10
pred = bq.query(
    query = f"""
        SELECT * EXCEPT({VAR_TARGET}, {VAR_OMIT}, splits)
        FROM {BQ_PROJECT}.{BQ_DATASET}.{BQ_TABLE}
        WHERE splits='TEST'
        LIMIT {n}
        """
).to_dataframe()

In [60]:
pred.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V20,V21,V22,V23,V24,V25,V26,V27,V28,Amount
0,35337,1.092844,-0.01323,1.359829,2.731537,-0.707357,0.873837,-0.79613,0.437707,0.39677,...,-0.240428,0.037603,0.380026,-0.167647,0.027557,0.592115,0.219695,0.03697,0.010984,0.0
1,60481,1.238973,0.035226,0.063003,0.641406,-0.260893,-0.580097,0.049938,-0.034733,0.405932,...,-0.26508,-0.060003,-0.053585,-0.057718,0.104983,0.537987,0.589563,-0.046207,-0.006212,0.0
2,139587,1.870539,0.211079,0.224457,3.889486,-0.380177,0.249799,-0.577133,0.179189,-0.120462,...,-0.374356,0.196006,0.656552,0.180776,-0.060226,-0.228979,0.080827,0.009868,-0.036997,0.0
3,162908,-3.368339,-1.980442,0.153645,-0.159795,3.847169,-3.516873,-1.209398,-0.292122,0.760543,...,-0.923275,-0.545992,-0.252324,-1.171627,0.214333,-0.159652,-0.060883,1.294977,0.120503,0.0
4,165236,2.180149,0.218732,-2.637726,0.348776,1.063546,-1.249197,0.942021,-0.547652,-0.087823,...,-0.250653,0.234502,0.825237,-0.176957,0.563779,0.730183,0.707494,-0.131066,-0.090428,0.0


Shape as instances: dictionaries of key:value pairs for only features used in model

In [61]:
newobs = pred.to_dict(orient='records')
#newobs[0]

In [62]:
len(newobs)

10

In [63]:
newobs[0]

{'Time': 35337,
 'V1': 1.0928441854981998,
 'V2': -0.0132303486713432,
 'V3': 1.35982868199426,
 'V4': 2.7315370965921004,
 'V5': -0.707357349219652,
 'V6': 0.8738370029866129,
 'V7': -0.7961301510622031,
 'V8': 0.437706509544851,
 'V9': 0.39676985012996396,
 'V10': 0.587438102569443,
 'V11': -0.14979756231827498,
 'V12': 0.29514781622888103,
 'V13': -1.30382621882143,
 'V14': -0.31782283120234495,
 'V15': -2.03673231037199,
 'V16': 0.376090905274179,
 'V17': -0.30040350116459497,
 'V18': 0.433799615590844,
 'V19': -0.145082264348681,
 'V20': -0.240427548108996,
 'V21': 0.0376030733329398,
 'V22': 0.38002620963091405,
 'V23': -0.16764742731151097,
 'V24': 0.0275573495476881,
 'V25': 0.59211469704354,
 'V26': 0.219695164116351,
 'V27': 0.0369695108704894,
 'V28': 0.010984441006191,
 'Amount': 0.0}

---

## Get A Model For Predictions

This section retrieves a currently active model being used on a Vertex AI Prediction Endpoint.

>If you already know the location of your model files in a GCS Bucket then this section can be bypassed by storing the model location with: `model_uri = 'gs://bucket/path/to/files'.

In [66]:
# Series 04 creates scikit-learn based models
PREVIOUS_SERIES = '04'

### Get Endpoint

Reference: [aiplatform.Endpoint](https://cloud.google.com/python/docs/reference/aiplatform/latest/google.cloud.aiplatform.Endpoint)

In [67]:
endpoints = aiplatform.Endpoint.list(filter = f"labels.series={PREVIOUS_SERIES}")
endpoint = endpoints[0]

In [68]:
print(f'Review the Endpoint in the Console:\nhttps://console.cloud.google.com/vertex-ai/locations/{REGION}/endpoints/{endpoint.name}?project={PROJECT_ID}')

Review the Endpoint in the Console:
https://console.cloud.google.com/vertex-ai/locations/us-central1/endpoints/5984848498170789888?project=statmike-mlops-349915


In [69]:
[list(newobs[0].values())]

[[35337,
  1.0928441854981998,
  -0.0132303486713432,
  1.35982868199426,
  2.7315370965921004,
  -0.707357349219652,
  0.8738370029866129,
  -0.7961301510622031,
  0.437706509544851,
  0.39676985012996396,
  0.587438102569443,
  -0.14979756231827498,
  0.29514781622888103,
  -1.30382621882143,
  -0.31782283120234495,
  -2.03673231037199,
  0.376090905274179,
  -0.30040350116459497,
  0.433799615590844,
  -0.145082264348681,
  -0.240427548108996,
  0.0376030733329398,
  0.38002620963091405,
  -0.16764742731151097,
  0.0275573495476881,
  0.59211469704354,
  0.219695164116351,
  0.0369695108704894,
  0.010984441006191,
  0.0]]

In [70]:
endpoint.predict(instances = [list(newobs[0].values())]).predictions

[0.0]

### Review The Model Information

Reference: [aiplatform.Model](https://cloud.google.com/python/docs/reference/aiplatform/latest/google.cloud.aiplatform.Model)

In [71]:
vertex_model = aiplatform.Model(
    model_name = endpoint.list_models()[0].model + f'@{endpoint.list_models()[0].model_version_id}'
)

In [72]:
vertex_model.display_name

'04_04a'

In [73]:
vertex_model.version_id

'2'

In [74]:
vertex_model.name

'model_04_04a'

In [75]:
vertex_model.uri

'gs://statmike-mlops-349915/04/04a/models/20230430233126/model'

In [76]:
!gsutil ls {vertex_model.uri}

gs://statmike-mlops-349915/04/04a/models/20230430233126/model/
gs://statmike-mlops-349915/04/04a/models/20230430233126/model/model.pkl


In [77]:
bucket = gcs.bucket(PROJECT_ID)
for blob in bucket.list_blobs(prefix = vertex_model.uri.split(f'gs://{PROJECT_ID}/')[1]):
    print(blob.name)
    if blob.name.split('.pkl')[-1] == '.pkl': break;

04/04a/models/20230430233126/model/
04/04a/models/20230430233126/model/model.pkl


In [97]:
print(f'Review the model in the Vertex AI Model Registry:\nhttps://console.cloud.google.com/vertex-ai/locations/{REGION}/models/{vertex_model.name}/versions/{vertex_model.version_id}/properties?project={PROJECT_ID}')

Review the model in the Vertex AI Model Registry:
https://console.cloud.google.com/vertex-ai/locations/us-central1/models/model_04_04a/versions/2/properties?project=statmike-mlops-349915


---
## Local Model: scikit-learn

Load the model to the notebook.

In [78]:
pickle_in = blob.download_as_string()
local_model = pickle.loads(pickle_in)

https://scikit-learn.org/stable/modules/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/modules/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/modules/model_persistence.html#security-maintainability-limitations


Get local predictions using the model:

In [79]:
[list(newobs[0].values())]

[[35337,
  1.0928441854981998,
  -0.0132303486713432,
  1.35982868199426,
  2.7315370965921004,
  -0.707357349219652,
  0.8738370029866129,
  -0.7961301510622031,
  0.437706509544851,
  0.39676985012996396,
  0.587438102569443,
  -0.14979756231827498,
  0.29514781622888103,
  -1.30382621882143,
  -0.31782283120234495,
  -2.03673231037199,
  0.376090905274179,
  -0.30040350116459497,
  0.433799615590844,
  -0.145082264348681,
  -0.240427548108996,
  0.0376030733329398,
  0.38002620963091405,
  -0.16764742731151097,
  0.0275573495476881,
  0.59211469704354,
  0.219695164116351,
  0.0369695108704894,
  0.010984441006191,
  0.0]]

In [80]:
local_model.predict([list(newobs[0].values())])

array([0])

In [81]:
local_model.predict_proba([list(newobs[0].values())])

array([[0.99819833, 0.00180167]])

---
## Convert Model: to ONNX

- Using [sklearn-onnx](https://onnx.ai/sklearn-onnx/)
    - All the available data types in [the source](https://github.com/onnx/sklearn-onnx/blob/main/skl2onnx/common/data_types.py)
    - more on zipmap option [here](https://onnx.ai/sklearn-onnx/auto_tutorial/plot_dbegin_options_zipmap.html)

In [82]:
initial_types = []
for feature in preds.dtypes.apply(lambda x: x.name).to_dict().items():
    if feature[1] == 'Int64': tensor_type = skl2onnx.common.data_types.Int64TensorType([None, 1])
    elif feature[1] == 'float64': tensor_type = skl2onnx.common.data_types.FloatTensorType([None, 1])
    # more data types here as needed
    initial_types.append((feature[0], tensor_type))

In [83]:
onnx_model = skl2onnx.convert_sklearn(local_model, initial_types = initial_types, options = {id(local_model): {'zipmap': False}})

## Local Test of ONNX Model Predictions

- With [onnxruntime](https://onnxruntime.ai/)

In [84]:
local_onnx = onnxruntime.InferenceSession(onnx_model.SerializeToString())

In [85]:
test_ob = newobs[0].copy()
for v in test_ob:
    if type(test_ob[v]) == int:
        test_ob[v] = np.array([[test_ob[v]]], dtype = np.int64)
    elif type(test_ob[v]) == float:
        test_ob[v] = np.array([[test_ob[v]]], dtype = np.float32)
test_ob

{'Time': array([[35337]]),
 'V1': array([[1.0928441]], dtype=float32),
 'V2': array([[-0.01323035]], dtype=float32),
 'V3': array([[1.3598287]], dtype=float32),
 'V4': array([[2.731537]], dtype=float32),
 'V5': array([[-0.70735735]], dtype=float32),
 'V6': array([[0.873837]], dtype=float32),
 'V7': array([[-0.7961302]], dtype=float32),
 'V8': array([[0.4377065]], dtype=float32),
 'V9': array([[0.39676985]], dtype=float32),
 'V10': array([[0.5874381]], dtype=float32),
 'V11': array([[-0.14979756]], dtype=float32),
 'V12': array([[0.2951478]], dtype=float32),
 'V13': array([[-1.3038262]], dtype=float32),
 'V14': array([[-0.31782284]], dtype=float32),
 'V15': array([[-2.0367322]], dtype=float32),
 'V16': array([[0.3760909]], dtype=float32),
 'V17': array([[-0.3004035]], dtype=float32),
 'V18': array([[0.43379962]], dtype=float32),
 'V19': array([[-0.14508227]], dtype=float32),
 'V20': array([[-0.24042755]], dtype=float32),
 'V21': array([[0.03760307]], dtype=float32),
 'V22': array([[0.38

In [86]:
local_onnx.run(None, test_ob)

[array([0], dtype=int64), array([[0.9986229 , 0.00137711]], dtype=float32)]

In [87]:
local_onnx.run(None, test_ob)[0]

array([0], dtype=int64)

---
## BigQuery ML Model Import

Reference: [The CREATE MODEL statement for importing ONNX models](https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-create-onnx)

Save the ONNX model in GCS:

In [91]:
blob = bucket.blob(f'{SERIES}/{EXPERIMENT}/model.onnx')
blob.upload_from_string(onnx_model.SerializeToString())

Create BigQuery ML Model:

In [92]:
query = f"""
CREATE OR REPLACE MODEL `{BQ_PROJECT}.{BQ_DATASET}.{SERIES}-{EXPERIMENT}`
    OPTIONS(
        MODEL_TYPE = 'ONNX',
        MODEL_PATH = 'gs://{PROJECT_ID}/{SERIES}/{EXPERIMENT}/*'
    )
"""
print(query)


CREATE OR REPLACE MODEL `statmike-mlops-349915.fraud.bqml-import-onnx-sklearn`
    OPTIONS(
        MODEL_TYPE = 'ONNX',
        MODEL_PATH = 'gs://statmike-mlops-349915/bqml/import-onnx-sklearn/*'
    )



In [93]:
job = bq.query(query = query)
job.result()
(job.ended-job.started).total_seconds()

6.016

## Predictions with BigQuery ML: ML.PREDICT

In [94]:
query = f"""
SELECT *
FROM ML.PREDICT (MODEL `{BQ_PROJECT}.{BQ_DATASET}.{SERIES}-{EXPERIMENT}`,(
    SELECT * 
    FROM `{BQ_PROJECT}.{BQ_DATASET}.{BQ_TABLE}`
    WHERE splits = 'TEST'
    LIMIT 1)
  )
"""
pred = bq.query(query = query).to_dataframe()

In [95]:
pred

Unnamed: 0,label,probabilities,Time,V1,V2,V3,V4,V5,V6,V7,...,V23,V24,V25,V26,V27,V28,Amount,Class,transaction_id,splits
0,0,"[0.9986228942871094, 0.001377105712890625]",35337,1.092844,-0.01323,1.359829,2.731537,-0.707357,0.873837,-0.79613,...,-0.167647,0.027557,0.592115,0.219695,0.03697,0.010984,0.0,0,a1b10547-d270-48c0-b902-7a0f735dadc7,TEST


In [96]:
print(query)


SELECT *
FROM ML.PREDICT (MODEL `statmike-mlops-349915.fraud.bqml-import-onnx-sklearn`,(
    SELECT * 
    FROM `statmike-mlops-349915.fraud.fraud_prepped`
    WHERE splits = 'TEST'
    LIMIT 1)
  )

