![ga4](https://www.google-analytics.com/collect?v=2&tid=G-6VDTYWLKX6&cid=1&en=page_view&sid=1&dl=statmike%2Fvertex-ai-mlops%2F03+-+BigQuery+ML+%28BQML%29&dt=BQML+Feature+Engineering+-+reuseable+modular.ipynb)

# BigQuery ML (BQML) Feature Engineering - Reusable and Modular

---
## Colab Setup

To run this notebook in Colab click [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/statmike/vertex-ai-mlops/blob/main/Applied%20GenAI/Vertex%20AI%20GenAI%20For%20Document%20Q&A%20v2%20-%20MLB%20Rules%20For%20Baseball.ipynb) and run the cells in this section.  Otherwise, skip this section.

This cell will authenticate to GCP (follow prompts in the popup).

In [1]:
PROJECT_ID = 'statmike-mlops-349915' # replace with project ID

In [2]:
try:
    import google.colab
    from google.colab import auth
    auth.authenticate_user()
    !gcloud config set project {PROJECT_ID}
except Exception:
    pass

## Installs (If Needed)

The clients packages may need installing in this environment. 

In [77]:
# tuples of (import name, install name)
packages = [
    ('google.cloud.aiplatform', 'google-cloud-aiplatform'),
    ('google.cloud.documentai', 'google-cloud-bigquery'),
    ('google.cloud.storage', 'google-cloud-storage'),
    ('bigframes', 'bigframes'),
    ('pandas', 'pandas')
]

import importlib
install = False
for package in packages:
    if not importlib.util.find_spec(package[0]):
        print(f'installing package {package[1]}')
        install = True
        !pip install {package[1]} -U -q --user

### Restart Kernel (If Installs Occured)

After a kernel restart the code submission can start with the next cell after this one.

In [4]:
if install:
    import IPython
    app = IPython.Application.instance()
    app.kernel.do_shutdown(True)

---
## Setup

Inputs

In [5]:
project = !gcloud config get-value project
PROJECT_ID = project[0]
PROJECT_ID

'statmike-mlops-349915'

In [22]:
REGION = 'us-central1'

# specify a GCS Bucket
GCS_BUCKET = PROJECT_ID

# BigQuery Source Table
BQ_SOURCE_TABLE = 'bigquery-public-data.ml_datasets.penguins'

# BigQuery Environment Parameters
BQ_REGION = REGION[0:2] # use multi-region
BQ_PROJECT = PROJECT_ID
BQ_DATASET = 'bqml'
BQ_TABLE_PREFIX = 'feature-engineering'

Packages

In [None]:
from google.cloud import bigquery
from google.cloud import aiplatform
from google.cloud import storage
import bigframes.pandas as bf
import bigframes.ml as bfml
import pandas as pd

# load BigQuery IPython Magics (for Jupyter Notebooks)
%load_ext google.cloud.bigquery

Clients

In [79]:
# bigquery client
bq = bigquery.Client(project = PROJECT_ID)

# vertex ai client
aiplatform.init(project = PROJECT_ID, location = REGION)

# gcs client
gcs = storage.Client(project = PROJECT_ID)

# setup BigFrames API
bf.reset_session()
bf.options.bigquery.project = BQ_PROJECT
bf.options.bigquery.location = BQ_REGION
bf_session = bf.get_global_session()

---
## BigQuery Source Data

The source table is a BigQuery Public Dataset table.  The following cell uses the BigQuery IPython magic to retrieve 5 rows of the table for review.  This data is known as [Palmer Penguins](https://allisonhorst.github.io/palmerpenguins/) data: 

```
@Manual{,
  title = {palmerpenguins: Palmer Archipelago (Antarctica) penguin data},
  author = {Allison Marie Horst and Alison Presmanes Hill and Kristen B Gorman},
  year = {2020},
  note = {R package version 0.1.0},
  doi = {10.5281/zenodo.3960218},
  url = {https://allisonhorst.github.io/palmerpenguins/},
}
```


There are 334 observations of 4 numerical features (culman length, culmen depth, flipper length, body mass) and 2  categorical features (island, sex) that represent 3 species of penguins.

In [9]:
%%bigquery
SELECT *
FROM `bigquery-public-data.ml_datasets.penguins`
LIMIT 5

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,species,island,culmen_length_mm,culmen_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie Penguin (Pygoscelis adeliae),Dream,36.6,18.4,184.0,3475.0,FEMALE
1,Adelie Penguin (Pygoscelis adeliae),Dream,39.8,19.1,184.0,4650.0,MALE
2,Adelie Penguin (Pygoscelis adeliae),Dream,40.9,18.9,184.0,3900.0,MALE
3,Chinstrap penguin (Pygoscelis antarctica),Dream,46.5,17.9,192.0,3500.0,FEMALE
4,Adelie Penguin (Pygoscelis adeliae),Dream,37.3,16.8,192.0,3000.0,FEMALE


Review the mean values of each measurement within `species`:

In [10]:
%%bigquery
SELECT species, count(*) as count,
    AVG(culmen_length_mm) as mean_culmen_length,
    AVG(culmen_depth_mm) as mean_culment_depth,
    AVG(flipper_length_mm) as mean_flipper_length,
    AVG(body_mass_g) as mean_body_mass
FROM `bigquery-public-data.ml_datasets.penguins`
GROUP BY species

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,species,count,mean_culmen_length,mean_culment_depth,mean_flipper_length,mean_body_mass
0,Adelie Penguin (Pygoscelis adeliae),152,38.791391,18.346358,189.953642,3700.662252
1,Chinstrap penguin (Pygoscelis antarctica),68,48.833824,18.420588,195.823529,3733.088235
2,Gentoo penguin (Pygoscelis papua),124,47.504878,14.982114,217.186992,5076.01626


Review the mean values of each measurement within `species` and `island`:

In [11]:
%%bigquery
SELECT species, island, count(*) as count,
    AVG(culmen_length_mm) as mean_culmen_length,
    AVG(culmen_depth_mm) as mean_culment_depth,
    AVG(flipper_length_mm) as mean_flipper_length,
    AVG(body_mass_g) as mean_body_mass
FROM `bigquery-public-data.ml_datasets.penguins`
GROUP BY species, island

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,species,island,count,mean_culmen_length,mean_culment_depth,mean_flipper_length,mean_body_mass
0,Adelie Penguin (Pygoscelis adeliae),Dream,56,38.501786,18.251786,189.732143,3688.392857
1,Chinstrap penguin (Pygoscelis antarctica),Dream,68,48.833824,18.420588,195.823529,3733.088235
2,Gentoo penguin (Pygoscelis papua),Biscoe,124,47.504878,14.982114,217.186992,5076.01626
3,Adelie Penguin (Pygoscelis adeliae),Biscoe,44,38.975,18.370455,188.795455,3709.659091
4,Adelie Penguin (Pygoscelis adeliae),Torgersen,52,38.95098,18.429412,191.196078,3706.372549


Review the mean values of each measurment within `species` and `sex`:

In [12]:
%%bigquery
SELECT species, sex, count(*) as count,
    AVG(culmen_length_mm) as mean_culmen_length,
    AVG(culmen_depth_mm) as mean_culment_depth,
    AVG(flipper_length_mm) as mean_flipper_length,
    AVG(body_mass_g) as mean_body_mass
FROM `bigquery-public-data.ml_datasets.penguins`
GROUP BY species, sex

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,species,sex,count,mean_culmen_length,mean_culment_depth,mean_flipper_length,mean_body_mass
0,Adelie Penguin (Pygoscelis adeliae),FEMALE,73,37.257534,17.621918,187.794521,3368.835616
1,Adelie Penguin (Pygoscelis adeliae),MALE,73,40.390411,19.072603,192.410959,4043.493151
2,Chinstrap penguin (Pygoscelis antarctica),FEMALE,34,46.573529,17.588235,191.735294,3527.205882
3,Chinstrap penguin (Pygoscelis antarctica),MALE,34,51.094118,19.252941,199.911765,3938.970588
4,Adelie Penguin (Pygoscelis adeliae),,6,37.84,18.32,185.6,3540.0
5,Gentoo penguin (Pygoscelis papua),,4,46.0,14.166667,215.333333,4491.666667
6,Gentoo penguin (Pygoscelis papua),FEMALE,58,45.563793,14.237931,212.706897,4679.741379
7,Gentoo penguin (Pygoscelis papua),MALE,61,49.47377,15.718033,221.540984,5484.836066
8,Gentoo penguin (Pygoscelis papua),.,1,44.5,15.7,217.0,4875.0


Which observations have missing values?

In [13]:
%%bigquery
SELECT *
FROM `bigquery-public-data.ml_datasets.penguins`
WHERE sex IS NULL OR sex = '.'
    OR culmen_length_mm IS NULL
    OR culmen_depth_mm IS NULL
    OR flipper_length_mm IS NULL
    OR body_mass_g IS NULL

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,species,island,culmen_length_mm,culmen_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie Penguin (Pygoscelis adeliae),Dream,37.5,18.9,179.0,2975.0,
1,Gentoo penguin (Pygoscelis papua),Biscoe,,,,,
2,Gentoo penguin (Pygoscelis papua),Biscoe,47.3,13.8,216.0,4725.0,
3,Gentoo penguin (Pygoscelis papua),Biscoe,44.5,14.3,216.0,4100.0,
4,Gentoo penguin (Pygoscelis papua),Biscoe,44.5,15.7,217.0,4875.0,.
5,Gentoo penguin (Pygoscelis papua),Biscoe,46.2,14.4,214.0,4650.0,
6,Adelie Penguin (Pygoscelis adeliae),Torgersen,,,,,
7,Adelie Penguin (Pygoscelis adeliae),Torgersen,34.1,18.1,193.0,3475.0,
8,Adelie Penguin (Pygoscelis adeliae),Torgersen,37.8,17.1,186.0,3300.0,
9,Adelie Penguin (Pygoscelis adeliae),Torgersen,37.8,17.3,180.0,3700.0,


### Processing As Dataframes Using BigFrames API

It can be helpful to use the `.describe()` method from Pandas.  The [BigFrames](https://cloud.google.com/python/docs/reference/bigframes/latest) API allows you to work in Python with dataframe like objects while the execution remains inside of BigQuery.

In [14]:
df = bf.read_gbq(BQ_SOURCE_TABLE)

HTML(value='Query job a2c9ecf8-425f-4ade-835c-35b0a504a8bf is RUNNING. <a target="_blank" href="https://consol…

In [15]:
df.describe()

HTML(value='Query job 1dd51f19-0155-4fa5-a119-62c3448a3124 is DONE. 0 Bytes processed. <a target="_blank" href…

HTML(value='Query job 94c259a8-5ee7-42ec-b41f-13c274dfe3a4 is DONE. 10.9 kB processed. <a target="_blank" href…

HTML(value='Query job 1b9ddefc-9300-4a59-b5c3-37e26ae125c2 is DONE. 0 Bytes processed. <a target="_blank" href…

Unnamed: 0,culmen_length_mm,culmen_depth_mm,flipper_length_mm,body_mass_g
count,342.0,342.0,342.0,342.0
mean,43.92193,17.15117,200.915205,4201.754386
std,5.459584,1.974793,14.061714,801.954536
min,32.1,13.1,172.0,2700.0
25%,39.2,15.6,190.0,3550.0
50%,44.4,17.3,197.0,4050.0
75%,48.5,18.7,213.0,4750.0
max,59.6,21.5,231.0,6300.0


---
## BigQuery Setup

This workflow uses a BigQuery Public Dataset table (reviewed above).  This section creates (or links existing) dataset in the users BigQuery project.  This dataset is used to store the model object created below.

Create the dataset if missing:

In [16]:
ds = bigquery.Dataset(f"{BQ_PROJECT}.{BQ_DATASET}")
ds.location = BQ_REGION
ds = bq.create_dataset(dataset = ds, exists_ok = True)

Review dataset attributes:

In [17]:
ds.dataset_id

'bqml'

In [18]:
ds.project

'statmike-mlops-349915'

In [19]:
ds.full_dataset_id

'statmike-mlops-349915:bqml'

In [20]:
ds.path

'/projects/statmike-mlops-349915/datasets/bqml'

In [21]:
ds.location

'US'

### Add Train/Text Splits To Source Data

Make a copy of the source data in the local project including a column `split` with values of 'TRAIN' and 'TEST'.  The code below shows how to do stratified sampling for balance across the categorical variable `island` within each `species`.

Use the Python Client for BigQuery to create the source table with `split` column.  This uses an formated string in Python to construct the query string using parameters.

In [23]:
query_job = bq.query(
    f'''
    CREATE OR REPLACE TABLE `{BQ_PROJECT}.{BQ_DATASET}.{BQ_TABLE_PREFIX}-source` AS
        SELECT * EXCEPT(seq, count),
            CASE
                WHEN seq <= CEIL(.1 * count) THEN 'EVAL'
                WHEN species IS NULL THEN 'EVAL'
                ELSE 'TRAIN'
            END AS split
        FROM (
            SELECT * EXCEPT(sex),
                CASE WHEN sex = '.' THEN NULL ELSE sex END AS sex,
                ROW_NUMBER() OVER (PARTITION BY species, island ORDER BY RAND()) as seq
            FROM `{BQ_SOURCE_TABLE}`
        )
        LEFT OUTER JOIN (
            SELECT species, island, COUNT(*) as count
            FROM `{BQ_SOURCE_TABLE}`
            GROUP BY species, island
        )
        USING(species, island)
    '''
)
query_job.result()

<google.cloud.bigquery.table._EmptyRowIterator at 0x7f4003e923b0>

Print out the able name:

In [24]:
print(f'{BQ_PROJECT}.{BQ_DATASET}.{BQ_TABLE_PREFIX}-source')

statmike-mlops-349915.bqml.feature-engineering-source


Review the TRAIN/TEST split:

In [25]:
%%bigquery
SELECT species, island,
    COUNT(*) as count,
    100 * COUNTIF(split = 'TRAIN')/COUNT(*) AS TRAIN_PCT,
    100 * COUNTIF(split = 'EVAL')/COUNT(*) AS EVAL_PCT
FROM `statmike-mlops-349915.bqml.reuse-feature-engineering-source`
GROUP BY species, island

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,species,island,count,TRAIN_PCT,EVAL_PCT
0,Adelie Penguin (Pygoscelis adeliae),Dream,56,89.285714,10.714286
1,Chinstrap penguin (Pygoscelis antarctica),Dream,68,89.705882,10.294118
2,Gentoo penguin (Pygoscelis papua),Biscoe,124,89.516129,10.483871
3,Adelie Penguin (Pygoscelis adeliae),Biscoe,44,88.636364,11.363636
4,Adelie Penguin (Pygoscelis adeliae),Torgersen,52,88.461538,11.538462


Which observations have missing values?  

In [26]:
%%bigquery
SELECT *
FROM `statmike-mlops-349915.bqml.reuse-feature-engineering-source`
WHERE sex IS NULL OR sex = '.'
    OR culmen_length_mm IS NULL
    OR culmen_depth_mm IS NULL
    OR flipper_length_mm IS NULL
    OR body_mass_g IS NULL

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,species,island,culmen_length_mm,culmen_depth_mm,flipper_length_mm,body_mass_g,sex,split
0,Adelie Penguin (Pygoscelis adeliae),Dream,37.5,18.9,179.0,2975.0,,TRAIN
1,Gentoo penguin (Pygoscelis papua),Biscoe,,,,,,TRAIN
2,Gentoo penguin (Pygoscelis papua),Biscoe,44.5,14.3,216.0,4100.0,,TRAIN
3,Gentoo penguin (Pygoscelis papua),Biscoe,47.3,13.8,216.0,4725.0,,TRAIN
4,Gentoo penguin (Pygoscelis papua),Biscoe,44.5,15.7,217.0,4875.0,,TRAIN
5,Gentoo penguin (Pygoscelis papua),Biscoe,46.2,14.4,214.0,4650.0,,TRAIN
6,Adelie Penguin (Pygoscelis adeliae),Torgersen,,,,,,TRAIN
7,Adelie Penguin (Pygoscelis adeliae),Torgersen,34.1,18.1,193.0,3475.0,,TRAIN
8,Adelie Penguin (Pygoscelis adeliae),Torgersen,37.8,17.1,186.0,3300.0,,TRAIN
9,Adelie Penguin (Pygoscelis adeliae),Torgersen,37.8,17.3,180.0,3700.0,,TRAIN


---
## Embedded Preprocessing

### Create Model Using `TRANSFORM` statement

Using the `TRANSFORM` clause, you can specify the desired preprocessing of column into features.  In the case of this data there are several desired preprocessing steps based on the data review above:
- impute missing values with `ML.IMPUTER`
- scale the `body_mass_g` column with `ML.ROBUST_SCALER`
- scale the other numerical columns with `ML.STANDARD_SCALER`

The model specification below does the data imputation in the inpute query and the scaling embedded in the model with a `TRANSFORM` clause.

In [41]:
%%bigquery
CREATE OR REPLACE MODEL `statmike-mlops-349915.bqml.embedded_preprocessing`
    TRANSFORM(
        species, sex, island, split,
        ML.ROBUST_SCALER(body_mass_g) OVER() AS body_mass_g,
        ML.STANDARD_SCALER(culmen_length_mm) OVER() AS culmen_length_mm,
        ML.STANDARD_SCALER(culmen_depth_mm) OVER() AS culmen_depth_mm,
        ML.STANDARD_SCALER(flipper_length_mm) OVER() AS flipper_length_mm
    )
    OPTIONS(
        model_type = 'BOOSTED_TREE_CLASSIFIER',
        input_label_cols = ['species'],
        data_split_method = 'CUSTOM',
        data_split_col = 'split',
        model_registry = 'VERTEX_AI',
        VERTEX_AI_MODEL_ID = 'bqml_embedded_preprocessing'
    )
AS
SELECT species,
    CASE WHEN split = 'TRAIN' THEN FALSE ELSE TRUE END AS split,
    ML.IMPUTER(sex, 'most_frequent') OVER() AS sex,
    ML.IMPUTER(body_mass_g, 'median') OVER() AS body_mass_g,
    ML.IMPUTER(culmen_length_mm, 'mean') OVER() AS culmen_length_mm,
    ML.IMPUTER(culmen_depth_mm, 'mean') OVER() AS culmen_depth_mm,
    ML.IMPUTER(flipper_length_mm, 'mean') OVER() AS flipper_length_mm, 
FROM `statmike-mlops-349915.bqml.reuse-feature-engineering-source`

Query is running:   0%|          |

The feature information for the model can be reviewed with ML.FEATURE_INFO.  This shows summary statistics pre-transformation.  Notice that the `null_count` is 0 for all features because the `ML.IMPUTER` functions filled in the missing values as instructed on the query statement.

In [42]:
%%bigquery
SELECT *
FROM ML.FEATURE_INFO(MODEL `statmike-mlops-349915.bqml.embedded_preprocessing`)

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,input,min,max,mean,median,stddev,category_count,null_count,dimension
0,body_mass_g,2700.0,6300.0,4221.498371,4050.0,790.485415,,0,
1,culmen_length_mm,32.1,59.6,44.062032,44.5,5.43099,,0,
2,culmen_depth_mm,13.1,21.5,17.195447,17.3,1.997309,,0,
3,flipper_length_mm,172.0,231.0,201.077623,197.0,13.98624,,0,


The `ML.EVALUATE` function can be used to review the evaluation metrics, here for both splits combined.  Notice that the imputation with `ML.IMPUTER` function needs to be repeated because it was not embedded in the mdoel above.

In [43]:
%%bigquery
WITH
    imputed AS (
        SELECT species, island, split,
            ML.IMPUTER(sex, 'most_frequent') OVER() AS sex,
            ML.IMPUTER(body_mass_g, 'median') OVER() AS body_mass_g,
            ML.IMPUTER(culmen_length_mm, 'mean') OVER() AS culmen_length_mm,
            ML.IMPUTER(culmen_depth_mm, 'mean') OVER() AS culmen_depth_mm,
            ML.IMPUTER(flipper_length_mm, 'mean') OVER() AS flipper_length_mm, 
        FROM `statmike-mlops-349915.bqml.reuse-feature-engineering-source`  
    )
SELECT *
FROM ML.EVALUATE(
    MODEL `statmike-mlops-349915.bqml.embedded_preprocessing`,
    (SELECT * FROM imputed)
)

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,precision,recall,accuracy,f1_score,log_loss,roc_auc
0,0.995671,0.99241,0.994186,0.994003,0.024628,1.0


---
## Reusable Preprocessing

With the `ML.TRANSFORM` function you can transform the results of a query statement using the transformation of a previously create model.  That makes models TRANSFORM statements complete reusuable.  This is helpful because the transform statement also remember values that were calculated when they were created - like the mean and standard-deviation used with `ML.STANDARD_SCALER`.

### Using ML.TRANSFORM

First, a sample of raw data:

In [44]:
%%bigquery
SELECT *
FROM `statmike-mlops-349915.bqml.reuse-feature-engineering-source`
WHERE sex IS NULL AND island = 'Biscoe'

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,species,island,culmen_length_mm,culmen_depth_mm,flipper_length_mm,body_mass_g,sex,split
0,Gentoo penguin (Pygoscelis papua),Biscoe,,,,,,TRAIN
1,Gentoo penguin (Pygoscelis papua),Biscoe,44.5,14.3,216.0,4100.0,,TRAIN
2,Gentoo penguin (Pygoscelis papua),Biscoe,47.3,13.8,216.0,4725.0,,TRAIN
3,Gentoo penguin (Pygoscelis papua),Biscoe,44.5,15.7,217.0,4875.0,,TRAIN
4,Gentoo penguin (Pygoscelis papua),Biscoe,46.2,14.4,214.0,4650.0,,TRAIN


Second, the same raw sample processed with the transformation from the model created above:

In [45]:
%%bigquery
SELECT *
FROM ML.TRANSFORM(
    MODEL `statmike-mlops-349915.bqml.embedded_preprocessing`,
    (SELECT *
     FROM `statmike-mlops-349915.bqml.reuse-feature-engineering-source`
     WHERE sex IS NULL AND island = 'Biscoe'
    )
)

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,body_mass_g,culmen_length_mm,culmen_depth_mm,flipper_length_mm,species,island,sex,split
0,,,,,Gentoo penguin (Pygoscelis papua),Biscoe,,TRAIN
1,0.043478,0.08078,-1.45202,1.068651,Gentoo penguin (Pygoscelis papua),Biscoe,,TRAIN
2,0.586957,0.597181,-1.702766,1.068651,Gentoo penguin (Pygoscelis papua),Biscoe,,TRAIN
3,0.717391,0.08078,-0.749931,1.140267,Gentoo penguin (Pygoscelis papua),Biscoe,,TRAIN
4,0.521739,0.394309,-1.401871,0.925419,Gentoo penguin (Pygoscelis papua),Biscoe,,TRAIN


Third, the same raw sample first with imputed missing values, then processed with the transformation from the model created above:

In [46]:
%%bigquery
WITH
    imputed AS (
        SELECT species, split, island,
            CASE WHEN sex IS NULL THEN TRUE ELSE FALSE END AS sex_null,
            ML.IMPUTER(sex, 'most_frequent') OVER() AS sex,
            ML.IMPUTER(body_mass_g, 'median') OVER() AS body_mass_g,
            ML.IMPUTER(culmen_length_mm, 'mean') OVER() AS culmen_length_mm,
            ML.IMPUTER(culmen_depth_mm, 'mean') OVER() AS culmen_depth_mm,
            ML.IMPUTER(flipper_length_mm, 'mean') OVER() AS flipper_length_mm, 
        FROM `statmike-mlops-349915.bqml.reuse-feature-engineering-source`  
    )
SELECT *
FROM ML.TRANSFORM(
    MODEL `statmike-mlops-349915.bqml.embedded_preprocessing`,
    (SELECT *
     FROM imputed
     WHERE sex_null AND island = 'Biscoe'
    )
)

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,sex,body_mass_g,culmen_length_mm,culmen_depth_mm,flipper_length_mm,species,split,island,sex_null
0,MALE,0.0,-0.025833,-0.022181,-0.011659,Gentoo penguin (Pygoscelis papua),TRAIN,Biscoe,True
1,MALE,0.043478,0.08078,-1.45202,1.068651,Gentoo penguin (Pygoscelis papua),TRAIN,Biscoe,True
2,MALE,0.586957,0.597181,-1.702766,1.068651,Gentoo penguin (Pygoscelis papua),TRAIN,Biscoe,True
3,MALE,0.717391,0.08078,-0.749931,1.140267,Gentoo penguin (Pygoscelis papua),TRAIN,Biscoe,True
4,MALE,0.521739,0.394309,-1.401871,0.925419,Gentoo penguin (Pygoscelis papua),TRAIN,Biscoe,True


---
## Modular Preprocessing

What if you want to take adavantage of the `TRANSFORM` clauses results repeatedly across many models and other parts of your workflow?  What if you want to apply multiple `TRANSFORM` clauses in sequence: like imputation then scaling? You can build a model with only transformations using the `model_type = 'TRANSFORM_ONLY'` as follows.

### Create `TRANSFORM_ONLY` Model - For Imputation

In [47]:
%%bigquery
CREATE OR REPLACE MODEL `statmike-mlops-349915.bqml.modular_preprocessing_impute`
    TRANSFORM(
        ML.IMPUTER(sex, 'most_frequent') OVER() AS sex,
        ML.IMPUTER(body_mass_g, 'median') OVER() AS body_mass_g,
        ML.IMPUTER(culmen_length_mm, 'mean') OVER() AS culmen_length_mm,
        ML.IMPUTER(culmen_depth_mm, 'mean') OVER() AS culmen_depth_mm,
        ML.IMPUTER(flipper_length_mm, 'mean') OVER() AS flipper_length_mm
    )
    OPTIONS(
        model_type = 'TRANSFORM_ONLY',
        model_registry = 'VERTEX_AI',
        VERTEX_AI_MODEL_ID = 'bqml_modular_preprocessing_impute'
    )
AS
SELECT * 
FROM `statmike-mlops-349915.bqml.reuse-feature-engineering-source`
WHERE split = 'TRAIN'

Query is running:   0%|          |

Now apply the `TRANSFORM_ONLY` model using `ML.TRANSFORM`:

In [48]:
%%bigquery
SELECT *
FROM ML.TRANSFORM(
    MODEL `statmike-mlops-349915.bqml.modular_preprocessing_impute`,
    (SELECT *
        FROM `statmike-mlops-349915.bqml.reuse-feature-engineering-source`
        WHERE sex IS NULL and island = 'Biscoe')
)

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,sex,body_mass_g,culmen_length_mm,culmen_depth_mm,flipper_length_mm,species,island,split
0,MALE,4050.0,44.063,17.1957,201.079,Gentoo penguin (Pygoscelis papua),Biscoe,TRAIN
1,MALE,4100.0,44.5,14.3,216.0,Gentoo penguin (Pygoscelis papua),Biscoe,TRAIN
2,MALE,4725.0,47.3,13.8,216.0,Gentoo penguin (Pygoscelis papua),Biscoe,TRAIN
3,MALE,4875.0,44.5,15.7,217.0,Gentoo penguin (Pygoscelis papua),Biscoe,TRAIN
4,MALE,4650.0,46.2,14.4,214.0,Gentoo penguin (Pygoscelis papua),Biscoe,TRAIN


### Create `TRANSFORM_ONLY` Model - For Scaling

In [49]:
%%bigquery
CREATE OR REPLACE MODEL `statmike-mlops-349915.bqml.modular_preprocessing_scale`
    TRANSFORM(
        ML.ROBUST_SCALER(body_mass_g) OVER() AS body_mass_g,
        ML.STANDARD_SCALER(culmen_length_mm) OVER() AS culmen_length_mm,
        ML.STANDARD_SCALER(culmen_depth_mm) OVER() AS culmen_depth_mm,
        ML.STANDARD_SCALER(flipper_length_mm) OVER() AS flipper_length_mm
    )
    OPTIONS(
        model_type = 'TRANSFORM_ONLY',
        model_registry = 'VERTEX_AI',
        VERTEX_AI_MODEL_ID = 'bqml_modular_preprocessing_scale'
)
AS
SELECT * 
FROM `statmike-mlops-349915.bqml.reuse-feature-engineering-source`
WHERE split = 'TRAIN'

Query is running:   0%|          |

Now apply the `TRANSFORM_ONLY` model using `ML.TRANSFORM`:

In [50]:
%%bigquery
SELECT *
FROM ML.TRANSFORM(
    MODEL `statmike-mlops-349915.bqml.modular_preprocessing_scale`,
    (SELECT *
        FROM `statmike-mlops-349915.bqml.reuse-feature-engineering-source`
        WHERE sex IS NULL and island = 'Biscoe')
)

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,body_mass_g,culmen_length_mm,culmen_depth_mm,flipper_length_mm,species,island,sex,split
0,,,,,Gentoo penguin (Pygoscelis papua),Biscoe,,TRAIN
1,0.043478,0.080333,-1.44743,1.065093,Gentoo penguin (Pygoscelis papua),Biscoe,,TRAIN
2,0.586957,0.595051,-1.697358,1.065093,Gentoo penguin (Pygoscelis papua),Biscoe,,TRAIN
3,0.717391,0.080333,-0.747633,1.136476,Gentoo penguin (Pygoscelis papua),Biscoe,,TRAIN
4,0.521739,0.39284,-1.397445,0.922329,Gentoo penguin (Pygoscelis papua),Biscoe,,TRAIN


### Apply Multiple `TRANSFORM_ONLY` Models - Feature Pipeline

In [51]:
%%bigquery
WITH
    raw AS (
        SELECT *
        FROM `statmike-mlops-349915.bqml.reuse-feature-engineering-source`
        WHERE sex IS NULL and island = 'Biscoe'
    ),
    impute AS (
        SELECT *
        FROM ML.TRANSFORM(
            MODEL `statmike-mlops-349915.bqml.modular_preprocessing_impute`,
            (SELECT * FROM raw)
        )
    ),
    scale AS (
        SELECT *
        FROM ML.TRANSFORM(
            MODEL `statmike-mlops-349915.bqml.modular_preprocessing_scale`,
            (SELECT * FROM impute)
        )
    
    )
SELECT *
FROM scale

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,body_mass_g,culmen_length_mm,culmen_depth_mm,flipper_length_mm,sex,species,island,split
0,0.0,0.0,0.0,0.0,MALE,Gentoo penguin (Pygoscelis papua),Biscoe,TRAIN
1,0.043478,0.080333,-1.44743,1.065093,MALE,Gentoo penguin (Pygoscelis papua),Biscoe,TRAIN
2,0.586957,0.595051,-1.697358,1.065093,MALE,Gentoo penguin (Pygoscelis papua),Biscoe,TRAIN
3,0.717391,0.080333,-0.747633,1.136476,MALE,Gentoo penguin (Pygoscelis papua),Biscoe,TRAIN
4,0.521739,0.39284,-1.397445,0.922329,MALE,Gentoo penguin (Pygoscelis papua),Biscoe,TRAIN


### Create Model Using `TRANSFORM_ONLY` Models As Feature Pipeline

In [52]:
%%bigquery
CREATE OR REPLACE MODEL `statmike-mlops-349915.bqml.modular_preprocessing`
    OPTIONS(
        model_type = 'BOOSTED_TREE_CLASSIFIER',
        input_label_cols = ['species'],
        data_split_method = 'CUSTOM',
        data_split_col = 'split',
        model_registry = 'VERTEX_AI',
        VERTEX_AI_MODEL_ID = 'bqml_modular_preprocessing'
    )
AS
WITH
    raw AS (
        SELECT *
        FROM `statmike-mlops-349915.bqml.reuse-feature-engineering-source`
    ),
    impute AS (
        SELECT *
        FROM ML.TRANSFORM(
            MODEL `statmike-mlops-349915.bqml.modular_preprocessing_impute`,
            (SELECT * FROM raw)
        )
    ),
    scale AS (
        SELECT *
        FROM ML.TRANSFORM(
            MODEL `statmike-mlops-349915.bqml.modular_preprocessing_scale`,
            (SELECT * FROM impute)
        )
    )
SELECT * EXCEPT(split),
    CASE WHEN split = 'TRAIN' THEN FALSE ELSE TRUE END AS split
FROM scale

Query is running:   0%|          |

The feature information for the model can be reviewed with ML.FEATURE_INFO.  This shows summary statistics pre-transformation inside the model, but since the features were pre-processed using modular `TRANSFORM_ONLY` models, the input featurs are already imputed and scaled.

In [53]:
%%bigquery
SELECT *
FROM ML.FEATURE_INFO(MODEL `statmike-mlops-349915.bqml.modular_preprocessing`)

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,input,min,max,mean,median,stddev,category_count,null_count,dimension
0,body_mass_g,-1.173913,1.956522,0.149129,0.0,0.687379,,0,
1,culmen_length_mm,-2.199133,2.856134,-9e-06,0.080333,0.998365,,0,
2,culmen_depth_mm,-2.047256,2.151526,1.9e-05,0.052135,0.998363,,0,
3,flipper_length_mm,-2.075722,2.135826,-2.2e-05,-0.219786,0.998368,,0,
4,sex,,,,,,2.0,0,
5,island,,,,,,3.0,0,


The `ML.EVALUATE` function can be used to review the evaluation metrics, here for both splits combined.  Notice that the feature pipeline needs to be repeated because it is not embedded in the model in this case.

In [54]:
%%bigquery
WITH
    raw AS (
        SELECT *
        FROM `statmike-mlops-349915.bqml.reuse-feature-engineering-source`
    ),
    impute AS (
        SELECT *
        FROM ML.TRANSFORM(
            MODEL `statmike-mlops-349915.bqml.modular_preprocessing_impute`,
            (SELECT * FROM raw)
        )
    ),
    scale AS (
        SELECT *
        FROM ML.TRANSFORM(
            MODEL `statmike-mlops-349915.bqml.modular_preprocessing_scale`,
            (SELECT * FROM impute)
        )
    )
SELECT *
FROM ML.EVALUATE(
    MODEL `statmike-mlops-349915.bqml.modular_preprocessing`,
    (SELECT * FROM scale)
)

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,precision,recall,accuracy,f1_score,log_loss,roc_auc
0,0.997821,0.995098,0.997093,0.996438,0.021294,1.0


---
## Using Models With BigFrames API

The model with embedded preprocessing can be used directly with the BigFrame API.

In [None]:
model = bf.read_gbq_model(f'{BQ_PROJECT}.{BQ_DATASET}.embedded_preprocessing')

In [110]:
test_dict = dict(
    island = 'Dream',
    culmen_length_mm = 40.9,
    culmen_depth_mm = 18.9,
    flipper_length_mm = 184.0,
    sex = 'MALE',
    body_mass_g = 3650.0
)
test_dict

{'island': 'Dream',
 'culmen_length_mm': 40.9,
 'culmen_depth_mm': 18.9,
 'flipper_length_mm': 184.0,
 'sex': 'MALE',
 'body_mass_g': 3650.0}

In [111]:
test_df = pd.Series(test_dict).to_frame().T
test_bf = bf.read_pandas(test_df)
test_bf.head()

HTML(value='Load job 92c50bdf-798f-4c12-9234-7cbf4eabc747 is RUNNING. <a target="_blank" href="https://console…

HTML(value='Query job 9b2c6404-f6a2-4229-81e0-acfb032ef923 is DONE. 8 Bytes processed. <a target="_blank" href…

HTML(value='Query job 79b83d19-bce2-4498-b6a5-0567b2f6b7e2 is DONE. 53 Bytes processed. <a target="_blank" hre…

HTML(value='Query job 1b2554ee-ef7c-43ef-848e-cb8e0e9cc9df is DONE. 0 Bytes processed. <a target="_blank" href…

Unnamed: 0,island,culmen_length_mm,culmen_depth_mm,flipper_length_mm,sex,body_mass_g
0,Dream,40.9,18.9,184.0,MALE,3650.0


In [73]:
model.predict(test_bf)

HTML(value='Query job eda0d29e-6f4a-47a8-b3c6-5f4c3888af59 is RUNNING. <a target="_blank" href="https://consol…

HTML(value='Query job 2d5a57d9-00f5-461f-ba48-2fdf771b1cb9 is DONE. 8 Bytes processed. <a target="_blank" href…

HTML(value='Query job 4b12b399-bfe1-4331-a6f8-6b40bb1723e5 is DONE. 0 Bytes processed. <a target="_blank" href…

HTML(value='Query job 37ff546f-01cc-4af9-860f-6388d52275e4 is DONE. 51 Bytes processed. <a target="_blank" hre…

HTML(value='Query job 9c59a1a9-dbb3-4112-a802-ae07030cb5a4 is DONE. 0 Bytes processed. <a target="_blank" href…

Unnamed: 0,predicted_species
0,Chinstrap penguin (Pygoscelis antarctica)


---
## Export To GCS For Complete Portability

In [81]:
gcs.lookup_bucket(GCS_BUCKET)

<Bucket: statmike-mlops-349915>

In [88]:
%%bigquery
EXPORT MODEL `statmike-mlops-349915.bqml.embedded_preprocessing`
    OPTIONS(URI = 'gs://statmike-mlops-349915/bqml/feature-engineering/models/embedded_preprocessing');
EXPORT MODEL `statmike-mlops-349915.bqml.modular_preprocessing`
    OPTIONS(URI = 'gs://statmike-mlops-349915/bqml/feature-engineering/models/modular_preprocessing');
EXPORT MODEL `statmike-mlops-349915.bqml.modular_preprocessing_impute`
    OPTIONS(URI = 'gs://statmike-mlops-349915/bqml/feature-engineering/models/modular_preprocessing_impute');
EXPORT MODEL `statmike-mlops-349915.bqml.modular_preprocessing_scale`
    OPTIONS(URI = 'gs://statmike-mlops-349915/bqml/feature-engineering/models/modular_preprocessing/scale');

Query is running:   0%|          |

In [91]:
for blob in list(gcs.bucket(GCS_BUCKET).list_blobs(prefix = 'bqml/feature-engineering/')):
    print(blob.name)

bqml/feature-engineering/
bqml/feature-engineering/models/
bqml/feature-engineering/models/embedded_preprocessing/
bqml/feature-engineering/models/embedded_preprocessing/assets/0_categorical_label.txt
bqml/feature-engineering/models/embedded_preprocessing/assets/model_metadata.json
bqml/feature-engineering/models/embedded_preprocessing/explanation_metadata.json
bqml/feature-engineering/models/embedded_preprocessing/main.py
bqml/feature-engineering/models/embedded_preprocessing/model.bst
bqml/feature-engineering/models/embedded_preprocessing/transform/
bqml/feature-engineering/models/embedded_preprocessing/transform/assets/
bqml/feature-engineering/models/embedded_preprocessing/transform/fingerprint.pb
bqml/feature-engineering/models/embedded_preprocessing/transform/saved_model.pb
bqml/feature-engineering/models/embedded_preprocessing/transform/variables/
bqml/feature-engineering/models/embedded_preprocessing/transform/variables/variables.data-00000-of-00001
bqml/feature-engineering/mod

---
## Online Serving With Vertex AI

In [None]:
## vertex ai client
aiplatform.init(project = PROJECT_ID, location = REGION)

In [96]:
for model in aiplatform.Model.list():
    if model.name.startswith('bqml_modular'): print(model.name)
    if model.name.startswith('bqml_embedded'): print(model.name)

bqml_modular_preprocessing
bqml_modular_preprocessing_scale
bqml_modular_preprocessing_impute
bqml_embedded_preprocessing


In [103]:
vertex_model = aiplatform.Model(model_name = 'bqml_embedded_preprocessing')

In [104]:
endpoint = vertex_model.deploy()

Creating Endpoint


[INFO][2023-11-15 00:00:28,405][google.cloud.aiplatform.models] Creating Endpoint


Create Endpoint backing LRO: projects/1026793852137/locations/us-central1/endpoints/4638340379307933696/operations/2973200868821696512


[INFO][2023-11-15 00:00:28,407][google.cloud.aiplatform.models] Create Endpoint backing LRO: projects/1026793852137/locations/us-central1/endpoints/4638340379307933696/operations/2973200868821696512


Endpoint created. Resource name: projects/1026793852137/locations/us-central1/endpoints/4638340379307933696


[INFO][2023-11-15 00:00:30,084][google.cloud.aiplatform.models] Endpoint created. Resource name: projects/1026793852137/locations/us-central1/endpoints/4638340379307933696


To use this Endpoint in another session:


[INFO][2023-11-15 00:00:30,085][google.cloud.aiplatform.models] To use this Endpoint in another session:


endpoint = aiplatform.Endpoint('projects/1026793852137/locations/us-central1/endpoints/4638340379307933696')


[INFO][2023-11-15 00:00:30,086][google.cloud.aiplatform.models] endpoint = aiplatform.Endpoint('projects/1026793852137/locations/us-central1/endpoints/4638340379307933696')


Deploying model to Endpoint : projects/1026793852137/locations/us-central1/endpoints/4638340379307933696


[INFO][2023-11-15 00:00:30,174][google.cloud.aiplatform.models] Deploying model to Endpoint : projects/1026793852137/locations/us-central1/endpoints/4638340379307933696


Using default machine_type: n1-standard-2


[INFO][2023-11-15 00:00:30,177][google.cloud.aiplatform.models] Using default machine_type: n1-standard-2


Deploy Endpoint model backing LRO: projects/1026793852137/locations/us-central1/endpoints/4638340379307933696/operations/1870381910069346304


[INFO][2023-11-15 00:00:30,314][google.cloud.aiplatform.models] Deploy Endpoint model backing LRO: projects/1026793852137/locations/us-central1/endpoints/4638340379307933696/operations/1870381910069346304


Endpoint model deployed. Resource name: projects/1026793852137/locations/us-central1/endpoints/4638340379307933696


[INFO][2023-11-15 00:18:38,427][google.cloud.aiplatform.models] Endpoint model deployed. Resource name: projects/1026793852137/locations/us-central1/endpoints/4638340379307933696


In [112]:
endpoint.predict(instances = [test_dict])

FailedPrecondition: 400 "Prediction failed: Exception during predicting with bqml model with transform clause: \"island\" in not an input of the TRANSFORM."


---
## Review Objects in GCP Console: BigQuery Models, GCS Exports, Vertex AI Models, Vertex AI Endpoints

### Console: BigQuery Models

### Console: GCS Model Files

### Console: Vertex AI Model Registry

### Console: Vertex AI Endpoint

---
## Remove Resources Created In This Notebook

- Dataset In BigQuery
- Model Objects In BigQuery
- Model Exports in GCS
- Endpoints In Vertex AI