![ga4](https://www.google-analytics.com/collect?v=2&tid=G-6VDTYWLKX6&cid=1&en=page_view&sid=1&dl=statmike%2Fvertex-ai-mlops%2F03+-+BigQuery+ML+%28BQML%29&dt=BQML+Cross-validation+Example.ipynb)
<!--- header table --->
<table align="left">
  <td style="text-align: center">
    <a href="https://colab.research.google.com/github/statmike/vertex-ai-mlops/blob/main/03%20-%20BigQuery%20ML%20%28BQML%29/BQML%20Cross-validation%20Example.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/colab-logo-32px.png" alt="Google Colaboratory logo">
      <br>Run in<br>Colab
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/colab/import/https%3A//raw.githubusercontent.com/statmike/vertex-ai-mlops/main/03%20-%20BigQuery%20ML%20%28BQML%29/BQML%20Cross-validation%20Example.ipynb">
      <img width="32px" src="https://lh3.googleusercontent.com/JmcxdQi-qOpctIvWKgPtrzZdJJK-J3sWE1RsfjZNwshCFgE_9fULcNpuXYTilIR2hjwN" alt="Google Cloud Colab Enterprise logo">
      <br>Run in<br>Colab Enterprise
    </a>
  </td>      
  <td style="text-align: center">
    <a href="https://github.com/statmike/vertex-ai-mlops/blob/main/03%20-%20BigQuery%20ML%20%28BQML%29/BQML%20Cross-validation%20Example.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" alt="GitHub logo">
      <br>View on<br>GitHub
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https%3A//raw.githubusercontent.com/statmike/vertex-ai-mlops/main/03%20-%20BigQuery%20ML%20%28BQML%29/BQML%20Cross-validation%20Example.ipynb">
      <img src="https://lh3.googleusercontent.com/UiNooY4LUgW_oTvpsNhPpQzsstV5W8F7rYgxgGBD85cWJoLmrOzhVs_ksK_vgx40SHs7jCqkTkCk=e14-rj-sc0xffffff-h130-w32" alt="Vertex AI logo">
      <br>Open in<br>Vertex AI Workbench
    </a>
  </td>
</table>

# BQML Cross-validation Examples

This notebook will examine using BQML to do k-fold cross-validation for model validation.  

This will build on the example in [03a - BQML Logistic Regression](./03a%20-%20BQML%20Logistic%20Regression.ipynb). In that example a hold out sample was used for testing based on the `split` column created in the data setup in notebook [01 - BigQuery - Table Data Source](../01%20-%20Data%20Sources/01%20-%20BigQuery%20-%20Table%20Data%20Source.ipynb).

**Prerequisites:**
- [01 - BigQuery - Table Data Source](../01%20-%20Data%20Sources/01%20-%20BigQuery%20-%20Table%20Data%20Source.ipynb)

**Resources:**
- [BigQuery ML (BQML) Overview](https://cloud.google.com/bigquery-ml/docs/introduction)
- [Overview of BQML methods and workflows](https://cloud.google.com/bigquery-ml/docs/reference/standard-sql/bigqueryml-syntax-e2e-journey)
- [BigQuery](https://cloud.google.com/bigquery)
    - [Documentation:](https://cloud.google.com/bigquery/docs/query-overview)
    - [API:](https://cloud.google.com/bigquery/docs/reference/libraries-overview)
        - [Clients](https://cloud.google.com/bigquery/docs/reference/libraries)
            - [Python SDK:](https://github.com/googleapis/python-bigquery)
            - [Python Library Reference:](https://cloud.google.com/python/docs/reference/bigquery/latest)
- [Vertex AI](https://cloud.google.com/vertex-ai)
    - [Documentation:](https://cloud.google.com/vertex-ai/docs/start/introduction-unified-platform)
    - [API:](https://cloud.google.com/vertex-ai/docs/reference)
        - [Clients:](https://cloud.google.com/vertex-ai/docs/start/client-libraries)
            - [Python SDK:](https://github.com/googleapis/python-aiplatform)
            - [Python Library Reference:](https://cloud.google.com/python/docs/reference/aiplatform/latest)

---
## Colab Setup

To run this notebook in Colab click [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/statmike/vertex-ai-mlops/blob/main/03%20-%20BigQuery%20ML%20(BQML)/BQML%20Cross-validation%20Example.ipynb) and run the cells in this section.  Otherwise, skip this section.

This cell will authenticate to GCP (follow prompts in the popup).

In [1]:
PROJECT_ID = 'statmike-mlops-349915' # replace with project ID

In [473]:
try:
    import google.colab
    from google.colab import auth
    auth.authenticate_user()
    !gcloud config set project {PROJECT_ID}
except Exception:
    pass

---
## Setup

inputs:

In [65]:
project = !gcloud config get-value project
PROJECT_ID = project[0]
PROJECT_ID

'statmike-mlops-349915'

In [66]:
REGION = 'us-central1'
EXPERIMENT = 'crossval'
SERIES = '03'

# source data
BQ_PROJECT = PROJECT_ID
BQ_DATASET = 'fraud'
BQ_TABLE = 'fraud_prepped'

# Model Training
VAR_TARGET = 'Class'
VAR_OMIT = 'transaction_id' # add more variables to the string with space delimiters

packages:

In [67]:
from google.cloud import bigquery
from datetime import datetime
import matplotlib.pyplot as plt
from sklearn import metrics

clients:

In [68]:
bq = bigquery.Client(project = PROJECT_ID)

parameters:

In [69]:
TIMESTAMP = datetime.now().strftime("%Y%m%d%H%M%S")
RUN_NAME = f'run-{TIMESTAMP}'

BQ_MODEL = f'{SERIES}_{EXPERIMENT}_{TIMESTAMP}'

---
## Review Data

The data source here was prepared in [01 - BigQuery - Table Data Source](../01%20-%20Data%20Sources/01%20-%20BigQuery%20-%20Table%20Data%20Source.ipynb).  In this notebook we will use prepared BigQuery table to build a model with BigQuery ML (BQML).

This is a table of 284,807 credit card transactions classified as fradulant or normal in the column `Class`.  In order protect confidentiality, the original features have been transformed using [principle component analysis (PCA)](https://en.wikipedia.org/wiki/Principal_component_analysis) into 28 features named `V1, V2, ... V28` (float).  Two descriptive features are provided without transformation by PCA:
- `Time` (integer) is the seconds elapsed between the transaction and the earliest transaction in the table
- `Amount` (float) is the value of the transaction

The data preparation included added splits for machine learning with a column named `splits` with 80% for training (`TRAIN`), 10% for validation (`VALIDATE`) and 10% for testing (`TEST`).  Additionally, a unique identifier was added to each transaction, `transaction_id`.  

Review the number of records for each level of Class (VAR_TARGET) for each of the data splits:

In [70]:
query = f"""
SELECT splits, {VAR_TARGET}, count(*) as n
FROM `{BQ_PROJECT}.{BQ_DATASET}.{BQ_TABLE}`
GROUP BY splits, {VAR_TARGET}
"""
bq.query(query = query).to_dataframe()

Unnamed: 0,splits,Class,n
0,TEST,0,28455
1,TEST,1,47
2,TRAIN,0,227664
3,TRAIN,1,397
4,VALIDATE,0,28196
5,VALIDATE,1,48


Further review the balance of the target variable (VAR_TARGET) for each split as a percentage of the split:

In [71]:
query = f"""
WITH
    COUNTS as (SELECT splits, {VAR_TARGET}, count(*) as n FROM `{BQ_PROJECT}.{BQ_DATASET}.{BQ_TABLE}` GROUP BY splits, {VAR_TARGET})
    
SELECT *,
    SUM(n) OVER() as total,
    SAFE_DIVIDE(n, SUM(n) OVER(PARTITION BY {VAR_TARGET})) as n_pct_class,
    SAFE_DIVIDE(n, SUM(n) OVER(PARTITION BY splits)) as n_pct_split,
    SAFE_DIVIDE(SUM(n) OVER(PARTITION BY {VAR_TARGET}), SUM(n) OVER()) as class_pct_total
FROM COUNTS
"""
review = bq.query(query = query).to_dataframe()
review

Unnamed: 0,splits,Class,n,total,n_pct_class,n_pct_split,class_pct_total
0,TEST,0,28455,284807,0.100083,0.998351,0.998273
1,TEST,1,47,284807,0.095528,0.001649,0.001727
2,VALIDATE,0,28196,284807,0.099172,0.998301,0.998273
3,VALIDATE,1,48,284807,0.097561,0.001699,0.001727
4,TRAIN,0,227664,284807,0.800746,0.998259,0.998273
5,TRAIN,1,397,284807,0.806911,0.001741,0.001727


---
## Prepare Data For Cross-validation



### Create BigQuery Dataset

List BigQuery datasets in the project:

In [72]:
datasets = list(bq.list_datasets())
for d in datasets:
    print(d.dataset_id)

applied_forecasting
explained_columns
forecasting_8_tournament
fraud
github_api
model_deployment_monitoring_1961322035766362112


Create the dataset if missing:

In [73]:
ds = bigquery.Dataset(f"{BQ_PROJECT}.crossvalidation")
ds.location = REGION
ds.labels = {'experiment': f'{EXPERIMENT}'}
ds = bq.create_dataset(dataset = ds, exists_ok = True)

List BigQuery datasets in the project:

In [74]:
datasets = list(bq.list_datasets())
for d in datasets:
    print(d.dataset_id)

applied_forecasting
crossvalidation
explained_columns
forecasting_8_tournament
fraud
github_api
model_deployment_monitoring_1961322035766362112


### Create BigQuery View With k-folds

In [75]:
k = 10

In [76]:
query = f"""
    CREATE VIEW IF NOT EXISTS `{BQ_PROJECT}.crossvalidation.{BQ_TABLE}_{k}_fold` AS
    SELECT * EXCEPT(splits),
        MOD(ABS(FARM_FINGERPRINT(transaction_id)), {k}) + 1 AS k
    FROM `{BQ_PROJECT}.{BQ_DATASET}.{BQ_TABLE}`
"""
print(query)


    CREATE TABLE IF NOT EXISTS `statmike-mlops-349915.crossvalidation.fraud_prepped_10_fold` AS
    SELECT * EXCEPT(splits),
        MOD(ABS(FARM_FINGERPRINT(transaction_id)), 10) + 1 AS k
    FROM `statmike-mlops-349915.fraud.fraud_prepped`



In [77]:
job = bq.query(query = query)
job.result()

<google.cloud.bigquery.table._EmptyRowIterator at 0x7f5b9b7e8d10>

In [78]:
query = f"""
WITH
    COUNTS as (SELECT k, {VAR_TARGET}, count(*) as n FROM `{BQ_PROJECT}.crossvalidation.{BQ_TABLE}_{k}_fold` GROUP BY k, {VAR_TARGET})
    
SELECT *,
    SUM(n) OVER() as total,
    SAFE_DIVIDE(n, SUM(n) OVER(PARTITION BY {VAR_TARGET})) as n_pct_class,
    SAFE_DIVIDE(n, SUM(n) OVER(PARTITION BY k)) as n_pct_k,
    SAFE_DIVIDE(SUM(n) OVER(PARTITION BY {VAR_TARGET}), SUM(n) OVER()) as class_pct_total
FROM COUNTS
ORDER BY Class, k
"""
review = bq.query(query = query).to_dataframe()
review

Unnamed: 0,k,Class,n,total,n_pct_class,n_pct_k,class_pct_total
0,1,0,28604,284807,0.100607,0.998151,0.998273
1,2,0,28258,284807,0.09939,0.998163,0.998273
2,3,0,28798,284807,0.101289,0.998405,0.998273
3,4,0,28379,284807,0.099815,0.998452,0.998273
4,5,0,28132,284807,0.098947,0.997836,0.998273
5,6,0,28663,284807,0.100814,0.99805,0.998273
6,7,0,28447,284807,0.100055,0.998456,0.998273
7,8,0,28383,284807,0.099829,0.998558,0.998273
8,9,0,28196,284807,0.099172,0.998301,0.998273
9,10,0,28455,284807,0.100083,0.998351,0.998273


---
## Cross-Validation Models

This submits separate model creation queries for each of the k-folds.  Note that BigQuery has a [default limits](https://cloud.google.com/bigquery/quotas#query_jobs) for concurrent queries (100) and queued queries (1000) within a project that might affect how this works if there are many folds or many other queries being executed in this project.  There are multiple ways to work around this ranging from running sequentially, running in groups, to managing the size of the [query queue](https://cloud.google.com/bigquery/docs/query-queues).  

In [79]:
jobs = []
for i in range(k):
    query = f"""
        CREATE OR REPLACE MODEL `{BQ_PROJECT}.crossvalidation.{BQ_MODEL}_fold_{i+1}`
        OPTIONS (
                model_type = 'LOGISTIC_REG',
                auto_class_weights = TRUE,
                input_label_cols = ['{VAR_TARGET}'],
                data_split_method = 'NO_SPLIT'
            ) AS
        SELECT * EXCEPT({','.join(VAR_OMIT.split())}, k)
        FROM `{BQ_PROJECT}.crossvalidation.{BQ_TABLE}_{k}_fold`
        WHERE k != {i+1}   
    """
    job = bq.query(query = query)
    jobs.append(job)

Wait for all the jobs to complete:

In [80]:
from time import sleep
while not all([job.done() for job in jobs]):
  print('waiting for jobs to finish ... sleeping for 10s')
  sleep(10)
print('All jobs are complete!')

waiting for jobs to finish ... sleeping for 10s
waiting for jobs to finish ... sleeping for 10s
waiting for jobs to finish ... sleeping for 10s
waiting for jobs to finish ... sleeping for 10s
waiting for jobs to finish ... sleeping for 10s
waiting for jobs to finish ... sleeping for 10s
waiting for jobs to finish ... sleeping for 10s
waiting for jobs to finish ... sleeping for 10s
waiting for jobs to finish ... sleeping for 10s
waiting for jobs to finish ... sleeping for 10s
waiting for jobs to finish ... sleeping for 10s
waiting for jobs to finish ... sleeping for 10s
waiting for jobs to finish ... sleeping for 10s
waiting for jobs to finish ... sleeping for 10s
All jobs are complete!


### Review The Cross-Validation

In [81]:
query = ''
for i in range(k):
    query += f"""
        SELECT {i+1} as k, * FROM ML.EVALUATE (MODEL `{BQ_PROJECT}.crossvalidation.{BQ_MODEL}_fold_{i+1}`,
            (SELECT * FROM `{BQ_PROJECT}.crossvalidation.{BQ_TABLE}_{k}_fold` WHERE k = {i+1}))
    """
    if i < k-1: query += "UNION ALL"
    else: query += 'ORDER BY k'
print(query)


        SELECT 1 as k, * FROM ML.EVALUATE (MODEL `statmike-mlops-349915.crossvalidation.03_crossval_20230206210458_fold_1`,
            (SELECT * FROM `statmike-mlops-349915.crossvalidation.fraud_prepped_10_fold` WHERE k = 1))
    UNION ALL
        SELECT 2 as k, * FROM ML.EVALUATE (MODEL `statmike-mlops-349915.crossvalidation.03_crossval_20230206210458_fold_2`,
            (SELECT * FROM `statmike-mlops-349915.crossvalidation.fraud_prepped_10_fold` WHERE k = 2))
    UNION ALL
        SELECT 3 as k, * FROM ML.EVALUATE (MODEL `statmike-mlops-349915.crossvalidation.03_crossval_20230206210458_fold_3`,
            (SELECT * FROM `statmike-mlops-349915.crossvalidation.fraud_prepped_10_fold` WHERE k = 3))
    UNION ALL
        SELECT 4 as k, * FROM ML.EVALUATE (MODEL `statmike-mlops-349915.crossvalidation.03_crossval_20230206210458_fold_4`,
            (SELECT * FROM `statmike-mlops-349915.crossvalidation.fraud_prepped_10_fold` WHERE k = 4))
    UNION ALL
        SELECT 5 as k, * FROM ML.EV

In [82]:
bq.query(query = query).to_dataframe()

Unnamed: 0,k,precision,recall,accuracy,f1_score,log_loss,roc_auc
0,1,0.087687,0.886792,0.982727,0.159593,0.111964,0.977325
1,2,0.082353,0.942308,0.980608,0.151468,0.126503,0.982966
2,3,0.076655,0.956522,0.981556,0.141935,0.120929,0.994974
3,4,0.081028,0.931818,0.983534,0.149091,0.116855,0.996614
4,5,0.093103,0.885246,0.981095,0.168487,0.117733,0.985989
5,6,0.092527,0.928571,0.982102,0.168285,0.119718,0.98939
6,7,0.079909,0.795455,0.985539,0.145228,0.107578,0.958209
7,8,0.072144,0.878049,0.983535,0.133333,0.112189,0.979634
8,9,0.078324,0.895833,0.981908,0.144054,0.118807,0.948145
9,10,0.074468,0.893617,0.98151,0.13748,0.114005,0.977249


## Compare to Model on Full Table

In [83]:
query = f"""
        CREATE OR REPLACE MODEL `{BQ_PROJECT}.crossvalidation.{BQ_MODEL}`
        OPTIONS (
                model_type = 'LOGISTIC_REG',
                auto_class_weights = TRUE,
                input_label_cols = ['{VAR_TARGET}'],
                data_split_method = 'NO_SPLIT'
            ) AS
        SELECT * EXCEPT({','.join(VAR_OMIT.split())}, k)
        FROM `{BQ_PROJECT}.crossvalidation.{BQ_TABLE}_{k}_fold`
"""
print(query)


        CREATE OR REPLACE MODEL `statmike-mlops-349915.crossvalidation.03_crossval_20230206210458`
        OPTIONS (
                model_type = 'LOGISTIC_REG',
                auto_class_weights = TRUE,
                input_label_cols = ['Class'],
                data_split_method = 'NO_SPLIT'
            ) AS
        SELECT * EXCEPT(transaction_id, k)
        FROM `statmike-mlops-349915.crossvalidation.fraud_prepped_10_fold`



In [84]:
job = bq.query(query = query)
job.result()

<google.cloud.bigquery.table._EmptyRowIterator at 0x7f5b9b65e690>

In [85]:
query = f"""
SELECT * FROM ML.EVALUATE (MODEL `{BQ_PROJECT}.crossvalidation.{BQ_MODEL}`,
    (SELECT * FROM `{BQ_PROJECT}.crossvalidation.{BQ_TABLE}_{k}_fold`))
"""
review = bq.query(query = query).to_dataframe()
review

Unnamed: 0,precision,recall,accuracy,f1_score,log_loss,roc_auc
0,0.083707,0.910569,0.982627,0.15332,0.11659,0.985649
