# 03 - Introduction To BigQuery ML (BQML)

BigQuery Machine Learning allows you to use `SQL` to constuct an ML workflow.  This is a great leap in productivity and flexibility when the data source is BigQuery and users are already familiar with `SQL`. Using `SQL` multiple techniques can be used for model training and even include hyperparameter tuning.  Predictions can be served directly in BigQuery and include explainability. Models can be exported to Vertex AI for online predictions on Vertex AI Endpoints.

**If you know:**
```SQL
CREATE TABLE {table name}
    AS
        SELECT ...
        FROM ...
        WHERE ...
```

**Then you are ready to try:**
```SQL
CREATE MODEL {model name} 
    OPTIONS (
        MODEL_TYPE = ...
        {more options}
    )
    AS
        SELECT ...
        FROM ...
        WHERE ...
```

**You can even use:**
- ```SQL
CREATE OR REPLACE MODEL {model name}...
```
- ```SQL
CREATE MODEL IF NOT EXISTS {model name}...
```

**Functions for your ML Workflow**
- Model Evaluation with `ML.EVALUTE`:
    - ```SQL
    SELECT *
    FROM ML.EVALUTE(
        MODEL {model name},
        (
            SELECT ...
            FROM ...
            WHERE ...),
        STRUCT({optional parameters})
    )
    ```
- Prediction Using Models with `ML.PREDICT`:
    - ```SQL
    SELECT *
    FROM ML.PREDICT(
        MODEL {model name},
        (
            SELECT ...
            FROM ...
            WHERE ...),
        STRUCT({optional parameters})
    )
    ```
---

**Getting Started**
<p align="center" width="100%">
A great place to start exploring what model types are available and the functions to help create an ML workflow with each model type is this site:
    <center>
        <a href="https://cloud.google.com/bigquery-ml/docs/reference/standard-sql/bigqueryml-syntax-e2e-journey">
            <h1>End-to-end user journey for each model</h1>
        </a>
    </center>
Another great resourse for getting started is the "What is BigQuery ML?" starting page which include a model selection guide.
    <center>
        <a href="https://cloud.google.com/bigquery-ml/docs/introduction">
        <h2>What is BigQuery ML?</h2>
        </a>
    </center>
</p>

---

**Prerequisites:**
- [01 - BigQuery - Table Data Source](../01%20-%20Data%20Sources/01%20-%20BigQuery%20-%20Table%20Data%20Source.ipynb)

**Resources:**
- [BigQuery ML (BQML) Overview](https://cloud.google.com/bigquery-ml/docs/introduction)
- [Overview of BQML methods and workflows](https://cloud.google.com/bigquery-ml/docs/reference/standard-sql/bigqueryml-syntax-e2e-journey)
- [BigQuery](https://cloud.google.com/bigquery)
    - [Documentation:](https://cloud.google.com/bigquery/docs/query-overview)
    - [API:](https://cloud.google.com/bigquery/docs/reference/libraries-overview)
        - [Clients](https://cloud.google.com/bigquery/docs/reference/libraries)
            - [Python SDK:](https://github.com/googleapis/python-bigquery)
            - [Python Library Reference:](https://cloud.google.com/python/docs/reference/bigquery/latest)

**Conceptual Flow & Workflow**
<p align="center">
  <img alt="Conceptual Flow" src="../architectures/slides/03_arch.png" width="45%">
&nbsp; &nbsp; &nbsp; &nbsp;
  <img alt="Workflow" src="../architectures/slides/03_console.png" width="45%">
</p>

---
## Setup

inputs:

In [1]:
project = !gcloud config get-value project
PROJECT_ID = project[0]
PROJECT_ID

'statmike-mlops-349915'

In [2]:


# source data
BQ_PROJECT = PROJECT_ID
BQ_DATASET = 'fraud'
BQ_TABLE = 'fraud_prepped'

packages:

In [3]:
import pandas as pd
from google.cloud import bigquery

clients:

In [4]:
bq = bigquery.Client()

parameters:

In [5]:
BQ_SOURCE = f'{BQ_PROJECT}.{BQ_DATASET}.{BQ_TABLE}'

---
## Using BigQuery From Jupyter and Python

> **Notes:**
> - The `LIMIT 5` statement does limit the number of rows returned by BigQuery to 5 but BigQuery still does a full table scan.  If you have a table larger than 1GB and want to limit the rows scanned for a quick review like then then replacing `LIMIT 5` with `TABLESAMPLE SYSTEM (1 PERCENT)` would be more efficient.  For tables under 1GB it will still return the full table.  More on [Table Sampling](https://cloud.google.com/bigquery/docs/table-sampling)
> - Each of the examples below run the same query in BigQuery.  The query is cached on the first run for up to 24 hours.  This means the subsequent, identical queries will not scan the data and instead use the cached results table.  More information on [Using cached query results](https://cloud.google.com/bigquery/docs/cached-results).

### BigQuery Cell Magic

In [12]:
%%bigquery
SELECT COUNT(*) as n
FROM `statmike-mlops-349915.fraud.fraud_prepped` # this cannot be parameterized with magics

Query complete after 0.00s: 100%|██████████| 2/2 [00:00<00:00, 1016.06query/s]                        
Downloading: 100%|██████████| 1/1 [00:00<00:00,  1.20rows/s]


Unnamed: 0,n
0,284807


### BigQuery Python Client

In [13]:
query = f"""
    SELECT splits, COUNT(*) as n 
    FROM `{BQ_SOURCE}`
    GROUP BY splits
"""
preview = bq.query(query = query).to_dataframe()
preview

Unnamed: 0,splits,n
0,TEST,28502
1,TRAIN,228061
2,VALIDATE,28244


### BigQuery Python Client: Helper Function

In [14]:
def bq_runner(query):
    return bq.query(query = query)

In [15]:
bq_runner(
    query = f"""
        SELECT splits, Class, COUNT(*) as n 
        FROM `{BQ_SOURCE}`
        GROUP BY splits, Class
    """
).to_dataframe()

Unnamed: 0,splits,Class,n
0,TEST,0,28455
1,TEST,1,47
2,TRAIN,0,227664
3,TRAIN,1,397
4,VALIDATE,0,28196
5,VALIDATE,1,48


### BigQuery Python Client: Using Query Job Properties and Methods

Query Jobs have Methods and Properties that can benefit the Python workflow:
- Query Job [Methods](https://cloud.google.com/python/docs/reference/bigquery/latest/google.cloud.bigquery.job.QueryJob#methods)
- Query Job [Properties](https://cloud.google.com/python/docs/reference/bigquery/latest/google.cloud.bigquery.job.QueryJob#properties)

BigQuery Query Job (using helper function):

In [18]:
job = bq_runner(
    query = f"""
    WITH
        COUNTS AS (
            SELECT splits, Class, COUNT(*) as n 
            FROM `{BQ_SOURCE}`
            GROUP BY splits, Class
        )
    SELECT *,
        SUM(n) OVER() as total,
        SAFE_DIVIDE(n, SUM(n) OVER(PARTITION BY Class)) as n_pct_class,
        SAFE_DIVIDE(n, SUM(n) OVER(PARTITION BY splits)) as n_pct_split,
        SAFE_DIVIDE(SUM(n) OVER(PARTITION BY Class), SUM(n) OVER()) as class_pct_to  
    FROM COUNTS
    """
)

Using Query Job Properties to get timing:

In [19]:
job.result()
(job.ended-job.started).total_seconds()

0.857

Using Query Job Methods to retrieve result to Pandas dataframe:

In [20]:
job.to_dataframe()

Unnamed: 0,splits,Class,n,total,n_pct_class,n_pct_split,class_pct_to
0,TRAIN,0,227664,284807,0.800746,0.998259,0.998273
1,TRAIN,1,397,284807,0.806911,0.001741,0.001727
2,TEST,0,28455,284807,0.100083,0.998351,0.998273
3,TEST,1,47,284807,0.095528,0.001649,0.001727
4,VALIDATE,0,28196,284807,0.099172,0.998301,0.998273
5,VALIDATE,1,48,284807,0.097561,0.001699,0.001727


### Indirect use with pandas-gbq

When working with [Pandas](https://pandas.pydata.org/docs/user_guide/index.html#user-guide) the methods above show the client returning data to pandas dataframes.  This section will show a pandas mudule, [pandas-gbq](https://pandas-gbq.readthedocs.io/en/latest/) the wraps the BigQuery client so that pandas can retrieve BigQuery data to dataframes.

References:
- [Comparison of BigQuery Client with pandas-gbq](https://cloud.google.com/bigquery/docs/pandas-gbq-migration)

#### Package Install (if needed)

In [21]:
try:
    import pandas_gbq
except ImportError:
    print('You need to pip install pandas-gbq')
    !pip install pandas-gbq -q

#### Using pandas-gbq

In [22]:
query = f"""
SELECT * 
FROM `{BQ_SOURCE}`
LIMIT 5
"""
df = pd.read_gbq(query, project_id = BQ_PROJECT)

In [23]:
df

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V23,V24,V25,V26,V27,V28,Amount,Class,transaction_id,splits
0,35337,1.092844,-0.01323,1.359829,2.731537,-0.707357,0.873837,-0.79613,0.437707,0.39677,...,-0.167647,0.027557,0.592115,0.219695,0.03697,0.010984,0.0,0,a1b10547-d270-48c0-b902-7a0f735dadc7,TEST
1,60481,1.238973,0.035226,0.063003,0.641406,-0.260893,-0.580097,0.049938,-0.034733,0.405932,...,-0.057718,0.104983,0.537987,0.589563,-0.046207,-0.006212,0.0,0,814c62c8-ade4-47d5-bf83-313b0aafdee5,TEST
2,139587,1.870539,0.211079,0.224457,3.889486,-0.380177,0.249799,-0.577133,0.179189,-0.120462,...,0.180776,-0.060226,-0.228979,0.080827,0.009868,-0.036997,0.0,0,d08a1bfa-85c5-4f1b-9537-1c5a93e6afd0,TEST
3,162908,-3.368339,-1.980442,0.153645,-0.159795,3.847169,-3.516873,-1.209398,-0.292122,0.760543,...,-1.171627,0.214333,-0.159652,-0.060883,1.294977,0.120503,0.0,0,802f3307-8e5a-4475-b795-5d5d8d7d0120,TEST
4,165236,2.180149,0.218732,-2.637726,0.348776,1.063546,-1.249197,0.942021,-0.547652,-0.087823,...,-0.176957,0.563779,0.730183,0.707494,-0.131066,-0.090428,0.0,0,c8a5b93a-1598-4689-80be-4f9f5df0b8ce,TEST
