# 01 - BigQuery - Table Data Source
Use BigQuery to load and prepare data for machine learning:

**Prerequisites:**
-  00 - Environment Setup

**Overview:**
-  Setup BigQuery
   -  Create a Dataset
      -  Use BigQuery Python Client
   -  Create Tables
      -  Copy from another Project:Dataset
         -  SQL with BigQuery Jupyter Magic (%%bigquery)
      -  Load data from GCS Bucket
         -  BigQuery Python Client (load_table_from_uri)
   -  Prepare Data For Analysis
      -  Run SQL Queries to prepare Unique ID's and Train/Test Splits

**Resources:**
-  [Python Client For Google BigQuery](https://googleapis.dev/python/bigquery/latest/index.html)
-  [Download BigQuery Data to Pandas](https://cloud.google.com/bigquery/docs/bigquery-storage-python-pandas)
-  [BigQuery Template Notebooks](https://github.com/GoogleCloudPlatform/bigquery-notebooks/tree/main/notebooks/official/template_notebooks)

**Related Training:**
-  todo


---
## Vertex AI - Conceptual Flow

<img src="architectures/slides/slide_05.png">

---
## Vertex AI - Workflow

<img src="architectures/slides/slide_06.png">

---
## Notes

This notebook uses [BigQuery Jupyter Magics](https://googleapis.dev/python/bigquery/latest/magics.html).  These allow parameters as inputs with `--params`, however, the dataset reference cannot be parameterized.  For this reason, the full qualifier of `project_id.dataset.table` is manually included and will need to be replaced if you change the variables `PROJECT_ID` and/or `DATANAME`.

---
## Setup

inputs:

In [31]:
PROJECT_ID = "statmike-mlops"
REGION = 'us-central1'
DATANAME = 'fraud'
NOTEBOOK = '01'

# Model Training
VAR_TARGET = 'Class'
VAR_OMIT = '' # add more variables to the string with space delimiters

packages:

In [32]:
from google.cloud import bigquery

clients:

In [33]:
bq = bigquery.Client(project = PROJECT_ID)

parameters:

In [34]:
BUCKET = PROJECT_ID

---
## Create Dataset

List BigQuery datasets in the project:

In [35]:
%%bigquery
SELECT schema_name FROM `statmike-mlops.INFORMATION_SCHEMA.SCHEMATA`

Query complete after 0.00s: 100%|██████████| 1/1 [00:00<00:00, 532.68query/s]                          
Downloading: 0rows [00:00, ?rows/s]


Unnamed: 0,schema_name


Create the dataset if missing:

In [36]:
%%bigquery
CREATE SCHEMA IF NOT EXISTS `statmike-mlops.fraud`
OPTIONS(
    location='us-central1',
    labels = [('notebook','01')]
)

Query complete after 0.00s: 100%|██████████| 1/1 [00:00<00:00, 1030.04query/s]


---
## Create Table
- import data from Cloud Storage Bucket
- https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-csv

In [38]:
destination = bigquery.TableReference.from_string(f"{PROJECT_ID}.{DATANAME}.{DATANAME}")
job_config = bigquery.LoadJobConfig(
    write_disposition = 'WRITE_TRUNCATE',
    source_format = bigquery.SourceFormat.CSV,
    autodetect = True,
    labels = {'notebook':'01'}
)
job = bq.load_table_from_uri(f"gs://{BUCKET}/{DATANAME}/data/{DATANAME}.csv", destination, job_config = job_config)

In [39]:
job.result()

<google.cloud.bigquery.job.load.LoadJob at 0x7f7f94677290>

In [40]:
%%bigquery
SELECT * FROM `fraud.fraud` LIMIT 5

Query complete after 0.00s: 100%|██████████| 1/1 [00:00<00:00, 444.36query/s]                          
Downloading: 100%|██████████| 5/5 [00:00<00:00,  5.73rows/s]


Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,282,-0.356466,0.725418,1.971749,0.831343,0.369681,-0.107776,0.75161,-0.120166,-0.420675,...,0.020804,0.424312,-0.015989,0.466754,-0.809962,0.657334,-0.04315,-0.046401,0.0,0
1,380,-1.299837,0.881817,1.452842,-1.293698,-0.025105,-1.170103,0.86161,-0.193934,0.592001,...,-0.272563,-0.360853,0.223911,0.59893,-0.397705,0.637141,0.234872,0.021379,0.0,0
2,403,1.237413,0.512365,0.687746,1.693872,-0.236323,-0.650232,0.118066,-0.230545,-0.808523,...,-0.077543,-0.17822,0.038722,0.471218,0.289249,0.871803,-0.066884,0.012986,0.0,0
3,430,-1.860258,-0.629859,0.96657,0.844632,0.759983,-1.481173,-0.509681,0.540722,-0.733623,...,0.268028,0.125515,-0.225029,0.586664,-0.031598,0.570168,-0.043007,-0.223739,0.0,0
4,711,-0.431349,1.027694,2.670816,2.084787,-0.274567,0.286856,0.15211,0.200872,-0.596505,...,0.001241,0.15417,-0.141533,0.38461,-0.147132,-0.0871,0.101117,0.077944,0.0,0


---
## Prepare Data for Analysis

Create a prepped version of the data with test/train splits using SQL DDL:

In [41]:
%%bigquery
CREATE OR REPLACE TABLE `fraud.fraud_prepped` AS
SELECT *, 
    CASE 
        WHEN MOD(ABS(FARM_FINGERPRINT(GENERATE_UUID())),10) < 8 THEN "TRAIN" 
        WHEN MOD(ABS(FARM_FINGERPRINT(GENERATE_UUID())),10) < 5 THEN "VALIDATE"
        ELSE "TEST"
    END AS splits
FROM `fraud.fraud`

Query complete after 0.00s: 100%|██████████| 3/3 [00:00<00:00, 1731.75query/s]                        


Review the test/train split:

In [48]:
%%bigquery
SELECT splits, count(*) as Count, 100*count(*) / (sum(count(*)) OVER()) as Percentage
FROM `fraud.fraud_prepped`
GROUP BY splits

Query complete after 0.00s: 100%|██████████| 5/5 [00:00<00:00, 2119.41query/s]                        
Downloading: 100%|██████████| 3/3 [00:00<00:00,  3.81rows/s]


Unnamed: 0,splits,Count,Percentage
0,TEST,28304,9.937958
1,VALIDATE,28388,9.967452
2,TRAIN,228115,80.09459


Retrieve a subset of the data to a Pandas dataframe:

In [44]:
%%bigquery fraud
SELECT * FROM `fraud.fraud_prepped` WHERE Class = 1

Query complete after 0.00s: 100%|██████████| 1/1 [00:00<00:00, 740.91query/s] 
Downloading: 100%|██████████| 492/492 [00:00<00:00, 753.25rows/s] 


In [45]:
fraud.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V22,V23,V24,V25,V26,V27,V28,Amount,Class,splits
0,32686,0.287953,1.728735,-1.652173,3.813544,-1.090927,-0.984745,-2.202318,0.555088,-2.033892,...,-0.633528,0.092891,0.187613,0.368708,-0.132474,0.576561,0.309843,0.0,1,TEST
1,129371,1.183931,3.05725,-6.161997,5.543972,1.617041,-1.848006,-1.005508,0.339937,-2.959806,...,-0.931072,-0.064175,-0.007013,0.345419,0.064558,0.476629,0.32374,0.0,1,TEST
2,84204,-0.937843,3.462889,-6.445104,4.932199,-2.233983,-2.291561,-5.695594,1.338825,-4.322377,...,-0.521657,-0.319917,-0.405859,0.906802,1.165784,1.374495,0.729889,0.0,1,TEST
3,85181,-3.003459,2.09615,-0.48703,3.069453,-1.774329,0.251804,-4.328776,-2.425478,-0.985222,...,1.245648,-0.269241,0.537102,-0.220757,-0.059555,0.46071,-0.033551,2.0,1,TEST
4,149640,0.754316,2.379822,-5.137274,3.818392,0.043203,-1.285451,-1.766684,0.756711,-1.765722,...,0.141165,0.171985,0.394274,-0.444642,-0.263189,0.304703,-0.044362,2.0,1,TEST


---
## Remove Resources
see notebook "XX - Cleanup"