# 01 - BigQuery - Table Data Source
Use BigQuery to load and prepare data for machine learning:

**Prerequisites:**
-  00 - Environment Setup

**Overview:**
-  Setup BigQuery
   -  Create a Dataset
      -  Use BigQuery Python Client
   -  Create Tables
      -  Copy from another Project:Dataset
         -  SQL with BigQuery Jupyter Magic (%%bigquery)
      -  Load data from GCS Bucket
         -  BigQuery Python Client (load_table_from_uri)
   -  Prepare Data For Analysis
      -  Run SQL Queries to prepare Unique ID's and Train/Test Splits

**Resources:**
-  [Python Client For Google BigQuery](https://googleapis.dev/python/bigquery/latest/index.html)
-  [Download BigQuery Data to Pandas](https://cloud.google.com/bigquery/docs/bigquery-storage-python-pandas)
-  [BigQuery Template Notebooks](https://github.com/GoogleCloudPlatform/bigquery-notebooks/tree/main/notebooks/official/template_notebooks)

**Related Training:**
-  todo


---
## Conceptual Architecture

<img src="architectures/statmike-mlops-01.png">

---
## Notes

This notebook uses [BigQuery Jupyter Magics](https://googleapis.dev/python/bigquery/latest/magics.html).  These allow parameters as inputs with `--params`, however, the dataset reference cannot be parameterized.  For this reason, the full qualifier of `project_id.dataset.table` is manually included and will need to be replaced if you change the variables `PROJECT_ID` and/or `DATANAME`.

---
## Setup

inputs:

In [2]:
PROJECT_ID = "statmike-mlops"
REGION = 'us-central1'
DATANAME = 'digits'

derived inputs:

In [11]:
BUCKET = PROJECT_ID
URI = f"gs://{BUCKET}/{DATANAME}/data"

DATASET_ID = DATANAME
TABLE_ID = DATANAME

packages:

In [4]:
from google.cloud import bigquery

clients:

In [None]:
bq = bigquery.Client(project=PROJECT_ID)

---
## Create Dataset

List BigQuery datasets in the project:

In [8]:
datum=[]
for ds in list(bq.list_datasets()): datum.append(ds.dataset_id)
print(datum)

[]


Create the dataset if missing:

In [9]:
if DATASET_ID not in datum:
    dataset = bigquery.Dataset(bigquery.dataset.DatasetReference(PROJECT_ID, DATASET_ID))
    dataset.location = REGION
    dataset = bq.create_dataset(DATASET_ID)

In [10]:
print(dataset)

Dataset(DatasetReference('statmike-mlops', 'digits'))


---
## Create Table

Load data to a table in the dataset:
- define job inputs
- run load job
- review resulting table

In [12]:
dataset_ref = bq.dataset(DATASET_ID)
table_ref = dataset_ref.table(TABLE_ID)

job_config = bigquery.LoadJobConfig(write_disposition="WRITE_TRUNCATE")
job_config.source_format = bigquery.SourceFormat.CSV
job_config.autodetect = True

job = bq.load_table_from_uri(URI+f'/{DATANAME}.csv', table_ref, job_config=job_config)
print("Starting job {}".format(job.job_id))
job.result()
      
bq_table = bq.get_table(table_ref) 
print("Loaded {} rows and {} columns to {}.".format(bq_table.num_rows,len(bq_table.schema),bq_table))

Starting job e2a69d0a-bb03-4a88-b8f3-42d2c58a367c
Loaded 1797 rows and 66 columns to Table(TableReference(DatasetReference('statmike-mlops', 'digits'), 'digits')).


Use the BigQuery Jupyter Magic to review a few records:

In [14]:
%%bigquery
SELECT * FROM `digits.digits` LIMIT 5

Query complete after 0.00s: 100%|██████████| 1/1 [00:00<00:00, 511.06query/s]                          
Downloading: 100%|██████████| 5/5 [00:00<00:00,  6.23rows/s]


Unnamed: 0,p0,p1,p2,p3,p4,p5,p6,p7,p8,p9,...,p56,p57,p58,p59,p60,p61,p62,p63,target,target_OE
0,0.0,5.0,16.0,15.0,5.0,0.0,0.0,0.0,0.0,2.0,...,0.0,6.0,16.0,16.0,16.0,16.0,7.0,0.0,2,Even
1,0.0,5.0,16.0,12.0,1.0,0.0,0.0,0.0,0.0,5.0,...,0.0,8.0,16.0,16.0,16.0,16.0,4.0,0.0,2,Even
2,0.0,5.0,15.0,16.0,6.0,0.0,0.0,0.0,0.0,11.0,...,0.0,6.0,16.0,16.0,16.0,13.0,3.0,0.0,2,Even
3,0.0,4.0,15.0,15.0,8.0,0.0,0.0,0.0,0.0,8.0,...,0.0,7.0,14.0,11.0,0.0,0.0,0.0,0.0,2,Even
4,0.0,6.0,16.0,16.0,16.0,15.0,10.0,0.0,0.0,9.0,...,0.0,9.0,16.0,11.0,0.0,0.0,0.0,0.0,5,Odd


---
## Prepare Data for Analysis

Create a prepped version of the data with test/train splits using SQL DDL:

In [31]:
%%bigquery
CREATE OR REPLACE TABLE `digits.digits_prepped` AS
SELECT *, 
    CASE 
        WHEN MOD(ABS(FARM_FINGERPRINT(GENERATE_UUID())),10) < 8 THEN "TRAIN" 
        WHEN MOD(ABS(FARM_FINGERPRINT(GENERATE_UUID())),10) < 5 THEN "VALIDATE"
        ELSE "TEST"
    END AS splits
FROM `digits.digits`

Query complete after 0.00s: 100%|██████████| 3/3 [00:00<00:00, 2130.89query/s]                        


Review the test/train split:

In [32]:
%%bigquery
SELECT splits, count(*) as Count
FROM `digits.digits_prepped`
GROUP BY splits

Query complete after 0.00s: 100%|██████████| 2/2 [00:00<00:00, 1164.44query/s]                        
Downloading: 100%|██████████| 3/3 [00:01<00:00,  2.67rows/s]


Unnamed: 0,splits,Count
0,TRAIN,1438
1,TEST,163
2,VALIDATE,196


Retrieve a subset of the data to a Pandas dataframe:

In [22]:
%%bigquery digits
SELECT * FROM `digits.digits_prepped` WHERE target = 2

Query complete after 0.00s: 100%|██████████| 1/1 [00:00<00:00, 455.65query/s]                          
Downloading: 100%|██████████| 177/177 [00:01<00:00, 173.43rows/s]


In [19]:
digits.head()

Unnamed: 0,p0,p1,p2,p3,p4,p5,p6,p7,p8,p9,...,p57,p58,p59,p60,p61,p62,p63,target,target_OE,SPLITS
0,0.0,0.0,3.0,12.0,16.0,16.0,3.0,0.0,0.0,2.0,...,0.0,0.0,15.0,16.0,8.0,0.0,0.0,2,Even,TEST
1,0.0,0.0,0.0,4.0,15.0,12.0,0.0,0.0,0.0,0.0,...,0.0,0.0,3.0,11.0,16.0,9.0,0.0,2,Even,TRAIN
2,0.0,0.0,0.0,5.0,14.0,12.0,2.0,0.0,0.0,0.0,...,0.0,0.0,6.0,12.0,13.0,3.0,0.0,2,Even,TEST
3,0.0,0.0,1.0,13.0,16.0,10.0,0.0,0.0,0.0,1.0,...,0.0,0.0,14.0,16.0,11.0,0.0,0.0,2,Even,TEST
4,0.0,0.0,0.0,1.0,14.0,14.0,3.0,0.0,0.0,0.0,...,0.0,0.0,1.0,13.0,16.0,5.0,0.0,2,Even,TRAIN


---
## Remove Resources
see notebook "XX - Cleanup"