# BigQuery Data

This notebook creates a BigQuery dataset and loads the projects source data into a new table.  It also walks through creating test/train splits of the data for ML processes.

**Prerequisites**
- `00 - Initial Setup`

**Overview**

<img src="architectures/statmike-mlops-01.png">

---
## Setup

Set the GCP Project and other parameters:

In [1]:
PROJECT_ID = "statmike-mlops"
REGION = 'us-central1'

FILE='digits.csv'
DATASET_ID = 'digits'
TABLE_ID = 'digits_source'
BUCKET=PROJECT_ID
URI = "gs://{}/digits/data".format(PROJECT_ID)

Make a client connection to BigQuery

In [2]:
from google.cloud import bigquery
bq = bigquery.Client(project=PROJECT_ID)

---
## Create Dataset

List BigQuery datasets in the project:

In [3]:
datum=[]
for ds in list(bq.list_datasets()): datum.append(ds.dataset_id)
print(datum)

[]


Create the dataset if missing:

In [4]:
if DATASET_ID not in datum:
    dataset = bigquery.Dataset(bigquery.dataset.DatasetReference(PROJECT_ID, DATASET_ID))
    dataset.location = REGION
    dataset = bq.create_dataset(DATASET_ID)

In [5]:
print(dataset)

Dataset(DatasetReference('statmike-mlops', 'digits'))


---
## Create Table

Load data to a table in the dataset:
- define job inputs
- run load job
- review resulting table

In [6]:
FILE_URI = '%s/%s' % (URI,FILE)

dataset_ref = bq.dataset(DATASET_ID)
table_ref = dataset_ref.table(TABLE_ID)

job_config = bigquery.LoadJobConfig(write_disposition="WRITE_TRUNCATE")
job_config.source_format = bigquery.SourceFormat.CSV
job_config.autodetect = True

job = bq.load_table_from_uri(FILE_URI, table_ref, job_config=job_config)
print("Starting job {}".format(job.job_id))
job.result()
      
bq_table = bq.get_table(table_ref) 
print("Loaded {} rows and {} columns to {}.".format(bq_table.num_rows,len(bq_table.schema),bq_table))

Starting job 299f372b-748d-4252-b705-53c25a4f7f8d
Loaded 1797 rows and 66 columns to Table(TableReference(DatasetReference('statmike-mlops', 'digits'), 'digits_source')).


Use the BigQuery magic to review a few records (this uses the BQ storage API):

In [7]:
%%bigquery
SELECT * FROM `digits.digits_source` LIMIT 5

Query complete after 0.01s: 100%|██████████| 1/1 [00:00<00:00, 400.76query/s]                          
Downloading: 100%|██████████| 5/5 [00:01<00:00,  4.25rows/s]


Unnamed: 0,p0,p1,p2,p3,p4,p5,p6,p7,p8,p9,...,p56,p57,p58,p59,p60,p61,p62,p63,target,target_OE
0,0.0,5.0,16.0,15.0,5.0,0.0,0.0,0.0,0.0,2.0,...,0.0,6.0,16.0,16.0,16.0,16.0,7.0,0.0,2,Even
1,0.0,5.0,16.0,12.0,1.0,0.0,0.0,0.0,0.0,5.0,...,0.0,8.0,16.0,16.0,16.0,16.0,4.0,0.0,2,Even
2,0.0,5.0,15.0,16.0,6.0,0.0,0.0,0.0,0.0,11.0,...,0.0,6.0,16.0,16.0,16.0,13.0,3.0,0.0,2,Even
3,0.0,4.0,15.0,15.0,8.0,0.0,0.0,0.0,0.0,8.0,...,0.0,7.0,14.0,11.0,0.0,0.0,0.0,0.0,2,Even
4,0.0,6.0,16.0,16.0,16.0,15.0,10.0,0.0,0.0,9.0,...,0.0,9.0,16.0,11.0,0.0,0.0,0.0,0.0,5,Odd


---
## Prepare Data for Analysis

Create a prepped version of the data with test/train splits using SQL DDL:

In [8]:
%%bigquery
CREATE OR REPLACE TABLE `digits.digits_prepped` AS
SELECT *, 
    CASE WHEN MOD(ABS(FARM_FINGERPRINT(GENERATE_UUID())),10) < 8 THEN 'TRAIN' ELSE 'TEST' END AS SPLITS
FROM `digits.digits_source`

Query complete after 0.00s: 100%|██████████| 3/3 [00:00<00:00, 926.71query/s]                         


Review the test/train split:

In [9]:
%%bigquery
SELECT splits, count(*) as Count
FROM `digits.digits_prepped`
GROUP BY splits

Query complete after 0.00s: 100%|██████████| 2/2 [00:00<00:00, 870.37query/s]                         
Downloading: 100%|██████████| 2/2 [00:01<00:00,  1.72rows/s]


Unnamed: 0,splits,Count
0,TEST,339
1,TRAIN,1458


Retrieve a subset of the data to a Pandas dataframe:

In [10]:
%%bigquery digits
SELECT * FROM `digits.digits_prepped` WHERE target = 2

Query complete after 0.00s: 100%|██████████| 1/1 [00:00<00:00, 431.73query/s]                          
Downloading: 100%|██████████| 177/177 [00:01<00:00, 151.95rows/s]


In [11]:
digits.head()

Unnamed: 0,p0,p1,p2,p3,p4,p5,p6,p7,p8,p9,...,p57,p58,p59,p60,p61,p62,p63,target,target_OE,SPLITS
0,0.0,0.0,0.0,4.0,15.0,12.0,0.0,0.0,0.0,0.0,...,0.0,0.0,3.0,11.0,16.0,9.0,0.0,2,Even,TRAIN
1,0.0,0.0,0.0,1.0,9.0,11.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,9.0,13.0,3.0,0.0,2,Even,TRAIN
2,0.0,0.0,1.0,13.0,16.0,10.0,0.0,0.0,0.0,1.0,...,0.0,0.0,14.0,16.0,11.0,0.0,0.0,2,Even,TRAIN
3,0.0,0.0,0.0,0.0,9.0,13.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,9.0,15.0,9.0,0.0,2,Even,TRAIN
4,0.0,0.0,0.0,8.0,15.0,8.0,0.0,0.0,0.0,0.0,...,0.0,0.0,9.0,12.0,14.0,4.0,0.0,2,Even,TRAIN


---
## Remove Resources
- delete table `<PROJECT_ID>.digits.digits_prepped`
- delete table `<PROJECT_ID>.digits.digits_source`
- delete dataset `<PROJECT_ID>.digits`
    - This will not work if you have run 02 as it created a model in the dataset
    - You can add `delete_contents=True` to force the deletion of the dataset

In [12]:
bq.delete_table(PROJECT_ID+'.digits.digits_prepped',not_found_ok=True)
bq.delete_table(PROJECT_ID+'.digits.digits_source',not_found_ok=True)
bq.delete_dataset(PROJECT_ID+'.digits',not_found_ok=True)