# BigQuery Data

This notebook creates a BigQuery dataset and loads the projects source data into a new table.  It also walks through creating test/train splits of the data for ML processes.

**Prerequisites**
- `00 - Initial Setup`

**Overview**

<img src="architectures/statmike-mlops-01.png">

---
## Setup

Set the GCP Project and other parameters:

In [24]:
PROJECT_ID = "statmike-mlops"
REGION = 'us-central1'

FILE='digits.csv'
DATASET_ID = 'digits'
TABLE_ID = 'digits_source'
BUCKET=PROJECT_ID
URI = "gs://{}/digits/data".format(PROJECT_ID)

Make a client connection to BigQuery

In [25]:
from google.cloud import bigquery
bq = bigquery.Client(project=PROJECT_ID)

---
## Create Dataset

List BigQuery datasets in the project:

In [26]:
datum=[]
for ds in list(bq.list_datasets()): datum.append(ds.dataset_id)
print(datum)

[]


Create the dataset if missing:

In [27]:
if DATASET_ID not in datum:
    dataset = bigquery.Dataset(bigquery.dataset.DatasetReference(PROJECT_ID, DATASET_ID))
    dataset.location = REGION
    dataset = bq.create_dataset(DATASET_ID)

In [28]:
print(dataset)

Dataset(DatasetReference('statmike-mlops', 'digits'))


---
## Create Table

Load data to a table in the dataset:
- define job inputs
- run load job
- review resulting table

In [29]:
FILE_URI = '%s/%s' % (URI,FILE)

dataset_ref = bq.dataset(DATASET_ID)
table_ref = dataset_ref.table(TABLE_ID)

job_config = bigquery.LoadJobConfig(write_disposition="WRITE_TRUNCATE")
job_config.source_format = bigquery.SourceFormat.CSV
job_config.autodetect = True

job = bq.load_table_from_uri(FILE_URI, table_ref, job_config=job_config)
print("Starting job {}".format(job.job_id))
job.result()
      
bq_table = bq.get_table(table_ref) 
print("Loaded {} rows and {} columns to {}.".format(bq_table.num_rows,len(bq_table.schema),bq_table))

Starting job ea5d6162-35f7-45d6-ad8a-b03f787d81e9
Loaded 1797 rows and 66 columns to Table(TableReference(DatasetReference('statmike-mlops', 'digits'), 'digits_source')).


Use the BigQuery magic to review a few records (this uses the BQ storage API):

In [30]:
%%bigquery
SELECT * FROM `statmike-mlops.digits.digits_source` LIMIT 5

Unnamed: 0,p0,p1,p2,p3,p4,p5,p6,p7,p8,p9,...,p56,p57,p58,p59,p60,p61,p62,p63,target,target_OE
0,0.0,5.0,16.0,15.0,5.0,0.0,0.0,0.0,0.0,2.0,...,0.0,6.0,16.0,16.0,16.0,16.0,7.0,0.0,2,Even
1,0.0,5.0,16.0,12.0,1.0,0.0,0.0,0.0,0.0,5.0,...,0.0,8.0,16.0,16.0,16.0,16.0,4.0,0.0,2,Even
2,0.0,5.0,15.0,16.0,6.0,0.0,0.0,0.0,0.0,11.0,...,0.0,6.0,16.0,16.0,16.0,13.0,3.0,0.0,2,Even
3,0.0,4.0,15.0,15.0,8.0,0.0,0.0,0.0,0.0,8.0,...,0.0,7.0,14.0,11.0,0.0,0.0,0.0,0.0,2,Even
4,0.0,6.0,16.0,16.0,16.0,15.0,10.0,0.0,0.0,9.0,...,0.0,9.0,16.0,11.0,0.0,0.0,0.0,0.0,5,Odd


---
## Prepare Data for Analysis

Create a prepped version of the data with test/train splits using SQL DDL:

In [31]:
%%bigquery
CREATE OR REPLACE TABLE `statmike-mlops.digits.digits_prepped` AS
SELECT *, 
    CASE WHEN MOD(ABS(FARM_FINGERPRINT(GENERATE_UUID())),10) < 8 THEN 'TRAIN' ELSE 'TEST' END AS SPLITS
FROM `statmike-mlops.digits.digits_source`

Review the test/train split:

In [32]:
%%bigquery
SELECT splits, count(*) as Count
FROM `statmike-mlops.digits.digits_prepped`
GROUP BY splits

Unnamed: 0,splits,Count
0,TRAIN,1463
1,TEST,334


Retrieve a subset of the data to a Pandas dataframe:

In [33]:
%%bigquery digits
SELECT * FROM `statmike-mlops.digits.digits_prepped` WHERE target = 2

In [34]:
digits.head()

Unnamed: 0,p0,p1,p2,p3,p4,p5,p6,p7,p8,p9,...,p57,p58,p59,p60,p61,p62,p63,target,target_OE,SPLITS
0,0.0,0.0,0.0,0.0,11.0,15.0,4.0,0.0,0.0,0.0,...,0.0,0.0,1.0,11.0,16.0,12.0,0.0,2,Even,TRAIN
1,0.0,0.0,0.0,0.0,9.0,13.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,9.0,15.0,9.0,0.0,2,Even,TRAIN
2,0.0,0.0,0.0,3.0,15.0,10.0,1.0,0.0,0.0,0.0,...,0.0,0.0,4.0,9.0,14.0,7.0,0.0,2,Even,TRAIN
3,0.0,0.0,0.0,3.0,15.0,13.0,1.0,0.0,0.0,0.0,...,4.0,4.0,5.0,13.0,6.0,0.0,0.0,2,Even,TRAIN
4,0.0,0.0,0.0,4.0,15.0,12.0,0.0,0.0,0.0,0.0,...,0.0,0.0,3.0,11.0,16.0,9.0,0.0,2,Even,TRAIN


---
## Remove Resources
- delete table `statmike-mlops.digits.digits_prepped`
- delete table `statmike-mlops.digits.digits_source`
- delete dataset `statmike-mlops`

In [23]:
bq.delete_table('statmike-mlops.digits.digits_prepped',not_found_ok=True)
bq.delete_table('statmike-mlops.digits.digits_source',not_found_ok=True)
bq.delete_dataset('statmike-mlops.digits',not_found_ok=True)