# 01 - BigQuery - Table Data Source
Use BigQuery to load and prepare data for machine learning:

**Prerequisites:**
-  00 - Environment Setup

**Overview:**
-  Setup BigQuery
   -  Create a Dataset
      -  Use BigQuery Python Client
   -  Create Tables
      -  Copy from another Project:Dataset
         -  SQL with BigQuery Jupyter Magic (%%bigquery)
      -  Load data from GCS Bucket
         -  BigQuery Python Client (load_table_from_uri)
   -  Prepare Data For Analysis
      -  Run SQL Queries to prepare Unique ID's and Train/Test Splits

**Resources:**
-  [Python Client For Google BigQuery](https://googleapis.dev/python/bigquery/latest/index.html)
-  [Download BigQuery Data to Pandas](https://cloud.google.com/bigquery/docs/bigquery-storage-python-pandas)
-  [BigQuery Template Notebooks](https://github.com/GoogleCloudPlatform/bigquery-notebooks/tree/main/notebooks/official/template_notebooks)

**Related Training:**
-  todo


---
## Vertex AI - Conceptual Flow

<img src="architectures/slides/01_arch.png">

---
## Vertex AI - Workflow

<img src="architectures/slides/01_console.png">

---
## Source Data

**Overview**

This notebook imports source data for this project into Google BigQuery.  All the remaining notebooks utilize BigQuery as the source and leverage API's native to the machine learning approaches they feature.

In the enviornment setup notebook (00), a BigQuery source table was exported to CSV format in a Cloud Storage Bucket. This notebook, `01 - BigQuery - Table Data Source`, start the machine learning lifecycle by importing a source and preparing it for machine learning.  To customize this series of notebooks change the source referenced here or in notebook `00 - Environment Setup`.

All of these workflows utilize tabular data to fit a supervised learning model: predict a target variable by learning patterns in feature columns.  The type of supervised learning used in these projects is classification: models with a target variable that has multiple discrete classes.  

**The Data**

The source data is exported to Google Cloud Storage in CSV format by the `00 - Environment Setup` notebook.  The BigQuery source table is `bigquery-public-data.ml_datasets.ulb_fraud_detection`.  This is a table of credit card transactions that are classified as fradulant, `Class = 1`, or normal `Class = 0`.  

The data can be researched further at this [Kaggle link](https://www.kaggle.com/mlg-ulb/creditcardfraud).

**Description of the Data**

This is a table of 284,207 credit card transactions classified as fradulant or normal in the column `Class`.  In order protect confidentiality, the original features have been transformed using [principle component analysis (PCA)](https://en.wikipedia.org/wiki/Principal_component_analysis) into 28 features named `V1, V2, ... V28` (float).  Two descriptive features are provided without transformation by PCA:
- `Time` (integer) is the seconds elapsed between the transaction and the earliest transaction in the table
- `Amount` (float) is the value of the transaction
>**Quick Note on PCA**

>PCA is an unsupervised learning technique: there is not a target variable.  PCA is commonlly used as a variable/feature reduction technique.  If you have 100 features then you could reduce it to a number p (say 10) projected features.  The choice of this number is a balance of how well it can explain the variance of the full feature space and reducing the number of features.  Each projected feature is orthogonal to each other feature, meaning there is no correlation between these new projected features.

**Preparation of the Data**

This notebook adds two columns to the source data and stores it in a new table with suffix `_prepped`.  
- `transaction_id` (string) a unique id for the row/transaction
- `splits` (string) this divided the tranactions into sets for `TRAIN` (80%), `VALIDATA` (10%), and `TEST` (10%)

---
## Setup

inputs:

In [1]:
PROJECT_ID = "statmike-mlops"
REGION = 'us-central1'
DATANAME = 'fraud'
NOTEBOOK = '01'

packages:

In [2]:
from google.cloud import bigquery

clients:

In [3]:
bq = bigquery.Client(project = PROJECT_ID)

parameters:

In [4]:
BUCKET = PROJECT_ID

---
## Create Dataset

List BigQuery datasets in the project:

In [5]:
query = f"""
SELECT schema_name
FROM `{PROJECT_ID}.INFORMATION_SCHEMA.SCHEMATA`
"""
bq.query(query = query).to_dataframe()

Unnamed: 0,schema_name


Create the dataset if missing:

In [6]:
query = f"""
CREATE SCHEMA IF NOT EXISTS `{PROJECT_ID}.{DATANAME}`
OPTIONS(
    location = '{REGION}',
    labels = [('notebook','{NOTEBOOK}')]
)
"""
job = bq.query(query = query)
job.result()

<google.cloud.bigquery.table._EmptyRowIterator at 0x7f94a95be710>

In [7]:
(job.ended-job.started).total_seconds()

0.677

---
## Create Table
- import data from Cloud Storage Bucket
- https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-csv

In [8]:
destination = bigquery.TableReference.from_string(f"{PROJECT_ID}.{DATANAME}.{DATANAME}")
job_config = bigquery.LoadJobConfig(
    write_disposition = 'WRITE_TRUNCATE',
    source_format = bigquery.SourceFormat.CSV,
    autodetect = True,
    labels = {'notebook':f'{NOTEBOOK}'}
)
job = bq.load_table_from_uri(f"gs://{BUCKET}/{DATANAME}/data/{DATANAME}.csv", destination, job_config = job_config)
job.result()

LoadJob<project=statmike-demo2, location=us-central1, id=41065df0-5dc5-4667-9bdc-c26aaeab46e9>

In [9]:
(job.ended-job.started).total_seconds()

12.326

In [10]:
query = f"""
SELECT *
FROM `{DATANAME}.{DATANAME}`
LIMIT 5
"""
bq.query(query = query).to_dataframe()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,282,-0.356466,0.725418,1.971749,0.831343,0.369681,-0.107776,0.75161,-0.120166,-0.420675,...,0.020804,0.424312,-0.015989,0.466754,-0.809962,0.657334,-0.04315,-0.046401,0.0,0
1,380,-1.299837,0.881817,1.452842,-1.293698,-0.025105,-1.170103,0.86161,-0.193934,0.592001,...,-0.272563,-0.360853,0.223911,0.59893,-0.397705,0.637141,0.234872,0.021379,0.0,0
2,403,1.237413,0.512365,0.687746,1.693872,-0.236323,-0.650232,0.118066,-0.230545,-0.808523,...,-0.077543,-0.17822,0.038722,0.471218,0.289249,0.871803,-0.066884,0.012986,0.0,0
3,430,-1.860258,-0.629859,0.96657,0.844632,0.759983,-1.481173,-0.509681,0.540722,-0.733623,...,0.268028,0.125515,-0.225029,0.586664,-0.031598,0.570168,-0.043007,-0.223739,0.0,0
4,711,-0.431349,1.027694,2.670816,2.084787,-0.274567,0.286856,0.15211,0.200872,-0.596505,...,0.001241,0.15417,-0.141533,0.38461,-0.147132,-0.0871,0.101117,0.077944,0.0,0


### Check out this table in BigQuery Console:
- Click: https://console.cloud.google.com/bigquery
- Make sure project selected is the one from this notebook
- Under Explore, expand this project and review the dataset and table

---
## Prepare Data for Analysis

Create a prepped version of the data with test/train splits using SQL DDL:

In [13]:
query = f"""
CREATE OR REPLACE TABLE `{DATANAME}.{DATANAME}_prepped` AS
WITH add_id AS(SELECT *, GENERATE_UUID() transaction_id FROM `{DATANAME}.{DATANAME}`)
SELECT *,
    CASE 
        WHEN MOD(ABS(FARM_FINGERPRINT(transaction_id)),10) < 8 THEN "TRAIN" 
        WHEN MOD(ABS(FARM_FINGERPRINT(transaction_id)),10) < 5 THEN "VALIDATE"
        ELSE "TEST"
    END AS splits
FROM add_id
"""
job = bq.query(query = query)
job.result()

<google.cloud.bigquery.table._EmptyRowIterator at 0x7f94a8449b50>

In [14]:
(job.ended-job.started).total_seconds()

12.813

In [15]:
job.estimated_bytes_processed/1000000 #MB

70.632136

Review the test/train split:

In [16]:
query = f"""
SELECT splits, count(*) as Count, 100*count(*) / (sum(count(*)) OVER()) as Percentage
FROM `{DATANAME}.{DATANAME}_prepped`
GROUP BY splits
"""
bq.query(query = query).to_dataframe()

Unnamed: 0,splits,Count,Percentage
0,TEST,57131,20.059549
1,TRAIN,227676,79.940451


Retrieve a subset of the data to a Pandas dataframe:

In [17]:
query = f"""
SELECT * 
FROM `{DATANAME}.{DATANAME}_prepped`
LIMIT 5
"""
data = bq.query(query = query).to_dataframe()

In [18]:
data.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V23,V24,V25,V26,V27,V28,Amount,Class,transaction_id,splits
0,49561,1.333331,-0.845997,1.161578,-0.610965,-1.635783,-0.198304,-1.331531,0.212857,-0.208834,...,-0.095332,0.367347,0.398274,-0.05981,0.041674,0.011653,0.0,0,69a8d126-24fe-4cb8-8606-e6d5d87cba39,TEST
1,75948,1.177538,0.191202,0.995249,2.278516,-0.065408,1.24637,-0.671976,0.383296,-0.096174,...,-0.134489,-0.996634,0.454516,0.020868,0.039035,0.012567,0.0,0,2cdb1887-d890-4054-9508-38964404511b,TEST
2,126360,-0.565995,1.152597,-0.520348,-0.796012,0.659413,-0.649954,0.81957,0.414709,-0.544948,...,-0.144623,0.779525,-0.226338,0.440692,0.191961,0.158301,0.0,0,06f77c59-cb18-4b28-ae33-24e7e5581bfb,TEST
3,169248,-0.661771,0.964343,0.043392,0.132295,1.73737,-1.495161,1.618268,-0.526928,-1.481898,...,-0.746222,0.054143,1.361398,1.032193,-0.070334,0.027212,0.0,0,48e0386a-b746-4a66-928b-0f442f400022,TEST
4,36072,-1.49929,1.669549,2.576758,3.109448,-0.79046,1.417691,-1.345957,-2.768534,-0.058562,...,0.000104,0.381183,0.342781,0.524033,0.275742,0.121231,0.0,0,d15c6e63-3de9-4533-801d-c52ad93e3b43,TEST


---
## Remove Resources
see notebook "99 - Cleanup"