# BigQuery - BQML

This notebook uses BigQuery ML to train a model to predict the value of handwritten digits in BigQuery table `<PROJECT_ID>.digits.digits_prepped`.  This model is then evaluated and used for prediction using BigQuery directly.

**Prerequisites**
- `00 - Initial Setup`
- `01 - BigQuery - Data`

**Overview**

<img src="architectures/statmike-mlops-02.png">

---
## Setup

Parameters:

In [1]:
PROJECT_ID = "statmike-mlops"
REGION = 'us-central1'

---
## Train Model

Use BigQuery ML to train multiclass logistic regression model:

In [2]:
%%bigquery
CREATE OR REPLACE MODEL `digits.digits_lr`
OPTIONS
  ( model_type='LOGISTIC_REG',
    auto_class_weights=TRUE,
    input_label_cols=['target']
  ) AS
SELECT * EXCEPT(splits,target_OE)
FROM `digits.digits_prepped`
WHERE splits = 'TRAIN'

Query complete after 0.01s: 100%|██████████| 3/3 [00:00<00:00, 1445.98query/s]                        


Review the iterations from training:

In [3]:
%%bigquery
SELECT *
FROM ML.TRAINING_INFO(MODEL `digits.digits_lr`)
ORDER BY iteration

Query complete after 0.00s: 100%|██████████| 3/3 [00:00<00:00, 1329.56query/s]                        
Downloading: 100%|██████████| 10/10 [00:01<00:00,  8.16rows/s]


Unnamed: 0,training_run,iteration,loss,eval_loss,learning_rate,duration_ms
0,0,0,0.194594,0.194249,0.2,5393
1,0,1,0.141378,0.140936,0.4,9026
2,0,2,0.087078,0.087505,0.8,7650
3,0,3,0.052062,0.053845,1.6,6668
4,0,4,0.032789,0.035363,3.2,6633
5,0,5,0.021841,0.024161,6.4,6206
6,0,6,0.015789,0.019719,12.8,6021
7,0,7,0.013974,0.017063,3.2,6601
8,0,8,0.012652,0.016316,6.4,6527
9,0,9,0.010724,0.014669,12.8,6760


---
## Evaluate Model

Review the model evaluation statistics on the Test/Train splits:

In [4]:
%%bigquery
SELECT 'TRAIN' as SPLIT, * FROM ML.EVALUATE (MODEL `digits.digits_lr`,
    (SELECT * FROM `digits.digits_prepped` WHERE SPLITS='TRAIN'))
UNION ALL
SELECT 'TEST' as SPLIT, * FROM ML.EVALUATE (MODEL `digits.digits_lr`,
    (SELECT * FROM `digits.digits_prepped` WHERE SPLITS='TEST'))

Query complete after 0.00s: 100%|██████████| 19/19 [00:00<00:00, 8005.20query/s]                       
Downloading: 100%|██████████| 2/2 [00:01<00:00,  1.33rows/s]


Unnamed: 0,SPLIT,precision,recall,accuracy,f1_score,log_loss,roc_auc
0,TRAIN,0.980935,0.980734,0.980861,0.980777,0.115419,0.999717
1,TEST,0.95894,0.95913,0.958084,0.9588,0.177642,0.998527


Review the confusion matrix for each split:

In [5]:
%%bigquery
SELECT *
FROM ML.CONFUSION_MATRIX (MODEL `digits.digits_lr`,(
    SELECT *
    FROM `digits.digits_prepped`
    WHERE splits = 'TRAIN')
  );

Query complete after 0.00s: 100%|██████████| 9/9 [00:00<00:00, 4143.66query/s]                        
Downloading: 100%|██████████| 10/10 [00:01<00:00,  8.58rows/s]


Unnamed: 0,expected_label,_0,_1,_2,_3,_4,_5,_6,_7,_8,_9
0,0,145,0,0,0,0,0,0,0,0,0
1,1,0,140,0,0,0,0,1,0,1,1
2,2,0,0,150,0,0,0,0,0,0,0
3,3,0,0,1,136,0,1,0,1,2,0
4,4,0,0,0,0,139,0,0,0,3,0
5,5,0,0,0,0,0,154,0,0,0,2
6,6,0,1,0,0,1,0,141,0,0,0
7,7,0,0,0,0,0,0,0,147,0,1
8,8,0,5,0,0,0,2,1,0,139,0
9,9,0,2,0,0,0,2,0,0,2,137


In [6]:
%%bigquery
SELECT *
FROM ML.CONFUSION_MATRIX (MODEL `digits.digits_lr`,(
    SELECT *
    FROM `digits.digits_prepped`
    WHERE splits = 'TEST')
  );

Query complete after 0.00s: 100%|██████████| 9/9 [00:00<00:00, 4261.06query/s]                        
Downloading: 100%|██████████| 10/10 [00:01<00:00,  8.94rows/s]


Unnamed: 0,expected_label,_0,_1,_2,_3,_4,_5,_6,_7,_8,_9
0,0,33,0,0,0,0,0,0,0,0,0
1,1,0,39,0,0,0,0,0,0,0,0
2,2,0,0,27,0,0,0,0,0,0,0
3,3,0,0,0,40,0,0,0,0,2,0
4,4,0,0,0,0,36,0,0,2,0,1
5,5,0,0,0,0,0,25,1,0,0,0
6,6,0,0,0,0,0,0,38,0,0,0
7,7,0,0,0,0,0,0,0,31,0,0
8,8,0,1,1,0,0,0,0,0,25,0
9,9,0,0,0,1,0,1,0,1,2,32


---
## Predictions

Create a pandas dataframe with predictions for the test data in the table:

In [7]:
%%bigquery pred
SELECT *
FROM ML.PREDICT(MODEL `digits.digits_lr`,(
    SELECT *
    FROM `digits.digits_prepped`
    WHERE splits='TEST')
  )

Query complete after 0.00s: 100%|██████████| 8/8 [00:00<00:00, 2952.44query/s]                        
Downloading: 100%|██████████| 339/339 [00:01<00:00, 259.88rows/s]


Review columns from the predictions - note that the query added columns with prefix `predicted_`

In [8]:
pred.columns

Index(['predicted_target', 'predicted_target_probs', 'p0', 'p1', 'p2', 'p3',
       'p4', 'p5', 'p6', 'p7', 'p8', 'p9', 'p10', 'p11', 'p12', 'p13', 'p14',
       'p15', 'p16', 'p17', 'p18', 'p19', 'p20', 'p21', 'p22', 'p23', 'p24',
       'p25', 'p26', 'p27', 'p28', 'p29', 'p30', 'p31', 'p32', 'p33', 'p34',
       'p35', 'p36', 'p37', 'p38', 'p39', 'p40', 'p41', 'p42', 'p43', 'p44',
       'p45', 'p46', 'p47', 'p48', 'p49', 'p50', 'p51', 'p52', 'p53', 'p54',
       'p55', 'p56', 'p57', 'p58', 'p59', 'p60', 'p61', 'p62', 'p63', 'target',
       'target_OE', 'SPLITS'],
      dtype='object')

Print the first few rows for the columns related to the actual and predicted values:

In [9]:
pred[['target', 'predicted_target', 'predicted_target_probs', 'SPLITS']].head()

Unnamed: 0,target,predicted_target,predicted_target_probs,SPLITS
0,6,6,"[{'label': 6, 'prob': 0.9962357632641997}, {'l...",TEST
1,0,0,"[{'label': 0, 'prob': 0.9624699529574724}, {'l...",TEST
2,5,5,"[{'label': 5, 'prob': 0.9968860686892316}, {'l...",TEST
3,3,3,"[{'label': 3, 'prob': 0.931602682994269}, {'la...",TEST
4,9,9,"[{'label': 9, 'prob': 0.9822555210903531}, {'l...",TEST


Notice the nested dictionary for predicted probabilities.  In BigQuery this is a Record type structure with nested fields for `label` and `prop`.  This is returned to the pandas dataframe as a nested dictionary.

The following code sorts the dictionary for the first record by `prop`:

In [10]:
[sorted(x, key = lambda x: x['label']) for x in [pred.predicted_target_probs[0]]]

[[{'label': 0, 'prob': 0.000713107108618018},
  {'label': 1, 'prob': 4.4982900313117e-06},
  {'label': 2, 'prob': 0.0001007608919244365},
  {'label': 3, 'prob': 2.3252700887197862e-05},
  {'label': 4, 'prob': 0.0019376754329717694},
  {'label': 5, 'prob': 0.00021023272581067556},
  {'label': 6, 'prob': 0.9962357632641997},
  {'label': 7, 'prob': 7.296668104531445e-06},
  {'label': 8, 'prob': 0.000765080272162334},
  {'label': 9, 'prob': 2.3326452901137037e-06}]]

---
## Remove Resources
- delete model `<PROJECT_ID>.digits.digits_lr`

In [1]:
from google.cloud import bigquery
bq = bigquery.Client()

In [2]:
# NOTE: This is an input for Notebook 03 - BigQuery - BQML Online Predictions 
bq.delete_model(PROJECT_ID+'.digits.digits_lr')