# 03a - BigQuery Machine Learning (BQML) - Machine Learning with SQL

BigQuery has a number of machine learning algorithms callable directly from SQL.  This gives the convenience of using the common language of SQL to "CREATE MODEL …).  The library of available models is constantly growing and covers supervised, unsupervised, and time series methods as well as functions for evaluation - even anomaly detection from results, explainability and hyperparameter tuning.  A great starting point for seeing the scope of available methods is [user journey for models](https://cloud.google.com/bigquery-ml/docs/reference/standard-sql/bigqueryml-syntax-e2e-journey).

In this demonstration, BigQuery ML (BQML) is used to create a logistic regression model.

**Prerequisites:**

-  01 - BigQuery - Table Data Source

**Overview:**

-  Train logistic regression model with BQML
   -  CREATE MODEL …. model_type="LOGISTIC_REG"
-  Review training information
   -  SELECT * FROM ML.TRAINING_INFO…
-  Evaluated the models performance
   -  SELECT * FROM ML.EVALUATE…
-  Review the classification errors with a confusion matrix
   -  SELECT * FROM ML.CONFUSION_MATRIX…
-  Create prediction for data in BigQuery
   -  SELECT * FROM ML.PREDICT

**Resources:**

-  [BigQuery ML (BQML) Overview](https://cloud.google.com/bigquery-ml/docs/introduction)
-  [Overview of BQML methods and workflows](https://cloud.google.com/bigquery-ml/docs/reference/standard-sql/bigqueryml-syntax-e2e-journey)
-  [BigQuery magics for jupyter notebooks](https://github.com/GoogleCloudPlatform/bigquery-notebooks/blob/main/notebooks/official/template_notebooks/bigquery_magic.ipynb)

**Related Training:**

-  todo

---
## Conceptual Architecture

<img src="architectures/statmike-mlops-02.png">

---
## Setup

inputs:

In [2]:
REGION = 'us-central1'
PROJECT_ID='statmike-mlops'
DATANAME = 'digits'
NOTEBOOK = '03a'

---
## Train Model

Use BigQuery ML to train multiclass logistic regression model:

In [5]:
%%bigquery
CREATE OR REPLACE MODEL `digits.digits_lr`
OPTIONS
    (model_type='LOGISTIC_REG',
        auto_class_weights=TRUE,
        input_label_cols=['target'],
        data_split_col = 'custom_splits',
        data_split_method = 'CUSTOM'
    ) AS
SELECT * EXCEPT(splits, target_OE),
    CASE
        WHEN splits = 'TRAIN' THEN FALSE
        ELSE TRUE
    END AS custom_splits
FROM `digits.digits_prepped`

Query complete after 0.01s: 100%|██████████| 3/3 [00:00<00:00, 1447.48query/s]                        


Review the iterations from training:

In [1]:
%%bigquery
SELECT *
FROM ML.TRAINING_INFO(MODEL `digits.digits_lr`)
ORDER BY iteration

Query complete after 0.00s: 100%|██████████| 1/1 [00:00<00:00, 1100.87query/s]
Downloading: 100%|██████████| 17/17 [00:00<00:00, 18.66rows/s]


Unnamed: 0,training_run,iteration,loss,eval_loss,learning_rate,duration_ms
0,0,0,0.194579,0.195251,0.2,7281
1,0,1,0.141465,0.142899,0.4,10496
2,0,2,0.087307,0.089117,0.8,6479
3,0,3,0.05234,0.054244,1.6,7451
4,0,4,0.033017,0.035164,3.2,7248
5,0,5,0.022044,0.024556,6.4,7240
6,0,6,0.016089,0.019642,12.8,8316
7,0,7,0.014276,0.017937,3.2,10697
8,0,8,0.012984,0.017248,6.4,11747
9,0,9,0.011335,0.016269,12.8,9882


---
## Evaluate Model

Review the model evaluation statistics on the Test/Train splits:

In [2]:
%%bigquery
SELECT 'TRAIN' as SPLIT, * FROM ML.EVALUATE (MODEL `digits.digits_lr`,
    (SELECT * FROM `digits.digits_prepped` WHERE SPLITS='TRAIN'))
UNION ALL
SELECT 'TEST' as SPLIT, * FROM ML.EVALUATE (MODEL `digits.digits_lr`,
    (SELECT * FROM `digits.digits_prepped` WHERE SPLITS='TEST'))

Query complete after 0.00s: 100%|██████████| 19/19 [00:00<00:00, 7539.43query/s]                       
Downloading: 100%|██████████| 2/2 [00:01<00:00,  1.85rows/s]


Unnamed: 0,SPLIT,precision,recall,accuracy,f1_score,log_loss,roc_auc
0,TEST,0.988854,0.986404,0.98773,0.987216,0.084713,0.999121
1,TRAIN,0.992325,0.992292,0.99235,0.992263,0.066866,1.0


Review the confusion matrix for each split:

In [3]:
%%bigquery
SELECT *
FROM ML.CONFUSION_MATRIX (MODEL `digits.digits_lr`,(
    SELECT *
    FROM `digits.digits_prepped`
    WHERE splits = 'TRAIN')
  );

Query complete after 0.00s: 100%|██████████| 9/9 [00:00<00:00, 5367.37query/s]                        
Downloading: 100%|██████████| 10/10 [00:00<00:00, 10.91rows/s]


Unnamed: 0,expected_label,_0,_1,_2,_3,_4,_5,_6,_7,_8,_9
0,0,149,0,0,0,0,0,0,0,0,0
1,1,0,140,0,0,0,0,0,0,0,0
2,2,0,0,134,0,0,0,0,0,0,0
3,3,0,0,0,146,0,0,0,0,0,0
4,4,0,0,0,0,156,0,0,0,1,0
5,5,0,0,0,0,0,146,1,0,0,1
6,6,0,0,0,0,1,0,141,0,0,0
7,7,0,0,0,0,0,0,0,142,0,0
8,8,0,4,1,0,0,0,0,0,134,0
9,9,0,0,0,1,0,1,0,0,0,139


In [4]:
%%bigquery
SELECT *
FROM ML.CONFUSION_MATRIX (MODEL `digits.digits_lr`,(
    SELECT *
    FROM `digits.digits_prepped`
    WHERE splits = 'TEST')
  );

Query complete after 0.00s: 100%|██████████| 9/9 [00:00<00:00, 4286.22query/s]                        
Downloading: 100%|██████████| 10/10 [00:00<00:00, 11.45rows/s]


Unnamed: 0,expected_label,_0,_1,_2,_3,_4,_5,_6,_7,_8,_9
0,0,14,0,0,0,0,0,0,0,0,0
1,1,0,19,0,0,0,0,0,0,0,0
2,2,0,0,20,0,0,0,0,0,0,0
3,3,0,0,0,11,0,0,0,0,1,0
4,4,0,0,0,0,12,0,0,0,0,0
5,5,0,0,0,0,0,15,0,0,0,0
6,6,0,0,0,0,0,0,18,0,0,0
7,7,0,0,0,0,0,0,0,16,0,0
8,8,0,0,0,0,0,0,0,0,18,0
9,9,0,0,0,0,0,0,0,1,0,18


---
## Predictions

Create a pandas dataframe with predictions for the test data in the table:

In [6]:
%%bigquery pred
SELECT *
FROM ML.PREDICT(MODEL `digits.digits_lr`,(
    SELECT *
    FROM `digits.digits_prepped`
    WHERE splits='TEST')
  )

Query complete after 0.00s: 100%|██████████| 8/8 [00:00<00:00, 3606.84query/s]                        
Downloading: 100%|██████████| 163/163 [00:01<00:00, 154.13rows/s]


Review columns from the predictions - note that the query added columns with prefix `predicted_`

In [7]:
pred.columns

Index(['predicted_target', 'predicted_target_probs', 'p0', 'p1', 'p2', 'p3',
       'p4', 'p5', 'p6', 'p7', 'p8', 'p9', 'p10', 'p11', 'p12', 'p13', 'p14',
       'p15', 'p16', 'p17', 'p18', 'p19', 'p20', 'p21', 'p22', 'p23', 'p24',
       'p25', 'p26', 'p27', 'p28', 'p29', 'p30', 'p31', 'p32', 'p33', 'p34',
       'p35', 'p36', 'p37', 'p38', 'p39', 'p40', 'p41', 'p42', 'p43', 'p44',
       'p45', 'p46', 'p47', 'p48', 'p49', 'p50', 'p51', 'p52', 'p53', 'p54',
       'p55', 'p56', 'p57', 'p58', 'p59', 'p60', 'p61', 'p62', 'p63', 'target',
       'target_OE', 'splits'],
      dtype='object')

Print the first few rows for the columns related to the actual and predicted values:

In [9]:
pred[['target', 'predicted_target', 'predicted_target_probs', 'splits']].head()

Unnamed: 0,target,predicted_target,predicted_target_probs,splits
0,5,5,"[{'label': 5, 'prob': 0.992310542043166}, {'la...",TEST
1,9,9,"[{'label': 9, 'prob': 0.9709476657242518}, {'l...",TEST
2,8,8,"[{'label': 8, 'prob': 0.9756974401078998}, {'l...",TEST
3,2,2,"[{'label': 2, 'prob': 0.9937757294655372}, {'l...",TEST
4,6,6,"[{'label': 6, 'prob': 0.9893872300958743}, {'l...",TEST


Notice the nested dictionary for predicted probabilities.  In BigQuery this is a Record type structure with nested fields for `label` and `prop`.  This is returned to the pandas dataframe as a nested dictionary.

The following code sorts the dictionary for the first record by `prop`:

In [10]:
[sorted(x, key = lambda x: x['label']) for x in [pred.predicted_target_probs[0]]]

[[{'label': 0, 'prob': 2.827550788524117e-05},
  {'label': 1, 'prob': 0.00022835558787124367},
  {'label': 2, 'prob': 0.0007155590617441531},
  {'label': 3, 'prob': 0.0015166203808304494},
  {'label': 4, 'prob': 8.582708806544684e-05},
  {'label': 5, 'prob': 0.992310542043166},
  {'label': 6, 'prob': 3.0936262096098616e-06},
  {'label': 7, 'prob': 0.0010702609786502797},
  {'label': 8, 'prob': 0.0014902489137406387},
  {'label': 9, 'prob': 0.0025512168118369694}]]

---
## Remove Resources
see notebook "XX - Cleanup"