# Test Skew

---
## Colab Setup

When running this notebook in [Colab](https://colab.google/) or [Colab Enterprise](https://cloud.google.com/colab/docs/introduction), this section will authenticate to GCP (follow prompts in the popup) and set the current project for the session.

In [21]:
PROJECT_ID = 'statmike-mlops-349915' # replace with project ID

In [22]:
try:
    from google.colab import auth
    auth.authenticate_user(project_id = PROJECT_ID)
    print('Colab authorized to GCP')
except Exception:
    print('Not a Colab Environment')
    pass

Not a Colab Environment


---
## Setup

Packages:

In [23]:
import json

Clients:

In [24]:
%load_ext google.cloud.bigquery

The google.cloud.bigquery extension is already loaded. To reload it, use:
  %reload_ext google.cloud.bigquery


Prepare the code below for your environment.

This notebook takes advantage of the [BigQuery IPython magic](https://cloud.google.com/python/docs/reference/bigquery/latest/magics) for legibility and ease of copy/pasting to BigQuery SQL editor.  If this notebook is being used from an environment that can run notebooks it needs further preparation: Colab, Colab Enterprise, Vertex AI Workbench Instances, or BigQuery Studio with a Python Notebook.  The SQL code in these cells uses the fully qualified [BigQuery table](https://cloud.google.com/bigquery/docs/tables-intro) names in the form `projectname.datasetname.tablename`.  Prepare for your environment by:
- Edit > Find
    - Find: statmike-mlops-349915
    - Replace: <your project id>
    - Replace All

### Create A BigQuery Dataset

Create a new [BigQuery Dataset](https://cloud.google.com/bigquery/docs/datasets) as a working location for this workflow:

In [25]:
%%bigquery
CREATE SCHEMA IF NOT EXISTS `statmike-mlops-349915.bqml_central_monitoring_test`
    OPTIONS(
        location = 'us'
    )

Query is running:   0%|          |

### Prepare The Source Data

Make a copy of the table source in the new BigQuery dataset with fixes applied to the data quality issue identified for the `sex` column with values of `.`.

> Note: A copy is being made in this case because the source project is `bigquery-public` which is not editable.

In [26]:
%%bigquery
CREATE OR REPLACE TABLE `statmike-mlops-349915.bqml_central_monitoring_test.source` AS
    SELECT * EXCEPT(sex),
        CASE WHEN sex = '.' THEN NULL ELSE sex END AS sex
    FROM `bigquery-public-data.ml_datasets.penguins`

Query is running:   0%|          |

### Split The Data

Depending on the ML technique, it may be desired to split the data into partitions for training, evaluation, and testing (in this case monitoring examples). 

The following cell creates a version with a new column column named `splits` with values [`TRAIN`, `EVAL`, `TEST`].  The data is first grouped by (stratified) the values of `species` and `island` to preserve any imbalance across the columns. 

In [27]:
%%bigquery
CREATE OR REPLACE TABLE `statmike-mlops-349915.bqml_central_monitoring_test.source_split` AS
    WITH
        # randomized numbering within groups (species, island)
        RANDOM AS (
            SELECT *,
                ROW_NUMBER() OVER (PARTITION BY species, island ORDER BY RAND()) AS sequence
            FROM `statmike-mlops-349915.bqml_central_monitoring_test.source`
        ),
        # get group sizes
        GROUP_SIZES AS (
            SELECT species, island, COUNT(*) AS count
            FROM `statmike-mlops-349915.bqml_central_monitoring_test.source`
            GROUP BY species, island
        )
    SELECT
        * EXCEPT(sequence, count),
        CASE
            WHEN sequence <= CEIL(.2 * count) AND species is not Null THEN 'TEST'
            WHEN sequence <= CEIL(.3 * count) THEN 'EVAL'
            ELSE 'TRAIN'
        END AS splits
    FROM RANDOM
    LEFT OUTER JOIN GROUP_SIZES USING(species, island)

Query is running:   0%|          |

Review the data by `splits`:

In [28]:
%%bigquery
SELECT species, island,
    COUNT(*) as count,
    100 * COUNTIF(splits = 'TRAIN')/COUNT(*) AS TRAIN_PCT,
    100 * COUNTIF(splits = 'EVAL')/COUNT(*) AS EVAL_PCT,
    100 * COUNTIF(splits = 'TEST')/COUNT(*) AS TEST_PCT
FROM `statmike-mlops-349915.bqml_central_monitoring_test.source_split`
GROUP BY species, island

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,species,island,count,TRAIN_PCT,EVAL_PCT,TEST_PCT
0,Gentoo penguin (Pygoscelis papua),Biscoe,124,69.354839,10.483871,20.16129
1,Adelie Penguin (Pygoscelis adeliae),Torgersen,52,69.230769,9.615385,21.153846
2,Adelie Penguin (Pygoscelis adeliae),Dream,56,69.642857,8.928571,21.428571
3,Adelie Penguin (Pygoscelis adeliae),Biscoe,44,68.181818,11.363636,20.454545
4,Chinstrap penguin (Pygoscelis antarctica),Dream,68,69.117647,10.294118,20.588235


---
## Create/Train A Model

Create a model trained to classify `species` for the traning records.  Here, directly in BigQuery, the [`CREATE MODEL` statement for deep neural network (DNN) models](https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-create-dnn-models) is used.  Rather than using the `TRANSFORM` clause within the model, the pre-built `TRANSFORM_ONLY` model from above is used on the the input records for the model.  Just as later on in the workflow the `TRANSFORM_ONLY` model will be used for evaluations, predictions, and even monitoring!


> Note: This runs for about 15-16 minutes

In [30]:
%%bigquery
CREATE MODEL IF NOT EXISTS `statmike-mlops-349915.bqml_central_monitoring_test.classify_species_logistic`
    TRANSFORM(
        ML.ROBUST_SCALER(body_mass_g) OVER() AS body_mass_g,
        ML.STANDARD_SCALER(culmen_length_mm) OVER() AS culmen_length_mm,
        ML.STANDARD_SCALER(culmen_depth_mm) OVER() AS culmen_depth_mm,
        ML.QUANTILE_BUCKETIZE(flipper_length_mm, 3) OVER() AS flipper_length_mm,
        ML.IMPUTER(sex, 'most_frequent') OVER() AS sex,
        ML.IMPUTER(island, 'most_frequent') OVER() AS island,
        species, split
    )
    OPTIONS(
        MODEL_TYPE = 'LOGISTIC_REG',
        INPUT_LABEL_COLS = ['species'],
        
        # data specifics
        DATA_SPLIT_METHOD = 'CUSTOM',
        DATA_SPLIT_COL = 'split',
        
        # model specifics
        AUTO_CLASS_WEIGHTS = TRUE
    )
AS
    SELECT * EXCEPT(splits),
        CASE WHEN splits = 'TRAIN' THEN FALSE
        ELSE TRUE END AS split
    FROM `statmike-mlops-349915.bqml_central_monitoring_test.source_split`
    WHERE splits != 'TEST'   

Query is running:   0%|          |

### Monitoring Skew: Model to Test split

In [31]:
%%bigquery
SELECT *
FROM ML.VALIDATE_DATA_SKEW(
    MODEL `statmike-mlops-349915.bqml_central_monitoring_test.classify_species_logistic`,
    (SELECT * FROM `statmike-mlops-349915.bqml_central_monitoring_test.source_split` WHERE splits = 'TEST')
);

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,input,metric,threshold,value,is_anomaly
0,body_mass_g,JENSEN_SHANNON_DIVERGENCE,0.3,0.039228,False
1,culmen_depth_mm,JENSEN_SHANNON_DIVERGENCE,0.3,0.107733,False
2,culmen_length_mm,JENSEN_SHANNON_DIVERGENCE,0.3,0.104672,False
3,flipper_length_mm,JENSEN_SHANNON_DIVERGENCE,0.3,0.051873,False
4,island,L_INFTY,0.3,0.008522,False
5,sex,L_INFTY,0.3,0.105797,False


### Monitoring Skew - (Manually) Drift From Train To Test Split

In [32]:
%%bigquery
SELECT *
FROM ML.VALIDATE_DATA_DRIFT(
    (SELECT * FROM `statmike-mlops-349915.bqml_central_monitoring_test.source_split` WHERE splits = 'TRAIN'),
    (SELECT * FROM `statmike-mlops-349915.bqml_central_monitoring_test.source_split` WHERE splits = 'TEST')
);

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,input,metric,threshold,value,is_anomaly
0,body_mass_g,JENSEN_SHANNON_DIVERGENCE,0.3,0.039228,False
1,culmen_depth_mm,JENSEN_SHANNON_DIVERGENCE,0.3,0.107733,False
2,culmen_length_mm,JENSEN_SHANNON_DIVERGENCE,0.3,0.104672,False
3,flipper_length_mm,JENSEN_SHANNON_DIVERGENCE,0.3,0.051873,False
4,island,L_INFTY,0.3,0.008522,False
5,sex,L_INFTY,0.3,0.105797,False
6,species,L_INFTY,0.3,0.009528,False
7,splits,L_INFTY,0.3,1.0,True


### Monitoring Skew - (Manually) Drift From Train+Eval To Test Split

In [None]:
%%bigquery
SELECT *
FROM ML.VALIDATE_DATA_DRIFT(
    (SELECT * FROM `statmike-mlops-349915.bqml_central_monitoring_test.source_split` WHERE splits != 'TEST'),
    (SELECT * FROM `statmike-mlops-349915.bqml_central_monitoring_test.source_split` WHERE splits = 'TEST')
);

Query is running:   0%|          |

Downloading:   0%|          |