# DEV - Autoencoder

- Overview: try pca, kmeans, **autoencoder**
- Idea: anomalies compare to predicted class
- Thought: but these are alread principal components...

**Hyperparameter Tuning**

When training a machine learning model it is helpful to find the optimal values for hyperparameters, parameters set before training begins.  These are not learned parameters like the coefficents of a model.  Rather than manually iterating these parameters we want to sequently test and focus in on optimal values.  The focusing part of the iterations is done in BQML by utilizing the [Vertex AI Vizier](https://cloud.google.com/vertex-ai/docs/vizier/overview) service by default.

Each `MODEL_TYPE` in BQML has parameters than can be tuned as [listed here](https://cloud.google.com/bigquery-ml/docs/reference/standard-sql/bigqueryml-hyperparameter-tuning#hyperparameters_and_objectives). 

**Prerequisites:**
-  01 - BigQuery - Table Data Source

**Resources:**
-  [BigQuery ML (BQML) Overview](https://cloud.google.com/bigquery-ml/docs/introduction)
-  [Overview of BQML methods and workflows](https://cloud.google.com/bigquery-ml/docs/reference/standard-sql/bigqueryml-syntax-e2e-journey)

**Conceptual Flow & Workflow**


---
## Setup

inputs:

In [1]:
project = !gcloud config get-value project
PROJECT_ID = project[0]
PROJECT_ID

'statmike-mlops-349915'

In [2]:
REGION = 'us-central1'
EXPERIMENT = 'autoencoder'
SERIES = '03'

# source data
BQ_PROJECT = PROJECT_ID
BQ_DATASET = 'fraud'
BQ_TABLE = 'fraud_prepped'

# Resources for serving BigQuery Model Exports
TF_DEPLOY_IMAGE = 'us-docker.pkg.dev/vertex-ai/prediction/tf2-cpu.2-3:latest'
XGB_DEPLOY_IMAGE = 'us-docker.pkg.dev/vertex-ai/prediction/xgboost-cpu.0-82:latest'

# Model Training
VAR_TARGET = 'Class'
VAR_OMIT = 'transaction_id' # add more variables to the string with space delimiters

packages:

In [3]:
from google.cloud import bigquery
from google.cloud import aiplatform
from datetime import datetime
import matplotlib.pyplot as plt

clients:

In [4]:
bq = bigquery.Client()
aiplatform.init(project=PROJECT_ID, location=REGION)

parameters:

In [5]:
TIMESTAMP = datetime.now().strftime("%Y%m%d%H%M%S")
BUCKET = PROJECT_ID
URI = f"gs://{BUCKET}/{SERIES}/{EXPERIMENT}"
RUN_NAME = f'run-{TIMESTAMP}'

BQ_MODEL = f'{SERIES}_{EXPERIMENT}_{TIMESTAMP}'

---
## This Run

In [6]:
print(f'This run with create BQML model: {BQ_PROJECT}.{BQ_DATASET}.{BQ_MODEL}')
print(f'The Timestamp Is: {TIMESTAMP}')

This run with create BQML model: statmike-mlops-349915.fraud.03_autoencoder_20221005002145
The Timestamp Is: 20221005002145


---
## Train Model

Use BigQuery ML to train multiclass logistic regression model:
- [Autoencoder](https://cloud.google.com/bigquery-ml/docs/reference/standard-sql/bigqueryml-syntax-create-autoencoder) with BigQuery ML (BQML)
- This uses the `splits` column that notebook `01` created
- `data_split_method = CUSTOM` uses the column in `data_split_col` to assign `TRAIN`, `EVAL`, and `TEST` data splits.
    - the `CASE` statement maps the validation data to `EVAL` as expected by hyperparameter tuning (rather than `VALIDATE`)
    - note that this is different behavior for `data_split_col` with hyperparameter tuning than without hyperparameter tuning
    - hyperparameter suggestions are based on the metric calculated with the evaluation data at each intermediate step

In [None]:
query = f"""
CREATE OR REPLACE MODEL `{BQ_PROJECT}.{BQ_DATASET}.{BQ_MODEL}`
OPTIONS (
        model_type = 'AUTOENCODER',
        activation_fn = HPARAM_CANDIDATES(['RELU', 'TANH']),
        batch_size = HPARAM_RANGE(15, 100),
        dropout = HPARAM_RANGE(.1, .9),
        early_stop = TRUE,
        hidden_units =HPARAM_CANDIDATES([struct([64, 8, 64]), struct([128, 64, 8, 64, 128]), struct([256, 64, 8, 64, 128])]),
        max_iterations = 50,
        min_rel_progress = 0.0001,
        optimizer = HPARAM_CANDIDATES(['SGD', 'ADAM']),
        data_split_col = 'custom_splits',
        data_split_method = 'CUSTOM',
        hparam_tuning_algorithm = 'VIZIER_DEFAULT',
        hparam_tuning_objectives = ['mean_absolute_error'],
        num_trials = 40,
        max_parallel_trials = 5
    ) AS
SELECT * EXCEPT({','.join(VAR_OMIT.split())}, splits, {VAR_TARGET}),
    CASE
        WHEN splits = 'VALIDATE' THEN 'EVAL'
        ELSE splits
    END AS custom_splits
FROM `{BQ_PROJECT}.{BQ_DATASET}.{BQ_TABLE}`
"""
job = bq.query(query = query)
job.result()

In [None]:
(job.ended-job.started).total_seconds()

In [12]:
feature_info = bq.query(
    query = f"""
        SELECT *
        FROM ML.FEATURE_INFO(MODEL `{BQ_PROJECT}.{BQ_DATASET}.{BQ_MODEL}`)
        """
).to_dataframe()
feature_info

Unnamed: 0,input,min,max,mean,median,stddev,category_count,null_count,dimension
0,Amount,0.0,25691.16,87.977372,21.61,242.05578,,0,
1,Time,0.0,172792.0,94865.379565,85104.0,47485.090893,,0,
2,V1,-56.40751,2.45493,0.00253,0.004484,1.954377,,0,
3,V10,-24.588262,23.745136,0.00038,-0.098514,1.088245,,0,
4,V11,-4.797473,12.018913,0.000876,-0.039446,1.0212,,0,
5,V12,-18.683715,7.848392,-0.000217,0.135537,1.00153,,0,
6,V13,-5.791881,7.126883,-0.000889,-0.024213,0.995349,,0,
7,V14,-19.214325,10.526766,-0.000488,0.039,0.960284,,0,
8,V15,-4.498945,8.877742,0.000631,0.051657,0.915896,,0,
9,V16,-14.129855,17.315112,-0.000698,0.066888,0.87671,,0,


In [13]:
trials = bq.query(
    query = f"""
        SELECT *
        FROM ML.TRIAL_INFO(MODEL `{BQ_PROJECT}.{BQ_DATASET}.{BQ_MODEL}`)
        """
).to_dataframe()
trials

Unnamed: 0,trial_id,hyperparameters,hparam_tuning_evaluation_metrics,training_loss,eval_loss,status,error_message,is_optimal
0,1,"{'hidden_units': [64, 8, 64], 'batch_size': 32...",{'mean_absolute_error': 0.33118679988068384},0.010028,1.7e-05,SUCCEEDED,,False
1,2,"{'hidden_units': [128, 64, 8, 64, 128], 'batch...",{'mean_absolute_error': 0.4067171805501922},0.014132,2e-05,SUCCEEDED,,False
2,3,"{'hidden_units': [256, 64, 8, 64, 128], 'batch...",{'mean_absolute_error': 0.2495703042107863},0.00901,1.3e-05,SUCCEEDED,,False
3,4,"{'hidden_units': [256, 64, 8, 64, 128], 'batch...",{'mean_absolute_error': 0.5526720246906225},0.028767,2.9e-05,SUCCEEDED,,False
4,5,"{'hidden_units': [64, 8, 64], 'batch_size': 55...",{'mean_absolute_error': 0.27614641235369064},0.002907,1.4e-05,SUCCEEDED,,False
5,6,"{'hidden_units': [256, 64, 8, 64, 128], 'batch...",{'mean_absolute_error': 0.25781977268588313},0.009239,1.3e-05,SUCCEEDED,,False
6,7,"{'hidden_units': [256, 64, 8, 64, 128], 'batch...",{'mean_absolute_error': 0.2557016759934491},0.006807,1.3e-05,SUCCEEDED,,False
7,8,"{'hidden_units': [256, 64, 8, 64, 128], 'batch...",{'mean_absolute_error': 0.23901097992636666},0.004654,1.3e-05,SUCCEEDED,,True
8,9,"{'hidden_units': [64, 8, 64], 'batch_size': 15...",{'mean_absolute_error': 0.2734102713835557},0.007059,1.4e-05,SUCCEEDED,,False
9,10,"{'hidden_units': [256, 64, 8, 64, 128], 'batch...",{'mean_absolute_error': 0.25222028827463244},0.002257,1.3e-05,SUCCEEDED,,False


In [15]:
eval = bq.query(
    query = f"""
        SELECT *
        FROM ML.EVALUATE(MODEL `{BQ_PROJECT}.{BQ_DATASET}.{BQ_MODEL}`)
        """
).to_dataframe()
eval

Unnamed: 0,trial_id,mean_absolute_error,mean_squared_error,mean_squared_log_error
0,1,0.327874,0.425683,0.039454
1,2,0.403131,0.537158,0.0332
2,3,0.245562,0.338232,0.025611
3,4,0.549455,0.803867,0.11827
4,5,0.272348,0.364448,0.029854
5,6,0.253953,0.346932,0.027248
6,7,0.252237,0.34753,0.026611
7,8,0.235408,0.332809,0.024274
8,9,0.270077,0.370347,0.030345
9,10,0.248972,0.342107,0.02614


In [None]:
query = f"""
SELECT *
FROM ML.PREDICT (MODEL `{BQ_PROJECT}.{BQ_DATASET}.{BQ_MODEL}`,(
    SELECT *
    FROM `{BQ_PROJECT}.{BQ_DATASET}.{BQ_TABLE}`
    WHERE splits = 'TEST')
  )
"""
pred = bq.query(query = query).to_dataframe()

In [None]:
pred.head()

In [21]:
query = f"""
SELECT *
FROM ML.DETECT_ANOMALIES (
    MODEL `{BQ_PROJECT}.{BQ_DATASET}.{BQ_MODEL}`,
    STRUCT (0.01 AS contamination),
    (SELECT *
    FROM `{BQ_PROJECT}.{BQ_DATASET}.{BQ_TABLE}`
    WHERE splits = 'TEST')
  )
"""
anomalies = bq.query(query = query).to_dataframe()

In [None]:
anomalies

In [None]:
query = f"""
WITH ANOMALIES AS (
        SELECT is_anomaly, {VAR_TARGET}
        FROM ML.DETECT_ANOMALIES (
            MODEL `{BQ_PROJECT}.{BQ_DATASET}.{BQ_MODEL}`,
            STRUCT (0.001 AS contamination),
            (SELECT *
            FROM `{BQ_PROJECT}.{BQ_DATASET}.{BQ_TABLE}`
            WHERE splits = 'TEST')
          )
      )
SELECT is_anomaly, {VAR_TARGET}, count(*) as count
FROM ANOMALIES
GROUP BY is_anomaly, {VAR_TARGET}
"""
bq.query(query = query).to_dataframe()