# DEV - K-Means

- Overview: try pca, **kmeans**, autoencoder
- Idea: anomalies compare to predicted class
- Thought: but these are alread principal components...

**Hyperparameter Tuning**

When training a machine learning model it is helpful to find the optimal values for hyperparameters, parameters set before training begins.  These are not learned parameters like the coefficents of a model.  Rather than manually iterating these parameters we want to sequently test and focus in on optimal values.  The focusing part of the iterations is done in BQML by utilizing the [Vertex AI Vizier](https://cloud.google.com/vertex-ai/docs/vizier/overview) service by default.

Each `MODEL_TYPE` in BQML has parameters than can be tuned as [listed here](https://cloud.google.com/bigquery-ml/docs/reference/standard-sql/bigqueryml-hyperparameter-tuning#hyperparameters_and_objectives). 

**Prerequisites:**
-  01 - BigQuery - Table Data Source

**Resources:**
-  [BigQuery ML (BQML) Overview](https://cloud.google.com/bigquery-ml/docs/introduction)
-  [Overview of BQML methods and workflows](https://cloud.google.com/bigquery-ml/docs/reference/standard-sql/bigqueryml-syntax-e2e-journey)

**Conceptual Flow & Workflow**


---
## Setup

inputs:

In [1]:
project = !gcloud config get-value project
PROJECT_ID = project[0]
PROJECT_ID

'statmike-mlops-349915'

In [2]:
REGION = 'us-central1'
EXPERIMENT = 'kmeans'
SERIES = '03'

# source data
BQ_PROJECT = PROJECT_ID
BQ_DATASET = 'fraud'
BQ_TABLE = 'fraud_prepped'

# Resources for serving BigQuery Model Exports
TF_DEPLOY_IMAGE = 'us-docker.pkg.dev/vertex-ai/prediction/tf2-cpu.2-3:latest'
XGB_DEPLOY_IMAGE = 'us-docker.pkg.dev/vertex-ai/prediction/xgboost-cpu.0-82:latest'

# Model Training
VAR_TARGET = 'Class'
VAR_OMIT = 'transaction_id' # add more variables to the string with space delimiters

packages:

In [3]:
from google.cloud import bigquery
from google.cloud import aiplatform
from datetime import datetime
import matplotlib.pyplot as plt

clients:

In [4]:
bq = bigquery.Client()
aiplatform.init(project=PROJECT_ID, location=REGION)

parameters:

In [5]:
TIMESTAMP = datetime.now().strftime("%Y%m%d%H%M%S")
BUCKET = PROJECT_ID
URI = f"gs://{BUCKET}/{SERIES}/{EXPERIMENT}"
RUN_NAME = f'run-{TIMESTAMP}'

BQ_MODEL = f'{SERIES}_{EXPERIMENT}_{TIMESTAMP}'

---
## This Run

In [6]:
print(f'This run with create BQML model: {BQ_PROJECT}.{BQ_DATASET}.{BQ_MODEL}')
print(f'The Timestamp Is: {TIMESTAMP}')

This run with create BQML model: statmike-mlops-349915.fraud.03_kmeans_20221005000845
The Timestamp Is: 20221005000845


---
## Train Model

Use BigQuery ML to train multiclass logistic regression model:
- [K-Means](https://cloud.google.com/bigquery-ml/docs/reference/standard-sql/bigqueryml-syntax-create-kmeans) with BigQuery ML (BQML)
- This uses the `splits` column that notebook `01` created
- `data_split_method = CUSTOM` uses the column in `data_split_col` to assign `TRAIN`, `EVAL`, and `TEST` data splits.
    - the `CASE` statement maps the validation data to `EVAL` as expected by hyperparameter tuning (rather than `VALIDATE`)
    - note that this is different behavior for `data_split_col` with hyperparameter tuning than without hyperparameter tuning
    - hyperparameter suggestions are based on the metric calculated with the evaluation data at each intermediate step

In [None]:
query = f"""
CREATE OR REPLACE MODEL `{BQ_PROJECT}.{BQ_DATASET}.{BQ_MODEL}`
OPTIONS (
        model_type = 'KMEANS',
        num_clusters = HPARAM_RANGE(2, 100),
        kmeans_init_method = 'KMEANS++',
        distance_type = 'EUCLIDEAN', 
        standardize_features = TRUE,
        early_stop = FALSE,
        hparam_tuning_algorithm = 'VIZIER_DEFAULT',
        hparam_tuning_objectives = ['davies_bouldin_index'],
        num_trials = 20,
        max_parallel_trials = 2
    ) AS
SELECT * EXCEPT({','.join(VAR_OMIT.split())}, splits, {VAR_TARGET})
FROM `{BQ_PROJECT}.{BQ_DATASET}.{BQ_TABLE}`
WHERE splits = 'TRAIN'
"""
job = bq.query(query = query)
job.result()

In [None]:
(job.ended-job.started).total_seconds()

In [11]:
feature_info = bq.query(
    query = f"""
        SELECT *
        FROM ML.FEATURE_INFO(MODEL `{BQ_PROJECT}.{BQ_DATASET}.{BQ_MODEL}`)
        """
).to_dataframe()
feature_info

Unnamed: 0,input,min,max,mean,median,stddev,category_count,null_count,dimension
0,Time,0.0,172792.0,94813.86,85945.0,47488.145955,,0,
1,V1,-56.40751,2.45493,1.265369e-15,-0.036327,1.958696,,0,
2,V2,-72.715728,22.057729,1.953067e-15,0.052912,1.651309,,0,
3,V3,-48.325589,9.382558,-1.608274e-15,0.204649,1.516255,,0,
4,V4,-5.683171,16.875344,1.511935e-15,-0.004035,1.415869,,0,
5,V5,-113.743307,34.801666,3.191722e-16,-0.056772,1.380247,,0,
6,V6,-26.160506,73.301626,1.478562e-15,-0.272171,1.332271,,0,
7,V7,-43.557242,120.589494,-2.495469e-16,0.03211,1.237094,,0,
8,V8,-73.216718,20.007208,1.789982e-17,0.025596,1.194353,,0,
9,V9,-13.434066,15.594995,-1.991638e-15,-0.036182,1.098632,,0,


In [14]:
trials = bq.query(
    query = f"""
        SELECT *
        FROM ML.TRIAL_INFO(MODEL `{BQ_PROJECT}.{BQ_DATASET}.{BQ_MODEL}`)
        """
).to_dataframe()
trials

Unnamed: 0,trial_id,hyperparameters,hparam_tuning_evaluation_metrics,training_loss,eval_loss,status,error_message,is_optimal
0,1,{'num_clusters': 51},{'davies_bouldin_index': 1.8537163907415874},15.009085,,SUCCEEDED,,False
1,2,{'num_clusters': 73},{'davies_bouldin_index': 2.082179750734141},13.896301,,SUCCEEDED,,False
2,3,{'num_clusters': 27},{'davies_bouldin_index': 2.247807526819555},18.409255,,SUCCEEDED,,False
3,4,{'num_clusters': 44},{'davies_bouldin_index': 1.8148688451118147},15.821209,,SUCCEEDED,,True
4,5,{'num_clusters': 100},{'davies_bouldin_index': 2.1268906398614535},12.697958,,SUCCEEDED,,False
5,6,{'num_clusters': 47},{'davies_bouldin_index': 2.147572198995448},16.196316,,SUCCEEDED,,False
6,7,{'num_clusters': 61},{'davies_bouldin_index': 2.0466808052250327},14.473987,,SUCCEEDED,,False
7,8,{'num_clusters': 38},{'davies_bouldin_index': 1.9749450498547934},16.654206,,SUCCEEDED,,False
8,9,{'num_clusters': 2},{'davies_bouldin_index': 5.167350476826834},29.452045,,SUCCEEDED,,False
9,10,{'num_clusters': 87},{'davies_bouldin_index': 2.072795598671615},13.276475,,SUCCEEDED,,False


In [15]:
centroids = bq.query(
    query = f"""
        SELECT *
        FROM ML.CENTROIDS(MODEL `{BQ_PROJECT}.{BQ_DATASET}.{BQ_MODEL}`)
        """
).to_dataframe()
centroids

Unnamed: 0,trial_id,centroid_id,feature,numerical_value,categorical_value
0,1,1,Time,106411.032258,[]
1,1,1,V1,-18.120056,[]
2,1,1,V2,-17.985045,[]
3,1,1,V3,-7.968524,[]
4,1,1,V4,4.907112,[]
...,...,...,...,...,...
31057,20,21,V26,-0.119341,[]
31058,20,21,V27,-0.309476,[]
31059,20,21,V28,-0.023670,[]
31060,20,21,Amount,104.979444,[]


In [16]:
eval = bq.query(
    query = f"""
        SELECT *
        FROM ML.EVALUATE(MODEL `{BQ_PROJECT}.{BQ_DATASET}.{BQ_MODEL}`)
        """
).to_dataframe()
eval

Unnamed: 0,trial_id,davies_bouldin_index,mean_squared_distance
0,1,1.853716,15.009085
1,2,2.08218,13.896301
2,3,2.247808,18.409255
3,4,1.814869,15.821209
4,5,2.126891,12.697958
5,6,2.147572,16.196316
6,7,2.046681,14.473987
7,8,1.974945,16.654206
8,9,5.16735,29.452045
9,10,2.072796,13.276475


In [None]:
query = f"""
SELECT *
FROM ML.PREDICT (MODEL `{BQ_PROJECT}.{BQ_DATASET}.{BQ_MODEL}`,(
    SELECT *
    FROM `{BQ_PROJECT}.{BQ_DATASET}.{BQ_TABLE}`
    WHERE splits = 'TEST')
  )
"""
pred = bq.query(query = query).to_dataframe()

In [None]:
pred.head()

In [21]:
query = f"""
SELECT *
FROM ML.DETECT_ANOMALIES (
    MODEL `{BQ_PROJECT}.{BQ_DATASET}.{BQ_MODEL}`,
    STRUCT (0.01 AS contamination),
    (SELECT *
    FROM `{BQ_PROJECT}.{BQ_DATASET}.{BQ_TABLE}`
    WHERE splits = 'TEST')
  )
"""
anomalies = bq.query(query = query).to_dataframe()

In [None]:
anomalies

In [None]:
query = f"""
WITH ANOMALIES AS (
        SELECT is_anomaly, {VAR_TARGET}
        FROM ML.DETECT_ANOMALIES (
            MODEL `{BQ_PROJECT}.{BQ_DATASET}.{BQ_MODEL}`,
            STRUCT (0.001 AS contamination),
            (SELECT *
            FROM `{BQ_PROJECT}.{BQ_DATASET}.{BQ_TABLE}`
            WHERE splits = 'TEST')
          )
      )
SELECT is_anomaly, {VAR_TARGET}, count(*) as count
FROM ANOMALIES
GROUP BY is_anomaly, {VAR_TARGET}
"""
bq.query(query = query).to_dataframe()