![tracker](https://us-central1-vertex-ai-mlops-369716.cloudfunctions.net/pixel-tracking?path=statmike%2Fvertex-ai-mlops%2FApplied+GenAI%2FEmbeddings&file=BQML+Autoencoder+As+Table+Embedding.ipynb)
<!--- header table --->
<table align="left">
  <td style="text-align: center">
    <a href="https://colab.research.google.com/github/statmike/vertex-ai-mlops/blob/main/Applied%20GenAI/Embeddings/BQML%20Autoencoder%20As%20Table%20Embedding.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/colab-logo-32px.png" alt="Google Colaboratory logo">
      <br>Run in<br>Colab
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/colab/import/https%3A%2F%2Fraw.githubusercontent.com%2Fstatmike%2Fvertex-ai-mlops%2Fmain%2FApplied%2520GenAI%2FEmbeddings%2FBQML%2520Autoencoder%2520As%2520Table%2520Embedding.ipynb">
      <img width="32px" src="https://lh3.googleusercontent.com/JmcxdQi-qOpctIvWKgPtrzZdJJK-J3sWE1RsfjZNwshCFgE_9fULcNpuXYTilIR2hjwN" alt="Google Cloud Colab Enterprise logo">
      <br>Run in<br>Colab Enterprise
    </a>
  </td>      
  <td style="text-align: center">
    <a href="https://github.com/statmike/vertex-ai-mlops/blob/main/Applied%20GenAI/Embeddings/BQML%20Autoencoder%20As%20Table%20Embedding.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" alt="GitHub logo">
      <br>View on<br>GitHub
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/statmike/vertex-ai-mlops/main/Applied%20GenAI/Embeddings/BQML%20Autoencoder%20As%20Table%20Embedding.ipynb">
      <img src="https://lh3.googleusercontent.com/UiNooY4LUgW_oTvpsNhPpQzsstV5W8F7rYgxgGBD85cWJoLmrOzhVs_ksK_vgx40SHs7jCqkTkCk=e14-rj-sc0xffffff-h130-w32" alt="Vertex AI logo">
      <br>Open in<br>Vertex AI Workbench
    </a>
  </td>
</table>

---

**File Move Notices**

This file moved locations:
- On 09/13/2024 (mm/dd/yyyy)
	- From: `Working With/Embeddings/BQML Autoencoder As Table Embedding.ipynb`
	- To: `Applied GenAI/Embeddings/BQML Autoencoder As Table Embedding.ipynb`
---
<!---end of move notices--->

# BigQuery ML Autoencoder As Table Embedding

Autoencoders are a type of neural network designed for unsupervised learning. They learn efficient representations (encodings) of input data by compressing it into a smaller representation called the "latent space." This latent space captures the most essential features of the input.

> The autoencoder also learns to reconstruct the original input data from this compressed representation using a decoder. The training process involves minimizing the difference (loss) between the original input and the reconstructed output. This comparison between input and reconstructed output serves as a form of supervision, even though the task itself is considered unsupervised.

With [BigQuery ML](https://cloud.google.com/bigquery/docs/bqml-introduction) you can train an [autoencoder](https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-create-autoencoder) on tabular data.  Prediction with the [`ML.PREDICT`](https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-predict#autoencoder_models) function will return the latent space, encoder results, from the trained model.  These representations can be used as embeddings to do tasks like matching similar rows to a query row.

For a detailed review of BigQuery ML Autoencoders, check out the end-to-end workflow in this repository: [BQML Autoencoder with Anomaly Detection](../../03%20-%20BigQuery%20ML%20%28BQML%29/03i%20-%20BQML%20Autoencoder%20with%20Anomaly%20Detection.ipynb)

---
## Colab Setup

When running this notebook in [Colab](https://colab.google/) or [Colab Enterprise](https://cloud.google.com/colab/docs/introduction), this section will authenticate to GCP (follow prompts in the popup) and set the current project for the session.

In [1]:
PROJECT_ID = 'statmike-mlops-349915' # replace with project ID

In [2]:
try:
    from google.colab import auth
    auth.authenticate_user(project_id = PROJECT_ID)
    print('Colab authorized to GCP')
except Exception:
    print('Not a Colab Environment')
    pass

Not a Colab Environment


---
## Installs

The list `packages` contains tuples of package import names and install names.  If the import name is not found then the install name is used to install quitely for the current user.

In [3]:
# tuples of (import name, install name)
packages = [
    ('google.cloud.bigquery', 'google-cloud-bigquery'),
]

import importlib
install = False
for package in packages:
    if not importlib.util.find_spec(package[0]):
        print(f'installing package {package[1]}')
        install = True
        !pip install {package[1]} -U -q --user

### Restart Kernel (If Installs Occured)

After a kernel restart the code submission can start with the next cell after this one.

In [4]:
if install:
    import IPython
    app = IPython.Application.instance()
    app.kernel.do_shutdown(True)
    IPython.display.display(IPython.display.Markdown("""<div class=\"alert alert-block alert-warning\">
        <b>⚠️ The kernel is going to restart. Please wait until it is finished before continuing to the next step. The previous cells do not need to be run again⚠️</b>
        </div>"""))

---
## Setup

inputs:

In [5]:
project = !gcloud config get-value project
PROJECT_ID = project[0]
PROJECT_ID

'statmike-mlops-349915'

In [36]:
REGION = 'us-central1'
EXPERIMENT = 'bqml-autoencoder'
SERIES = 'applied-genai-embeddings'

# Data source for this series of notebooks: Described above
BQ_SOURCE = 'bigquery-public-data.ml_datasets.ulb_fraud_detection'

# make this the BQ Project / Dataset / Table prefix to store results
BQ_PROJECT = PROJECT_ID
BQ_DATASET = SERIES.replace('-', '_')
BQ_TABLE = EXPERIMENT
BQ_REGION = REGION[0:2] # use multi-region (the first two charecters)
BQ_MODEL = EXPERIMENT+'-model'

packages:

In [10]:
import numpy as np
from google.cloud import bigquery

clients:

In [11]:
bq = bigquery.Client(project = PROJECT_ID)

---
## Review Source Data

This is a BigQuery public table of 284,807 credit card transactions classified as fradulant or normal in the column `Class`.
- The data can be researched further at this [Kaggle link](https://www.kaggle.com/mlg-ulb/creditcardfraud).
- Read mode about BigQuery public datasets [here](https://cloud.google.com/bigquery/public-data)

In order protect confidentiality, the original features have been transformed using [principle component analysis (PCA)](https://en.wikipedia.org/wiki/Principal_component_analysis) into 28 features named `V1, V2, ... V28` (float).  Two descriptive features are provided without transformation by PCA:
- `Time` (integer) is the seconds elapsed between the transaction and the earliest transaction in the table
- `Amount` (float) is the value of the transaction
 

### Review BigQuery table:

In [22]:
source_data = bq.query(f'SELECT * FROM `{BQ_SOURCE}` LIMIT 5').to_dataframe()
source_data

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,282.0,-0.356466,0.725418,1.971749,0.831343,0.369681,-0.107776,0.75161,-0.120166,-0.420675,...,0.020804,0.424312,-0.015989,0.466754,-0.809962,0.657334,-0.04315,-0.046401,0.0,0
1,14332.0,1.07195,0.340678,1.784068,2.846396,-0.751538,0.403028,-0.73492,0.205807,1.092726,...,-0.169632,-0.113604,0.067643,0.468669,0.223541,-0.112355,0.014015,0.021504,0.0,0
2,32799.0,1.153477,-0.047859,1.358363,1.48062,-1.222598,-0.48169,-0.654461,0.128115,0.907095,...,0.125514,0.480049,-0.025964,0.701843,0.417245,-0.257691,0.060115,0.035332,0.0,0
3,35799.0,-0.769798,0.622325,0.242491,-0.586652,0.527819,-0.104512,0.209909,0.669861,-0.304509,...,0.152738,0.255654,-0.130237,-0.660934,-0.493374,0.331855,-0.011101,0.049089,0.0,0
4,36419.0,1.04796,0.145048,1.624573,2.932652,-0.726574,0.690451,-0.627288,0.278709,0.318434,...,0.078499,0.658942,-0.06781,0.476882,0.52683,0.219902,0.070627,0.028488,0.0,0


In [23]:
source_data.dtypes

Time      float64
V1        float64
V2        float64
V3        float64
V4        float64
V5        float64
V6        float64
V7        float64
V8        float64
V9        float64
V10       float64
V11       float64
V12       float64
V13       float64
V14       float64
V15       float64
V16       float64
V17       float64
V18       float64
V19       float64
V20       float64
V21       float64
V22       float64
V23       float64
V24       float64
V25       float64
V26       float64
V27       float64
V28       float64
Amount    float64
Class       Int64
dtype: object

---
## Prepare Data Source

The data preparation includes adding splits for machine learning with a column named `splits` with 80% for training (`TRAIN`), 10% for validation (`VALIDATE`) and 10% for testing (`TEST`).  Additionally, a unique identifier was added to each transaction, `transaction_id`. 

### Create/Recall Dataset

In [24]:
dataset = bigquery.Dataset(f"{BQ_PROJECT}.{BQ_DATASET}")
dataset.location = BQ_REGION
bq_dataset = bq.create_dataset(dataset, exists_ok = True)

### Create/Recall Table With Preparation For ML

Copy the data from the source while adding columns:
- `transaction_id` as a unique identify for the row
    - Use the `GENERATE_UUID()` function
- `splits` column to randomly assign rows to 'TRAIN", "VALIDATE" and "TEST" groups
    - Use a CASE statement with the last last digit of the result of `ABS(FARM_FINGERPRINT(transaction_id))` to assign to "TRAIN" when [0, 7], "VALIDATE" when 8, and "TEST" when 9.  This gives an 80/10/10 split. 

In [31]:
job = bq.query(f"""
CREATE TABLE IF NOT EXISTS `{BQ_PROJECT}.{BQ_DATASET}.{BQ_TABLE}` AS
WITH add_id AS (SELECT *, GENERATE_UUID() transaction_id FROM `{BQ_SOURCE}`)
SELECT *,
    CASE 
        WHEN MOD(ABS(FARM_FINGERPRINT(transaction_id)),10) < 8 THEN "TRAIN" 
        WHEN MOD(ABS(FARM_FINGERPRINT(transaction_id)),10) < 9 THEN "VALIDATE"
        ELSE "TEST"
    END AS splits
FROM add_id
""")
job.result()
(job.ended-job.started).total_seconds()

0.341

In [32]:
bq.query(f'SELECT * FROM `{BQ_PROJECT}.{BQ_DATASET}.{BQ_TABLE}` LIMIT 5').to_dataframe()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V23,V24,V25,V26,V27,V28,Amount,Class,transaction_id,splits
0,139889.0,2.180042,0.195697,-2.906568,0.047321,1.575817,-0.405608,0.861374,-0.372075,-0.416297,...,-0.28167,-0.245863,0.808016,0.737577,-0.127611,-0.106866,0.0,0,2056d104-31a6-406b-934f-daae83ae775d,TEST
1,50570.0,-0.922055,1.514612,1.846089,0.701321,0.212083,-0.412444,1.06109,-0.67687,-0.007496,...,-0.101959,0.612442,-0.420179,-0.541617,-0.729131,-0.24372,0.0,0,aa620abf-28a1-451c-9c58-1b8fc3ee4492,TEST
2,84079.0,-0.959494,0.94699,1.760368,2.714721,0.641016,0.217552,0.379313,0.005477,-1.744816,...,0.235519,0.026976,-0.182005,0.109672,-0.024543,0.062964,0.0,0,db0d156d-1d1e-4c02-8ed1-a745c7a41043,TEST
3,159888.0,-3.146402,2.543688,-0.328957,2.499684,-0.112949,0.959888,-0.501032,0.632631,0.272793,...,0.085183,0.21283,-0.312526,-0.24838,-2.73144,-0.754864,0.0,0,85ede956-7e8c-4428-b671-884c334429de,TEST
4,24481.0,-0.257764,1.496135,1.775952,2.024421,0.821847,-0.599382,1.252612,-0.447357,-0.182781,...,-0.047306,0.249045,-0.167503,-0.435957,-0.168028,-0.168385,0.0,0,725437fc-5c21-4196-9f4a-44b150e3528c,TEST


### Review the number of records for each level of `Class` for each of the data splits:

In [33]:
bq.query(f'SELECT splits, class, count(*) as count FROM `{BQ_PROJECT}.{BQ_DATASET}.{BQ_TABLE}` GROUP BY splits, class').to_dataframe()

Unnamed: 0,splits,class,count
0,TEST,0,28471
1,TEST,1,47
2,TRAIN,0,227429
3,TRAIN,1,405
4,VALIDATE,0,28415
5,VALIDATE,1,40


---
## Train Model

Use BigQuery ML to train unsupervised autoencoder model:
- [Autoencoder](https://cloud.google.com/bigquery-ml/docs/reference/standard-sql/bigqueryml-syntax-create-autoencoder) with BigQuery ML (BQML)
- This uses the `splits` column although not directly used by the `AUTOENCODER` training but used to subset to the `splits = 'TRAIN'` data for training here

In [37]:
query = f"""
CREATE OR REPLACE MODEL `{BQ_PROJECT}.{BQ_DATASET}.{BQ_MODEL}`
OPTIONS (
        model_type = 'AUTOENCODER',
        activation_fn = 'RELU',
        batch_size = 30,
        dropout = .5,
        early_stop = TRUE,
        hidden_units = [128, 64, 8, 64, 128],
        max_iterations = 30,
        min_rel_progress = 0.001,
        optimizer = 'ADAM'
    ) AS
SELECT * EXCEPT(Class, splits, transaction_id),
FROM `{BQ_PROJECT}.{BQ_DATASET}.{BQ_TABLE}`
WHERE splits = 'TRAIN'
"""
print(query)


CREATE OR REPLACE MODEL `statmike-mlops-349915.applied_genai_embeddings.bqml-autoencoder-model`
OPTIONS (
        model_type = 'AUTOENCODER',
        activation_fn = 'RELU',
        batch_size = 30,
        dropout = .5,
        early_stop = TRUE,
        hidden_units = [128, 64, 8, 64, 128],
        max_iterations = 30,
        min_rel_progress = 0.001,
        optimizer = 'ADAM'
    ) AS
SELECT * EXCEPT(Class, splits, transaction_id),
FROM `statmike-mlops-349915.applied_genai_embeddings.bqml-autoencoder`
WHERE splits = 'TRAIN'



In [38]:
job = bq.query(query = query)
job.result()
(job.ended-job.started).total_seconds()

1952.27

In [39]:
job.total_bytes_processed/1e6 #mb

48227.393777

### Evaluate Model

Calcuate evaluation statistics with [ML.EVALUATE](https://cloud.google.com/bigquery-ml/docs/reference/standard-sql/bigqueryml-syntax-evaluate):

In [40]:
bq.query(
    query = f"""
        SELECT *
        FROM ML.EVALUATE(
            MODEL `{BQ_PROJECT}.{BQ_DATASET}.{BQ_MODEL}`,
            (SELECT * FROM `{BQ_PROJECT}.{BQ_DATASET}.{BQ_TABLE}` WHERE splits = 'TRAIN')
        )
        """
).to_dataframe()

Unnamed: 0,mean_absolute_error,mean_squared_error,mean_squared_log_error
0,0.454548,0.646509,0.037982


In [41]:
bq.query(
    query = f"""
        SELECT *
        FROM ML.EVALUATE(
            MODEL `{BQ_PROJECT}.{BQ_DATASET}.{BQ_MODEL}`,
            (SELECT * FROM `{BQ_PROJECT}.{BQ_DATASET}.{BQ_TABLE}` WHERE splits = 'VALIDATE')
        )
        """
).to_dataframe()

Unnamed: 0,mean_absolute_error,mean_squared_error,mean_squared_log_error
0,0.455129,0.666135,0.038142


In [42]:
bq.query(
    query = f"""
        SELECT *
        FROM ML.EVALUATE(
            MODEL `{BQ_PROJECT}.{BQ_DATASET}.{BQ_MODEL}`,
            (SELECT * FROM `{BQ_PROJECT}.{BQ_DATASET}.{BQ_TABLE}` WHERE splits = 'TEST')
        )
        """
).to_dataframe()

Unnamed: 0,mean_absolute_error,mean_squared_error,mean_squared_log_error
0,0.456433,0.658797,0.038213


### Predictions

Retrieve the latent space prediction with [ML.PREDICT](https://cloud.google.com/bigquery-ml/docs/reference/standard-sql/bigqueryml-syntax-predict).

In [43]:
query = f"""
SELECT [latent_col_1, latent_col_2, latent_col_3, latent_col_4, latent_col_5, latent_col_6, latent_col_7, latent_col_8] as embedding, transaction_id
FROM ML.PREDICT (
    MODEL `{BQ_PROJECT}.{BQ_DATASET}.{BQ_MODEL}`,
    (SELECT *
    FROM `{BQ_PROJECT}.{BQ_DATASET}.{BQ_TABLE}`
    WHERE splits = 'TEST'
    LIMIT 5)
)
"""
pred = bq.query(query = query).to_dataframe()
pred

Unnamed: 0,embedding,transaction_id
0,"[7.203158378601074, 4.551510810852051, 0.48663...",2056d104-31a6-406b-934f-daae83ae775d
1,"[1.6095829010009766, 3.792853832244873, 2.4992...",aa620abf-28a1-451c-9c58-1b8fc3ee4492
2,"[2.7257955074310303, 5.712514877319336, 3.3487...",db0d156d-1d1e-4c02-8ed1-a745c7a41043
3,"[4.329518795013428, 5.630748748779297, 4.49882...",85ede956-7e8c-4428-b671-884c334429de
4,"[2.881032943725586, 8.394695281982422, 3.82335...",725437fc-5c21-4196-9f4a-44b150e3528c


### Embeddings

The `ML.PREDICT` result requires knowing the dimension of the latent space, encoder layer, of the model.  A handy function for making this easier is the `ML.GENERATE_EMBEDDING` function that automatically creates an array fromm tehe latent space as an output column named `ml_generate_embedding_result`.

- Retrieve the latent space prediction as embeddings with [ML.GENERATE_EMBEDDING](https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-generate-embedding).

In [46]:
query = f"""
SELECT ml_generate_embedding_result as embedding, transaction_id
FROM ML.GENERATE_EMBEDDING (
    MODEL `{BQ_PROJECT}.{BQ_DATASET}.{BQ_MODEL}`,
    (SELECT *
    FROM `{BQ_PROJECT}.{BQ_DATASET}.{BQ_TABLE}`
    WHERE splits = 'TEST'
    LIMIT 5)
)
"""
embeddings = bq.query(query = query).to_dataframe()
embeddings

Unnamed: 0,embedding,transaction_id
0,"[7.203158378601074, 4.551510810852051, 0.48663...",2056d104-31a6-406b-934f-daae83ae775d
1,"[1.6095829010009766, 3.792853832244873, 2.4992...",aa620abf-28a1-451c-9c58-1b8fc3ee4492
2,"[2.7257955074310303, 5.712514877319336, 3.3487...",db0d156d-1d1e-4c02-8ed1-a745c7a41043
3,"[4.329518795013428, 5.630748748779297, 4.49882...",85ede956-7e8c-4428-b671-884c334429de
4,"[2.881032943725586, 8.394695281982422, 3.82335...",725437fc-5c21-4196-9f4a-44b150e3528c


### Normalized Embeddings

For many applications the embedding are used for finding similar entities: semantic search, recommendation.  Or even the most dissimilar entities: outlier detection.  These methods are aided by normalized version of the embeddings which can directly computed in BigQuery by using the `ML.NORMALIZER` function on the array of values.
- Normalize the laten space embeddings with [ML.NORMALIZER](https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-normalizer)

Read much more on [The Math of Similarity](./The%20Math%20of%20Similarity.ipynb)!

In [47]:
query = f"""
SELECT ML.NORMALIZER(ml_generate_embedding_result, 2) as embedding_normalzied, transaction_id
FROM ML.GENERATE_EMBEDDING (
    MODEL `{BQ_PROJECT}.{BQ_DATASET}.{BQ_MODEL}`,
    (SELECT *
    FROM `{BQ_PROJECT}.{BQ_DATASET}.{BQ_TABLE}`
    WHERE splits = 'TEST'
    LIMIT 5)
)
"""
embeddings_normalized = bq.query(query = query).to_dataframe()
embeddings_normalized

Unnamed: 0,embedding_normalzied,transaction_id
0,"[0.5572484631959302, 0.3521125416458699, 0.037...",2056d104-31a6-406b-934f-daae83ae775d
1,"[0.18012424434948443, 0.4244484270032261, 0.27...",aa620abf-28a1-451c-9c58-1b8fc3ee4492
2,"[0.24128518177106906, 0.5056671297558105, 0.29...",db0d156d-1d1e-4c02-8ed1-a745c7a41043
3,"[0.295824461603445, 0.3847340307080803, 0.3073...",85ede956-7e8c-4428-b671-884c334429de
4,"[0.19260856175990754, 0.5612189156658676, 0.25...",725437fc-5c21-4196-9f4a-44b150e3528c


---
## Similarity Matching With Embeddings And Vector Search

The normalized embeddings above can be used with vector search in BigQuery to identify the distance between observations (rows) which is helpful for many uses cases.

With BigQuery vector indexes we can [create vector indexes](https://cloud.google.com/bigquery/docs/vector-index) and search these with the [VECTOR_SEARCH function](https://cloud.google.com/bigquery/docs/reference/standard-sql/search_functions#vector_search).
- Read more about [vector search in BigQuery](https://cloud.google.com/bigquery/docs/vector-search)

### Add Normalized Embeddings To Table

Add the normalized embeddings to a table/view for use in similarity searches with vector search. See the content above for a description of this process.

In [63]:
query = f'''
CREATE OR REPLACE TABLE `{BQ_PROJECT}.{BQ_DATASET}.{BQ_TABLE}_embeddings` AS
SELECT * EXCEPT(ml_generate_embedding_result),
    ml_generate_embedding_result as embedding,
    ML.NORMALIZER(ml_generate_embedding_result, 2) as embedding_normalized 
FROM ML.GENERATE_EMBEDDING (
    MODEL `{BQ_PROJECT}.{BQ_DATASET}.{BQ_MODEL}`,
    (SELECT * FROM `{BQ_PROJECT}.{BQ_DATASET}.{BQ_TABLE}`)
)
'''
job = bq.query(query)
job.result()

<google.cloud.bigquery.table._EmptyRowIterator at 0x7f1045c5a4d0>

In [64]:
bq.query(f'SELECT * FROM `{BQ_PROJECT}.{BQ_DATASET}.{BQ_TABLE}_embeddings` LIMIT 5').to_dataframe()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V25,V26,V27,V28,Amount,Class,transaction_id,splits,embedding,embedding_normalized
0,61698.0,-0.768682,0.593944,1.906717,0.547216,0.185228,0.738125,-0.060125,0.268806,0.34314,...,0.094026,-0.218763,-0.103316,0.04478,0.0,0,2c9c234d-5d02-442e-bfcd-117f434082c1,TEST,"[2.4221367835998535, 3.1980233192443848, 1.964...","[0.29038063860808216, 0.38339868334997335, 0.2..."
1,151415.0,-1.624698,0.611314,2.394833,4.59714,1.24529,0.557947,-0.494394,0.212051,-1.827552,...,-0.257313,0.022158,-0.120621,0.150415,0.0,0,b6356bca-8224-44c2-bdbe-8164376948f0,TEST,"[3.86045241355896, 6.940500259399414, 5.306707...","[0.26914002293401107, 0.4838724063603171, 0.36..."
2,136813.0,1.923606,1.593571,-2.816262,4.471087,1.583987,-1.230085,0.836307,-0.26932,-1.238581,...,0.192239,0.027997,-0.006823,0.036704,0.0,0,5c8d987d-ff08-4496-916a-777286e86fd4,TEST,"[6.2652668952941895, 8.455583572387695, 8.5624...","[0.22088052507215444, 0.2980996293491149, 0.30..."
3,46801.0,1.069709,0.019603,1.454666,2.87182,-0.711671,0.838489,-0.74357,0.391116,0.584745,...,0.515763,0.245495,0.059666,0.021725,0.0,0,c322691a-8b16-464b-a0f3-b948468b7d58,TEST,"[4.502050399780273, 4.690890312194824, 3.86994...","[0.45989533621220974, 0.47918578996066463, 0.3..."
4,2519.0,-0.707495,1.634484,1.90961,2.58381,0.606477,0.411204,0.905654,-0.630049,-1.006667,...,-0.652769,-0.01128,-0.973249,-0.282698,0.0,0,f64d004b-2a23-4c52-9386-a8f6213a9ab0,TEST,"[2.4852662086486816, 5.80500602722168, 3.50516...","[0.2122472040373211, 0.49576029095391305, 0.29..."


### Get Matches With Vector Search (brute force) - No Vector Index

The [VECTOR_SEARCH function](https://cloud.google.com/bigquery/docs/reference/standard-sql/search_functions#vector_search) function can be directly used to find matches without a [vector index](https://cloud.google.com/bigquery/docs/vector-index).  While vector indexes offer efficient ways of finding matches based on embeddings, the brute force approach of directly using `VECTOR_SEARCH` compares a query embedding to all embeddings in the table.

The options for `VECTOR_SEARCH` are:
- `base_table` is the table to search
- `column_to_search` is the column of embeddings vector to match
- `top_k` is the number of matches to return
- `distance_type` is the distance measure to use from `EUCLIDEAN`, `DOT_PRODUCT`, and `COSINE`.
    - Use `DOT_PRODUCT` here for these normalized embedding and read more about why in [The Math of Similarity](./The%20Math%20of%20Similarity.ipynb)

In [65]:
query = f'''
SELECT
    query.transaction_id AS transaction_id,
    base.transaction_id AS match_transaction_id,
    distance
FROM
    VECTOR_SEARCH(
        TABLE `{BQ_PROJECT}.{BQ_DATASET}.{BQ_TABLE}_embeddings`,
        'embedding_normalized',
        (SELECT * FROM `{BQ_PROJECT}.{BQ_DATASET}.{BQ_TABLE}_embeddings` WHERE splits = 'TEST' LIMIT 1),
        'embedding_normalized',
        top_k => 10,
        distance_type => 'DOT_PRODUCT'
        #,options => '{{"use_brute_force":true}}'
    )
ORDER BY distance
'''
bq.query(query).to_dataframe()

Unnamed: 0,transaction_id,match_transaction_id,distance
0,2c9c234d-5d02-442e-bfcd-117f434082c1,2c9c234d-5d02-442e-bfcd-117f434082c1,-1.0
1,2c9c234d-5d02-442e-bfcd-117f434082c1,9975c120-5108-4a2d-80fb-191522708dac,-0.999311
2,2c9c234d-5d02-442e-bfcd-117f434082c1,84eec6c7-d913-482d-9720-cf62795f5724,-0.998537
3,2c9c234d-5d02-442e-bfcd-117f434082c1,2f2cac8b-b202-45de-aa00-0170a1a26da4,-0.998111
4,2c9c234d-5d02-442e-bfcd-117f434082c1,aef1bc62-0e13-49b1-ad6f-ffabdfbdb9b3,-0.997749
5,2c9c234d-5d02-442e-bfcd-117f434082c1,46518583-ce2c-47a8-b979-5c1175bbc4d8,-0.997703
6,2c9c234d-5d02-442e-bfcd-117f434082c1,6e328df7-4eea-42f0-8f54-9b1858673168,-0.997698
7,2c9c234d-5d02-442e-bfcd-117f434082c1,cfd7229f-0ae1-4e17-b9df-7181e396bf64,-0.997634
8,2c9c234d-5d02-442e-bfcd-117f434082c1,05f0f5f9-653e-46b9-b865-9a7d9705655a,-0.997591
9,2c9c234d-5d02-442e-bfcd-117f434082c1,f514590b-93e9-42a4-9742-b081ac0678d4,-0.997564


### Create A Vector Index

To efficiently do a vector search it can be helpful to [create a vector index](https://cloud.google.com/bigquery/docs/vector-index) for the embeddings column.  
 - This is not required though as a [brute force search](https://cloud.google.com/bigquery/docs/vector-search#use_the_vector_search_function_with_brute_force) is possible as shown in the previous section.  An Index can also be ignored and force brute force search as will be shown later in this workflow.

There are two types of vector indexes that can be chosen in BigQuery:
- IVF Index: `index_type = "IVF"`
    - use k-means to cluser the embeddings vectors and then uses the clusters as partions
    - k can be specifed with the NUM_LISTS option as any INT64 <= 5000
    - updates are handled automatically
- TreeAH Index `index_type = "TREE_AH"`
    - uses the [ScaNN algorithm](https://github.com/google-research/google-research/blob/master/scann/docs/algorithms.md) developed by Google
    - the number of partitions/clusters is trained based on the `leaf_node_embedding_count` option
        - each leaf node will contain up to the value of the parameter which defaults to 1000 and can be any INT64 >= 500
    
Options for both index types:
- Set the default `DISTANCE_TYPE` as `EUCLIDEAN`, `DOT_PRODUCT` or `COSINE`.
    - Use `DOT_PRODUCT` here for these normalized embeddings and read more about why in [The Math of Similarity](./The%20Math%20of%20Similarity.ipynb)
    - This option can be preempted with an alternative choice when using the VECTOR_SEARCH function
- `NORMALIZATION_TYPE` can be chosen to normalize the vectors for you.
    - In this case the embedding are already normalized in the `embedding_normalized` column
    - The default is `NONE` but can set to `L2` which is what we used above with the `ML.NORMALIZER` function
    
Read about the [CREATE VECTOR INDEX](https://cloud.google.com/bigquery/docs/reference/standard-sql/data-definition-language#create_vector_index_statement) DDL statement.

In [67]:
query = f'''
CREATE VECTOR INDEX row_index ON `{BQ_PROJECT}.{BQ_DATASET}.{BQ_TABLE}_embeddings`(embedding_normalized)
OPTIONS(
    index_type = 'TREE_AH',
    tree_ah_options = '{{"leaf_node_embedding_count":500}}',
    distance_type = 'DOT_PRODUCT'
)
'''
job = bq.query(query)
job.result()

<google.cloud.bigquery.table._EmptyRowIterator at 0x7f1045c8a020>

### Check The Vector Index Status


[Get information about vector indexes](https://cloud.google.com/bigquery/docs/vector-index#get_information_about_vector_indexes)

In [68]:
query = f'''
SELECT *
FROM `{BQ_PROJECT}.{BQ_DATASET}.INFORMATION_SCHEMA.VECTOR_INDEXES`
WHERE index_status = 'ACTIVE'
    AND table_name = '{BQ_TABLE}_embeddings'
'''
bq.query(query).to_dataframe()

Unnamed: 0,index_catalog,index_schema,table_name,index_name,index_status,creation_time,last_modification_time,last_refresh_time,disable_time,disable_reason,ddl,coverage_percentage,unindexed_row_count,total_logical_bytes,total_storage_bytes
0,statmike-mlops-349915,applied_genai_embeddings,bqml-autoencoder_embeddings,row_index,ACTIVE,2024-09-16 18:58:21.032000+00:00,2024-09-16 18:58:21.032000+00:00,NaT,NaT,,CREATE VECTOR INDEX `row_index` ON `statmike-m...,0,284807,0,0


### Get Matches With Vector Search Index: Using Index

In [89]:
query = f'''
SELECT
    query.transaction_id AS transaction_id,
    base.transaction_id AS match_transaction_id,
    distance
FROM
    VECTOR_SEARCH(
        TABLE `{BQ_PROJECT}.{BQ_DATASET}.{BQ_TABLE}_embeddings`,
        'embedding_normalized',
        (SELECT * FROM `{BQ_PROJECT}.{BQ_DATASET}.{BQ_TABLE}_embeddings` WHERE splits = 'TEST' LIMIT 1),
        'embedding_normalized',
        top_k => 10
        #,options => '{{"fraction_lists_to_search": 0.03}}'
    )
ORDER BY distance
'''
job = bq.query(query, job_config = bigquery.QueryJobConfig(use_query_cache=False))
job.result()
job.to_dataframe()

Unnamed: 0,transaction_id,match_transaction_id,distance
0,2c9c234d-5d02-442e-bfcd-117f434082c1,2c9c234d-5d02-442e-bfcd-117f434082c1,-1.0
1,2c9c234d-5d02-442e-bfcd-117f434082c1,9975c120-5108-4a2d-80fb-191522708dac,-0.999311
2,2c9c234d-5d02-442e-bfcd-117f434082c1,84eec6c7-d913-482d-9720-cf62795f5724,-0.998537
3,2c9c234d-5d02-442e-bfcd-117f434082c1,2f2cac8b-b202-45de-aa00-0170a1a26da4,-0.998111
4,2c9c234d-5d02-442e-bfcd-117f434082c1,aef1bc62-0e13-49b1-ad6f-ffabdfbdb9b3,-0.997749
5,2c9c234d-5d02-442e-bfcd-117f434082c1,46518583-ce2c-47a8-b979-5c1175bbc4d8,-0.997703
6,2c9c234d-5d02-442e-bfcd-117f434082c1,6e328df7-4eea-42f0-8f54-9b1858673168,-0.997698
7,2c9c234d-5d02-442e-bfcd-117f434082c1,cfd7229f-0ae1-4e17-b9df-7181e396bf64,-0.997634
8,2c9c234d-5d02-442e-bfcd-117f434082c1,05f0f5f9-653e-46b9-b865-9a7d9705655a,-0.997591
9,2c9c234d-5d02-442e-bfcd-117f434082c1,f514590b-93e9-42a4-9742-b081ac0678d4,-0.997564


Check to see if the index was used with [vector index usage information](https://cloud.google.com/bigquery/docs/vector-index#vector_index_usage)

In [90]:
job._properties['statistics']['query']['vectorSearchStatistics']

{'indexUsageMode': 'PARTIALLY_USED',
 'indexUnusedReasons': [{'code': 'ESTIMATED_PERFORMANCE_GAIN_TOO_LOW',
   'message': 'The number of queries is too low for the vector index on the base table `statmike-mlops-349915.applied_genai_embeddings.bqml-autoencoder_embeddings` to be effective.',
   'baseTable': {'projectId': 'statmike-mlops-349915',
    'datasetId': 'applied_genai_embeddings',
    'tableId': 'bqml-autoencoder_embeddings'}}]}

### Get Matches With Vector Search Index: Using Index And Specify Fraction of Lists

When setting up the index we specified `leaf_node_embedding_count = 500` which led to a number of list being created.  We can guide the vector search to use a larger/smaller portion of these list by seting the option for `fraction_list_to_search`. See [VECTOR_SEARCH function details](https://cloud.google.com/bigquery/docs/reference/standard-sql/search_functions#vector_search).

In [95]:
query = f'''
SELECT
    query.transaction_id AS transaction_id,
    base.transaction_id AS match_transaction_id,
    distance
FROM
    VECTOR_SEARCH(
        TABLE `{BQ_PROJECT}.{BQ_DATASET}.{BQ_TABLE}_embeddings`,
        'embedding_normalized',
        (SELECT * FROM `{BQ_PROJECT}.{BQ_DATASET}.{BQ_TABLE}_embeddings` WHERE splits = 'TEST' LIMIT 1),
        'embedding_normalized',
        top_k => 10,
        options => '{{"fraction_lists_to_search": 0.25}}'
    )
ORDER BY distance
'''
job = bq.query(query, job_config = bigquery.QueryJobConfig(use_query_cache=False))
job.result()
job.to_dataframe()

Unnamed: 0,transaction_id,match_transaction_id,distance
0,2c9c234d-5d02-442e-bfcd-117f434082c1,2c9c234d-5d02-442e-bfcd-117f434082c1,-1.0
1,2c9c234d-5d02-442e-bfcd-117f434082c1,9975c120-5108-4a2d-80fb-191522708dac,-0.999311
2,2c9c234d-5d02-442e-bfcd-117f434082c1,84eec6c7-d913-482d-9720-cf62795f5724,-0.998537
3,2c9c234d-5d02-442e-bfcd-117f434082c1,2f2cac8b-b202-45de-aa00-0170a1a26da4,-0.998111
4,2c9c234d-5d02-442e-bfcd-117f434082c1,aef1bc62-0e13-49b1-ad6f-ffabdfbdb9b3,-0.997749
5,2c9c234d-5d02-442e-bfcd-117f434082c1,46518583-ce2c-47a8-b979-5c1175bbc4d8,-0.997703
6,2c9c234d-5d02-442e-bfcd-117f434082c1,6e328df7-4eea-42f0-8f54-9b1858673168,-0.997698
7,2c9c234d-5d02-442e-bfcd-117f434082c1,cfd7229f-0ae1-4e17-b9df-7181e396bf64,-0.997634
8,2c9c234d-5d02-442e-bfcd-117f434082c1,05f0f5f9-653e-46b9-b865-9a7d9705655a,-0.997591
9,2c9c234d-5d02-442e-bfcd-117f434082c1,f514590b-93e9-42a4-9742-b081ac0678d4,-0.997564


Check to see if the index was used with [vector index usage information](https://cloud.google.com/bigquery/docs/vector-index#vector_index_usage)

In [96]:
job._properties['statistics']['query']['vectorSearchStatistics']

{'indexUsageMode': 'PARTIALLY_USED',
 'indexUnusedReasons': [{'code': 'ESTIMATED_PERFORMANCE_GAIN_TOO_LOW',
   'message': 'The number of queries is too low for the vector index on the base table `statmike-mlops-349915.applied_genai_embeddings.bqml-autoencoder_embeddings` to be effective.',
   'baseTable': {'projectId': 'statmike-mlops-349915',
    'datasetId': 'applied_genai_embeddings',
    'tableId': 'bqml-autoencoder_embeddings'}}]}

### Get Matches With Vector Search Index: Force Brute Force

Rather than using the index, find the exact nearest neighbor by searching all of the embeddings with options value `use_brute_force` set to `true`:

In [97]:
query = f'''
SELECT
    query.transaction_id AS transaction_id,
    base.transaction_id AS match_transaction_id,
    distance
FROM
    VECTOR_SEARCH(
        TABLE `{BQ_PROJECT}.{BQ_DATASET}.{BQ_TABLE}_embeddings`,
        'embedding_normalized',
        (SELECT * FROM `{BQ_PROJECT}.{BQ_DATASET}.{BQ_TABLE}_embeddings` WHERE splits = 'TEST' LIMIT 1),
        'embedding_normalized',
        top_k => 10,
        options => '{{"use_brute_force":true}}'
    )
ORDER BY distance
'''
job = bq.query(query, job_config = bigquery.QueryJobConfig(use_query_cache=False))
job.result()
job.to_dataframe()

Unnamed: 0,transaction_id,match_transaction_id,distance
0,2c9c234d-5d02-442e-bfcd-117f434082c1,2c9c234d-5d02-442e-bfcd-117f434082c1,-1.0
1,2c9c234d-5d02-442e-bfcd-117f434082c1,9975c120-5108-4a2d-80fb-191522708dac,-0.999311
2,2c9c234d-5d02-442e-bfcd-117f434082c1,84eec6c7-d913-482d-9720-cf62795f5724,-0.998537
3,2c9c234d-5d02-442e-bfcd-117f434082c1,2f2cac8b-b202-45de-aa00-0170a1a26da4,-0.998111
4,2c9c234d-5d02-442e-bfcd-117f434082c1,aef1bc62-0e13-49b1-ad6f-ffabdfbdb9b3,-0.997749
5,2c9c234d-5d02-442e-bfcd-117f434082c1,46518583-ce2c-47a8-b979-5c1175bbc4d8,-0.997703
6,2c9c234d-5d02-442e-bfcd-117f434082c1,6e328df7-4eea-42f0-8f54-9b1858673168,-0.997698
7,2c9c234d-5d02-442e-bfcd-117f434082c1,cfd7229f-0ae1-4e17-b9df-7181e396bf64,-0.997634
8,2c9c234d-5d02-442e-bfcd-117f434082c1,05f0f5f9-653e-46b9-b865-9a7d9705655a,-0.997591
9,2c9c234d-5d02-442e-bfcd-117f434082c1,f514590b-93e9-42a4-9742-b081ac0678d4,-0.997564


In [98]:
job._properties['statistics']['query']['vectorSearchStatistics']

{'indexUsageMode': 'UNUSED',
 'indexUnusedReasons': [{'code': 'INDEX_SUPPRESSED_BY_FUNCTION_OPTION',
   'message': 'The vector index `row_index` of the base table `statmike-mlops-349915:applied_genai_embeddings.bqml-autoencoder_embeddings` was not used because use_brute_force option has been specified.',
   'baseTable': {'projectId': 'statmike-mlops-349915',
    'datasetId': 'applied_genai_embeddings',
    'tableId': 'bqml-autoencoder_embeddings'},
   'indexName': 'row_index'}]}

### Get Batch Matches With Vector Search

Get a list of matches for multiple query embeddings

In [104]:
query = f'''
SELECT
    query.transaction_id AS transaction_id,
    ARRAY_AGG(base.transaction_id ORDER BY distance) AS matches
FROM
    VECTOR_SEARCH(
        TABLE `{BQ_PROJECT}.{BQ_DATASET}.{BQ_TABLE}_embeddings`,
        'embedding_normalized',
        (SELECT * FROM `{BQ_PROJECT}.{BQ_DATASET}.{BQ_TABLE}_embeddings` WHERE splits = 'TEST'),
        top_k => 5
    )
GROUP BY transaction_id
LIMIT 5
'''
job = bq.query(query, job_config = bigquery.QueryJobConfig(use_query_cache=False))
job.result()
job.to_dataframe()

Unnamed: 0,transaction_id,matches
0,d2f77720-1b85-4e55-ada6-1d7700f94b7d,"[ef268a30-9041-4b8c-98ab-c268ce8dbf62, 0d46a4c..."
1,fa3e1c8a-0378-4be0-900b-6566bbbc4a07,"[cfbf76cd-03e9-4f77-aba6-0b219ae91053, 24bec56..."
2,11d18c98-edbe-40c3-bfc0-f41136bca74a,"[5616f04f-2c66-4a94-a8ad-b3d0ded1b0d9, 9120a5b..."
3,02c44ddd-5d07-4b75-aa85-b68397589a49,"[0e773bdc-9a97-4c60-a685-b7835e518684, 1cb4a45..."
4,ee6ef95e-dead-4652-b483-475278bf2199,"[38291fc4-9d0e-4fa9-92af-060304407678, ed76f28..."


In [105]:
job._properties['statistics']['query']['vectorSearchStatistics']

{'indexUsageMode': 'FULLY_USED'}