![tracker](https://us-central1-vertex-ai-mlops-369716.cloudfunctions.net/pixel-tracking?path=statmike%2Fvertex-ai-mlops%2FApplied+GenAI%2FRetrieval&file=Retrieval+-+BigQuery+Vector+Indexing+And+Search.ipynb)
<!--- header table --->
<table align="left">
  <td style="text-align: center">
    <a href="https://colab.research.google.com/github/statmike/vertex-ai-mlops/blob/main/Applied%20GenAI/Retrieval/Retrieval%20-%20BigQuery%20Vector%20Indexing%20And%20Search.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/colab-logo-32px.png" alt="Google Colaboratory logo">
      <br>Run in<br>Colab
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/colab/import/https%3A%2F%2Fraw.githubusercontent.com%2Fstatmike%2Fvertex-ai-mlops%2Fmain%2FApplied%2520GenAI%2FRetrieval%2FRetrieval%2520-%2520BigQuery%2520Vector%2520Indexing%2520And%2520Search.ipynb">
      <img width="32px" src="https://lh3.googleusercontent.com/JmcxdQi-qOpctIvWKgPtrzZdJJK-J3sWE1RsfjZNwshCFgE_9fULcNpuXYTilIR2hjwN" alt="Google Cloud Colab Enterprise logo">
      <br>Run in<br>Colab Enterprise
    </a>
  </td>      
  <td style="text-align: center">
    <a href="https://github.com/statmike/vertex-ai-mlops/blob/main/Applied%20GenAI/Retrieval/Retrieval%20-%20BigQuery%20Vector%20Indexing%20And%20Search.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" alt="GitHub logo">
      <br>View on<br>GitHub
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/statmike/vertex-ai-mlops/main/Applied%20GenAI/Retrieval/Retrieval%20-%20BigQuery%20Vector%20Indexing%20And%20Search.ipynb">
      <img src="https://lh3.googleusercontent.com/UiNooY4LUgW_oTvpsNhPpQzsstV5W8F7rYgxgGBD85cWJoLmrOzhVs_ksK_vgx40SHs7jCqkTkCk=e14-rj-sc0xffffff-h130-w32" alt="Vertex AI logo">
      <br>Open in<br>Vertex AI Workbench
    </a>
  </td>
</table>

# Retrieval - BigQuery Vector Indexing And Search

In prior workflows, a series of documents was [processed into chunks](../Chunking/readme.md), and for each chunk, [embeddings](../Embeddings/readme.md) were created:

- Process: [Large Document Processing - Document AI Layout Parser](../Chunking/Large%20Document%20Processing%20-%20Document%20AI%20Layout%20Parser.ipynb)
- Embed: [Vertex AI Text Embeddings API](../Embeddings/Vertex%20AI%20Text%20Embeddings%20API.ipynb)

Retrieving chunks for a query involves calculating the embedding for the query and then using similarity metrics to find relevant chunks. A thorough review of similarity matching can be found in [The Math of Similarity](../Embeddings/The%20Math%20of%20Similarity.ipynb) - use dot product! As development moves from experiment to application, the process of storing and computing similarity is migrated to a [retrieval](./readme.md) system. This workflow is part of a [series of workflows exploring many retrieval systems](./readme.md).  

A detailed [comparison of many retrieval systems](./readme.md#comparison-of-vector-database-solutions) can be found in the readme as well.

---

**BigQuery For Storage, Indexing, And Search**

This workflow builds a retrieval system using Google [BigQuery](https://cloud.google.com/bigquery). BigQuery is a fully managed data warehouse where SQL queries run without the need to plan for storage or compute requirements.  Built into this solution is the  `VECTOR_SEARCH` function that can perform brute-force searches for neighboring embeddings and utilize an index for efficient search. BigQuery offers two built-in methods for [creating vector indexes](https://cloud.google.com/bigquery/docs/vector-index): the [Inverted File (IVF) index](https://cloud.google.com/bigquery/docs/vector-index#ivf-index) and the [TreeAH index](https://cloud.google.com/bigquery/docs/vector-index#tree-ah-index).

---

**Use Case Data**

Buying a home usually involves borrowing money from a lending institution, typically through a mortgage secured by the home's value. But how do these institutions manage the risks associated with such large loans, and how are lending standards established?

In the United States, two government-sponsored enterprises (GSEs) play a vital role in the housing market:

- Federal National Mortgage Association ([Fannie Mae](https://www.fanniemae.com/))
- Federal Home Loan Mortgage Corporation ([Freddie Mac](https://www.freddiemac.com/))

These GSEs purchase mortgages from lenders, enabling those lenders to offer more loans. This process also allows Fannie Mae and Freddie Mac to set standards for mortgages, ensuring they are responsible and borrowers are more likely to repay them. This system makes homeownership more affordable and stabilizes the housing market by maintaining a steady flow of liquidity for lenders and keeping interest rates controlled.

However, navigating the complexities of these GSEs and their extensive servicing guides can be challenging.

**Approaches**

[This series](../readme.md) covers many generative AI workflows. These documents are used directly as long context for Gemini in the workflow [Long Context Retrieval With The Vertex AI Gemini API](../Generate/Long%20Context%20Retrieval%20With%20The%20Vertex%20AI%20Gemini%20API.ipynb). The workflow below uses a [retrieval](./readme.md) approach with the already generated chunks and embeddings.

---
## Colab Setup

When running this notebook in [Colab](https://colab.google/) or [Colab Enterprise](https://cloud.google.com/colab/docs/introduction), this section will authenticate to GCP (follow prompts in the popup) and set the current project for the session.

In [1]:
PROJECT_ID = 'statmike-mlops-349915' # replace with project ID

In [2]:
try:
    from google.colab import auth
    auth.authenticate_user()
    !gcloud config set project {PROJECT_ID}
except Exception:
    pass

---
## Installs and API Enablement

The clients packages may need installing in this environment. 

### Installs (If Needed)

In [3]:
# tuples of (import name, install name, min_version)
packages = [
    ('google.cloud.aiplatform', 'google-cloud-aiplatform', '1.69.0'),
    ('google.cloud.bigquery', 'google-cloud-bigquery')
]

import importlib
install = False
for package in packages:
    if not importlib.util.find_spec(package[0]):
        print(f'installing package {package[1]}')
        install = True
        !pip install {package[1]} -U -q --user
    elif len(package) == 3:
        if importlib.metadata.version(package[0]) < package[2]:
            print(f'updating package {package[1]}')
            install = True
            !pip install {package[1]} -U -q --user

### API Enablement

In [4]:
!gcloud services enable aiplatform.googleapis.com

### Restart Kernel (If Installs Occured)

After a kernel restart the code submission can start with the next cell after this one.

In [5]:
if install:
    import IPython
    app = IPython.Application.instance()
    app.kernel.do_shutdown(True)
    IPython.display.display(IPython.display.Markdown("""<div class=\"alert alert-block alert-warning\">
        <b>⚠️ The kernel is going to restart. Please wait until it is finished before continuing to the next step. The previous cells do not need to be run again⚠️</b>
        </div>"""))

---
## Setup

Inputs

In [6]:
project = !gcloud config get-value project
PROJECT_ID = project[0]
PROJECT_ID

'statmike-mlops-349915'

In [7]:
REGION = 'us-central1'
SERIES = 'applied-genai'
EXPERIMENT = 'retrieval-bigquery'

# make this the BigQuery Project / Dataset / Table prefix to store results
BQ_PROJECT = PROJECT_ID
BQ_DATASET = SERIES.replace('-', '_')
BQ_TABLE = EXPERIMENT
BQ_REGION = REGION[0:2]

Packages

In [8]:
import os, json, time, glob

import numpy as np

# Vertex AI
from google.cloud import aiplatform
import vertexai.language_models # for embeddings API
import vertexai.generative_models # for Gemini Models

# BigQuery
from google.cloud import bigquery

In [9]:
aiplatform.__version__

'1.78.0'

In [10]:
bigquery.__version__

'3.29.0'

Clients

In [11]:
# vertex ai clients
vertexai.init(project = PROJECT_ID, location = REGION)

# bigquery client
bq = bigquery.Client(project = PROJECT_ID)

---
## Text & Embeddings For Examples

This repository contains a [section for document processing (chunking)](../Chunking/readme.md) that includes an example of processing mulitple large pdfs (over 1000 pages) into chunks: [Large Document Processing - Document AI Layout Parser](../Chunking/Large%20Document%20Processing%20-%20Document%20AI%20Layout%20Parser.ipynb).  The chunks of text from that workflow are stored with this repository and loaded by another companion workflow that augments the chunks with text embeddings: [Vertex AI Text Embeddings API](../Embeddings/Vertex%20AI%20Text%20Embeddings%20API.ipynb).

The following code will load the version of the chunks that includes text embeddings and prepare it for a local example of retrival augmented generation.

### Get The Documents

If you are working from a clone of this notebooks [repository](https://github.com/statmike/vertex-ai-mlops) then the documents are already present. The following cell checks for the documents folder and if it is missing gets it (`git clone`):

In [13]:
local_dir = '../Embeddings/files/embeddings-api'

In [14]:
if not os.path.exists(local_dir):
    print('Retrieving documents...')
    parent_dir = os.path.dirname(local_dir)
    temp_dir = os.path.join(parent_dir, 'temp')
    if not os.path.exists(temp_dir):
        os.makedirs(temp_dir)
    !git clone https://www.github.com/statmike/vertex-ai-mlops {temp_dir}/vertex-ai-mlops
    shutil.copytree(f'{temp_dir}/vertex-ai-mlops/Applied GenAI/Embeddings/files/embeddings-api', local_dir)
    shutil.rmtree(temp_dir)
    print(f'Documents are now in folder `{local_dir}`')
else:
    print(f'Documents Found in folder `{local_dir}`')             

Documents Found in folder `../Embeddings/files/embeddings-api`


### Load The Chunks

In [15]:
jsonl_files = glob.glob(f"{local_dir}/large-files*.jsonl")
jsonl_files.sort()
jsonl_files

['../Embeddings/files/embeddings-api/large-files-chunk-embeddings-0000.jsonl',
 '../Embeddings/files/embeddings-api/large-files-chunk-embeddings-0001.jsonl',
 '../Embeddings/files/embeddings-api/large-files-chunk-embeddings-0002.jsonl',
 '../Embeddings/files/embeddings-api/large-files-chunk-embeddings-0003.jsonl',
 '../Embeddings/files/embeddings-api/large-files-chunk-embeddings-0004.jsonl',
 '../Embeddings/files/embeddings-api/large-files-chunk-embeddings-0005.jsonl',
 '../Embeddings/files/embeddings-api/large-files-chunk-embeddings-0006.jsonl',
 '../Embeddings/files/embeddings-api/large-files-chunk-embeddings-0007.jsonl',
 '../Embeddings/files/embeddings-api/large-files-chunk-embeddings-0008.jsonl',
 '../Embeddings/files/embeddings-api/large-files-chunk-embeddings-0009.jsonl']

In [16]:
chunks = []
for file in jsonl_files:
    with open(file, 'r') as f:
        chunks.extend([json.loads(line) for line in f])
len(chunks)

9040

### Review A Chunk

In [17]:
chunks[0].keys()

dict_keys(['instance', 'predictions', 'status'])

In [18]:
chunks[0]['instance']['chunk_id']

'fannie_part_0_c17'

In [19]:
print(chunks[0]['instance']['content'])

# Selling Guide Fannie Mae Single Family

## Fannie Mae Copyright Notice

### Fannie Mae Copyright Notice

|-|
| Section B3-4.2, Verification of Depository Assets 402 |
| B3-4.2-01, Verification of Deposits and Assets (05/04/2022) 403 |
| B3-4.2-02, Depository Accounts (12/14/2022) 405 |
| B3-4.2-03, Individual Development Accounts (02/06/2019) 408 |
| B3-4.2-04, Pooled Savings (Community Savings Funds) (04/01/2009) 411 |
| B3-4.2-05, Foreign Assets (05/04/2022) 411 |
| Section B3-4.3, Verification of Non-Depository Assets 412 |
| B3-4.3-01, Stocks, Stock Options, Bonds, and Mutual Funds (06/30/2015) 412 |
| B3-4.3-02, Trust Accounts (04/01/2009) 413 |
| B3-4.3-03, Retirement Accounts (06/30/2015) 414 |
| B3-4.3-04, Personal Gifts (09/06/2023) 415 |
| B3-4.3-05, Gifts of Equity (10/07/2020) 418 |
| B3-4.3-06, Grants and Lender Contributions (12/14/2022) 419 |
| B3-4.3-07, Disaster Relief Grants or Loans (04/01/2009) 423 |
| B3-4.3-08, Employer Assistance (09/29/2015) 423 |
| B3-4.3-09,

In [20]:
chunks[0]['predictions'][0]['embeddings']['values'][0:10]

[0.031277116388082504,
 0.03056905046105385,
 0.010865348391234875,
 0.0623614676296711,
 0.03228681534528732,
 0.05066155269742012,
 0.046544693410396576,
 0.05509665608406067,
 -0.014074751175940037,
 0.008380400016903877]

### Prepare Chunk Structure

Make a list of dictionaries with information for each chunk:

In [21]:
content_chunks = [
    dict(
        gse = chunk['instance']['gse'],
        chunk_id = chunk['instance']['chunk_id'],
        content = chunk['instance']['content'],
        embedding = chunk['predictions'][0]['embeddings']['values']
    ) for chunk in chunks
]

### Query Embedding

Create a query, or prompt, and get the embedding for it:

Connect to models for text embeddings. Learn more about the model API:
- [Vertex AI Text Embeddings API](../Embeddings/Vertex%20AI%20Text%20Embeddings%20API.ipynb)

In [22]:
question = "Does a lender have to perform servicing functions directly?"

In [23]:
embedder = vertexai.language_models.TextEmbeddingModel.from_pretrained('text-embedding-004')

In [24]:
question_embedding = embedder.get_embeddings([question])[0].values
question_embedding[0:10]

[-0.0005117303808219731,
 0.009651427157223225,
 0.01768726110458374,
 0.014538003131747246,
 -0.01829824410378933,
 0.027877431362867355,
 -0.021124685183167458,
 0.008830446749925613,
 -0.02669006586074829,
 0.06414774805307388]

---
## Load To BigQuery

In this case the information to load to BigQuery is local.  It could be in GCS or other BigQuery sources.  You can also get embeddings for information within BigQuery using the [ML.GENERATE_EMBEDDING function](https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-generate-embedding) and even use the exact same model as was used for the data imported above.

### Create/Recall Dataset

In [25]:
dataset = bigquery.Dataset(f"{BQ_PROJECT}.{BQ_DATASET}")
dataset.location = BQ_REGION
bq_dataset = bq.create_dataset(dataset, exists_ok = True)

### Load JSON TO BigQuery Table

In [26]:
bq_table = bq_dataset.table(BQ_TABLE)

In [29]:
job_config = bigquery.LoadJobConfig(
    source_format = bigquery.SourceFormat.NEWLINE_DELIMITED_JSON,
    write_disposition = bigquery.WriteDisposition.WRITE_TRUNCATE,
    autodetect = True
)

In [30]:
load_job = bq.load_table_from_json(
    json_rows = content_chunks,
    destination = bq_table,
    job_config = job_config
)
load_job.result()

LoadJob<project=statmike-mlops-349915, location=US, id=f44d223f-4db7-4580-a88b-72d0e87d7707>

In [31]:
bq.query(f"SELECT * FROM `{BQ_PROJECT}.{BQ_DATASET}.{BQ_TABLE}` LIMIT 5").to_dataframe()

Unnamed: 0,chunk_id,embedding,content,gse
0,fannie_part_0_c17,"[0.031277116388082504, 0.03056905046105385, 0....",# Selling Guide Fannie Mae Single Family\n\n##...,fannie
1,fannie_part_0_c418,"[0.0002988415362779051, -0.002309585688635707,...",# Additional Financial Requirements\n\nLender ...,fannie
2,fannie_part_0_c725,"[-0.0030626351945102215, -0.017544567584991455...","# B3-1-01, Comprehensive Risk Assessment (12/1...",fannie
3,fannie_part_0_c882,"[0.012272126041352749, -0.050955940037965775, ...",# Calculating Monthly Qualifying Rental Income...,fannie
4,fannie_part_0_c315,"[0.008982143364846706, -0.0069817849434912205,...",# Compliance with Fannie Mae Data Breach Incid...,fannie


---
## Retrieval With BigQuery

BigQuery has a built in [VECTOR_SEARCH function](https://cloud.google.com/bigquery/docs/reference/standard-sql/search_functions#vector_search) that by itself will do brute force matching to find the `top_k` neighbors with the selected distance metric using `distance_type` ('EUCLIDEAN', 'COSINE', or 'DOT_PRODUCT').  Remember to use 'DOT_PRODUCT' and read why in this companion workflow: [The Math of Similarity](../Embeddings/The%20Math%20of%20Similarity.ipynb).

[BigQuery also has vector indexing](https://cloud.google.com/bigquery/docs/vector-search-intro) options for approximate nearest neighbor search with two options for index type:
- [Inverted File (IVF) Index](https://cloud.google.com/bigquery/docs/vector-index#ivf-index)
    - Automatically creates an inverted list from assigning embeddings to k clusters using k-means clustering and uses the clusters as partions
    - Reduces the search space to only partions (clusters) near the query embedding (configurable)
- [TreeAH Index](https://cloud.google.com/bigquery/docs/vector-index#tree-ah-index)
    - Uses the Google [ScaNN algorithm](https://github.com/google-research/google-research/blob/master/scann/docs/algorithms.md)
    - Train a clustering model and bases the number of clusters on the value of `leaf_node_embedding_count`
    - Creates a candidate list at query time using asymmetric hashing - super fast!
    
The following sections explore vector search in BigQuery with and without the use of vector indexes.

### Get Matches: Vector Search With No Vector Index - Brute Force

The [VECTOR_SEARCH function](https://cloud.google.com/bigquery/docs/reference/standard-sql/search_functions#vector_search) function can be directly used to find matches without a [vector index](https://cloud.google.com/bigquery/docs/vector-index).  While vector indexes offer efficient ways of finding matches based on embeddings, the brute force approach of directly using `VECTOR_SEARCH` compares a query embedding to all embeddings in the table.

The options for `VECTOR_SEARCH` are:
- `base_table` is the table to search
- `column_to_search` is the column of embeddings vector to match
- `top_k` is the number of matches to return
- `distance_type` is the distance measure to use from `EUCLIDEAN`, `DOT_PRODUCT`, and `COSINE`.
    - Use `DOT_PRODUCT` here for these normalized embeddings and read more about why in [The Math of Similarity](../Embeddings/The%20Math%20of%20Similarity.ipynb)

#### Matches For A Given `chunk_id`

In [32]:
example_chunk = 'fannie_part_0_c17'
query = f'''
SELECT
    query.chunk_id AS chunk_id,
    base.chunk_id AS match,
    distance
FROM VECTOR_SEARCH(
    TABLE `{BQ_PROJECT}.{BQ_DATASET}.{BQ_TABLE}`,
    'embedding',
    (SELECT * FROM `{BQ_PROJECT}.{BQ_DATASET}.{BQ_TABLE}` WHERE chunk_id = '{example_chunk}'),
    'embedding',
    top_k => 5,
    distance_type => 'DOT_PRODUCT'
)
ORDER BY distance
'''
bq.query(query).to_dataframe()

Unnamed: 0,chunk_id,match,distance
0,fannie_part_0_c17,fannie_part_0_c17,-0.999799
1,fannie_part_0_c17,fannie_part_0_c15,-0.897096
2,fannie_part_0_c17,fannie_part_0_c20,-0.889913
3,fannie_part_0_c17,fannie_part_0_c11,-0.883223
4,fannie_part_0_c17,fannie_part_0_c22,-0.877935


#### Matches For An Input Embedding

In [33]:
query = f'''
SELECT
    query.question AS question,
    base.chunk_id AS match,
    base.gse AS gse,
    distance
FROM VECTOR_SEARCH(
    TABLE `{BQ_PROJECT}.{BQ_DATASET}.{BQ_TABLE}`,
    'embedding',
    (SELECT {question_embedding} AS embedding, '{question}' AS question),
    'embedding',
    top_k => 5,
    distance_type => 'DOT_PRODUCT'
)
ORDER BY distance
'''
bq.query(query).to_dataframe()

Unnamed: 0,question,match,gse,distance
0,Does a lender have to perform servicing functi...,fannie_part_0_c352,fannie,-0.709984
1,Does a lender have to perform servicing functi...,freddie_part_4_c509,freddie,-0.680526
2,Does a lender have to perform servicing functi...,freddie_part_4_c510,freddie,-0.67533
3,Does a lender have to perform servicing functi...,fannie_part_0_c353,fannie,-0.672371
4,Does a lender have to perform servicing functi...,fannie_part_0_c326,fannie,-0.66835


#### Pre-Filter Matches For An Input Embedding

In [34]:
example_gse = 'fannie'
query = f'''
SELECT
    query.question AS question,
    base.chunk_id AS match,
    base.gse AS gse,
    distance
FROM VECTOR_SEARCH(
    (SELECT * FROM `{BQ_PROJECT}.{BQ_DATASET}.{BQ_TABLE}` WHERE gse = '{example_gse}'),
    'embedding',
    (SELECT {question_embedding} AS embedding, '{question}' AS question),
    'embedding',
    top_k => 5,
    distance_type => 'DOT_PRODUCT'
)
ORDER BY distance
'''
bq.query(query).to_dataframe()

Unnamed: 0,question,match,gse,distance
0,Does a lender have to perform servicing functi...,fannie_part_0_c352,fannie,-0.709984
1,Does a lender have to perform servicing functi...,fannie_part_0_c353,fannie,-0.672371
2,Does a lender have to perform servicing functi...,fannie_part_0_c326,fannie,-0.66835
3,Does a lender have to perform servicing functi...,fannie_part_0_c92,fannie,-0.661434
4,Does a lender have to perform servicing functi...,fannie_part_0_c240,fannie,-0.660858


### Create A Vector Index

To efficiently do a vector search it can be helpful to [create a vector index](https://cloud.google.com/bigquery/docs/vector-index) for the embeddings column.  
 - This is not required though as a [brute force search](https://cloud.google.com/bigquery/docs/vector-search#use_the_vector_search_function_with_brute_force) is possible as shown in the previous section.  An Index can also be ignored and force brute force search as will be shown later in this workflow.

There are two types of vector indexes that can be chosen in BigQuery:
- IVF Index: `index_type = "IVF"`
    - use k-means to cluser the embeddings vectors and then uses the clusters as partions
    - k can be specifed with the NUM_LISTS option as any INT64 <= 5000
        - The choice of value can impact the performance of queries.
        - **NOTE:** If omitted then by default BigQuery calculates an appropriate value!
    - updates are handled automatically
- TreeAH Index `index_type = "TREE_AH"`
    - uses the [ScaNN algorithm](https://github.com/google-research/google-research/blob/master/scann/docs/algorithms.md) developed by Google
    - the number of partitions/clusters is trained based on the `leaf_node_embedding_count` option
        - each leaf node will contain up to the value of the parameter which defaults to 1000 and can be any INT64 >= 500
    
Options for both index types:
- Set the default `DISTANCE_TYPE` as `EUCLIDEAN`, `DOT_PRODUCT` or `COSINE`.
    - Use `DOT_PRODUCT` here for these normalized embeddings and read more about why in [The Math of Similarity](../Embeddings/The%20Math%20of%20Similarity.ipynb)
    - This option can be preempted with an alternative choice when using the VECTOR_SEARCH function
- `NORMALIZATION_TYPE` can be chosen to normalize the vectors for you.
    - In this case the embedding are already normalized
    - The default is `NONE` but can be set to `L2`
- `STORING` Clause can be used to set additional columns as part of the index which can be helpful for pre-filtering searchs to a subset.  This is used below to enable subsetting the search to the source of the information based on the columns `gse` which has values 'freddie' and 'fannie'

Read about the [CREATE VECTOR INDEX](https://cloud.google.com/bigquery/docs/reference/standard-sql/data-definition-language#create_vector_index_statement) DDL statement.

The workflow below uses the `IVF` index type.  See an example of the workflow below using the `TREE_AH` index type in the workflow **[BQML Autoencoder As Table Embedding](../Embeddings/BQML%20Autoencoder%20As%20Table%20Embedding.ipynb)**


In [35]:
query = f'''
CREATE VECTOR INDEX IF NOT EXISTS row_index ON `{BQ_PROJECT}.{BQ_DATASET}.{BQ_TABLE}`(embedding)
STORING(gse, chunk_id) # add chunk_id here also!
OPTIONS(
    index_type = 'IVF',
    ivf_options = '{{"num_lists":100}}',
    distance_type = 'DOT_PRODUCT'
)
'''
job = bq.query(query)
job.result()

<google.cloud.bigquery.table._EmptyRowIterator at 0x7fe600809780>

### Check The Vector Index Status

[Get information about vector indexes](https://cloud.google.com/bigquery/docs/vector-index#get_information_about_vector_indexes)

In [36]:
query = f'''
SELECT *
FROM `{BQ_PROJECT}.{BQ_DATASET}.INFORMATION_SCHEMA.VECTOR_INDEXES`
WHERE index_status = 'ACTIVE'
    AND table_name = '{BQ_TABLE}'
'''
bq.query(query).to_dataframe()

Unnamed: 0,index_catalog,index_schema,table_name,index_name,index_status,creation_time,last_modification_time,last_refresh_time,disable_time,disable_reason,ddl,coverage_percentage,unindexed_row_count,total_logical_bytes,total_storage_bytes
0,statmike-mlops-349915,applied_genai,retrieval-bigquery,row_index,ACTIVE,2025-02-21 16:53:15.687000+00:00,2025-02-21 16:53:15.687000+00:00,NaT,NaT,,CREATE VECTOR INDEX `row_index` ON `statmike-m...,0,9040,0,0


It can take a few minutes for the index to be built.  Notice the output above show the `index_status = ACTIVE` but the `coverage_percentage = 0`.  The next cell will for a wait of 10 minutes then rerun the check for status:

In [37]:
time.sleep(600) # sleep for 10 minutes

In [38]:
query = f'''
SELECT *
FROM `{BQ_PROJECT}.{BQ_DATASET}.INFORMATION_SCHEMA.VECTOR_INDEXES`
WHERE index_status = 'ACTIVE'
    AND table_name = '{BQ_TABLE}'
'''
bq.query(query).to_dataframe()

Unnamed: 0,index_catalog,index_schema,table_name,index_name,index_status,creation_time,last_modification_time,last_refresh_time,disable_time,disable_reason,ddl,coverage_percentage,unindexed_row_count,total_logical_bytes,total_storage_bytes
0,statmike-mlops-349915,applied_genai,retrieval-bigquery,row_index,ACTIVE,2025-02-21 16:53:15.687000+00:00,2025-02-21 16:53:15.687000+00:00,2025-02-21 17:00:42.584000+00:00,NaT,,CREATE VECTOR INDEX `row_index` ON `statmike-m...,100,0,56728466,30824694


### Get Matches: Vector Search With Vector Index

Directly query the table with the index as the base table:

#### Matches For A Given `chunk_id`

In [39]:
example_chunk = 'fannie_part_0_c17'
query = f'''
SELECT
    query.chunk_id AS chunk_id,
    base.chunk_id AS match,
    distance
FROM VECTOR_SEARCH(
    TABLE `{BQ_PROJECT}.{BQ_DATASET}.{BQ_TABLE}`,
    'embedding',
    (SELECT * FROM `{BQ_PROJECT}.{BQ_DATASET}.{BQ_TABLE}` WHERE chunk_id = '{example_chunk}'),
    'embedding',
    top_k => 5,
    distance_type => 'DOT_PRODUCT'
)
ORDER BY distance
'''
job = bq.query(query, job_config = bigquery.QueryJobConfig(use_query_cache=False))
job.result()
job.to_dataframe()

Unnamed: 0,chunk_id,match,distance
0,fannie_part_0_c17,fannie_part_0_c17,-0.999799
1,fannie_part_0_c17,fannie_part_0_c15,-0.897096
2,fannie_part_0_c17,fannie_part_0_c20,-0.889913
3,fannie_part_0_c17,fannie_part_0_c11,-0.883223
4,fannie_part_0_c17,fannie_part_0_c22,-0.877935


Check to see if the index was used with [vector index usage information](https://cloud.google.com/bigquery/docs/vector-index#vector_index_usage)

In [40]:
job._properties['statistics']['query']['vectorSearchStatistics']

{'indexUsageMode': 'FULLY_USED'}

#### Matches For An Input Embedding

In [41]:
query = f'''
SELECT
    query.question AS question,
    base.chunk_id AS match,
    base.gse AS gse,
    distance
FROM VECTOR_SEARCH(
    TABLE `{BQ_PROJECT}.{BQ_DATASET}.{BQ_TABLE}`,
    'embedding',
    (SELECT {question_embedding} AS embedding, '{question}' AS question),
    'embedding',
    top_k => 5,
    distance_type => 'DOT_PRODUCT'
)
ORDER BY distance
'''
job = bq.query(query, job_config = bigquery.QueryJobConfig(use_query_cache=False))
job.result()
job.to_dataframe()

Unnamed: 0,question,match,gse,distance
0,Does a lender have to perform servicing functi...,fannie_part_0_c352,fannie,-0.709984
1,Does a lender have to perform servicing functi...,freddie_part_4_c509,freddie,-0.680526
2,Does a lender have to perform servicing functi...,freddie_part_4_c510,freddie,-0.67533
3,Does a lender have to perform servicing functi...,freddie_part_4_c472,freddie,-0.661984
4,Does a lender have to perform servicing functi...,freddie_part_5_c360,freddie,-0.655732


Check to see if the index was used with [vector index usage information](https://cloud.google.com/bigquery/docs/vector-index#vector_index_usage)

In [42]:
job._properties['statistics']['query']['vectorSearchStatistics']

{'indexUsageMode': 'FULLY_USED'}

#### Pre-Filter Matches For An Input Embedding

In [43]:
example_gse = 'fannie'
query = f'''
SELECT
    query.question AS question,
    base.chunk_id AS match,
    base.gse AS gse,
    distance
FROM VECTOR_SEARCH(
    (SELECT * FROM `{BQ_PROJECT}.{BQ_DATASET}.{BQ_TABLE}` WHERE gse = '{example_gse}'),
    'embedding',
    (SELECT {question_embedding} AS embedding, '{question}' AS question),
    'embedding',
    top_k => 5,
    distance_type => 'DOT_PRODUCT'
)
ORDER BY distance
'''
job = bq.query(query, job_config = bigquery.QueryJobConfig(use_query_cache=False))
job.result()
job.to_dataframe()

Unnamed: 0,question,match,gse,distance
0,Does a lender have to perform servicing functi...,fannie_part_0_c352,fannie,-0.709984
1,Does a lender have to perform servicing functi...,fannie_part_0_c335,fannie,-0.652192
2,Does a lender have to perform servicing functi...,fannie_part_0_c337,fannie,-0.637705
3,Does a lender have to perform servicing functi...,fannie_part_0_c336,fannie,-0.632683
4,Does a lender have to perform servicing functi...,fannie_part_0_c338,fannie,-0.632171


Check to see if the index was used with [vector index usage information](https://cloud.google.com/bigquery/docs/vector-index#vector_index_usage)

In [44]:
job._properties['statistics']['query']['vectorSearchStatistics']

{'indexUsageMode': 'FULLY_USED'}

### Get Matches: Vector Search With Vector Index - Modify Distance Metric

When setting up the index we specified `distance_type = 'DOT_PRODUCT'`.  We can modify the vector search to use a different distance measure by setting the option for `distance_type`. See [VECTOR_SEARCH function details](https://cloud.google.com/bigquery/docs/reference/standard-sql/search_functions#vector_search).  Does this modification still use the index?  Yes!

#### Matches For A Given `chunk_id`

In [45]:
example_chunk = 'fannie_part_0_c17'
query = f'''
SELECT
    query.chunk_id AS chunk_id,
    base.chunk_id AS match,
    distance
FROM VECTOR_SEARCH(
    TABLE `{BQ_PROJECT}.{BQ_DATASET}.{BQ_TABLE}`,
    'embedding',
    (SELECT * FROM `{BQ_PROJECT}.{BQ_DATASET}.{BQ_TABLE}` WHERE chunk_id = '{example_chunk}'),
    'embedding',
    top_k => 5,
    distance_type => 'COSINE'
)
ORDER BY distance
'''
job = bq.query(query, job_config = bigquery.QueryJobConfig(use_query_cache=False))
job.result()
job.to_dataframe()

Unnamed: 0,chunk_id,match,distance
0,fannie_part_0_c17,fannie_part_0_c17,-2.220446e-16
1,fannie_part_0_c17,fannie_part_0_c15,0.1027482
2,fannie_part_0_c17,fannie_part_0_c20,0.1099197
3,fannie_part_0_c17,fannie_part_0_c11,0.116623
4,fannie_part_0_c17,fannie_part_0_c22,0.1218969


Check to see if the index was used with [vector index usage information](https://cloud.google.com/bigquery/docs/vector-index#vector_index_usage)

In [46]:
job._properties['statistics']['query']['vectorSearchStatistics']

{'indexUsageMode': 'FULLY_USED'}

#### Matches For An Input Embedding

In [47]:
query = f'''
SELECT
    query.question AS question,
    base.chunk_id AS match,
    base.gse AS gse,
    distance
FROM VECTOR_SEARCH(
    TABLE `{BQ_PROJECT}.{BQ_DATASET}.{BQ_TABLE}`,
    'embedding',
    (SELECT {question_embedding} AS embedding, '{question}' AS question),
    'embedding',
    top_k => 5,
    distance_type => 'COSINE'
)
ORDER BY distance
'''
job = bq.query(query, job_config = bigquery.QueryJobConfig(use_query_cache=False))
job.result()
job.to_dataframe()

Unnamed: 0,question,match,gse,distance
0,Does a lender have to perform servicing functi...,fannie_part_2_c417,fannie,0.344078
1,Does a lender have to perform servicing functi...,freddie_part_4_c508,freddie,0.34806
2,Does a lender have to perform servicing functi...,fannie_part_2_c793,fannie,0.354459
3,Does a lender have to perform servicing functi...,fannie_part_2_c788,fannie,0.355131
4,Does a lender have to perform servicing functi...,fannie_part_2_c815,fannie,0.359223


Check to see if the index was used with [vector index usage information](https://cloud.google.com/bigquery/docs/vector-index#vector_index_usage)

In [48]:
job._properties['statistics']['query']['vectorSearchStatistics']

{'indexUsageMode': 'FULLY_USED'}

#### Pre-Filter Matches For An Input Embedding

In [49]:
example_gse = 'fannie'
query = f'''
SELECT
    query.question AS question,
    base.chunk_id AS match,
    base.gse AS gse,
    distance
FROM VECTOR_SEARCH(
    (SELECT * FROM `{BQ_PROJECT}.{BQ_DATASET}.{BQ_TABLE}` WHERE gse = '{example_gse}'),
    'embedding',
    (SELECT {question_embedding} AS embedding, '{question}' AS question),
    'embedding',
    top_k => 5,
    distance_type => 'COSINE'
)
ORDER BY distance
'''
job = bq.query(query, job_config = bigquery.QueryJobConfig(use_query_cache=False))
job.result()
job.to_dataframe()

Unnamed: 0,question,match,gse,distance
0,Does a lender have to perform servicing functi...,fannie_part_2_c417,fannie,0.344078
1,Does a lender have to perform servicing functi...,fannie_part_2_c793,fannie,0.354459
2,Does a lender have to perform servicing functi...,fannie_part_2_c788,fannie,0.355131
3,Does a lender have to perform servicing functi...,fannie_part_2_c815,fannie,0.359223
4,Does a lender have to perform servicing functi...,fannie_part_2_c814,fannie,0.363247


Check to see if the index was used with [vector index usage information](https://cloud.google.com/bigquery/docs/vector-index#vector_index_usage)

In [50]:
job._properties['statistics']['query']['vectorSearchStatistics']

{'indexUsageMode': 'FULLY_USED'}

### Get Matches: Vector Search With Vector Index - Modify Search Size

When setting up the index we specified `num_lists = 1000` which led to a number of list being created.  We can guide the vector search to use a larger/smaller portion of these list by seting the option for `fraction_list_to_search`. See [VECTOR_SEARCH function details](https://cloud.google.com/bigquery/docs/reference/standard-sql/search_functions#vector_search).

#### Matches For A Given `chunk_id`

In [51]:
example_chunk = 'fannie_part_0_c17'
query = f'''
SELECT
    query.chunk_id AS chunk_id,
    base.chunk_id AS match,
    distance
FROM VECTOR_SEARCH(
    TABLE `{BQ_PROJECT}.{BQ_DATASET}.{BQ_TABLE}`,
    'embedding',
    (SELECT * FROM `{BQ_PROJECT}.{BQ_DATASET}.{BQ_TABLE}` WHERE chunk_id = '{example_chunk}'),
    'embedding',
    top_k => 5,
    distance_type => 'DOT_PRODUCT',
    options => '{{"fraction_lists_to_search": 0.25}}'
)
ORDER BY distance
'''
job = bq.query(query, job_config = bigquery.QueryJobConfig(use_query_cache=False))
job.result()
job.to_dataframe()

Unnamed: 0,chunk_id,match,distance
0,fannie_part_0_c17,fannie_part_0_c17,-0.999799
1,fannie_part_0_c17,fannie_part_0_c15,-0.897096
2,fannie_part_0_c17,fannie_part_0_c20,-0.889913
3,fannie_part_0_c17,fannie_part_0_c11,-0.883223
4,fannie_part_0_c17,fannie_part_0_c22,-0.877935


Check to see if the index was used with [vector index usage information](https://cloud.google.com/bigquery/docs/vector-index#vector_index_usage)

In [52]:
job._properties['statistics']['query']['vectorSearchStatistics']

{'indexUsageMode': 'FULLY_USED'}

#### Matches For An Input Embedding

In [53]:
query = f'''
SELECT
    query.question AS question,
    base.chunk_id AS match,
    base.gse AS gse,
    distance
FROM VECTOR_SEARCH(
    TABLE `{BQ_PROJECT}.{BQ_DATASET}.{BQ_TABLE}`,
    'embedding',
    (SELECT {question_embedding} AS embedding, '{question}' AS question),
    'embedding',
    top_k => 5,
    distance_type => 'DOT_PRODUCT',
    options => '{{"fraction_lists_to_search": 0.25}}'
)
ORDER BY distance
'''
job = bq.query(query, job_config = bigquery.QueryJobConfig(use_query_cache=False))
job.result()
job.to_dataframe()

Unnamed: 0,question,match,gse,distance
0,Does a lender have to perform servicing functi...,fannie_part_0_c352,fannie,-0.709984
1,Does a lender have to perform servicing functi...,freddie_part_4_c509,freddie,-0.680526
2,Does a lender have to perform servicing functi...,freddie_part_4_c510,freddie,-0.67533
3,Does a lender have to perform servicing functi...,fannie_part_0_c353,fannie,-0.672371
4,Does a lender have to perform servicing functi...,fannie_part_0_c326,fannie,-0.66835


Check to see if the index was used with [vector index usage information](https://cloud.google.com/bigquery/docs/vector-index#vector_index_usage)

In [54]:
job._properties['statistics']['query']['vectorSearchStatistics']

{'indexUsageMode': 'FULLY_USED'}

#### Pre-Filter Matches For An Input Embedding

In [55]:
example_gse = 'fannie'
query = f'''
SELECT
    query.question AS question,
    base.chunk_id AS match,
    base.gse AS gse,
    distance
FROM VECTOR_SEARCH(
    (SELECT * FROM `{BQ_PROJECT}.{BQ_DATASET}.{BQ_TABLE}` WHERE gse = '{example_gse}'),
    'embedding',
    (SELECT {question_embedding} AS embedding, '{question}' AS question),
    'embedding',
    top_k => 5,
    distance_type => 'DOT_PRODUCT',
    options => '{{"fraction_lists_to_search": 0.25}}'
)
ORDER BY distance
'''
job = bq.query(query, job_config = bigquery.QueryJobConfig(use_query_cache=False))
job.result()
job.to_dataframe()

Unnamed: 0,question,match,gse,distance
0,Does a lender have to perform servicing functi...,fannie_part_0_c352,fannie,-0.709984
1,Does a lender have to perform servicing functi...,fannie_part_0_c353,fannie,-0.672371
2,Does a lender have to perform servicing functi...,fannie_part_0_c326,fannie,-0.66835
3,Does a lender have to perform servicing functi...,fannie_part_0_c92,fannie,-0.661434
4,Does a lender have to perform servicing functi...,fannie_part_2_c417,fannie,-0.655913


Check to see if the index was used with [vector index usage information](https://cloud.google.com/bigquery/docs/vector-index#vector_index_usage)

In [56]:
job._properties['statistics']['query']['vectorSearchStatistics']

{'indexUsageMode': 'FULLY_USED'}

### Get Matches: Vector Search With Vector Index - Brute Force

Rather than using the index, find the exact nearest neighbor by searching all of the embeddings with options value `use_brute_force` set to `true`:

#### Matches For A Given `chunk_id`

In [57]:
example_chunk = 'fannie_part_0_c17'
query = f'''
SELECT
    query.chunk_id AS chunk_id,
    base.chunk_id AS match,
    distance
FROM VECTOR_SEARCH(
    TABLE `{BQ_PROJECT}.{BQ_DATASET}.{BQ_TABLE}`,
    'embedding',
    (SELECT * FROM `{BQ_PROJECT}.{BQ_DATASET}.{BQ_TABLE}` WHERE chunk_id = '{example_chunk}'),
    'embedding',
    top_k => 5,
    distance_type => 'DOT_PRODUCT',
    options => '{{"use_brute_force":true}}'
)
ORDER BY distance
'''
ob = bq.query(query, job_config = bigquery.QueryJobConfig(use_query_cache=False))
job.result()
job.to_dataframe()

Unnamed: 0,question,match,gse,distance
0,Does a lender have to perform servicing functi...,fannie_part_0_c352,fannie,-0.709984
1,Does a lender have to perform servicing functi...,fannie_part_0_c353,fannie,-0.672371
2,Does a lender have to perform servicing functi...,fannie_part_0_c326,fannie,-0.66835
3,Does a lender have to perform servicing functi...,fannie_part_0_c92,fannie,-0.661434
4,Does a lender have to perform servicing functi...,fannie_part_2_c417,fannie,-0.655913


Check to see if the index was used with [vector index usage information](https://cloud.google.com/bigquery/docs/vector-index#vector_index_usage)

In [58]:
job._properties['statistics']['query']['vectorSearchStatistics']

{'indexUsageMode': 'FULLY_USED'}

#### Matches For An Input Embedding

In [59]:
query = f'''
SELECT
    query.question AS question,
    base.chunk_id AS match,
    base.gse AS gse,
    distance
FROM VECTOR_SEARCH(
    TABLE `{BQ_PROJECT}.{BQ_DATASET}.{BQ_TABLE}`,
    'embedding',
    (SELECT {question_embedding} AS embedding, '{question}' AS question),
    'embedding',
    top_k => 5,
    distance_type => 'DOT_PRODUCT',
    options => '{{"use_brute_force":true}}'
)
ORDER BY distance
'''
ob = bq.query(query, job_config = bigquery.QueryJobConfig(use_query_cache=False))
job.result()
job.to_dataframe()

Unnamed: 0,question,match,gse,distance
0,Does a lender have to perform servicing functi...,fannie_part_0_c352,fannie,-0.709984
1,Does a lender have to perform servicing functi...,fannie_part_0_c353,fannie,-0.672371
2,Does a lender have to perform servicing functi...,fannie_part_0_c326,fannie,-0.66835
3,Does a lender have to perform servicing functi...,fannie_part_0_c92,fannie,-0.661434
4,Does a lender have to perform servicing functi...,fannie_part_2_c417,fannie,-0.655913


Check to see if the index was used with [vector index usage information](https://cloud.google.com/bigquery/docs/vector-index#vector_index_usage)

In [60]:
job._properties['statistics']['query']['vectorSearchStatistics']

{'indexUsageMode': 'FULLY_USED'}

#### Pre-Filter Matches For An Input Embedding

In [61]:
example_gse = 'fannie'
query = f'''
SELECT
    query.question AS question,
    base.chunk_id AS match,
    base.gse AS gse,
    distance
FROM VECTOR_SEARCH(
    (SELECT * FROM `{BQ_PROJECT}.{BQ_DATASET}.{BQ_TABLE}` WHERE gse = '{example_gse}'),
    'embedding',
    (SELECT {question_embedding} AS embedding, '{question}' AS question),
    'embedding',
    top_k => 5,
    distance_type => 'DOT_PRODUCT',
    options => '{{"use_brute_force":true}}'
)
ORDER BY distance
'''
job = bq.query(query, job_config = bigquery.QueryJobConfig(use_query_cache=False))
job.result()
job.to_dataframe()

Unnamed: 0,question,match,gse,distance
0,Does a lender have to perform servicing functi...,fannie_part_0_c352,fannie,-0.709984
1,Does a lender have to perform servicing functi...,fannie_part_0_c353,fannie,-0.672371
2,Does a lender have to perform servicing functi...,fannie_part_0_c326,fannie,-0.66835
3,Does a lender have to perform servicing functi...,fannie_part_0_c92,fannie,-0.661434
4,Does a lender have to perform servicing functi...,fannie_part_0_c240,fannie,-0.660858


Check to see if the index was used with [vector index usage information](https://cloud.google.com/bigquery/docs/vector-index#vector_index_usage)

In [62]:
job._properties['statistics']['query']['vectorSearchStatistics']

{'indexUsageMode': 'UNUSED',
 'indexUnusedReasons': [{'code': 'INDEX_SUPPRESSED_BY_FUNCTION_OPTION',
   'message': 'The vector index `row_index` of the base table `statmike-mlops-349915:applied_genai.retrieval-bigquery` was not used because use_brute_force option has been specified.',
   'baseTable': {'projectId': 'statmike-mlops-349915',
    'datasetId': 'applied_genai',
    'tableId': 'retrieval-bigquery'},
   'indexName': 'row_index'}]}

### Get Matches: Batch Matching Is A BigQuery Advantage!

Getting a match for a single input, or few inputs, is common in online applications.  Many applicationn can benefit from batch matching and BigQuery easily extends to this.  Here is an example of finding the top matching chunks from the `gse = freddie` chunks for each `gse = fannie` chunk.  Thats thousands of simoutaneous matches! 

In [63]:
query = f'''
SELECT
    query.chunk_id AS fannie_chunk_id,
    ARRAY_AGG(base.chunk_id ORDER BY distance) as matches
FROM VECTOR_SEARCH(
    (SELECT * FROM `{BQ_PROJECT}.{BQ_DATASET}.{BQ_TABLE}` WHERE gse = 'freddie'),
    'embedding',
    (SELECT * FROM `{BQ_PROJECT}.{BQ_DATASET}.{BQ_TABLE}` WHERE gse = 'fannie'),
    'embedding',
    top_k => 2,
    distance_type => 'DOT_PRODUCT'
)
GROUP BY fannie_chunk_id
ORDER BY fannie_chunk_id
'''
job = bq.query(query, job_config = bigquery.QueryJobConfig(use_query_cache=False))
job.result()
results = job.to_dataframe()
results.head()

Unnamed: 0,fannie_chunk_id,matches
0,fannie_part_0_c10,"[freddie_part_0_c7, freddie_part_0_c90]"
1,fannie_part_0_c100,"[freddie_part_1_c339, freddie_part_1_c343]"
2,fannie_part_0_c1000,"[freddie_part_2_c308, freddie_part_2_c347]"
3,fannie_part_0_c1001,"[freddie_part_2_c347, freddie_part_2_c315]"
4,fannie_part_0_c1002,"[freddie_part_2_c355, freddie_part_2_c316]"


In [64]:
results.shape

(2603, 2)

Check to see if the index was used with [vector index usage information](https://cloud.google.com/bigquery/docs/vector-index#vector_index_usage)

In [65]:
job._properties['statistics']['query']['vectorSearchStatistics']

{'indexUsageMode': 'FULLY_USED'}

How many megabytes were scanned by the job:

In [66]:
job.total_bytes_processed / (1024*1024)

80.48384284973145

What was the compute time (slot time) of the job in BigQuery in seconds:

In [67]:
job.slot_millis/1000

41.543

In [68]:
(job.ended - job.started).total_seconds()

2.476

Wow!! This found the top matches for each of the Fannie Mae chunks within the thousands of Freddie Mac chunks in only a few seconds.

### Understanding The Vector Index

The advantage of a vector index is reading fewer rows when looking for matches.  This section attempts to show how this work the index.

In [164]:
query = f'''
SELECT
    COUNT(*) AS row_count
FROM VECTOR_SEARCH(
    TABLE `{BQ_PROJECT}.{BQ_DATASET}.{BQ_TABLE}`,
    'embedding',
    (SELECT {question_embedding} AS embedding, '{question}' AS question),
    'embedding',
    top_k => 10,
    distance_type => 'DOT_PRODUCT',
    options => '{{"fraction_lists_to_search": 0.01}}'
)
'''
job = bq.query(query, job_config = bigquery.QueryJobConfig(use_query_cache=False))
job.result()
job.to_dataframe()['row_count'][0]

10

Using the job object, the query plan can be traversed.  In the code below a summary of each stage is printed.  If a stage has substeps that include a reference to `centroid_id` then the details of the substep are presented:

In [165]:
for stage in job.query_plan:
    print('\nName: ', stage.name, '\nEntry ID: ', stage.entry_id, '\nStatus: ', stage.status, '\nRecords :', stage.records_read)
    for step in stage.steps:
        for sstep in step.substeps:
            if 'centroid_id' in sstep:
                print('\t', step.__dict__)


Name:  S00: Input 
Entry ID:  0 
Status:  COMPLETE 
Records : 0

Name:  S01: Compute 
Entry ID:  1 
Status:  COMPLETE 
Records : 1

Name:  S02: Input 
Entry ID:  2 
Status:  COMPLETE 
Records : 100
	 {'kind': 'READ', 'substeps': ['$20:centroid_id, $21:embedded_col', 'FROM statmike-mlops-349915.applied_genai.retrieval-bigquery_row_index']}

Name:  S03: Coalesce 
Entry ID:  3 
Status:  COMPLETE 
Records : 1

Name:  S04: Join+ 
Entry ID:  4 
Status:  COMPLETE 
Records : 140

Name:  S05: Compute 
Entry ID:  5 
Status:  COMPLETE 
Records : 1

Name:  S06: Input 
Entry ID:  6 
Status:  COMPLETE 
Records : 0

Name:  S07: Aggregate 
Entry ID:  7 
Status:  COMPLETE 
Records : 36

Name:  S09: Coalesce 
Entry ID:  9 
Status:  COMPLETE 
Records : 0

Name:  S0A: Coalesce 
Entry ID:  10 
Status:  COMPLETE 
Records : 1

Name:  S0B: Join+ 
Entry ID:  11 
Status:  COMPLETE 
Records : 9042
	 {'kind': 'READ', 'substeps': ['$1:centroid_id, $2:indexed_original, $3:_s_chunk_id', 'FROM statmike-mlops-349915.

**Notice:**

The first state with a reference to `centroid_id` is within a stage that referenced 100 records. This matches the number of clusters (`num_lists`) defined when the index was created.  If the number of clusters had not be defined it would have been automatically determined and this method could be used to understand how many clusters were created.

---
## Retrieval Augmented Generation (RAG)

Build a simple retrieval augmented generation process that enhances a query by retrieving context.  This is done here by constructing three functions for the stages:
- `retrieve` - a function that uses an embedding to search for matching context parts, pieces of texts
    - this uses the system built earlier in this workflow!
- `augment` - prepare chunks into a prompt
- `generate` - make the llm request with the augmented prompt

A final function is used to execute the workflow of rag:
- `rag` - a function that receives the query an orchestrates the workflow through `retrieve` > `augment` > `generate`

### Clients

In [69]:
embedder = vertexai.language_models.TextEmbeddingModel.from_pretrained('text-embedding-004')
llm = vertexai.generative_models.GenerativeModel("gemini-1.5-flash-002")

### Retrieve Function

In [92]:
def retrieve_bigquery(query_embedding, n_matches = 5):

    # query notes: brute force
    query = f'''
    SELECT
        base.chunk_id as chunk_id,
        base.content AS content,
    FROM VECTOR_SEARCH(
        TABLE `{BQ_PROJECT}.{BQ_DATASET}.{BQ_TABLE}`,
        'embedding',
        (SELECT {question_embedding} AS embedding),
        'embedding',
        top_k => {n_matches},
        distance_type => 'DOT_PRODUCT'
        ,options => '{{"use_brute_force":true}}'
    )
    ORDER BY distance
    '''
    timings = {}
    start_time = time.time()
    
    # Method 1: This common approach with the API is not as fast as the alternative that follows!
    #job = bq.query(query, job_config = bigquery.QueryJobConfig(use_query_cache=False))
    #job.result()
    #results = job.to_dataframe()
    
    # Method 2: Much less latency than the common approach of Method 1
    results = bq.query_and_wait(query, job_config = bigquery.QueryJobConfig(use_query_cache=False)).to_dataframe()
    matches = results.to_dict('records')
    
    # Method 2 alt: The result of client.query_and_wait is a row iterator. This approach directly uses the iterator.  It is not as efficient as the dataframe approach of Method 2 though.
    #matches = []
    #for row in results:
    #    matches.append(dict(row))
    
    return matches

### Augment Function

In [93]:
def augment(matches):

    prompt = ''
    for m, match in enumerate(matches):
        prompt += f"Context {m+1}:\n{match['content']}\n\n"
    prompt += f'Answer the following question using the provided contexts:\n'

    return prompt

### Generate Function

In [94]:
def generate(prompt):

    result = llm.generate_content(prompt)

    return result

### RAG Function

In [95]:
def rag(query):
    
    query_embedding = embedder.get_embeddings([query])[0].values
    matches = retrieve_bigquery(query_embedding)
    prompt = augment(matches) + query
    result = generate(prompt)
    
    return result.text

### Example In Use

In [96]:
question

'Does a lender have to perform servicing functions directly?'

In [97]:
print(rag(question))

No.  A lender may use other organizations to perform some or all of its servicing functions through subservicing arrangements (Context 1).  However, the lender (master servicer) remains contractually responsible, and there are stipulations regarding Fannie Mae approval and maintaining the ability to meet reporting requirements (Context 4).



---
### Profiling Performance In Testing Environments

Profile the timing of each step in the RAG function for sequential calls. The environment choosen for this workflow is a minimal testing enviornment so load testing (simoultaneous requests) would not be appropriate for some solutions.

In [101]:
profile = []

In [102]:
def rag(query, profile = profile):
    
    timings = {}
    start_time = time.time()
    
    # 1. Get embeddings
    embedding_start = time.time()
    query_embedding = embedder.get_embeddings([query])[0].values
    timings['embedding'] = time.time() - embedding_start

    # 2. Retrieve from BigQuery
    retrieval_start = time.time()
    matches = retrieve_bigquery(query_embedding)
    timings['retrieve_bigquery'] = time.time() - retrieval_start

    # 3. Augment the prompt
    augment_start = time.time()
    prompt = augment(matches) + query
    timings['augment'] = time.time() - augment_start

    # 4. Generate text
    generate_start = time.time()
    result = generate(prompt)
    timings['generate'] = time.time() - generate_start

    total_time = time.time() - start_time
    timings['total'] = total_time
    
    profile.append(timings)
    
    return result.text

In [103]:
print(rag(question))

No, a lender does not have to perform servicing functions directly.  Context 1 explicitly states that a lender may use other organizations to perform some or all of its servicing functions, referring to this as "subservicing."  This allows for a "master servicer" to utilize a "subservicer" to handle some of the workload.  However, the master servicer remains contractually responsible.



In [104]:
profile

[{'embedding': 0.09870719909667969,
  'retrieve_bigquery': 0.9294946193695068,
  'augment': 3.528594970703125e-05,
  'generate': 0.7410991191864014,
  'total': 1.7693414688110352}]

### Report From Profile

In [105]:
for i in range(100):
    response = rag(question)

In [106]:
all_timings = {}
for timings in profile:
    for key, value in timings.items():
        if key not in all_timings:
            all_timings[key] = []
        all_timings[key].append(value)

In [107]:
for key, values in all_timings.items():
    arr = np.array(values)
    print(f"Statistics for '{key}':")
    print(f"  Min: {np.min(arr):.4f} seconds")
    print(f"  Max: {np.max(arr):.4f} seconds")
    print(f"  Mean: {np.mean(arr):.4f} seconds")
    print(f"  Median: {np.median(arr):.4f} seconds")
    print(f"  Std Dev: {np.std(arr):.4f} seconds")
    print(f"  P95: {np.percentile(arr, 95):.4f} seconds")
    print(f"  P99: {np.percentile(arr, 99):.4f} seconds")
    print("")

Statistics for 'embedding':
  Min: 0.0474 seconds
  Max: 10.0669 seconds
  Mean: 0.1697 seconds
  Median: 0.0555 seconds
  Std Dev: 0.9947 seconds
  P95: 0.0987 seconds
  P99: 1.0568 seconds

Statistics for 'retrieve_bigquery':
  Min: 0.7869 seconds
  Max: 1.3211 seconds
  Mean: 0.9671 seconds
  Median: 0.9480 seconds
  Std Dev: 0.1064 seconds
  P95: 1.1408 seconds
  P99: 1.2350 seconds

Statistics for 'augment':
  Min: 0.0000 seconds
  Max: 0.0001 seconds
  Mean: 0.0000 seconds
  Median: 0.0000 seconds
  Std Dev: 0.0000 seconds
  P95: 0.0001 seconds
  P99: 0.0001 seconds

Statistics for 'generate':
  Min: 0.5702 seconds
  Max: 1.1724 seconds
  Mean: 0.7434 seconds
  Median: 0.7365 seconds
  Std Dev: 0.1000 seconds
  P95: 0.9134 seconds
  P99: 1.0414 seconds

Statistics for 'total':
  Min: 1.5194 seconds
  Max: 11.7873 seconds
  Mean: 1.8802 seconds
  Median: 1.7625 seconds
  Std Dev: 1.0054 seconds
  P95: 2.0123 seconds
  P99: 2.8446 seconds



---
## Low Latency Vector Search With BigQuery

Did you know that Vertex AI Feature Store synchronizes BigQuery sources to an online store with a high speed client?  And that embeddings matching is a built-in feature of the online store?  Continue on to the next workflow in the [Retrieval](./readme.md) section that extends this workflow to Vertex AI Feature Store: [Retrieval - Vertex AI Feature Store](Retrieval%20-%20Vertex%20AI%20Feature%20Store.ipynb).

---
## Remove Resources

The resources created above in BigQuery will persist unless deleted.  To remove the table created above uncomment and run the following cell:

In [28]:
#bq.delete_table(bq_table)