![tracker](https://us-central1-vertex-ai-mlops-369716.cloudfunctions.net/pixel-tracking?path=statmike%2Fvertex-ai-mlops%2FApplied+GenAI%2FRetrieval&file=Retrieval+-+AlloyDB+For+PostgreSQL.ipynb)
<!--- header table --->
<table align="left">
  <td style="text-align: center">
    <a href="https://colab.research.google.com/github/statmike/vertex-ai-mlops/blob/main/Applied%20GenAI/Retrieval/Retrieval%20-%20AlloyDB%20For%20PostgreSQL.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/colab-logo-32px.png" alt="Google Colaboratory logo">
      <br>Run in<br>Colab
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/colab/import/https%3A%2F%2Fraw.githubusercontent.com%2Fstatmike%2Fvertex-ai-mlops%2Fmain%2FApplied%2520GenAI%2FRetrieval%2FRetrieval%2520-%2520AlloyDB%2520For%2520PostgreSQL.ipynb">
      <img width="32px" src="https://lh3.googleusercontent.com/JmcxdQi-qOpctIvWKgPtrzZdJJK-J3sWE1RsfjZNwshCFgE_9fULcNpuXYTilIR2hjwN" alt="Google Cloud Colab Enterprise logo">
      <br>Run in<br>Colab Enterprise
    </a>
  </td>      
  <td style="text-align: center">
    <a href="https://github.com/statmike/vertex-ai-mlops/blob/main/Applied%20GenAI/Retrieval/Retrieval%20-%20AlloyDB%20For%20PostgreSQL.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" alt="GitHub logo">
      <br>View on<br>GitHub
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/statmike/vertex-ai-mlops/main/Applied%20GenAI/Retrieval/Retrieval%20-%20AlloyDB%20For%20PostgreSQL.ipynb">
      <img src="https://lh3.googleusercontent.com/UiNooY4LUgW_oTvpsNhPpQzsstV5W8F7rYgxgGBD85cWJoLmrOzhVs_ksK_vgx40SHs7jCqkTkCk=e14-rj-sc0xffffff-h130-w32" alt="Vertex AI logo">
      <br>Open in<br>Vertex AI Workbench
    </a>
  </td>
</table>

# Retrieval - AlloyDB For PostgreSQL
<p style="font-size: 45px;">IN PROGRESS - NOT COMPLETE</p>

In prior workflows, a series of documents was [processed into chunks](../Chunking/readme.md), and for each chunk, [embeddings](../Embeddings/readme.md) were created:

- Process: [Large Document Processing - Document AI Layout Parser](../Chunking/Large%20Document%20Processing%20-%20Document%20AI%20Layout%20Parser.ipynb)
- Embed: [Vertex AI Text Embeddings API](../Embeddings/Vertex%20AI%20Text%20Embeddings%20API.ipynb)

Retrieving chunks for a query involves calculating the embedding for the query and then using similarity metrics to find relevant chunks. A thorough review of similarity matching can be found in [The Math of Similarity](../Embeddings/The%20Math%20of%20Similarity.ipynb) - use dot product! As development moves from experiment to application, the process of storing and computing similarity is migrated to a [retrieval](./readme.md) system. This workflow is part of a [series of workflows exploring many retrieval systems](./readme.md).

**AlloyDB For Storage, Indexing, And Search**

- [**AlloyDB for PostgreSQL**](https://cloud.google.com/alloydb) is a fully managed database service on Google Cloud that is compatible with and significantly faster than standard PostgreSQL. 
- AlloyDB boasts a comprehensive suite of [generative AI features](https://cloud.google.com/alloydb/docs/ai), including the ability to generate embeddings and predictions through seamless integration with Vertex AI.  You can even request predictions from any [endpoint in Google Cloud](https://cloud.google.com/alloydb/docs/ai/model-endpoint-register-model#add-generic).
- Similarity metrics are built into AlloyDB through an optimized implementation of [`pgvector`](https://github.com/pgvector/pgvector?tab=readme-ov-file#indexing), simply called [`vector`](https://cloud.google.com/alloydb/docs/ai/store-embeddings#required-extension) in AlloyDB. This allows for the creation of highly efficient inverted file (IVF) indexes for accelerated similarity search. 
- Additionally, AlloyDB offers the `alloydb_scann` extension, which implements the [ScaNN algorithm](https://github.com/google-research/google-research/blob/master/scann/docs/algorithms.md) for super-efficient nearest neighbor matching.

**Use Case Data**

Buying a home usually involves borrowing money from a lending institution, typically through a mortgage secured by the home's value. But how do these institutions manage the risks associated with such large loans, and how are lending standards established?

In the United States, two government-sponsored enterprises (GSEs) play a vital role in the housing market:

- Federal National Mortgage Association ([Fannie Mae](https://www.fanniemae.com/))
- Federal Home Loan Mortgage Corporation ([Freddie Mac](https://www.freddiemac.com/))

These GSEs purchase mortgages from lenders, enabling those lenders to offer more loans. This process also allows Fannie Mae and Freddie Mac to set standards for mortgages, ensuring they are responsible and borrowers are more likely to repay them. This system makes homeownership more affordable and stabilizes the housing market by maintaining a steady flow of liquidity for lenders and keeping interest rates controlled.

However, navigating the complexities of these GSEs and their extensive servicing guides can be challenging.

**Approaches**

[This series](../readme.md) covers many generative AI workflows. These documents are used directly as long context for Gemini in the workflow [Long Context Retrieval With The Vertex AI Gemini API](../Generate/Long%20Context%20Retrieval%20With%20The%20Vertex%20AI%20Gemini%20API.ipynb). The workflow below uses a [retrieval](./readme.md) approach with the already generated chunks and embeddings.

---
## Colab Setup

When running this notebook in [Colab](https://colab.google/) or [Colab Enterprise](https://cloud.google.com/colab/docs/introduction), this section will authenticate to GCP (follow prompts in the popup) and set the current project for the session.

In [1]:
PROJECT_ID = 'statmike-mlops-349915' # replace with project ID

In [2]:
try:
    from google.colab import auth
    auth.authenticate_user()
    !gcloud config set project {PROJECT_ID}
except Exception:
    pass

---
## Installs and API Enablement

The clients packages may need installing in this environment. 

### Installs (If Needed)

In [3]:
# tuples of (import name, install name, min_version)
packages = [
    ('google.cloud.aiplatform', 'google-cloud-aiplatform', '1.69.0'),
    ('google.cloud.alloydb', 'google-cloud-alloydb'),
    ('google.cloud.alloydb.connector', 'google-cloud-alloydb-connector'),
    ('sqlalchemy', 'sqlalchemy'),
    ('pg8000', 'pg8000'),
    ('asyncpg', 'asyncpg')
]

import importlib
install = False
for package in packages:
    if not importlib.util.find_spec(package[0]):
        print(f'installing package {package[1]}')
        install = True
        !pip install {package[1]} -U -q --user
    elif len(package) == 3:
        if importlib.metadata.version(package[0]) < package[2]:
            print(f'updating package {package[1]}')
            install = True
            !pip install {package[1]} -U -q --user

### API Enablement

In [4]:
!gcloud services enable aiplatform.googleapis.com
!gcloud services enable alloydb.googleapis.com

### Restart Kernel (If Installs Occured)

After a kernel restart the code submission can start with the next cell after this one.

In [5]:
if install:
    import IPython
    app = IPython.Application.instance()
    app.kernel.do_shutdown(True)
    IPython.display.display(IPython.display.Markdown("""<div class=\"alert alert-block alert-warning\">
        <b>⚠️ The kernel is going to restart. Please wait until it is finished before continuing to the next step. The previous cells do not need to be run again⚠️</b>
        </div>"""))

---
## Setup

Inputs

In [6]:
project = !gcloud config get-value project
PROJECT_ID = project[0]
PROJECT_ID

'statmike-mlops-349915'

In [7]:
REGION = 'us-central1'
SERIES = 'applied-genai'
EXPERIMENT = 'retrieval-alloydb'

# AlloyDB Names
ALLOYDB_CLUSTER_NAME = PROJECT_ID
ALLOYDB_INSTANCE_NAME = PROJECT_ID
ALLOYDB_DATABASE_NAME = SERIES
ALLOYDB_TABLE_NAME = EXPERIMENT

ALLOYDB_USER = 'test_alloydb'
ALLOYDB_PASS = 'test_alloydb_pass'

Packages

In [8]:
#!pip install google-cloud-aiplatform -U -q --user --force-reinstall

In [9]:
import os, json, time, glob, datetime, asyncio

import numpy as np

# Vertex AI
from google.cloud import aiplatform
import vertexai.language_models # for embeddings API
import vertexai.generative_models # for Gemini Models
from vertexai.resources.preview import feature_store

# AlloyDB
from google.cloud import alloydb
import google.cloud.alloydb.connector
import pg8000
import sqlalchemy
import asyncpg
import sqlalchemy.ext.asyncio

In [10]:
aiplatform.__version__

'1.71.0'

Clients

In [11]:
# vertex ai clients
vertexai.init(project = PROJECT_ID, location = REGION)

# alloydb
alloydb_client = alloydb.AlloyDBAdminClient()

---
## Text & Embeddings For Examples

This repository contains a [section for document processing (chunking)](../Chunking/readme.md) that includes an example of processing mulitple large pdfs (over 1000 pages) into chunks: [Large Document Processing - Document AI Layout Parser](../Chunking/Large%20Document%20Processing%20-%20Document%20AI%20Layout%20Parser.ipynb).  The chunks of text from that workflow are stored with this repository and loaded by another companion workflow that augments the chunks with text embeddings: [Vertex AI Text Embeddings API](../Embeddings/Vertex%20AI%20Text%20Embeddings%20API.ipynb).

The following code will load the version of the chunks that includes text embeddings and prepare it for a local example of retrival augmented generation.

### Get The Documents

If you are working from a clone of this notebooks [repository](https://github.com/statmike/vertex-ai-mlops) then the documents are already present. The following cell checks for the documents folder and if it is missing gets it (`git clone`):

In [12]:
local_dir = '../Embeddings/files/embeddings-api'

In [13]:
if not os.path.exists(local_dir):
    print('Retrieving documents...')
    parent_dir = os.path.dirname(local_dir)
    temp_dir = os.path.join(parent_dir, 'temp')
    if not os.path.exists(temp_dir):
        os.makedirs(temp_dir)
    !git clone https://www.github.com/statmike/vertex-ai-mlops {temp_dir}/vertex-ai-mlops
    shutil.copytree(f'{temp_dir}/vertex-ai-mlops/Applied GenAI/Embeddings/files/embeddings-api', local_dir)
    shutil.rmtree(temp_dir)
    print(f'Documents are now in folder `{local_dir}`')
else:
    print(f'Documents Found in folder `{local_dir}`')             

Documents Found in folder `../Embeddings/files/embeddings-api`


### Load The Chunks

In [14]:
jsonl_files = glob.glob(f"{local_dir}/large-files*.jsonl")
jsonl_files.sort()
jsonl_files

['../Embeddings/files/embeddings-api/large-files-chunk-embeddings-0000.jsonl',
 '../Embeddings/files/embeddings-api/large-files-chunk-embeddings-0001.jsonl',
 '../Embeddings/files/embeddings-api/large-files-chunk-embeddings-0002.jsonl',
 '../Embeddings/files/embeddings-api/large-files-chunk-embeddings-0003.jsonl',
 '../Embeddings/files/embeddings-api/large-files-chunk-embeddings-0004.jsonl',
 '../Embeddings/files/embeddings-api/large-files-chunk-embeddings-0005.jsonl',
 '../Embeddings/files/embeddings-api/large-files-chunk-embeddings-0006.jsonl',
 '../Embeddings/files/embeddings-api/large-files-chunk-embeddings-0007.jsonl',
 '../Embeddings/files/embeddings-api/large-files-chunk-embeddings-0008.jsonl',
 '../Embeddings/files/embeddings-api/large-files-chunk-embeddings-0009.jsonl']

In [15]:
chunks = []
for file in jsonl_files:
    with open(file, 'r') as f:
        chunks.extend([json.loads(line) for line in f])
len(chunks)

9040

### Review A Chunk

In [16]:
chunks[0].keys()

dict_keys(['instance', 'predictions', 'status'])

In [17]:
chunks[0]['instance']['chunk_id']

'fannie_part_0_c17'

In [18]:
print(chunks[0]['instance']['content'])

# Selling Guide Fannie Mae Single Family

## Fannie Mae Copyright Notice

### Fannie Mae Copyright Notice

|-|
| Section B3-4.2, Verification of Depository Assets 402 |
| B3-4.2-01, Verification of Deposits and Assets (05/04/2022) 403 |
| B3-4.2-02, Depository Accounts (12/14/2022) 405 |
| B3-4.2-03, Individual Development Accounts (02/06/2019) 408 |
| B3-4.2-04, Pooled Savings (Community Savings Funds) (04/01/2009) 411 |
| B3-4.2-05, Foreign Assets (05/04/2022) 411 |
| Section B3-4.3, Verification of Non-Depository Assets 412 |
| B3-4.3-01, Stocks, Stock Options, Bonds, and Mutual Funds (06/30/2015) 412 |
| B3-4.3-02, Trust Accounts (04/01/2009) 413 |
| B3-4.3-03, Retirement Accounts (06/30/2015) 414 |
| B3-4.3-04, Personal Gifts (09/06/2023) 415 |
| B3-4.3-05, Gifts of Equity (10/07/2020) 418 |
| B3-4.3-06, Grants and Lender Contributions (12/14/2022) 419 |
| B3-4.3-07, Disaster Relief Grants or Loans (04/01/2009) 423 |
| B3-4.3-08, Employer Assistance (09/29/2015) 423 |
| B3-4.3-09,

In [19]:
chunks[0]['predictions'][0]['embeddings']['values'][0:10]

[0.031277116388082504,
 0.03056905046105385,
 0.010865348391234875,
 0.0623614676296711,
 0.03228681534528732,
 0.05066155269742012,
 0.046544693410396576,
 0.05509665608406067,
 -0.014074751175940037,
 0.008380400016903877]

### Prepare Chunk Structure

Make a list of dictionaries with information for each chunk:

In [20]:
content_chunks = [
    dict(
        gse = chunk['instance']['gse'],
        chunk_id = chunk['instance']['chunk_id'],
        content = chunk['instance']['content'],
        embedding = chunk['predictions'][0]['embeddings']['values']
    ) for chunk in chunks
]

### Query Embedding

Create a query, or prompt, and get the embedding for it:

Connect to models for text embeddings. Learn more about the model API:
- [Vertex AI Text Embeddings API](../Embeddings/Vertex%20AI%20Text%20Embeddings%20API.ipynb)

In [21]:
question = "Does a lender have to perform servicing functions directly?"

In [22]:
embedder = vertexai.language_models.TextEmbeddingModel.from_pretrained('text-embedding-004')

In [23]:
question_embedding = embedder.get_embeddings([question])[0].values
question_embedding[0:10]

[-0.0005117303808219731,
 0.009651427157223225,
 0.01768726110458374,
 0.014538003131747246,
 -0.01829824410378933,
 0.027877431362867355,
 -0.021124685183167458,
 0.008830446749925613,
 -0.02669006586074829,
 0.06414774805307388]

---
## Setup AlloyDB

- Cluster
    - Instance
        - Connect to Instance With SQL Client
        - Database
            - Table

### Create/Retrieve Cluster

https://cloud.google.com/alloydb/docs/cluster-create

https://cloud.google.com/alloydb/docs/cluster-settings?resource=prim-instance

https://cloud.google.com/python/docs/reference/alloydb/latest/google.cloud.alloydb_v1.services.alloy_db_admin.AlloyDBAdminClient#google_cloud_alloydb_v1_services_alloy_db_admin_AlloyDBAdminClient_create_cluster

In [24]:
try:
    alloydb_cluster = alloydb_client.get_cluster(
        name = f'projects/{PROJECT_ID}/locations/{REGION}/clusters/{ALLOYDB_CLUSTER_NAME}'
    )
    print(f"Found the cluster: {alloydb_cluster.name}")
except Exception:
    print('Creating a cluster ...')
    create_cluster = alloydb_client.create_cluster(
        parent = f"projects/{PROJECT_ID}/locations/{REGION}",
        cluster_id = ALLOYDB_CLUSTER_NAME,
        cluster = alloydb.Cluster(
            #network = f"projects/{PROJECT_ID}/global/networks/your-network"
            initial_user = alloydb.UserPassword(
                user = ALLOYDB_USER,
                password = ALLOYDB_PASS
            ),
        )
    )
    alloydb_cluster = create_cluster.result()
    print(f"Created the cluster: {alloydb_cluster.name}")

Found the cluster: projects/statmike-mlops-349915/locations/us-central1/clusters/statmike-mlops-349915


In [25]:
#alloydb_cluster

### Create/Retrieve Instance

https://cloud.google.com/alloydb/docs/cluster-create

https://cloud.google.com/alloydb/docs/cluster-settings?resource=prim-instance

https://cloud.google.com/python/docs/reference/alloydb/latest/google.cloud.alloydb_v1.services.alloy_db_admin.AlloyDBAdminClient#google_cloud_alloydb_v1_services_alloy_db_admin_AlloyDBAdminClient_create_instance

In [26]:
try:
    alloydb_instance = alloydb_client.get_instance(
        name = f"{alloydb_cluster.name}/instances/{ALLOYDB_INSTANCE_NAME}"
    )
    print(f"Found the instance: {alloydb_instance.name}")
except Exception:
    print('Creating an instance ...')
    create_instance = alloydb_client.create_instance(
        parent = alloydb_cluster.name,
        instance_id = ALLOYDB_INSTANCE_NAME,
        instance = alloydb.Instance(
            instance_type = alloydb.Instance.InstanceType.PRIMARY, # PRIMARY supports read/write, READ_POOL support read only
            machine_config = alloydb.Instance.MachineConfig(cpu_count = 2),
            availability_type = alloydb.Instance.AvailabilityType.ZONAL, # ZONAL is a a single zone, REGIONAL is multi-zone within the region (high availability)
            gce_zone = REGION+'-a', # add a zone to the region with availability_type = ZONAL
        )
    )
    alloydb_instance = create_instance.result()
    print(f"Created the instance: {alloydb_instance.name}")

Found the instance: projects/statmike-mlops-349915/locations/us-central1/clusters/statmike-mlops-349915/instances/statmike-mlops-349915


In [27]:
#alloydb_instance

### Create/Retrieve User

In [29]:
alloydb_client.get_user(name = f"{alloydb_cluster.name}/users/{ALLOYDB_USER}")

name: "projects/statmike-mlops-349915/locations/us-central1/clusters/statmike-mlops-349915/users/test_alloydb"
database_roles: "alloydbsuperuser"
user_type: ALLOYDB_BUILT_IN

### Connections To Databases

AlloyDB is postgres and has a default database named postgres

There are many ways to connect to a database: https://cloud.google.com/alloydb/docs/connection-overview

Here we want to use Python and will use the AlloyDB language connectors: https://cloud.google.com/alloydb/docs/language-connectors-overview

We will create sync and async connectors:https://cloud.google.com/alloydb/docs/connect-language-connectors#install-connectors

Connections have three parts:
- a connection tool, in this case provided by: https://github.com/GoogleCloudPlatform/alloydb-python-connector/tree/main
- a driver to create a connection pool, pg8000 (sync) and asyncpg (async)
- a client library that can use connection pools to execute SQL queries, SQLAlchemy

#### Connection Tool

In [30]:
sync_connector = google.cloud.alloydb.connector.Connector()
async_connector = google.cloud.alloydb.connector.AsyncConnector()

#### Connection

In [31]:
def get_sync_conn(
    connector: google.cloud.alloydb.connector.Connector,
    db: str
):
    def getconn():
        conn = connector.connect(
            alloydb_instance.name,
            "pg8000",
            user = ALLOYDB_USER,
            password = ALLOYDB_PASS,
            db = db
        )
        return conn
    return getconn

In [32]:
async def get_async_conn(
    connector: google.cloud.alloydb.connector.AsyncConnector,
    db: str
):
    async def getconn():
        conn = await connector.connect(
            alloydb_instance.name,
            "asyncpg",
            user = ALLOYDB_USER,
            password = ALLOYDB_PASS,
            db = db
        )
        return conn
    return getconn

#### Connection Pool

In [33]:
def get_sync_pool(
    connector: google.cloud.alloydb.connector.Connector,
    db: str
) -> sqlalchemy.engine.Engine:

    pool = sqlalchemy.create_engine(
        "postgresql+pg8000://",
        creator = get_sync_conn(connector, db)
    )
    pool.dialect.description_encoding = None
    pool.execution_options(isolation_level="AUTOCOMMIT")
    return pool

In [34]:
async def get_async_pool(
    connector: google.cloud.alloydb.connector.Connector,
    db: str
) -> sqlalchemy.engine.Engine:

    pool = sqlalchemy.ext.asyncio.create_async_engine(
        "postgresql+asyncpg://",
        async_creator = await get_async_conn(connector, db)
    )
    pool.dialect.description_encoding = None
    pool.execution_options(isolation_level="AUTOCOMMIT")
    return pool

In [35]:
sync_pool = get_sync_pool(sync_connector, 'postgres')

In [36]:
async_pool = await get_async_pool(async_connector, 'postgres')

#### Query Orchestrator

Use the a pool as a context manager to orchstrate the query

In [37]:
def run_query(query, pool = None, connector = sync_connector):
    # get the current connnection pool:
    if pool is None:
        pool = sync_pool
        
    # run the query and get the response as 'result'
    with pool.connect().execution_options(isolation_level="AUTOCOMMIT") as connection:
        result = connection.execute(query)
        connector.close()
        
    # prepare the response
    rows = []
    try:
        for row in result:
            rows.append(dict(zip(result.keys(), row)))
    except Exception:
        pass
    
    # return the response
    return rows[0] if len(rows) == 1 else rows

In [38]:
async def async_run_query(query, pool = None, connector = async_connector):
    # get the current connection pool
    if pool is None:
        pool = async_pool
        
    # run the query and get the response as 'result'
    async with pool.connect() as connection:
        result = await connection.execute(query)
        await connection.commit()
        await connector.close()
        
    # prepare the response
    rows = []
    try:
        for row in result:
            rows.append(dict(zip(result.keys(), row)))
    except Exception:
        pass
    
    # return the response
    return rows[0] if len(rows) == 1 else rows

### Test Query

When submitting SQL statements either of the connectors should work for DML (SELECT, INSERT, DELETE, UPDATE) but only the synchronous connector appears to work for DDL (CREATE, ALTER, DROP) statements.

In [39]:
query = sqlalchemy.text("SELECT 'Success' as did_it_work")

In [40]:
run_query(query)

{'did_it_work': 'Success'}

In [41]:
await async_run_query(query)

{'did_it_work': 'Success'}

---
## Working With AlloyDB

### Create A Database

In [42]:
query = sqlalchemy.text(f"SELECT datname FROM pg_database WHERE datname = '{ALLOYDB_DATABASE_NAME}'")
result = run_query(query)
result

{'datname': 'applied-genai'}

In [43]:
if not result:
    query = sqlalchemy.text(f"CREATE DATABASE \"{ALLOYDB_DATABASE_NAME}\"")
    run_query(query)

In [44]:
query = sqlalchemy.text(f"SELECT * FROM pg_database WHERE datname = '{ALLOYDB_DATABASE_NAME}'")
run_query(query)

{'oid': 21630,
 'datname': 'applied-genai',
 'datdba': 16470,
 'encoding': 6,
 'datlocprovider': 'i',
 'datistemplate': False,
 'datallowconn': True,
 'datconnlimit': -1,
 'datfrozenxid': 720,
 'datminmxid': 1,
 'dattablespace': 1663,
 'datcollate': 'C',
 'datctype': 'C',
 'daticulocale': 'und-x-icu',
 'datcollversion': '153.112',
 'datacl': None}

### Move Connection To New Database

In [45]:
run_query(sqlalchemy.text('SELECT current_database()'))

{'current_database': 'postgres'}

In [46]:
await async_run_query(sqlalchemy.text('SELECT current_database()'))

{'current_database': 'postgres'}

In [47]:
sync_pool.dispose()
sync_connector.close()
sync_connector = google.cloud.alloydb.connector.Connector()
sync_pool = get_sync_pool(sync_connector, ALLOYDB_DATABASE_NAME)

await async_pool.dispose()
await async_connector.close()
async_connector = google.cloud.alloydb.connector.AsyncConnector()
async_pool = await get_async_pool(async_connector, ALLOYDB_DATABASE_NAME)

In [48]:
run_query(sqlalchemy.text('SELECT current_database()'))

{'current_database': 'applied-genai'}

In [49]:
await async_run_query(sqlalchemy.text('SELECT current_database()'))

{'current_database': 'applied-genai'}

### Create Table

In [50]:
result = run_query(sqlalchemy.text(f"SELECT * from information_schema.tables WHERE table_name = '{ALLOYDB_TABLE_NAME}'"))
result

{'table_catalog': 'applied-genai',
 'table_schema': 'public',
 'table_name': 'retrieval-alloydb',
 'table_type': 'BASE TABLE',
 'self_referencing_column_name': None,
 'reference_generation': None,
 'user_defined_type_catalog': None,
 'user_defined_type_schema': None,
 'user_defined_type_name': None,
 'is_insertable_into': 'YES',
 'is_typed': 'NO',
 'commit_action': None}

In [51]:
run_query(sqlalchemy.text(f"DROP TABLE IF EXISTS \"{ALLOYDB_TABLE_NAME}\""))

[]

In [52]:
run_query(
    sqlalchemy.text(f"""
            CREATE TABLE IF NOT EXISTS \"{ALLOYDB_TABLE_NAME}\" (
                chunk_id VARCHAR(100) NOT NULL PRIMARY KEY,
                gse VARCHAR(50),
                content TEXT,
                embedding REAL[]
            );
        """
    )
)

[]

In [53]:
result = run_query(sqlalchemy.text(f"SELECT * from information_schema.tables WHERE table_name = '{ALLOYDB_TABLE_NAME}'"))
result

{'table_catalog': 'applied-genai',
 'table_schema': 'public',
 'table_name': 'retrieval-alloydb',
 'table_type': 'BASE TABLE',
 'self_referencing_column_name': None,
 'reference_generation': None,
 'user_defined_type_catalog': None,
 'user_defined_type_schema': None,
 'user_defined_type_name': None,
 'is_insertable_into': 'YES',
 'is_typed': 'NO',
 'commit_action': None}

In [54]:
run_query(sqlalchemy.text(f"SELECT column_name, data_type FROM information_schema.columns WHERE table_name = '{ALLOYDB_TABLE_NAME}'"))

[{'column_name': 'embedding', 'data_type': 'ARRAY'},
 {'column_name': 'chunk_id', 'data_type': 'character varying'},
 {'column_name': 'gse', 'data_type': 'character varying'},
 {'column_name': 'content', 'data_type': 'text'}]

### Add, Retrieve, And Delete Rows

#### Get A Record

Dictionaries for each record/row are stored in `content_chunks` from earlier in this workflow:

In [55]:
first_record = content_chunks[0]

In [56]:
first_record.keys()

dict_keys(['gse', 'chunk_id', 'content', 'embedding'])

In [57]:
first_record['chunk_id']

'fannie_part_0_c17'

#### Insert Row

In [58]:
table = sqlalchemy.Table(
    ALLOYDB_TABLE_NAME,
    sqlalchemy.MetaData(),
    autoload_with = sync_pool
)

In [59]:
for c in table.columns:
    print(c)

retrieval-alloydb.chunk_id
retrieval-alloydb.gse
retrieval-alloydb.content
retrieval-alloydb.embedding


In [60]:
insert_row = sqlalchemy.insert(table).values(first_record)

In [61]:
run_query(insert_row)

[]

#### Retrieve Row

In [62]:
query = sqlalchemy.text(f"SELECT * FROM \"{ALLOYDB_TABLE_NAME}\" WHERE chunk_id = '{first_record['chunk_id']}'")
result = run_query(query)

In [63]:
result.keys()

dict_keys(['chunk_id', 'gse', 'content', 'embedding'])

In [64]:
result['chunk_id']

'fannie_part_0_c17'

In [65]:
query = sqlalchemy.select(table).where(table.columns.chunk_id == first_record['chunk_id'])
result = run_query(query)

In [66]:
result.keys()

dict_keys(['chunk_id', 'gse', 'content', 'embedding'])

In [67]:
result['chunk_id']

'fannie_part_0_c17'

In [68]:
type(result['embedding'])

list

In [69]:
result['embedding'][0:10]

[0.031277116,
 0.03056905,
 0.010865348,
 0.062361468,
 0.032286815,
 0.050661553,
 0.046544693,
 0.055096656,
 -0.014074751,
 0.0083804]

#### Delete Row

In [70]:
run_query(sqlalchemy.text(f"SELECT COUNT(*) as count FROM \"{ALLOYDB_TABLE_NAME}\""))

{'count': 1}

In [71]:
run_query(sqlalchemy.text(f"DELETE FROM \"{ALLOYDB_TABLE_NAME}\" WHERE chunk_id = '{first_record['chunk_id']}'"))

[]

In [72]:
run_query(sqlalchemy.text(f"SELECT COUNT(*) as count FROM \"{ALLOYDB_TABLE_NAME}\""))

{'count': 0}

## Load Data 

There are a lot of rows to load so using the async method here:

In [73]:
queries = [sqlalchemy.insert(table).values(c) for c in content_chunks]

In [74]:
tasks = [async_run_query(query) for query in queries]

In [75]:
results = await asyncio.gather(*tasks)

In [76]:
run_query(sqlalchemy.text(f"SELECT COUNT(*) as count FROM \"{ALLOYDB_TABLE_NAME}\""))

{'count': 9040}

---
## Setup AlloyDB for Vector Similarity Search

To store embeddings a vectors and then do indexing and matching the database needs some required extensions.

**Store Embeddings As Vectors**

[Google provides](https://cloud.google.com/alloydb/docs/ai/store-embeddings#required-extension) a version of [`pgvector`](https://github.com/pgvector/pgvector#indexing) named `vector` that includes function and operators for working with vector values. 

```CREATE EXTENSION IF NOT EXISTS vector```

**Create Indexes For Vectors**

Indexing of vectors allows for faster approximate search.  The `vector` package above includes the `pgvector` functionality of IVF, IVFFLat, and HNSW index types.  The Google developed [ScaNN index](https://github.com/google-research/google-research/blob/master/scann/docs/algorithms.md) can be added as an extension named `alloydb_scann`.

```CREATE EXTENSION IF NOT EXISTS alloydb_scann```


In [77]:
run_query(sqlalchemy.text(f"CREATE EXTENSION IF NOT EXISTS vector"))

[]

In [78]:
run_query(sqlalchemy.text(f"CREATE EXTENSION IF NOT EXISTS alloydb_scann"))

[]

### Convert `embedding` Column To Vector Data Type

The data was loaded/inserted above with the embedding stored in a column named 'embedding' as an ARRAY of float values.  This column can now be converted to the vector type with the specific dimension using an `ALTER TABLE` command.

In [79]:
run_query(sqlalchemy.text(f"ALTER TABLE \"{ALLOYDB_TABLE_NAME}\" ALTER COLUMN embedding TYPE vector({len(question_embedding)});"))

[]

In [80]:
run_query(sqlalchemy.text(f"SELECT column_name, data_type FROM information_schema.columns WHERE table_name = '{ALLOYDB_TABLE_NAME}'"))

[{'column_name': 'embedding', 'data_type': 'USER-DEFINED'},
 {'column_name': 'chunk_id', 'data_type': 'character varying'},
 {'column_name': 'gse', 'data_type': 'character varying'},
 {'column_name': 'content', 'data_type': 'text'}]

In [81]:
query = sqlalchemy.text(f"SELECT * FROM \"{ALLOYDB_TABLE_NAME}\" WHERE chunk_id = '{first_record['chunk_id']}'")
result = run_query(query)
result.keys()

dict_keys(['chunk_id', 'gse', 'content', 'embedding'])

In [82]:
type(result['embedding'])

str

In [83]:
result['embedding'][0:100]

'[0.031277116,0.03056905,0.010865348,0.062361468,0.032286815,0.050661553,0.046544693,0.055096656,-0.0'

---
## Vector Similarity Search, Matching

- brute force, with and without prefiltering
- create indexes
- search with index, specify index
- search and let opimizer pick index
- detect which index is pick, with and without prefiltering
- search and force brute force even when index exists

### Brute Force Search - No Index

Easily run a brute force (compare to all rows) match with a choice of distance measure:
- `<->` for L2, Euclidean distance
- `<#>` for Inner, Dot Product
- `<=>` for Cosine distance

In [219]:
run_query(sqlalchemy.text(f"SELECT chunk_id, embedding <#> '{question_embedding}' AS dot_product FROM \"{ALLOYDB_TABLE_NAME}\" ORDER BY dot_product LIMIT 5"))

[{'chunk_id': 'fannie_part_0_c352', 'dot_product': -0.7099841833114624},
 {'chunk_id': 'freddie_part_4_c509', 'dot_product': -0.680526077747345},
 {'chunk_id': 'freddie_part_4_c510', 'dot_product': -0.6753296852111816},
 {'chunk_id': 'fannie_part_0_c353', 'dot_product': -0.6723706722259521},
 {'chunk_id': 'fannie_part_0_c326', 'dot_product': -0.66834956407547}]

In [220]:
run_query(sqlalchemy.text(f"SELECT chunk_id, embedding <-> '{question_embedding}' AS euclidean_distance FROM \"{ALLOYDB_TABLE_NAME}\" ORDER BY euclidean_distance LIMIT 5"))

[{'chunk_id': 'fannie_part_0_c352', 'euclidean_distance': 0.7615658337594855},
 {'chunk_id': 'freddie_part_4_c509', 'euclidean_distance': 0.7992875100367289},
 {'chunk_id': 'freddie_part_4_c510', 'euclidean_distance': 0.8057848660615564},
 {'chunk_id': 'fannie_part_0_c353', 'euclidean_distance': 0.8094337265330812},
 {'chunk_id': 'fannie_part_0_c326', 'euclidean_distance': 0.8144253147417732}]

In [221]:
run_query(sqlalchemy.text(f"SELECT chunk_id, embedding <=> '{question_embedding}' AS cosine_distance FROM \"{ALLOYDB_TABLE_NAME}\" ORDER BY cosine_distance LIMIT 5"))

[{'chunk_id': 'fannie_part_0_c352', 'cosine_distance': 0.2899983636254655},
 {'chunk_id': 'freddie_part_4_c509', 'cosine_distance': 0.31944424887732137},
 {'chunk_id': 'freddie_part_4_c510', 'cosine_distance': 0.3246529458452222},
 {'chunk_id': 'fannie_part_0_c353', 'cosine_distance': 0.32760391792511945},
 {'chunk_id': 'fannie_part_0_c326', 'cosine_distance': 0.33164633285935186}]

### Brute Force Search With Pre-Filtering - No Index

Extending a brute force match with pre-filtering means including a `WHERE` statement:

Matches for rows where `gse = 'fannie'`:

In [222]:
run_query(sqlalchemy.text(f"SELECT chunk_id, embedding <#> '{question_embedding}' AS dot_product FROM \"{ALLOYDB_TABLE_NAME}\" WHERE gse = 'fannie' ORDER BY dot_product LIMIT 5"))

[{'chunk_id': 'fannie_part_0_c352', 'dot_product': -0.7099841833114624},
 {'chunk_id': 'fannie_part_0_c353', 'dot_product': -0.6723706722259521},
 {'chunk_id': 'fannie_part_0_c326', 'dot_product': -0.66834956407547},
 {'chunk_id': 'fannie_part_0_c92', 'dot_product': -0.6614338159561157},
 {'chunk_id': 'fannie_part_0_c240', 'dot_product': -0.6608578562736511}]

Matches for rows where `gse = 'freddie'`:

In [223]:
run_query(sqlalchemy.text(f"SELECT chunk_id, embedding <#> '{question_embedding}' AS dot_product FROM \"{ALLOYDB_TABLE_NAME}\" WHERE gse = 'freddie' ORDER BY dot_product LIMIT 5"))

[{'chunk_id': 'freddie_part_4_c509', 'dot_product': -0.680526077747345},
 {'chunk_id': 'freddie_part_4_c510', 'dot_product': -0.6753296852111816},
 {'chunk_id': 'freddie_part_4_c472', 'dot_product': -0.661984384059906},
 {'chunk_id': 'freddie_part_6_c439', 'dot_product': -0.6604534983634949},
 {'chunk_id': 'freddie_part_4_c558', 'dot_product': -0.6575403213500977}]

### Create An Index

Index make search across many rows more efficient by first matching partions of rows and then only comparing to rows within the partions.  This section covers [creating indexes](https://cloud.google.com/alloydb/docs/ai/store-index-query-vectors?resource=scann) and using them in queries.

- IVF: Inverted File Lists, a general approach where the quanitzation can be selected
    - partions rows into list, only searches a subset that are closest to the query vector
    - fast build, low memory, slow query
    - can increase the number of list used in searches at query time for greater recall
- IVFFlat: Inverted File Lists, specifically with flat quantization
    - partions rows into list, only searches a subset that are closest to the query vector
    - fast build, low memory usage, slower query
    - can increase the number of lists used in search at query time for greater recall
- [HNSW](https://arxiv.org/abs/1603.09320): Hierarchical Navigable Small World graphs
    - creates a multilayer graph
    - slower build, more memory, faster query
    - can increase the number of candidates in the search for greater recall
- ScaNN: [Developed by google](https://github.com/google-research/google-research/blob/master/scann/docs/algorithms.md)
    - tree-based quantization index
    - faster build with less memory than HNSW
    - faster query times 

#### IVF

https://cloud.google.com/alloydb/docs/ai/store-index-query-vectors?resource=ivf#create-index

In [283]:
run_query(sqlalchemy.text(f"""
    CREATE INDEX IF NOT EXISTS ivf_index
    ON \"{ALLOYDB_TABLE_NAME}\"
    USING ivf (embedding vector_ip_ops)
    WITH (lists = 100, quantizer = 'FLAT')
"""))

[]

In [284]:
run_query(sqlalchemy.text(f"SELECT * FROM pg_indexes  WHERE tablename = '{ALLOYDB_TABLE_NAME}' AND indexname = 'ivf_index'"))

{'schemaname': 'public',
 'tablename': 'retrieval-alloydb',
 'indexname': 'ivf_index',
 'tablespace': None,
 'indexdef': 'CREATE INDEX ivf_index ON public."retrieval-alloydb" USING ivf (embedding vector_ip_ops) WITH (lists=\'100\', quantizer=\'FLAT\')'}

In [285]:
run_query(sqlalchemy.text('SELECT * FROM pg_stat_progress_create_index'))

[]

In [286]:
run_query(sqlalchemy.text('SELECT * FROM pg_stat_ann_indexes'))

[]

In [287]:
run_query(sqlalchemy.text(f"""
    SELECT chunk_id, embedding <#> '{question_embedding}' AS dot_product
    FROM \"{ALLOYDB_TABLE_NAME}\"
    ORDER BY dot_product
    LIMIT 5
"""))

[{'chunk_id': 'fannie_part_0_c352', 'dot_product': -0.7099841833114624},
 {'chunk_id': 'fannie_part_0_c353', 'dot_product': -0.6723706722259521},
 {'chunk_id': 'fannie_part_0_c92', 'dot_product': -0.6614338159561157},
 {'chunk_id': 'fannie_part_0_c240', 'dot_product': -0.6608578562736511},
 {'chunk_id': 'fannie_part_2_c417', 'dot_product': -0.6559132933616638}]

In [288]:
result = run_query(sqlalchemy.text(f"""
    EXPLAIN ANALYZE
    SELECT chunk_id, embedding <#> '{question_embedding}' AS dot_product
    FROM \"{ALLOYDB_TABLE_NAME}\"
    ORDER BY dot_product
    LIMIT 5
"""))
result[0:2] + result[-2:]

[{'QUERY PLAN': 'Limit  (cost=115.68..115.88 rows=5 width=27) (actual time=0.201..0.224 rows=5 loops=1)'},
 {'QUERY PLAN': '  ->  Index Scan using ivf_index on "retrieval-alloydb"  (cost=115.68..487.18 rows=9040 width=27) (actual time=0.199..0.222 rows=5 loops=1)'},
 {'QUERY PLAN': 'Planning Time: 0.076 ms'},
 {'QUERY PLAN': 'Execution Time: 0.242 ms'}]

In [307]:
run_query(sqlalchemy.text(f"""
    SET LOCAL ivf.probes = 10;
    SELECT chunk_id, embedding <#> '{question_embedding}' AS dot_product
    FROM \"{ALLOYDB_TABLE_NAME}\"
    ORDER BY dot_product
    LIMIT 5
"""))

[{'chunk_id': 'fannie_part_0_c352', 'dot_product': -0.7099841833114624},
 {'chunk_id': 'freddie_part_4_c509', 'dot_product': -0.680526077747345},
 {'chunk_id': 'freddie_part_4_c510', 'dot_product': -0.6753296852111816},
 {'chunk_id': 'fannie_part_0_c353', 'dot_product': -0.6723706722259521},
 {'chunk_id': 'fannie_part_0_c326', 'dot_product': -0.66834956407547}]

In [308]:
result = run_query(sqlalchemy.text(f"""
    SET LOCAL ivf.probes = 10;
    EXPLAIN ANALYZE
    SELECT chunk_id, embedding <#> '{question_embedding}' AS dot_product
    FROM \"{ALLOYDB_TABLE_NAME}\"
    ORDER BY dot_product
    LIMIT 5
"""))
result[0:2] + result[-2:]

[{'QUERY PLAN': 'Limit  (cost=1156.78..1158.24 rows=5 width=27) (actual time=1.007..1.031 rows=5 loops=1)'},
 {'QUERY PLAN': '  ->  Index Scan using ivf_index on "retrieval-alloydb"  (cost=1156.78..3788.42 rows=9040 width=27) (actual time=1.005..1.028 rows=5 loops=1)'},
 {'QUERY PLAN': 'Planning Time: 0.079 ms'},
 {'QUERY PLAN': 'Execution Time: 1.054 ms'}]

In [289]:
run_query(sqlalchemy.text(f"""
    SELECT chunk_id, embedding <#> '{question_embedding}' AS dot_product
    FROM \"{ALLOYDB_TABLE_NAME}\"
    WHERE gse = 'fannie'
    ORDER BY dot_product
    LIMIT 5
"""))

[{'chunk_id': 'fannie_part_0_c352', 'dot_product': -0.7099841833114624},
 {'chunk_id': 'fannie_part_0_c353', 'dot_product': -0.6723706722259521},
 {'chunk_id': 'fannie_part_0_c92', 'dot_product': -0.6614338159561157},
 {'chunk_id': 'fannie_part_0_c240', 'dot_product': -0.6608578562736511},
 {'chunk_id': 'fannie_part_2_c417', 'dot_product': -0.6559132933616638}]

In [290]:
result = run_query(sqlalchemy.text(f"""
    EXPLAIN ANALYZE
    SELECT chunk_id, embedding <#> '{question_embedding}' AS dot_product
    FROM \"{ALLOYDB_TABLE_NAME}\"
    WHERE gse = 'fannie'
    ORDER BY dot_product
    LIMIT 5
"""))
result[0:2] + result[-2:]

[{'QUERY PLAN': 'Limit  (cost=115.68..116.27 rows=5 width=27) (actual time=0.183..0.208 rows=5 loops=1)'},
 {'QUERY PLAN': '  ->  Index Scan using ivf_index on "retrieval-alloydb"  (cost=115.68..472.38 rows=3032 width=27) (actual time=0.182..0.206 rows=5 loops=1)'},
 {'QUERY PLAN': 'Planning Time: 0.092 ms'},
 {'QUERY PLAN': 'Execution Time: 0.227 ms'}]

In [309]:
run_query(sqlalchemy.text('DROP INDEX IF EXISTS ivf_index'))

[]

#### IVFFlat

https://cloud.google.com/alloydb/docs/ai/store-index-query-vectors?resource=ivfflat#create-index

In [310]:
run_query(sqlalchemy.text(f"""
    CREATE INDEX IF NOT EXISTS ivfflat_index
    ON \"{ALLOYDB_TABLE_NAME}\"
    USING ivfflat (embedding vector_ip_ops)
    WITH (lists = 100)
"""))

[]

In [311]:
run_query(sqlalchemy.text(f"SELECT * FROM pg_indexes  WHERE tablename = '{ALLOYDB_TABLE_NAME}' AND indexname = 'ivfflat_index'"))

{'schemaname': 'public',
 'tablename': 'retrieval-alloydb',
 'indexname': 'ivfflat_index',
 'tablespace': None,
 'indexdef': 'CREATE INDEX ivfflat_index ON public."retrieval-alloydb" USING ivfflat (embedding vector_ip_ops) WITH (lists=\'100\')'}

In [312]:
run_query(sqlalchemy.text('SELECT * FROM pg_stat_progress_create_index'))

[]

In [313]:
run_query(sqlalchemy.text('SELECT * FROM pg_stat_ann_indexes'))

[]

In [314]:
run_query(sqlalchemy.text(f"""
    SELECT chunk_id, embedding <#> '{question_embedding}' AS dot_product
    FROM \"{ALLOYDB_TABLE_NAME}\"
    ORDER BY dot_product
    LIMIT 5
"""))

[{'chunk_id': 'freddie_part_4_c509', 'dot_product': -0.680526077747345},
 {'chunk_id': 'freddie_part_4_c510', 'dot_product': -0.6753296852111816},
 {'chunk_id': 'freddie_part_6_c410', 'dot_product': -0.6573532819747925},
 {'chunk_id': 'freddie_part_5_c360', 'dot_product': -0.6557323932647705},
 {'chunk_id': 'freddie_part_4_c508', 'dot_product': -0.6519314646720886}]

In [315]:
result = run_query(sqlalchemy.text(f"""
    EXPLAIN ANALYZE
    SELECT chunk_id, embedding <#> '{question_embedding}' AS dot_product
    FROM \"{ALLOYDB_TABLE_NAME}\"
    ORDER BY dot_product
    LIMIT 5
"""))
result[0:2] + result[-2:]

[{'QUERY PLAN': 'Limit  (cost=118.18..118.38 rows=5 width=27) (actual time=0.181..0.204 rows=5 loops=1)'},
 {'QUERY PLAN': '  ->  Index Scan using ivfflat_index on "retrieval-alloydb"  (cost=118.18..489.68 rows=9040 width=27) (actual time=0.179..0.201 rows=5 loops=1)'},
 {'QUERY PLAN': 'Planning Time: 0.086 ms'},
 {'QUERY PLAN': 'Execution Time: 0.222 ms'}]

In [316]:
run_query(sqlalchemy.text(f"""
    SET LOCAL ivfflat.probes = 10;
    SELECT chunk_id, embedding <#> '{question_embedding}' AS dot_product
    FROM \"{ALLOYDB_TABLE_NAME}\"
    ORDER BY dot_product
    LIMIT 5
"""))

[{'chunk_id': 'fannie_part_0_c352', 'dot_product': -0.7099841833114624},
 {'chunk_id': 'freddie_part_4_c509', 'dot_product': -0.680526077747345},
 {'chunk_id': 'freddie_part_4_c510', 'dot_product': -0.6753296852111816},
 {'chunk_id': 'fannie_part_0_c353', 'dot_product': -0.6723706722259521},
 {'chunk_id': 'fannie_part_0_c326', 'dot_product': -0.66834956407547}]

In [317]:
result = run_query(sqlalchemy.text(f"""
    SET LOCAL ivfflat.probes = 10;
    EXPLAIN ANALYZE
    SELECT chunk_id, embedding <#> '{question_embedding}' AS dot_product
    FROM \"{ALLOYDB_TABLE_NAME}\"
    ORDER BY dot_product
    LIMIT 5
"""))
result[0:2] + result[-2:]

[{'QUERY PLAN': 'Limit  (cost=1159.28..1160.74 rows=5 width=27) (actual time=0.858..0.882 rows=5 loops=1)'},
 {'QUERY PLAN': '  ->  Index Scan using ivfflat_index on "retrieval-alloydb"  (cost=1159.28..3790.92 rows=9040 width=27) (actual time=0.857..0.879 rows=5 loops=1)'},
 {'QUERY PLAN': 'Planning Time: 0.056 ms'},
 {'QUERY PLAN': 'Execution Time: 0.902 ms'}]

In [318]:
run_query(sqlalchemy.text(f"""
    SELECT chunk_id, embedding <#> '{question_embedding}' AS dot_product
    FROM \"{ALLOYDB_TABLE_NAME}\"
    WHERE gse = 'fannie'
    ORDER BY dot_product
    LIMIT 5
"""))

[{'chunk_id': 'fannie_part_2_c793', 'dot_product': -0.6455321311950684},
 {'chunk_id': 'fannie_part_2_c795', 'dot_product': -0.6292608976364136},
 {'chunk_id': 'fannie_part_2_c792', 'dot_product': -0.6290532350540161},
 {'chunk_id': 'fannie_part_2_c791', 'dot_product': -0.6204110383987427},
 {'chunk_id': 'fannie_part_2_c798', 'dot_product': -0.6203723549842834}]

In [319]:
result = run_query(sqlalchemy.text(f"""
    EXPLAIN ANALYZE
    SELECT chunk_id, embedding <#> '{question_embedding}' AS dot_product
    FROM \"{ALLOYDB_TABLE_NAME}\"
    WHERE gse = 'fannie'
    ORDER BY dot_product
    LIMIT 5
"""))
result[0:2] + result[-2:]

[{'QUERY PLAN': 'Limit  (cost=118.18..118.77 rows=5 width=27) (actual time=0.218..0.245 rows=5 loops=1)'},
 {'QUERY PLAN': '  ->  Index Scan using ivfflat_index on "retrieval-alloydb"  (cost=118.18..474.88 rows=3032 width=27) (actual time=0.217..0.243 rows=5 loops=1)'},
 {'QUERY PLAN': 'Planning Time: 0.087 ms'},
 {'QUERY PLAN': 'Execution Time: 0.263 ms'}]

In [320]:
run_query(sqlalchemy.text('DROP INDEX IF EXISTS ivfflat_index'))

[]

#### HNSW

https://cloud.google.com/alloydb/docs/ai/store-index-query-vectors?resource=hnsw#create-index

In [321]:
run_query(sqlalchemy.text(f"""
    CREATE INDEX IF NOT EXISTS hnsw_index
    ON \"{ALLOYDB_TABLE_NAME}\"
    USING hnsw (embedding vector_ip_ops)
    WITH (m = 10, ef_construction = 40)
"""))

[]

In [322]:
run_query(sqlalchemy.text(f"SELECT * FROM pg_indexes  WHERE tablename = '{ALLOYDB_TABLE_NAME}' AND indexname = 'hnsw_index'"))

{'schemaname': 'public',
 'tablename': 'retrieval-alloydb',
 'indexname': 'hnsw_index',
 'tablespace': None,
 'indexdef': 'CREATE INDEX hnsw_index ON public."retrieval-alloydb" USING hnsw (embedding vector_ip_ops) WITH (m=\'10\', ef_construction=\'40\')'}

In [323]:
run_query(sqlalchemy.text('SELECT * FROM pg_stat_progress_create_index'))

[]

In [324]:
run_query(sqlalchemy.text('SELECT * FROM pg_stat_ann_indexes'))

[]

In [325]:
run_query(sqlalchemy.text(f"""
    SELECT chunk_id, embedding <#> '{question_embedding}' AS dot_product
    FROM \"{ALLOYDB_TABLE_NAME}\"
    ORDER BY dot_product
    LIMIT 5
"""))

[{'chunk_id': 'fannie_part_0_c352', 'dot_product': -0.7099841833114624},
 {'chunk_id': 'freddie_part_4_c509', 'dot_product': -0.680526077747345},
 {'chunk_id': 'freddie_part_4_c510', 'dot_product': -0.6753296852111816},
 {'chunk_id': 'fannie_part_0_c353', 'dot_product': -0.6723706722259521},
 {'chunk_id': 'fannie_part_0_c326', 'dot_product': -0.66834956407547}]

In [326]:
result = run_query(sqlalchemy.text(f"""
    EXPLAIN ANALYZE
    SELECT chunk_id, embedding <#> '{question_embedding}' AS dot_product
    FROM \"{ALLOYDB_TABLE_NAME}\"
    ORDER BY dot_product
    LIMIT 5
"""))
result[0:2] + result[-2:]

[{'QUERY PLAN': 'Limit  (cost=100.38..102.98 rows=5 width=27) (actual time=0.588..0.613 rows=5 loops=1)'},
 {'QUERY PLAN': '  ->  Index Scan using hnsw_index on "retrieval-alloydb"  (cost=100.38..4817.38 rows=9040 width=27) (actual time=0.586..0.610 rows=5 loops=1)'},
 {'QUERY PLAN': 'Planning Time: 0.076 ms'},
 {'QUERY PLAN': 'Execution Time: 0.632 ms'}]

In [327]:
run_query(sqlalchemy.text(f"""
    SET LOCAL hnsw.ef_search = 20;
    SELECT chunk_id, embedding <#> '{question_embedding}' AS dot_product
    FROM \"{ALLOYDB_TABLE_NAME}\"
    ORDER BY dot_product
    LIMIT 5
"""))

[{'chunk_id': 'fannie_part_0_c352', 'dot_product': -0.7099841833114624},
 {'chunk_id': 'freddie_part_4_c509', 'dot_product': -0.680526077747345},
 {'chunk_id': 'freddie_part_4_c510', 'dot_product': -0.6753296852111816},
 {'chunk_id': 'fannie_part_0_c353', 'dot_product': -0.6723706722259521},
 {'chunk_id': 'fannie_part_0_c326', 'dot_product': -0.66834956407547}]

In [328]:
result = run_query(sqlalchemy.text(f"""
    SET LOCAL hnsw.ef_search = 20;
    EXPLAIN ANALYZE
    SELECT chunk_id, embedding <#> '{question_embedding}' AS dot_product
    FROM \"{ALLOYDB_TABLE_NAME}\"
    ORDER BY dot_product
    LIMIT 5
"""))
result[0:2] + result[-2:]

[{'QUERY PLAN': 'Limit  (cost=100.38..102.98 rows=5 width=27) (actual time=0.356..0.379 rows=5 loops=1)'},
 {'QUERY PLAN': '  ->  Index Scan using hnsw_index on "retrieval-alloydb"  (cost=100.38..4817.38 rows=9040 width=27) (actual time=0.354..0.376 rows=5 loops=1)'},
 {'QUERY PLAN': 'Planning Time: 0.076 ms'},
 {'QUERY PLAN': 'Execution Time: 0.400 ms'}]

In [329]:
run_query(sqlalchemy.text(f"""
    SELECT chunk_id, embedding <#> '{question_embedding}' AS dot_product
    FROM \"{ALLOYDB_TABLE_NAME}\"
    WHERE gse = 'fannie'
    ORDER BY dot_product
    LIMIT 5
"""))

[{'chunk_id': 'fannie_part_0_c352', 'dot_product': -0.7099841833114624},
 {'chunk_id': 'fannie_part_0_c353', 'dot_product': -0.6723706722259521},
 {'chunk_id': 'fannie_part_0_c326', 'dot_product': -0.66834956407547},
 {'chunk_id': 'fannie_part_0_c92', 'dot_product': -0.6614338159561157},
 {'chunk_id': 'fannie_part_0_c240', 'dot_product': -0.6608578562736511}]

In [330]:
result = run_query(sqlalchemy.text(f"""
    EXPLAIN ANALYZE
    SELECT chunk_id, embedding <#> '{question_embedding}' AS dot_product
    FROM \"{ALLOYDB_TABLE_NAME}\"
    WHERE gse = 'fannie'
    ORDER BY dot_product
    LIMIT 5
"""))
result[0:2] + result[-2:]

[{'QUERY PLAN': 'Limit  (cost=100.38..108.17 rows=5 width=27) (actual time=0.585..0.618 rows=5 loops=1)'},
 {'QUERY PLAN': '  ->  Index Scan using hnsw_index on "retrieval-alloydb"  (cost=100.38..4824.95 rows=3032 width=27) (actual time=0.584..0.616 rows=5 loops=1)'},
 {'QUERY PLAN': 'Planning Time: 0.077 ms'},
 {'QUERY PLAN': 'Execution Time: 0.638 ms'}]

In [331]:
run_query(sqlalchemy.text('DROP INDEX IF EXISTS hnsw_index'))

[]

#### ScaNN

https://cloud.google.com/alloydb/docs/ai/store-index-query-vectors?resource=scann#create-index

In [332]:
run_query(sqlalchemy.text(f"""
    CREATE INDEX IF NOT EXISTS scann_index
    ON \"{ALLOYDB_TABLE_NAME}\"
    USING scann (embedding dot_product)
    WITH (num_leaves = 100)
"""))

[]

In [333]:
run_query(sqlalchemy.text(f"SELECT * FROM pg_indexes  WHERE tablename = '{ALLOYDB_TABLE_NAME}' AND indexname = 'scann_index'"))

{'schemaname': 'public',
 'tablename': 'retrieval-alloydb',
 'indexname': 'scann_index',
 'tablespace': None,
 'indexdef': 'CREATE INDEX scann_index ON public."retrieval-alloydb" USING scann (embedding dot_product) WITH (num_leaves=\'100\')'}

In [334]:
run_query(sqlalchemy.text('SELECT * FROM pg_stat_progress_create_index'))

[]

In [335]:
run_query(sqlalchemy.text('SELECT * FROM pg_stat_ann_indexes'))

{'relid': 71137,
 'indexrelid': 90890,
 'schemaname': 'public',
 'relname': 'retrieval-alloydb',
 'indexrelname': 'scann_index',
 'indextype': 'scann',
 'indexconfig': ['num_leaves=100'],
 'indexsize': '2120 kB',
 'indexscan': 0,
 'insertcount': 9041,
 'deletecount': 1,
 'updatecount': 0,
 'partitioncount': 100,
 'distribution': {'average': 165.5,
  'maximum': 628,
  'minimum': 37,
  'outliers': [628, 623, 561, 527, 407, 403, 382, 341, 337, 317]}}

In [336]:
run_query(sqlalchemy.text(f"""
    SELECT chunk_id, embedding <#> '{question_embedding}' AS dot_product
    FROM \"{ALLOYDB_TABLE_NAME}\"
    ORDER BY dot_product
    LIMIT 5
"""))

[{'chunk_id': 'fannie_part_0_c352', 'dot_product': -0.7099841833114624},
 {'chunk_id': 'fannie_part_0_c92', 'dot_product': -0.6614338159561157},
 {'chunk_id': 'fannie_part_0_c335', 'dot_product': -0.6521918177604675},
 {'chunk_id': 'freddie_part_4_c472', 'dot_product': -0.661984384059906},
 {'chunk_id': 'freddie_part_5_c360', 'dot_product': -0.6557323932647705}]

In [337]:
result = run_query(sqlalchemy.text(f"""
    EXPLAIN ANALYZE
    SELECT chunk_id, embedding <#> '{question_embedding}' AS dot_product
    FROM \"{ALLOYDB_TABLE_NAME}\"
    ORDER BY dot_product
    LIMIT 5
"""))
result[0:2] + result[-2:]

[{'QUERY PLAN': 'Limit  (cost=2.51..6.52 rows=1 width=27) (actual time=0.173..0.210 rows=5 loops=1)'},
 {'QUERY PLAN': '  ->  Index Scan using scann_index on "retrieval-alloydb"  (cost=2.51..6.52 rows=1 width=27) (actual time=0.170..0.206 rows=5 loops=1)'},
 {'QUERY PLAN': 'Planning Time: 0.093 ms'},
 {'QUERY PLAN': 'Execution Time: 0.238 ms'}]

In [340]:
run_query(sqlalchemy.text(f"""
    SET LOCAL scann.num_leaves_to_search = 2;
    SET LOCAL scann.pre_reordering_num_neighbors=50;
    SELECT chunk_id, embedding <#> '{question_embedding}' AS dot_product
    FROM \"{ALLOYDB_TABLE_NAME}\"
    ORDER BY dot_product
    LIMIT 5
"""))

[{'chunk_id': 'fannie_part_0_c352', 'dot_product': -0.7099841833114624},
 {'chunk_id': 'freddie_part_4_c509', 'dot_product': -0.680526077747345},
 {'chunk_id': 'freddie_part_4_c510', 'dot_product': -0.6753296852111816},
 {'chunk_id': 'fannie_part_0_c353', 'dot_product': -0.6723706722259521},
 {'chunk_id': 'fannie_part_0_c326', 'dot_product': -0.66834956407547}]

In [341]:
result = run_query(sqlalchemy.text(f"""
    SET LOCAL scann.num_leaves_to_search = 2;
    SET LOCAL scann.pre_reordering_num_neighbors=50;
    EXPLAIN ANALYZE
    SELECT chunk_id, embedding <#> '{question_embedding}' AS dot_product
    FROM \"{ALLOYDB_TABLE_NAME}\"
    ORDER BY dot_product
    LIMIT 5
"""))
result[0:2] + result[-2:]

[{'QUERY PLAN': 'Limit  (cost=2.51..6.52 rows=1 width=27) (actual time=0.426..0.439 rows=5 loops=1)'},
 {'QUERY PLAN': '  ->  Index Scan using scann_index on "retrieval-alloydb"  (cost=2.51..6.52 rows=1 width=27) (actual time=0.424..0.436 rows=5 loops=1)'},
 {'QUERY PLAN': 'Planning Time: 0.086 ms'},
 {'QUERY PLAN': 'Execution Time: 0.471 ms'}]

In [342]:
run_query(sqlalchemy.text(f"""
    SELECT chunk_id, embedding <#> '{question_embedding}' AS dot_product
    FROM \"{ALLOYDB_TABLE_NAME}\"
    WHERE gse = 'fannie'
    ORDER BY dot_product
    LIMIT 5
"""))

[{'chunk_id': 'fannie_part_0_c352', 'dot_product': -0.7099841833114624},
 {'chunk_id': 'fannie_part_0_c353', 'dot_product': -0.6723706722259521},
 {'chunk_id': 'fannie_part_0_c326', 'dot_product': -0.66834956407547},
 {'chunk_id': 'fannie_part_0_c92', 'dot_product': -0.6614338159561157},
 {'chunk_id': 'fannie_part_0_c240', 'dot_product': -0.6608578562736511}]

In [343]:
result = run_query(sqlalchemy.text(f"""
    EXPLAIN ANALYZE
    SELECT chunk_id, embedding <#> '{question_embedding}' AS dot_product
    FROM \"{ALLOYDB_TABLE_NAME}\"
    WHERE gse = 'fannie'
    ORDER BY dot_product
    LIMIT 5
"""))
result[0:2] + result[-2:]

[{'QUERY PLAN': 'Limit  (cost=1151.01..1151.02 rows=1 width=27) (actual time=11.990..11.992 rows=5 loops=1)'},
 {'QUERY PLAN': '  ->  Sort  (cost=1151.01..1151.02 rows=1 width=27) (actual time=11.989..11.990 rows=5 loops=1)'},
 {'QUERY PLAN': 'Planning Time: 0.083 ms'},
 {'QUERY PLAN': 'Execution Time: 12.015 ms'}]

In [344]:
run_query(sqlalchemy.text('DROP INDEX IF EXISTS scann_index'))

[]

In [None]:
### Search And Override Index - Brute Force Search

## Remove Resources

In [143]:
# can't drop the database of an active connection, switch connection to postgres (default) database
#query = sqlalchemy.text(f"DROP DATABASE IF EXISTS \"{ALLOYDB_DATABASE_NAME}\"")
#run_query(query)

In [56]:
#delete_instance = alloydb_client.delete_instance(name = alloydb_instance.name)
#delete_instance.result()

<google.api_core.operation.Operation at 0x7f918e7714e0>

In [60]:
#delete_cluster = alloydb_client.delete_cluster(request = dict(name = alloydb_cluster.name, force = True))
#delete_cluster.result()

<google.api_core.operation.Operation at 0x7f918e6dd0f0>