![tracker](https://us-central1-vertex-ai-mlops-369716.cloudfunctions.net/pixel-tracking?path=statmike%2Fvertex-ai-mlops%2FApplied+GenAI%2FRetrieval&file=Retrieval+-+Cloud+SQL+For+MySQL.ipynb)
<!--- header table --->
<table align="left">
  <td style="text-align: center">
    <a href="https://colab.research.google.com/github/statmike/vertex-ai-mlops/blob/main/Applied%20GenAI/Retrieval/Retrieval%20-%20Cloud%20SQL%20For%20MySQL.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/colab-logo-32px.png" alt="Google Colaboratory logo">
      <br>Run in<br>Colab
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/colab/import/https%3A%2F%2Fraw.githubusercontent.com%2Fstatmike%2Fvertex-ai-mlops%2Fmain%2FApplied%2520GenAI%2FRetrieval%2FRetrieval%2520-%2520Cloud%2520SQL%2520For%2520MySQL.ipynb">
      <img width="32px" src="https://lh3.googleusercontent.com/JmcxdQi-qOpctIvWKgPtrzZdJJK-J3sWE1RsfjZNwshCFgE_9fULcNpuXYTilIR2hjwN" alt="Google Cloud Colab Enterprise logo">
      <br>Run in<br>Colab Enterprise
    </a>
  </td>      
  <td style="text-align: center">
    <a href="https://github.com/statmike/vertex-ai-mlops/blob/main/Applied%20GenAI/Retrieval/Retrieval%20-%20Cloud%20SQL%20For%20MySQL.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" alt="GitHub logo">
      <br>View on<br>GitHub
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/statmike/vertex-ai-mlops/main/Applied%20GenAI/Retrieval/Retrieval%20-%20Cloud%20SQL%20For%20MySQL.ipynb">
      <img src="https://lh3.googleusercontent.com/UiNooY4LUgW_oTvpsNhPpQzsstV5W8F7rYgxgGBD85cWJoLmrOzhVs_ksK_vgx40SHs7jCqkTkCk=e14-rj-sc0xffffff-h130-w32" alt="Vertex AI logo">
      <br>Open in<br>Vertex AI Workbench
    </a>
  </td>
</table>

# Retrieval - Cloud SQL For MySQL

In prior workflows, a series of documents was [processed into chunks](../Chunking/readme.md), and for each chunk, [embeddings](../Embeddings/readme.md) were created:

- Process: [Large Document Processing - Document AI Layout Parser](../Chunking/Large%20Document%20Processing%20-%20Document%20AI%20Layout%20Parser.ipynb)
- Embed: [Vertex AI Text Embeddings API](../Embeddings/Vertex%20AI%20Text%20Embeddings%20API.ipynb)

Retrieving chunks for a query involves calculating the embedding for the query and then using similarity metrics to find relevant chunks. A thorough review of similarity matching can be found in [The Math of Similarity](../Embeddings/The%20Math%20of%20Similarity.ipynb) - use dot product! As development moves from experiment to application, the process of storing and computing similarity is migrated to a [retrieval](./readme.md) system. This workflow is part of a [series of workflows exploring many retrieval systems](./readme.md).  

A detailed [comparison of many retrieval systems](./readme.md#comparison-of-vector-database-solutions) can be found in the readme as well.

---

**Cloud SQL For MySQL For Storage, Indexing, And Search**

[Cloud SQL for MySQL](https://cloud.google.com/sql/docs/mysql) is a fully managed relational database service on Google Cloud that offers compatibility with the popular MySQL open-source database. It provides high availability, scalability, and security for your applications.

- **Key Features:**

    - **Fully Managed:** Cloud SQL takes care of database management tasks, including patching, backups, and high availability, allowing you to focus on your applications.
    - **Scalability:** You can easily scale your Cloud SQL instances to accommodate growing data and traffic demands.
    - **Security:** Cloud SQL provides robust security features, including data encryption and network security, to protect your data.
    - **MySQL Compatibility:** Cloud SQL is compatible with standard MySQL, making it easy to migrate existing applications or use familiar tools and frameworks.

- **Vector Search in MySQL:**

    - Cloud SQL for MySQL supports [vector search functionality](https://cloud.google.com/sql/docs/mysql/work-with-vectors), allowing you to store and query vector embeddings directly in your database. This includes [creating indexes](https://cloud.google.com/sql/docs/mysql/work-with-vectors#work-with-vectors) for approximate nearest neighbors efficient searches.

---

**Use Case Data**

Buying a home usually involves borrowing money from a lending institution, typically through a mortgage secured by the home's value. But how do these institutions manage the risks associated with such large loans, and how are lending standards established?

In the United States, two government-sponsored enterprises (GSEs) play a vital role in the housing market:

- Federal National Mortgage Association ([Fannie Mae](https://www.fanniemae.com/))
- Federal Home Loan Mortgage Corporation ([Freddie Mac](https://www.freddiemac.com/))

These GSEs purchase mortgages from lenders, enabling those lenders to offer more loans. This process also allows Fannie Mae and Freddie Mac to set standards for mortgages, ensuring they are responsible and borrowers are more likely to repay them. This system makes homeownership more affordable and stabilizes the housing market by maintaining a steady flow of liquidity for lenders and keeping interest rates controlled.

However, navigating the complexities of these GSEs and their extensive servicing guides can be challenging.

**Approaches**

[This series](../readme.md) covers many generative AI workflows. These documents are used directly as long context for Gemini in the workflow [Long Context Retrieval With The Vertex AI Gemini API](../Generate/Long%20Context%20Retrieval%20With%20The%20Vertex%20AI%20Gemini%20API.ipynb). The workflow below uses a [retrieval](./readme.md) approach with the already generated chunks and embeddings.

---
## Colab Setup

When running this notebook in [Colab](https://colab.google/) or [Colab Enterprise](https://cloud.google.com/colab/docs/introduction), this section will authenticate to GCP (follow prompts in the popup) and set the current project for the session.

In [6]:
PROJECT_ID = 'statmike-mlops-349915' # replace with project ID

In [7]:
try:
    from google.colab import auth
    auth.authenticate_user()
    !gcloud config set project {PROJECT_ID}
except Exception:
    pass

---
## Installs and API Enablement

The clients packages may need installing in this environment. 

### Installs (If Needed)

In [8]:
# tuples of (import name, install name, min_version)
packages = [
    ('google.cloud.aiplatform', 'google-cloud-aiplatform', '1.69.0'),
    ('google.cloud.sql', 'cloud-sql-python-connector'),
    ('sqlalchemy', 'sqlalchemy'),
    ('pymysql', 'pymysql')
]

import importlib
install = False
for package in packages:
    if not importlib.util.find_spec(package[0]):
        print(f'installing package {package[1]}')
        install = True
        !pip install {package[1]} -U -q --user
    elif len(package) == 3:
        if importlib.metadata.version(package[0]) < package[2]:
            print(f'updating package {package[1]}')
            install = True
            !pip install {package[1]} -U -q --user

### API Enablement

In [9]:
!gcloud services enable aiplatform.googleapis.com
!gcloud services enable sqladmin.googleapis.com

### Restart Kernel (If Installs Occured)

After a kernel restart the code submission can start with the next cell after this one.

In [10]:
if install:
    import IPython
    app = IPython.Application.instance()
    app.kernel.do_shutdown(True)
    IPython.display.display(IPython.display.Markdown("""<div class=\"alert alert-block alert-warning\">
        <b>⚠️ The kernel is going to restart. Please wait until it is finished before continuing to the next step. The previous cells do not need to be run again⚠️</b>
        </div>"""))

---
## Setup

Inputs

In [11]:
project = !gcloud config get-value project
PROJECT_ID = project[0]
PROJECT_ID

'statmike-mlops-349915'

In [12]:
REGION = 'us-central1'
SERIES = 'applied-genai'
EXPERIMENT = 'retrieval-cloudsql-mysql'

# Cloud SQL Names
CLOUDSQL_INSTANCE_NAME = EXPERIMENT
CLOUDSQL_DATABASE_NAME = SERIES.replace('-', '_')
CLOUDSQL_TABLE_NAME = EXPERIMENT.replace('-', '_')

CLOUDSQL_USER = 'test_db'
CLOUDSQL_PASS = 'test_db_pass'

Packages

In [13]:
import os, json, time, glob, datetime, copy
import concurrent.futures

import numpy as np

# Vertex AI
from google.cloud import aiplatform
import vertexai.language_models # for embeddings API
import vertexai.generative_models # for Gemini Models
from vertexai.resources.preview import feature_store

# Cloud SQL
import google.cloud.sql.connector
import sqlalchemy
import pymysql

In [14]:
aiplatform.__version__

'1.71.0'

Clients

In [15]:
# vertex ai clients
vertexai.init(project = PROJECT_ID, location = REGION)

---
## Text & Embeddings For Examples

This repository contains a [section for document processing (chunking)](../Chunking/readme.md) that includes an example of processing mulitple large pdfs (over 1000 pages) into chunks: [Large Document Processing - Document AI Layout Parser](../Chunking/Large%20Document%20Processing%20-%20Document%20AI%20Layout%20Parser.ipynb).  The chunks of text from that workflow are stored with this repository and loaded by another companion workflow that augments the chunks with text embeddings: [Vertex AI Text Embeddings API](../Embeddings/Vertex%20AI%20Text%20Embeddings%20API.ipynb).

The following code will load the version of the chunks that includes text embeddings and prepare it for a local example of retrival augmented generation.

### Get The Documents

If you are working from a clone of this notebooks [repository](https://github.com/statmike/vertex-ai-mlops) then the documents are already present. The following cell checks for the documents folder and if it is missing gets it (`git clone`):

In [11]:
local_dir = '../Embeddings/files/embeddings-api'

In [12]:
if not os.path.exists(local_dir):
    print('Retrieving documents...')
    parent_dir = os.path.dirname(local_dir)
    temp_dir = os.path.join(parent_dir, 'temp')
    if not os.path.exists(temp_dir):
        os.makedirs(temp_dir)
    !git clone https://www.github.com/statmike/vertex-ai-mlops {temp_dir}/vertex-ai-mlops
    shutil.copytree(f'{temp_dir}/vertex-ai-mlops/Applied GenAI/Embeddings/files/embeddings-api', local_dir)
    shutil.rmtree(temp_dir)
    print(f'Documents are now in folder `{local_dir}`')
else:
    print(f'Documents Found in folder `{local_dir}`')             

Documents Found in folder `../Embeddings/files/embeddings-api`


### Load The Chunks

In [13]:
jsonl_files = glob.glob(f"{local_dir}/large-files*.jsonl")
jsonl_files.sort()
jsonl_files

['../Embeddings/files/embeddings-api/large-files-chunk-embeddings-0000.jsonl',
 '../Embeddings/files/embeddings-api/large-files-chunk-embeddings-0001.jsonl',
 '../Embeddings/files/embeddings-api/large-files-chunk-embeddings-0002.jsonl',
 '../Embeddings/files/embeddings-api/large-files-chunk-embeddings-0003.jsonl',
 '../Embeddings/files/embeddings-api/large-files-chunk-embeddings-0004.jsonl',
 '../Embeddings/files/embeddings-api/large-files-chunk-embeddings-0005.jsonl',
 '../Embeddings/files/embeddings-api/large-files-chunk-embeddings-0006.jsonl',
 '../Embeddings/files/embeddings-api/large-files-chunk-embeddings-0007.jsonl',
 '../Embeddings/files/embeddings-api/large-files-chunk-embeddings-0008.jsonl',
 '../Embeddings/files/embeddings-api/large-files-chunk-embeddings-0009.jsonl']

In [14]:
chunks = []
for file in jsonl_files:
    with open(file, 'r') as f:
        chunks.extend([json.loads(line) for line in f])
len(chunks)

9040

### Review A Chunk

In [15]:
chunks[0].keys()

dict_keys(['instance', 'predictions', 'status'])

In [16]:
chunks[0]['instance']['chunk_id']

'fannie_part_0_c17'

In [17]:
print(chunks[0]['instance']['content'])

# Selling Guide Fannie Mae Single Family

## Fannie Mae Copyright Notice

### Fannie Mae Copyright Notice

|-|
| Section B3-4.2, Verification of Depository Assets 402 |
| B3-4.2-01, Verification of Deposits and Assets (05/04/2022) 403 |
| B3-4.2-02, Depository Accounts (12/14/2022) 405 |
| B3-4.2-03, Individual Development Accounts (02/06/2019) 408 |
| B3-4.2-04, Pooled Savings (Community Savings Funds) (04/01/2009) 411 |
| B3-4.2-05, Foreign Assets (05/04/2022) 411 |
| Section B3-4.3, Verification of Non-Depository Assets 412 |
| B3-4.3-01, Stocks, Stock Options, Bonds, and Mutual Funds (06/30/2015) 412 |
| B3-4.3-02, Trust Accounts (04/01/2009) 413 |
| B3-4.3-03, Retirement Accounts (06/30/2015) 414 |
| B3-4.3-04, Personal Gifts (09/06/2023) 415 |
| B3-4.3-05, Gifts of Equity (10/07/2020) 418 |
| B3-4.3-06, Grants and Lender Contributions (12/14/2022) 419 |
| B3-4.3-07, Disaster Relief Grants or Loans (04/01/2009) 423 |
| B3-4.3-08, Employer Assistance (09/29/2015) 423 |
| B3-4.3-09,

In [18]:
chunks[0]['predictions'][0]['embeddings']['values'][0:10]

[0.031277116388082504,
 0.03056905046105385,
 0.010865348391234875,
 0.0623614676296711,
 0.03228681534528732,
 0.05066155269742012,
 0.046544693410396576,
 0.05509665608406067,
 -0.014074751175940037,
 0.008380400016903877]

### Prepare Chunk Structure

Make a list of dictionaries with information for each chunk:

In [19]:
content_chunks = [
    dict(
        gse = chunk['instance']['gse'],
        chunk_id = chunk['instance']['chunk_id'],
        content = chunk['instance']['content'],
        embedding = chunk['predictions'][0]['embeddings']['values']
    ) for chunk in chunks
]

### Query Embedding

Create a query, or prompt, and get the embedding for it:

Connect to models for text embeddings. Learn more about the model API:
- [Vertex AI Text Embeddings API](../Embeddings/Vertex%20AI%20Text%20Embeddings%20API.ipynb)

In [16]:
question = "Does a lender have to perform servicing functions directly?"

In [17]:
embedder = vertexai.language_models.TextEmbeddingModel.from_pretrained('text-embedding-004')

In [18]:
question_embedding = embedder.get_embeddings([question])[0].values
question_embedding[0:10]

[-0.0005117303808219731,
 0.009651427157223225,
 0.01768726110458374,
 0.014538003131747246,
 -0.01829824410378933,
 0.027877431362867355,
 -0.021124685183167458,
 0.008830446749925613,
 -0.02669006586074829,
 0.06414774805307388]

---
## Setup: Cloud SQL For MySQL

Cloud SQL For MySQL is a managed instance of [MySQL](https://www.mysql.com/) that is easy to deploy and use.  This workflow will create an instance, configure it, load and use data all in the workflow below. While failover replicas and read-only replicas are possible this example uses a single instance server with minimum possible specifications.  The choices made during creation of instances fall into categories called [editions](https://cloud.google.com/sql/docs/mysql/editions-intro).  The configuration choices are also related to overall cost of running the instance - [see pricing](https://cloud.google.com/sql/pricing#mysql-pg-pricing).  This workflow uses a minimal configuration for testing purposes and keeping the cost of this example very small.  At the end of this notebook is a section that can be used to shutdown and delete the parts related to ongoing costs.

### Create/Retrieve Instance

The starting point for using Cloud SQL for MySQL is creating an instance.  There is not a Python admin client for Cloud SQL so the `gcloud` cli is used from this notebook by prefixing them with `!` to indicate they are to be run as shell commands on the underlying system shell.

The documentation can be referenced for:
- [Create instances](https://cloud.google.com/sql/docs/mysql/create-instance)
- [Instance settings](https://cloud.google.com/sql/docs/mysql/instance-settings)
- [Cloud SDK (gcloud) CLI For SQL instances](https://cloud.google.com/sdk/gcloud/reference/sql/instances)

When creating an instance the `--database-version` can be set and there is a [minimum version requirement]( https://cloud.google.com/sql/docs/mysql/work-with-vectors#before-you-begin) to use the vector storage, indexing, and retrieval functions. Choose a version that meets the minimum requirement for the vector storage and indexing features.

In [19]:
list_instances = !gcloud sql instances list --format="json(name)"
list_instances = json.loads(''.join(list_instances))

if CLOUDSQL_INSTANCE_NAME in [i['name'] for i in list_instances]:
    instance_describe = !gcloud sql instances describe $CLOUDSQL_INSTANCE_NAME --format=json
    instance_describe = json.loads(''.join(instance_describe))
    print(f"Found the instance: {instance_describe['name']}")
else:
    print('Creating an instance...')
    instance_create = !gcloud sql instances create $CLOUDSQL_INSTANCE_NAME \
        --database-version=MYSQL_8_0_36 \
        --tier=db-g1-small \
        --region=us-central1 \
        --quiet
    instance_describe = !gcloud sql instances describe $CLOUDSQL_INSTANCE_NAME --format=json
    instance_describe = json.loads(''.join(instance_describe))
    print(f"Created the instance: {instance_describe['name']}")

Found the instance: retrieval-cloudsql-mysql


In [45]:
#instance_describe

### Create User For This Workflow

We need a user account to login and use the MySQL instance.  In production this should be taken very seriously and access should be configured carefully to protect the environment.  This example create an example user with password for demonstration and testing.  Read more about controlling access in [About MySQL users](https://cloud.google.com/sql/docs/mysql/users).

**Reference:**
- [Cloud SDK (gcloud) CLI for SQL users](https://cloud.google.com/sql/docs/mysql/users)

In [21]:
list_users = !gcloud sql users list --format="json" --instance=$CLOUDSQL_INSTANCE_NAME
list_users = json.loads(''.join(list_users))

if CLOUDSQL_USER in [i['name'] for i in list_users]:
    user_describe = !gcloud sql users describe $CLOUDSQL_USER --instance=$CLOUDSQL_INSTANCE_NAME --format=json
    user_describe = json.loads(''.join(user_describe))
    print(f"Found the user: {user_describe['name']}")
else:
    print('Creating the user...')
    user_create = !gcloud sql users create $CLOUDSQL_USER \
        --instance=$CLOUDSQL_INSTANCE_NAME \
        --password=$CLOUDSQL_PASS
    user_describe = !gcloud sql users describe $CLOUDSQL_USER --instance=$CLOUDSQL_INSTANCE_NAME --format=json
    user_describe = json.loads(''.join(user_describe))
    print(f"Created the user: {user_describe['name']}")

Found the user: test_db


In [22]:
#user_describe

### Connect To Database

Cloud SQL For MySQL has default databases, [like mysql](https://cloud.google.com/sql/docs/mysql/create-manage-databases#gcloud).

There are many ways to [connect to a database](https://cloud.google.com/sql/docs/mysql/connect-overview) depending on where and how you need to connect.

Here we want to use Python and will use the convenient [Cloud SQL Language Connectors](https://cloud.google.com/sql/docs/mysql/language-connectors).

That means we will [create a connector](https://cloud.google.com/sql/docs/mysql/connect-connectors#python) and interact with the database through the connector.  A connector has three parts:
- a **connection tool**, in this case provided by [Cloud SQL Language Connectors](https://github.com/GoogleCloudPlatform/cloud-sql-python-connector)
- a driver to create a **connection pool**, [pymysql](https://github.com/PyMySQL/PyMySQL)
- a client library that can use connection pools to **orchestrate SQL queries**, [SQLAlchemy](https://www.sqlalchemy.org/)

#### Connection Tool

In [23]:
sync_connector = google.cloud.sql.connector.Connector()

#### Connection

In [24]:
def get_sync_conn(
    connector: google.cloud.sql.connector.Connector,
    db: str
):
    def getconn():
        conn = connector.connect(
            instance_describe['connectionName'],
            "pymysql",
            user = CLOUDSQL_USER,
            password = CLOUDSQL_PASS,
            db = db
        )
        return conn
    return getconn

#### Connection Pool

In [25]:
def get_sync_pool(
    connector: google.cloud.sql.connector.Connector,
    db: str
) -> sqlalchemy.engine.Engine:

    pool = sqlalchemy.create_engine(
        "mysql+pymysql://",
        creator = get_sync_conn(connector, db)
    )
    pool.dialect.description_encoding = None
    pool.execution_options(isolation_level="AUTOCOMMIT")
    return pool

In [26]:
sync_pool = get_sync_pool(sync_connector, 'mysql')

#### Orchestrate SQL Queries

Use the a pool as a context manager to orchestrate queries

In [27]:
def run_query(query, pool = None, connector = sync_connector):
    # get the current connnection pool:
    if pool is None:
        pool = sync_pool
        
    # run the query and get the response as 'result'
    with pool.connect().execution_options(isolation_level="AUTOCOMMIT") as connection:
        result = connection.execute(query)
        #connector.close()
        
    # prepare the response
    rows = []
    try:
        for row in result:
            rows.append(dict(zip(result.keys(), row)))
    except Exception:
        pass
    
    # return the response
    return rows[0] if len(rows) == 1 else rows

#### Execute A Test Query

In [28]:
query = sqlalchemy.text("SELECT 'Success' as did_it_work")

In [29]:
run_query(query)

{'did_it_work': 'Success'}

#### Execute Async Queries

There is not a supported async driver for working with the connector and MySQL.  A workaround is using multiple threads to manage queries asynchronously.  While this does create multiple processes to manage the queries individually, the expected workload of each is minimal due to just passing the query to the MySQL innstance for exectution. The example below show how to use [`concurrent.futures`](https://docs.python.org/3/library/concurrent.futures.html) to launch 5 parallel tasks:

In [30]:
with concurrent.futures.ThreadPoolExecutor(max_workers = 5) as executor:
    queries = [query]* 5
    futures = [executor.submit(run_query, query) for query in queries]
    for future in futures:
        print(future.result())

{'did_it_work': 'Success'}
{'did_it_work': 'Success'}
{'did_it_work': 'Success'}
{'did_it_work': 'Success'}
{'did_it_work': 'Success'}


---
## Working With Cloud SQL For MySQL

Now that a connection to MySQL is established the environment can be interacted with using, SQL!

### Create A Database

[Creating and Selecting a Database](https://dev.mysql.com/doc/refman/8.4/en/creating-database.html)

In [35]:
query = sqlalchemy.text(f"SELECT schema_name FROM INFORMATION_SCHEMA.SCHEMATA WHERE schema_name = '{CLOUDSQL_DATABASE_NAME}'")
result = run_query(query)
result

[]

In [36]:
if not result:
    query = sqlalchemy.text(f"CREATE DATABASE `{CLOUDSQL_DATABASE_NAME}`")
    run_query(query)

In [37]:
query = sqlalchemy.text(f"SELECT schema_name FROM INFORMATION_SCHEMA.SCHEMATA WHERE schema_name = '{CLOUDSQL_DATABASE_NAME}'")
result = run_query(query)
result

{'SCHEMA_NAME': 'applied_genai'}

In [38]:
query = sqlalchemy.text(f"SELECT * FROM INFORMATION_SCHEMA.SCHEMATA WHERE schema_name = '{CLOUDSQL_DATABASE_NAME}'")
run_query(query)

{'CATALOG_NAME': 'def',
 'SCHEMA_NAME': 'applied_genai',
 'DEFAULT_CHARACTER_SET_NAME': 'utf8mb4',
 'DEFAULT_COLLATION_NAME': 'utf8mb4_0900_ai_ci',
 'SQL_PATH': None,
 'DEFAULT_ENCRYPTION': 'NO'}

### Move Connection To New Database

Note that the connection pool connects to a specific database.  Now that a new database is created we can switch the connection pool to it by first closing the existing connection pool and creating a new one.

Verify database for current connection:

In [39]:
run_query(sqlalchemy.text('SELECT database()'))

{'database()': 'mysql'}

Close the current connection and create a new one:

In [31]:
sync_pool.dispose()
sync_connector.close()

sync_connector = google.cloud.sql.connector.Connector()
sync_pool = get_sync_pool(sync_connector, CLOUDSQL_DATABASE_NAME)

Verify the database of the new connection:

In [41]:
run_query(sqlalchemy.text('SELECT database()'))

{'database()': 'applied_genai'}

### Create Table

**References:**
- [CREATE TABLE statement](https://dev.mysql.com/doc/refman/8.4/en/create-table.html)
- [Information_Schema Tables](https://dev.mysql.com/doc/mysql-infoschema-excerpt/8.0/en/information-schema-table-reference.html)

In [42]:
result = run_query(sqlalchemy.text(f"SELECT * from information_schema.tables WHERE table_schema = '{CLOUDSQL_TABLE_NAME}'"))
result

[]

In [43]:
run_query(sqlalchemy.text(f"DROP TABLE IF EXISTS `{CLOUDSQL_TABLE_NAME}`"))

[]

In [44]:
run_query(
    sqlalchemy.text(f"""
            CREATE TABLE IF NOT EXISTS `{CLOUDSQL_TABLE_NAME}` (
                chunk_id VARCHAR(100) NOT NULL PRIMARY KEY,
                gse VARCHAR(50),
                content TEXT,
                embedding TEXT
            );
        """
    )
)

[]

In [45]:
result = run_query(sqlalchemy.text(f"SELECT * from information_schema.tables WHERE table_name = '{CLOUDSQL_TABLE_NAME}'"))
result

{'TABLE_CATALOG': 'def',
 'TABLE_SCHEMA': 'applied_genai',
 'TABLE_NAME': 'retrieval_cloudsql_mysql',
 'TABLE_TYPE': 'BASE TABLE',
 'ENGINE': 'InnoDB',
 'VERSION': 10,
 'ROW_FORMAT': 'Dynamic',
 'TABLE_ROWS': 0,
 'AVG_ROW_LENGTH': 0,
 'DATA_LENGTH': 16384,
 'MAX_DATA_LENGTH': 0,
 'INDEX_LENGTH': 0,
 'DATA_FREE': 0,
 'AUTO_INCREMENT': None,
 'CREATE_TIME': datetime.datetime(2024, 11, 6, 22, 47, 28),
 'UPDATE_TIME': None,
 'CHECK_TIME': None,
 'TABLE_COLLATION': 'utf8mb4_0900_ai_ci',
 'CHECKSUM': None,
 'CREATE_OPTIONS': '',
 'TABLE_COMMENT': ''}

In [46]:
run_query(sqlalchemy.text(f"SELECT column_name, data_type FROM information_schema.columns WHERE table_name = '{CLOUDSQL_TABLE_NAME}'"))

[{'COLUMN_NAME': 'chunk_id', 'DATA_TYPE': 'varchar'},
 {'COLUMN_NAME': 'gse', 'DATA_TYPE': 'varchar'},
 {'COLUMN_NAME': 'content', 'DATA_TYPE': 'text'},
 {'COLUMN_NAME': 'embedding', 'DATA_TYPE': 'text'}]

### Add, Retrieve, And Delete Rows

Learn about inserting, retrieving and deleting records/rows with the following simple examples.

#### Get A Record

Dictionaries for each record/row are stored in `content_chunks` from earlier in this workflow:

In [47]:
first_record = content_chunks[0].copy()

In [48]:
first_record.keys()

dict_keys(['gse', 'chunk_id', 'content', 'embedding'])

In [49]:
first_record['chunk_id']

'fannie_part_0_c17'

In [50]:
type(first_record['embedding'])

list

Convert embedding to string for storage in MySQL. Later in this workflow the vector extension will be added and these values can be converted to a native vector type.

In [51]:
first_record['embedding'] = json.dumps(first_record['embedding'])

In [52]:
type(first_record['embedding'])

str

In [53]:
first_record['embedding'][0:50] + ' ... ' + first_record['embedding'][-50:]

'[0.031277116388082504, 0.03056905046105385, 0.0108 ... 255325, 0.06677839905023575, -0.03832581639289856]'

#### Insert Row

In [54]:
table = sqlalchemy.Table(
    CLOUDSQL_TABLE_NAME,
    sqlalchemy.MetaData(),
    autoload_with = sync_pool
)

In [55]:
for c in table.columns:
    print(c)

retrieval_cloudsql_mysql.chunk_id
retrieval_cloudsql_mysql.gse
retrieval_cloudsql_mysql.content
retrieval_cloudsql_mysql.embedding


In [56]:
insert_row = sqlalchemy.insert(table).values(first_record)

In [57]:
run_query(insert_row)

[]

#### Retrieve Row

There are two helpful ways to retrieve rows.  Both with SQL and with the sqlalchemy clients `select` method.  Both are demonstrated here.

Using SQL:

In [58]:
query = sqlalchemy.text(f"SELECT * FROM `{CLOUDSQL_TABLE_NAME}` WHERE chunk_id = '{first_record['chunk_id']}'")
result = run_query(query)

In [59]:
result.keys()

dict_keys(['chunk_id', 'gse', 'content', 'embedding'])

In [60]:
result['chunk_id']

'fannie_part_0_c17'

Using sqlalchemy clients `select` method:

In [61]:
query = sqlalchemy.select(table).where(table.columns.chunk_id == first_record['chunk_id'])
result = run_query(query)

In [62]:
result.keys()

dict_keys(['chunk_id', 'gse', 'content', 'embedding'])

In [63]:
result['chunk_id']

'fannie_part_0_c17'

In [64]:
type(result['embedding'])

str

In [65]:
json.loads(result['embedding'])[0:10]

[0.031277116388082504,
 0.03056905046105385,
 0.010865348391234875,
 0.0623614676296711,
 0.03228681534528732,
 0.05066155269742012,
 0.046544693410396576,
 0.05509665608406067,
 -0.014074751175940037,
 0.008380400016903877]

#### Delete Row

Delete the row added here.  Verify the action by counting the rows before and after the deletion.

In [66]:
run_query(sqlalchemy.text(f"SELECT COUNT(*) as count FROM `{CLOUDSQL_TABLE_NAME}`"))

{'count': 1}

In [67]:
run_query(sqlalchemy.text(f"DELETE FROM `{CLOUDSQL_TABLE_NAME}` WHERE chunk_id = '{first_record['chunk_id']}'"))

[]

In [68]:
run_query(sqlalchemy.text(f"SELECT COUNT(*) as count FROM `{CLOUDSQL_TABLE_NAME}`"))

{'count': 0}

---
## Load Data 

There are a lot of rows to load and this workflow uses concurrency with `concurrent.futures` since there is not an async driver for the connection (described earlier).  The expected computation for each tasks is minimal so it should be ok to request many more threads than this compute environment has CPUs for.  In this case 1000 threads are used on a 4 vCPU machine and all 9000+ records are written in only a few seconds.

Convert the embeddings to string values:

In [69]:
prep_content_chunks  = copy.deepcopy(content_chunks)
for chunk in prep_content_chunks:
    chunk['embedding'] = json.dumps(chunk['embedding'])

Create a list of queries, one for each record to insert:

In [70]:
queries = [sqlalchemy.insert(table).values(c) for c in prep_content_chunks]

Launch all the queries across 1000 threads and managed with the [ThreadPoolExecutor](https://docs.python.org/3/library/concurrent.futures.html#threadpoolexecutor):

In [71]:
with concurrent.futures.ThreadPoolExecutor(max_workers = 1000) as executor:
    futures = [executor.submit(run_query, query) for query in queries]

Verify the results with a row count:

In [72]:
run_query(sqlalchemy.text(f"SELECT COUNT(*) as count FROM `{CLOUDSQL_TABLE_NAME}`"))

{'count': 9040}

---
## Setup Cloud SQL For MySQL For Vector Similarity Search

[Working with vector embeddings](https://cloud.google.com/sql/docs/mysql/work-with-vectors) requires [configuring the instance](https://cloud.google.com/sql/docs/mysql/work-with-vectors#configure-instance) to support vectors.  Vector indexes are held in memory and the allocation for memory used by vector indexes can be [set on the instance](https://cloud.google.com/sql/docs/mysql/work-with-vectors#manage-memory-indexes) and defaults to 1GB with an upper limit of half the buffer pool size.

There are two primary [limitations](https://cloud.google.com/sql/docs/mysql/work-with-vectors#limitations) to work with:
- Only one vector column per table
- Only one vector search indnex per table

### Verify Version of MySQL

Check the [Required version of MySQL](https://cloud.google.com/sql/docs/mysql/work-with-vectors#before-you-begin).  If necessary update the instance version with [`gcloud sql instances patch`](https://cloud.google.com/sdk/gcloud/reference/sql/instances/patch) for:
- [Upgrade the database major version in-place]()https://cloud.google.com/sql/docs/mysql/upgrade-major-db-version-inplace#perform-upgrade
- [Upgrade the database minor version in-place](https://cloud.google.com/sql/docs/mysql/upgrade-minor-db-version#minor-ver-upgrade)

In [257]:
instance_describe['maintenanceVersion']

'MYSQL_8_0_36.R20241020.00_00'

In [258]:
instance_describe['upgradableDatabaseVersions']

[{'displayName': 'MySQL 8.0.35',
  'majorVersion': 'MYSQL_8_0',
  'name': 'MYSQL_8_0_35'},
 {'displayName': 'MySQL 8.0.37',
  'majorVersion': 'MYSQL_8_0',
  'name': 'MYSQL_8_0_37'},
 {'displayName': 'MySQL 8.0.39',
  'majorVersion': 'MYSQL_8_0',
  'name': 'MYSQL_8_0_39'}]

In [259]:
#instance_patch = !gcloud sql instances patch $CLOUDSQL_INSTANCE_NAME --database-version= --quiet

In [260]:
#instance_describe = !gcloud sql instances describe $CLOUDSQL_INSTANCE_NAME --format=json
#instance_describe = json.loads(''.join(instance_describe))

### Enable Vector Features

[Enable vector features](https://cloud.google.com/sql/docs/mysql/work-with-vectors#enable-support)

In [77]:
enable_vector = !gcloud sql instances patch $CLOUDSQL_INSTANCE_NAME --database-flags=cloudsql_vector=on --quiet

In [78]:
enable_vector

['The following message will be used for the patch API method.',
 '{"name": "retrieval-cloudsql-mysql", "project": "statmike-mlops-349915", "settings": {"databaseFlags": [{"name": "cloudsql_vector", "value": "on"}]}}',
 'Patching Cloud SQL instance...',
 '......done.',
 'Updated [https://sqladmin.googleapis.com/sql/v1beta4/projects/statmike-mlops-349915/instances/retrieval-cloudsql-mysql].']

Update the connection after the instance is updated/patched:

In [140]:
sync_pool.dispose()
sync_connector.close()
sync_connector = google.cloud.sql.connector.Connector()
sync_pool = get_sync_pool(sync_connector, CLOUDSQL_DATABASE_NAME)

### Create `embedding_vector` Column As Vector Data Type

The data was loaded/inserted above with the embedding stored in a column named 'embedding' as an string representation of an array of float values.  This column can now be use to create a new column, `embedding_vector`, with the data in the `VARBINARY` data type in `VECTOR` form.

To [store vector embedding](https://cloud.google.com/sql/docs/mysql/work-with-vectors#store-vector-embeddings) use:

- Create columns as VECTOR(dim) with the VARBINARY extension:
    - ```embedding_vector VECTOR(768) USING VARBINARY```
- Functions to convert string represention to vectors and back:
    - ```string_to_vector('[]')```
    - ```vector_to_string()```

In [82]:
run_query(sqlalchemy.text(f"ALTER TABLE `{CLOUDSQL_TABLE_NAME}` ADD COLUMN embedding_vector VECTOR({len(question_embedding)}) USING VARBINARY;"))

[]

In [83]:
run_query(sqlalchemy.text(f"SELECT column_name, data_type FROM information_schema.columns WHERE table_name = '{CLOUDSQL_TABLE_NAME}'"))

[{'COLUMN_NAME': 'chunk_id', 'DATA_TYPE': 'varchar'},
 {'COLUMN_NAME': 'content', 'DATA_TYPE': 'text'},
 {'COLUMN_NAME': 'embedding', 'DATA_TYPE': 'text'},
 {'COLUMN_NAME': 'embedding_vector', 'DATA_TYPE': 'varbinary'},
 {'COLUMN_NAME': 'gse', 'DATA_TYPE': 'varchar'}]

In [84]:
run_query(sqlalchemy.text(f"UPDATE `{CLOUDSQL_TABLE_NAME}` SET embedding_vector = string_to_vector(embedding)"))

[]

In [85]:
query = sqlalchemy.text(f"SELECT *, vector_to_string(embedding_vector) as embedding_vector FROM `{CLOUDSQL_TABLE_NAME}` WHERE chunk_id = '{first_record['chunk_id']}'")
result = run_query(query)
result.keys()

dict_keys(['chunk_id', 'gse', 'content', 'embedding', 'embedding_vector'])

In [86]:
type(result['embedding_vector'])

str

In [87]:
result['embedding_vector'][0:100]

'[0.0312771,0.0305691,0.0108653,0.0623615,0.0322868,0.0506616,0.0465447,0.0550967,-0.0140748,0.008380'

---
## Vector Similarity Search, Matching

This section covers the operation of using a vector similarity metric calculation to find nearest neighbors for a query vector while also taking advantage of indexing.  To understand similarity metrics and motivate the intution for choosing one (choose dot product), check out [The Math of Similarity](../Embeddings/The%20Math%20of%20Similarity.ipynb).


### Check For Vector Indexes

At this point in the workflow no vector indexes have been created.  The following cells show how to check for indexes and will be reused later in the workflow to verify the details of indexes after they are created.

The system table `mysql.vector_indexes` lists indexes available by database and table on the instance:

In [145]:
run_query(sqlalchemy.text(f"SELECT * FROM mysql.vector_indexes"))

[]

The `information_schema.innodb_vector_indexes` view contains detail information about each index in the instance:

In [141]:
run_query(sqlalchemy.text(f"SELECT * FROM information_schema.innodb_vector_indexes"))

{'INDEX_NAME': 'applied_genai.embedding_index',
 'TABLE_NAME': 'applied_genai.retrieval_cloudsql_mysql',
 'INDEX_TYPE': 'BRUTE_FORCE',
 'DIMENSION': 768,
 'DIST_MEASURE': 'DotProductDistance',
 'STATUS': 'Ready',
 'STATE': 'INDEX_READY_TO_USE',
 'PARTITIONS': 0,
 'SEARCH_PARTITIONS': 0,
 'INITIAL_SIZE': 9040,
 'CURRENT_SIZE': 9040,
 'QUERIES': 3,
 'MUTATIONS': 0,
 'INDEX_MEMORY': 27770880,
 'DATASET_MEMORY': 0}

### Brute Force Search - No Index

Without an index you can still use distance measures to find nearest neighbor matches through brute force search that compare a query embedding to all rows.

Easily run a brute force (compare to all rows) match with a choice of distance measure:
- `cosine_distance` for Cosine distance
- `l2_squared_distance` for L2, Euclidean distance
    - Note that this is the square of L2 rather than L2.  This is an easier computation, faster, since it eliminates a square root operation.  Read more about these metrics and their calculation in [The Math of Similarity](../Embeddings/The%20Math%20of%20Similarity.ipynb).
- `dot_product` for Dot product

Dot product with `dot_product()`:

In [150]:
run_query(sqlalchemy.text(f"""
    SELECT
        chunk_id,
        dot_product(embedding_vector, string_to_vector('{str(question_embedding)}')) AS dot_product
    FROM `{CLOUDSQL_TABLE_NAME}`
    ORDER BY dot_product
    LIMIT 5
"""))

[{'chunk_id': 'fannie_part_0_c352', 'dot_product': -0.7099842034905741},
 {'chunk_id': 'freddie_part_4_c509', 'dot_product': -0.6805260858670844},
 {'chunk_id': 'freddie_part_4_c510', 'dot_product': -0.675329698947003},
 {'chunk_id': 'fannie_part_0_c353', 'dot_product': -0.6723706805978971},
 {'chunk_id': 'fannie_part_0_c326', 'dot_product': -0.6683496294372162}]

Euclidean distance with `l2_squared_distance()`:

In [152]:
run_query(sqlalchemy.text(f"""
    SELECT
        chunk_id,
        l2_squared_distance(embedding_vector, string_to_vector('{str(question_embedding)}')) AS euclidean_distance
    FROM `{CLOUDSQL_TABLE_NAME}`
    ORDER BY euclidean_distance
    LIMIT 5
"""))

[{'chunk_id': 'fannie_part_0_c352', 'euclidean_distance': 0.5799824894033581},
 {'chunk_id': 'freddie_part_4_c509', 'euclidean_distance': 0.63886055817837},
 {'chunk_id': 'freddie_part_4_c510', 'euclidean_distance': 0.6492892334551303},
 {'chunk_id': 'fannie_part_0_c353', 'euclidean_distance': 0.6551829273195239},
 {'chunk_id': 'fannie_part_0_c326', 'euclidean_distance': 0.6632886194891773}]

Cosine similarity with `cosine_distance()`:

In [153]:
run_query(sqlalchemy.text(f"""
    SELECT
        chunk_id,
        cosine_distance(embedding_vector, string_to_vector('{str(question_embedding)}')) AS cosine_distance
    FROM `{CLOUDSQL_TABLE_NAME}`
    ORDER BY cosine_distance
    LIMIT 5
"""))

[{'chunk_id': 'fannie_part_0_c352', 'cosine_distance': 0.28999836618279184},
 {'chunk_id': 'freddie_part_4_c509', 'cosine_distance': 0.319444218286727},
 {'chunk_id': 'freddie_part_4_c510', 'cosine_distance': 0.32465295628697444},
 {'chunk_id': 'fannie_part_0_c353', 'cosine_distance': 0.32760386778331907},
 {'chunk_id': 'fannie_part_0_c326', 'cosine_distance': 0.3316463216836467}]

### Brute Force Search With Pre-Filtering - No Index

Extending a brute force match with pre-filtering means including a `WHERE` statement to first filter to row that meet a desired condition:

Find the top 5 matches where the GSE is 'fannie':

In [154]:
run_query(sqlalchemy.text(f"""
    SELECT
        chunk_id,
        dot_product(embedding_vector, string_to_vector('{str(question_embedding)}')) AS dot_product
    FROM `{CLOUDSQL_TABLE_NAME}`
    WHERE gse = 'fannie'
    ORDER BY dot_product
    LIMIT 5
"""))

[{'chunk_id': 'fannie_part_0_c352', 'dot_product': -0.7099842034905741},
 {'chunk_id': 'fannie_part_0_c353', 'dot_product': -0.6723706805978971},
 {'chunk_id': 'fannie_part_0_c326', 'dot_product': -0.6683496294372162},
 {'chunk_id': 'fannie_part_0_c92', 'dot_product': -0.6614337365039091},
 {'chunk_id': 'fannie_part_0_c240', 'dot_product': -0.6608578612144242}]

Find the top 5 matches where the GSE is 'freddie':

In [155]:
run_query(sqlalchemy.text(f"""
    SELECT
        chunk_id,
        dot_product(embedding_vector, string_to_vector('{str(question_embedding)}')) AS dot_product
    FROM `{CLOUDSQL_TABLE_NAME}`
    WHERE gse = 'freddie'
    ORDER BY dot_product
    LIMIT 5
"""))

[{'chunk_id': 'freddie_part_4_c509', 'dot_product': -0.6805260858670844},
 {'chunk_id': 'freddie_part_4_c510', 'dot_product': -0.675329698947003},
 {'chunk_id': 'freddie_part_4_c472', 'dot_product': -0.6619843865670703},
 {'chunk_id': 'freddie_part_6_c439', 'dot_product': -0.660453491838382},
 {'chunk_id': 'freddie_part_4_c558', 'dot_product': -0.6575403557118698}]

### Create And Use An Index

Indexes make search across many rows more efficient by first matching partions of rows and then only comparing to rows within the partions.  This section covers [creating indexes](https://cloud.google.com/sql/docs/mysql/work-with-vectors#create-vector-search-index) and using them in queries.

- Brute Force: `BRUTE_FORCE`
    - search covers all rows, like using the distance measures above
        - it can still be helpful to set a index of this type to pre-set a default number of nearest neighbors and make use of the `NEAREST() ... TO ()` predictate. Also, since indexes are held in memory it can be faster to scan all rows with an index.
    - This is the default for tables under 10,000 rows
- Tree SQ: `TREE_SQ`
    - scalable nearest neighbors with k-means
    - approximate search
    - high accuracy but can be slower for larger datasets
    - this is the default for tables with 10,000 or mmore rows
- Tree AH: `TREE_AH`
    - asymmetric hashing, traverse trees to find nearest neighbors
    - scalable and fast for large datasets
    
Two primary [limitations](https://cloud.google.com/sql/docs/mysql/work-with-vectors#limitations) to work with:
- Only one vector column per table
- Only one vector search indnex per table

**NOTE:** Indexes are used for approximate neighbor search when the `NEAREST ... TO` predicate is used, otherwise, with the use of a distance function all rows (brute force) are scanned.

#### Index: `TREE_SQ`

Create the index:

In [222]:
run_query(sqlalchemy.text(f"""
    call mysql.create_vector_index(
        'embedding_index',
        '{CLOUDSQL_DATABASE_NAME}.{CLOUDSQL_TABLE_NAME}',
        'embedding_vector',
        'index_type = tree_sq, distance_measure = dot_product, num_neighbors = 5, num_partitions = 90'
    );
"""))

[]

Review the index details:

In [160]:
run_query(sqlalchemy.text(f"SELECT * FROM information_schema.innodb_vector_indexes"))

{'INDEX_NAME': 'applied_genai.embedding_index',
 'TABLE_NAME': 'applied_genai.retrieval_cloudsql_mysql',
 'INDEX_TYPE': 'TREE_SQ',
 'DIMENSION': 768,
 'DIST_MEASURE': 'DotProductDistance',
 'STATUS': 'Ready',
 'STATE': 'INDEX_READY_TO_USE',
 'PARTITIONS': 90,
 'SEARCH_PARTITIONS': 53,
 'INITIAL_SIZE': 9040,
 'CURRENT_SIZE': 9040,
 'QUERIES': 0,
 'MUTATIONS': 0,
 'INDEX_MEMORY': 6942720,
 'DATASET_MEMORY': 0}

In [161]:
run_query(sqlalchemy.text(f"SELECT * FROM mysql.vector_indexes"))

{'index_name': 'applied_genai.embedding_index',
 'table_name': 'applied_genai.retrieval_cloudsql_mysql',
 'column_name': 'embedding_vector',
 'index_options': 'index_type = tree_sq, distance_measure = dot_product, num_neighbors = 5, num_partitions = 90',
 'status': 'ACTIVE',
 'create_time': datetime.datetime(2024, 11, 8, 12, 46, 58),
 'update_time': datetime.datetime(2024, 11, 8, 12, 46, 58)}

Use a distance measure directly, like brute force above:

In [162]:
run_query(sqlalchemy.text(f"""
    SELECT
        chunk_id, 
        dot_product(embedding_vector, string_to_vector('{str(question_embedding)}')) AS dot_product
    FROM `{CLOUDSQL_TABLE_NAME}`
    ORDER BY dot_product
    LIMIT 5
"""))

[{'chunk_id': 'fannie_part_0_c352', 'dot_product': -0.7099842034905741},
 {'chunk_id': 'freddie_part_4_c509', 'dot_product': -0.6805260858670844},
 {'chunk_id': 'freddie_part_4_c510', 'dot_product': -0.675329698947003},
 {'chunk_id': 'fannie_part_0_c353', 'dot_product': -0.6723706805978971},
 {'chunk_id': 'fannie_part_0_c326', 'dot_product': -0.6683496294372162}]

Use `EXPLAIN ANALYZE` to understand the query execution.  Note that the index was not used:

In [163]:
run_query(sqlalchemy.text(f"""
EXPLAIN ANALYZE
    SELECT
        chunk_id, 
        dot_product(embedding_vector, string_to_vector('{str(question_embedding)}')) AS dot_product
    FROM `{CLOUDSQL_TABLE_NAME}`
    ORDER BY dot_product
    LIMIT 5
"""))

{'EXPLAIN': '-> Limit: 5 row(s)  (cost=21938 rows=5) (actual time=1751..1751 rows=5 loops=1)\n    -> Sort: dot_product, limit input to 5 row(s) per chunk  (cost=21938 rows=5625) (actual time=1751..1751 rows=5 loops=1)\n        -> Table scan on retrieval_cloudsql_mysql  (cost=21938 rows=5625) (actual time=0.0928..19.3 rows=9040 loops=1)\n'}

Now use the `NEAREST ... TO` predicate to invoke the use of the index:

In [164]:
run_query(sqlalchemy.text(f"""
    SELECT chunk_id
    FROM `{CLOUDSQL_TABLE_NAME}`
    WHERE NEAREST(embedding_vector) TO (string_to_vector('{str(question_embedding)}'))
"""))

[{'chunk_id': 'fannie_part_0_c326'},
 {'chunk_id': 'fannie_part_0_c352'},
 {'chunk_id': 'fannie_part_0_c353'},
 {'chunk_id': 'freddie_part_4_c509'},
 {'chunk_id': 'freddie_part_4_c510'}]

Use `EXPLAIN ANALYZE` to understand the query execution.  Note that the index was used to make the execution much more efficient that above:

In [166]:
run_query(sqlalchemy.text(f"""
EXPLAIN ANALYZE
    SELECT chunk_id
    FROM `{CLOUDSQL_TABLE_NAME}`
    WHERE NEAREST(embedding_vector) TO (string_to_vector('{str(question_embedding)}'))
"""))

{'EXPLAIN': "-> Filter: ((retrieval_cloudsql_mysql.chunk_id = 'fannie_part_0_c326') or (retrieval_cloudsql_mysql.chunk_id = 'fannie_part_0_c353') or (retrieval_cloudsql_mysql.chunk_id = 'freddie_part_4_c510') or (retrieval_cloudsql_mysql.chunk_id = 'freddie_part_4_c509') or (retrieval_cloudsql_mysql.chunk_id = 'fannie_part_0_c352'))  (cost=5.41 rows=5) (actual time=0.0519..0.0895 rows=5 loops=1)\n    -> Index range scan on retrieval_cloudsql_mysql using PRIMARY over (chunk_id = 'fannie_part_0_c326') OR (chunk_id = 'fannie_part_0_c352') OR (3 more)  (cost=5.41 rows=5) (actual time=0.0488..0.0809 rows=5 loops=1)\n"}

Use the `num_neighbor` parameters to override the default value of 5 set during the index creation:

In [167]:
run_query(sqlalchemy.text(f"""
    SELECT chunk_id
    FROM `{CLOUDSQL_TABLE_NAME}`
    WHERE NEAREST(embedding_vector) TO (string_to_vector('{str(question_embedding)}'), 'num_neighbors=7')
"""))

[{'chunk_id': 'fannie_part_0_c326'},
 {'chunk_id': 'fannie_part_0_c352'},
 {'chunk_id': 'fannie_part_0_c353'},
 {'chunk_id': 'fannie_part_0_c92'},
 {'chunk_id': 'freddie_part_4_c472'},
 {'chunk_id': 'freddie_part_4_c509'},
 {'chunk_id': 'freddie_part_4_c510'}]

Add a filter, `gse = 'fannie'`, to the query and note that fewer than the requested or default neighbors is returned.  This means the filter is being applied as a post filter, 7 neight found, then the filter was applied to those 7:

In [118]:
run_query(sqlalchemy.text(f"""
    SELECT chunk_id
    FROM `{CLOUDSQL_TABLE_NAME}`
    WHERE NEAREST(embedding_vector) TO (string_to_vector('{str(question_embedding)}'), 'num_neighbors=7')
    AND gse = 'fannie'
"""))

[{'chunk_id': 'fannie_part_0_c326'},
 {'chunk_id': 'fannie_part_0_c352'},
 {'chunk_id': 'fannie_part_0_c353'},
 {'chunk_id': 'fannie_part_0_c92'}]

Using `EXPLAIN_ANALYZE` shows that even though the filtering was applied post rather than pre it still used the index:

In [170]:
run_query(sqlalchemy.text(f"""
EXPLAIN ANALYZE
    SELECT chunk_id
    FROM `{CLOUDSQL_TABLE_NAME}`
    WHERE NEAREST(embedding_vector) TO (string_to_vector('{str(question_embedding)}'), 'num_neighbors=7')
    AND gse = 'fannie'
"""))

{'EXPLAIN': "-> Filter: ((retrieval_cloudsql_mysql.gse = 'fannie') and ((retrieval_cloudsql_mysql.chunk_id = 'fannie_part_0_c92') or (retrieval_cloudsql_mysql.chunk_id = 'freddie_part_4_c472') or (retrieval_cloudsql_mysql.chunk_id = 'fannie_part_0_c326') or (retrieval_cloudsql_mysql.chunk_id = 'fannie_part_0_c353') or (retrieval_cloudsql_mysql.chunk_id = 'freddie_part_4_c510') or (retrieval_cloudsql_mysql.chunk_id = 'freddie_part_4_c509') or (retrieval_cloudsql_mysql.chunk_id = 'fannie_part_0_c352')))  (cost=7.57 rows=0.7) (actual time=0.0428..0.0825 rows=4 loops=1)\n    -> Index range scan on retrieval_cloudsql_mysql using PRIMARY over (chunk_id = 'fannie_part_0_c326') OR (chunk_id = 'fannie_part_0_c352') OR (5 more)  (cost=7.57 rows=7) (actual time=0.0397..0.0754 rows=7 loops=1)\n"}

Force the indexing to scan a larger number of partion by setting the `num_partition` parameter.  In this case it is set to the full number of partitions and essentially forcing a brute force query across all rows:

In [171]:
run_query(sqlalchemy.text(f"""
    SELECT chunk_id
    FROM `{CLOUDSQL_TABLE_NAME}`
    WHERE NEAREST(embedding_vector) TO (string_to_vector('{str(question_embedding)}'), 'num_neighbors=7, num_partitions=90')
"""))

[{'chunk_id': 'fannie_part_0_c326'},
 {'chunk_id': 'fannie_part_0_c352'},
 {'chunk_id': 'fannie_part_0_c353'},
 {'chunk_id': 'fannie_part_0_c92'},
 {'chunk_id': 'freddie_part_4_c472'},
 {'chunk_id': 'freddie_part_4_c509'},
 {'chunk_id': 'freddie_part_4_c510'}]

Using `EXPLAIN ANALYZE` to evaluate the execution of the query set to scan all partions of the index:

In [173]:
run_query(sqlalchemy.text(f"""
EXPLAIN ANALYZE
    SELECT chunk_id
    FROM `{CLOUDSQL_TABLE_NAME}`
    WHERE NEAREST(embedding_vector) TO (string_to_vector('{str(question_embedding)}'), 'num_neighbors=7, num_partitions=90')
"""))

{'EXPLAIN': "-> Filter: ((retrieval_cloudsql_mysql.chunk_id = 'fannie_part_0_c92') or (retrieval_cloudsql_mysql.chunk_id = 'freddie_part_4_c472') or (retrieval_cloudsql_mysql.chunk_id = 'fannie_part_0_c326') or (retrieval_cloudsql_mysql.chunk_id = 'fannie_part_0_c353') or (retrieval_cloudsql_mysql.chunk_id = 'freddie_part_4_c510') or (retrieval_cloudsql_mysql.chunk_id = 'freddie_part_4_c509') or (retrieval_cloudsql_mysql.chunk_id = 'fannie_part_0_c352'))  (cost=7.57 rows=7) (actual time=0.0439..0.0901 rows=7 loops=1)\n    -> Index range scan on retrieval_cloudsql_mysql using PRIMARY over (chunk_id = 'fannie_part_0_c326') OR (chunk_id = 'fannie_part_0_c352') OR (5 more)  (cost=7.57 rows=7) (actual time=0.041..0.0818 rows=7 loops=1)\n"}

And compare to a brute force query with pre-filtering:

In [174]:
run_query(sqlalchemy.text(f"""
    SELECT
        chunk_id, 
        dot_product(embedding_vector, string_to_vector('{str(question_embedding)}')) AS dot_product
    FROM `{CLOUDSQL_TABLE_NAME}`
    WHERE gse = 'fannie'
    ORDER BY dot_product
    LIMIT 5
"""))

[{'chunk_id': 'fannie_part_0_c352', 'dot_product': -0.7099842034905741},
 {'chunk_id': 'fannie_part_0_c353', 'dot_product': -0.6723706805978971},
 {'chunk_id': 'fannie_part_0_c326', 'dot_product': -0.6683496294372162},
 {'chunk_id': 'fannie_part_0_c92', 'dot_product': -0.6614337365039091},
 {'chunk_id': 'fannie_part_0_c240', 'dot_product': -0.6608578612144242}]

Uncomment the following cell to delete the index.  The next section expects the index to be present and instead rebuilds/replaces it with a different index:

In [120]:
#run_query(sqlalchemy.text(f"call mysql.drop_vector_index('{CLOUDSQL_DATABASE_NAME}.embedding_index');"))

#### Index: `TREE_AH`

Create/Change the index with `call mysql.alter_vector_index()` function to overwrite the previous index:

In [175]:
run_query(sqlalchemy.text(f"""
    call mysql.alter_vector_index(
        '{CLOUDSQL_DATABASE_NAME}.embedding_index',
        'index_type = tree_ah, distance_measure = dot_product, num_neighbors = 5, num_partitions = 90'
    );
"""))

{'Alter Index Status': 'Success: Done'}

Review the index details:

In [176]:
run_query(sqlalchemy.text(f"SELECT * FROM information_schema.innodb_vector_indexes"))

{'INDEX_NAME': 'applied_genai.embedding_index',
 'TABLE_NAME': 'applied_genai.retrieval_cloudsql_mysql',
 'INDEX_TYPE': 'TREE_AH',
 'DIMENSION': 768,
 'DIST_MEASURE': 'DotProductDistance',
 'STATUS': 'Ready',
 'STATE': 'INDEX_READY_TO_USE',
 'PARTITIONS': 90,
 'SEARCH_PARTITIONS': 53,
 'INITIAL_SIZE': 9040,
 'CURRENT_SIZE': 9040,
 'QUERIES': 0,
 'MUTATIONS': 0,
 'INDEX_MEMORY': 29506560,
 'DATASET_MEMORY': 0}

In [177]:
run_query(sqlalchemy.text(f"SELECT * FROM mysql.vector_indexes"))

{'index_name': 'applied_genai.embedding_index',
 'table_name': 'applied_genai.retrieval_cloudsql_mysql',
 'column_name': 'embedding_vector',
 'index_options': 'index_type = tree_ah, distance_measure = dot_product, num_neighbors = 5, num_partitions = 90',
 'status': 'ACTIVE',
 'create_time': datetime.datetime(2024, 11, 8, 12, 46, 58),
 'update_time': datetime.datetime(2024, 11, 8, 12, 59, 52)}

Use a distance measure directly, like brute force above:

In [178]:
run_query(sqlalchemy.text(f"""
    SELECT
        chunk_id, 
        dot_product(embedding_vector, string_to_vector('{str(question_embedding)}')) AS dot_product
    FROM `{CLOUDSQL_TABLE_NAME}`
    ORDER BY dot_product
    LIMIT 5
"""))

[{'chunk_id': 'fannie_part_0_c352', 'dot_product': -0.7099842034905741},
 {'chunk_id': 'freddie_part_4_c509', 'dot_product': -0.6805260858670844},
 {'chunk_id': 'freddie_part_4_c510', 'dot_product': -0.675329698947003},
 {'chunk_id': 'fannie_part_0_c353', 'dot_product': -0.6723706805978971},
 {'chunk_id': 'fannie_part_0_c326', 'dot_product': -0.6683496294372162}]

Use `EXPLAIN ANALYZE` to understand the query execution.  Note that the index was not used:

In [179]:
run_query(sqlalchemy.text(f"""
EXPLAIN ANALYZE
    SELECT
        chunk_id, 
        dot_product(embedding_vector, string_to_vector('{str(question_embedding)}')) AS dot_product
    FROM `{CLOUDSQL_TABLE_NAME}`
    ORDER BY dot_product
    LIMIT 5
"""))

{'EXPLAIN': '-> Limit: 5 row(s)  (cost=21938 rows=5) (actual time=1736..1736 rows=5 loops=1)\n    -> Sort: dot_product, limit input to 5 row(s) per chunk  (cost=21938 rows=5625) (actual time=1736..1736 rows=5 loops=1)\n        -> Table scan on retrieval_cloudsql_mysql  (cost=21938 rows=5625) (actual time=0.0737..18.7 rows=9040 loops=1)\n'}

Now use the `NEAREST ... TO` predicate to invoke the use of the index:

In [180]:
run_query(sqlalchemy.text(f"""
    SELECT chunk_id
    FROM `{CLOUDSQL_TABLE_NAME}`
    WHERE NEAREST(embedding_vector) TO (string_to_vector('{str(question_embedding)}'))
"""))

[{'chunk_id': 'fannie_part_0_c326'},
 {'chunk_id': 'fannie_part_0_c352'},
 {'chunk_id': 'fannie_part_0_c353'},
 {'chunk_id': 'freddie_part_4_c509'},
 {'chunk_id': 'freddie_part_4_c510'}]

Use `EXPLAIN ANALYZE` to understand the query execution.  Note that the index was used to make the execution much more efficient that above:

In [181]:
run_query(sqlalchemy.text(f"""
EXPLAIN ANALYZE
    SELECT chunk_id
    FROM `{CLOUDSQL_TABLE_NAME}`
    WHERE NEAREST(embedding_vector) TO (string_to_vector('{str(question_embedding)}'))
"""))

{'EXPLAIN': "-> Filter: ((retrieval_cloudsql_mysql.chunk_id = 'fannie_part_0_c326') or (retrieval_cloudsql_mysql.chunk_id = 'fannie_part_0_c353') or (retrieval_cloudsql_mysql.chunk_id = 'freddie_part_4_c510') or (retrieval_cloudsql_mysql.chunk_id = 'freddie_part_4_c509') or (retrieval_cloudsql_mysql.chunk_id = 'fannie_part_0_c352'))  (cost=5.41 rows=5) (actual time=0.0379..0.0597 rows=5 loops=1)\n    -> Index range scan on retrieval_cloudsql_mysql using PRIMARY over (chunk_id = 'fannie_part_0_c326') OR (chunk_id = 'fannie_part_0_c352') OR (3 more)  (cost=5.41 rows=5) (actual time=0.0351..0.0536 rows=5 loops=1)\n"}

Use the `num_neighbor` parameters to override the default value of 5 set during the index creation:

In [182]:
run_query(sqlalchemy.text(f"""
    SELECT chunk_id
    FROM `{CLOUDSQL_TABLE_NAME}`
    WHERE NEAREST(embedding_vector) TO (string_to_vector('{str(question_embedding)}'), 'num_neighbors=7')
"""))

[{'chunk_id': 'fannie_part_0_c326'},
 {'chunk_id': 'fannie_part_0_c352'},
 {'chunk_id': 'fannie_part_0_c353'},
 {'chunk_id': 'fannie_part_0_c92'},
 {'chunk_id': 'freddie_part_4_c472'},
 {'chunk_id': 'freddie_part_4_c509'},
 {'chunk_id': 'freddie_part_4_c510'}]

Add a filter, `gse = 'fannie'`, to the query and note that fewere than the requested or default neighbors is returned.  This means the filter is being applied as a post filter, 7 neight found, then the filter was applied to those 7:

In [183]:
run_query(sqlalchemy.text(f"""
    SELECT chunk_id
    FROM `{CLOUDSQL_TABLE_NAME}`
    WHERE NEAREST(embedding_vector) TO (string_to_vector('{str(question_embedding)}'), 'num_neighbors=7')
    AND gse = 'fannie'
"""))

[{'chunk_id': 'fannie_part_0_c326'},
 {'chunk_id': 'fannie_part_0_c352'},
 {'chunk_id': 'fannie_part_0_c353'},
 {'chunk_id': 'fannie_part_0_c92'}]

Using `EXPLAIN_ANALYZE` shows that even though the filtering was applied post rather than pre it still used the index:

In [184]:
run_query(sqlalchemy.text(f"""
EXPLAIN ANALYZE
    SELECT chunk_id
    FROM `{CLOUDSQL_TABLE_NAME}`
    WHERE NEAREST(embedding_vector) TO (string_to_vector('{str(question_embedding)}'), 'num_neighbors=7')
    AND gse = 'fannie'
"""))

{'EXPLAIN': "-> Filter: ((retrieval_cloudsql_mysql.gse = 'fannie') and ((retrieval_cloudsql_mysql.chunk_id = 'fannie_part_0_c92') or (retrieval_cloudsql_mysql.chunk_id = 'freddie_part_4_c472') or (retrieval_cloudsql_mysql.chunk_id = 'fannie_part_0_c326') or (retrieval_cloudsql_mysql.chunk_id = 'fannie_part_0_c353') or (retrieval_cloudsql_mysql.chunk_id = 'freddie_part_4_c510') or (retrieval_cloudsql_mysql.chunk_id = 'freddie_part_4_c509') or (retrieval_cloudsql_mysql.chunk_id = 'fannie_part_0_c352')))  (cost=7.57 rows=0.7) (actual time=0.0478..0.0788 rows=4 loops=1)\n    -> Index range scan on retrieval_cloudsql_mysql using PRIMARY over (chunk_id = 'fannie_part_0_c326') OR (chunk_id = 'fannie_part_0_c352') OR (5 more)  (cost=7.57 rows=7) (actual time=0.0444..0.0714 rows=7 loops=1)\n"}

Force the indexing to scan a larger number of partion by setting the `num_partition` parameter.  In this case it is set to the full number of partitions and essentially forcing a brute force query across all rows:

In [185]:
run_query(sqlalchemy.text(f"""
    SELECT chunk_id
    FROM `{CLOUDSQL_TABLE_NAME}`
    WHERE NEAREST(embedding_vector) TO (string_to_vector('{str(question_embedding)}'), 'num_neighbors=7, num_partitions=90')
"""))

[{'chunk_id': 'fannie_part_0_c326'},
 {'chunk_id': 'fannie_part_0_c352'},
 {'chunk_id': 'fannie_part_0_c353'},
 {'chunk_id': 'fannie_part_0_c92'},
 {'chunk_id': 'freddie_part_4_c472'},
 {'chunk_id': 'freddie_part_4_c509'},
 {'chunk_id': 'freddie_part_4_c510'}]

Using `EXPLAIN ANALYZE` to evaluate the execution of the query set to scan all partions of the index:

In [186]:
run_query(sqlalchemy.text(f"""
EXPLAIN ANALYZE
    SELECT chunk_id
    FROM `{CLOUDSQL_TABLE_NAME}`
    WHERE NEAREST(embedding_vector) TO (string_to_vector('{str(question_embedding)}'), 'num_neighbors=7, num_partitions=90')
"""))

{'EXPLAIN': "-> Filter: ((retrieval_cloudsql_mysql.chunk_id = 'fannie_part_0_c92') or (retrieval_cloudsql_mysql.chunk_id = 'freddie_part_4_c472') or (retrieval_cloudsql_mysql.chunk_id = 'fannie_part_0_c326') or (retrieval_cloudsql_mysql.chunk_id = 'fannie_part_0_c353') or (retrieval_cloudsql_mysql.chunk_id = 'freddie_part_4_c510') or (retrieval_cloudsql_mysql.chunk_id = 'freddie_part_4_c509') or (retrieval_cloudsql_mysql.chunk_id = 'fannie_part_0_c352'))  (cost=7.57 rows=7) (actual time=0.0514..0.0909 rows=7 loops=1)\n    -> Index range scan on retrieval_cloudsql_mysql using PRIMARY over (chunk_id = 'fannie_part_0_c326') OR (chunk_id = 'fannie_part_0_c352') OR (5 more)  (cost=7.57 rows=7) (actual time=0.0481..0.0821 rows=7 loops=1)\n"}

And compare to a brute force query with pre-filtering:

In [187]:
run_query(sqlalchemy.text(f"""
    SELECT
        chunk_id, 
        dot_product(embedding_vector, string_to_vector('{str(question_embedding)}')) AS dot_product
    FROM `{CLOUDSQL_TABLE_NAME}`
    WHERE gse = 'fannie'
    ORDER BY dot_product
    LIMIT 5
"""))

[{'chunk_id': 'fannie_part_0_c352', 'dot_product': -0.7099842034905741},
 {'chunk_id': 'fannie_part_0_c353', 'dot_product': -0.6723706805978971},
 {'chunk_id': 'fannie_part_0_c326', 'dot_product': -0.6683496294372162},
 {'chunk_id': 'fannie_part_0_c92', 'dot_product': -0.6614337365039091},
 {'chunk_id': 'fannie_part_0_c240', 'dot_product': -0.6608578612144242}]

Uncomment the following cell to delete the index.  The next section expects the index to be present and instead rebuilds/replaces it with a different index:

In [188]:
#run_query(sqlalchemy.text(f"call mysql.drop_vector_index('{CLOUDSQL_DATABASE_NAME}.embedding_index');"))

#### Index: `BRUTE_FORCE`

Since the distance measure function do brute force scans it might seem weird to create an index with brute force as the type.  It can be beneficial because it allows you to preset the distance measure the default number of nearest neighbors. Then at query time you can simply use the `NEAREST ... TO` predicate.  It is also faster because indexes are loaded to memory and and make even the brute force queries faster to execute.

Create/Change the index with `call mysql.alter_vector_index()` function to overwrite the previous index:

In [223]:
run_query(sqlalchemy.text(f"""
    call mysql.alter_vector_index(
        '{CLOUDSQL_DATABASE_NAME}.embedding_index',
        'index_type = brute_force, distance_measure = dot_product, num_neighbors = 5'
    );
"""))

{'Alter Index Status': 'Success: Done'}

Review the index details:

In [224]:
run_query(sqlalchemy.text(f"SELECT * FROM information_schema.innodb_vector_indexes"))

{'INDEX_NAME': 'applied_genai.embedding_index',
 'TABLE_NAME': 'applied_genai.retrieval_cloudsql_mysql',
 'INDEX_TYPE': 'BRUTE_FORCE',
 'DIMENSION': 768,
 'DIST_MEASURE': 'DotProductDistance',
 'STATUS': 'Ready',
 'STATE': 'INDEX_READY_TO_USE',
 'PARTITIONS': 0,
 'SEARCH_PARTITIONS': 0,
 'INITIAL_SIZE': 9040,
 'CURRENT_SIZE': 9040,
 'QUERIES': 0,
 'MUTATIONS': 0,
 'INDEX_MEMORY': 27770880,
 'DATASET_MEMORY': 0}

In [225]:
run_query(sqlalchemy.text(f"SELECT * FROM mysql.vector_indexes"))

{'index_name': 'applied_genai.embedding_index',
 'table_name': 'applied_genai.retrieval_cloudsql_mysql',
 'column_name': 'embedding_vector',
 'index_options': 'index_type = brute_force, distance_measure = dot_product, num_neighbors = 5',
 'status': 'ACTIVE',
 'create_time': datetime.datetime(2024, 11, 8, 17, 21, 25),
 'update_time': datetime.datetime(2024, 11, 8, 17, 21, 34)}

Use a distance measure directly, like brute force above:

In [226]:
run_query(sqlalchemy.text(f"""
    SELECT
        chunk_id, 
        dot_product(embedding_vector, string_to_vector('{str(question_embedding)}')) AS dot_product
    FROM `{CLOUDSQL_TABLE_NAME}`
    ORDER BY dot_product
    LIMIT 5
"""))

[{'chunk_id': 'fannie_part_0_c352', 'dot_product': -0.7099842034905741},
 {'chunk_id': 'freddie_part_4_c509', 'dot_product': -0.6805260858670844},
 {'chunk_id': 'freddie_part_4_c510', 'dot_product': -0.675329698947003},
 {'chunk_id': 'fannie_part_0_c353', 'dot_product': -0.6723706805978971},
 {'chunk_id': 'fannie_part_0_c326', 'dot_product': -0.6683496294372162}]

Use `EXPLAIN ANALYZE` to understand the query execution.  Note that the index was not used:

In [227]:
run_query(sqlalchemy.text(f"""
EXPLAIN ANALYZE
    SELECT
        chunk_id, 
        dot_product(embedding_vector, string_to_vector('{str(question_embedding)}')) AS dot_product
    FROM `{CLOUDSQL_TABLE_NAME}`
    ORDER BY dot_product
    LIMIT 5
"""))

{'EXPLAIN': '-> Limit: 5 row(s)  (cost=21938 rows=5) (actual time=1764..1764 rows=5 loops=1)\n    -> Sort: dot_product, limit input to 5 row(s) per chunk  (cost=21938 rows=5625) (actual time=1764..1764 rows=5 loops=1)\n        -> Table scan on retrieval_cloudsql_mysql  (cost=21938 rows=5625) (actual time=0.0739..17.7 rows=9040 loops=1)\n'}

Now use the `NEAREST ... TO` predicate to invoke the use of the index:

In [228]:
run_query(sqlalchemy.text(f"""
    SELECT chunk_id
    FROM `{CLOUDSQL_TABLE_NAME}`
    WHERE NEAREST(embedding_vector) TO (string_to_vector('{str(question_embedding)}'))
"""))

[{'chunk_id': 'fannie_part_0_c326'},
 {'chunk_id': 'fannie_part_0_c352'},
 {'chunk_id': 'fannie_part_0_c353'},
 {'chunk_id': 'freddie_part_4_c509'},
 {'chunk_id': 'freddie_part_4_c510'}]

Use `EXPLAIN ANALYZE` to understand the query execution.  Note that the index was used to make the execution much more efficient that above:

In [229]:
run_query(sqlalchemy.text(f"""
EXPLAIN ANALYZE
    SELECT chunk_id
    FROM `{CLOUDSQL_TABLE_NAME}`
    WHERE NEAREST(embedding_vector) TO (string_to_vector('{str(question_embedding)}'))
"""))

{'EXPLAIN': "-> Filter: ((retrieval_cloudsql_mysql.chunk_id = 'fannie_part_0_c326') or (retrieval_cloudsql_mysql.chunk_id = 'fannie_part_0_c353') or (retrieval_cloudsql_mysql.chunk_id = 'freddie_part_4_c510') or (retrieval_cloudsql_mysql.chunk_id = 'freddie_part_4_c509') or (retrieval_cloudsql_mysql.chunk_id = 'fannie_part_0_c352'))  (cost=5.41 rows=5) (actual time=0.0356..0.0576 rows=5 loops=1)\n    -> Index range scan on retrieval_cloudsql_mysql using PRIMARY over (chunk_id = 'fannie_part_0_c326') OR (chunk_id = 'fannie_part_0_c352') OR (3 more)  (cost=5.41 rows=5) (actual time=0.0332..0.0519 rows=5 loops=1)\n"}

Use the `num_neighbor` parameters to override the default value of 5 set during the index creation:

In [230]:
run_query(sqlalchemy.text(f"""
    SELECT chunk_id
    FROM `{CLOUDSQL_TABLE_NAME}`
    WHERE NEAREST(embedding_vector) TO (string_to_vector('{str(question_embedding)}'), 'num_neighbors=7')
"""))

[{'chunk_id': 'fannie_part_0_c326'},
 {'chunk_id': 'fannie_part_0_c352'},
 {'chunk_id': 'fannie_part_0_c353'},
 {'chunk_id': 'fannie_part_0_c92'},
 {'chunk_id': 'freddie_part_4_c472'},
 {'chunk_id': 'freddie_part_4_c509'},
 {'chunk_id': 'freddie_part_4_c510'}]

Add a filter, `gse = 'fannie'`, to the query and note that fewere than the requested or default neighbors is returned.  This means the filter is being applied as a post filter, 7 neight found, then the filter was applied to those 7:

In [231]:
run_query(sqlalchemy.text(f"""
    SELECT chunk_id
    FROM `{CLOUDSQL_TABLE_NAME}`
    WHERE NEAREST(embedding_vector) TO (string_to_vector('{str(question_embedding)}'), 'num_neighbors=7')
    AND gse = 'fannie'
"""))

[{'chunk_id': 'fannie_part_0_c326'},
 {'chunk_id': 'fannie_part_0_c352'},
 {'chunk_id': 'fannie_part_0_c353'},
 {'chunk_id': 'fannie_part_0_c92'}]

Using `EXPLAIN_ANALYZE` shows that even though the filtering was applied post rather than pre it still used the index:

In [232]:
run_query(sqlalchemy.text(f"""
EXPLAIN ANALYZE
    SELECT chunk_id
    FROM `{CLOUDSQL_TABLE_NAME}`
    WHERE NEAREST(embedding_vector) TO (string_to_vector('{str(question_embedding)}'), 'num_neighbors=7')
    AND gse = 'fannie'
"""))

{'EXPLAIN': "-> Filter: ((retrieval_cloudsql_mysql.gse = 'fannie') and ((retrieval_cloudsql_mysql.chunk_id = 'fannie_part_0_c92') or (retrieval_cloudsql_mysql.chunk_id = 'freddie_part_4_c472') or (retrieval_cloudsql_mysql.chunk_id = 'fannie_part_0_c326') or (retrieval_cloudsql_mysql.chunk_id = 'fannie_part_0_c353') or (retrieval_cloudsql_mysql.chunk_id = 'freddie_part_4_c510') or (retrieval_cloudsql_mysql.chunk_id = 'freddie_part_4_c509') or (retrieval_cloudsql_mysql.chunk_id = 'fannie_part_0_c352')))  (cost=7.57 rows=0.7) (actual time=0.0366..0.0665 rows=4 loops=1)\n    -> Index range scan on retrieval_cloudsql_mysql using PRIMARY over (chunk_id = 'fannie_part_0_c326') OR (chunk_id = 'fannie_part_0_c352') OR (5 more)  (cost=7.57 rows=7) (actual time=0.0337..0.0597 rows=7 loops=1)\n"}

And compare to a brute force query with pre-filtering:

In [233]:
run_query(sqlalchemy.text(f"""
    SELECT
        chunk_id, 
        dot_product(embedding_vector, string_to_vector('{str(question_embedding)}')) AS dot_product
    FROM `{CLOUDSQL_TABLE_NAME}`
    WHERE gse = 'fannie'
    ORDER BY dot_product
    LIMIT 5
"""))

[{'chunk_id': 'fannie_part_0_c352', 'dot_product': -0.7099842034905741},
 {'chunk_id': 'fannie_part_0_c353', 'dot_product': -0.6723706805978971},
 {'chunk_id': 'fannie_part_0_c326', 'dot_product': -0.6683496294372162},
 {'chunk_id': 'fannie_part_0_c92', 'dot_product': -0.6614337365039091},
 {'chunk_id': 'fannie_part_0_c240', 'dot_product': -0.6608578612144242}]

Uncomment the following cell to delete the index.  The next section expects the index to be present and instead rebuilds/replaces it with a different index:

In [234]:
#run_query(sqlalchemy.text(f"call mysql.drop_vector_index('{CLOUDSQL_DATABASE_NAME}.embedding_index');"))

---
## Retrieval Augmented Generation (RAG)

Build a simple retrieval augmented generation process that enhances a query by retrieving context.  This is done here by constructing three functions for the stages:
- `retrieve` - a function that uses an embedding to search for matching context parts, pieces of texts
    - this uses the system built earlier in this workflow!
- `augment` - prepare chunks into a prompt
- `generate` - make the llm request with the augmented prompt

A final function is used to execute the workflow of rag:
- `rag` - a function that receives the query an orchestrates the workflow through `retrieve` > `augment` > `generate`

### Clients

In [33]:
embedder = vertexai.language_models.TextEmbeddingModel.from_pretrained('text-embedding-004')
llm = vertexai.generative_models.GenerativeModel("gemini-1.5-flash-002")

### Retrieve Function

In [34]:
def retrieve_cloudsql_mysql(query_embedding, n_matches = 5):
    
    matches = run_query(
        sqlalchemy.text(
            f"""
                SELECT chunk_id, content
                FROM `{CLOUDSQL_TABLE_NAME}`
                WHERE NEAREST(embedding_vector) TO (string_to_vector('{str(query_embedding)}'), 'num_neighbors={n_matches}')
            """)
    )
    
    return matches

### Augment Function

In [35]:
def augment(matches):

    prompt = ''
    for m, match in enumerate(matches):
        prompt += f"Context {m+1}:\n{match['content']}\n\n"
    prompt += f'Answer the following question using the provided contexts:\n'

    return prompt

### Generate Function

In [36]:
def generate(prompt):

    result = llm.generate_content(prompt)

    return result

### RAG Function

In [249]:
def rag(query):
    
    query_embedding = embedder.get_embeddings([query])[0].values
    matches = retrieve_cloudsql_mysql(query_embedding)
    prompt = augment(matches) + query
    result = generate(prompt)
    
    return result.text

### Example In Use

In [250]:
question

'Does a lender have to perform servicing functions directly?'

In [251]:
print(rag(question))

No, a lender does not have to perform servicing functions directly.  Context 2 explicitly states that a lender may use other organizations (subservicers) to perform some or all of its servicing functions.  The use of a subservicer is permissible as long as it doesn't interfere with the lender's ability to meet Fannie Mae's requirements (Context 3).  The contexts also discuss master servicers and servicing agents, further illustrating that servicing can be outsourced.



---
### Profiling Performance

Profile the timing of each step in the RAG function for sequential calls. The environment choosen for this workflow is a minimal testing enviornment so load testing (simoultaneous requests) would not be helpful.

In [37]:
profile = []

In [38]:
def rag(query, profile = profile):
    
    timings = {}
    start_time = time.time()
    
    
    # 1. Get embeddings
    embedding_start = time.time()
    query_embedding = embedder.get_embeddings([query])[0].values
    timings['embedding'] = time.time() - embedding_start

    # 2. Retrieve from Bigtable
    retrieval_start = time.time()
    matches = retrieve_cloudsql_mysql(query_embedding)
    timings['retrieve_cloudsql_mysql'] = time.time() - retrieval_start

    # 3. Augment the prompt
    augment_start = time.time()
    prompt = augment(matches) + query
    timings['augment'] = time.time() - augment_start

    # 4. Generate text
    generate_start = time.time()
    result = generate(prompt)
    timings['generate'] = time.time() - generate_start

    total_time = time.time() - start_time
    timings['total'] = total_time
    
    profile.append(timings)
    
    return result.text

In [39]:
print(rag(question))

No.  Context 2 explicitly states that a lender may use other organizations (subservicers) to perform some or all of its servicing functions.  This is referred to as a "subservicing" arrangement.  However,  the lender remains ultimately responsible, acting as the "master servicer," and must meet Fannie Mae's requirements even when using a subservicer.



In [40]:
profile

[{'embedding': 0.13007521629333496,
  'retrieve_cloudsql_mysql': 0.3063802719116211,
  'augment': 2.4557113647460938e-05,
  'generate': 0.7158539295196533,
  'total': 1.1523406505584717}]

In [41]:
for i in range(100):
    response = rag(question)

### Report From Profile

In [42]:
all_timings = {}
for timings in profile:
    for key, value in timings.items():
        if key not in all_timings:
            all_timings[key] = []
        all_timings[key].append(value)

In [43]:
for key, values in all_timings.items():
    arr = np.array(values)
    print(f"Statistics for '{key}':")
    print(f"  Min: {np.min(arr):.4f} seconds")
    print(f"  Max: {np.max(arr):.4f} seconds")
    print(f"  Mean: {np.mean(arr):.4f} seconds")
    print(f"  Median: {np.median(arr):.4f} seconds")
    print(f"  Std Dev: {np.std(arr):.4f} seconds")
    print(f"  P95: {np.percentile(arr, 95):.4f} seconds")
    print(f"  P99: {np.percentile(arr, 99):.4f} seconds")
    print("")

Statistics for 'embedding':
  Min: 0.0469 seconds
  Max: 10.0762 seconds
  Mean: 0.2577 seconds
  Median: 0.0521 seconds
  Std Dev: 1.3952 seconds
  P95: 0.0907 seconds
  P99: 10.0652 seconds

Statistics for 'retrieve_cloudsql_mysql':
  Min: 0.0142 seconds
  Max: 0.3064 seconds
  Mean: 0.0194 seconds
  Median: 0.0159 seconds
  Std Dev: 0.0288 seconds
  P95: 0.0212 seconds
  P99: 0.0238 seconds

Statistics for 'augment':
  Min: 0.0000 seconds
  Max: 0.0001 seconds
  Mean: 0.0000 seconds
  Median: 0.0000 seconds
  Std Dev: 0.0000 seconds
  P95: 0.0000 seconds
  P99: 0.0001 seconds

Statistics for 'generate':
  Min: 0.5215 seconds
  Max: 0.9846 seconds
  Mean: 0.6994 seconds
  Median: 0.6847 seconds
  Std Dev: 0.0825 seconds
  P95: 0.8642 seconds
  P99: 0.9639 seconds

Statistics for 'total':
  Min: 0.5868 seconds
  Max: 10.8077 seconds
  Mean: 0.9765 seconds
  Median: 0.7594 seconds
  Std Dev: 1.3983 seconds
  P95: 1.0279 seconds
  P99: 10.7732 seconds



## Remove Resources

In [82]:
# can't drop the database of an active connection, switch connection to mysql (default) database

#sync_pool.dispose()
#sync_connector.close()
#sync_connector = google.cloud.sql.connector.Connector()
#sync_pool = get_sync_pool(sync_connector, 'mysql')

#query = sqlalchemy.text(f"DROP DATABASE IF EXISTS `{CLOUDSQL_DATABASE_NAME}`")
#run_query(query)

In [107]:
#user_delete = !gcloud sql users delete $CLOUDSQL_USER --instance=$CLOUDSQL_INSTANCE_NAME --quiet

In [255]:
#instance_delete = !gcloud sql instances delete $CLOUDSQL_INSTANCE_NAME --quiet --format=json