![tracker](https://us-central1-vertex-ai-mlops-369716.cloudfunctions.net/pixel-tracking?path=statmike%2Fvertex-ai-mlops%2FApplied+GenAI%2FRetrieval&file=Retrieval+-+Cloud+SQL+For+MySQL.ipynb)
<!--- header table --->
<table align="left">
  <td style="text-align: center">
    <a href="https://colab.research.google.com/github/statmike/vertex-ai-mlops/blob/main/Applied%20GenAI/Retrieval/Retrieval%20-%20Cloud%20SQL%20For%20MySQL.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/colab-logo-32px.png" alt="Google Colaboratory logo">
      <br>Run in<br>Colab
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/colab/import/https%3A%2F%2Fraw.githubusercontent.com%2Fstatmike%2Fvertex-ai-mlops%2Fmain%2FApplied%2520GenAI%2FRetrieval%2FRetrieval%2520-%2520Cloud%2520SQL%2520For%2520MySQL.ipynb">
      <img width="32px" src="https://lh3.googleusercontent.com/JmcxdQi-qOpctIvWKgPtrzZdJJK-J3sWE1RsfjZNwshCFgE_9fULcNpuXYTilIR2hjwN" alt="Google Cloud Colab Enterprise logo">
      <br>Run in<br>Colab Enterprise
    </a>
  </td>      
  <td style="text-align: center">
    <a href="https://github.com/statmike/vertex-ai-mlops/blob/main/Applied%20GenAI/Retrieval/Retrieval%20-%20Cloud%20SQL%20For%20MySQL.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" alt="GitHub logo">
      <br>View on<br>GitHub
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/statmike/vertex-ai-mlops/main/Applied%20GenAI/Retrieval/Retrieval%20-%20Cloud%20SQL%20For%20MySQL.ipynb">
      <img src="https://lh3.googleusercontent.com/UiNooY4LUgW_oTvpsNhPpQzsstV5W8F7rYgxgGBD85cWJoLmrOzhVs_ksK_vgx40SHs7jCqkTkCk=e14-rj-sc0xffffff-h130-w32" alt="Vertex AI logo">
      <br>Open in<br>Vertex AI Workbench
    </a>
  </td>
</table>

# Retrieval - Cloud SQL For MySQL
<p style="font-size: 45px;">IN PROGRESS - NOT COMPLETE</p>

In prior workflows, a series of documents was [processed into chunks](../Chunking/readme.md), and for each chunk, [embeddings](../Embeddings/readme.md) were created:

- Process: [Large Document Processing - Document AI Layout Parser](../Chunking/Large%20Document%20Processing%20-%20Document%20AI%20Layout%20Parser.ipynb)
- Embed: [Vertex AI Text Embeddings API](../Embeddings/Vertex%20AI%20Text%20Embeddings%20API.ipynb)

Retrieving chunks for a query involves calculating the embedding for the query and then using similarity metrics to find relevant chunks. A thorough review of similarity matching can be found in [The Math of Similarity](../Embeddings/The%20Math%20of%20Similarity.ipynb) - use dot product! As development moves from experiment to application, the process of storing and computing similarity is migrated to a [retrieval](./readme.md) system. This workflow is part of a [series of workflows exploring many retrieval systems](./readme.md).

**Cloud SQL For MySQL For Storage, Indexing, And Search**



**Use Case Data**

Buying a home usually involves borrowing money from a lending institution, typically through a mortgage secured by the home's value. But how do these institutions manage the risks associated with such large loans, and how are lending standards established?

In the United States, two government-sponsored enterprises (GSEs) play a vital role in the housing market:

- Federal National Mortgage Association ([Fannie Mae](https://www.fanniemae.com/))
- Federal Home Loan Mortgage Corporation ([Freddie Mac](https://www.freddiemac.com/))

These GSEs purchase mortgages from lenders, enabling those lenders to offer more loans. This process also allows Fannie Mae and Freddie Mac to set standards for mortgages, ensuring they are responsible and borrowers are more likely to repay them. This system makes homeownership more affordable and stabilizes the housing market by maintaining a steady flow of liquidity for lenders and keeping interest rates controlled.

However, navigating the complexities of these GSEs and their extensive servicing guides can be challenging.

**Approaches**

[This series](../readme.md) covers many generative AI workflows. These documents are used directly as long context for Gemini in the workflow [Long Context Retrieval With The Vertex AI Gemini API](../Generate/Long%20Context%20Retrieval%20With%20The%20Vertex%20AI%20Gemini%20API.ipynb). The workflow below uses a [retrieval](./readme.md) approach with the already generated chunks and embeddings.

---
## Colab Setup

When running this notebook in [Colab](https://colab.google/) or [Colab Enterprise](https://cloud.google.com/colab/docs/introduction), this section will authenticate to GCP (follow prompts in the popup) and set the current project for the session.

In [1]:
PROJECT_ID = 'statmike-mlops-349915' # replace with project ID

In [2]:
try:
    from google.colab import auth
    auth.authenticate_user()
    !gcloud config set project {PROJECT_ID}
except Exception:
    pass

---
## Installs and API Enablement

The clients packages may need installing in this environment. 

### Installs (If Needed)

In [3]:
# tuples of (import name, install name, min_version)
packages = [
    ('google.cloud.aiplatform', 'google-cloud-aiplatform', '1.69.0'),
    ('google.cloud.sql', 'cloud-sql-python-connector'),
    ('sqlalchemy', 'sqlalchemy'),
    ('pymysql', 'pymysql')
]

import importlib
install = False
for package in packages:
    if not importlib.util.find_spec(package[0]):
        print(f'installing package {package[1]}')
        install = True
        !pip install {package[1]} -U -q --user
    elif len(package) == 3:
        if importlib.metadata.version(package[0]) < package[2]:
            print(f'updating package {package[1]}')
            install = True
            !pip install {package[1]} -U -q --user

### API Enablement

In [4]:
!gcloud services enable aiplatform.googleapis.com
!gcloud services enable sqladmin.googleapis.com

### Restart Kernel (If Installs Occured)

After a kernel restart the code submission can start with the next cell after this one.

In [5]:
if install:
    import IPython
    app = IPython.Application.instance()
    app.kernel.do_shutdown(True)
    IPython.display.display(IPython.display.Markdown("""<div class=\"alert alert-block alert-warning\">
        <b>⚠️ The kernel is going to restart. Please wait until it is finished before continuing to the next step. The previous cells do not need to be run again⚠️</b>
        </div>"""))

---
## Setup

Inputs

In [6]:
project = !gcloud config get-value project
PROJECT_ID = project[0]
PROJECT_ID

'statmike-mlops-349915'

In [7]:
REGION = 'us-central1'
SERIES = 'applied-genai'
EXPERIMENT = 'retrieval-cloudsql-mysql'

# Cloud SQL Names
CLOUDSQL_INSTANCE_NAME = EXPERIMENT
CLOUDSQL_DATABASE_NAME = SERIES
CLOUDSQL_TABLE_NAME = EXPERIMENT

CLOUDSQL_USER = 'test_db'
CLOUDSQL_PASS = 'test_db_pass'

Packages

In [75]:
import os, json, time, glob, datetime, copy
import concurrent.futures

import numpy as np

# Vertex AI
from google.cloud import aiplatform
import vertexai.language_models # for embeddings API
import vertexai.generative_models # for Gemini Models
from vertexai.resources.preview import feature_store

# Cloud SQL
import google.cloud.sql.connector
import sqlalchemy
import pymysql

In [9]:
aiplatform.__version__

'1.71.0'

Clients

In [10]:
# vertex ai clients
vertexai.init(project = PROJECT_ID, location = REGION)

---
## Text & Embeddings For Examples

This repository contains a [section for document processing (chunking)](../Chunking/readme.md) that includes an example of processing mulitple large pdfs (over 1000 pages) into chunks: [Large Document Processing - Document AI Layout Parser](../Chunking/Large%20Document%20Processing%20-%20Document%20AI%20Layout%20Parser.ipynb).  The chunks of text from that workflow are stored with this repository and loaded by another companion workflow that augments the chunks with text embeddings: [Vertex AI Text Embeddings API](../Embeddings/Vertex%20AI%20Text%20Embeddings%20API.ipynb).

The following code will load the version of the chunks that includes text embeddings and prepare it for a local example of retrival augmented generation.

### Get The Documents

If you are working from a clone of this notebooks [repository](https://github.com/statmike/vertex-ai-mlops) then the documents are already present. The following cell checks for the documents folder and if it is missing gets it (`git clone`):

In [11]:
local_dir = '../Embeddings/files/embeddings-api'

In [12]:
if not os.path.exists(local_dir):
    print('Retrieving documents...')
    parent_dir = os.path.dirname(local_dir)
    temp_dir = os.path.join(parent_dir, 'temp')
    if not os.path.exists(temp_dir):
        os.makedirs(temp_dir)
    !git clone https://www.github.com/statmike/vertex-ai-mlops {temp_dir}/vertex-ai-mlops
    shutil.copytree(f'{temp_dir}/vertex-ai-mlops/Applied GenAI/Embeddings/files/embeddings-api', local_dir)
    shutil.rmtree(temp_dir)
    print(f'Documents are now in folder `{local_dir}`')
else:
    print(f'Documents Found in folder `{local_dir}`')             

Documents Found in folder `../Embeddings/files/embeddings-api`


### Load The Chunks

In [13]:
jsonl_files = glob.glob(f"{local_dir}/large-files*.jsonl")
jsonl_files.sort()
jsonl_files

['../Embeddings/files/embeddings-api/large-files-chunk-embeddings-0000.jsonl',
 '../Embeddings/files/embeddings-api/large-files-chunk-embeddings-0001.jsonl',
 '../Embeddings/files/embeddings-api/large-files-chunk-embeddings-0002.jsonl',
 '../Embeddings/files/embeddings-api/large-files-chunk-embeddings-0003.jsonl',
 '../Embeddings/files/embeddings-api/large-files-chunk-embeddings-0004.jsonl',
 '../Embeddings/files/embeddings-api/large-files-chunk-embeddings-0005.jsonl',
 '../Embeddings/files/embeddings-api/large-files-chunk-embeddings-0006.jsonl',
 '../Embeddings/files/embeddings-api/large-files-chunk-embeddings-0007.jsonl',
 '../Embeddings/files/embeddings-api/large-files-chunk-embeddings-0008.jsonl',
 '../Embeddings/files/embeddings-api/large-files-chunk-embeddings-0009.jsonl']

In [14]:
chunks = []
for file in jsonl_files:
    with open(file, 'r') as f:
        chunks.extend([json.loads(line) for line in f])
len(chunks)

9040

### Review A Chunk

In [15]:
chunks[0].keys()

dict_keys(['instance', 'predictions', 'status'])

In [16]:
chunks[0]['instance']['chunk_id']

'fannie_part_0_c17'

In [17]:
print(chunks[0]['instance']['content'])

# Selling Guide Fannie Mae Single Family

## Fannie Mae Copyright Notice

### Fannie Mae Copyright Notice

|-|
| Section B3-4.2, Verification of Depository Assets 402 |
| B3-4.2-01, Verification of Deposits and Assets (05/04/2022) 403 |
| B3-4.2-02, Depository Accounts (12/14/2022) 405 |
| B3-4.2-03, Individual Development Accounts (02/06/2019) 408 |
| B3-4.2-04, Pooled Savings (Community Savings Funds) (04/01/2009) 411 |
| B3-4.2-05, Foreign Assets (05/04/2022) 411 |
| Section B3-4.3, Verification of Non-Depository Assets 412 |
| B3-4.3-01, Stocks, Stock Options, Bonds, and Mutual Funds (06/30/2015) 412 |
| B3-4.3-02, Trust Accounts (04/01/2009) 413 |
| B3-4.3-03, Retirement Accounts (06/30/2015) 414 |
| B3-4.3-04, Personal Gifts (09/06/2023) 415 |
| B3-4.3-05, Gifts of Equity (10/07/2020) 418 |
| B3-4.3-06, Grants and Lender Contributions (12/14/2022) 419 |
| B3-4.3-07, Disaster Relief Grants or Loans (04/01/2009) 423 |
| B3-4.3-08, Employer Assistance (09/29/2015) 423 |
| B3-4.3-09,

In [18]:
chunks[0]['predictions'][0]['embeddings']['values'][0:10]

[0.031277116388082504,
 0.03056905046105385,
 0.010865348391234875,
 0.0623614676296711,
 0.03228681534528732,
 0.05066155269742012,
 0.046544693410396576,
 0.05509665608406067,
 -0.014074751175940037,
 0.008380400016903877]

### Prepare Chunk Structure

Make a list of dictionaries with information for each chunk:

In [19]:
content_chunks = [
    dict(
        gse = chunk['instance']['gse'],
        chunk_id = chunk['instance']['chunk_id'],
        content = chunk['instance']['content'],
        embedding = chunk['predictions'][0]['embeddings']['values']
    ) for chunk in chunks
]

### Query Embedding

Create a query, or prompt, and get the embedding for it:

Connect to models for text embeddings. Learn more about the model API:
- [Vertex AI Text Embeddings API](../Embeddings/Vertex%20AI%20Text%20Embeddings%20API.ipynb)

In [20]:
question = "Does a lender have to perform servicing functions directly?"

In [21]:
embedder = vertexai.language_models.TextEmbeddingModel.from_pretrained('text-embedding-004')

In [22]:
question_embedding = embedder.get_embeddings([question])[0].values
question_embedding[0:10]

[-0.0005117303808219731,
 0.009651427157223225,
 0.01768726110458374,
 0.014538003131747246,
 -0.01829824410378933,
 0.027877431362867355,
 -0.021124685183167458,
 0.008830446749925613,
 -0.02669006586074829,
 0.06414774805307388]

---
## Setup Cloud SQL For MySQL

Creating an instance.  While failover replicas and read-only replicas are possible this example uses a single instance server with minimum possible specifications.  Creating an instance also includes choose a Cloud SQL edition.

https://cloud.google.com/sql/docs/mysql/editions-intro

https://cloud.google.com/sql/pricing#mysql-pg-pricing

### Create/Retrieve Instance

https://cloud.google.com/sql/docs/mysql/create-instance

https://cloud.google.com/sql/docs/mysql/instance-settings

https://cloud.google.com/sdk/gcloud/reference/sql/instances

In [23]:
list_instances = !gcloud sql instances list --format="json(name)"
list_instances = json.loads(''.join(list_instances))

if CLOUDSQL_INSTANCE_NAME in [i['name'] for i in list_instances]:
    instance_describe = !gcloud sql instances describe $CLOUDSQL_INSTANCE_NAME --format=json
    instance_describe = json.loads(''.join(instance_describe))
    print(f"Found the instance: {instance_describe['name']}")
else:
    print('Creating an instance...')
    instance_create = !gcloud sql instances create $CLOUDSQL_INSTANCE_NAME \
        --database-version=MYSQL_8_0 \
        --tier=db-g1-small \
        --region=us-central1 \
        --quiet
    instance_describe = !gcloud sql instances describe $CLOUDSQL_INSTANCE_NAME --format=json
    instance_describe = json.loads(''.join(instance_describe))
    print(f"Created the instance: {instance_describe['name']}")

Found the instance: retrieval-cloudsql-mysql


In [24]:
#instance_describe

### Create User For This Workflow

This workflow is a demonstration of capabilities.  

https://cloud.google.com/sql/docs/mysql/users

In [25]:
list_users = !gcloud sql users list --format="json" --instance=$CLOUDSQL_INSTANCE_NAME
list_users = json.loads(''.join(list_users))

if CLOUDSQL_USER in [i['name'] for i in list_users]:
    user_describe = !gcloud sql users describe $CLOUDSQL_USER --instance=$CLOUDSQL_INSTANCE_NAME --format=json
    user_describe = json.loads(''.join(user_describe))
    print(f"Found the user: {user_describe['name']}")
else:
    print('Creating the user...')
    user_create = !gcloud sql users create $CLOUDSQL_USER \
        --instance=$CLOUDSQL_INSTANCE_NAME \
        --password=$CLOUDSQL_PASS
    user_describe = !gcloud sql users describe $CLOUDSQL_USER --instance=$CLOUDSQL_INSTANCE_NAME --format=json
    user_describe = json.loads(''.join(user_describe))
    print(f"Created the user: {user_describe['name']}")

Found the user: test_db


In [26]:
#user_describe

### Connection To Databases

Cloud SQL For MySQL has default databases, like mysql: https://cloud.google.com/sql/docs/mysql/create-manage-databases#gcloud

There are many ways to connect to a database: https://cloud.google.com/sql/docs/mysql/connect-overview

Here we want to use Python and will use the Cloud SQL Language Connectors: https://cloud.google.com/sql/docs/mysql/language-connectors

We will create a connector: https://cloud.google.com/sql/docs/mysql/connect-connectors#python

Connections have three parts:
- a connection tool, in this case provided by: https://github.com/GoogleCloudPlatform/cloud-sql-python-connector
- a driver to create a connection pool, pymysql
- a client library that can use connection pools to execute SQL queries, SQLAlchemy

#### Connection Tool

In [27]:
sync_connector = google.cloud.sql.connector.Connector()

#### Connection

In [28]:
def get_sync_conn(
    connector: google.cloud.sql.connector.Connector,
    db: str
):
    def getconn():
        conn = connector.connect(
            instance_describe['connectionName'],
            "pymysql",
            user = CLOUDSQL_USER,
            password = CLOUDSQL_PASS,
            db = db
        )
        return conn
    return getconn

#### Connection Pool

In [29]:
def get_sync_pool(
    connector: google.cloud.sql.connector.Connector,
    db: str
) -> sqlalchemy.engine.Engine:

    pool = sqlalchemy.create_engine(
        "mysql+pymysql://",
        creator = get_sync_conn(connector, db)
    )
    pool.dialect.description_encoding = None
    pool.execution_options(isolation_level="AUTOCOMMIT")
    return pool

In [30]:
sync_pool = get_sync_pool(sync_connector, 'mysql')

#### Query Orchestrator

Use the a pool as a context manager to orchstrate the query

In [31]:
def run_query(query, pool = None, connector = sync_connector):
    # get the current connnection pool:
    if pool is None:
        pool = sync_pool
        
    # run the query and get the response as 'result'
    with pool.connect().execution_options(isolation_level="AUTOCOMMIT") as connection:
        result = connection.execute(query)
        #connector.close()
        
    # prepare the response
    rows = []
    try:
        for row in result:
            rows.append(dict(zip(result.keys(), row)))
    except Exception:
        pass
    
    # return the response
    return rows[0] if len(rows) == 1 else rows

### Test Query

When submitting SQL statements either of the connectors should work for DML (SELECT, INSERT, DELETE, UPDATE) but only the synchronous connector appears to work for DDL (CREATE, ALTER, DROP) statements.

In [32]:
query = sqlalchemy.text("SELECT 'Success' as did_it_work")

In [33]:
run_query(query)

{'did_it_work': 'Success'}

In [34]:
with concurrent.futures.ThreadPoolExecutor(max_workers = 5) as executor:
    queries = [query]* 5
    futures = [executor.submit(run_query, query) for query in queries]
    for future in futures:
        print(future.result())

{'did_it_work': 'Success'}
{'did_it_work': 'Success'}
{'did_it_work': 'Success'}
{'did_it_work': 'Success'}
{'did_it_work': 'Success'}


---
## Working With Cloud SQL For MySQL

### Create A Database

In [35]:
query = sqlalchemy.text(f"SELECT schema_name FROM INFORMATION_SCHEMA.SCHEMATA WHERE schema_name = '{CLOUDSQL_DATABASE_NAME}'")
result = run_query(query)
result

[]

In [40]:
if not result:
    query = sqlalchemy.text(f"CREATE DATABASE `{CLOUDSQL_DATABASE_NAME}`")
    run_query(query)

In [41]:
query = sqlalchemy.text(f"SELECT schema_name FROM INFORMATION_SCHEMA.SCHEMATA WHERE schema_name = '{CLOUDSQL_DATABASE_NAME}'")
result = run_query(query)
result

{'SCHEMA_NAME': 'applied-genai'}

In [42]:
query = sqlalchemy.text(f"SELECT * FROM INFORMATION_SCHEMA.SCHEMATA WHERE schema_name = '{CLOUDSQL_DATABASE_NAME}'")
run_query(query)

{'CATALOG_NAME': 'def',
 'SCHEMA_NAME': 'applied-genai',
 'DEFAULT_CHARACTER_SET_NAME': 'utf8mb4',
 'DEFAULT_COLLATION_NAME': 'utf8mb4_0900_ai_ci',
 'SQL_PATH': None,
 'DEFAULT_ENCRYPTION': 'NO'}

### Move Connection To New Database

In [43]:
run_query(sqlalchemy.text('SELECT database()'))

{'DATABASE()': 'mysql'}

In [46]:
sync_pool.dispose()
sync_connector.close()
sync_connector = google.cloud.sql.connector.Connector()
sync_pool = get_sync_pool(sync_connector, CLOUDSQL_DATABASE_NAME)

In [47]:
run_query(sqlalchemy.text('SELECT database()'))

{'database()': 'applied-genai'}

### Create Table

In [48]:
result = run_query(sqlalchemy.text(f"SELECT * from information_schema.tables WHERE table_schema = '{CLOUDSQL_TABLE_NAME}'"))
result

[]

In [49]:
run_query(sqlalchemy.text(f"DROP TABLE IF EXISTS `{CLOUDSQL_TABLE_NAME}`"))

[]

In [50]:
run_query(
    sqlalchemy.text(f"""
            CREATE TABLE IF NOT EXISTS `{CLOUDSQL_TABLE_NAME}` (
                chunk_id VARCHAR(100) NOT NULL PRIMARY KEY,
                gse VARCHAR(50),
                content TEXT,
                embedding JSON
            );
        """
    )
)

[]

In [51]:
result = run_query(sqlalchemy.text(f"SELECT * from information_schema.tables WHERE table_name = '{CLOUDSQL_TABLE_NAME}'"))
result

{'TABLE_CATALOG': 'def',
 'TABLE_SCHEMA': 'applied-genai',
 'TABLE_NAME': 'retrieval-cloudsql-mysql',
 'TABLE_TYPE': 'BASE TABLE',
 'ENGINE': 'InnoDB',
 'VERSION': 10,
 'ROW_FORMAT': 'Dynamic',
 'TABLE_ROWS': 0,
 'AVG_ROW_LENGTH': 0,
 'DATA_LENGTH': 16384,
 'MAX_DATA_LENGTH': 0,
 'INDEX_LENGTH': 0,
 'DATA_FREE': 0,
 'AUTO_INCREMENT': None,
 'CREATE_TIME': datetime.datetime(2024, 11, 5, 14, 45, 6),
 'UPDATE_TIME': None,
 'CHECK_TIME': None,
 'TABLE_COLLATION': 'utf8mb4_0900_ai_ci',
 'CHECKSUM': None,
 'CREATE_OPTIONS': '',
 'TABLE_COMMENT': ''}

In [52]:
run_query(sqlalchemy.text(f"SELECT column_name, data_type FROM information_schema.columns WHERE table_name = '{CLOUDSQL_TABLE_NAME}'"))

[{'COLUMN_NAME': 'chunk_id', 'DATA_TYPE': 'varchar'},
 {'COLUMN_NAME': 'content', 'DATA_TYPE': 'text'},
 {'COLUMN_NAME': 'embedding', 'DATA_TYPE': 'json'},
 {'COLUMN_NAME': 'gse', 'DATA_TYPE': 'varchar'}]

### Add, Retrieve, And Delete Rows

#### Get A Record

Dictionaries for each record/row are stored in `content_chunks` from earlier in this workflow:

In [53]:
first_record = content_chunks[0]

In [54]:
first_record.keys()

dict_keys(['gse', 'chunk_id', 'content', 'embedding'])

In [55]:
first_record['chunk_id']

'fannie_part_0_c17'

Convert embedding to JSON:

In [58]:
first_record['embedding'] = json.dumps(first_record['embedding'])

#### Insert Row

In [59]:
table = sqlalchemy.Table(
    CLOUDSQL_TABLE_NAME,
    sqlalchemy.MetaData(),
    autoload_with = sync_pool
)

In [60]:
for c in table.columns:
    print(c)

retrieval-cloudsql-mysql.chunk_id
retrieval-cloudsql-mysql.gse
retrieval-cloudsql-mysql.content
retrieval-cloudsql-mysql.embedding


In [61]:
insert_row = sqlalchemy.insert(table).values(first_record)

In [62]:
run_query(insert_row)

[]

#### Retrieve Row

In [63]:
query = sqlalchemy.text(f"SELECT * FROM `{CLOUDSQL_TABLE_NAME}` WHERE chunk_id = '{first_record['chunk_id']}'")
result = run_query(query)

In [64]:
result.keys()

dict_keys(['chunk_id', 'gse', 'content', 'embedding'])

In [65]:
result['chunk_id']

'fannie_part_0_c17'

In [66]:
query = sqlalchemy.select(table).where(table.columns.chunk_id == first_record['chunk_id'])
result = run_query(query)

In [67]:
result.keys()

dict_keys(['chunk_id', 'gse', 'content', 'embedding'])

In [68]:
result['chunk_id']

'fannie_part_0_c17'

In [69]:
type(result['embedding'])

str

In [70]:
json.loads(result['embedding'])[0:10]

[0.031277116388082504,
 0.03056905046105385,
 0.010865348391234875,
 0.0623614676296711,
 0.03228681534528732,
 0.05066155269742012,
 0.046544693410396576,
 0.05509665608406067,
 -0.014074751175940037,
 0.008380400016903877]

#### Delete Row

In [71]:
run_query(sqlalchemy.text(f"SELECT COUNT(*) as count FROM `{CLOUDSQL_TABLE_NAME}`"))

{'count': 1}

In [72]:
run_query(sqlalchemy.text(f"DELETE FROM `{CLOUDSQL_TABLE_NAME}` WHERE chunk_id = '{first_record['chunk_id']}'"))

[]

In [73]:
run_query(sqlalchemy.text(f"SELECT COUNT(*) as count FROM `{CLOUDSQL_TABLE_NAME}`"))

{'count': 0}

## Load Data 

There are a lot of rows to load so using concurrency.  There is not an async driver for MySQL here so using threading to achieve concurrency. The number of threads is limited to 1000 but each thread is just send data and waiting so no intense CPU work and the OS should be able to manage the high number of threads.

In [76]:
prep_content_chunks  = copy.deepcopy(content_chunks)
for chunk in prep_content_chunks:
    chunk['embedding'] = json.dumps(chunk['embedding'])

In [77]:
queries = [sqlalchemy.insert(table).values(c) for c in prep_content_chunks]

In [79]:
with concurrent.futures.ThreadPoolExecutor(max_workers = 1000) as executor:
    futures = [executor.submit(run_query, query) for query in queries]

In [80]:
run_query(sqlalchemy.text(f"SELECT COUNT(*) as count FROM `{CLOUDSQL_TABLE_NAME}`"))

{'count': 9040}

---
## Setup Cloud SQL For MySQL For Vector Similarity Search

https://cloud.google.com/sql/docs/mysql/work-with-vectors

## Remove Resources

In [82]:
# can't drop the database of an active connection, switch connection to mysql (default) database

#sync_pool.dispose()
#sync_connector.close()
#sync_connector = google.cloud.sql.connector.Connector()
#sync_pool = get_sync_pool(sync_connector, 'mysql')

#query = sqlalchemy.text(f"DROP DATABASE IF EXISTS `{CLOUDSQL_DATABASE_NAME}`")
#run_query(query)

In [107]:
#user_delete = !gcloud sql users delete $CLOUDSQL_USER --instance=$CLOUDSQL_INSTANCE_NAME --quiet

In [100]:
#instance_delete = !gcloud sql instances delete $CLOUDSQL_INSTANCE_NAME --quiet --format=json