![tracker](https://us-central1-vertex-ai-mlops-369716.cloudfunctions.net/pixel-tracking?path=statmike%2Fvertex-ai-mlops%2FApplied+GenAI%2FRetrieval&file=Retrieval+-+Spanner.ipynb)
<!--- header table --->
<table align="left">
  <td style="text-align: center">
    <a href="https://colab.research.google.com/github/statmike/vertex-ai-mlops/blob/main/Applied%20GenAI/Retrieval/Retrieval%20-%20Spanner.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/colab-logo-32px.png" alt="Google Colaboratory logo">
      <br>Run in<br>Colab
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/colab/import/https%3A%2F%2Fraw.githubusercontent.com%2Fstatmike%2Fvertex-ai-mlops%2Fmain%2FApplied%2520GenAI%2FRetrieval%2FRetrieval%2520-%2520Spanner.ipynb">
      <img width="32px" src="https://lh3.googleusercontent.com/JmcxdQi-qOpctIvWKgPtrzZdJJK-J3sWE1RsfjZNwshCFgE_9fULcNpuXYTilIR2hjwN" alt="Google Cloud Colab Enterprise logo">
      <br>Run in<br>Colab Enterprise
    </a>
  </td>      
  <td style="text-align: center">
    <a href="https://github.com/statmike/vertex-ai-mlops/blob/main/Applied%20GenAI/Retrieval/Retrieval%20-%20Spanner.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" alt="GitHub logo">
      <br>View on<br>GitHub
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/statmike/vertex-ai-mlops/main/Applied%20GenAI/Retrieval/Retrieval%20-%20Spanner.ipynb">
      <img src="https://lh3.googleusercontent.com/UiNooY4LUgW_oTvpsNhPpQzsstV5W8F7rYgxgGBD85cWJoLmrOzhVs_ksK_vgx40SHs7jCqkTkCk=e14-rj-sc0xffffff-h130-w32" alt="Vertex AI logo">
      <br>Open in<br>Vertex AI Workbench
    </a>
  </td>
</table>

# Retrieval - Spanner

In prior workflows, a series of documents was [processed into chunks](../Chunking/readme.md), and for each chunk, [embeddings](../Embeddings/readme.md) were created:

- Process: [Large Document Processing - Document AI Layout Parser](../Chunking/Large%20Document%20Processing%20-%20Document%20AI%20Layout%20Parser.ipynb)
- Embed: [Vertex AI Text Embeddings API](../Embeddings/Vertex%20AI%20Text%20Embeddings%20API.ipynb)

Retrieving chunks for a query involves calculating the embedding for the query and then using similarity metrics to find relevant chunks. A thorough review of similarity matching can be found in [The Math of Similarity](../Embeddings/The%20Math%20of%20Similarity.ipynb) - use dot product! As development moves from experiment to application, the process of storing and computing similarity is migrated to a [retrieval](./readme.md) system. This workflow is part of a [series of workflows exploring many retrieval systems](./readme.md).

---

**Spanner For Storage, Indexing, And Search**

**Spanner** ([https://cloud.google.com/spanner](https://cloud.google.com/spanner)) is Google Cloud's globally distributed, scalable database, suitable for a wide range of applications, from gaming databases to financial ledgers.

- **Effortless Scalability:** Spanner decouples compute and storage, enabling effortless scalability to accommodate growing data and traffic demands.
- **Always-On Availability:**  Spanner provides automatic maintenance with zero downtime and allows for 100% online schema changes, even with synchronous replication, ensuring continuous availability.
- **Spanner Graph:**  Leverage [Spanner Graph](https://cloud.google.com/spanner/docs/graph/overview) for knowledge graphs, social networks, GraphRAG, and more, using the ISO Graph Query Language.
- **Vertex AI Integration:** Spanner integrates with [Vertex AI](https://cloud.google.com/spanner/docs/ml) for generative AI and custom ML model inference.
- **LangChain Integration:**  Build LLM-powered applications with Spanner's integration with [LangChain](https://cloud.google.com/spanner/docs/langchain).
- **Vector Similarity Search:** Spanner offers [built-in vector similarity search](https://cloud.google.com/spanner/docs/find-k-nearest-neighbors) with indexing for efficient approximate nearest neighbor search in applications like retrieval augmented generation (RAG).

---

**Use Case Data**

Buying a home usually involves borrowing money from a lending institution, typically through a mortgage secured by the home's value. But how do these institutions manage the risks associated with such large loans, and how are lending standards established?

In the United States, two government-sponsored enterprises (GSEs) play a vital role in the housing market:

- Federal National Mortgage Association ([Fannie Mae](https://www.fanniemae.com/))
- Federal Home Loan Mortgage Corporation ([Freddie Mac](https://www.freddiemac.com/))

These GSEs purchase mortgages from lenders, enabling those lenders to offer more loans. This process also allows Fannie Mae and Freddie Mac to set standards for mortgages, ensuring they are responsible and borrowers are more likely to repay them. This system makes homeownership more affordable and stabilizes the housing market by maintaining a steady flow of liquidity for lenders and keeping interest rates controlled.

However, navigating the complexities of these GSEs and their extensive servicing guides can be challenging.

**Approaches**

[This series](../readme.md) covers many generative AI workflows. These documents are used directly as long context for Gemini in the workflow [Long Context Retrieval With The Vertex AI Gemini API](../Generate/Long%20Context%20Retrieval%20With%20The%20Vertex%20AI%20Gemini%20API.ipynb). The workflow below uses a [retrieval](./readme.md) approach with the already generated chunks and embeddings.

---
## Colab Setup

When running this notebook in [Colab](https://colab.google/) or [Colab Enterprise](https://cloud.google.com/colab/docs/introduction), this section will authenticate to GCP (follow prompts in the popup) and set the current project for the session.

In [1]:
PROJECT_ID = 'statmike-mlops-349915' # replace with project ID

In [2]:
try:
    from google.colab import auth
    auth.authenticate_user()
    !gcloud config set project {PROJECT_ID}
except Exception:
    pass

---
## Installs and API Enablement

The clients packages may need installing in this environment. 

### Installs (If Needed)

In [48]:
# tuples of (import name, install name, min_version)
packages = [
    ('google.cloud.aiplatform', 'google-cloud-aiplatform', '1.69.0'),
    ('google.cloud.spanner', 'google-cloud-spanner'),
    ('google.cloud.sqlalchemy_spanner', 'sqlalchemy-spanner')
]

import importlib
install = False
for package in packages:
    if not importlib.util.find_spec(package[0]):
        print(f'installing package {package[1]}')
        install = True
        !pip install {package[1]} -U -q --user
    elif len(package) == 3:
        if importlib.metadata.version(package[0]) < package[2]:
            print(f'updating package {package[1]}')
            install = True
            !pip install {package[1]} -U -q --user

### API Enablement

In [4]:
!gcloud services enable aiplatform.googleapis.com
!gcloud services enable spanner.googleapis.com

### Restart Kernel (If Installs Occured)

After a kernel restart the code submission can start with the next cell after this one.

In [5]:
if install:
    import IPython
    app = IPython.Application.instance()
    app.kernel.do_shutdown(True)
    IPython.display.display(IPython.display.Markdown("""<div class=\"alert alert-block alert-warning\">
        <b>⚠️ The kernel is going to restart. Please wait until it is finished before continuing to the next step. The previous cells do not need to be run again⚠️</b>
        </div>"""))

---
## Setup

Inputs

In [6]:
project = !gcloud config get-value project
PROJECT_ID = project[0]
PROJECT_ID

'statmike-mlops-349915'

In [7]:
REGION = 'us-central1'
SERIES = 'applied-genai'
EXPERIMENT = 'retrieval-spanner'

# Spanner names
SPANNER_INSTANCE_NAME = PROJECT_ID
SPANNER_DATABASE_NAME = SERIES
SPANNER_TABLE_NAME = EXPERIMENT.replace('-', '_')

Packages

In [49]:
import os, json, time, glob, datetime, asyncio

import numpy as np

# Vertex AI
from google.cloud import aiplatform
import vertexai.language_models # for embeddings API
import vertexai.generative_models # for Gemini Models
from vertexai.resources.preview import feature_store

# spanner
from google.cloud import spanner
from google.cloud import spanner_admin_instance_v1
from google.cloud import spanner_admin_database_v1
import sqlalchemy

In [9]:
aiplatform.__version__

'1.71.0'

Clients

In [10]:
# vertex ai clients
vertexai.init(project = PROJECT_ID, location = REGION)

# spanner client
spanner_client = spanner.Client(project = PROJECT_ID)
spanner_instance_client = spanner_admin_instance_v1.InstanceAdminClient()
spanner_database_client = spanner_admin_database_v1.DatabaseAdminClient()

---
## Text & Embeddings For Examples

This repository contains a [section for document processing (chunking)](../Chunking/readme.md) that includes an example of processing mulitple large pdfs (over 1000 pages) into chunks: [Large Document Processing - Document AI Layout Parser](../Chunking/Large%20Document%20Processing%20-%20Document%20AI%20Layout%20Parser.ipynb).  The chunks of text from that workflow are stored with this repository and loaded by another companion workflow that augments the chunks with text embeddings: [Vertex AI Text Embeddings API](../Embeddings/Vertex%20AI%20Text%20Embeddings%20API.ipynb).

The following code will load the version of the chunks that includes text embeddings and prepare it for a local example of retrival augmented generation.

### Get The Documents

If you are working from a clone of this notebooks [repository](https://github.com/statmike/vertex-ai-mlops) then the documents are already present. The following cell checks for the documents folder and if it is missing gets it (`git clone`):

In [11]:
local_dir = '../Embeddings/files/embeddings-api'

In [12]:
if not os.path.exists(local_dir):
    print('Retrieving documents...')
    parent_dir = os.path.dirname(local_dir)
    temp_dir = os.path.join(parent_dir, 'temp')
    if not os.path.exists(temp_dir):
        os.makedirs(temp_dir)
    !git clone https://www.github.com/statmike/vertex-ai-mlops {temp_dir}/vertex-ai-mlops
    shutil.copytree(f'{temp_dir}/vertex-ai-mlops/Applied GenAI/Embeddings/files/embeddings-api', local_dir)
    shutil.rmtree(temp_dir)
    print(f'Documents are now in folder `{local_dir}`')
else:
    print(f'Documents Found in folder `{local_dir}`')             

Documents Found in folder `../Embeddings/files/embeddings-api`


### Load The Chunks

In [13]:
jsonl_files = glob.glob(f"{local_dir}/large-files*.jsonl")
jsonl_files.sort()
jsonl_files

['../Embeddings/files/embeddings-api/large-files-chunk-embeddings-0000.jsonl',
 '../Embeddings/files/embeddings-api/large-files-chunk-embeddings-0001.jsonl',
 '../Embeddings/files/embeddings-api/large-files-chunk-embeddings-0002.jsonl',
 '../Embeddings/files/embeddings-api/large-files-chunk-embeddings-0003.jsonl',
 '../Embeddings/files/embeddings-api/large-files-chunk-embeddings-0004.jsonl',
 '../Embeddings/files/embeddings-api/large-files-chunk-embeddings-0005.jsonl',
 '../Embeddings/files/embeddings-api/large-files-chunk-embeddings-0006.jsonl',
 '../Embeddings/files/embeddings-api/large-files-chunk-embeddings-0007.jsonl',
 '../Embeddings/files/embeddings-api/large-files-chunk-embeddings-0008.jsonl',
 '../Embeddings/files/embeddings-api/large-files-chunk-embeddings-0009.jsonl']

In [14]:
chunks = []
for file in jsonl_files:
    with open(file, 'r') as f:
        chunks.extend([json.loads(line) for line in f])
len(chunks)

9040

### Review A Chunk

In [15]:
chunks[0].keys()

dict_keys(['instance', 'predictions', 'status'])

In [16]:
chunks[0]['instance']['chunk_id']

'fannie_part_0_c17'

In [17]:
print(chunks[0]['instance']['content'])

# Selling Guide Fannie Mae Single Family

## Fannie Mae Copyright Notice

### Fannie Mae Copyright Notice

|-|
| Section B3-4.2, Verification of Depository Assets 402 |
| B3-4.2-01, Verification of Deposits and Assets (05/04/2022) 403 |
| B3-4.2-02, Depository Accounts (12/14/2022) 405 |
| B3-4.2-03, Individual Development Accounts (02/06/2019) 408 |
| B3-4.2-04, Pooled Savings (Community Savings Funds) (04/01/2009) 411 |
| B3-4.2-05, Foreign Assets (05/04/2022) 411 |
| Section B3-4.3, Verification of Non-Depository Assets 412 |
| B3-4.3-01, Stocks, Stock Options, Bonds, and Mutual Funds (06/30/2015) 412 |
| B3-4.3-02, Trust Accounts (04/01/2009) 413 |
| B3-4.3-03, Retirement Accounts (06/30/2015) 414 |
| B3-4.3-04, Personal Gifts (09/06/2023) 415 |
| B3-4.3-05, Gifts of Equity (10/07/2020) 418 |
| B3-4.3-06, Grants and Lender Contributions (12/14/2022) 419 |
| B3-4.3-07, Disaster Relief Grants or Loans (04/01/2009) 423 |
| B3-4.3-08, Employer Assistance (09/29/2015) 423 |
| B3-4.3-09,

In [18]:
chunks[0]['predictions'][0]['embeddings']['values'][0:10]

[0.031277116388082504,
 0.03056905046105385,
 0.010865348391234875,
 0.0623614676296711,
 0.03228681534528732,
 0.05066155269742012,
 0.046544693410396576,
 0.05509665608406067,
 -0.014074751175940037,
 0.008380400016903877]

### Prepare Chunk Structure

Make a list of dictionaries with information for each chunk:

In [19]:
content_chunks = [
    dict(
        gse = chunk['instance']['gse'],
        chunk_id = chunk['instance']['chunk_id'],
        content = chunk['instance']['content'],
        embedding = chunk['predictions'][0]['embeddings']['values']
    ) for chunk in chunks
]

### Query Embedding

Create a query, or prompt, and get the embedding for it:

Connect to models for text embeddings. Learn more about the model API:
- [Vertex AI Text Embeddings API](../Embeddings/Vertex%20AI%20Text%20Embeddings%20API.ipynb)

In [20]:
question = "Does a lender have to perform servicing functions directly?"

In [21]:
embedder = vertexai.language_models.TextEmbeddingModel.from_pretrained('text-embedding-004')

In [24]:
question_embedding = embedder.get_embeddings([question])[0].values
question_embedding[0:10]

[-0.0005117303808219731,
 0.009651427157223225,
 0.01768726110458374,
 0.014538003131747246,
 -0.01829824410378933,
 0.027877431362867355,
 -0.021124685183167458,
 0.008830446749925613,
 -0.02669006586074829,
 0.06414774805307388]

---
## Setup Spanner

Spanner is Googles fully managed, globally distributed, scalable, database service.  This workflow will guide you through creating an instance, creating a database, adding a table to the database, and loading records to the table.  [Spanner pricing](https://cloud.google.com/spanner/pricing#spanner-pricing) is based on compute, storage and data movement (replication, egress).  There are three tiers based the features needed and on the number of regions and replicas that are needed and these are call [Spanner editions](https://cloud.google.com/spanner/docs/editions-overview).  This workflow uses utilizes a minimal configuration of smallest compute with no replication and at the lowest tier that has the vector search capabilities (Enterprise edition).

The setup here is done with the [Python Client for Cloud Spanner](https://cloud.google.com/python/docs/reference/spanner/latest).  Alternatively, the [console](https://cloud.google.com/spanner/docs/create-query-database-console) and [Cloud SDK `gcloud spanner`](https://cloud.google.com/spanner/docs/getting-started/gcloud) as well as [client libraries](https://cloud.google.com/spanner/docs/getting-started/set-up) in other languages can be used.

### Create/Retrieve An Instance

The starting point for using Spanner is create an instance.  This is where compute size (nnumber of nodes), location(s), and replication is specified.

Documentation References:
- [Instances overview](https://cloud.google.com/spanner/docs/instances)
- [Instance Configurations (regional, dual-region, and multi-region)](https://cloud.google.com/spanner/docs/instance-configurations)
- [Python SDK for Cloud Spanner: `InstanceAdminClient`](https://cloud.google.com/python/docs/reference/spanner/latest/google.cloud.spanner_admin_instance_v1.services.instance_admin)
    - Setup as `spanner_instance_client` here

Use the client to list configurations, if needed:

In [25]:
#spanner_instance_client.list_instance_configs(parent = f"projects/{PROJECT_ID}")

Make a [configuration choice](https://cloud.google.com/spanner/docs/instance-configurations), here a single region. 

In [26]:
# regional configuration
config_id = 'regional-us-central1'

In [27]:
try:
    spanner_instance = spanner_instance_client.get_instance(
        name = f'projects/{PROJECT_ID}/instances/{SPANNER_INSTANCE_NAME}'
    )
    print(f"Found the instance: {spanner_instance.name}")
except Exception:
    print('Creating an instance ...')
    create_instance = spanner_instance_client.create_instance(
        parent = f"projects/{PROJECT_ID}",
        instance_id = SPANNER_INSTANCE_NAME,
        instance = spanner_admin_instance_v1.Instance(
            name = f'projects/{PROJECT_ID}/instances/{SPANNER_INSTANCE_NAME}',
            config = f'projects/{PROJECT_ID}/instanceConfigs/{config_id}',
            display_name = SPANNER_INSTANCE_NAME,
            node_count = 1,
            edition = 'ENTERPRISE' # minnimum needed for vector indexing features
        )
    )
    spanner_instance = create_instance.result()
    spanner_instance = spanner_instance_client.get_instance(
        name = f'projects/{PROJECT_ID}/instances/{SPANNER_INSTANCE_NAME}'
    )
    print(f"Created the instance: {spanner_instance.name}")

Found the instance: projects/statmike-mlops-349915/instances/statmike-mlops-349915


In [28]:
spanner_instance

name: "projects/statmike-mlops-349915/instances/statmike-mlops-349915"
config: "projects/statmike-mlops-349915/instanceConfigs/regional-us-central1"
display_name: "statmike-mlops-349915"
node_count: 1
state: READY
processing_units: 1000
create_time {
  seconds: 1729873343
  nanos: 540788000
}
update_time {
  seconds: 1729873343
  nanos: 540788000
}
edition: ENTERPRISE

---
## Working With Spanner

Now that an instance is created it can be used to create a database, add a table, and load the records.

### Create/Retrieve A Database

[Spanner databases](https://cloud.google.com/spanner/docs/databases) are the container for tables, views, and indexes.  There can be multiple databases on an instance.  Spanner has two available dialect for databases that are choosen at database creation: GoogleSQL or PostgreSQL.  The vector storage and indexing features are specific to the GoogleSQL dialect and it is used in this workflow.  Here a single database with GoogleSQL dialect is created and used for this workflow.

Documentation References:
- [Database overview](https://cloud.google.com/spanner/docs/databases)
- [Choosing the Right Dialect for Your Spanner Database](https://cloud.google.com/spanner/docs/choose-googlesql-or-postgres)
- [Create and managed databases](https://cloud.google.com/spanner/docs/create-manage-databases)
- [Python SDK for Cloud Spanner `DatabaseAdminClient`](https://cloud.google.com/python/docs/reference/spanner/latest/google.cloud.spanner_admin_database_v1.services.database_admin)
    - Setup as `spanner_database_client` here

In [62]:
try:
    spanner_database = spanner_database_client.get_database(
        name = f"{spanner_instance.name}/databases/{SPANNER_DATABASE_NAME}"
    )
    print(f"Found the database: {spanner_database.name}")
except Exception:
    print('Creating a database ...')
    create_database = spanner_database_client.create_database(
        request = spanner_admin_database_v1.types.CreateDatabaseRequest(
            parent = spanner_instance.name,
            create_statement = f'CREATE DATABASE `{SPANNER_DATABASE_NAME}`',
            extra_statements = [], # you could go ahead and CREATE TABLE here   
        )
    )
    spanner_database = create_database.result()
    spanner_database = spanner_database_client.get_database(
        name = f"{spanner_instance.name}/databases/{SPANNER_DATABASE_NAME}"
    )
    print(f"Created the database: {spanner_database.name}")

Found the database: projects/statmike-mlops-349915/instances/statmike-mlops-349915/databases/applied-genai


In [63]:
spanner_database

name: "projects/statmike-mlops-349915/instances/statmike-mlops-349915/databases/applied-genai"
state: READY
create_time {
  seconds: 1729873484
  nanos: 192873000
}
version_retention_period: "1h"
earliest_version_time {
  seconds: 1731511990
  nanos: 491087000
}
encryption_info {
  encryption_type: GOOGLE_DEFAULT_ENCRYPTION
}
database_dialect: GOOGLE_STANDARD_SQL

### Connection To Databases With Client

For work inside a database the Cloud Spanner API is invoked here using the [Python Client For Cloud Spanner](https://cloud.google.com/python/docs/reference/spanner/latest/google.cloud.spanner_v1.client) which was setup above as the `spanner_client` object.

Documentation References:
- [Data type in GoogleSQL](https://cloud.google.com/spanner/docs/reference/standard-sql/data-types)
- [Python Client For Cloud Spanner database Module](https://cloud.google.com/python/docs/reference/spanner/latest/google.cloud.spanner_v1.database)
- [Python Client For Cloud Spanner table Module](https://cloud.google.com/python/docs/reference/spanner/latest/google.cloud.spanner_v1.table.Table)


Use the client, `spanner_client`, to return the database as a Python object:

In [145]:
instance = spanner_client.instance(SPANNER_INSTANCE_NAME)
database = instance.database(SPANNER_DATABASE_NAME)

In [146]:
database.name

'projects/statmike-mlops-349915/instances/statmike-mlops-349915/databases/applied-genai'

Test DML statement (SELECT) with client:

In [147]:
with database.snapshot() as snapshot:
    result = snapshot.execute_sql("SELECT 'Success' as did_it_work")
for r in result:
    print(r)

['Success']


In [148]:
with database.snapshot() as snapshot:
    result = snapshot.execute_sql("SELECT 'Success' as did_it_work")

### Connection To Databases With Client Using SQLAlchemy

The Spanner Client also integrates with various language frameworks.  Of note here, from Python, is [integration wtih SQLAlchemy](https://cloud.google.com/spanner/docs/use-sqlalchemy).  [SQLAlchemy](https://www.sqlalchemy.org/) is a client library that can use connections to orchestrate SQL queries.  

The method of using a connection is called an engine.  The [Spanner dialect for SQLAlchemy](https://github.com/googleapis/python-spanner-sqlalchemy/tree/main) is a Python package that adds engine support for Spanner to SQLAlchemy through the installation with `pip install sqlalchemy-spanner`.  This section shows how to use the client created and used in the previous section as an engine with SQLAlchemy.

Use the client, `spanner_cliennt`, with SQLAlchemy to create an engine:

In [149]:
engine = sqlalchemy.create_engine(
    f"spanner+spanner:///{database.name}",
    connect_args = dict(client = spanner_client)
)
# https://github.com/googleapis/python-spanner-sqlalchemy/tree/main?tab=readme-ov-file#autocommit-mode
autocommit_engine = engine.execution_options(isolation_level = "AUTOCOMMIT")

Test DML statement (SELECT) with SQLAlchemy:

In [150]:
with autocommit_engine.connect() as connection:
    result = connection.execute(sqlalchemy.text("SELECT 'Success' as did_it_work"))
for r in result:
    print(r)

('Success',)


### Query Orchestrator

Use the client and SQLAlchemy engine as the basis for function that executes queries and returns results:

In [155]:
def run_query(query, database = database, engine = autocommit_engine):
    with engine.connect() as connection:
        result = connection.execute(query)
        
    # prepare the response
    rows = []
    try:
        for row in result:
            rows.append(dict(zip(result.keys(), row)))
    except Exception:
        pass
    
    # return the response
    return rows[0] if len(rows) == 1 else rows

In [156]:
run_query(sqlalchemy.text("SELECT 'Success' as did_it_work"))

{'did_it_work': 'Success'}

In [78]:
inspector = sqlalchemy.inspect(engine)

In [79]:
inspector.has_table(SPANNER_TABLE_NAME)

True

In [99]:
# https://github.com/googleapis/python-spanner-sqlalchemy/tree/main?tab=readme-ov-file#readonly-transactions
with engine.connect().execution_options(read_only = True) as connection:
    result = connection.execute(sqlalchemy.text(f"SELECT * FROM information_schema.tables  WHERE table_name = '{SPANNER_TABLE_NAME}'"))

In [85]:
for r in result:print(r)

('', '', 'retrieval_spanner', None, None, 'BASE TABLE', 'COMMITTED', None, None)


In [93]:
with autocommit_engine.connect() as connection:
    result = connection.execute(sqlalchemy.text(f"SELECT * FROM information_schema.columns  WHERE table_name = '{SPANNER_TABLE_NAME}'"))

In [97]:
for r in result:print(r)

('', '', 'retrieval_spanner', 'gse', 1, None, None, 'YES', 'STRING(100)', 'NEVER', None, None, False, 'COMMITTED', 'NO', None, None, None, None, None)
('', '', 'retrieval_spanner', 'chunk_id', 2, None, None, 'YES', 'STRING(100)', 'NEVER', None, None, False, 'COMMITTED', 'NO', None, None, None, None, None)
('', '', 'retrieval_spanner', 'content', 3, None, None, 'YES', 'STRING(MAX)', 'NEVER', None, None, False, 'COMMITTED', 'NO', None, None, None, None, None)
('', '', 'retrieval_spanner', 'embedding', 4, None, None, 'YES', 'ARRAY<FLOAT32>(vector_length=>768)', 'NEVER', None, None, False, 'COMMITTED', 'NO', None, None, None, None, None)


### Create/Retrieve A Table

The next step is creating a table within the database.  For work inside a database the Cloud Spanner API is invoked here using the [Python Client For Cloud Spanner](https://cloud.google.com/python/docs/reference/spanner/latest/google.cloud.spanner_v1.client) which was setup above as the `spanner_client` object.


Documentation References:
- [Data type in GoogleSQL](https://cloud.google.com/spanner/docs/reference/standard-sql/data-types)
- [Python Client For Cloud Spanner database Module](https://cloud.google.com/python/docs/reference/spanner/latest/google.cloud.spanner_v1.database)
- [Python Client For Cloud Spanner table Module](https://cloud.google.com/python/docs/reference/spanner/latest/google.cloud.spanner_v1.table.Table)


Use the client, `spanner_client`, to return the database as a Python object:

In [32]:
instance = spanner_client.instance(SPANNER_INSTANCE_NAME)
database = instance.database(SPANNER_DATABASE_NAME)

Check to see if the table already exists:

In [34]:
result = database.table(SPANNER_TABLE_NAME).exists()
result

True

Delete the table if it already exists:

Create the table:

In [None]:
if database.table(SPANNER_TABLE_NAME).exists():
    print(f'Found Table: {SPANNER_TABLE_NAME}')
else:
    print(f'Creating Table...')
    
    create_table = database.update_ddl([
f"""
CREATE TABLE `{SPANNER_TABLE_NAME}` (
{', '.join([f'{c} {data_types[c]}' for c in columns])}
) PRIMARY KEY (chunk_id)
"""
    ])
    create_table.result()
    print(f'Created Table: {SPANNER_TABLE_NAME}')

In [34]:
columns = content_chunks[0].keys()
columns

dict_keys(['gse', 'chunk_id', 'content', 'embedding'])

In [39]:
data_types = dict(
    chunk_id = 'STRING(100)',
    embedding = f"ARRAY<FLOAT32>(vector_length=>{len(content_chunks[0]['embedding'])})",
    gse = 'STRING(100)',
    content = 'STRING(MAX)'
)
data_types

{'chunk_id': 'STRING(100)',
 'embedding': 'ARRAY<FLOAT32>(vector_length=>768)',
 'gse': 'STRING(100)',
 'content': 'STRING(MAX)'}

In [40]:
if database.table(SPANNER_TABLE_NAME).exists():
    print(f'Found Table: {SPANNER_TABLE_NAME}')
else:
    print(f'Creating Table: {SPANNER_TABLE_NAME}')
    
    create_table = database.update_ddl([
f"""CREATE TABLE `{SPANNER_TABLE_NAME}` (
{', '.join([f'{c} {data_types[c]}' for c in columns])}
) PRIMARY KEY (chunk_id)"""
    ])
    create_table.result()

Creating Table: retrieval_spanner


### Add Records To The Table

In [41]:
async def insert_data(database, input_data):
    """Inserts data rows into the Spanner table asynchronously."""
    with database.batch() as batch:
        for record in input_data:
            batch.insert(
                table=SPANNER_TABLE_NAME,
                columns=tuple(record.keys()),
                values=[tuple(record.values())]
            )

In [43]:
await insert_data(database, content_chunks)

### Retreive: An Entity With Read Method

In [44]:
entity = 'fannie_part_0_c40'

with database.snapshot() as snapshot:
    response = snapshot.read(
        table = SPANNER_TABLE_NAME,
        columns = columns, # or a subset provided as a list [column names]
        keyset = spanner.KeySet(keys=[(entity,)]) # or all rows with spanner.KeySet(all_ = True)
    )
    results = []
    for row in list(response):
        results.append(dict(zip(columns, row)))

In [45]:
len(results)

1

In [46]:
results[0].keys()

dict_keys(['gse', 'chunk_id', 'content', 'embedding'])

In [49]:
results[0]['chunk_id']

'fannie_part_0_c40'

In [50]:
results[0]['gse']

'fannie'

In [51]:
results[0]['content']

'# Selling Guide Fannie Mae Single Family\n\n## Fannie Mae Copyright Notice\n\n### Fannie Mae Copyright Notice\n\nGlossary of Defined Terms: V (03/01/2023) and E-3-22, Acronyms 1160\n| E-3-23, Acronyms and Glossary of Defined Terms: W (11/10/2019) 1160 |\n| E-3-24, Acronyms and Glossary of Defined Terms: X (04/01/2009) 1161 |\n| No applicable terms. 1161 |\n| E-3-25, Acronyms and Glossary of Defined Terms: Y (05/30/2017) 1161 |\n| E-3-26, Acronyms and Glossary of Defined Terms: Z (04/01/2009) 1162 |\n\n|-|-|\n| No applicable terms. | 1162 |\n\n'

### Retreive: An Entity With Query Method

In [52]:
entity = 'fannie_part_0_c40'

with database.snapshot() as snapshot:
    response = snapshot.execute_sql(
        f"SELECT {', '.join(columns)} FROM `{SPANNER_TABLE_NAME}` WHERE chunk_id = @id_value",
        params = {'id_value': entity},
        param_types = {'id_value': spanner.param_types.STRING}
    )
    results = []
    for row in list(response):
        results.append(dict(zip(columns, row)))

In [53]:
len(results)

1

In [54]:
results[0].keys()

dict_keys(['gse', 'chunk_id', 'content', 'embedding'])

In [55]:
with database.snapshot() as snapshot:
    response = snapshot.execute_sql(
        f"""
        SELECT {', '.join(columns)}
        FROM `{SPANNER_TABLE_NAME}`
        WHERE chunk_id = '{entity}'
        """,
    )
    results = []
    for row in list(response):
        results.append(dict(zip(columns, row)))

In [56]:
len(results)

1

In [57]:
results[0].keys()

dict_keys(['gse', 'chunk_id', 'content', 'embedding'])

In [59]:
results[0]['chunk_id']

'fannie_part_0_c40'

In [60]:
results[0]['gse']

'fannie'

In [61]:
results[0]['content']

'# Selling Guide Fannie Mae Single Family\n\n## Fannie Mae Copyright Notice\n\n### Fannie Mae Copyright Notice\n\nGlossary of Defined Terms: V (03/01/2023) and E-3-22, Acronyms 1160\n| E-3-23, Acronyms and Glossary of Defined Terms: W (11/10/2019) 1160 |\n| E-3-24, Acronyms and Glossary of Defined Terms: X (04/01/2009) 1161 |\n| No applicable terms. 1161 |\n| E-3-25, Acronyms and Glossary of Defined Terms: Y (05/30/2017) 1161 |\n| E-3-26, Acronyms and Glossary of Defined Terms: Z (04/01/2009) 1162 |\n\n|-|-|\n| No applicable terms. | 1162 |\n\n'

---
## Vector Similarity Search, Matching

### Matches: For Entity

In [62]:
with database.snapshot() as snapshot:
    response = snapshot.execute_sql(
        f"""
        SELECT {', '.join(columns)}
        FROM `{SPANNER_TABLE_NAME}`
        ORDER BY DOT_PRODUCT(embedding, ARRAY<FLOAT32>{question_embedding})
        LIMIT 5
        """,
    )
    results = []
    for row in list(response):
        results.append(dict(zip(columns, row)))

In [63]:
len(results)

5

In [65]:
[r['chunk_id'] for r in results]

['freddie_part_2_c180',
 'freddie_part_2_c250',
 'freddie_part_2_c243',
 'freddie_part_2_c157',
 'freddie_part_2_c227']

In [66]:
with database.snapshot() as snapshot:
    response = snapshot.execute_sql(
        f"""
        SELECT {', '.join(columns)}
        FROM `{SPANNER_TABLE_NAME}`
        WHERE gse = 'fannie'
        ORDER BY DOT_PRODUCT(embedding, ARRAY<FLOAT32>{question_embedding})
        LIMIT 5
        """,
    )
    results = []
    for row in list(response):
        results.append(dict(zip(columns, row)))

In [67]:
len(results)

5

In [68]:
[r['chunk_id'] for r in results]

['fannie_part_0_c839',
 'fannie_part_0_c987',
 'fannie_part_1_c519',
 'fannie_part_0_c753',
 'fannie_part_0_c990']

### Create Vector Index

https://cloud.google.com/spanner/docs/find-approximate-nearest-neighbors#vector-index

https://cloud.google.com/spanner/docs/find-approximate-nearest-neighbors#rebuild

https://cloud.google.com/spanner/docs/reference/standard-sql/data-definition-language#vector_index_statements

In [None]:
create_index = database.update_ddl([
f"""
CREATE VECTOR INDEX embedding_index ON `{SPANNER_TABLE_NAME}`(embedding)
WHERE embedding IS NOT NULL
OPTIONS(
    distance_type = 'DOT_PRODUCT',
    tree_depth = 2,
    num_leaves = 100
)
"""
])
#create_index.result()

In [80]:
while not create_index.done():
    print('Waiting on index creation ...')
    time.sleep(60)
    
print('Index Created')

Index Created


In [117]:
with database.snapshot() as snapshot:
    response = snapshot.execute_sql(
        f"""
        SELECT *
        FROM INFORMATION_SCHEMA.INDEXES
        WHERE INDEX_NAME = 'embedding_index'
            AND TABLE_NAME = '{SPANNER_TABLE_NAME}'
        """,
    )
    results = []
    for row in list(response):
        column_names = [field.name for field in response._metadata.row_type.fields]
        results.append(dict(zip(column_names, row)))

In [118]:
results

[{'TABLE_CATALOG': '',
  'TABLE_SCHEMA': '',
  'TABLE_NAME': 'retrieval_spanner',
  'INDEX_NAME': 'embedding_index',
  'INDEX_TYPE': 'VECTOR',
  'PARENT_TABLE_NAME': '',
  'IS_UNIQUE': False,
  'IS_NULL_FILTERED': False,
  'INDEX_STATE': 'READ_WRITE',
  'FILTER': 'embedding IS NOT NULL',
  'SPANNER_IS_MANAGED': False,
  'SEARCH_PARTITION_BY': None,
  'SEARCH_ORDER_BY': None}]

### Match: With Index

In [123]:
with database.snapshot() as snapshot:
    response = snapshot.execute_sql(
        f"""
        SELECT {', '.join(columns)}
        FROM `{SPANNER_TABLE_NAME}`@{{FORCE_INDEX=embedding_index}}
        WHERE embedding IS NOT NULL
            AND gse = 'fannie'
        ORDER BY APPROX_DOT_PRODUCT(ARRAY<FLOAT32>{question_embedding}, embedding, options => JSON '{{"num_leaves_to_search": 10}}')
        LIMIT 5
        """,
    )
    results = []
    for row in list(response):
        results.append(dict(zip(columns, row)))

In [124]:
len(results)

5

In [125]:
[r['chunk_id'] for r in results]

['fannie_part_0_c839',
 'fannie_part_0_c987',
 'fannie_part_1_c519',
 'fannie_part_0_c753',
 'fannie_part_0_c990']

In [129]:
with database.snapshot() as snapshot:
    response = snapshot.execute_sql(
        f"""
        SELECT {', '.join(columns)}
        FROM `{SPANNER_TABLE_NAME}`@{{FORCE_INDEX=embedding_index}}
        WHERE embedding IS NOT NULL
            AND gse = 'fannie'
        ORDER BY APPROX_DOT_PRODUCT(ARRAY<FLOAT32>{question_embedding}, embedding, options => JSON '{{"num_leaves_to_search": 10}}')
        LIMIT 5
        """,
    )
    results = []
    for row in list(response):
        results.append(dict(zip(columns, row)))

In [130]:
len(results)

5

In [131]:
[r['chunk_id'] for r in results]

['fannie_part_0_c839',
 'fannie_part_0_c987',
 'fannie_part_1_c519',
 'fannie_part_0_c753',
 'fannie_part_0_c990']

## Remove Resources

In [133]:
#spanner_database_client.drop_database(database = spanner_database.name)

In [134]:
#spanner_instance_client.delete_instance(name = spanner_instance.name)