![tracker](https://us-central1-vertex-ai-mlops-369716.cloudfunctions.net/pixel-tracking?path=statmike%2Fvertex-ai-mlops%2FApplied+GenAI%2FRetrieval&file=Retrieval+-+Bigtable.ipynb)
<!--- header table --->
<table align="left">
  <td style="text-align: center">
    <a href="https://colab.research.google.com/github/statmike/vertex-ai-mlops/blob/main/Applied%20GenAI/Retrieval/Retrieval%20-%20Bigtable.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/colab-logo-32px.png" alt="Google Colaboratory logo">
      <br>Run in<br>Colab
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/colab/import/https%3A%2F%2Fraw.githubusercontent.com%2Fstatmike%2Fvertex-ai-mlops%2Fmain%2FApplied%2520GenAI%2FRetrieval%2FRetrieval%2520-%2520Bigtable.ipynb">
      <img width="32px" src="https://lh3.googleusercontent.com/JmcxdQi-qOpctIvWKgPtrzZdJJK-J3sWE1RsfjZNwshCFgE_9fULcNpuXYTilIR2hjwN" alt="Google Cloud Colab Enterprise logo">
      <br>Run in<br>Colab Enterprise
    </a>
  </td>      
  <td style="text-align: center">
    <a href="https://github.com/statmike/vertex-ai-mlops/blob/main/Applied%20GenAI/Retrieval/Retrieval%20-%20Bigtable.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" alt="GitHub logo">
      <br>View on<br>GitHub
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/statmike/vertex-ai-mlops/main/Applied%20GenAI/Retrieval/Retrieval%20-%20Bigtable.ipynb">
      <img src="https://lh3.googleusercontent.com/UiNooY4LUgW_oTvpsNhPpQzsstV5W8F7rYgxgGBD85cWJoLmrOzhVs_ksK_vgx40SHs7jCqkTkCk=e14-rj-sc0xffffff-h130-w32" alt="Vertex AI logo">
      <br>Open in<br>Vertex AI Workbench
    </a>
  </td>
</table>

# Retrieval - Bigtable

In prior workflows, a series of documents was [processed into chunks](../Chunking/readme.md), and for each chunk, [embeddings](../Embeddings/readme.md) were created:

- Process: [Large Document Processing - Document AI Layout Parser](../Chunking/Large%20Document%20Processing%20-%20Document%20AI%20Layout%20Parser.ipynb)
- Embed: [Vertex AI Text Embeddings API](../Embeddings/Vertex%20AI%20Text%20Embeddings%20API.ipynb)

Retrieving chunks for a query involves calculating the embedding for the query and then using similarity metrics to find relevant chunks. A thorough review of similarity matching can be found in [The Math of Similarity](../Embeddings/The%20Math%20of%20Similarity.ipynb) - use dot product! As development moves from experiment to application, the process of storing and computing similarity is migrated to a [retrieval](./readme.md) system. This workflow is part of a [series of workflows exploring many retrieval systems](./readme.md).

---

**Bigtable For Storage, And Search**

[Google Cloud Bigtable](https://cloud.google.com/bigtable) is a fully managed, scalable NoSQL wide-column database service. It's the same technology that powers many core Google services, including Search, Maps, and Gmail! Bigtable is designed for high throughput and low latency, making it suitable for large-scale analytical and operational workloads.

- **Key Features:**

    - **Scalability:** Bigtable can handle massive amounts of data and traffic, scaling seamlessly to meet your needs.
    - **Performance:** It provides low latency and high throughput, making it ideal for applications that require fast data access.
    - **Wide-Column Data Model:** Bigtable's flexible data model allows you to store and query data with a large number of columns, making it suitable for sparse and semi-structured data.
    - **Integration:** Bigtable integrates with other Google Cloud services, such as BigQuery and Dataflow, for data analytics and processing.

- **Built-in Nearest Neighbor Search:**

    - Bigtable now offers [built-in support for nearest neighbor search](https://cloud.google.com/bigtable/docs/find-k-nearest-neighbors#perform-knn-search) with vector embeddings.
    - You [write embeddings to Bigtable](https://cloud.google.com/bigtable/docs/find-k-nearest-neighbors#write-embeddings) as [byte objects](https://cloud.google.com/bigtable/docs/find-k-nearest-neighbors#define-conversion-functions).
    - Using the [new SQL interface in Bigtable](https://cloud.google.com/bigtable/docs/introduction-sql), you can perform efficient nearest neighbor searches using embeddings.

---

**Use Case Data**

Buying a home usually involves borrowing money from a lending institution, typically through a mortgage secured by the home's value. But how do these institutions manage the risks associated with such large loans, and how are lending standards established?

In the United States, two government-sponsored enterprises (GSEs) play a vital role in the housing market:

- Federal National Mortgage Association ([Fannie Mae](https://www.fanniemae.com/))
- Federal Home Loan Mortgage Corporation ([Freddie Mac](https://www.freddiemac.com/))

These GSEs purchase mortgages from lenders, enabling those lenders to offer more loans. This process also allows Fannie Mae and Freddie Mac to set standards for mortgages, ensuring they are responsible and borrowers are more likely to repay them. This system makes homeownership more affordable and stabilizes the housing market by maintaining a steady flow of liquidity for lenders and keeping interest rates controlled.

However, navigating the complexities of these GSEs and their extensive servicing guides can be challenging.

**Approaches**

[This series](../readme.md) covers many generative AI workflows. These documents are used directly as long context for Gemini in the workflow [Long Context Retrieval With The Vertex AI Gemini API](../Generate/Long%20Context%20Retrieval%20With%20The%20Vertex%20AI%20Gemini%20API.ipynb). The workflow below uses a [retrieval](./readme.md) approach with the already generated chunks and embeddings.

---
## Colab Setup

When running this notebook in [Colab](https://colab.google/) or [Colab Enterprise](https://cloud.google.com/colab/docs/introduction), this section will authenticate to GCP (follow prompts in the popup) and set the current project for the session.

In [1]:
PROJECT_ID = 'statmike-mlops-349915' # replace with project ID

In [2]:
try:
    from google.colab import auth
    auth.authenticate_user()
    !gcloud config set project {PROJECT_ID}
except Exception:
    pass

---
## Installs and API Enablement

The clients packages may need installing in this environment. 

### Installs (If Needed)

In [3]:
# tuples of (import name, install name, min_version)
packages = [
    ('google.cloud.aiplatform', 'google-cloud-aiplatform', '1.69.0'),
    ('google.cloud.bigtable', 'google-cloud-bigtable'),
]

import importlib
install = False
for package in packages:
    if not importlib.util.find_spec(package[0]):
        print(f'installing package {package[1]}')
        install = True
        !pip install {package[1]} -U -q --user
    elif len(package) == 3:
        if importlib.metadata.version(package[0]) < package[2]:
            print(f'updating package {package[1]}')
            install = True
            !pip install {package[1]} -U -q --user

### API Enablement

In [28]:
!gcloud services enable aiplatform.googleapis.com
!gcloud services enable bigtable.googleapis.com
!gcloud services enable bigtableadmin.googleapis.com

### Restart Kernel (If Installs Occured)

After a kernel restart the code submission can start with the next cell after this one.

In [5]:
if install:
    import IPython
    app = IPython.Application.instance()
    app.kernel.do_shutdown(True)
    IPython.display.display(IPython.display.Markdown("""<div class=\"alert alert-block alert-warning\">
        <b>⚠️ The kernel is going to restart. Please wait until it is finished before continuing to the next step. The previous cells do not need to be run again⚠️</b>
        </div>"""))

---
## Setup

Inputs

In [6]:
project = !gcloud config get-value project
PROJECT_ID = project[0]
PROJECT_ID

'statmike-mlops-349915'

In [99]:
REGION = 'us-central1'
SERIES = 'applied-genai'
EXPERIMENT = 'retrieval-bigtable'

# BigTable Names
BIGTABLE_INSTANCE_NAME = PROJECT_ID
BIGTABLE_TABLE_NAME = SERIES
BIGTABLE_COLUMN_FAMILY_NAME = EXPERIMENT

Packages

In [285]:
import os, json, time, glob, datetime, struct, tempfile

import numpy as np
import redis

# Vertex AI
from google.cloud import aiplatform
import vertexai.language_models # for embeddings API
import vertexai.generative_models # for Gemini Models
from vertexai.resources.preview import feature_store

# bigtable
from google.cloud import bigtable
import google.cloud.bigtable.data
import google.cloud.bigtable.row_filters
#from google.cloud.bigtable import row_filters

In [9]:
aiplatform.__version__

'1.69.0'

Clients

In [25]:
# vertex ai clients
vertexai.init(project = PROJECT_ID, location = REGION)

# bigtable client
bigtable_client = bigtable.client.Client(project = PROJECT_ID, admin = True)

---
## Text & Embeddings For Examples

This repository contains a [section for document processing (chunking)](../Chunking/readme.md) that includes an example of processing mulitple large pdfs (over 1000 pages) into chunks: [Large Document Processing - Document AI Layout Parser](../Chunking/Large%20Document%20Processing%20-%20Document%20AI%20Layout%20Parser.ipynb).  The chunks of text from that workflow are stored with this repository and loaded by another companion workflow that augments the chunks with text embeddings: [Vertex AI Text Embeddings API](../Embeddings/Vertex%20AI%20Text%20Embeddings%20API.ipynb).

The following code will load the version of the chunks that includes text embeddings and prepare it for a local example of retrival augmented generation.

### Get The Documents

If you are working from a clone of this notebooks [repository](https://github.com/statmike/vertex-ai-mlops) then the documents are already present. The following cell checks for the documents folder and if it is missing gets it (`git clone`):

In [11]:
local_dir = '../Embeddings/files/embeddings-api'

In [12]:
if not os.path.exists(local_dir):
    print('Retrieving documents...')
    parent_dir = os.path.dirname(local_dir)
    temp_dir = os.path.join(parent_dir, 'temp')
    if not os.path.exists(temp_dir):
        os.makedirs(temp_dir)
    !git clone https://www.github.com/statmike/vertex-ai-mlops {temp_dir}/vertex-ai-mlops
    shutil.copytree(f'{temp_dir}/vertex-ai-mlops/Applied GenAI/Embeddings/files/embeddings-api', local_dir)
    shutil.rmtree(temp_dir)
    print(f'Documents are now in folder `{local_dir}`')
else:
    print(f'Documents Found in folder `{local_dir}`')             

Documents Found in folder `../Embeddings/files/embeddings-api`


### Load The Chunks

In [13]:
jsonl_files = glob.glob(f"{local_dir}/large-files*.jsonl")
jsonl_files.sort()
jsonl_files

['../Embeddings/files/embeddings-api/large-files-chunk-embeddings-0000.jsonl',
 '../Embeddings/files/embeddings-api/large-files-chunk-embeddings-0001.jsonl',
 '../Embeddings/files/embeddings-api/large-files-chunk-embeddings-0002.jsonl',
 '../Embeddings/files/embeddings-api/large-files-chunk-embeddings-0003.jsonl',
 '../Embeddings/files/embeddings-api/large-files-chunk-embeddings-0004.jsonl',
 '../Embeddings/files/embeddings-api/large-files-chunk-embeddings-0005.jsonl',
 '../Embeddings/files/embeddings-api/large-files-chunk-embeddings-0006.jsonl',
 '../Embeddings/files/embeddings-api/large-files-chunk-embeddings-0007.jsonl',
 '../Embeddings/files/embeddings-api/large-files-chunk-embeddings-0008.jsonl',
 '../Embeddings/files/embeddings-api/large-files-chunk-embeddings-0009.jsonl']

In [14]:
chunks = []
for file in jsonl_files:
    with open(file, 'r') as f:
        chunks.extend([json.loads(line) for line in f])
len(chunks)

9040

### Review A Chunk

In [15]:
chunks[0].keys()

dict_keys(['instance', 'predictions', 'status'])

In [16]:
chunks[0]['instance']['chunk_id']

'fannie_part_0_c17'

In [17]:
print(chunks[0]['instance']['content'])

# Selling Guide Fannie Mae Single Family

## Fannie Mae Copyright Notice

### Fannie Mae Copyright Notice

|-|
| Section B3-4.2, Verification of Depository Assets 402 |
| B3-4.2-01, Verification of Deposits and Assets (05/04/2022) 403 |
| B3-4.2-02, Depository Accounts (12/14/2022) 405 |
| B3-4.2-03, Individual Development Accounts (02/06/2019) 408 |
| B3-4.2-04, Pooled Savings (Community Savings Funds) (04/01/2009) 411 |
| B3-4.2-05, Foreign Assets (05/04/2022) 411 |
| Section B3-4.3, Verification of Non-Depository Assets 412 |
| B3-4.3-01, Stocks, Stock Options, Bonds, and Mutual Funds (06/30/2015) 412 |
| B3-4.3-02, Trust Accounts (04/01/2009) 413 |
| B3-4.3-03, Retirement Accounts (06/30/2015) 414 |
| B3-4.3-04, Personal Gifts (09/06/2023) 415 |
| B3-4.3-05, Gifts of Equity (10/07/2020) 418 |
| B3-4.3-06, Grants and Lender Contributions (12/14/2022) 419 |
| B3-4.3-07, Disaster Relief Grants or Loans (04/01/2009) 423 |
| B3-4.3-08, Employer Assistance (09/29/2015) 423 |
| B3-4.3-09,

In [18]:
chunks[0]['predictions'][0]['embeddings']['values'][0:10]

[0.031277116388082504,
 0.03056905046105385,
 0.010865348391234875,
 0.0623614676296711,
 0.03228681534528732,
 0.05066155269742012,
 0.046544693410396576,
 0.05509665608406067,
 -0.014074751175940037,
 0.008380400016903877]

### Prepare Chunk Structure

Make a list of dictionaries with information for each chunk:

In [19]:
content_chunks = [
    dict(
        gse = chunk['instance']['gse'],
        chunk_id = chunk['instance']['chunk_id'],
        content = chunk['instance']['content'],
        embedding = chunk['predictions'][0]['embeddings']['values']
    ) for chunk in chunks
]

### Query Embedding

Create a query, or prompt, and get the embedding for it:

Connect to models for text embeddings. Learn more about the model API:
- [Vertex AI Text Embeddings API](../Embeddings/Vertex%20AI%20Text%20Embeddings%20API.ipynb)

In [20]:
question = "Does a lender have to perform servicing functions directly?"

In [21]:
embedder = vertexai.language_models.TextEmbeddingModel.from_pretrained('text-embedding-004')

In [22]:
question_embedding = embedder.get_embeddings([question])[0].values
question_embedding[0:10]

[-0.0005117303808219731,
 0.009651427157223225,
 0.01768726110458374,
 0.014538003131747246,
 -0.01829824410378933,
 0.027877431362867355,
 -0.021124685183167458,
 0.008830446749925613,
 -0.02669006586074829,
 0.06414774805307388]

---
## Setup Bigtable



---
## Retrieval With Bigtable

**BigTable data structure**

Instance (cluster of nodes, min of 1)
Table
Row
Column Family
Column 
Cell

**Understanding Costs**
Compute
Storage (including backups)
Network
Replication (multiple clusters in different regions)

### Create/Retrieve An Instance

https://cloud.google.com/python/docs/reference/bigtable/latest/instance

https://cloud.google.com/bigtable/docs/instances-clusters-nodes

https://cloud.google.com/bigtable/docs/creating-instance

https://cloud.google.com/bigtable/docs/locations#north_america

In [65]:
bigtable_instance = bigtable_client.instance(BIGTABLE_INSTANCE_NAME)

if bigtable_instance.exists():
    print(f"Found the instance: {bigtable_instance.name}")
else:
    bigtable_cluster = bigtable_instance.cluster(
        BIGTABLE_INSTANCE_NAME,
        location_id = REGION+'-a', # bigtable is a zonal resource 
        serve_nodes = 1,
        default_storage_type = bigtable.enums.StorageType.HDD,  # Choice of HDD or SDD
    )
    print(f"Creating the instance ...")
    create_instance = bigtable_instance.create(clusters = [bigtable_cluster])
    while not create_instance.done():
        print('Waiting for instance creation...')
        time.sleep(10)
    bigtable_instance.reload()
    print(f"Created Bigtable instance: {bigtable_instance.name}")

Creating the instance ...
Waiting for instance creation...
Created Bigtable instance: projects/statmike-mlops-349915/instances/statmike-mlops-349915


In [81]:
bigtable_clusters = bigtable_instance.list_clusters()
for cluster in bigtable_clusters:
    if type(cluster) == list and cluster:
        print('Name: ', cluster[0].name)
        print('Nodes: ', cluster[0].serve_nodes)
        print('Storage: ', cluster[0].default_storage_type)

Name:  projects/statmike-mlops-349915/instances/statmike-mlops-349915/clusters/statmike-mlops-349915
Nodes:  1
Storage:  StorageType.HDD


### Create/Retrieve Table

https://cloud.google.com/bigtable/docs/managing-tables

https://cloud.google.com/python/docs/reference/bigtable/latest/table

In [102]:
bigtable_table = bigtable_instance.table(BIGTABLE_TABLE_NAME)

if bigtable_table.exists():
    print(f"Found the table: {bigtable_table.name}")
else:
    print(f"Creating the table...")
    bigtable_table.create()
    print(f"Created Bigtable table: {bigtable_table.name}")

Creating the table...
Created Bigtable table: projects/statmike-mlops-349915/instances/statmike-mlops-349915/tables/applied-genai


### Create/Retrieve Column Family

https://cloud.google.com/bigtable/docs/managing-tables#add-column-families

In [123]:
bigtable_column_family = bigtable_table.column_family(BIGTABLE_COLUMN_FAMILY_NAME)

if bigtable_column_family.name in [c.name for c in list(bigtable_table.list_column_families().values())]:
    print(f"Found the column family: {bigtable_column_family.name}")
else:
    print(f"Creating the column family...")
    bigtable_column_family.create()
    print(f"Created column family: {bigtable_column_family.name}")

Found the column family: projects/statmike-mlops-349915/instances/statmike-mlops-349915/tables/applied-genai/columnFamilies/retrieval-bigtable


In [131]:
bigtable_column_family.column_family_id

'retrieval-bigtable'

### Add/Retrive/Delete Rows To The Table

A row has a structure of:
- row key
    - column families
        - column qualifiers
            - cell value
            
            
THere are three ways to write data to a table:
- direct = add, overwrite, delete cell
- conditional = use a filter to check if row is a match and then execute change/mutation
- append = presumes cell exists and data while handling change/mutation, add to a string, increment a counter for instance.

Writing records involves a timestamp for the cell: either assigned by the Bigtable server or in the call to set the value.  A cell can have multiple values with different timestamps. Think of this as a z-axis, or time axis for the table where each cell has its own history. Each point in the history of key:column_family:column:cells is referred to as a cell so cells can be a list of timestamped values.  
- more on time-series data schema options: https://cloud.google.com/bigtable/docs/schema-design-time-series#new-columns
- You can also set a TTL or max number of versions as part of garbage collection: https://cloud.google.com/bigtable/docs/garbage-collection

https://cloud.google.com/python/docs/reference/bigtable/latest/data-api

https://cloud.google.com/bigtable/docs/samples-python-hello#write_rows_to_a_table

In [238]:
first_record = content_chunks[0]

In [239]:
first_record.keys()

dict_keys(['gse', 'chunk_id', 'content', 'embedding'])

In [240]:
first_record['chunk_id']

'fannie_part_0_c17'

In [241]:
row = bigtable_table.row(first_record['chunk_id'])

In [242]:
for k, v in first_record.items():
    if k == 'embedding':
        v = struct.pack(f'>{len(v)}f', *v)
    row.set_cell(bigtable_column_family.column_family_id, k, v)

In [243]:
row.commit()



row filters:

https://cloud.google.com/python/docs/reference/bigtable/latest/row-filters

In [244]:
row_filter = bigtable.row_filters.CellsColumnLimitFilter(1) # most recent column only

In [245]:
read_row = bigtable_table.read_row(first_record['chunk_id'], row_filter)

In [246]:
read_row.cells[bigtable_column_family.column_family_id]['gse'.encode()][0].value.decode()

'fannie'

In [247]:
for column, cells in read_row.cells[bigtable_column_family.column_family_id].items():
    for cell in cells:
        c = column.decode()
        if c == 'embedding':
            v = list(struct.unpack(f'>{len(cell.value)//4}f', cell.value))
        else:
            v = cell.value.decode()
        print(f"{c}: {v}")

chunk_id: fannie_part_0_c17
content: # Selling Guide Fannie Mae Single Family

## Fannie Mae Copyright Notice

### Fannie Mae Copyright Notice

|-|
| Section B3-4.2, Verification of Depository Assets 402 |
| B3-4.2-01, Verification of Deposits and Assets (05/04/2022) 403 |
| B3-4.2-02, Depository Accounts (12/14/2022) 405 |
| B3-4.2-03, Individual Development Accounts (02/06/2019) 408 |
| B3-4.2-04, Pooled Savings (Community Savings Funds) (04/01/2009) 411 |
| B3-4.2-05, Foreign Assets (05/04/2022) 411 |
| Section B3-4.3, Verification of Non-Depository Assets 412 |
| B3-4.3-01, Stocks, Stock Options, Bonds, and Mutual Funds (06/30/2015) 412 |
| B3-4.3-02, Trust Accounts (04/01/2009) 413 |
| B3-4.3-03, Retirement Accounts (06/30/2015) 414 |
| B3-4.3-04, Personal Gifts (09/06/2023) 415 |
| B3-4.3-05, Gifts of Equity (10/07/2020) 418 |
| B3-4.3-06, Grants and Lender Contributions (12/14/2022) 419 |
| B3-4.3-07, Disaster Relief Grants or Loans (04/01/2009) 423 |
| B3-4.3-08, Employer Assis

write the same row again, then change the filter to (2) and show multiple instances

In [248]:
row = bigtable_table.row(first_record['chunk_id'])

In [249]:
for k, v in first_record.items():
    if k == 'embedding':
        v = struct.pack(f'>{len(v)}f', *v)
    row.set_cell(bigtable_column_family.column_family_id, k, v)

In [250]:
row.commit()



row filters:

https://cloud.google.com/python/docs/reference/bigtable/latest/row-filters

In [251]:
row_filter = bigtable.row_filters.CellsColumnLimitFilter(2) # most recent column only

In [252]:
read_row = bigtable_table.read_row(first_record['chunk_id'], row_filter)

In [253]:
read_row.cells[bigtable_column_family.column_family_id]['gse'.encode()][0].value.decode()

'fannie'

In [254]:
for column, cells in read_row.cells[bigtable_column_family.column_family_id].items():
    for cell in cells:
        c = column.decode()
        if c == 'embedding':
            v = list(struct.unpack(f'>{len(cell.value)//4}f', cell.value))
        else:
            v = cell.value.decode()
        print(f"{c}: {v}")

chunk_id: fannie_part_0_c17
chunk_id: fannie_part_0_c17
content: # Selling Guide Fannie Mae Single Family

## Fannie Mae Copyright Notice

### Fannie Mae Copyright Notice

|-|
| Section B3-4.2, Verification of Depository Assets 402 |
| B3-4.2-01, Verification of Deposits and Assets (05/04/2022) 403 |
| B3-4.2-02, Depository Accounts (12/14/2022) 405 |
| B3-4.2-03, Individual Development Accounts (02/06/2019) 408 |
| B3-4.2-04, Pooled Savings (Community Savings Funds) (04/01/2009) 411 |
| B3-4.2-05, Foreign Assets (05/04/2022) 411 |
| Section B3-4.3, Verification of Non-Depository Assets 412 |
| B3-4.3-01, Stocks, Stock Options, Bonds, and Mutual Funds (06/30/2015) 412 |
| B3-4.3-02, Trust Accounts (04/01/2009) 413 |
| B3-4.3-03, Retirement Accounts (06/30/2015) 414 |
| B3-4.3-04, Personal Gifts (09/06/2023) 415 |
| B3-4.3-05, Gifts of Equity (10/07/2020) 418 |
| B3-4.3-06, Grants and Lender Contributions (12/14/2022) 419 |
| B3-4.3-07, Disaster Relief Grants or Loans (04/01/2009) 423 |

In [255]:
row_filter = bigtable.row_filters.CellsRowLimitFilter(2) # most recent column only

In [256]:
read_row = bigtable_table.read_row(first_record['chunk_id'], row_filter)

In [257]:
for column, cells in read_row.cells[bigtable_column_family.column_family_id].items():
    for cell in cells:
        c = column.decode()
        if c == 'embedding':
            v = list(struct.unpack(f'>{len(cell.value)//4}f', cell.value))
        else:
            v = cell.value.decode()
        print(f"{c}: {v}")

chunk_id: fannie_part_0_c17
chunk_id: fannie_part_0_c17


googleSQL

https://cloud.google.com/bigtable/docs/googlesql-examples

In [311]:
async def execute_query(query):
    async with bigtable.data.BigtableDataClientAsync(project=PROJECT_ID) as client:
        local_query = query
        rows = []
        async for row in await client.execute_query(query, BIGTABLE_INSTANCE_NAME):
            rows.append(row)
        return rows

In [321]:
query = f"""
      SELECT `{BIGTABLE_COLUMN_FAMILY_NAME}`['gse'] as gse, _key
      FROM `{BIGTABLE_TABLE_NAME}`(with_history => TRUE)
      WHERE _key = '{first_record['chunk_id']}';
      """

results = await execute_query(query)

In [322]:
results

[QueryResultRow([('gse', [Struct([('timestamp', DatetimeWithNanoseconds(2024, 10, 30, 14, 56, 9, 479000, tzinfo=datetime.timezone.utc)), ('value', b'fannie')]), Struct([('timestamp', DatetimeWithNanoseconds(2024, 10, 30, 14, 56, 6, 726000, tzinfo=datetime.timezone.utc)), ('value', b'fannie')])]), ('_key', b'fannie_part_0_c17')])]

In [323]:
for row in results:
    print(
        row['_key'].decode(),
        [r['value'].decode() for r in row['gse']],
        [r['timestamp'].strftime("%m/%d/%y %H:%M:%S.%f") for r in row['gse']]
    )

fannie_part_0_c17 ['fannie', 'fannie'] ['10/30/24 14:56:09.479000', '10/30/24 14:56:06.726000']


In [324]:
query = f"""
      SELECT `{BIGTABLE_COLUMN_FAMILY_NAME}`['gse'] as gse, _key
      FROM `{BIGTABLE_TABLE_NAME}`
      WHERE _key = '{first_record['chunk_id']}';
      """

results = await execute_query(query)

In [325]:
results

[QueryResultRow([('gse', b'fannie'), ('_key', b'fannie_part_0_c17')])]

In [328]:
for row in results:
    print(
        row['_key'].decode(), 
        row['gse'].decode()
    )

fannie_part_0_c17 fannie


In [335]:
query = f"""
      SELECT _key, `{BIGTABLE_COLUMN_FAMILY_NAME}`
      FROM `{BIGTABLE_TABLE_NAME}`
      WHERE _key = '{first_record['chunk_id']}';
      """

results = await execute_query(query)

In [346]:
for row in results:
    print(
        row['_key'].decode(), 
        row[BIGTABLE_COLUMN_FAMILY_NAME][b'chunk_id'].decode(),
        row[BIGTABLE_COLUMN_FAMILY_NAME][b'gse'].decode(),
        row[BIGTABLE_COLUMN_FAMILY_NAME][b'content'].decode(),
    )

fannie_part_0_c17 fannie_part_0_c17 fannie # Selling Guide Fannie Mae Single Family

## Fannie Mae Copyright Notice

### Fannie Mae Copyright Notice

|-|
| Section B3-4.2, Verification of Depository Assets 402 |
| B3-4.2-01, Verification of Deposits and Assets (05/04/2022) 403 |
| B3-4.2-02, Depository Accounts (12/14/2022) 405 |
| B3-4.2-03, Individual Development Accounts (02/06/2019) 408 |
| B3-4.2-04, Pooled Savings (Community Savings Funds) (04/01/2009) 411 |
| B3-4.2-05, Foreign Assets (05/04/2022) 411 |
| Section B3-4.3, Verification of Non-Depository Assets 412 |
| B3-4.3-01, Stocks, Stock Options, Bonds, and Mutual Funds (06/30/2015) 412 |
| B3-4.3-02, Trust Accounts (04/01/2009) 413 |
| B3-4.3-03, Retirement Accounts (06/30/2015) 414 |
| B3-4.3-04, Personal Gifts (09/06/2023) 415 |
| B3-4.3-05, Gifts of Equity (10/07/2020) 418 |
| B3-4.3-06, Grants and Lender Contributions (12/14/2022) 419 |
| B3-4.3-07, Disaster Relief Grants or Loans (04/01/2009) 423 |
| B3-4.3-08, Employer

In [351]:
row = bigtable_table.row(first_record['chunk_id'])
row.delete()
row.commit()



### Load All Rows To The Table

https://cloud.google.com/bigtable/docs/samples-python-hello#write_rows_to_a_table

update the column family to have a max version garbage collection of 1 and explain how this works referring back to the single row example above.

There are many great features in the BigTable clients for handling mutations/changes/writes to fit the specifics of a use case.  Here we need to write over 9000 new rows to the table.  Here are some of the techniques that can be used:

- table.mutate_rows() **used here**
    - atomic updates for many rows as cell operations are grouped into row objects and submitted together
    - async with: await table.bulk_mutate_rows()
        - asyncronous submission of many row objects containing cell operations


In [379]:
rows = []
for chunk in content_chunks:
    row = bigtable_table.row(chunk['chunk_id'])
    for k, v in chunk.items():
        if k == 'embedding':
            v = struct.pack(f'>{len(v)}f', *v)
        else:
            v = v.encode()
        row.set_cell(bigtable_column_family.column_family_id, k, v)
    rows.append(row)
results = bigtable_table.mutate_rows(rows)

In [384]:
len(results), all(results)

(9040, True)

Get a sample of keys to verify that rows were writen:

In [376]:
[r.row_key.decode() for r in bigtable_table.read_rows(limit=10)]

['fannie_part_0_c1',
 'fannie_part_0_c10',
 'fannie_part_0_c100',
 'fannie_part_0_c1000',
 'fannie_part_0_c1001',
 'fannie_part_0_c1002',
 'fannie_part_0_c1003',
 'fannie_part_0_c1004',
 'fannie_part_0_c1005',
 'fannie_part_0_c1006']

### Matching

Bigtable only has brute force matching - no indexing and no approximation methods (IVF, ScaaN).  For distance measure dot product is missing and only cosine and euclidean are available.

In [399]:
query = f"""
      SELECT _key, `{BIGTABLE_COLUMN_FAMILY_NAME}`['content'] as content
      FROM `{BIGTABLE_TABLE_NAME}`
      ORDER BY COSINE_DISTANCE(
              TO_VECTOR32(`{BIGTABLE_COLUMN_FAMILY_NAME}`['embedding']),
              {question_embedding}
          ) ASC
      LIMIT 5;
      """

results = await execute_query(query)

In [401]:
for row in results:
    print(
        row['_key'].decode(),
        #row['content'].decode()
    )

fannie_part_0_c352
freddie_part_4_c509
freddie_part_4_c510
fannie_part_0_c353
fannie_part_0_c326


### Matching with Filtering

Since the only method is brutefore filtering is easily accomplished with a WHERE clause

In [402]:
query = f"""
      SELECT _key, 
          `{BIGTABLE_COLUMN_FAMILY_NAME}`['content'] as content,
          COSINE_DISTANCE(
              TO_VECTOR32(`{BIGTABLE_COLUMN_FAMILY_NAME}`['embedding']),
              {question_embedding}
          ) as distance,
      FROM `{BIGTABLE_TABLE_NAME}`
      WHERE `{BIGTABLE_COLUMN_FAMILY_NAME}`['gse'] = 'fannie'
      ORDER BY COSINE_DISTANCE(
              TO_VECTOR32(`{BIGTABLE_COLUMN_FAMILY_NAME}`['embedding']),
              {question_embedding}
          ) ASC
      LIMIT 5;
      """

results = await execute_query(query)

In [403]:
for row in results:
    print(
        row['_key'].decode(),
        row['distance'],
        #row['content'].decode()
    )

fannie_part_0_c352 0.28999842323157043
fannie_part_0_c353 0.32760391792511945
fannie_part_0_c326 0.3316462732543465
fannie_part_0_c92 0.3385544159915941
fannie_part_0_c240 0.33913578264492195


## Remove Resources

In [None]:
#bigtable_column_family.delete()

In [101]:
#bigtable_table.delete()

In [64]:
#bigtable_instance.delete()