![tracker](https://us-central1-vertex-ai-mlops-369716.cloudfunctions.net/pixel-tracking?path=statmike%2Fvertex-ai-mlops%2FApplied+GenAI%2FRetrieval&file=Retrieval+-+Bigtable.ipynb)
<!--- header table --->
<table align="left">
  <td style="text-align: center">
    <a href="https://colab.research.google.com/github/statmike/vertex-ai-mlops/blob/main/Applied%20GenAI/Retrieval/Retrieval%20-%20Bigtable.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/colab-logo-32px.png" alt="Google Colaboratory logo">
      <br>Run in<br>Colab
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/colab/import/https%3A%2F%2Fraw.githubusercontent.com%2Fstatmike%2Fvertex-ai-mlops%2Fmain%2FApplied%2520GenAI%2FRetrieval%2FRetrieval%2520-%2520Bigtable.ipynb">
      <img width="32px" src="https://lh3.googleusercontent.com/JmcxdQi-qOpctIvWKgPtrzZdJJK-J3sWE1RsfjZNwshCFgE_9fULcNpuXYTilIR2hjwN" alt="Google Cloud Colab Enterprise logo">
      <br>Run in<br>Colab Enterprise
    </a>
  </td>      
  <td style="text-align: center">
    <a href="https://github.com/statmike/vertex-ai-mlops/blob/main/Applied%20GenAI/Retrieval/Retrieval%20-%20Bigtable.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" alt="GitHub logo">
      <br>View on<br>GitHub
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/statmike/vertex-ai-mlops/main/Applied%20GenAI/Retrieval/Retrieval%20-%20Bigtable.ipynb">
      <img src="https://lh3.googleusercontent.com/UiNooY4LUgW_oTvpsNhPpQzsstV5W8F7rYgxgGBD85cWJoLmrOzhVs_ksK_vgx40SHs7jCqkTkCk=e14-rj-sc0xffffff-h130-w32" alt="Vertex AI logo">
      <br>Open in<br>Vertex AI Workbench
    </a>
  </td>
</table>

# Retrieval - Bigtable

In prior workflows, a series of documents was [processed into chunks](../Chunking/readme.md), and for each chunk, [embeddings](../Embeddings/readme.md) were created:

- Process: [Large Document Processing - Document AI Layout Parser](../Chunking/Large%20Document%20Processing%20-%20Document%20AI%20Layout%20Parser.ipynb)
- Embed: [Vertex AI Text Embeddings API](../Embeddings/Vertex%20AI%20Text%20Embeddings%20API.ipynb)

Retrieving chunks for a query involves calculating the embedding for the query and then using similarity metrics to find relevant chunks. A thorough review of similarity matching can be found in [The Math of Similarity](../Embeddings/The%20Math%20of%20Similarity.ipynb) - use dot product! As development moves from experiment to application, the process of storing and computing similarity is migrated to a [retrieval](./readme.md) system. This workflow is part of a [series of workflows exploring many retrieval systems](./readme.md).  

A detailed [comparison of many retrieval systems](./readme.md#comparison-of-vector-database-solutions) can be found in the readme as well.

---

**Bigtable For Storage, And Search**

[Google Cloud Bigtable](https://cloud.google.com/bigtable) is a fully managed, scalable NoSQL wide-column database service. It's the same technology that powers many core Google services, including Search, Maps, and Gmail! Bigtable is designed for high throughput and low latency, making it suitable for large-scale analytical and operational workloads.

- **Key Features:**

    - **Scalability:** Bigtable can handle massive amounts of data and traffic, scaling seamlessly to meet your needs.
    - **Performance:** It provides low latency and high throughput, making it ideal for applications that require fast data access.
    - **Wide-Column Data Model:** Bigtable's flexible data model allows you to store and query data with a large number of columns, making it suitable for sparse and semi-structured data.
    - **Integration:** Bigtable integrates with other Google Cloud services, such as BigQuery and Dataflow, for data analytics and processing.

- **Built-in Nearest Neighbor Search:**

    - Bigtable now offers [built-in support for nearest neighbor search](https://cloud.google.com/bigtable/docs/find-k-nearest-neighbors#perform-knn-search) with vector embeddings.
    - You [write embeddings to Bigtable](https://cloud.google.com/bigtable/docs/find-k-nearest-neighbors#write-embeddings) as [byte objects](https://cloud.google.com/bigtable/docs/find-k-nearest-neighbors#define-conversion-functions).
    - Using the [new SQL interface in Bigtable](https://cloud.google.com/bigtable/docs/introduction-sql), you can perform efficient nearest neighbor searches using embeddings.

---

**Use Case Data**

Buying a home usually involves borrowing money from a lending institution, typically through a mortgage secured by the home's value. But how do these institutions manage the risks associated with such large loans, and how are lending standards established?

In the United States, two government-sponsored enterprises (GSEs) play a vital role in the housing market:

- Federal National Mortgage Association ([Fannie Mae](https://www.fanniemae.com/))
- Federal Home Loan Mortgage Corporation ([Freddie Mac](https://www.freddiemac.com/))

These GSEs purchase mortgages from lenders, enabling those lenders to offer more loans. This process also allows Fannie Mae and Freddie Mac to set standards for mortgages, ensuring they are responsible and borrowers are more likely to repay them. This system makes homeownership more affordable and stabilizes the housing market by maintaining a steady flow of liquidity for lenders and keeping interest rates controlled.

However, navigating the complexities of these GSEs and their extensive servicing guides can be challenging.

**Approaches**

[This series](../readme.md) covers many generative AI workflows. These documents are used directly as long context for Gemini in the workflow [Long Context Retrieval With The Vertex AI Gemini API](../Generate/Long%20Context%20Retrieval%20With%20The%20Vertex%20AI%20Gemini%20API.ipynb). The workflow below uses a [retrieval](./readme.md) approach with the already generated chunks and embeddings.

---
## Colab Setup

When running this notebook in [Colab](https://colab.google/) or [Colab Enterprise](https://cloud.google.com/colab/docs/introduction), this section will authenticate to GCP (follow prompts in the popup) and set the current project for the session.

In [1]:
PROJECT_ID = 'statmike-mlops-349915' # replace with project ID

In [2]:
try:
    from google.colab import auth
    auth.authenticate_user()
    !gcloud config set project {PROJECT_ID}
except Exception:
    pass

---
## Installs and API Enablement

The clients packages may need installing in this environment. 

### Installs (If Needed)

In [3]:
# tuples of (import name, install name, min_version)
packages = [
    ('google.cloud.aiplatform', 'google-cloud-aiplatform', '1.69.0'),
    ('google.cloud.bigtable', 'google-cloud-bigtable'),
]

import importlib
install = False
for package in packages:
    if not importlib.util.find_spec(package[0]):
        print(f'installing package {package[1]}')
        install = True
        !pip install {package[1]} -U -q --user
    elif len(package) == 3:
        if importlib.metadata.version(package[0]) < package[2]:
            print(f'updating package {package[1]}')
            install = True
            !pip install {package[1]} -U -q --user

### API Enablement

In [4]:
!gcloud services enable aiplatform.googleapis.com
!gcloud services enable bigtable.googleapis.com
!gcloud services enable bigtableadmin.googleapis.com

### Restart Kernel (If Installs Occured)

After a kernel restart the code submission can start with the next cell after this one.

In [5]:
if install:
    import IPython
    app = IPython.Application.instance()
    app.kernel.do_shutdown(True)
    IPython.display.display(IPython.display.Markdown("""<div class=\"alert alert-block alert-warning\">
        <b>⚠️ The kernel is going to restart. Please wait until it is finished before continuing to the next step. The previous cells do not need to be run again⚠️</b>
        </div>"""))

---
## Setup

Inputs

In [6]:
project = !gcloud config get-value project
PROJECT_ID = project[0]
PROJECT_ID

'statmike-mlops-349915'

In [7]:
REGION = 'us-central1'
SERIES = 'applied-genai'
EXPERIMENT = 'retrieval-bigtable'

# BigTable Names
BIGTABLE_INSTANCE_NAME = PROJECT_ID
BIGTABLE_TABLE_NAME = SERIES
BIGTABLE_COLUMN_FAMILY_NAME = EXPERIMENT

Packages

In [241]:
import os, json, time, glob, datetime, struct, asyncio

# Vertex AI
from google.cloud import aiplatform
import vertexai.language_models # for embeddings API
import vertexai.generative_models # for Gemini Models
from vertexai.resources.preview import feature_store

# bigtable
from google.cloud import bigtable
import google.cloud.bigtable.data
import google.cloud.bigtable.row_filters
#from google.cloud.bigtable import row_filters

In [242]:
aiplatform.__version__

'1.71.0'

Clients

In [10]:
# vertex ai clients
vertexai.init(project = PROJECT_ID, location = REGION)

# bigtable client
bigtable_client = bigtable.client.Client(project = PROJECT_ID, admin = True)

---
## Text & Embeddings For Examples

This repository contains a [section for document processing (chunking)](../Chunking/readme.md) that includes an example of processing mulitple large pdfs (over 1000 pages) into chunks: [Large Document Processing - Document AI Layout Parser](../Chunking/Large%20Document%20Processing%20-%20Document%20AI%20Layout%20Parser.ipynb).  The chunks of text from that workflow are stored with this repository and loaded by another companion workflow that augments the chunks with text embeddings: [Vertex AI Text Embeddings API](../Embeddings/Vertex%20AI%20Text%20Embeddings%20API.ipynb).

The following code will load the version of the chunks that includes text embeddings and prepare it for a local example of retrival augmented generation.

### Get The Documents

If you are working from a clone of this notebooks [repository](https://github.com/statmike/vertex-ai-mlops) then the documents are already present. The following cell checks for the documents folder and if it is missing gets it (`git clone`):

In [11]:
local_dir = '../Embeddings/files/embeddings-api'

In [12]:
if not os.path.exists(local_dir):
    print('Retrieving documents...')
    parent_dir = os.path.dirname(local_dir)
    temp_dir = os.path.join(parent_dir, 'temp')
    if not os.path.exists(temp_dir):
        os.makedirs(temp_dir)
    !git clone https://www.github.com/statmike/vertex-ai-mlops {temp_dir}/vertex-ai-mlops
    shutil.copytree(f'{temp_dir}/vertex-ai-mlops/Applied GenAI/Embeddings/files/embeddings-api', local_dir)
    shutil.rmtree(temp_dir)
    print(f'Documents are now in folder `{local_dir}`')
else:
    print(f'Documents Found in folder `{local_dir}`')             

Documents Found in folder `../Embeddings/files/embeddings-api`


### Load The Chunks

In [13]:
jsonl_files = glob.glob(f"{local_dir}/large-files*.jsonl")
jsonl_files.sort()
jsonl_files

['../Embeddings/files/embeddings-api/large-files-chunk-embeddings-0000.jsonl',
 '../Embeddings/files/embeddings-api/large-files-chunk-embeddings-0001.jsonl',
 '../Embeddings/files/embeddings-api/large-files-chunk-embeddings-0002.jsonl',
 '../Embeddings/files/embeddings-api/large-files-chunk-embeddings-0003.jsonl',
 '../Embeddings/files/embeddings-api/large-files-chunk-embeddings-0004.jsonl',
 '../Embeddings/files/embeddings-api/large-files-chunk-embeddings-0005.jsonl',
 '../Embeddings/files/embeddings-api/large-files-chunk-embeddings-0006.jsonl',
 '../Embeddings/files/embeddings-api/large-files-chunk-embeddings-0007.jsonl',
 '../Embeddings/files/embeddings-api/large-files-chunk-embeddings-0008.jsonl',
 '../Embeddings/files/embeddings-api/large-files-chunk-embeddings-0009.jsonl']

In [14]:
chunks = []
for file in jsonl_files:
    with open(file, 'r') as f:
        chunks.extend([json.loads(line) for line in f])
len(chunks)

9040

### Review A Chunk

In [15]:
chunks[0].keys()

dict_keys(['instance', 'predictions', 'status'])

In [16]:
chunks[0]['instance']['chunk_id']

'fannie_part_0_c17'

In [17]:
print(chunks[0]['instance']['content'])

# Selling Guide Fannie Mae Single Family

## Fannie Mae Copyright Notice

### Fannie Mae Copyright Notice

|-|
| Section B3-4.2, Verification of Depository Assets 402 |
| B3-4.2-01, Verification of Deposits and Assets (05/04/2022) 403 |
| B3-4.2-02, Depository Accounts (12/14/2022) 405 |
| B3-4.2-03, Individual Development Accounts (02/06/2019) 408 |
| B3-4.2-04, Pooled Savings (Community Savings Funds) (04/01/2009) 411 |
| B3-4.2-05, Foreign Assets (05/04/2022) 411 |
| Section B3-4.3, Verification of Non-Depository Assets 412 |
| B3-4.3-01, Stocks, Stock Options, Bonds, and Mutual Funds (06/30/2015) 412 |
| B3-4.3-02, Trust Accounts (04/01/2009) 413 |
| B3-4.3-03, Retirement Accounts (06/30/2015) 414 |
| B3-4.3-04, Personal Gifts (09/06/2023) 415 |
| B3-4.3-05, Gifts of Equity (10/07/2020) 418 |
| B3-4.3-06, Grants and Lender Contributions (12/14/2022) 419 |
| B3-4.3-07, Disaster Relief Grants or Loans (04/01/2009) 423 |
| B3-4.3-08, Employer Assistance (09/29/2015) 423 |
| B3-4.3-09,

In [18]:
chunks[0]['predictions'][0]['embeddings']['values'][0:10]

[0.031277116388082504,
 0.03056905046105385,
 0.010865348391234875,
 0.0623614676296711,
 0.03228681534528732,
 0.05066155269742012,
 0.046544693410396576,
 0.05509665608406067,
 -0.014074751175940037,
 0.008380400016903877]

### Prepare Chunk Structure

Make a list of dictionaries with information for each chunk:

In [19]:
content_chunks = [
    dict(
        gse = chunk['instance']['gse'],
        chunk_id = chunk['instance']['chunk_id'],
        content = chunk['instance']['content'],
        embedding = chunk['predictions'][0]['embeddings']['values']
    ) for chunk in chunks
]

### Query Embedding

Create a query, or prompt, and get the embedding for it:

Connect to models for text embeddings. Learn more about the model API:
- [Vertex AI Text Embeddings API](../Embeddings/Vertex%20AI%20Text%20Embeddings%20API.ipynb)

In [20]:
question = "Does a lender have to perform servicing functions directly?"

In [21]:
embedder = vertexai.language_models.TextEmbeddingModel.from_pretrained('text-embedding-004')

In [22]:
question_embedding = embedder.get_embeddings([question])[0].values
question_embedding[0:10]

[-0.0005117303808219731,
 0.009651427157223225,
 0.01768726110458374,
 0.014538003131747246,
 -0.01829824410378933,
 0.027877431362867355,
 -0.021124685183167458,
 0.008830446749925613,
 -0.02669006586074829,
 0.06414774805307388]

---
## Setup Bigtable

Bigtable is like a tabular canvas that seems infinate - billions of rows and millions of columns.  You can choose many data schemas with the only hard limit being a max of 256MB per row.  This makes it easy to store and retrieve massive data - up to petabytes of data.  Bigtable can still be viewed simply as rows and columns.  Each row has an indexed key - the row key.  Columns are arranged in groupings called a column family. Cells are the combination of a row and column and the unit where data is stored is called a cell.  Bigtable cells offer a lot of flexibility in how they are written, including timestamped values for a history based schema at the cell level - also appends and incrementing are available by data type.

**Bigtable Layout**

Tables hold data.  Tables are hosted on nodes - the compute resource for Bigtable. Behind the scenes Bigtable manages tables in fractions called tablets which nodes read from disk based on incoming queries.  The location of a node is denoted by a cluster which is a container for nodes in a region.  And since Bigtable can cover multiple regions (up to 8) the combination of clusters across regions is called an instance.  This makes scaling Bigtable incredibly flexible:
- [Instance, clusters, and nodes](https://cloud.google.com/bigtable/docs/instances-clusters-nodes)
- [Understand performance](https://cloud.google.com/bigtable/docs/performance)

**Bigtable Pricing**

The [pricing of Bigtable](https://cloud.google.com/bigtable/pricing) is based on the compute (number and location of nodes), storage (type and amount of storage used over time), and network bandwidth (inter-region and inter-continental data transfer).

**Bigtable Data Structure**

It's a table of rows and columns.  Columns are grouped into a construct called a column family and data is written to a row>column family>column which is represented by a cell. Cells can hold data in many [data types](https://cloud.google.com/bigtable/docs/reference/sql/data-types).  Cells can also be dimensional and hold timestamped version of the data for the cell.  Changing a cell is called a mutation.  Mutation include writing a value, called setting, and deleting.  Cells can also be appended or in the case of numeric values, incremented.

**Working With Bigtable**

There is a wide array of ways to interact with and administer Bigtable as [listed here](https://cloud.google.com/bigtable/docs/apis).  In this workflow the [Python Client for Google Cloud Bigtable](https://cloud.google.com/python/docs/reference/bigtable/latest) is used for all tasks from creating an instance, a table, added data, and retrieving data including with vector similarity search.


### Create/Retrieve An Instance

The starting point for Bigtable is an instance.  This is where the cluster of node(s) is specified and launched.  In a single step this workflow creates:
- an instance
- with a single cluster
- in a single region
- in a single zone
- with a single node

Documentation References:
- [Python Client For Instance](https://cloud.google.com/python/docs/reference/bigtable/latest/instance)
- [Instances, clusters, and nodes](https://cloud.google.com/bigtable/docs/instances-clusters-nodes)
- [Create An Instance](https://cloud.google.com/bigtable/docs/creating-instance)
- [Bigtable locations](https://cloud.google.com/bigtable/docs/locations)

In [23]:
bigtable_instance = bigtable_client.instance(BIGTABLE_INSTANCE_NAME)

if bigtable_instance.exists():
    print(f"Found the instance: {bigtable_instance.name}")
else:
    bigtable_cluster = bigtable_instance.cluster(
        BIGTABLE_INSTANCE_NAME,
        location_id = REGION+'-a', # bigtable is a zonal resource 
        serve_nodes = 1,
        default_storage_type = bigtable.enums.StorageType.HDD,  # Choice of HDD or SDD
    )
    print(f"Creating the instance ...")
    create_instance = bigtable_instance.create(clusters = [bigtable_cluster])
    while not create_instance.done():
        print('Waiting for instance creation...')
        time.sleep(10)
    bigtable_instance.reload()
    print(f"Created Bigtable instance: {bigtable_instance.name}")

Found the instance: projects/statmike-mlops-349915/instances/statmike-mlops-349915


Review the clusters information:

In [26]:
bigtable_clusters = bigtable_instance.list_clusters()
for cluster in bigtable_clusters:
    if type(cluster) == list and cluster:
        print('Name: ', cluster[0].name)
        print('Nodes: ', cluster[0].serve_nodes)
        print('Storage: ', cluster[0].default_storage_type)

Name:  projects/statmike-mlops-349915/instances/statmike-mlops-349915/clusters/statmike-mlops-349915
Nodes:  1
Storage:  StorageType.HDD


---
## Working With Bigtable

Create a table, a column family, and add/retrieve/delete data.

### Create/Retrieve Table

Create a table to hold data:

Documentation References:

- [Create and managed tables](https://cloud.google.com/bigtable/docs/managing-tables)
- [Python Client For Tables](https://cloud.google.com/python/docs/reference/bigtable/latest/table)

In [28]:
bigtable_table = bigtable_instance.table(BIGTABLE_TABLE_NAME)

if bigtable_table.exists():
    print(f"Found the table: {bigtable_table.name}")
else:
    print(f"Creating the table...")
    bigtable_table.create()
    print(f"Created Bigtable table: {bigtable_table.name}")

Found the table: projects/statmike-mlops-349915/instances/statmike-mlops-349915/tables/applied-genai


In [32]:
bigtable_table.name

'projects/statmike-mlops-349915/instances/statmike-mlops-349915/tables/applied-genai'

### Create/Retrieve Column Family

Add a column family to the table to hold the data for this workflow:

- [Column Families](https://cloud.google.com/bigtable/docs/managing-tables#add-column-families)

In [29]:
bigtable_column_family = bigtable_table.column_family(BIGTABLE_COLUMN_FAMILY_NAME)

if bigtable_column_family.name in [c.name for c in list(bigtable_table.list_column_families().values())]:
    print(f"Found the column family: {bigtable_column_family.name}")
else:
    print(f"Creating the column family...")
    bigtable_column_family.create()
    print(f"Created column family: {bigtable_column_family.name}")

Creating the column family...
Created column family: projects/statmike-mlops-349915/instances/statmike-mlops-349915/tables/applied-genai/columnFamilies/retrieval-bigtable


In [30]:
bigtable_column_family.column_family_id

'retrieval-bigtable'

### Prepare Data For Bigtable

Data is written to a column family with a column name.  There are many [supported data types](https://cloud.google.com/bigtable/docs/reference/sql/data-types).  Embedding vectors require a conversion to bytes prior to be saved to Bigtable - [read more here](https://cloud.google.com/bigtable/docs/find-k-nearest-neighbors#define-conversion-functions).

#### Get A Record

Dictionaries for each record/row are stored in `content_chunks` from earlier in this workflow:

In [33]:
first_record = content_chunks[0].copy()

In [34]:
first_record.keys()

dict_keys(['gse', 'chunk_id', 'content', 'embedding'])

In [35]:
first_record['chunk_id']

'fannie_part_0_c17'

In [36]:
type(first_record['embedding'])

list

#### Prepare The Record

Convert the embedding array (list of floats) into the `bytes` data type for Bigtable [per these instructions](https://cloud.google.com/bigtable/docs/find-k-nearest-neighbors#define-conversion-functions):

In [183]:
def vector_as_bytes(embedding):
    return struct.pack(f'>{len(embedding)}f', *embedding)

def bytes_as_vector(embedding):
    return list(struct.unpack(f'>{len(embedding)//4}f', embedding))

In [41]:
type(first_record['embedding'])

list

In [42]:
first_record['embedding'] = vector_as_bytes(first_record['embedding'])

In [184]:
type(first_record['embedding'])

bytes

In [185]:
#convert back to list of floats:
#bytes_as_vector(first_record['embedding'])

### Add, Retrieve, And Delete Data To The Column Family

- [Python Bigtable Data API](https://cloud.google.com/python/docs/reference/bigtable/latest/data-api)
- [Write rows to a table](https://cloud.google.com/bigtable/docs/samples-python-hello#write_rows_to_a_table)

#### Insert Record

Three steps:
- create row with key
- set the value of cells in the column family
- commit the data

In [186]:
row = bigtable_table.row(first_record['chunk_id'])

In [187]:
for key, value in first_record.items():
    row.set_cell(bigtable_column_family.column_family_id, key, value)

In [188]:
row.commit()



#### Retrieve Record

Prior to reading data create a row filter to limit what Bigtable returns on the request.  Since cells can have a history of values a common row filter to use is the `CellsColumnLimitFilter(1)` which will only return the most recent value for each column.

- [Create a filter](https://cloud.google.com/bigtable/docs/samples-python-hello#creating-filter)
- [Bigtable Row Filters](https://cloud.google.com/python/docs/reference/bigtable/latest/row-filters)
- [Read a row by its row key](https://cloud.google.com/bigtable/docs/samples-python-hello#read_a_row_by_its_row_key)

Create a row filter that returns only the most recent value for each column:

In [52]:
row_filter = bigtable.row_filters.CellsColumnLimitFilter(1) # most only the most recent value for each column

Read the row:

In [64]:
read_row = bigtable_table.read_row(first_record['chunk_id'], row_filter)

In [65]:
type(read_row)

google.cloud.bigtable.row.PartialRowData

Examine the row response and filter down to the data in the column family:

In [66]:
type(read_row.cells)

collections.OrderedDict

In [67]:
type(read_row.cells[bigtable_column_family.column_family_id])

collections.OrderedDict

In [68]:
results = read_row.cells[bigtable_column_family.column_family_id]

In [69]:
results.keys()

odict_keys([b'chunk_id', b'content', b'embedding', b'gse'])

Remember that [Bigtable stores all data as raw byte strings](https://cloud.google.com/bigtable/docs/overview#data-types) for most objects so using `.encode` and `.decode` are essential for interpreting results:

In [70]:
results['gse'.encode()]

[<Cell value=b'fannie' timestamp=2024-11-17 17:38:39.661000+00:00>]

In [71]:
results['gse'.encode()][0].value.decode()

'fannie'

If reading the embedding value then recall that a function was created to convert it back to a list of floats above:

In [74]:
bytes_as_vector(results['embedding'.encode()][0].value)[0:5]

[0.031277116388082504,
 0.03056905046105385,
 0.010865348391234875,
 0.0623614676296711,
 0.03228681534528732]

Create a dictionary from the row:

In [79]:
response = {}
for key, values in results.items():
    if key == 'embedding'.encode():
        response[key.decode()] = bytes_as_vector(values[0].value)
    else:
        response[key.decode()] = values[0].value.decode()

In [80]:
response.keys()

dict_keys(['chunk_id', 'content', 'embedding', 'gse'])

#### Insert Record - Again

This creates a newer value for each cell without removing or overwriting the older value - a history.

In [81]:
row = bigtable_table.row(first_record['chunk_id'])

In [82]:
for key, value in first_record.items():
    row.set_cell(bigtable_column_family.column_family_id, key, value)

In [83]:
row.commit()



#### Retrieve Record - With History

The time the row filter is set to `CellsColumnLimitFilter(2)` which will return the most recent 2 values for each column.

Create a row filter that returns only the most recent value for each column:

In [84]:
row_filter = bigtable.row_filters.CellsColumnLimitFilter(2) # most only the most recent 2 values for each column

Read the row:

In [85]:
read_row = bigtable_table.read_row(first_record['chunk_id'], row_filter)

In [86]:
type(read_row)

google.cloud.bigtable.row.PartialRowData

Examine the row response and filter down to the data in the column family:

In [87]:
results = read_row.cells[bigtable_column_family.column_family_id]

In [88]:
results.keys()

odict_keys([b'chunk_id', b'content', b'embedding', b'gse'])

Now the columns have multiple values:

In [89]:
results['gse'.encode()]

[<Cell value=b'fannie' timestamp=2024-11-17 18:03:14.152000+00:00>,
 <Cell value=b'fannie' timestamp=2024-11-17 17:38:39.661000+00:00>]

#### Retrieve Record - Only The Most Recent Value From History

This time the row filter is set to `CellsColumnLimitFilter(1)` which will return the most recent value for each column even though we know multiple values exists.

Create a row filter that returns only the most recent value for each column:

In [90]:
row_filter = bigtable.row_filters.CellsColumnLimitFilter(1) # most only the most recent value for each column

Read the row:

In [91]:
read_row = bigtable_table.read_row(first_record['chunk_id'], row_filter)

In [92]:
type(read_row)

google.cloud.bigtable.row.PartialRowData

Examine the row response and filter down to the data in the column family:

In [93]:
results = read_row.cells[bigtable_column_family.column_family_id]

In [94]:
results.keys()

odict_keys([b'chunk_id', b'content', b'embedding', b'gse'])

Now the column has only a single, most recent, value returned:

In [96]:
results['gse'.encode()]

[<Cell value=b'fannie' timestamp=2024-11-17 18:03:14.152000+00:00>]

#### Retrieve Record With SQL

Bigtable also support SQL queries.

Documentation References:
- [GoogleSQL for Bigtable overview](https://cloud.google.com/bigtable/docs/googlesql-overview)
- [Python Bigtable Data Client Async](https://cloud.google.com/python/docs/reference/bigtable/latest/async_data_client)

Define a function to execute queries:

In [212]:
async def execute_query(query):
    async with bigtable.data.BigtableDataClientAsync(project = PROJECT_ID) as client:
        rows = []
        async for row in await client.execute_query(query, BIGTABLE_INSTANCE_NAME):
            record = {}
            for col in row.fields:
                if type(col[1]) == bytes:
                     record[col[0]] = col[1].decode()
                else:
                     record[col[0]] = col[1]
            rows.append(record)
        return rows

Execute a SQL query for the record in thsi example:

In [216]:
query = f"""
      SELECT
           `{BIGTABLE_COLUMN_FAMILY_NAME}`['chunk_id'] as chunk_id, 
           `{BIGTABLE_COLUMN_FAMILY_NAME}`['content'] as content
      FROM `{BIGTABLE_TABLE_NAME}`(with_history => FALSE)
      WHERE _key = '{first_record['chunk_id']}';
      """

results = await execute_query(query)

In [217]:
results[0]['chunk_id']

'fannie_part_0_c17'

In [218]:
results[0]['content']

'# Selling Guide Fannie Mae Single Family\n\n## Fannie Mae Copyright Notice\n\n### Fannie Mae Copyright Notice\n\n|-|\n| Section B3-4.2, Verification of Depository Assets 402 |\n| B3-4.2-01, Verification of Deposits and Assets (05/04/2022) 403 |\n| B3-4.2-02, Depository Accounts (12/14/2022) 405 |\n| B3-4.2-03, Individual Development Accounts (02/06/2019) 408 |\n| B3-4.2-04, Pooled Savings (Community Savings Funds) (04/01/2009) 411 |\n| B3-4.2-05, Foreign Assets (05/04/2022) 411 |\n| Section B3-4.3, Verification of Non-Depository Assets 412 |\n| B3-4.3-01, Stocks, Stock Options, Bonds, and Mutual Funds (06/30/2015) 412 |\n| B3-4.3-02, Trust Accounts (04/01/2009) 413 |\n| B3-4.3-03, Retirement Accounts (06/30/2015) 414 |\n| B3-4.3-04, Personal Gifts (09/06/2023) 415 |\n| B3-4.3-05, Gifts of Equity (10/07/2020) 418 |\n| B3-4.3-06, Grants and Lender Contributions (12/14/2022) 419 |\n| B3-4.3-07, Disaster Relief Grants or Loans (04/01/2009) 423 |\n| B3-4.3-08, Employer Assistance (09/29/20

#### Delete Record

In [190]:
if bigtable_table.read_row(first_record['chunk_id']): print('Row found')
else: print('Row not found')

Row found


In [191]:
row = bigtable_table.row(first_record['chunk_id'])
row.delete()
row.commit()



In [192]:
if bigtable_table.read_row(first_record['chunk_id']): print('Row found')
else: print('Row not found')

Row not found


### Load Data

There are many great features in the BigTable clients for handling mutations/changes/writes to fit the specifics of a use case.  Here we need to write over 9000 new rows to the table.  Some of the techniques that can be used:

- `table.mutate_rows()` **used here**
    - atomic updates for many rows as cell operations are grouped into row objects and submitted together
    - async with: await table.bulk_mutate_rows()
        - asyncronous submission of many row objects containing cell operations

#### Limit Cell History With Garbage Collection Settings

Update the column family and set the garbage collection policy to key at most 1 version.

Documentation Refrences:
- [Garbage Collection Overview](https://cloud.google.com/bigtable/docs/garbage-collection)
- [Configure Garbage Collection](https://cloud.google.com/bigtable/docs/configuring-garbage-collection)
    - [Garbase collection based on the number of versions](https://cloud.google.com/bigtable/docs/configuring-garbage-collection#versions)

In [169]:
bigtable_column_family.gc_rule = bigtable.column_family.MaxVersionsGCRule(1)

In [170]:
bigtable_column_family.update()

#### Prepare The Embedding Values As Bytes

As we did in the single record example above, the embedding for each record now needs to be prepared by converting it to bytes:

In [178]:
for chunk in content_chunks:
    if type(chunk['embedding']) == list:
        chunk['embedding'] = vector_as_bytes(chunk['embedding'])

In [179]:
type(content_chunks[0]['embedding'])

bytes

#### Write Records To Bigtable

[Write rows to a table](https://cloud.google.com/bigtable/docs/samples-python-hello#write_rows_to_a_table)

In [197]:
rows = []
for chunk in content_chunks:
    row = bigtable_table.row(chunk['chunk_id'])
    for key, value in chunk.items():
        if type(value) != bytes:
            value = value.encode()
        row.set_cell(bigtable_column_family.column_family_id, key, value)
    rows.append(row)

In [198]:
results = bigtable_table.mutate_rows(rows)

In [199]:
len(results), all(results)

(9040, True)

Get a sample of keys to verify that rows were writen:

In [200]:
[r.row_key.decode() for r in bigtable_table.read_rows(limit=10)]

['fannie_part_0_c1',
 'fannie_part_0_c10',
 'fannie_part_0_c100',
 'fannie_part_0_c1000',
 'fannie_part_0_c1001',
 'fannie_part_0_c1002',
 'fannie_part_0_c1003',
 'fannie_part_0_c1004',
 'fannie_part_0_c1005',
 'fannie_part_0_c1006']

---
## Vector Similarity Search, Matching


This section covers the operation of using a vector similarity metric calculation to find nearest neighbors for a query vector while also taking advantage of indexing.  To understand similarity metrics and motivate the intution for choosing one (choose dot product), check out [The Math of Similarity](../Embeddings/The%20Math%20of%20Similarity.ipynb).


**Notes On [Vector Search](https://cloud.google.com/bigtable/docs/find-k-nearest-neighbors#perform-knn-search) With Bigtable**

The workflow below shows using the embedding column cell values in vector search.  Searching does not require an index and no indexing or approximate nearest neighbors functionality is availalbe.  All search uses brute force with the choice of distance function.

### Brute Force Search

Using the [GoogleSQL for Bigtable](https://cloud.google.com/bigtable/docs/googlesql-overview) functionality gives access to the [GoogleSQL mathematical functions](https://cloud.google.com/spanner/docs/reference/standard-sql/mathematical_functions).  This includes the following function that are useful for vector similarity search:

- `COSINE_DISTANCE` 
- `EUCLIDEAN_DISTANCE`
- `DOT_PRODUCT`

Cosine Similarity with `COSINE_DISTANCE`:

In [276]:
query = f"""
      SELECT
          `{BIGTABLE_COLUMN_FAMILY_NAME}`['chunk_id'] as chunk_id,
          `{BIGTABLE_COLUMN_FAMILY_NAME}`['content'] as content,
          COSINE_DISTANCE(
              TO_VECTOR32(`{BIGTABLE_COLUMN_FAMILY_NAME}`['embedding']),
              {question_embedding}
          ) as cosine_distance
      FROM `{BIGTABLE_TABLE_NAME}`
      ORDER BY cosine_distance
      LIMIT 5;
      """

results = await execute_query(query)
[(r['chunk_id'], r['cosine_distance']) for r in results]

[('fannie_part_0_c352', 0.28999842323157043),
 ('freddie_part_4_c509', 0.31944418927007767),
 ('freddie_part_4_c510', 0.3246529853234844),
 ('fannie_part_0_c353', 0.32760391792511945),
 ('fannie_part_0_c326', 0.3316462732543465)]

Euclidean Distance with `EUCLIDEAN DISTANCE`:

In [277]:
query = f"""
      SELECT
          `{BIGTABLE_COLUMN_FAMILY_NAME}`['chunk_id'] as chunk_id,
          `{BIGTABLE_COLUMN_FAMILY_NAME}`['content'] as content,
          EUCLIDEAN_DISTANCE(
              TO_VECTOR32(`{BIGTABLE_COLUMN_FAMILY_NAME}`['embedding']),
              {question_embedding}
          ) as euclidean_distance
      FROM `{BIGTABLE_TABLE_NAME}`
      ORDER BY euclidean_distance
      LIMIT 5;
      """

results = await execute_query(query)
[(r['chunk_id'], r['euclidean_distance']) for r in results]

[('fannie_part_0_c352', 0.7615658337594855),
 ('freddie_part_4_c509', 0.7992875473228386),
 ('freddie_part_4_c510', 0.8057848660615564),
 ('fannie_part_0_c353', 0.8094336897143497),
 ('fannie_part_0_c326', 0.8144253147417732)]

### Brute Force Search With Pre-Filtering

Extending a brute force match with pre-filtering means including a `WHERE` statement to first filter to rows that meet a desired condition:

Find the top 5 matches where the GSE is 'fannie':

In [278]:
query = f"""
      SELECT
          `{BIGTABLE_COLUMN_FAMILY_NAME}`['chunk_id'] as chunk_id,
          `{BIGTABLE_COLUMN_FAMILY_NAME}`['content'] as content,
          COSINE_DISTANCE(
              TO_VECTOR32(`{BIGTABLE_COLUMN_FAMILY_NAME}`['embedding']),
              {question_embedding}
          ) as cosine_distance
      FROM `{BIGTABLE_TABLE_NAME}`
      WHERE `{BIGTABLE_COLUMN_FAMILY_NAME}`['gse'] = 'fannie'
      ORDER BY cosine_distance
      LIMIT 5;
      """

results = await execute_query(query)
[(r['chunk_id'], r['cosine_distance']) for r in results]

[('fannie_part_0_c352', 0.28999842323157043),
 ('fannie_part_0_c353', 0.32760391792511945),
 ('fannie_part_0_c326', 0.3316462732543465),
 ('fannie_part_0_c92', 0.3385544159915941),
 ('fannie_part_0_c240', 0.33913578264492195)]

Find the top 5 matches where the GSE is 'freddie':

In [279]:
query = f"""
      SELECT
          `{BIGTABLE_COLUMN_FAMILY_NAME}`['chunk_id'] as chunk_id,
          `{BIGTABLE_COLUMN_FAMILY_NAME}`['content'] as content,
          COSINE_DISTANCE(
              TO_VECTOR32(`{BIGTABLE_COLUMN_FAMILY_NAME}`['embedding']),
              {question_embedding}
          ) as cosine_distance
      FROM `{BIGTABLE_TABLE_NAME}`
      WHERE `{BIGTABLE_COLUMN_FAMILY_NAME}`['gse'] = 'freddie'
      ORDER BY cosine_distance
      LIMIT 5;
      """

results = await execute_query(query)
[(r['chunk_id'], r['cosine_distance']) for r in results]

[('freddie_part_4_c509', 0.31944418927007767),
 ('freddie_part_4_c510', 0.3246529853234844),
 ('freddie_part_4_c472', 0.33799758363971266),
 ('freddie_part_6_c439', 0.3395277423685359),
 ('freddie_part_4_c558', 0.342446390396589)]

---
## Retrieval Augmented Generation (RAG)

Build a simple retrieval augmented generation process that enhances a query by retrieving context.  This is done here by constructing three functions for the stages:
- `retrieve` - a function that uses an embedding to search for matching context parts, pieces of texts
    - this uses the system built earlier in this workflow!
- `augment` - prepare chunks into a prompt
- `generate` - make the llm request with the augmented prompt

A final function is used to execute the workflow of rag:
- `rag` - a function that receives the query an orchestrates the workflow through `retrieve` > `augment` > `generate`

### Clients

In [226]:
embedder = vertexai.language_models.TextEmbeddingModel.from_pretrained('text-embedding-004')
llm = vertexai.generative_models.GenerativeModel("gemini-1.5-flash-002")

### Retrieve Function

In [268]:
def retrieve_bigtable(query_embedding, n_matches = 5):

    query = f"""
          SELECT
              `{BIGTABLE_COLUMN_FAMILY_NAME}`['chunk_id'] as chunk_id,
              `{BIGTABLE_COLUMN_FAMILY_NAME}`['content'] as content,
              COSINE_DISTANCE(
                  TO_VECTOR32(`{BIGTABLE_COLUMN_FAMILY_NAME}`['embedding']),
                  {query_embedding}
              ) as cosine_distance
          FROM `{BIGTABLE_TABLE_NAME}`
          ORDER BY cosine_distance
          LIMIT {n_matches};
          """
    matches = execute_query(query)
    
    return matches

### Augment Function

In [269]:
def augment(matches):

    prompt = ''
    for m, match in enumerate(matches):
        prompt += f"Context {m+1}:\n{match['content']}\n\n"
    prompt += f'Answer the following question using the provided contexts:\n'

    return prompt

### Generate Function

In [270]:
def generate(prompt):

    result = llm.generate_content(prompt)

    return result

### RAG Function

In [272]:
async def rag(query):
    
    query_embedding = embedder.get_embeddings([query])[0].values
    matches = await retrieve_bigtable(query_embedding)
    prompt = augment(matches) + query
    result = generate(prompt)
    
    return result.text

### Example In Use

In [273]:
question

'Does a lender have to perform servicing functions directly?'

In [274]:
print(await rag(question))

No.  A lender may use other organizations to perform some or all of its servicing functions through subservicing arrangements (Context 1).  However, the lender (master servicer) remains contractually responsible (Context 1).  There are specific requirements and guidelines for these subservicing arrangements, including the need for both the master servicer and subservicer to be Fannie Mae-approved (Context 4).



---
### Profiling Performance

Profile the timing of each step in the RAG function for sequential calls. The environment choosen for this workflow is a minimal testing enviornment so load testing (simoultaneous requests) would not be helpful.

In [287]:
profile = []

In [288]:
async def rag(query, profile = profile):
    
    timings = {}
    start_time = time.time()
    
    
    # 1. Get embeddings
    embedding_start = time.time()
    query_embedding = embedder.get_embeddings([query])[0].values
    timings['embedding'] = time.time() - embedding_start

    # 2. Retrieve from Bigtable
    retrieval_start = time.time()
    matches = await retrieve_bigtable(query_embedding)
    timings['retrieval_bigtable'] = time.time() - retrieval_start

    # 3. Augment the prompt
    augment_start = time.time()
    prompt = augment(matches) + query
    timings['augment'] = time.time() - augment_start

    # 4. Generate text
    generate_start = time.time()
    result = generate(prompt)
    timings['generate'] = time.time() - generate_start

    total_time = time.time() - start_time
    timings['total'] = total_time
    
    profile.append(timings)
    
    return result.text

In [289]:
print(await rag(question))

No, a lender does not have to perform servicing functions directly.  Context 1 explicitly states that a lender may use other organizations ("subservicing arrangements") to perform some or all of its servicing functions.  However, the lender remains contractually responsible (the "master servicer") unless they sell and assign servicing to another lender.  Even then,  the original lender may continue to be contractually responsible.



In [290]:
profile

[{'embedding': 0.08796429634094238,
  'retrieval_bigtable': 0.22971844673156738,
  'augment': 2.4080276489257812e-05,
  'generate': 0.7621994018554688,
  'total': 1.0799107551574707}]

In [291]:
for i in range(100):
    response = await rag(question)

### Report From Profile

In [292]:
all_timings = {}
for timings in profile:
    for key, value in timings.items():
        if key not in all_timings:
            all_timings[key] = []
        all_timings[key].append(value)

In [293]:
for key, values in all_timings.items():
    arr = np.array(values)
    print(f"Statistics for '{key}':")
    print(f"  Min: {np.min(arr):.4f} seconds")
    print(f"  Max: {np.max(arr):.4f} seconds")
    print(f"  Mean: {np.mean(arr):.4f} seconds")
    print(f"  Median: {np.median(arr):.4f} seconds")
    print(f"  Std Dev: {np.std(arr):.4f} seconds")
    print(f"  P95: {np.percentile(arr, 95):.4f} seconds")
    print(f"  P99: {np.percentile(arr, 99):.4f} seconds")
    print("")

Statistics for 'embedding':
  Min: 0.0473 seconds
  Max: 0.1127 seconds
  Mean: 0.0558 seconds
  Median: 0.0530 seconds
  Std Dev: 0.0104 seconds
  P95: 0.0798 seconds
  P99: 0.0880 seconds

Statistics for 'retrieval_bigtable':
  Min: 0.1362 seconds
  Max: 0.2672 seconds
  Mean: 0.1697 seconds
  Median: 0.1652 seconds
  Std Dev: 0.0213 seconds
  P95: 0.2101 seconds
  P99: 0.2297 seconds

Statistics for 'augment':
  Min: 0.0000 seconds
  Max: 0.0000 seconds
  Mean: 0.0000 seconds
  Median: 0.0000 seconds
  Std Dev: 0.0000 seconds
  P95: 0.0000 seconds
  P99: 0.0000 seconds

Statistics for 'generate':
  Min: 0.5664 seconds
  Max: 0.9706 seconds
  Mean: 0.7367 seconds
  Median: 0.7216 seconds
  Std Dev: 0.0993 seconds
  P95: 0.9135 seconds
  P99: 0.9691 seconds

Statistics for 'total':
  Min: 0.7704 seconds
  Max: 1.1772 seconds
  Mean: 0.9622 seconds
  Median: 0.9405 seconds
  Std Dev: 0.1013 seconds
  P95: 1.1378 seconds
  P99: 1.1650 seconds



## Remove Resources

In [None]:
#bigtable_column_family.delete()

In [101]:
#bigtable_table.delete()

In [64]:
#bigtable_instance.delete()