![tracker](https://us-central1-vertex-ai-mlops-369716.cloudfunctions.net/pixel-tracking?path=statmike%2Fvertex-ai-mlops%2FApplied+GenAI%2FRetrieval&file=Retrieval+-+Local+With+Numpy.ipynb)
<!--- header table --->
<table align="left">
  <td style="text-align: center">
    <a href="https://colab.research.google.com/github/statmike/vertex-ai-mlops/blob/main/Applied%20GenAI/Retrieval/Retrieval%20-%20Local%20With%20Numpy.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/colab-logo-32px.png" alt="Google Colaboratory logo">
      <br>Run in<br>Colab
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/colab/import/https%3A%2F%2Fraw.githubusercontent.com%2Fstatmike%2Fvertex-ai-mlops%2Fmain%2FApplied%2520GenAI%2FRetrieval%2FRetrieval%2520-%2520Local%2520With%2520Numpy.ipynb">
      <img width="32px" src="https://lh3.googleusercontent.com/JmcxdQi-qOpctIvWKgPtrzZdJJK-J3sWE1RsfjZNwshCFgE_9fULcNpuXYTilIR2hjwN" alt="Google Cloud Colab Enterprise logo">
      <br>Run in<br>Colab Enterprise
    </a>
  </td>      
  <td style="text-align: center">
    <a href="https://github.com/statmike/vertex-ai-mlops/blob/main/Applied%20GenAI/Retrieval/Retrieval%20-%20Local%20With%20Numpy.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" alt="GitHub logo">
      <br>View on<br>GitHub
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/statmike/vertex-ai-mlops/main/Applied%20GenAI/Retrieval/Retrieval%20-%20Local%20With%20Numpy.ipynb">
      <img src="https://lh3.googleusercontent.com/UiNooY4LUgW_oTvpsNhPpQzsstV5W8F7rYgxgGBD85cWJoLmrOzhVs_ksK_vgx40SHs7jCqkTkCk=e14-rj-sc0xffffff-h130-w32" alt="Vertex AI logo">
      <br>Open in<br>Vertex AI Workbench
    </a>
  </td>
</table>

# Retrieval - Local With NumPy

In prior workflows, a series of documents was [processed into chunks](../Chunking/readme.md), and for each chunk, [embeddings](../Embeddings/readme.md) were created:

- Process: [Large Document Processing - Document AI Layout Parser](../Chunking/Large%20Document%20Processing%20-%20Document%20AI%20Layout%20Parser.ipynb)
- Embed: [Vertex AI Text Embeddings API](../Embeddings/Vertex%20AI%20Text%20Embeddings%20API.ipynb)

Retrieving chunks for a query involves calculating the embedding for the query and then using similarity metrics to find relevant chunks. A thorough review of similarity matching can be found in [The Math of Similarity](../Embeddings/The%20Math%20of%20Similarity.ipynb) - use dot product! As development moves from experiment to application, the process of storing and computing similarity is migrated to a [retrieval](./readme.md) system. This workflow is part of a [series of workflows exploring many retrieval systems](./readme.md).  

A detailed [comparison of many retrieval systems](./readme.md#comparison-of-vector-database-solutions) can be found in the readme as well.

---

**NumPy For Local Search (and indexing)**

This workflow builds a retrieval system locally using [NumPy](https://numpy.org/)! NumPy is a powerful Python library for numerical computation and provides an easy-to-implement local solution for similarity search. This workflow also extends NumPy to approximate nearest neighbor search by building an Inverted File (IVF) index using [k-means](https://en.wikipedia.org/wiki/K-means_clustering) clustering with the [scikit-learn](https://scikit-learn.org/stable/) package.

---

**Use Case Data**

Buying a home usually involves borrowing money from a lending institution, typically through a mortgage secured by the home's value. But how do these institutions manage the risks associated with such large loans, and how are lending standards established?

In the United States, two government-sponsored enterprises (GSEs) play a vital role in the housing market:

- Federal National Mortgage Association ([Fannie Mae](https://www.fanniemae.com/))
- Federal Home Loan Mortgage Corporation ([Freddie Mac](https://www.freddiemac.com/))

These GSEs purchase mortgages from lenders, enabling those lenders to offer more loans. This process also allows Fannie Mae and Freddie Mac to set standards for mortgages, ensuring they are responsible and borrowers are more likely to repay them. This system makes homeownership more affordable and stabilizes the housing market by maintaining a steady flow of liquidity for lenders and keeping interest rates controlled.

However, navigating the complexities of these GSEs and their extensive servicing guides can be challenging.

**Approaches**

[This series](../readme.md) covers many generative AI workflows. These documents are used directly as long context for Gemini in the workflow [Long Context Retrieval With The Vertex AI Gemini API](../Generate/Long%20Context%20Retrieval%20With%20The%20Vertex%20AI%20Gemini%20API.ipynb). The workflow below uses a [retrieval](./readme.md) approach with the already generated chunks and embeddings.

---
## Colab Setup

When running this notebook in [Colab](https://colab.google/) or [Colab Enterprise](https://cloud.google.com/colab/docs/introduction), this section will authenticate to GCP (follow prompts in the popup) and set the current project for the session.

In [1]:
PROJECT_ID = 'statmike-mlops-349915' # replace with project ID

In [2]:
try:
    from google.colab import auth
    auth.authenticate_user()
    !gcloud config set project {PROJECT_ID}
except Exception:
    pass

---
## Installs and API Enablement

The clients packages may need installing in this environment. 

### Installs (If Needed)

In [3]:
# tuples of (import name, install name, min_version)
packages = [
    ('google.cloud.aiplatform', 'google-cloud-aiplatform', '1.69.0'),
    ('numpy', 'numpy'),
    ('sklearn', 'scikit-learn'),
    ('psutil', 'psutil'),
    ('GPUtil', 'GPUtil')
]

import importlib
install = False
for package in packages:
    if not importlib.util.find_spec(package[0]):
        print(f'installing package {package[1]}')
        install = True
        !pip install {package[1]} -U -q --user
    elif len(package) == 3:
        if importlib.metadata.version(package[0]) < package[2]:
            print(f'updating package {package[1]}')
            install = True
            !pip install {package[1]} -U -q --user

### API Enablement

In [4]:
!gcloud services enable aiplatform.googleapis.com

### Restart Kernel (If Installs Occured)

After a kernel restart the code submission can start with the next cell after this one.

In [5]:
if install:
    import IPython
    app = IPython.Application.instance()
    app.kernel.do_shutdown(True)
    IPython.display.display(IPython.display.Markdown("""<div class=\"alert alert-block alert-warning\">
        <b>⚠️ The kernel is going to restart. Please wait until it is finished before continuing to the next step. The previous cells do not need to be run again⚠️</b>
        </div>"""))

---
## Setup

Inputs

In [6]:
project = !gcloud config get-value project
PROJECT_ID = project[0]
PROJECT_ID

'statmike-mlops-349915'

In [7]:
REGION = 'us-central1'
SERIES = 'applied-genai'
EXPERIMENT = 'retrieval-numpy'

Packages

In [8]:
import os, json, sys, time, glob

import numpy as np
import sklearn.cluster
import psutil
import GPUtil

# Vertex AI
from google.cloud import aiplatform
import vertexai.language_models # for embeddings API
import vertexai.generative_models # for Gemini Models

In [9]:
aiplatform.__version__

'1.71.0'

Clients

In [10]:
# vertex ai clients
vertexai.init(project = PROJECT_ID, location = REGION)

---
## Text & Embeddings For Examples

This repository contains a [section for document processing (chunking)](../Chunking/readme.md) that includes an example of processing mulitple large pdfs (over 1000 pages) into chunks: [Large Document Processing - Document AI Layout Parser](../Chunking/Large%20Document%20Processing%20-%20Document%20AI%20Layout%20Parser.ipynb).  The chunks of text from that workflow are stored with this repository and loaded by another companion workflow that augments the chunks with text embeddings: [Vertex AI Text Embeddings API](../Embeddings/Vertex%20AI%20Text%20Embeddings%20API.ipynb).

The following code will load the version of the chunks that includes text embeddings and prepare it for a local example of retrival augmented generation.

### Get The Documents

If you are working from a clone of this notebooks [repository](https://github.com/statmike/vertex-ai-mlops) then the documents are already present. The following cell checks for the documents folder and if it is missing gets it (`git clone`):

In [11]:
local_dir = '../Embeddings/files/embeddings-api'

In [12]:
if not os.path.exists(local_dir):
    print('Retrieving documents...')
    parent_dir = os.path.dirname(local_dir)
    temp_dir = os.path.join(parent_dir, 'temp')
    if not os.path.exists(temp_dir):
        os.makedirs(temp_dir)
    !git clone https://www.github.com/statmike/vertex-ai-mlops {temp_dir}/vertex-ai-mlops
    shutil.copytree(f'{temp_dir}/vertex-ai-mlops/Applied GenAI/Embeddings/files/embeddings-api', local_dir)
    shutil.rmtree(temp_dir)
    print(f'Documents are now in folder `{local_dir}`')
else:
    print(f'Documents Found in folder `{local_dir}`')             

Documents Found in folder `../Embeddings/files/embeddings-api`


### Load The Chunks

In [13]:
jsonl_files = glob.glob(f"{local_dir}/large-files*.jsonl")
jsonl_files.sort()
jsonl_files

['../Embeddings/files/embeddings-api/large-files-chunk-embeddings-0000.jsonl',
 '../Embeddings/files/embeddings-api/large-files-chunk-embeddings-0001.jsonl',
 '../Embeddings/files/embeddings-api/large-files-chunk-embeddings-0002.jsonl',
 '../Embeddings/files/embeddings-api/large-files-chunk-embeddings-0003.jsonl',
 '../Embeddings/files/embeddings-api/large-files-chunk-embeddings-0004.jsonl',
 '../Embeddings/files/embeddings-api/large-files-chunk-embeddings-0005.jsonl',
 '../Embeddings/files/embeddings-api/large-files-chunk-embeddings-0006.jsonl',
 '../Embeddings/files/embeddings-api/large-files-chunk-embeddings-0007.jsonl',
 '../Embeddings/files/embeddings-api/large-files-chunk-embeddings-0008.jsonl',
 '../Embeddings/files/embeddings-api/large-files-chunk-embeddings-0009.jsonl']

In [14]:
chunks = []
for file in jsonl_files:
    with open(file, 'r') as f:
        chunks.extend([json.loads(line) for line in f])
len(chunks)

9040

### Review A Chunk

In [15]:
chunks[0].keys()

dict_keys(['instance', 'predictions', 'status'])

In [16]:
chunks[0]['instance']['chunk_id']

'fannie_part_0_c17'

In [17]:
print(chunks[0]['instance']['content'])

# Selling Guide Fannie Mae Single Family

## Fannie Mae Copyright Notice

### Fannie Mae Copyright Notice

|-|
| Section B3-4.2, Verification of Depository Assets 402 |
| B3-4.2-01, Verification of Deposits and Assets (05/04/2022) 403 |
| B3-4.2-02, Depository Accounts (12/14/2022) 405 |
| B3-4.2-03, Individual Development Accounts (02/06/2019) 408 |
| B3-4.2-04, Pooled Savings (Community Savings Funds) (04/01/2009) 411 |
| B3-4.2-05, Foreign Assets (05/04/2022) 411 |
| Section B3-4.3, Verification of Non-Depository Assets 412 |
| B3-4.3-01, Stocks, Stock Options, Bonds, and Mutual Funds (06/30/2015) 412 |
| B3-4.3-02, Trust Accounts (04/01/2009) 413 |
| B3-4.3-03, Retirement Accounts (06/30/2015) 414 |
| B3-4.3-04, Personal Gifts (09/06/2023) 415 |
| B3-4.3-05, Gifts of Equity (10/07/2020) 418 |
| B3-4.3-06, Grants and Lender Contributions (12/14/2022) 419 |
| B3-4.3-07, Disaster Relief Grants or Loans (04/01/2009) 423 |
| B3-4.3-08, Employer Assistance (09/29/2015) 423 |
| B3-4.3-09,

In [18]:
chunks[0]['predictions'][0]['embeddings']['values'][0:10]

[0.031277116388082504,
 0.03056905046105385,
 0.010865348391234875,
 0.0623614676296711,
 0.03228681534528732,
 0.05066155269742012,
 0.046544693410396576,
 0.05509665608406067,
 -0.014074751175940037,
 0.008380400016903877]

### Prepare Chunk Structure

Make a list of dictionaries with information for each chunk:

In [19]:
content_chunks = [
    dict(
        gse = chunk['instance']['gse'],
        chunk_id = chunk['instance']['chunk_id'],
        content = chunk['instance']['content'],
        embedding = chunk['predictions'][0]['embeddings']['values']
    ) for chunk in chunks
]

### Query Embedding

Create a query, or prompt, and get the embedding for it:

Connect to models for text embeddings. Learn more about the model API:
- [Vertex AI Text Embeddings API](../Embeddings/Vertex%20AI%20Text%20Embeddings%20API.ipynb)

In [20]:
question = "Does a lender have to perform servicing functions directly?"

In [21]:
embedder = vertexai.language_models.TextEmbeddingModel.from_pretrained('text-embedding-004')

In [22]:
question_embedding = embedder.get_embeddings([question])[0].values
question_embedding[0:10]

[-0.0005117303808219731,
 0.009651427157223225,
 0.01768726110458374,
 0.014538003131747246,
 -0.01829824410378933,
 0.027877431362867355,
 -0.021124685183167458,
 0.008830446749925613,
 -0.02669006586074829,
 0.06414774805307388]

---
## Numpy For Vector Similarity Search

Embeddings can be used with math to measure similarity.  For deeper details into this checkout the companion workflow here: [The Math of Similarity](../Embeddings/The%20Math%20of%20Similarity.ipynb).  Retrieval systems handle the storage and math of similarity as a service.  For an overview of Google Cloud based solutions for retrieval check out [this companion series](../Retrieval/readme.md).

The content below motivates retrieval with the embeddings that accompany the text chunks using a local vector database with brute force matching using [Numpy](https://numpy.org/)!

### Vector DB With Numpy

In [23]:
vector_db = [
    [
        chunk['instance']['chunk_id'],
        chunk['predictions'][0]['embeddings']['values'],
    ]
    for chunk in chunks
]
vector_index = np.array([row[1] for row in vector_db])

In [24]:
len(vector_db)

9040

In [25]:
vector_index.shape

(9040, 768)

### Matching With Numpy

Use dot product to calculate similarity and find matches for a query embedding.  Why dot product?  Check out the companion workflow: [The Math of Similarity](../Embeddings/The%20Math%20of%20Similarity.ipynb)

> **NOTE:**  This will calculate the similarity for all embeddings vectors stored in the local vector db which is just a Numpy array here.  This is very fast because there are <10000 embeddings vectors.  As this scales it would be better to consider a solution that searches a subset of embeddings.  More details on retrieval solutions can be found in [Retrieval](../Retrieval/readme.md).  One method is to partion the embeddings and only search partition near the query embedding.  This is covered as an example later in this workflow (below).

In [26]:
similarity = np.dot(question_embedding, vector_index.T)
matches = np.argsort(similarity)[-5:].tolist()
matches.reverse()
matches = [(match, similarity[match]) for match in matches]
matches

[(141, 0.7099842015202706),
 (6673, 0.6805260859043876),
 (7246, 0.6753296984114661),
 (698, 0.6723706814818046),
 (327, 0.6683496311110356)]

### Get Text For Matches

Make a dictionary for each lookup of chunk content by chunk id:

In [27]:
chunk_lookup = {}
for chunk in chunks:
    chunk_lookup[chunk['instance']['chunk_id']] = chunk['instance']['content']

In [28]:
print(chunk_lookup['fannie_part_0_c17'])

# Selling Guide Fannie Mae Single Family

## Fannie Mae Copyright Notice

### Fannie Mae Copyright Notice

|-|
| Section B3-4.2, Verification of Depository Assets 402 |
| B3-4.2-01, Verification of Deposits and Assets (05/04/2022) 403 |
| B3-4.2-02, Depository Accounts (12/14/2022) 405 |
| B3-4.2-03, Individual Development Accounts (02/06/2019) 408 |
| B3-4.2-04, Pooled Savings (Community Savings Funds) (04/01/2009) 411 |
| B3-4.2-05, Foreign Assets (05/04/2022) 411 |
| Section B3-4.3, Verification of Non-Depository Assets 412 |
| B3-4.3-01, Stocks, Stock Options, Bonds, and Mutual Funds (06/30/2015) 412 |
| B3-4.3-02, Trust Accounts (04/01/2009) 413 |
| B3-4.3-03, Retirement Accounts (06/30/2015) 414 |
| B3-4.3-04, Personal Gifts (09/06/2023) 415 |
| B3-4.3-05, Gifts of Equity (10/07/2020) 418 |
| B3-4.3-06, Grants and Lender Contributions (12/14/2022) 419 |
| B3-4.3-07, Disaster Relief Grants or Loans (04/01/2009) 423 |
| B3-4.3-08, Employer Assistance (09/29/2015) 423 |
| B3-4.3-09,

In [29]:
for m, match in enumerate(matches):
    print(f"Match {m+1} ({match[1]:.2f}) is chunk {vector_db[match[0]][0]}:\n{chunk_lookup[vector_db[match[0]][0]]}\n###################################################")

Match 1 (0.71) is chunk fannie_part_0_c352:
# A3-3-03, Other Servicing Arrangements (12/15/2015)

Introduction This topic provides an overview of other servicing arrangements, including: • Subservicing • General Requirements for Subservicing Arrangements • Pledge of Servicing Rights and Transfer of Interest in Servicing Income

## Subservicing

A lender may use other organizations to perform some or all of its servicing functions. Fannie Mae refers to these arrangements as “subservicing” arrangements, meaning that a servicer (the “subservicer”) other than the contractually responsible servicer (the “master” servicer) is performing the servicing functions. The following are not considered to be subservicing arrangements: • when a computer service bureau is used to perform accounting and reporting functions; • when the originating lender sells and assigns servicing to another lender, unless the originating lender continues to be the contractually responsible servicer.
###################

---
## Retrieval Augmented Generation (RAG)

Build a simple retrieval augmented generation process that enhances a query by retrieving context.  This is done here by constructing three functions for the stages:
- `retrieve` - a function that uses an embedding to search for matching context parts, pieces of texts
    - this uses the system built earlier in this workflow!
- `augment` - prepare chunks into a prompt
- `generate` - make the llm request with the augmented prompt

A final function is used to execute the workflow of rag:
- `rag` - a function that receives the query an orchestrates the workflow through `retrieve` > `augment` > `generate`

### Clients

In [30]:
embedder = vertexai.language_models.TextEmbeddingModel.from_pretrained('text-embedding-004')
llm = vertexai.generative_models.GenerativeModel("gemini-1.5-flash-002")

### Retrieve Function

In [43]:
def retrieve_numpy(query_embedding, n_matches = 5):
    
    similarity = np.dot(query_embedding, vector_index.T)
    bf_matches = np.argsort(similarity)[-(n_matches):].tolist()
    bf_matches.reverse()
    bf_matches = [(match, similarity[match]) for match in bf_matches]
    
    matches = []
    for m, match in enumerate(bf_matches):
        matches.append(dict(
            chunk_id = vector_db[match[0]][0],
            content = chunk_lookup[vector_db[match[0]][0]]
        ))
    
    return matches

### Augment Function

In [44]:
def augment(matches):

    prompt = ''
    for m, match in enumerate(matches):
        prompt += f"Context {m+1}:\n{match['content']}\n\n"
    prompt += f'Answer the following question using the provided contexts:\n'

    return prompt

### Generate Function

In [45]:
def generate(prompt):

    result = llm.generate_content(prompt)

    return result

### RAG Function

In [46]:
def rag(query):
    
    query_embedding = embedder.get_embeddings([query])[0].values
    matches = retrieve_numpy(query_embedding)
    prompt = augment(matches) + query
    result = generate(prompt)
    
    return result.text

### Example In Use

In [47]:
question

'Does a lender have to perform servicing functions directly?'

In [48]:
print(rag(question))

No, a lender does not have to perform servicing functions directly.  Context 1 explicitly states that a lender "may use other organizations to perform some or all of its servicing functions," referring to this as "subservicing."  This involves a "master servicer" and a "subservicer," where the subservicer performs functions on behalf of the master servicer.  The contexts also describe procedures and regulations surrounding these arrangements.



---
### Profiling Performance

Profile the timing of each step in the RAG function for sequential calls. The environment choosen for this workflow is a minimal testing enviornment so load testing (simoultaneous requests) would not be helpful.

In [49]:
profile = []

In [50]:
def rag(query, profile = profile):
    
    timings = {}
    start_time = time.time()
    
    
    # 1. Get embeddings
    embedding_start = time.time()
    query_embedding = embedder.get_embeddings([query])[0].values
    timings['embedding'] = time.time() - embedding_start

    # 2. Retrieve from Bigtable
    retrieval_start = time.time()
    matches = retrieve_numpy(query_embedding)
    timings['retrieve_numpy'] = time.time() - retrieval_start

    # 3. Augment the prompt
    augment_start = time.time()
    prompt = augment(matches) + query
    timings['augment'] = time.time() - augment_start

    # 4. Generate text
    generate_start = time.time()
    result = generate(prompt)
    timings['generate'] = time.time() - generate_start

    total_time = time.time() - start_time
    timings['total'] = total_time
    
    profile.append(timings)
    
    return result.text

In [51]:
print(rag(question))

No, a lender does not have to perform servicing functions directly.  Context 1 explicitly states that a lender may use other organizations to perform some or all of its servicing functions, referring to this as "subservicing."  This involves a "master servicer" and a "subservicer," where the subservicer performs the servicing functions on behalf of the master servicer.  However,  the master servicer remains contractually responsible.



In [52]:
profile

[{'embedding': 0.10621380805969238,
  'retrieve_numpy': 0.0040607452392578125,
  'augment': 3.981590270996094e-05,
  'generate': 0.8290009498596191,
  'total': 0.9393248558044434}]

In [53]:
for i in range(100):
    response = rag(question)

### Report From Profile

In [54]:
all_timings = {}
for timings in profile:
    for key, value in timings.items():
        if key not in all_timings:
            all_timings[key] = []
        all_timings[key].append(value)

In [55]:
for key, values in all_timings.items():
    arr = np.array(values)
    print(f"Statistics for '{key}':")
    print(f"  Min: {np.min(arr):.4f} seconds")
    print(f"  Max: {np.max(arr):.4f} seconds")
    print(f"  Mean: {np.mean(arr):.4f} seconds")
    print(f"  Median: {np.median(arr):.4f} seconds")
    print(f"  Std Dev: {np.std(arr):.4f} seconds")
    print(f"  P95: {np.percentile(arr, 95):.4f} seconds")
    print(f"  P99: {np.percentile(arr, 99):.4f} seconds")
    print("")

Statistics for 'embedding':
  Min: 0.0462 seconds
  Max: 0.3017 seconds
  Mean: 0.0602 seconds
  Median: 0.0528 seconds
  Std Dev: 0.0332 seconds
  P95: 0.0904 seconds
  P99: 0.2560 seconds

Statistics for 'retrieve_numpy':
  Min: 0.0038 seconds
  Max: 0.0129 seconds
  Mean: 0.0042 seconds
  Median: 0.0040 seconds
  Std Dev: 0.0011 seconds
  P95: 0.0054 seconds
  P99: 0.0079 seconds

Statistics for 'augment':
  Min: 0.0000 seconds
  Max: 0.0001 seconds
  Mean: 0.0000 seconds
  Median: 0.0000 seconds
  Std Dev: 0.0000 seconds
  P95: 0.0001 seconds
  P99: 0.0001 seconds

Statistics for 'generate':
  Min: 0.5788 seconds
  Max: 1.2178 seconds
  Mean: 0.7623 seconds
  Median: 0.7355 seconds
  Std Dev: 0.1136 seconds
  P95: 0.9955 seconds
  P99: 1.0743 seconds

Statistics for 'total':
  Min: 0.6290 seconds
  Max: 1.2720 seconds
  Mean: 0.8267 seconds
  Median: 0.7922 seconds
  Std Dev: 0.1181 seconds
  P95: 1.0569 seconds
  P99: 1.1332 seconds



---
## Deeper Profiling



### Size Of Objects

The design above involves three objects:
- `vector_db` - a Python list of list objects that each contain a chunk_id and the embedding vector for the chunk
- `vector_index` - a numpy array of rows for each embedding vector
- `content_chunks` - a Python dict that has keys for each chunk_id and values are the text of the chunk

These are used by finding the index of matching embeddings from the `vector_index` and then looking up the cooresponding chunk_id in `vector_db` before finally retrieving the text of the chunk from `content_chunks`.

#### Object: `vector_db`

In [56]:
type(vector_db)

list

In [57]:
len(vector_db)

9040

In [58]:
# size in bytes
sys.getsizeof(vector_db)

75672

In [59]:
# size in megabytes
sys.getsizeof(vector_db)/ (1024*1024)

0.07216644287109375

#### Object: `vector_index`

In [60]:
type(vector_index)

numpy.ndarray

In [61]:
vector_index.shape

(9040, 768)

In [62]:
# size in bytes
sys.getsizeof(vector_index)

55541888

In [63]:
# size in megabytes
sys.getsizeof(vector_index)/ (1024*1024)

52.9688720703125

#### Object: `content_chunks`

In [64]:
type(chunk_lookup)

dict

In [65]:
len(chunk_lookup)

9040

In [66]:
# size in bytes
sys.getsizeof(chunk_lookup)

295000

In [67]:
# size in megabytes
sys.getsizeof(chunk_lookup)/ (1024*1024)

0.28133392333984375

### Local Compute Environment

In [68]:
# Get CPU count
cpu_count = psutil.cpu_count(logical=True)  # Includes logical cores (hyperthreading)
print(f"CPU Count: {cpu_count}")

# Get CPU frequency
cpu_freq = psutil.cpu_freq()
print(f"CPU Frequency: {cpu_freq.current} MHz")

CPU Count: 4
CPU Frequency: 2199.998 MHz


In [69]:
mem = psutil.virtual_memory()
print(f"Total Memory: {mem.total / (1024**3):.2f} GB")
print(f"Used Memory: {mem.used / (1024**3):.2f} GB")
print(f"Available Memory: {mem.available / (1024**3):.2f} GB")
print(f"Memory Percentage Used: {mem.percent}%")

Total Memory: 31.36 GB
Used Memory: 14.82 GB
Available Memory: 16.11 GB
Memory Percentage Used: 48.6%


In [70]:
# Get all available GPUs
gpus = GPUtil.getGPUs()

if gpus:
    for gpu in gpus:
        print(f"GPU Name: {gpu.name}")
        print(f"GPU Memory Total: {gpu.memoryTotal} MB")
        print(f"GPU Memory Used: {gpu.memoryUsed} MB")
        print(f"GPU Memory Free: {gpu.memoryFree} MB")
        print(f"GPU Load: {gpu.load*100}%")
else:
    print("No GPUs found.")

No GPUs found.


### Timing Sequential Operations

Get the timing for different sequential matching requests loads: 1, 10, 100, 1000, 10000, 100000, ...

Break this down by the tasks:
- Get Embedding Vector For Question
- Get Matching Chunks
- Construct Prompt From Matches

#### Get Embedding Vector For Question: API

This test sequential request.  The Text Embeddings API has many option for asynchronous and multi-instance request that could also be used for efficiency.  See more in [Vertex AI Text Embeddings API](../Embeddings/Vertex%20AI%20Text%20Embeddings%20API.ipynb).

In [71]:
embed_time = []
for x in range(4):
    n = 10**x
    start_time = time.time()
    for i in range(n):
        # get embedding for question
        question_embedding = embedder.get_embeddings([question])[0].values
    end_time = time.time()
    execution_time = end_time - start_time
    print(f"Execution time for n = {n}: {execution_time:.6f} seconds")
    embed_time.append(execution_time)

Execution time for n = 1: 0.080102 seconds
Execution time for n = 10: 0.669986 seconds
Execution time for n = 100: 5.302307 seconds
Execution time for n = 1000: 52.841818 seconds


#### Get Matching Chunks: Python + Numpy

In [72]:
# get embedding for question
question_embedding = embedder.get_embeddings([question])[0].values

match_time = []
for x in range(4):
    n = 10**x
    start_time = time.time()
    for i in range(n):
        # get top_n matches:
        top_n = 10
        matches = retrieve_numpy(question_embedding, n_matches = top_n)
    end_time = time.time()
    execution_time = end_time - start_time
    print(f"Execution time for n = {n}: {execution_time:.6f} seconds")
    match_time.append(execution_time)

Execution time for n = 1: 0.004354 seconds
Execution time for n = 10: 0.041247 seconds
Execution time for n = 100: 0.402753 seconds
Execution time for n = 1000: 4.015273 seconds


### Profile Sequential Operations Timing

Now collect the individual timings for local operations and review the profile of timing:

In [73]:
# get embedding for question
question_embedding = embedder.get_embeddings([question])[0].values

combined_time_profile = []

n = 10000

for i in range(n):
    start_time = time.time()
    
    # get top_n matches:
    top_n = 10
    matches = retrieve_numpy(question_embedding, n_matches = top_n)
    
    end_time = time.time()
    execution_time = end_time - start_time
    combined_time_profile.append(execution_time)

In [74]:
# Total time for all requests
total_time = sum(combined_time_profile)
print(f"Total time for all requests: {total_time:.6f} seconds")

# Average time per request
average_time = total_time / len(combined_time_profile)
print(f"Average time per request: {average_time:.6f} seconds")

# Range of time across all requests
time_range = max(combined_time_profile) - min(combined_time_profile)
print(f"Range of time across all requests: {time_range:.6f} seconds")

# 99th percentile of request times
percentile_99 = np.percentile(combined_time_profile, 99)
print(f"99th percentile of request times: {percentile_99:.6f} seconds") 

Total time for all requests: 47.152343 seconds
Average time per request: 0.004715 seconds
Range of time across all requests: 0.023218 seconds
99th percentile of request times: 0.012341 seconds


---
## Approximate Search With IVF using K-Means

The solution above is fast at the current size and scale.  As the number of embeddings increase it could be helpful to search a subset of embeddings for faster responses.  A simple way to extend the brute force search to a subset is an Inverted File (IVF) index. How?
- Cluster the embeddings into k groups, using [k-means clustering](https://en.wikipedia.org/wiki/K-means_clustering)
- Create an inverted list that assigns embeddings to clusters
- Search by first finding the closest cluster then only searching within those

Here the clustering with k-means is trained with [scikit-learn `sklearn.cluster.KMeans`](https://scikit-learn.org/1.5/modules/generated/sklearn.cluster.KMeans.html).

### Cluster With k-means

In [75]:
k = 100
kmeans = sklearn.cluster.KMeans(n_clusters = k, random_state = 0)
cluster_assignments = kmeans.fit_predict(vector_index)

### Create Inverted Lists

In [76]:
inverted_lists = [[] for _ in range(k)]
for i, cluster_id in enumerate(cluster_assignments):
    inverted_lists[cluster_id].append(i)

In [77]:
len(inverted_lists)

100

In [78]:
inverted_lists[0]

[1072,
 1148,
 1199,
 1233,
 1247,
 1311,
 1327,
 1363,
 1413,
 1421,
 1432,
 1443,
 1474,
 1505,
 1517,
 1526,
 1580,
 1610,
 1621,
 1631,
 1690,
 1700,
 1711,
 1744,
 1820,
 1865,
 1891,
 2061,
 4798,
 4848,
 4850,
 4853,
 4862,
 4876,
 4881,
 4884,
 4893,
 4909,
 4943,
 4953,
 4961,
 4974,
 5007,
 5016,
 5020,
 5061,
 5065,
 5131,
 5184,
 5211,
 5219,
 5230,
 5264,
 5278,
 5333,
 5335,
 5336,
 5363,
 5375,
 5420,
 5439,
 5469,
 5481,
 5505,
 5522,
 5600]

### Search With IVF Index

#### Find Closest Clusters

The center of each cluster is stored:

In [79]:
kmeans.cluster_centers_.shape

(100, 768)

In [80]:
# get embedding for question
question_embedding = embedder.get_embeddings([question])[0].values

In [81]:
cluster_similarity = np.dot(question_embedding, kmeans.cluster_centers_.T)
nearest_clusters = np.argsort(cluster_similarity)[-10:]

In [82]:
nearest_clusters

array([72, 52, 34, 76, 79, 14, 20, 24, 33, 84])

#### Search Within Clusters

In [83]:
candidate_indicies = [idv for cluster_id in nearest_clusters for idv in inverted_lists[cluster_id]]

#### Top Matches Within Candidate Clusters

In [84]:
candidate_index = vector_index[candidate_indicies]
candidate_similarity = np.dot(question_embedding, candidate_index.T)
ivf_matches = [[candidate_indicies[match], candidate_similarity[match]] for match in np.argsort(candidate_similarity)[-top_n:].tolist()]
ivf_matches.reverse()
ivf_matches

[[141, 0.7099842015202706],
 [6673, 0.6805260859043876],
 [7246, 0.6753296984114661],
 [698, 0.6723706814818046],
 [327, 0.6683496311110356],
 [7166, 0.6619843862150242],
 [190, 0.661433734537568],
 [8724, 0.6604534921319808],
 [7264, 0.6575403552108654],
 [8440, 0.6573532750493309]]

#### Compare To Top Matches From Brute Force

In [85]:
top_n = 10
similarity = np.dot(question_embedding, vector_index.T)
matches = np.argsort(similarity)[-top_n:].tolist()
matches = [[match, similarity[match]] for match in matches]
matches.reverse()
matches

[[141, 0.7099842015202706],
 [6673, 0.6805260859043876],
 [7246, 0.6753296984114661],
 [698, 0.6723706814818046],
 [327, 0.6683496311110356],
 [7166, 0.6619843862150242],
 [190, 0.661433734537568],
 [506, 0.6608578617010463],
 [8724, 0.6604534921319808],
 [7264, 0.6575403552108654]]

In [86]:
[i[0] for i in matches] == [i[0] for i in ivf_matches]

False

#### Put Steps Together For Efficiency

In [87]:
top_c = 10
top_n = 10
nearest_clusters = np.argsort(np.dot(question_embedding, kmeans.cluster_centers_.T))[-top_c:]
candidate_indices = np.concatenate([inverted_lists[cluster_id] for cluster_id in nearest_clusters])
candidate_similarity = np.dot(question_embedding, vector_index[candidate_indices].T)
top_indices = np.argsort(candidate_similarity)[-top_n:]
ivf_matches = [[candidate_indices[i], candidate_similarity[i]] for i in top_indices]
ivf_matches.reverse()
ivf_matches

[[141, 0.7099842015202706],
 [6673, 0.6805260859043876],
 [7246, 0.6753296984114661],
 [698, 0.6723706814818046],
 [327, 0.6683496311110356],
 [7166, 0.6619843862150242],
 [190, 0.661433734537568],
 [8724, 0.6604534921319808],
 [7264, 0.6575403552108654],
 [8440, 0.6573532750493309]]

### RAG: Enhanced With IVF Method

The `augment` and `generate` functions can remain the same.  Here, a new `retrieve_numpy_ivf` function is created along with an updated `rag` function that points to it.

#### Retrieve Function

In [88]:
def retrieve_numpy_ivf(query_embedding, n_matches = 5):

    n_clusters = 10
    nearest_clusters = np.argsort(np.dot(question_embedding, kmeans.cluster_centers_.T))[-n_clusters:]
    candidate_indices = np.concatenate([inverted_lists[cluster_id] for cluster_id in nearest_clusters])
    candidate_similarity = np.dot(question_embedding, vector_index[candidate_indices].T)
    top_indices = np.argsort(candidate_similarity)[-n_matches:]
    ivf_matches = [[candidate_indices[i], candidate_similarity[i]] for i in top_indices]
    ivf_matches.reverse()
    
    matches = []
    for m, match in enumerate(ivf_matches):
        matches.append(dict(
            chunk_id = vector_db[match[0]][0],
            content = chunk_lookup[vector_db[match[0]][0]]
        ))
    
    return matches

#### RAG Function

In [89]:
def rag(query):
    
    query_embedding = embedder.get_embeddings([query])[0].values
    matches = retrieve_numpy_ivf(query_embedding)
    prompt = augment(matches) + query
    result = generate(prompt)
    
    return result.text

#### Example In Use

In [90]:
question

'Does a lender have to perform servicing functions directly?'

In [91]:
print(rag(question))

No.  A lender may use other organizations to perform some or all of its servicing functions through subservicing arrangements (Context 1).  However, the lender (master servicer) remains contractually responsible (Context 1), and there are specific requirements for these arrangements, including ensuring the subservicer's ability to meet Fannie Mae's requirements (Context 4).



---
### Profiling Performance

Profile the timing of each step in the RAG function for sequential calls. The environment choosen for this workflow is a minimal testing enviornment so load testing (simoultaneous requests) would not be helpful.

In [97]:
profile = []

In [98]:
def rag(query, profile = profile):
    
    timings = {}
    start_time = time.time()
    
    
    # 1. Get embeddings
    embedding_start = time.time()
    query_embedding = embedder.get_embeddings([query])[0].values
    timings['embedding'] = time.time() - embedding_start

    # 2. Retrieve from Bigtable
    retrieval_start = time.time()
    matches = retrieve_numpy_ivf(query_embedding)
    timings['retrieve_numpy_ivf'] = time.time() - retrieval_start

    # 3. Augment the prompt
    augment_start = time.time()
    prompt = augment(matches) + query
    timings['augment'] = time.time() - augment_start

    # 4. Generate text
    generate_start = time.time()
    result = generate(prompt)
    timings['generate'] = time.time() - generate_start

    total_time = time.time() - start_time
    timings['total'] = total_time
    
    profile.append(timings)
    
    return result.text

In [99]:
print(rag(question))

No.  A lender may use other organizations to perform some or all of its servicing functions through subservicing arrangements (Context 1).  However, the lender (master servicer) remains contractually responsible (Context 1).  There are specific requirements and limitations on the use of subservicers, including Fannie Mae approval (Context 4).



In [100]:
profile

[{'embedding': 0.056551456451416016,
  'retrieve_numpy_ivf': 0.008069276809692383,
  'augment': 3.981590270996094e-05,
  'generate': 0.6398637294769287,
  'total': 0.7045333385467529}]

In [101]:
for i in range(100):
    response = rag(question)

### Report From Profile

In [102]:
all_timings = {}
for timings in profile:
    for key, value in timings.items():
        if key not in all_timings:
            all_timings[key] = []
        all_timings[key].append(value)

In [103]:
for key, values in all_timings.items():
    arr = np.array(values)
    print(f"Statistics for '{key}':")
    print(f"  Min: {np.min(arr):.4f} seconds")
    print(f"  Max: {np.max(arr):.4f} seconds")
    print(f"  Mean: {np.mean(arr):.4f} seconds")
    print(f"  Median: {np.median(arr):.4f} seconds")
    print(f"  Std Dev: {np.std(arr):.4f} seconds")
    print(f"  P95: {np.percentile(arr, 95):.4f} seconds")
    print(f"  P99: {np.percentile(arr, 99):.4f} seconds")
    print("")

Statistics for 'embedding':
  Min: 0.0471 seconds
  Max: 0.1558 seconds
  Mean: 0.0548 seconds
  Median: 0.0516 seconds
  Std Dev: 0.0132 seconds
  P95: 0.0714 seconds
  P99: 0.1010 seconds

Statistics for 'retrieve_numpy_ivf':
  Min: 0.0026 seconds
  Max: 0.0089 seconds
  Mean: 0.0038 seconds
  Median: 0.0036 seconds
  Std Dev: 0.0010 seconds
  P95: 0.0048 seconds
  P99: 0.0088 seconds

Statistics for 'augment':
  Min: 0.0000 seconds
  Max: 0.0001 seconds
  Mean: 0.0000 seconds
  Median: 0.0000 seconds
  Std Dev: 0.0000 seconds
  P95: 0.0001 seconds
  P99: 0.0001 seconds

Statistics for 'generate':
  Min: 0.5629 seconds
  Max: 0.9085 seconds
  Mean: 0.7162 seconds
  Median: 0.7038 seconds
  Std Dev: 0.0847 seconds
  P95: 0.8653 seconds
  P99: 0.8972 seconds

Statistics for 'total':
  Min: 0.6165 seconds
  Max: 0.9648 seconds
  Mean: 0.7749 seconds
  Median: 0.7603 seconds
  Std Dev: 0.0867 seconds
  P95: 0.9280 seconds
  P99: 0.9509 seconds



### Time Sequential Operations

Similar to the brute force timing above, calculate the time for various numbers of sequential operations:

#### Get Matching Chunks: Python + Numpy

In [104]:
# get embedding for question
question_embedding = embedder.get_embeddings([question])[0].values

match_time_ivf = []
for x in range(4):
    n = 10**x
    start_time = time.time()
    for i in range(n):
        # get top_n matches:
        top_n = 10
        matches = retrieve_numpy_ivf(question_embedding, n_matches = top_n)
    end_time = time.time()
    execution_time = end_time - start_time
    print(f"Execution time for n = {n}: {execution_time:.6f} seconds")
    match_time_ivf.append(execution_time)

Execution time for n = 1: 0.003629 seconds
Execution time for n = 10: 0.056653 seconds
Execution time for n = 100: 0.328526 seconds
Execution time for n = 1000: 3.100171 seconds


Compare timing:

In [105]:
for i, (t, ivf_t) in enumerate(zip(match_time, match_time_ivf)):
    adiff = abs(t-ivf_t)
    if t <= ivf_t:
        print(f"For {10**i} iterations IVF was {100*(adiff/t):.2f}% slower")
    else:
        print(f"For {10**i} iterations: IVF was {100*(adiff/t):.2f}% faster")

For 1 iterations: IVF was 16.66% faster
For 10 iterations IVF was 37.35% slower
For 100 iterations: IVF was 18.43% faster
For 1000 iterations: IVF was 22.79% faster


### Profile Sequential Operations Timing

Now collect the individual timings for local operations and review the profile of timing:

In [106]:
# get embedding for question
question_embedding = embedder.get_embeddings([question])[0].values

combined_time_profile_ivf = []

n = 10000

for i in range(n):
    start_time = time.time()
    
    # get top_n matches:
    top_n = 10
    matches = retrieve_numpy_ivf(question_embedding, n_matches = top_n)
    
    end_time = time.time()
    execution_time = end_time - start_time
    combined_time_profile_ivf.append(execution_time)

In [107]:
# Total time for all requests
total_time = sum(combined_time_profile_ivf)
print(f"Total time for all requests: {total_time:.6f} seconds")

# Average time per request
average_time = total_time / len(combined_time_profile_ivf)
print(f"Average time per request: {average_time:.6f} seconds")

# Range of time across all requests
time_range = max(combined_time_profile_ivf) - min(combined_time_profile_ivf)
print(f"Range of time across all requests: {time_range:.6f} seconds")

# 99th percentile of request times
percentile_99 = np.percentile(combined_time_profile_ivf, 99)
print(f"99th percentile of request times: {percentile_99:.6f} seconds") 

Total time for all requests: 32.304379 seconds
Average time per request: 0.003230 seconds
Range of time across all requests: 0.034760 seconds
99th percentile of request times: 0.008415 seconds


Compare Timing:

In [108]:
def print_time_stats(label, times):
    """Prints timing statistics for a given list of times."""
    total_time = sum(times)
    average_time = total_time / len(times)
    time_range = max(times) - min(times)
    percentile_99 = np.percentile(times, 99)
    percentile_97 = np.percentile(times, 97)
    percentile_95 = np.percentile(times, 95)
    print(f"----- {label} -----")
    print(f"Total time: {total_time:.6f} seconds")
    print(f"Average time: {average_time:.6f} seconds")
    print(f"Range: {time_range:.6f} seconds")
    print(f"99th percentile: {percentile_99:.6f} seconds")
    print(f"97th percentile: {percentile_97:.6f} seconds")
    print(f"95th percentile: {percentile_95:.6f} seconds")
    return (total_time, average_time, time_range, percentile_99, percentile_97, percentile_95)
    
# Print individual statistics
results_ivf = print_time_stats("IVF", combined_time_profile_ivf)
results_bf = print_time_stats("Brute Force", combined_time_profile)

----- IVF -----
Total time: 32.304379 seconds
Average time: 0.003230 seconds
Range: 0.034760 seconds
99th percentile: 0.008415 seconds
97th percentile: 0.006985 seconds
95th percentile: 0.006576 seconds
----- Brute Force -----
Total time: 47.152343 seconds
Average time: 0.004715 seconds
Range: 0.023218 seconds
99th percentile: 0.012341 seconds
97th percentile: 0.009365 seconds
95th percentile: 0.008123 seconds


In [109]:
metrics = {'1' : 'Total time', '2' : 'Average Time', '3' : 'Range', '4' : '99th percentile', '5':'97th percentile', '6':'95th percentile'}
for i, (bf, ivf) in enumerate(zip(results_bf, results_ivf)):
    adiff = abs(bf-ivf)
    if bf <= ivf:
        print(f"For the '{metrics[str(i+1)]}' IVF was {100*(adiff/bf):.2f}% slower:\n\tbrute force = {bf:.6f}\n\tIVF = {ivf:.6f}")
    else:
        print(f"For the '{metrics[str(i+1)]}' IVF was {100*(adiff/bf):.2f}% faster:\n\tbrute force = {bf:.6f}\n\tIVF = {ivf:.6f}")


For the 'Total time' IVF was 31.49% faster:
	brute force = 47.152343
	IVF = 32.304379
For the 'Average Time' IVF was 31.49% faster:
	brute force = 0.004715
	IVF = 0.003230
For the 'Range' IVF was 49.72% slower:
	brute force = 0.023218
	IVF = 0.034760
For the '99th percentile' IVF was 31.81% faster:
	brute force = 0.012341
	IVF = 0.008415
For the '97th percentile' IVF was 25.41% faster:
	brute force = 0.009365
	IVF = 0.006985
For the '95th percentile' IVF was 19.05% faster:
	brute force = 0.008123
	IVF = 0.006576
