# Setting Up Vector DB & Indexing Profile Embeddings

**Purpose**:
  1. Initialize a connection to the chosen vector database.
     *MVP Example*: Using ChromaDB for its ease of setup for local development.
  2. Create a dedicated collection (or index) within the database to store the
     taxpayer profile embeddings.
  3. Load the taxpayer profile embeddings and their corresponding Taxpayer IDs
     generated in [Notebook 04](./notebook_04.ipynb).
  4. Ingest (add/index) these embeddings into the vector database collection,
     linking each vector to its Taxpayer ID.
  5. Perform basic verification to confirm that the embeddings have been indexed.

**Why a Vector Database?**
  Vector databases are specialized for storing, indexing, and efficiently querying
  high-dimensional vector embeddings based on similarity. This allows us to quickly
  find taxpayers with similar profiles (represented by nearby vectors) without
  comparing a query vector to every single stored vector.

**Prerequisites**:
  - Successful completion of [Notebook 04](./notebook_04.ipynb).
  - Existence of the embeddings file ('embeddings.npy').
  - Existence of the corresponding Taxpayer IDs file ('embedding_ids.csv').
  - Vector Database client library installed (e.g., `pip install chromadb`).

**Outputs**:
  - An initialized vector database client.
  - A populated collection within the vector database.
  - Confirmation that the embeddings are indexed.

**Next Step**:
  [Notebook 06](./notebook_06.ipynb) will use the indexed embeddings to perform similarity searches.

In [1]:
!pip install --quiet chromadb

## Imports and Configuration

In [2]:
import pandas as pd
import numpy as np
import os
import chromadb # MVP Example Client
from chromadb.utils import embedding_functions # If needed, but we have precomputed embeddings

# --- Configuration ---
PROCESSED_DATA_DIR = './data/processed' # Directory containing N04 output
VECTOR_DB_DIR = './vector_db' # Directory to persist ChromaDB data

EMBEDDINGS_INPUT_FILE = os.path.join(PROCESSED_DATA_DIR, 'embeddings.npy')
IDS_INPUT_FILE = os.path.join(PROCESSED_DATA_DIR, 'embedding_ids.csv')

# ChromaDB specific config
CHROMA_PERSIST_DIR = os.path.join(VECTOR_DB_DIR, 'chroma_persist')
COLLECTION_NAME = "taxpayer_profiles"
# Distance metric: 'cosine' or 'l2' (Euclidean) are common.
# Cosine is good for orientation, L2 for magnitude differences.
# Since features were scaled with StandardScaler (not normalized to unit length),
# L2 might be slightly more conventional, but Cosine is also widely used. Let's use Cosine.
DISTANCE_METRIC = "cosine"

# Create directories if they don't exist
os.makedirs(CHROMA_PERSIST_DIR, exist_ok=True)

print("Notebook 05: Setting Up Vector DB & Indexing Profile Embeddings")
print("-" * 50)
print(f"Using ChromaDB as the vector database example.")
print(f"Loading embeddings from: {EMBEDDINGS_INPUT_FILE}")
print(f"Loading IDs from: {IDS_INPUT_FILE}")
print(f"ChromaDB persistence directory: {CHROMA_PERSIST_DIR}")
print(f"Collection Name: {COLLECTION_NAME}")
print(f"Distance Metric: {DISTANCE_METRIC}")
print("-" * 50)

Notebook 05: Setting Up Vector DB & Indexing Profile Embeddings
--------------------------------------------------
Using ChromaDB as the vector database example.
Loading embeddings from: ./data/processed/embeddings.npy
Loading IDs from: ./data/processed/embedding_ids.csv
ChromaDB persistence directory: ./vector_db/chroma_persist
Collection Name: taxpayer_profiles
Distance Metric: cosine
--------------------------------------------------


## Load Embeddings and IDs

In [3]:
try:
    embeddings = np.load(EMBEDDINGS_INPUT_FILE)
    print(f"Successfully loaded embeddings array with shape: {embeddings.shape}")
except FileNotFoundError:
    print(f"ERROR: Embeddings file not found at {EMBEDDINGS_INPUT_FILE}.")
    print("Please ensure Notebook 04 was run successfully and saved the file.")
    raise
except Exception as e:
    print(f"ERROR loading embeddings file: {e}")
    raise

try:
    ids_df = pd.read_csv(IDS_INPUT_FILE)
    # Ensure IDs are strings, as required by ChromaDB
    id_list = ids_df['Taxpayer ID'].astype(str).tolist()
    print(f"Successfully loaded {len(id_list)} Taxpayer IDs.")
except FileNotFoundError:
    print(f"ERROR: IDs file not found at {IDS_INPUT_FILE}.")
    print("Please ensure Notebook 04 was run successfully and saved the file.")
    raise
except Exception as e:
    print(f"ERROR loading IDs file: {e}")
    raise

# Validation
if embeddings.shape[0] != len(id_list):
    print(f"ERROR: Mismatch between number of embeddings ({embeddings.shape[0]}) and number of IDs ({len(id_list)})!")
    raise ValueError("Mismatch in length between embeddings and IDs.")
else:
    print("Validation: Number of embeddings matches number of IDs.")

# Convert embeddings to list of lists for ChromaDB ingestion if needed (depends on client version)
# Current versions often handle numpy arrays directly, but tolist() is safe.
embeddings_list = embeddings.tolist()

Successfully loaded embeddings array with shape: (4900, 28)
Successfully loaded 4900 Taxpayer IDs.
Validation: Number of embeddings matches number of IDs.


## Initialize Vector Database Client (ChromaDB Example)

In [4]:
try:
    # Using PersistentClient to save data to disk
    # Use chromadb.Client() for a purely in-memory instance (data lost on restart)
    client = chromadb.PersistentClient(path=CHROMA_PERSIST_DIR)
    print(f"ChromaDB Persistent Client initialized. Data will be stored in: {CHROMA_PERSIST_DIR}")
    # You might want to reset the DB for repeatable runs during development:
    # client.reset() # Uncomment carefully - this deletes all collections!
    # print("ChromaDB client reset (collections deleted).")

except Exception as e:
    print(f"ERROR initializing ChromaDB client: {e}")
    raise

ChromaDB Persistent Client initialized. Data will be stored in: ./vector_db/chroma_persist


## Create or Get Collection

In [5]:
try:
    # Using get_or_create_collection is idempotent: creates if not exists, gets if it does.
    # Specify the distance metric in the metadata.
    collection = client.get_or_create_collection(
        name=COLLECTION_NAME,
        metadata={"hnsw:space": DISTANCE_METRIC} # HNSW is a common index type used by Chroma
    )
    print(f"Successfully got or created collection: '{COLLECTION_NAME}' with distance metric '{DISTANCE_METRIC}'.")

except Exception as e:
    print(f"ERROR getting or creating ChromaDB collection '{COLLECTION_NAME}': {e}")
    raise

Successfully got or created collection: 'taxpayer_profiles' with distance metric 'cosine'.


## Ingest / Index Embeddings

In [6]:
print(f"Preparing to add {len(id_list)} embeddings to the '{COLLECTION_NAME}' collection.")

# Note: For very large datasets (millions+), consider batching the `add` operation
# for better performance and memory management. For this MVP scale (~5k), one batch is fine.
BATCH_SIZE = 5000 # Example batch size
num_batches = (len(id_list) + BATCH_SIZE - 1) // BATCH_SIZE

try:
    for i in range(num_batches):
        start_idx = i * BATCH_SIZE
        end_idx = min((i + 1) * BATCH_SIZE, len(id_list))

        batch_ids = id_list[start_idx:end_idx]
        batch_embeddings = embeddings_list[start_idx:end_idx]

        print(f"Adding batch {i+1}/{num_batches} ({len(batch_ids)} items)...")

        collection.add(
            embeddings=batch_embeddings,
            ids=batch_ids
            # metadatas=[{"id": tid} for tid in batch_ids] # Optional: Add metadata if needed
        )
    print("Finished adding all batches.")

except Exception as e:
    print(f"ERROR adding embeddings to collection '{COLLECTION_NAME}': {e}")
    # Consider potential issues: duplicate IDs, incorrect embedding dimensions, DB connection errors.
    raise

Preparing to add 4900 embeddings to the 'taxpayer_profiles' collection.
Adding batch 1/1 (4900 items)...
Finished adding all batches.


## Verify Indexing

In [7]:
try:
    # Get the total count of items in the collection
    count = collection.count()
    print(f"Verification: Collection '{COLLECTION_NAME}' now contains {count} items.")

    # Assert that the count matches the number of items we intended to add
    expected_count = len(id_list)
    assert count == expected_count, f"Count mismatch! Expected {expected_count}, found {count}."
    print("Verification successful: Item count matches expected count.")

    # Optional: Retrieve a sample item to ensure it was added correctly
    if count > 0:
        sample_id = id_list[0]
        retrieved_item = collection.get(ids=[sample_id], include=['embeddings']) # Can also include 'metadatas', 'documents'
        if retrieved_item and retrieved_item['ids'] and retrieved_item['ids'][0] == sample_id:
             print(f"Successfully retrieved sample item with ID: {sample_id}.")
             # print(f"  Retrieved Embedding (first 5 dims): {retrieved_item['embeddings'][0][:5]}...") # Optional check
        else:
             print(f"Warning: Could not retrieve sample item with ID: {sample_id}.")

except Exception as e:
    print(f"ERROR during verification: {e}")
    raise

Verification: Collection 'taxpayer_profiles' now contains 4900 items.
Verification successful: Item count matches expected count.
Successfully retrieved sample item with ID: TXP_0A78B11C9A.


## Notes on Other Vector Databases

In [8]:
print("-" * 50)
print("""
This notebook used ChromaDB as an example due to its simplicity for local MVPs.
If using other vector databases, the specific steps in sections 2-5 would change:

* **Milvus:**
    * Requires connection setup (`pymilvus.connections.connect`, `utility.has_collection`).
    * Need to define a schema for the collection (specifying fields like ID, embedding vector, dimension, index type like HNSW or IVF_FLAT, metric type).
    * Create the collection using the schema (`Collection(...)`).
    * Insert data typically as lists of entities or Pandas DataFrames matching the schema (`collection.insert(...)`).
    * Explicitly create an index on the vector field (`collection.create_index(...)`) and load the collection (`collection.load()`) before querying.

* **Pinecone:**
    * Initialize connection using API key and environment (`pinecone.init(...)`).
    * Create an index (`pinecone.create_index(...)`) specifying name, dimension, metric, and pod configuration.
    * Connect to the index (`pinecone.Index(...)`).
    * Upsert (add or update) data in batches, providing tuples of (ID, vector, optional_metadata) (`index.upsert(vectors=...)`).

* **Others (Weaviate, Qdrant, pgvector, Cloud Services):**
    * Each has its own specific client library, connection methods, schema/collection/index definition process, data insertion API (`.add`, `.upsert`, SQL `INSERT`, etc.), and configuration details (indexing parameters, distance metrics).

Refer to the official documentation of your chosen vector database for the exact API calls and procedures. The core concepts (connect, create structure, index data, verify) remain similar.
""")
print("-" * 50)

--------------------------------------------------

This notebook used ChromaDB as an example due to its simplicity for local MVPs.
If using other vector databases, the specific steps in sections 2-5 would change:

* **Milvus:**
    * Requires connection setup (`pymilvus.connections.connect`, `utility.has_collection`).
    * Need to define a schema for the collection (specifying fields like ID, embedding vector, dimension, index type like HNSW or IVF_FLAT, metric type).
    * Create the collection using the schema (`Collection(...)`).
    * Insert data typically as lists of entities or Pandas DataFrames matching the schema (`collection.insert(...)`).
    * Explicitly create an index on the vector field (`collection.create_index(...)`) and load the collection (`collection.load()`) before querying.

* **Pinecone:**
    * Initialize connection using API key and environment (`pinecone.init(...)`).
    * Create an index (`pinecone.create_index(...)`) specifying name, dimension, metric, and 

## Conclusion

In [10]:
print("Notebook 05 finished.")
print(f"Successfully set up the vector database using ChromaDB (persisted at {CHROMA_PERSIST_DIR}).")
print(f"  - Created or connected to the '{COLLECTION_NAME}' collection.")
print(f"  - Indexed {count} taxpayer profile embeddings.")
print("  - Verified the item count in the collection.")
print("\nThe vector database is now populated and ready for similarity queries.")

Notebook 05 finished.
Successfully set up the vector database using ChromaDB (persisted at ./vector_db/chroma_persist).
  - Created or connected to the 'taxpayer_profiles' collection.
  - Indexed 4900 taxpayer profile embeddings.
  - Verified the item count in the collection.

The vector database is now populated and ready for similarity queries.


Proceed to [Notebook 06](./notebook_06.ipynb): Querying for Cross-Source Similarity.