# Querying for Cross-Source Similarity

**Purpose**:
  1. Connect to the vector database and collection initialized in Notebook 05.
  2. Identify a specific taxpayer profile in the synthetic data known to exhibit
     suspicious cross-source patterns (e.g., low declared income combined with
     high-value property ownership, based on Notebook 00 generation logic).
  3. Retrieve the vector embedding for this chosen 'query' profile from the DB.
  4. Perform a similarity search (query) against the vector database using the
     query vector to find the N most similar taxpayer profiles based on their
     overall characteristics captured in the embeddings.
  5. Extract and display the Taxpayer IDs and similarity scores/distances of the
     top results.
  6. Save the query results for detailed analysis in the next notebook.

**Demonstrating Value**:
  This notebook showcases the core functionality: leveraging the unified profile
  embeddings and vector search to proactively identify entities that resemble a
  known 'bad' or suspicious pattern, even when that pattern requires combining
  information from multiple original sources.

**Prerequisites**:
  - Successful completion of [Notebooks 00-05](../).
  - Vector database populated with embeddings and IDs (via [Notebook 05](./notebook_05.ipynb)).
  - Existence of the original synthetic data files (needed to identify the query profile based on N00 logic).
  - Vector Database client library installed (e.g., `pip install chromadb`).

**Outputs**:
  - The Taxpayer ID of the selected query profile.
  - A list of Taxpayer IDs for the N most similar profiles found.
  - Corresponding similarity scores/distances.
  - These results saved to a file (e.g., 'query_results.json').

**Next Step**:
  [Notebook 07](./notebook_07.ipynb) will analyze the original source data for these identified similar
  profiles to see if they consistently exhibit the suspicious cross-source pattern.

## Imports and Configuration

In [1]:
import pandas as pd
import numpy as np
import os
import chromadb # MVP Example Client
import json # For saving results

# --- Configuration ---
# Vector DB / Collection Config (should match N05)
VECTOR_DB_DIR = './vector_db'
CHROMA_PERSIST_DIR = os.path.join(VECTOR_DB_DIR, 'chroma_persist')
COLLECTION_NAME = "taxpayer_profiles"

# Original Data Sources (needed to find query profile)
DATA_DIR = './data'
TAX_FILE_RAW = os.path.join(DATA_DIR, 'synthetic_tax_filings.csv')
PROP_FILE_RAW = os.path.join(DATA_DIR, 'synthetic_property_ownership.csv')

# Query Parameters
# Define thresholds based on Notebook 00's fraud pattern generation
# These values should ideally match those used in Notebook 00's configuration
FRAUD_LOW_INCOME_MAX = 20000  # Example threshold
FRAUD_HIGH_PROP_VALUE_MIN = 800000 # Example threshold
N_RESULTS = 10 # Number of similar profiles to retrieve (excluding the query profile itself)

# Output file
RESULTS_OUTPUT_FILE = os.path.join('./data/processed', 'query_results.json') # Save query + results
OUTPUT_DIR = os.path.dirname(RESULTS_OUTPUT_FILE)
os.makedirs(OUTPUT_DIR, exist_ok=True) # Ensure output dir exists


print("Notebook 06: Querying for Cross-Source Similarity")
print("-" * 50)
print(f"Connecting to ChromaDB collection '{COLLECTION_NAME}' from: {CHROMA_PERSIST_DIR}")
print(f"Will query for top {N_RESULTS} similar profiles.")
print(f"Query profile selection criteria: Income <= {FRAUD_LOW_INCOME_MAX}, Max Property Value >= {FRAUD_HIGH_PROP_VALUE_MIN}")
print(f"Saving results to: {RESULTS_OUTPUT_FILE}")
print("-" * 50)

Notebook 06: Querying for Cross-Source Similarity
--------------------------------------------------
Connecting to ChromaDB collection 'taxpayer_profiles' from: ./vector_db/chroma_persist
Will query for top 10 similar profiles.
Query profile selection criteria: Income <= 20000, Max Property Value >= 800000
Saving results to: ./data/processed/query_results.json
--------------------------------------------------


## Connect to Vector Database and Collection

In [2]:
try:
    client = chromadb.PersistentClient(path=CHROMA_PERSIST_DIR)
    print(f"ChromaDB Persistent Client initialized.")

    # Get the existing collection
    collection = client.get_collection(name=COLLECTION_NAME)
    print(f"Successfully connected to collection: '{collection.name}'.")
    print(f"Items currently in collection: {collection.count()}")
    if collection.count() == 0:
         print("ERROR: Collection is empty. Please run Notebook 05 first.")
         raise ValueError("Collection is empty")

except Exception as e:
    print(f"ERROR connecting to ChromaDB or getting collection '{COLLECTION_NAME}': {e}")
    print("Ensure the path is correct and Notebook 05 ran successfully.")
    raise

ChromaDB Persistent Client initialized.
Successfully connected to collection: 'taxpayer_profiles'.
Items currently in collection: 4900


## Identify a Suspicious Query Profile

In [3]:
print(f"Searching original data for a profile with Income <= {FRAUD_LOW_INCOME_MAX} and owns property >= {FRAUD_HIGH_PROP_VALUE_MIN}...")

try:
    # Load original source data
    tax_df_raw = pd.read_csv(TAX_FILE_RAW)
    prop_df_raw = pd.read_csv(PROP_FILE_RAW)
    print("Loaded original tax and property data.")

    # Find IDs with low income
    low_income_ids = set(tax_df_raw[tax_df_raw['Declared Income'] <= FRAUD_LOW_INCOME_MAX]['Taxpayer ID'].astype(str))
    print(f"Found {len(low_income_ids)} IDs with income <= {FRAUD_LOW_INCOME_MAX}.")

    # Find IDs with high property value (check max value if multiple properties)
    high_value_prop_ids = set(prop_df_raw[prop_df_raw['Property Value'] >= FRAUD_HIGH_PROP_VALUE_MIN]['Taxpayer ID'].astype(str))
    print(f"Found {len(high_value_prop_ids)} IDs owning property >= {FRAUD_HIGH_PROP_VALUE_MIN}.")

    # Find IDs present in BOTH sets (intersection)
    suspicious_ids = list(low_income_ids.intersection(high_value_prop_ids))
    print(f"Found {len(suspicious_ids)} IDs matching BOTH criteria (low income AND high property value).")

    if not suspicious_ids:
        print("ERROR: No Taxpayer IDs found matching the specified suspicious pattern.")
        print("Check the thresholds or the data generation logic in Notebook 00.")
        raise ValueError("No matching suspicious profiles found.")

    # Select one ID as our query profile
    # Let's check if this ID exists in our vector DB collection
    query_taxpayer_id = None
    available_ids_in_db = set(collection.get(include=[])['ids']) # Get all IDs efficiently

    for potential_id in suspicious_ids:
        if potential_id in available_ids_in_db:
            query_taxpayer_id = potential_id
            break # Use the first one found that's definitely in the DB

    if query_taxpayer_id is None:
         print("ERROR: None of the identified suspicious IDs were found in the vector DB collection.")
         print("This might indicate an issue with ID consistency or the indexing process.")
         raise ValueError("Identified suspicious IDs not found in vector DB.")

    print(f"Selected Query Taxpayer ID: {query_taxpayer_id}")

except FileNotFoundError:
    print("ERROR: Could not find original synthetic data files needed to select query profile.")
    print(f"Ensure '{TAX_FILE_RAW}' and '{PROP_FILE_RAW}' exist.")
    raise
except Exception as e:
    print(f"An error occurred during query profile selection: {e}")
    raise


Searching original data for a profile with Income <= 20000 and owns property >= 800000...
Loaded original tax and property data.
Found 500 IDs with income <= 20000.
Found 493 IDs owning property >= 800000.
Found 141 IDs matching BOTH criteria (low income AND high property value).
Selected Query Taxpayer ID: TXP_595479B867


## Retrieve Query Vector

In [4]:
try:
    # Fetch the embedding for the selected Taxpayer ID
    query_result = collection.get(
        ids=[query_taxpayer_id],
        include=['embeddings'] # We only need the embedding vector
    )

    # --- Debug print statements can be removed or commented out now ---
    # print("--- DEBUG ---")
    # print(f"Type of query_result: {type(query_result)}")
    # print(f"Value of query_result:\n{query_result}")
    # print("--- END DEBUG ---")
    # --- End of DEBUG lines ---

    # --- New, Robust Check ---
    # Use .get() to safely retrieve the value, defaulting to None if key missing
    embeddings_value = query_result.get('embeddings')

    # Check if embeddings_value is None or if it's an empty list/array
    if embeddings_value is None or len(embeddings_value) == 0:
        print(f"ERROR: Could not retrieve a valid embedding for query ID: {query_taxpayer_id}")
        print(f"Debug: Embeddings value received: {embeddings_value}") # Show what was received
        raise ValueError("Failed to retrieve valid query embedding.")
    # --- End New Check ---

    # If the check passes, we have at least one embedding. Get the first one.
    query_vector = embeddings_value[0]
    print(f"Successfully retrieved embedding vector for ID {query_taxpayer_id}.")
    # print(f"Query vector (first 10 dims): {query_vector[:10]}...") # Optional print

except Exception as e:
    print(f"ERROR retrieving embedding from ChromaDB: {e}")
    raise

Successfully retrieved embedding vector for ID TXP_595479B867.


## Perform Similarity Query

In [5]:
print(f"Querying collection '{COLLECTION_NAME}' to find top {N_RESULTS} profiles similar to ID {query_taxpayer_id}...")

try:
    # Perform the query
    # Request N+1 results because the query item itself is usually the most similar
    similarity_results = collection.query(
        query_embeddings=[query_vector], # Chroma expects a list of query embeddings
        n_results=N_RESULTS + 1,
        include=['distances'] # Or 'similarities' depending on metric and preference. Include 'metadatas' if you stored them.
    )
    print("Query executed successfully.")

except Exception as e:
    print(f"ERROR during ChromaDB query: {e}")
    raise

Querying collection 'taxpayer_profiles' to find top 10 profiles similar to ID TXP_595479B867...
Query executed successfully.


## Process and Display Results

In [6]:
# The results object is a dictionary containing lists of lists (one list per query vector)
if not similarity_results or not similarity_results.get('ids') or not similarity_results['ids'][0]:
    print("Warning: Query returned no results.")
    similar_ids = []
    distances = []
else:
    result_ids = similarity_results['ids'][0]
    result_distances = similarity_results['distances'][0] # Or 'similarities'

    print(f"\nRaw results (Top {N_RESULTS+1}):")
    for i, (res_id, dist) in enumerate(zip(result_ids, result_distances)):
        print(f"  {i+1}. ID: {res_id}, Distance: {dist:.4f}")

    # Filter out the query profile itself from the results
    filtered_results = []
    for res_id, dist in zip(result_ids, result_distances):
        if res_id != query_taxpayer_id:
            filtered_results.append({'Taxpayer ID': res_id, 'Distance': dist})

    # Keep only the top N results after filtering
    top_n_similar = filtered_results[:N_RESULTS]

    print(f"\nQuery Profile ID: {query_taxpayer_id}")
    print(f"\nTop {len(top_n_similar)} Most Similar Profiles (excluding query profile):")
    if top_n_similar:
        results_df = pd.DataFrame(top_n_similar)
        print(results_df)
        similar_ids = results_df['Taxpayer ID'].tolist()
        distances = results_df['Distance'].tolist()
    else:
        print("No other similar profiles found within the requested limit.")
        similar_ids = []
        distances = []



Raw results (Top 11):
  1. ID: TXP_595479B867, Distance: 0.0000
  2. ID: TXP_3BF21F2AE7, Distance: 0.1354
  3. ID: TXP_BAA950F168, Distance: 0.1524
  4. ID: TXP_D33C9D72AD, Distance: 0.1560
  5. ID: TXP_D9351E6FD0, Distance: 0.1891
  6. ID: TXP_2D85DCB938, Distance: 0.1900
  7. ID: TXP_8ACB54CE81, Distance: 0.2235
  8. ID: TXP_58584DB75E, Distance: 0.2265
  9. ID: TXP_20B925A862, Distance: 0.2328
  10. ID: TXP_248D37CEF3, Distance: 0.2391
  11. ID: TXP_A2E4FF8EC6, Distance: 0.2515

Query Profile ID: TXP_595479B867

Top 10 Most Similar Profiles (excluding query profile):
      Taxpayer ID  Distance
0  TXP_3BF21F2AE7  0.135367
1  TXP_BAA950F168  0.152423
2  TXP_D33C9D72AD  0.156041
3  TXP_D9351E6FD0  0.189115
4  TXP_2D85DCB938  0.189992
5  TXP_8ACB54CE81  0.223525
6  TXP_58584DB75E  0.226491
7  TXP_20B925A862  0.232777
8  TXP_248D37CEF3  0.239142
9  TXP_A2E4FF8EC6  0.251476


## Save Query Results for Next Step

In [7]:
query_output = {
    'query_taxpayer_id': query_taxpayer_id,
    'similar_profiles': top_n_similar # List of dictionaries [{'Taxpayer ID': id, 'Distance': dist}, ...]
}

try:
    with open(RESULTS_OUTPUT_FILE, 'w') as f:
        json.dump(query_output, f, indent=4)
    print(f"Successfully saved query ID and similar profile results to: {RESULTS_OUTPUT_FILE}")
except Exception as e:
    print(f"ERROR saving query results to JSON: {e}")

Successfully saved query ID and similar profile results to: ./data/processed/query_results.json


## Conclusion

In [8]:
print("Notebook 06 finished.")
print(f"  - Identified a query profile ({query_taxpayer_id}) exhibiting suspicious cross-source patterns.")
print(f"  - Retrieved its embedding and queried the vector database.")
print(f"  - Found the top {len(similar_ids)} similar profiles based on embedding similarity.")
print("  - Saved the query ID and results for further analysis.")

Notebook 06 finished.
  - Identified a query profile (TXP_595479B867) exhibiting suspicious cross-source patterns.
  - Retrieved its embedding and queried the vector database.
  - Found the top 10 similar profiles based on embedding similarity.
  - Saved the query ID and results for further analysis.


Ready to proceed to [Notebook 07](./notebook_07.ipynb): Analyzing Cross-Source Patterns in Similar Profiles.