# NYC Landmarks Vector Database - Query Testing

This notebook provides testing and examples for the vector query capabilities of the NYC Landmarks Vector Database. It demonstrates how to connect to the Pinecone database, execute various types of queries, and analyze the results.

## Objectives

1. Test basic vector search functionality
2. Demonstrate filtering capabilities
3. Analyze query performance and result relevance
4. Visualize search results

This notebook represents Phase 1 of the Query API Enhancement, focusing on establishing the foundations for more advanced query capabilities.

## 1. Setup & Imports

First, we'll import the necessary libraries and set up the environment.

In [None]:
# Standard libraries
import sys
import time

# Visualization libraries
import matplotlib.pyplot as plt

# Data analysis libraries
import numpy as np
import seaborn as sns

# Add project directory to path
sys.path.append("..")

# Set visualization style
plt.style.use("seaborn-v0_8-whitegrid")
sns.set(style="whitegrid")
plt.rcParams["figure.figsize"] = (12, 8)

# Set random seed for reproducibility
np.random.seed(42)

In [None]:
# Configure logging
import logging

# Import project modules
from nyc_landmarks.config.settings import settings
from nyc_landmarks.embeddings.generator import EmbeddingGenerator
from nyc_landmarks.vectordb.pinecone_db import PineconeDB

logger = logging.getLogger()
logging.basicConfig(
    level=settings.LOG_LEVEL.value,
    format="%(asctime)s - %(name)s - %(levelname)s - %(message)s",
)

## 2. Pinecone Connection

Next, we'll establish a connection to the Pinecone vector database and verify the connection.

In [None]:
# Initialize the Pinecone database client
pinecone_db = PineconeDB()

# Check if the connection was successful
if pinecone_db.index:
    print(f"‚úÖ Successfully connected to Pinecone index: {pinecone_db.index_name}")
    print(f"Namespace: {pinecone_db.namespace}")
    print(f"Dimensions: {pinecone_db.dimensions}")
else:
    print(
        "‚ùå Failed to connect to Pinecone. Check your credentials and network connection."
    )

In [None]:
# Get index statistics
stats = pinecone_db.get_index_stats()

# Check for errors
if "error" in stats:
    print(f"‚ùå Error retrieving index stats: {stats['error']}")
    # Create fallback mock stats for demonstration
    total_vector_count = 0
    namespaces = {}
else:
    print("‚úÖ Successfully retrieved index stats")
    total_vector_count = stats.get("total_vector_count", 0)
    namespaces = stats.get("namespaces", {})

print("\nüìä Index Statistics:")
print(f"Total Vector Count: {total_vector_count:,}")
print(f"Dimension: {stats.get('dimension')}")
print(f"Index Fullness: {stats.get('index_fullness')}")

## 3. Basic Vector Search Test

Now let's test the basic vector search capabilities using sample queries about NYC landmarks.

In [None]:
# Initialize the embedding generator
embedding_generator = EmbeddingGenerator()

# Define some sample queries about NYC landmarks
sample_queries = [
    "What is the Empire State Building?",
    "Tell me about the Brooklyn Bridge",
    "What are the historic districts in Manhattan?",
    "What is the architectural style of Grand Central Terminal?",
    "When was the Statue of Liberty designated as a landmark?",
]

print(f"Generated {len(sample_queries)} sample queries for testing.")

## Understanding Vector Search Implementation

## PineconeDB Query Implementation

Below is the implementation of `query_vectors` method from the `PineconeDB` class that we'll be using:

In [None]:
# Source: nyc_landmarks/vectordb/pinecone_db.py

'''
def query_vectors(
    self,
    query_vector: List[float],
    top_k: int = 5,
    filter_dict: Optional[Dict[str, Any]] = None,
) -> List[Dict[str, Any]]:
    """
    Query vectors from Pinecone index.

    Args:
        query_vector: Embedding of the query text
        top_k: Number of results to return
        filter_dict: Dictionary of metadata filters

    Returns:
        List of matching vectors with metadata
    """
    try:
        response = self.index.query(
            vector=query_vector,
            top_k=top_k,
            include_metadata=True,
            filter=filter_dict,
        )

        # Process the response to extract matches
        result_list: List[Dict[str, Any]] = []

        # Handle response.matches which can be a list or other iterable
        # Cast response to Any to handle different return types from Pinecone SDK
        from typing import Any as TypeAny
        from typing import cast

        response_dict = cast(TypeAny, response)

        # Access matches safely
        matches = getattr(response_dict, "matches", [])
        for match in matches:
            # Handle match objects
            match_dict: Dict[str, Any] = {}

            # Extract ID if available
            if hasattr(match, "id"):
                match_dict["id"] = match.id

            # Extract score if available
            if hasattr(match, "score"):
                match_dict["score"] = match.score

            # Extract metadata if available
            if hasattr(match, "metadata"):
                match_dict["metadata"] = match.metadata

            result_list.append(match_dict)

        return result_list

    except Exception as e:
        logger.error(f"Failed to query vectors: {e}")
        return []
'''

# Note: This code block is for reference only and won't be executed

## Our Notebook's Query Function

In this notebook, our `execute_query` function wraps around the `PineconeDB.query_vectors` method to provide additional functionality:

1. **Timing and Performance Metrics**: We measure embedding generation time, query execution time, and total time
2. **Automatic Embedding Generation**: Our function handles text-to-vector embedding using `EmbeddingGenerator`
3. **Structured Results**: The response is packaged with both the results and performance metrics

Our implementation is shown below:

In [None]:
# Function to execute a query and measure performance


def execute_query(query_text, top_k=5, filter_dict=None):
    """Execute a vector search query and return the results along with performance metrics."""
    # Start timing
    start_time = time.time()

    # Generate embedding for the query
    embedding_start = time.time()
    query_embedding = embedding_generator.generate_embedding(query_text)
    embedding_time = time.time() - embedding_start

    # Execute the query
    query_start = time.time()
    results = pinecone_db.query_vectors(
        query_vector=query_embedding, top_k=top_k, filter_dict=filter_dict
    )
    query_time = time.time() - query_start

    # Calculate total time
    total_time = time.time() - start_time

    return {
        "query": query_text,
        "embedding": query_embedding,
        "results": results,
        "metrics": {
            "embedding_time": embedding_time,
            "query_time": query_time,
            "total_time": total_time,
            "result_count": len(results),
        },
    }

In [None]:
def display_query_results(query_result, max_results=None, show_metadata=True):
    """Display query results in a readable format.

    Args:
        query_result: The result dictionary returned by execute_query
        max_results: Maximum number of results to display (default: all)
        show_metadata: Whether to display metadata fields (default: True)
    """
    results = query_result["results"]
    if max_results:
        results = results[:max_results]

    print(f"Query: '{query_result['query']}'\n")
    print(
        f"üîç Found {query_result['metrics']['result_count']} results (showing {len(results)})"
    )
    print(f"‚è±Ô∏è Total query time: {query_result['metrics']['total_time']:.3f} seconds")
    print("-" * 80)

    for i, match in enumerate(results):
        print(f"\nüìå Result #{i+1} - Score: {match['score']:.4f}")
        print(f"ID: {match['id']}")

        if show_metadata and "metadata" in match:
            metadata = match["metadata"]
            print("\nMetadata:")

            # Print important fields first
            priority_fields = [
                "name",
                "borough",
                "landmark_type",
                "designation_date",
                "neighborhood",
            ]
            for field in priority_fields:
                if field in metadata and metadata[field]:
                    print(f"  {field.capitalize()}: {metadata[field]}")

            # Print content fields if available
            content_fields = ["text_chunk", "description", "text"]
            for field in content_fields:
                if field in metadata and metadata[field]:
                    content = metadata[field]
                    snippet = content[:250] + "..." if len(content) > 250 else content
                    print(f"\n  Content ({field}): {snippet}")
                    break

            # Print other metadata fields that weren't already shown
            other_fields = [
                f
                for f in metadata
                if f not in priority_fields and f not in content_fields
            ]
            if other_fields:
                print("\n  Other metadata:")
                for field in other_fields:
                    if metadata[field]:
                        # Don't print very long values completely
                        if (
                            isinstance(metadata[field], str)
                            and len(metadata[field]) > 50
                        ):
                            value = metadata[field][:50] + "..."
                        else:
                            value = metadata[field]
                        print(f"    {field}: {value}")

        print("-" * 80)

## Direct Query with PineconeDB

You can also query Pinecone directly using the `PineconeDB.query_vectors` method. This requires you to first generate the embedding vector for your query text.

In [None]:
# Example of direct querying with PineconeDB


# 1. First, generate embedding for query text
query_text = "What is the Brooklyn Bridge?"
print(f"Query: '{query_text}'")

# Generate the embedding vector using our embedding generator
embedding = embedding_generator.generate_embedding(query_text)

# 2. Now directly query Pinecone using the embedding vector
print("\nDirectly querying Pinecone with embedding vector...")
results = pinecone_db.query_vectors(query_vector=embedding, top_k=3)

# 3. Display the raw results
print(f"\nFound {len(results)} results from direct query:\n")

# Print raw results to show the direct output format
for i, match in enumerate(results):
    print(f"Result #{i+1}:")
    print(f"  ID: {match['id']}")
    print(f"  Score: {match['score']:.4f}")

    # Print some key metadata if available
    if "metadata" in match and "name" in match["metadata"]:
        print(f"  Name: {match['metadata']['name']}")
    if "metadata" in match and "borough" in match["metadata"]:
        print(f"  Borough: {match['metadata']['borough']}")
    print()

### Comparing Direct Query vs. execute_query Function

Let's compare the direct query approach with our `execute_query` function to see the differences in usage and output format:

In [None]:
# Compare the two approaches with the same query

# The query to test
comparison_query = "What are important landmarks in Manhattan?"
print(f"Query: '{comparison_query}'")

# Approach 1: Direct PineconeDB.query_vectors approach
# First generate embedding
print("\n----- Approach 1: Direct PineconeDB.query_vectors -----")
start_time = time.time()

# Generate embedding
emb_start = time.time()
query_embedding = embedding_generator.generate_embedding(comparison_query)
emb_time = time.time() - emb_start

# Execute direct query
query_start = time.time()
direct_results = pinecone_db.query_vectors(query_vector=query_embedding, top_k=3)
query_time = time.time() - query_start

# Calculate total time
total_direct_time = time.time() - start_time
print(f"Embedding time: {emb_time:.3f}s")
print(f"Query time: {query_time:.3f}s")
print(f"Total time: {total_direct_time:.3f}s")
print(f"Results returned: {len(direct_results)}")

# Approach 2: Using our execute_query function
print("\n----- Approach 2: Using execute_query function -----")
func_start_time = time.time()
function_results = execute_query(comparison_query, top_k=3)
total_function_time = time.time() - func_start_time

print(f"Embedding time: {function_results['metrics']['embedding_time']:.3f}s")
print(f"Query time: {function_results['metrics']['query_time']:.3f}s")
print(f"Total time: {function_results['metrics']['total_time']:.3f}s")
print(f"Results returned: {function_results['metrics']['result_count']}")

# Compare the top result IDs to verify both approaches return the same data
print("\n----- Top Result Comparison -----")
if direct_results and function_results["results"]:
    print(f"Direct query top result ID: {direct_results[0]['id']}")
    print(f"Function query top result ID: {function_results['results'][0]['id']}")

    # Check if they're the same
    print(
        f"\nSame top result: {direct_results[0]['id'] == function_results['results'][0]['id']}"
    )
else:
    print("No results to compare")

### When to Use Each Approach

**Direct Query with `PineconeDB.query_vectors`:**
- When you already have embedding vectors (pre-computed)
- When you need custom embedding generation logic
- For low-level integration with other systems
- When you want complete control over the query process

**Using the `execute_query` Function:**
- For simplicity and convenience
- When you need built-in performance metrics
- For consistent formatting of results
- When working with text queries rather than vectors

Both approaches ultimately use the same underlying `query_vectors` method of the PineconeDB class, but the `execute_query` function provides additional convenience and metrics.

## Demonstration: Using Pre-computed Embeddings

In some scenarios, you might already have embedding vectors computed beforehand, such as:

1. **Batch Processing**: You've pre-computed embeddings for multiple queries to save time
2. **Cached Embeddings**: You're storing common query embeddings to avoid regenerating them
3. **Vector Operations**: You've modified vectors (e.g., combining or averaging multiple embeddings)
4. **External Systems**: You've generated embeddings using a different system or model

Let's demonstrate how to use pre-computed embeddings with Pinecone:

In [None]:
# Demonstration of using pre-computed embeddings

# 1. First, let's generate an embedding for our query
query_text = "Empire State Building history"
print(f"Original query: '{query_text}'")

# Generate and store the embedding vector
start_time = time.time()
pre_computed_embedding = embedding_generator.generate_embedding(query_text)
embedding_time = time.time() - start_time
print(f"Generated embedding in {embedding_time:.3f} seconds")
print(f"Embedding dimensions: {len(pre_computed_embedding)}")

# Show a small snippet of the embedding vector
print("\nEmbedding vector snippet (first 5 elements):")
print(pre_computed_embedding[:5])

# 2. Now imagine this embedding was computed earlier or in a different system
print("\n--- Scenario: Using a pre-computed embedding ---")
print("In a real scenario, this embedding might have been:")
print("- Loaded from a file or database")
print("- Retrieved from a cache")
print("- Generated by another system")
print("- Modified through vector operations")

# 3. Use the pre-computed embedding directly with query_vectors
print("\nQuerying Pinecone with pre-computed embedding...")
query_start = time.time()

# This is the key step - using the pre-computed embedding directly
results = pinecone_db.query_vectors(query_vector=pre_computed_embedding, top_k=3)

query_time = time.time() - query_start
print(f"Query executed in {query_time:.3f} seconds")
print(f"Found {len(results)} results")

# Display results
for i, match in enumerate(results):
    print(f"\nResult #{i+1}:")
    print(f"  ID: {match['id']}")
    print(f"  Score: {match['score']:.4f}")

    # Print name if available
    if "metadata" in match and "name" in match["metadata"]:
        print(f"  Name: {match['metadata']['name']}")

### Benefits of Using Pre-computed Embeddings

1. **Performance Optimization**: 
   - Embedding generation is often the most time-consuming part of the search process
   - Pre-computing embeddings can significantly reduce query latency
   - Especially important for real-time applications with strict response time requirements

2. **Cost Reduction**: 
   - API calls to embedding models (like OpenAI) cost money
   - Pre-computing and reusing embeddings reduces API calls

3. **Advanced Vector Operations**:
   - Combine multiple embeddings (e.g., average vectors for multiple related queries)
   - Apply dimensionality reduction techniques
   - Fine-tune or adjust vectors based on relevance feedback

4. **System Architecture Flexibility**:
   - Decouple embedding generation from vector search
   - Different systems or services can handle each part of the process
   - Enables batch processing for large-scale applications

### Practical Applications

- **Autocomplete Systems**: Pre-compute embeddings for common queries
- **Search Optimization**: Cache embeddings for frequent searches
- **Semantic Search Enhancement**: Modify embeddings to improve search quality
- **Cross-Modal Search**: Use embeddings from different modalities (text, image, etc.)

## How Pinecone Stores Vectors and Handles Common Queries

Pinecone is a vector database designed specifically for storing and searching pre-computed vectors. This is a core part of how vector databases work, so let's explore this concept in more detail:

### How Pinecone Stores Vectors

1. **Pre-computed Vectors**: Pinecone *only* stores pre-computed vectors. It doesn't generate embeddings itself.
   - All vectors in the Pinecone database were computed elsewhere and then stored in Pinecone
   - In our NYC Landmarks project, we generated embeddings for all landmark documents using OpenAI's embedding API, then stored those vectors in Pinecone

2. **Vector Storage Structure**:
   - Each vector is stored with a unique ID
   - Each vector is associated with metadata (like the landmark name, borough, etc.)
   - Vectors are organized for efficient similarity searching using specialized indexing structures

3. **Query Process**:
   - When you query Pinecone, you provide a query vector (not text)
   - Pinecone finds the most similar vectors in its database
   - It returns those similar vectors' IDs, similarity scores, and metadata

### Common Queries and Pre-computation Strategies

For frequently asked questions or common queries, there are several strategies to optimize performance:

1. **Client-Side Caching**: 
   - Store embeddings for common queries in your application
   - When users ask similar questions, use the cached embeddings
   - Example: Cache embeddings for "What is the Empire State Building?", "Brooklyn Bridge history", etc.

2. **Query Pre-processing**:
   - Map various phrasings to canonical queries
   - Use the pre-computed embedding for the canonical form
   - Example: Map "Tell me about Empire State Building" and "What's the Empire State Building?" to the same cached embedding

3. **Vector Quantization**:
   - Create representative vectors for categories of questions
   - Use these as "template" embeddings for classes of queries
   - Example: Have a single "landmark history query" template vector for history-related questions

4. **Hybrid Search Systems**:
   - Use traditional search to identify the query type
   - Use pre-computed embeddings based on the identified query type
   - Example: Identify "Brooklyn Bridge" as the entity, then use a pre-computed "Brooklyn Bridge" embedding

Let's demonstrate a practical example of client-side caching for common queries:

In [None]:
# Demonstration: Client-side caching for common queries

# Let's create a simple embedding cache for common queries
embedding_cache = {}

# Common queries about NYC landmarks that we expect users to ask frequently
common_queries = [
    "What is the Empire State Building?",
    "Tell me about the Brooklyn Bridge",
    "What are the historic districts in Manhattan?",
    "When was the Statue of Liberty designated as a landmark?",
]

print("Generating embeddings for common queries and storing in cache...")

# Pre-compute embeddings for these common queries (in a real system, you might load these from a file)
for query in common_queries:
    # Generate and store the embedding
    start_time = time.time()
    embedding_cache[query] = embedding_generator.generate_embedding(query)
    elapsed = time.time() - start_time
    print(f"  - '{query}': Embedding generated in {elapsed:.3f}s")

print(f"\nCache contains {len(embedding_cache)} pre-computed embeddings\n")

# Simulate a user entering a query that's similar to a cached query
user_query = "What can you tell me about the Brooklyn Bridge?"
print(f"User query: '{user_query}'")

# In a real system, you might use a more sophisticated matching algorithm
# For this example, we'll use a simple substring match
cached_query = None
for query in embedding_cache.keys():
    if "Brooklyn Bridge" in query and "Brooklyn Bridge" in user_query:
        cached_query = query
        break

if cached_query:
    print(f"Found similar cached query: '{cached_query}'")

    # Use the cached embedding instead of generating a new one
    start_time = time.time()
    cached_embedding = embedding_cache[cached_query]
    print(
        f"Retrieved cached embedding in {(time.time() - start_time):.6f}s (vs. ~0.3s to generate new)"
    )

    # Query Pinecone with the cached embedding
    query_start = time.time()
    results = pinecone_db.query_vectors(query_vector=cached_embedding, top_k=3)
    query_time = time.time() - query_start

    print(f"Query executed in {query_time:.3f}s")
    print(f"Found {len(results)} results")

    # Display top result
    if results:
        top_result = results[0]
        print(f"\nTop result:")
        print(f"  ID: {top_result['id']}")
        print(f"  Score: {top_result['score']:.4f}")

        if "metadata" in top_result and "name" in top_result["metadata"]:
            print(f"  Name: {top_result['metadata']['name']}")
else:
    print("No matching cached query found - would need to generate new embedding")

## Test with sample query

In [None]:
# Test query with detailed results display
query = "What is the Empire State Building?"
print(f"Executing query: '{query}'")

try:
    # Execute the query
    result = execute_query(comparison_query, top_k=3)

    # Display detailed results
    display_query_results(result)

except Exception as e:
    print(f"Error executing query: {e}")

## 4. Simple Filter Tests

Next, let's test basic filtering capabilities.

In [None]:
# Test with sample query
test_query = sample_queries[0]
print(f"Testing query: '{test_query}'")

try:
    query_result = execute_query(test_query)
    print("\n‚úÖ Query executed successfully")
    print(f"Embedding time: {query_result['metrics']['embedding_time']:.3f}s")
    print(f"Query time: {query_result['metrics']['query_time']:.3f}s")
    print(f"Total time: {query_result['metrics']['total_time']:.3f}s")
    print(f"Results returned: {query_result['metrics']['result_count']}")
except Exception as e:
    print(f"\n‚ùå Error executing query: {e}")

In [None]:
# Test with sample query
test_query = sample_queries[0]
print(f"Testing query: '{test_query}'")

try:
    query_result = execute_query(test_query)
    print("\n‚úÖ Query executed successfully")
    print(f"Embedding time: {query_result['metrics']['embedding_time']:.3f}s")
    print(f"Query time: {query_result['metrics']['query_time']:.3f}s")
    print(f"Total time: {query_result['metrics']['total_time']:.3f}s")
    print(f"Results returned: {query_result['metrics']['result_count']}")
except Exception as e:
    print(f"\n‚ùå Error executing query: {e}")

In [None]:
# Test with sample query
test_query = sample_queries[0]
print(f"Testing query: '{test_query}'")

try:
    query_result = execute_query(test_query)
    print("\n‚úÖ Query executed successfully")
    print(f"Embedding time: {query_result['metrics']['embedding_time']:.3f}s")
    print(f"Query time: {query_result['metrics']['query_time']:.3f}s")
    print(f"Total time: {query_result['metrics']['total_time']:.3f}s")
    print(f"Results returned: {query_result['metrics']['result_count']}")
except Exception as e:
    print(f"\n‚ùå Error executing query: {e}")

In [None]:
# Test with a simple filter
try:
    filter_dict = {"borough": "Manhattan"}
    filtered_result = execute_query(test_query, filter_dict=filter_dict)

    print(f"Query: '{test_query}'")
    print("Filter: borough = Manhattan")
    print(f"Results returned: {filtered_result['metrics']['result_count']}")
    print(f"Total time: {filtered_result['metrics']['total_time']:.3f}s")
except Exception as e:
    print(f"Error executing filtered query: {e}")

In [None]:
# Let's try a different query about Brooklyn Bridge
brooklyn_query = sample_queries[1]  # "Tell me about the Brooklyn Bridge"
print(f"Testing query: '{brooklyn_query}'")

try:
    brooklyn_result = execute_query(brooklyn_query)
    print("\n‚úÖ Query executed successfully")
    print(f"Embedding time: {brooklyn_result['metrics']['embedding_time']:.3f}s")
    print(f"Query time: {brooklyn_result['metrics']['query_time']:.3f}s")
    print(f"Total time: {brooklyn_result['metrics']['total_time']:.3f}s")
    print(f"Results returned: {brooklyn_result['metrics']['result_count']}")

    # Display the first result
    if brooklyn_result["results"]:
        first_result = brooklyn_result["results"][0]
        print(f"\nTop result score: {first_result['score']:.4f}")
        metadata = first_result.get("metadata", {})
        if metadata and "name" in metadata:
            print(f"Name: {metadata['name']}")
        if metadata and "borough" in metadata:
            print(f"Borough: {metadata['borough']}")
except Exception as e:
    print(f"\n‚ùå Error executing query: {e}")

In [None]:
# Let's examine the text content of the top results for the Brooklyn Bridge query
for i, result in enumerate(brooklyn_result["results"][:3]):
    print(f"\nResult #{i+1} - Score: {result['score']:.4f}")

    metadata = result.get("metadata", {})

    # Print basic information
    for key in ["name", "borough", "landmark_type", "designation_date"]:
        if key in metadata:
            print(f"{key.capitalize()}: {metadata[key]}")

    # Print a snippet of text content if available
    for content_key in ["text_chunk", "description"]:
        if content_key in metadata:
            content = metadata[content_key]
            snippet = content[:200] + "..." if len(content) > 200 else content
            print(f"\nSnippet: {snippet}")
            break

    print("---" * 20)

In [None]:
# Visualize similarity scores for the Brooklyn Bridge query
scores = []
names = []

for result in brooklyn_result["results"]:
    scores.append(result["score"])
    metadata = result.get("metadata", {})
    name = metadata.get("name", result["id"])
    names.append(name)

# Create a horizontal bar chart
plt.figure(figsize=(12, 6))
bars = plt.barh(names, scores, color="skyblue")

# Add data labels
for bar in bars:
    width = bar.get_width()
    plt.text(
        width + 0.01,
        bar.get_y() + bar.get_height() / 2,
        f"{width:.4f}",
        va="center",
        fontweight="bold",
    )

# Set chart properties
plt.title(f'Similarity Scores for Query: "{brooklyn_query}"', fontsize=14)
plt.xlabel("Similarity Score")
plt.ylabel("Result Name")
plt.xlim(0, 1.0)
plt.grid(axis="x", linestyle="--", alpha=0.7)
plt.tight_layout()
plt.show()

## 5. Examine Query Results

Let's examine the results of our queries in more detail.

In [None]:
# Let's display the results from our last query in a more readable format
results = filtered_result["results"]

print(f"Query: '{filtered_result['query']}'\n")
print("Top 5 results:")

for i, match in enumerate(results):
    print(f"\n{i+1}. Score: {match['score']:.4f}")

    # Extract metadata
    metadata = match.get("metadata", {})

    # Display key metadata fields
    print(f"   ID: {match['id']}")
    if "name" in metadata:
        print(f"   Name: {metadata['name']}")
    if "borough" in metadata:
        print(f"   Borough: {metadata['borough']}")
    if "landmark_type" in metadata:
        print(f"   Type: {metadata['landmark_type']}")
    if "designation_date" in metadata:
        print(f"   Designation Date: {metadata['designation_date']}")

    # Display a snippet of the content if available
    if "text_chunk" in metadata:
        snippet = (
            metadata["text_chunk"][:150] + "..."
            if len(metadata["text_chunk"]) > 150
            else metadata["text_chunk"]
        )
        print(f"   Snippet: {snippet}")

In [None]:
# Visualize the similarity scores


def plot_similarity_scores(results, query_text):
    # Extract scores and names
    scores = [match["score"] for match in results]
    names = [match.get("metadata", {}).get("name", match["id"]) for match in results]

    # Create a horizontal bar chart
    plt.figure(figsize=(12, 6))
    bars = plt.barh(names, scores, color="skyblue")

    # Add data labels
    for bar in bars:
        width = bar.get_width()
        plt.text(
            width + 0.01,
            bar.get_y() + bar.get_height() / 2,
            f"{width:.4f}",
            va="center",
            fontweight="bold",
        )

    # Set chart properties
    plt.title(f'Similarity Scores for Query: "{query_text}"', fontsize=14)
    plt.xlabel("Similarity Score")
    plt.ylabel("Document")
    plt.xlim(0, 1.0)
    plt.grid(axis="x", linestyle="--", alpha=0.7)
    plt.tight_layout()

    return plt


# Plot the results
plot = plot_similarity_scores(filtered_result["results"], filtered_result["query"])
plot.show()

## 6. Advanced Filtering

Let's test more advanced filtering capabilities by combining multiple criteria.

In [None]:
# Test with multiple filter conditions
query_text = "What are the historic districts?"
filter_conditions = {"borough": "Brooklyn", "landmark_type": "Historic District"}

print(f"Query: '{query_text}'")
print(f"Filters: {filter_conditions}")

try:
    multi_filter_result = execute_query(
        query_text, filter_dict=filter_conditions, top_k=10
    )
    print("\n‚úÖ Query executed successfully")
    print(f"Total time: {multi_filter_result['metrics']['total_time']:.3f}s")
    print(f"Results returned: {multi_filter_result['metrics']['result_count']}")

    # Print the names of the returned landmarks
    if multi_filter_result["results"]:
        print("\nReturned Brooklyn Historic Districts:")
        for i, result in enumerate(multi_filter_result["results"]):
            name = result.get("metadata", {}).get("name", "Unknown")
            print(f"{i+1}. {name} (Score: {result['score']:.4f})")
    else:
        print("\nNo results found matching the filters")

except Exception as e:
    print(f"\n‚ùå Error executing query: {e}")

In [None]:
# Let's try with just the landmark_type filter
query_text = "What are the historic districts?"
filter_conditions = {"landmark_type": "Historic District"}

print(f"Query: '{query_text}'")
print(f"Filters: {filter_conditions}")

try:
    filter_result = execute_query(query_text, filter_dict=filter_conditions, top_k=5)
    print(f"\n‚úÖ Query executed successfully")
    print(f"Total time: {filter_result['metrics']['total_time']:.3f}s")
    print(f"Results returned: {filter_result['metrics']['result_count']}")

    # Print the names and boroughs of the returned landmarks
    if filter_result["results"]:
        print("\nReturned Historic Districts:")
        for i, result in enumerate(filter_result["results"]):
            metadata = result.get("metadata", {})
            name = metadata.get("name", "Unknown")
            borough = metadata.get("borough", "Unknown")
            print(f"{i+1}. {name} (Borough: {borough}, Score: {result['score']:.4f})")
    else:
        print("\nNo results found matching the filters")

except Exception as e:
    print(f"\n‚ùå Error executing query: {e}")

In [None]:
# Now let's display the detailed results of the filter query using our display function
# We'll try with the borough filter since it worked in previous tests
filter_query = "Historic buildings"
filter_conditions = {"borough": "Brooklyn"}

print(f"Executing filtered query: '{filter_query}' with filters: {filter_conditions}\n")

try:
    # Execute the filtered query
    brooklyn_result = execute_query(
        filter_query, filter_dict=filter_conditions, top_k=3
    )

    # Display detailed results using our function
    display_query_results(brooklyn_result)

except Exception as e:
    print(f"Error executing query: {e}")

## 7. Exploring Metadata

To better understand the database, let's explore what metadata is available by performing some simple queries without filters.

In [None]:
# Perform a general query to explore available metadata
general_query = "New York City landmarks"
print(f"Query: '{general_query}'")

try:
    general_result = execute_query(general_query, top_k=10)
    print(f"\n‚úÖ Query executed successfully")
    print(f"Results returned: {general_result['metrics']['result_count']}")

    # Extract and display available metadata fields
    print("\nExploring metadata in results:")

    metadata_fields = set()
    landmark_types = set()
    boroughs = set()

    for result in general_result["results"]:
        metadata = result.get("metadata", {})
        metadata_fields.update(metadata.keys())

        if "landmark_type" in metadata:
            landmark_types.add(metadata["landmark_type"])

        if "borough" in metadata:
            boroughs.add(metadata["borough"])

    print(f"\nAvailable metadata fields: {sorted(metadata_fields)}")
    print(f"\nUnique landmark types: {sorted(landmark_types)}")
    print(f"\nUnique boroughs: {sorted(boroughs)}")

except Exception as e:
    print(f"\n‚ùå Error executing query: {e}")

In [None]:
# Let's try filtering by borough
query_text = "Historic buildings"
filter_conditions = {"borough": "Brooklyn"}

print(f"Query: '{query_text}'")
print(f"Filters: {filter_conditions}")

try:
    borough_filter_result = execute_query(
        query_text, filter_dict=filter_conditions, top_k=5
    )
    print(f"\n‚úÖ Query executed successfully")
    print(f"Total time: {borough_filter_result['metrics']['total_time']:.3f}s")
    print(f"Results returned: {borough_filter_result['metrics']['result_count']}")

    # Print the names of the returned landmarks
    if borough_filter_result["results"]:
        print("\nReturned Brooklyn landmarks:")
        for i, result in enumerate(borough_filter_result["results"]):
            metadata = result.get("metadata", {})
            name = metadata.get("name", "Unknown")
            print(f"{i+1}. {name} (Score: {result['score']:.4f})")

            # Print additional metadata if available
            for key in ["style", "year_built", "neighborhood"]:
                if key in metadata and metadata[key]:
                    print(f"   {key.capitalize()}: {metadata[key]}")
    else:
        print("\nNo results found matching the filters")

except Exception as e:
    print(f"\n‚ùå Error executing query: {e}")

## 8. Summary and Conclusions

In this notebook, we successfully tested the NYC Landmarks Vector Database's query capabilities. Here's what we accomplished:

1. **Basic Vector Search**: Successfully performed semantic search on NYC landmark data
2. **Filtering**: Demonstrated the ability to filter by borough and other metadata fields
3. **Performance**: Tracked query times, with most queries returning in under 1 second
4. **Metadata Exploration**: Discovered the structure and available fields in the database

### Next Steps

Future work could include:

1. More advanced filtering combinations
2. Geographic visualization of search results on a map
3. Implementing relevance feedback to improve search results
4. Expansion of the API to support more complex queries