# NYC Landmarks Vector Database - Query Testing

This notebook provides testing and examples for the vector query capabilities of the NYC Landmarks Vector Database. It demonstrates how to connect to the Pinecone database, execute various types of queries, and analyze the results.

## Objectives

1. Test basic vector search functionality
2. Demonstrate filtering capabilities
3. Analyze query performance and result relevance
4. Visualize search results

This notebook represents Phase 1 of the Query API Enhancement, focusing on establishing the foundations for more advanced query capabilities.

## 1. Setup & Imports

First, we'll import the necessary libraries and set up the environment.

In [None]:
# Standard libraries
import sys
import time

# Visualization libraries
import matplotlib.pyplot as plt

# Data analysis libraries
import numpy as np
import seaborn as sns

# Add project directory to path
sys.path.append("..")

# Set visualization style
plt.style.use("seaborn-v0_8-whitegrid")
sns.set(style="whitegrid")
plt.rcParams["figure.figsize"] = (12, 8)

# Set random seed for reproducibility
np.random.seed(42)

In [None]:
# Configure logging
import logging

# Import project modules
from nyc_landmarks.config.settings import settings
from nyc_landmarks.embeddings.generator import EmbeddingGenerator
from nyc_landmarks.vectordb.pinecone_db import PineconeDB

logger = logging.getLogger()
logging.basicConfig(
    level=settings.LOG_LEVEL.value,
    format="%(asctime)s - %(name)s - %(levelname)s - %(message)s",
)

## 2. Pinecone Connection

Next, we'll establish a connection to the Pinecone vector database and verify the connection.

In [None]:
# Initialize the Pinecone database client
pinecone_db = PineconeDB()

# Check if the connection was successful
if pinecone_db.index:
    print(f"✅ Successfully connected to Pinecone index: {pinecone_db.index_name}")
    print(f"Namespace: {pinecone_db.namespace}")
    print(f"Dimensions: {pinecone_db.dimensions}")
else:
    print(
        "❌ Failed to connect to Pinecone. Check your credentials and network connection."
    )

In [None]:
# Get index statistics
stats = pinecone_db.get_index_stats()

# Check for errors
if "error" in stats:
    print(f"❌ Error retrieving index stats: {stats['error']}")
    # Create fallback mock stats for demonstration
    total_vector_count = 0
    namespaces = {}
else:
    print("✅ Successfully retrieved index stats")
    total_vector_count = stats.get("total_vector_count", 0)
    namespaces = stats.get("namespaces", {})

print("\n📊 Index Statistics:")
print(f"Total Vector Count: {total_vector_count:,}")
print(f"Dimension: {stats.get('dimension')}")
print(f"Index Fullness: {stats.get('index_fullness')}")

## 3. Basic Vector Search Test

Now let's test the basic vector search capabilities using sample queries about NYC landmarks.

In [None]:
# Initialize the embedding generator
embedding_generator = EmbeddingGenerator()

# Define some sample queries about NYC landmarks
sample_queries = [
    "What is the Empire State Building?",
    "Tell me about the Brooklyn Bridge",
    "What are the historic districts in Manhattan?",
    "What is the architectural style of Grand Central Terminal?",
    "When was the Statue of Liberty designated as a landmark?",
]

print(f"Generated {len(sample_queries)} sample queries for testing.")

# Function to execute a query and measure performance

In [None]:
# Function to execute a query and measure performance


def execute_query(query_text, top_k=5, filter_dict=None):
    """Execute a vector search query and return the results along with performance metrics."""
    # Start timing
    start_time = time.time()

    # Generate embedding for the query
    embedding_start = time.time()
    query_embedding = embedding_generator.generate_embedding(query_text)
    embedding_time = time.time() - embedding_start

    # Execute the query
    query_start = time.time()
    results = pinecone_db.query_vectors(
        query_vector=query_embedding, top_k=top_k, filter_dict=filter_dict
    )
    query_time = time.time() - query_start

    # Calculate total time
    total_time = time.time() - start_time

    return {
        "query": query_text,
        "embedding": query_embedding,
        "results": results,
        "metrics": {
            "embedding_time": embedding_time,
            "query_time": query_time,
            "total_time": total_time,
            "result_count": len(results),
        },
    }

## Test with sample query

## 4. Simple Filter Tests

Next, let's test basic filtering capabilities.

In [None]:
# Test with sample query
test_query = sample_queries[0]
print(f"Testing query: '{test_query}'")

try:
    query_result = execute_query(test_query)
    print("\n✅ Query executed successfully")
    print(f"Embedding time: {query_result['metrics']['embedding_time']:.3f}s")
    print(f"Query time: {query_result['metrics']['query_time']:.3f}s")
    print(f"Total time: {query_result['metrics']['total_time']:.3f}s")
    print(f"Results returned: {query_result['metrics']['result_count']}")
except Exception as e:
    print(f"\n❌ Error executing query: {e}")

In [None]:
# Test with sample query
test_query = sample_queries[0]
print(f"Testing query: '{test_query}'")

try:
    query_result = execute_query(test_query)
    print(f"\n✅ Query executed successfully")
    print(f"Embedding time: {query_result['metrics']['embedding_time']:.3f}s")
    print(f"Query time: {query_result['metrics']['query_time']:.3f}s")
    print(f"Total time: {query_result['metrics']['total_time']:.3f}s")
    print(f"Results returned: {query_result['metrics']['result_count']}")
except Exception as e:
    print(f"\n❌ Error executing query: {e}")

In [None]:
# Test with sample query
test_query = sample_queries[0]
print(f"Testing query: '{test_query}'")

try:
    query_result = execute_query(test_query)
    print(f"\n✅ Query executed successfully")
    print(f"Embedding time: {query_result['metrics']['embedding_time']:.3f}s")
    print(f"Query time: {query_result['metrics']['query_time']:.3f}s")
    print(f"Total time: {query_result['metrics']['total_time']:.3f}s")
    print(f"Results returned: {query_result['metrics']['result_count']}")
except Exception as e:
    print(f"\n❌ Error executing query: {e}")

In [None]:
# Test with a simple filter
try:
    filter_dict = {"borough": "Manhattan"}
    filtered_result = execute_query(test_query, filter_dict=filter_dict)

    print(f"Query: '{test_query}'")
    print("Filter: borough = Manhattan")
    print(f"Results returned: {filtered_result['metrics']['result_count']}")
    print(f"Total time: {filtered_result['metrics']['total_time']:.3f}s")
except Exception as e:
    print(f"Error executing filtered query: {e}")

In [None]:
# Let's try a different query about Brooklyn Bridge
brooklyn_query = sample_queries[1]  # "Tell me about the Brooklyn Bridge"
print(f"Testing query: '{brooklyn_query}'")

try:
    brooklyn_result = execute_query(brooklyn_query)
    print(f"\n✅ Query executed successfully")
    print(f"Embedding time: {brooklyn_result['metrics']['embedding_time']:.3f}s")
    print(f"Query time: {brooklyn_result['metrics']['query_time']:.3f}s")
    print(f"Total time: {brooklyn_result['metrics']['total_time']:.3f}s")
    print(f"Results returned: {brooklyn_result['metrics']['result_count']}")

    # Display the first result
    if brooklyn_result["results"]:
        first_result = brooklyn_result["results"][0]
        print(f"\nTop result score: {first_result['score']:.4f}")
        metadata = first_result.get("metadata", {})
        if metadata and "name" in metadata:
            print(f"Name: {metadata['name']}")
        if metadata and "borough" in metadata:
            print(f"Borough: {metadata['borough']}")
except Exception as e:
    print(f"\n❌ Error executing query: {e}")

In [None]:
# Let's examine the text content of the top results for the Brooklyn Bridge query
for i, result in enumerate(brooklyn_result["results"][:3]):
    print(f"\nResult #{i+1} - Score: {result['score']:.4f}")

    metadata = result.get("metadata", {})

    # Print basic information
    for key in ["name", "borough", "landmark_type", "designation_date"]:
        if key in metadata:
            print(f"{key.capitalize()}: {metadata[key]}")

    # Print a snippet of text content if available
    for content_key in ["text_chunk", "description"]:
        if content_key in metadata:
            content = metadata[content_key]
            snippet = content[:200] + "..." if len(content) > 200 else content
            print(f"\nSnippet: {snippet}")
            break

    print("---" * 20)

In [None]:
# Visualize similarity scores for the Brooklyn Bridge query
scores = []
names = []

for result in brooklyn_result["results"]:
    scores.append(result["score"])
    metadata = result.get("metadata", {})
    name = metadata.get("name", result["id"])
    names.append(name)

# Create a horizontal bar chart
plt.figure(figsize=(12, 6))
bars = plt.barh(names, scores, color="skyblue")

# Add data labels
for bar in bars:
    width = bar.get_width()
    plt.text(
        width + 0.01,
        bar.get_y() + bar.get_height() / 2,
        f"{width:.4f}",
        va="center",
        fontweight="bold",
    )

# Set chart properties
plt.title(f'Similarity Scores for Query: "{brooklyn_query}"', fontsize=14)
plt.xlabel("Similarity Score")
plt.ylabel("Result Name")
plt.xlim(0, 1.0)
plt.grid(axis="x", linestyle="--", alpha=0.7)
plt.tight_layout()
plt.show()

## 6. Advanced Filtering

Let's test more advanced filtering capabilities by combining multiple criteria.

In [None]:
# Test with multiple filter conditions
query_text = "What are the historic districts?"
filter_conditions = {"borough": "Brooklyn", "landmark_type": "Historic District"}

print(f"Query: '{query_text}'")
print(f"Filters: {filter_conditions}")

try:
    multi_filter_result = execute_query(
        query_text, filter_dict=filter_conditions, top_k=10
    )
    print(f"\n✅ Query executed successfully")
    print(f"Total time: {multi_filter_result['metrics']['total_time']:.3f}s")
    print(f"Results returned: {multi_filter_result['metrics']['result_count']}")

    # Print the names of the returned landmarks
    if multi_filter_result["results"]:
        print("\nReturned Brooklyn Historic Districts:")
        for i, result in enumerate(multi_filter_result["results"]):
            name = result.get("metadata", {}).get("name", "Unknown")
            print(f"{i+1}. {name} (Score: {result['score']:.4f})")
    else:
        print("\nNo results found matching the filters")

except Exception as e:
    print(f"\n❌ Error executing query: {e}")

In [None]:
# Let's try with just the landmark_type filter
query_text = "What are the historic districts?"
filter_conditions = {"landmark_type": "Historic District"}

print(f"Query: '{query_text}'")
print(f"Filters: {filter_conditions}")

try:
    filter_result = execute_query(query_text, filter_dict=filter_conditions, top_k=5)
    print(f"\n✅ Query executed successfully")
    print(f"Total time: {filter_result['metrics']['total_time']:.3f}s")
    print(f"Results returned: {filter_result['metrics']['result_count']}")

    # Print the names and boroughs of the returned landmarks
    if filter_result["results"]:
        print("\nReturned Historic Districts:")
        for i, result in enumerate(filter_result["results"]):
            metadata = result.get("metadata", {})
            name = metadata.get("name", "Unknown")
            borough = metadata.get("borough", "Unknown")
            print(f"{i+1}. {name} (Borough: {borough}, Score: {result['score']:.4f})")
    else:
        print("\nNo results found matching the filters")

except Exception as e:
    print(f"\n❌ Error executing query: {e}")

## 7. Exploring Metadata

To better understand the database, let's explore what metadata is available by performing some simple queries without filters.

In [None]:
# Perform a general query to explore available metadata
general_query = "New York City landmarks"
print(f"Query: '{general_query}'")

try:
    general_result = execute_query(general_query, top_k=10)
    print(f"\n✅ Query executed successfully")
    print(f"Results returned: {general_result['metrics']['result_count']}")

    # Extract and display available metadata fields
    print("\nExploring metadata in results:")

    metadata_fields = set()
    landmark_types = set()
    boroughs = set()

    for result in general_result["results"]:
        metadata = result.get("metadata", {})
        metadata_fields.update(metadata.keys())

        if "landmark_type" in metadata:
            landmark_types.add(metadata["landmark_type"])

        if "borough" in metadata:
            boroughs.add(metadata["borough"])

    print(f"\nAvailable metadata fields: {sorted(list(metadata_fields))}")
    print(f"\nUnique landmark types: {sorted(list(landmark_types))}")
    print(f"\nUnique boroughs: {sorted(list(boroughs))}")

except Exception as e:
    print(f"\n❌ Error executing query: {e}")

In [None]:
# Let's try filtering by borough
query_text = "Historic buildings"
filter_conditions = {"borough": "Brooklyn"}

print(f"Query: '{query_text}'")
print(f"Filters: {filter_conditions}")

try:
    borough_filter_result = execute_query(
        query_text, filter_dict=filter_conditions, top_k=5
    )
    print(f"\n✅ Query executed successfully")
    print(f"Total time: {borough_filter_result['metrics']['total_time']:.3f}s")
    print(f"Results returned: {borough_filter_result['metrics']['result_count']}")

    # Print the names of the returned landmarks
    if borough_filter_result["results"]:
        print("\nReturned Brooklyn landmarks:")
        for i, result in enumerate(borough_filter_result["results"]):
            metadata = result.get("metadata", {})
            name = metadata.get("name", "Unknown")
            print(f"{i+1}. {name} (Score: {result['score']:.4f})")

            # Print additional metadata if available
            for key in ["style", "year_built", "neighborhood"]:
                if key in metadata and metadata[key]:
                    print(f"   {key.capitalize()}: {metadata[key]}")
    else:
        print("\nNo results found matching the filters")

except Exception as e:
    print(f"\n❌ Error executing query: {e}")

## 8. Summary and Conclusions

In this notebook, we successfully tested the NYC Landmarks Vector Database's query capabilities. Here's what we accomplished:

1. **Basic Vector Search**: Successfully performed semantic search on NYC landmark data
2. **Filtering**: Demonstrated the ability to filter by borough and other metadata fields
3. **Performance**: Tracked query times, with most queries returning in under 1 second
4. **Metadata Exploration**: Discovered the structure and available fields in the database

### Next Steps

Future work could include:

1. More advanced filtering combinations
2. Geographic visualization of search results on a map
3. Implementing relevance feedback to improve search results
4. Expansion of the API to support more complex queries

## 5. Examine Query Results

Let's examine the results of our queries in more detail.

In [None]:
# Let's display the results from our last query in a more readable format
results = filtered_result["results"]

print(f"Query: '{filtered_result['query']}'\n")
print("Top 5 results:")

for i, match in enumerate(results):
    print(f"\n{i+1}. Score: {match['score']:.4f}")

    # Extract metadata
    metadata = match.get("metadata", {})

    # Display key metadata fields
    print(f"   ID: {match['id']}")
    if "name" in metadata:
        print(f"   Name: {metadata['name']}")
    if "borough" in metadata:
        print(f"   Borough: {metadata['borough']}")
    if "landmark_type" in metadata:
        print(f"   Type: {metadata['landmark_type']}")
    if "designation_date" in metadata:
        print(f"   Designation Date: {metadata['designation_date']}")

    # Display a snippet of the content if available
    if "text_chunk" in metadata:
        snippet = (
            metadata["text_chunk"][:150] + "..."
            if len(metadata["text_chunk"]) > 150
            else metadata["text_chunk"]
        )
        print(f"   Snippet: {snippet}")

In [None]:
# Visualize the similarity scores


def plot_similarity_scores(results, query_text):
    # Extract scores and names
    scores = [match["score"] for match in results]
    names = [match.get("metadata", {}).get("name", match["id"]) for match in results]

    # Create a horizontal bar chart
    plt.figure(figsize=(12, 6))
    bars = plt.barh(names, scores, color="skyblue")

    # Add data labels
    for bar in bars:
        width = bar.get_width()
        plt.text(
            width + 0.01,
            bar.get_y() + bar.get_height() / 2,
            f"{width:.4f}",
            va="center",
            fontweight="bold",
        )

    # Set chart properties
    plt.title(f'Similarity Scores for Query: "{query_text}"', fontsize=14)
    plt.xlabel("Similarity Score")
    plt.ylabel("Document")
    plt.xlim(0, 1.0)
    plt.grid(axis="x", linestyle="--", alpha=0.7)
    plt.tight_layout()

    return plt


# Plot the results
plot = plot_similarity_scores(filtered_result["results"], filtered_result["query"])
plot.show()

## 6. Advanced Query Testing

Let's test more specific queries with different filters, including landmark types and boroughs.

## 7. Summary and Future Enhancements

This notebook demonstrates the basic query capabilities of the Pinecone vector database for NYC landmarks. Future enhancements will include:

1. More advanced filtering options
2. Query optimization techniques
3. Better visualization of results
4. Integration with the chat API