# NYC Landmarks Vector Database Statistics

This notebook provides a comprehensive analysis of the NYC Landmarks data stored in the Pinecone vector database. It examines the vectors, metadata distribution, and overall statistics of the embeddings to give insights about the landmarks collection.

## Setup and Configuration

First, we'll import the necessary libraries and set up connections to the Pinecone database.

In [17]:
# Standard libraries
import os
import sys
import json
from datetime import datetime
from typing import Dict, List, Any, Tuple, Optional
from collections import Counter, defaultdict

# Data analysis libraries
import numpy as np
import pandas as pd
from tqdm.notebook import tqdm

# Visualization libraries
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

# For map visualizations
import folium
from folium.plugins import MarkerCluster

# Vector analysis
from sklearn.manifold import TSNE
from sklearn.cluster import KMeans, DBSCAN
from sklearn.decomposition import PCA

# Add project directory to path
sys.path.append('..')

# Set visualization style
plt.style.use('seaborn-v0_8-whitegrid')
sns.set(style="whitegrid")
plt.rcParams['figure.figsize'] = (12, 8)

# Set random seed for reproducibility
np.random.seed(42)

In [18]:
# Import project modules
from nyc_landmarks.config.settings import settings
from nyc_landmarks.vectordb.pinecone_db import PineconeDB
from nyc_landmarks.db.db_client import DbClient

# We're using the fetch_all_lpc_reports function instead of the non-existent LandmarkReportFetcher
from nyc_landmarks.db.fetchers import fetch_all_lpc_reports

## Database Connection

Connect to the Pinecone database and verify the connection.

In [19]:
# Initialize the Pinecone database client
pinecone_db = PineconeDB()

# Check if the connection was successful
if pinecone_db.index:
    print(f"✅ Successfully connected to Pinecone index: {pinecone_db.index_name}")
    print(f"Namespace: {pinecone_db.namespace}")
    print(f"Dimensions: {pinecone_db.dimensions}")
    print(f"Metric: {pinecone_db.metric}")
else:
    print("❌ Failed to connect to Pinecone. Check your credentials and network connection.")

INFO:nyc_landmarks.db.db_client:Using CoreDataStore API client
INFO:nyc_landmarks.db.coredatastore_api:Initialized CoreDataStore API client
INFO:nyc_landmarks.db.coredatastore_api:Initialized CoreDataStore API client
INFO:nyc_landmarks.vectordb.pinecone_db:Initialized Pinecone in environment: us-central1-gcp
INFO:nyc_landmarks.vectordb.pinecone_db:Initialized Pinecone in environment: us-central1-gcp
INFO:nyc_landmarks.vectordb.pinecone_db:Connected to Pinecone index: nyc-landmarks


✅ Successfully connected to Pinecone index: nyc-landmarks
Namespace: landmarks
Dimensions: 1536
Metric: cosine


## Index Statistics

Retrieve basic statistics about the Pinecone index.

In [11]:
# Get index statistics
index_stats = pinecone_db.get_index_stats()

# Display the statistics in a more readable format
print("\n📊 Index Statistics:")
print(f"Dimension: {index_stats.get('dimension', 'N/A')}")
print(f"Index Fullness: {index_stats.get('index_fullness', 'N/A')}")

# Extract namespace information
namespaces = index_stats.get('namespaces', {})
total_vector_count = sum(ns.get('vector_count', 0) for ns in namespaces.values())

print(f"\n🔢 Total Vector Count: {total_vector_count:,}")
print("\n📁 Namespace Statistics:")

ERROR:nyc_landmarks.vectordb.pinecone_db:Error getting index stats: 'NoneType' object is not callable



📊 Index Statistics:
Dimension: N/A
Index Fullness: N/A

🔢 Total Vector Count: 0

📁 Namespace Statistics:


In [None]:
# Create a DataFrame for namespace stats
namespace_data = []

for ns_name, ns_stats in namespaces.items():
    vector_count = ns_stats.get('vector_count', 0)
    percentage = (vector_count / total_vector_count * 100) if total_vector_count > 0 else 0
    namespace_data.append({
        'Namespace': ns_name if ns_name else 'default',
        'Vector Count': vector_count,
        'Percentage': percentage
    })

namespace_df = pd.DataFrame(namespace_data)
if not namespace_df.empty:
    namespace_df = namespace_df.sort_values('Vector Count', ascending=False).reset_index(drop=True)
    display(namespace_df)
else:
    print("No namespace data available.")

In [None]:
# Visualize namespace distribution
if not namespace_df.empty and len(namespace_df) > 0:
    plt.figure(figsize=(10, 6))
    bars = plt.bar(namespace_df['Namespace'], namespace_df['Vector Count'], color='skyblue')
    plt.title('Vector Count by Namespace')
    plt.xlabel('Namespace')
    plt.ylabel('Vector Count')
    plt.xticks(rotation=45, ha='right')
    plt.grid(axis='y', linestyle='--', alpha=0.7)
    
    # Add count labels on top of bars
    for bar in bars:
        height = bar.get_height()
        plt.text(bar.get_x() + bar.get_width()/2., height + 5,
                f'{int(height):,}',
                ha='center', va='bottom', rotation=0)
    
    plt.tight_layout()
    plt.show()

## Vector Metadata Analysis

Let's analyze the metadata associated with the vectors to understand the distribution of landmark properties.

In [None]:
# Function to sample vectors and retrieve metadata
def sample_vectors(pinecone_db, sample_size=100):
    """
    Sample vectors from the Pinecone database to analyze metadata.
    """
    try:
        # Get a sample of vectors from the index
        # Since Pinecone doesn't have a direct "sample" function,
        # we'll query with a random vector to get diverse results
        random_vector = np.random.rand(pinecone_db.dimensions).tolist()
        
        # Set a high top_k to get a good sample size
        results = pinecone_db.query_vectors(
            query_vector=random_vector,
            top_k=sample_size,
            filter_dict=None
        )
        
        return results
    except Exception as e:
        print(f"Error sampling vectors: {e}")
        return []

# Sample vectors for analysis
sample_size = 200  # Adjust based on your database size
vector_samples = sample_vectors(pinecone_db, sample_size)

print(f"Retrieved {len(vector_samples)} vector samples")

In [None]:
# Analyze metadata fields
if vector_samples:
    # Extract all metadata fields
    all_metadata = [sample.get('metadata', {}) for sample in vector_samples]
    
    # Count metadata fields
    field_counts = Counter()
    for metadata in all_metadata:
        for key in metadata.keys():
            field_counts[key] += 1
    
    # Create DataFrame for field distribution
    field_df = pd.DataFrame({
        'Field': list(field_counts.keys()),
        'Count': list(field_counts.values()),
        'Percentage': [count/len(all_metadata)*100 for count in field_counts.values()]
    }).sort_values('Count', ascending=False).reset_index(drop=True)
    
    # Display field distribution
    display(field_df)
    
    # Visualize top 10 metadata fields
    top_fields = field_df.head(10)
    plt.figure(figsize=(10, 6))
    bars = plt.bar(top_fields['Field'], top_fields['Count'], color='lightgreen')
    plt.title('Top 10 Metadata Fields')
    plt.xlabel('Metadata Field')
    plt.ylabel('Count')
    plt.xticks(rotation=45, ha='right')
    plt.grid(axis='y', linestyle='--', alpha=0.7)
    
    # Add count labels
    for bar in bars:
        height = bar.get_height()
        plt.text(bar.get_x() + bar.get_width()/2., height + 1,
                f'{int(height)}',
                ha='center', va='bottom', rotation=0)
    
    plt.tight_layout()
    plt.show()
else:
    print("No vector samples available for metadata analysis")

## Geographical Distribution

If the vectors contain location information, let's visualize the geographical distribution of landmarks.

In [None]:
# Extract geographical information and create a map
def create_landmark_map(vector_samples):
    """
    Create a map visualization of landmarks with geographical information.
    """
    # Extract latitude and longitude information if available
    geo_data = []
    for sample in vector_samples:
        metadata = sample.get('metadata', {})
        
        # Check if the required fields exist - names may vary by your schema
        lat = metadata.get('latitude') or metadata.get('lat')
        lng = metadata.get('longitude') or metadata.get('lng')
        name = metadata.get('name') or metadata.get('title') or "Unknown"
        
        if lat and lng:
            try:
                geo_data.append({
                    'name': name,
                    'lat': float(lat),
                    'lng': float(lng),
                    'metadata': metadata
                })
            except (ValueError, TypeError):
                # Skip if conversion to float fails
                pass
    
    # Create map if we have geo data
    if geo_data:
        # Create a dataframe for the geographical data
        geo_df = pd.DataFrame(geo_data)
        
        # Center the map on the mean coordinates
        center_lat = geo_df['lat'].mean()
        center_lng = geo_df['lng'].mean()
        
        # Create a map
        m = folium.Map(location=[center_lat, center_lng], zoom_start=12)
        
        # Add a marker cluster
        marker_cluster = MarkerCluster().add_to(m)
        
        # Add markers for each landmark
        for _, row in geo_df.iterrows():
            popup_html = f"<b>{row['name']}</b>"
            folium.Marker(
                location=[row['lat'], row['lng']],
                popup=folium.Popup(popup_html, max_width=300),
                icon=folium.Icon(color='blue', icon='info-sign')
            ).add_to(marker_cluster)
        
        # Display the map
        return m
    else:
        print("No geographical data found in the vector metadata")
        return None

# Create and display the map
landmark_map = create_landmark_map(vector_samples)
if landmark_map:
    display(landmark_map)

## Summary

This notebook provides a comprehensive analysis of the NYC Landmarks vector database. The analysis includes:

1. Database size and vector counts by namespace
2. Metadata distribution analysis
3. Vector distribution across landmarks
4. Vector clustering and dimensionality reduction visualization

These insights help us understand the structure and content of the vector database, enabling better optimization and usage of the data for landmark information retrieval.