# NYC Landmarks Vector Database - Processing Status Analysis

This notebook analyzes the processing status of NYC landmark records in the Pinecone vector database. It determines which landmarks have already been processed (have vectors in Pinecone) and which landmarks still need processing.

## Objectives

1. Connect to CoreDataStore API and Pinecone database
2. Fetch all available landmark IDs from the CoreDataStore API
3. Check which landmarks already have vectors in Pinecone
4. Generate statistics and visualizations of processing status
5. Export a list of unprocessed landmarks for batch processing
6. Analysis and Visualizations
7. Summary


## 1. Setup & Imports

First, we'll import the necessary libraries and set up the environment.

In [None]:
# Standard libraries
import sys
import time

# Visualization libraries
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from tqdm.notebook import tqdm

# Add project directory to path
sys.path.append("..")

# Set visualization style
plt.style.use("seaborn-v0_8-whitegrid")
sns.set_theme(style="whitegrid")
plt.rcParams["figure.figsize"] = (12, 8)

# Set random seed for reproducibility
np.random.seed(42)

In [None]:
# Configure logging
import logging

# Import project modules
from nyc_landmarks.config.settings import settings
from nyc_landmarks.db.db_client import get_db_client
from nyc_landmarks.vectordb.pinecone_db import PineconeDB

# Set up logger
logger = logging.getLogger()
logging.basicConfig(
    level=settings.LOG_LEVEL.value,
    format="%(asctime)s - %(name)s - %(levelname)s - %(message)s",
)

## 2. Connect to Databases

Next, we'll establish connections to both the CoreDataStore API and the Pinecone vector database.

In [None]:
# Initialize the database client for CoreDataStore API
db_client = get_db_client()
print("✅ Initialized CoreDataStore API client")

In [None]:
# Initialize the Pinecone database client
try:
    # Create PineconeDB instance
    pinecone_db = PineconeDB()

    # Check if the connection was successful
    if pinecone_db.index:
        print(f"✅ Successfully connected to Pinecone index: {pinecone_db.index_name}")
        print(f"Namespace: {pinecone_db.namespace}")
        print(f"Dimensions: {pinecone_db.dimensions}")
    else:
        print(
            "❌ Failed to connect to Pinecone. Check your credentials and network connection."
        )
except Exception as e:
    print(f"❌ Error initializing Pinecone: {e}")

In [None]:
# Get index statistics from Pinecone
try:
    stats = pinecone_db.get_index_stats()

    # Check for errors
    if "error" in stats:
        print(f"❌ Error retrieving index stats: {stats['error']}")
        # Create fallback mock stats for demonstration
        total_vector_count = 0
        namespaces = {}
    else:
        print("✅ Successfully retrieved index stats")
        total_vector_count = stats.get("total_vector_count", 0)
        namespaces = stats.get("namespaces", {})
except Exception as e:
    print(f"❌ Error retrieving index stats: {e}")
    # Create fallback mock stats for demonstration
    total_vector_count = 0
    namespaces = {}
    stats = {}

print("\n📊 Index Statistics:")
print(f"Total Vector Count: {total_vector_count:,}")
print(f"Dimension: {stats.get('dimension')}")
print(f"Index Fullness: {stats.get('index_fullness')}")

## 3. Fetch All Landmark IDs

Now we'll fetch all landmark IDs from the CoreDataStore API to determine the total universe of landmarks.

In [None]:
def fetch_all_landmark_ids(
    start_page: int = 1,
    end_page: int = None,  # type: ignore
    page_size: int = 100,
    max_pages: int = 500,
) -> set[str]:
    """Fetch all landmark IDs from CoreDataStore API.

    Args:
        start_page: Starting page number (default: 1)
        end_page: Ending page number (default: None, fetch until no more results)
        page_size: Number of landmarks per page (default: 100)
        max_pages: Maximum number of pages to fetch (safety limit)

    Returns:
        Set of landmark IDs
    """
    all_landmark_ids: set[str] = set()
    current_page = start_page
    total_pages_fetched = 0

    try:
        with tqdm(desc="Fetching landmark IDs", unit="page") as pbar:
            while True:
                # Check if we've reached the end page or max pages
                if (
                    end_page and current_page > end_page
                ) or total_pages_fetched >= max_pages:
                    break

                # Fetch landmarks for the current page
                try:
                    # Fetch API results for current page
                    landmarks = db_client.get_landmarks_page(page_size, current_page)

                except Exception as e:
                    print(f"Error fetching page {current_page}: {e}")
                    # Try to continue with next page
                    current_page += 1
                    total_pages_fetched += 1
                    pbar.update(1)
                    continue

                # If no landmarks found, we've reached the end
                if not landmarks:
                    print(f"No landmarks found on page {current_page}, ending fetch")
                    break

                # Process the landmarks
                for landmark in landmarks:
                    landmark_id = landmark.get("id", "") or landmark.get("lpNumber", "")
                    if landmark_id:
                        all_landmark_ids.add(landmark_id)

                # Update progress
                pbar.set_postfix(
                    {
                        "page": current_page,
                        "landmarks": len(landmarks),
                        "total": len(all_landmark_ids),
                    }
                )
                pbar.update(1)

                # Move to next page
                current_page += 1
                total_pages_fetched += 1

                # Small delay to avoid rate limiting
                time.sleep(0.5)
    except Exception as e:
        print(f"Error fetching landmark IDs: {e}")

    print(
        f"Completed fetching {len(all_landmark_ids)} landmark IDs from {total_pages_fetched} pages"
    )
    return all_landmark_ids


# Try to get total records from API
try:
    # Force reload the db_client module to get the latest version with the new method
    import importlib

    import nyc_landmarks.db.db_client

    importlib.reload(nyc_landmarks.db.db_client)

    # Reinitialize the database client to get the updated class definition
    from nyc_landmarks.db.db_client import get_db_client

    db_client = get_db_client()
    print("✅ Reloaded and reinitialized the CoreDataStore API client")

    # Now try to use the get_total_record_count method (should be available after reload)
    total_records = db_client.get_total_record_count()
    print(f"Total landmark records available from API: {total_records}")
except Exception as e:
    print(f"Error getting total record count: {e}")
    # Use a reasonable default based on API documentation
    total_records = 1765
    print(f"Falling back to default record count: {total_records}")

# Configure fetch parameters
page_size = 100
start_page = 1

# Calculate the required number of pages (using ceiling division)
# This ensures we get all records even if the last page is partial
total_pages = (total_records + page_size - 1) // page_size

print(
    f"Will fetch {total_records} records using {total_pages} pages with {page_size} records per page"
)

In [None]:
# Fetch all landmark IDs
start_time = time.time()
all_landmark_ids = fetch_all_landmark_ids(
    start_page=start_page, end_page=total_pages, page_size=page_size
)
elapsed_time = time.time() - start_time

print(
    f"Fetched {len(all_landmark_ids)} unique landmark IDs in {elapsed_time:.2f} seconds"
)

## 4. Check Processing Status in Pinecone

Now we'll check which landmarks already have vectors in Pinecone.

In [None]:
def check_landmark_processing_status_by_source(
    pinecone_db, landmark_ids, batch_size=10, top_k=1
):
    """Check which landmarks have vectors in Pinecone by source type.

    Args:
        pinecone_db: PineconeDB instance
        landmark_ids: Set of landmark IDs to check
        batch_size: Number of landmarks to check in parallel batches
        top_k: Number of vectors to retrieve per landmark (1 is sufficient to check existence)

    Returns:
        dict with keys 'pdf' and 'wikipedia', each containing sets of landmark IDs
    """
    # Generate a random query vector for searching
    random_vector = np.random.rand(pinecone_db.dimensions).tolist()

    results = {
        "pdf": {"processed": set(), "unprocessed": set()},
        "wikipedia": {"processed": set(), "unprocessed": set()},
    }

    # Convert set to list for iteration with tqdm
    landmark_ids_list = list(landmark_ids)

    with tqdm(
        total=len(landmark_ids_list) * 2, desc="Checking processing status by source"
    ) as pbar:
        # Check each source type separately
        for source_type in ["pdf", "wikipedia"]:
            pbar.set_description(f"Checking {source_type} vectors")

            for i in range(0, len(landmark_ids_list), batch_size):
                # Get the current batch
                batch = landmark_ids_list[i : i + batch_size]

                # Check each landmark in the batch
                for landmark_id in batch:
                    # Query Pinecone for vectors with this landmark_id AND source_type
                    filter_dict = {
                        "landmark_id": landmark_id,
                        "source_type": source_type,
                    }
                    try:
                        # We only need to know if vectors exist, so top_k=1 is sufficient
                        vectors = pinecone_db.query_vectors(
                            query_vector=random_vector,
                            top_k=top_k,
                            filter_dict=filter_dict,
                        )

                        # If vectors found, mark as processed, otherwise unprocessed
                        if vectors:
                            results[source_type]["processed"].add(landmark_id)
                        else:
                            results[source_type]["unprocessed"].add(landmark_id)
                    except Exception as e:
                        print(
                            f"Error checking {source_type} for landmark {landmark_id}: {e}"
                        )
                        # If we can't check, assume unprocessed to be safe
                        results[source_type]["unprocessed"].add(landmark_id)

                # Update progress
                pbar.update(len(batch))
                pbar.set_postfix(
                    {
                        f"{source_type}_processed": len(
                            results[source_type]["processed"]
                        ),
                        f"{source_type}_unprocessed": len(
                            results[source_type]["unprocessed"]
                        ),
                    }
                )

                # Small delay to avoid rate limiting
                time.sleep(0.2)

    return results

In [None]:
# Check processing status for all landmarks by source type
start_time = time.time()
processing_results = check_landmark_processing_status_by_source(
    pinecone_db=pinecone_db,
    landmark_ids=all_landmark_ids,
    batch_size=10,  # Adjust based on API rate limits
    top_k=1,
)
elapsed_time = time.time() - start_time

print(f"\nProcessing status check completed in {elapsed_time:.2f} seconds")
print(f"Total landmarks: {len(all_landmark_ids)}")

# PDF results
pdf_processed = processing_results["pdf"]["processed"]
pdf_unprocessed = processing_results["pdf"]["unprocessed"]
print("\n📄 PDF Processing Status:")
print(
    f"  Processed landmarks: {len(pdf_processed)} ({len(pdf_processed)/len(all_landmark_ids)*100:.2f}%)"
)
print(
    f"  Unprocessed landmarks: {len(pdf_unprocessed)} ({len(pdf_unprocessed)/len(all_landmark_ids)*100:.2f}%)"
)

# Wikipedia results
wiki_processed = processing_results["wikipedia"]["processed"]
wiki_unprocessed = processing_results["wikipedia"]["unprocessed"]
print("\n📖 Wikipedia Processing Status:")
print(
    f"  Processed landmarks: {len(wiki_processed)} ({len(wiki_processed)/len(all_landmark_ids)*100:.2f}%)"
)
print(
    f"  Unprocessed landmarks: {len(wiki_unprocessed)} ({len(wiki_unprocessed)/len(all_landmark_ids)*100:.2f}%)"
)

# Analysis of coverage
landmarks_with_both = pdf_processed & wiki_processed
landmarks_with_pdf_only = pdf_processed - wiki_processed
landmarks_with_wiki_only = wiki_processed - pdf_processed
landmarks_with_neither = pdf_unprocessed & wiki_unprocessed

print("\n🔍 Coverage Analysis:")
print(
    f"  Landmarks with both PDF and Wikipedia: {len(landmarks_with_both)} ({len(landmarks_with_both)/len(all_landmark_ids)*100:.2f}%)"
)
print(
    f"  Landmarks with PDF only: {len(landmarks_with_pdf_only)} ({len(landmarks_with_pdf_only)/len(all_landmark_ids)*100:.2f}%)"
)
print(
    f"  Landmarks with Wikipedia only: {len(landmarks_with_wiki_only)} ({len(landmarks_with_wiki_only)/len(all_landmark_ids)*100:.2f}%)"
)
print(
    f"  Landmarks with neither: {len(landmarks_with_neither)} ({len(landmarks_with_neither)/len(all_landmark_ids)*100:.2f}%)"
)

# Identify landmarks that should have PDFs but don't
print("\n⚠️  Landmarks missing PDF content (should be investigated):")
if pdf_unprocessed:
    missing_pdf_sample = sorted(pdf_unprocessed)[:10]  # Show first 10
    for landmark_id in missing_pdf_sample:
        print(f"  - {landmark_id}")
    if len(pdf_unprocessed) > 10:
        print(f"  ... and {len(pdf_unprocessed) - 10} more")
else:
    print("  ✅ All landmarks have PDF content!")

## 5. Export a list of unprocessed landmarks for batch processing

In [None]:
# Export landmarks that need PDF processing (should be investigated as every landmark should have PDF)
pdf_missing_landmarks = sorted(processing_results["pdf"]["unprocessed"]))
print(f"\n📋 Landmarks missing PDF content ({len(pdf_missing_landmarks)} total):")
if pdf_missing_landmarks:
    print("First 20:")
    for landmark_id in pdf_missing_landmarks[:20]:
        print(f"  {landmark_id}")
    if len(pdf_missing_landmarks) > 20:
        print(f"  ... and {len(pdf_missing_landmarks) - 20} more")

    # Save to file for further investigation
    import os

    output_dir = "../test_output"
    os.makedirs(output_dir, exist_ok=True)

    with open(f"{output_dir}/landmarks_missing_pdf.txt", "w") as f:
        f.write("# Landmarks missing PDF content\n")
        f.write(f"# Generated on: {time.strftime('%Y-%m-%d %H:%M:%S')}\n")
        f.write(f"# Total count: {len(pdf_missing_landmarks)}\n\n")
        for landmark_id in pdf_missing_landmarks:
            f.write(f"{landmark_id}\n")
    print(f"\n💾 Saved list to {output_dir}/landmarks_missing_pdf.txt")
else:
    print("  ✅ All landmarks have PDF content!")

# Export landmarks that could benefit from Wikipedia processing (optional enhancement)
wiki_missing_landmarks = sorted(processing_results["wikipedia"]["unprocessed"]))
print(
    f"\n📋 Landmarks without Wikipedia content ({len(wiki_missing_landmarks)} total):"
)
if wiki_missing_landmarks:
    print("First 20:")
    for landmark_id in wiki_missing_landmarks[:20]:
        print(f"  {landmark_id}")
    if len(wiki_missing_landmarks) > 20:
        print(f"  ... and {len(wiki_missing_landmarks) - 20} more")

    # Save to file for potential Wikipedia article creation/processing
    with open(f"{output_dir}/landmarks_missing_wikipedia.txt", "w") as f:
        f.write("# Landmarks without Wikipedia content\n")
        f.write(f"# Generated on: {time.strftime('%Y-%m-%d %H:%M:%S')}\n")
        f.write(f"# Total count: {len(wiki_missing_landmarks)}\n")
        f.write(
            "# Note: This is optional enhancement - not all landmarks need Wikipedia articles\n\n"
        )
        for landmark_id in wiki_missing_landmarks:
            f.write(f"{landmark_id}\n")
    print(f"\n💾 Saved list to {output_dir}/landmarks_missing_wikipedia.txt")
else:
    print("  ✅ All landmarks have Wikipedia content!")

# Export landmarks with complete coverage (both PDF and Wikipedia)
complete_landmarks = sorted(landmarks_with_both)
print(f"\n📋 Landmarks with complete coverage ({len(complete_landmarks)} total):")
if complete_landmarks:
    print("First 10:")
    for landmark_id in complete_landmarks[:10]:
        print(f"  {landmark_id}")
    if len(complete_landmarks) > 10:
        print(f"  ... and {len(complete_landmarks) - 10} more")

    # Save to file
    with open(f"{output_dir}/landmarks_complete_coverage.txt", "w") as f:
        f.write("# Landmarks with both PDF and Wikipedia content\n")
        f.write(f"# Generated on: {time.strftime('%Y-%m-%d %H:%M:%S')}\n")
        f.write(f"# Total count: {len(complete_landmarks)}\n\n")
        for landmark_id in complete_landmarks:
            f.write(f"{landmark_id}\n")
    print(f"\n💾 Saved list to {output_dir}/landmarks_complete_coverage.txt")

## 6. Analysis and Visualizations

Now we'll analyze the processing status and create visualizations.

In [None]:
# Use the results from the source-type filtered processing check
pdf_processed_count = len(processing_results["pdf"]["processed"])
pdf_unprocessed_count = len(processing_results["pdf"]["unprocessed"])
wiki_processed_count = len(processing_results["wikipedia"]["processed"])
wiki_unprocessed_count = len(processing_results["wikipedia"]["unprocessed"])

total_landmarks = len(all_landmark_ids)

# Calculate percentages
pdf_processed_percentage = (pdf_processed_count / total_landmarks) * 100
pdf_unprocessed_percentage = (pdf_unprocessed_count / total_landmarks) * 100
wiki_processed_percentage = (wiki_processed_count / total_landmarks) * 100
wiki_unprocessed_percentage = (wiki_unprocessed_count / total_landmarks) * 100

# Display detailed statistics by source type
print("\nDetailed Processing Statistics by Source Type:")
print(f"Total landmarks: {total_landmarks}")
print("\nPDF Processing:")
print(f"  Processed landmarks: {pdf_processed_count} ({pdf_processed_percentage:.2f}%)")
print(
    f"  Unprocessed landmarks: {pdf_unprocessed_count} ({pdf_unprocessed_percentage:.2f}%)"
)
print("\nWikipedia Processing:")
print(
    f"  Processed landmarks: {wiki_processed_count} ({wiki_processed_percentage:.2f}%)"
)
print(
    f"  Unprocessed landmarks: {wiki_unprocessed_count} ({wiki_unprocessed_percentage:.2f}%)"
)

# Display sample of landmarks by source processing status
print("\nSample of landmarks with PDF content:")
for lid in sorted(processing_results["pdf"]["processed"]))[:5]:
    print(f"  - {lid}")

print("\nSample of landmarks with Wikipedia content:")
for lid in sorted(processing_results["wikipedia"]["processed"]))[:5]:
    print(f"  - {lid}")

print("\nSample of landmarks missing PDF content:")
for lid in sorted(processing_results["pdf"]["unprocessed"]))[:5]:
    print(f"  - {lid}")

In [None]:
# Create visualization of processing status by source type
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))

# PDF processing status
ax1.bar(
    ["Processed", "Unprocessed"],
    [pdf_processed_count, pdf_unprocessed_count],
    color=["#4CAF50", "#F44336"],
)
for i, (count, percentage) in enumerate(
    [
        (pdf_processed_count, pdf_processed_percentage),
        (pdf_unprocessed_count, pdf_unprocessed_percentage),
    ]
):
    ax1.text(
        i,
        count / 2,
        f"{count}\n({percentage:.1f}%)",
        color="white",
        ha="center",
        va="center",
        fontweight="bold",
        fontsize=11,
    )
ax1.set_title("📄 PDF Content Processing Status", fontsize=14, fontweight="bold")
ax1.set_xlabel("Status", fontsize=12)
ax1.set_ylabel("Number of Landmarks", fontsize=12)
ax1.grid(axis="y", alpha=0.3)

# Wikipedia processing status
ax2.bar(
    ["Processed", "Unprocessed"],
    [wiki_processed_count, wiki_unprocessed_count],
    color=["#2196F3", "#FF9800"],
)
for i, (count, percentage) in enumerate(
    [
        (wiki_processed_count, wiki_processed_percentage),
        (wiki_unprocessed_count, wiki_unprocessed_percentage),
    ]
):
    ax2.text(
        i,
        count / 2,
        f"{count}\n({percentage:.1f}%)",
        color="white",
        ha="center",
        va="center",
        fontweight="bold",
        fontsize=11,
    )
ax2.set_title("📖 Wikipedia Content Processing Status", fontsize=14, fontweight="bold")
ax2.set_xlabel("Status", fontsize=12)
ax2.set_ylabel("Number of Landmarks", fontsize=12)
ax2.grid(axis="y", alpha=0.3)

plt.suptitle(
    "NYC Landmarks Processing Status by Source Type", fontsize=16, fontweight="bold"
)
plt.tight_layout()
plt.show()

# Define coverage analysis variables for the combined visualization
pdf_processed = processing_results["pdf"]["processed"]
pdf_unprocessed = processing_results["pdf"]["unprocessed"]
wiki_processed = processing_results["wikipedia"]["processed"]
wiki_unprocessed = processing_results["wikipedia"]["unprocessed"]

landmarks_with_both = pdf_processed & wiki_processed
landmarks_with_pdf_only = pdf_processed - wiki_processed
landmarks_with_wiki_only = wiki_processed - pdf_processed
landmarks_with_neither = pdf_unprocessed & wiki_unprocessed

# Create a combined coverage visualization
plt.figure(figsize=(12, 8))
coverage_labels = ["Both PDF & Wikipedia", "PDF Only", "Wikipedia Only", "Neither"]
coverage_counts = [
    len(landmarks_with_both),
    len(landmarks_with_pdf_only),
    len(landmarks_with_wiki_only),
    len(landmarks_with_neither),
]
coverage_colors = ["#4CAF50", "#2196F3", "#FF9800", "#F44336"]

bars = plt.bar(coverage_labels, coverage_counts, color=coverage_colors)

# Add count and percentage labels on bars
for i, (bar, count) in enumerate(zip(bars, coverage_counts)):
    percentage = (count / total_landmarks) * 100
    plt.text(
        bar.get_x() + bar.get_width() / 2.0,
        bar.get_height() / 2,
        f"{count}\n({percentage:.1f}%)",
        color="white",
        ha="center",
        va="center",
        fontweight="bold",
        fontsize=12,
    )

plt.title("Content Coverage Analysis", fontsize=16, fontweight="bold")
plt.xlabel("Coverage Type", fontsize=14)
plt.ylabel("Number of Landmarks", fontsize=14)
plt.xticks(rotation=45, ha="right")
plt.grid(axis="y", alpha=0.3)
plt.tight_layout()
plt.show()

## 7. Summary

This notebook has analyzed the processing status of NYC landmark records in the Pinecone vector database by source type, providing separate analysis for PDF and Wikipedia content.

### Key Features

1. **Source Type Filtering**: The notebook now distinguishes between:
   - **PDF Content**: Expected for all landmarks (official LPC reports)
   - **Wikipedia Content**: Optional enhancement, not all landmarks have Wikipedia articles

2. **Comprehensive Analysis**: The status check provides:
   - Individual processing status by source type (PDF vs Wikipedia)
   - Coverage analysis showing landmarks with both, either, or neither source type
   - Identification of landmarks missing expected PDF content
   - Export of filtered lists for targeted processing

3. **Dynamic Pagination**: The original approach of:
   - Using the known total record count from the API
   - Calculating exact pages needed based on page size
   - Processing all records without manual adjustments

### Key Insights

- **PDF Processing**: Shows which landmarks have their official LPC reports processed
- **Wikipedia Processing**: Shows which landmarks have additional Wikipedia content
- **Coverage Gaps**: Identifies landmarks that may need investigation (missing PDF content)
- **Enhancement Opportunities**: Lists landmarks that could benefit from Wikipedia article processing

### Outputs

The notebook generates several files for further action:
- `landmarks_missing_pdf.txt`: Landmarks that should be investigated (every landmark should have PDF)
- `landmarks_missing_wikipedia.txt`: Landmarks that could benefit from Wikipedia content (optional)
- `landmarks_complete_coverage.txt`: Landmarks with both PDF and Wikipedia content

### Usage for Processing Planning

This analysis helps prioritize processing efforts:
1. **Critical**: Address landmarks missing PDF content first
2. **Enhancement**: Consider Wikipedia article processing for landmarks without it
3. **Maintenance**: Verify landmarks with complete coverage are working correctly

These improvements make the notebook more actionable for maintaining and expanding the landmark vector database.