# NYC Landmarks Wikipedia Integration Testing

This notebook tests the integration of Wikipedia articles for NYC landmarks into the vector database. It demonstrates the process of:

1. **Exploring Landmark Data**: Fetching landmark information from the CoreDataStore API with interactive pagination
2. **Retrieving Wikipedia Articles**: Finding Wikipedia articles associated with NYC landmarks
3. **Fetching and Processing Wikipedia Content**: Retrieving, cleaning, and chunking article content
4. **End-to-End Wikipedia Processing Test**: Complete workflow demonstration for processing and querying Wikipedia articles
5. **Analyzing Wikipedia Coverage**: Examining Wikipedia article distribution in the vector database, including statistics on coverage percentages, landmark representation, and interactive querying of Wikipedia content
6. **Summary Report**: Review of Wikipedia integration status and recommendations

The notebook provides both testing tools and interactive visualizations to demonstrate the Wikipedia integration capabilities for the NYC landmarks vector database.

## Environment Setup

First, let's set up our environment by creating a Python alias and installing any required dependencies.

In [None]:
# Create a python alias for python3 and verify the Python installation
!alias python=python3
!python --version

# Check if the project is installed correctly
!pip list | grep nyc-landmarks-vector-db || echo "Project not installed - install with 'pip install -e .'"

In [None]:
# Install the project in development mode if not already installed
import os

# Check if we're in the right directory structure
project_root = os.path.abspath(os.path.join(os.getcwd(), ".."))
print(f"Project root directory: {project_root}")

# Check for setup.py to confirm we're in the right place
setup_py_path = os.path.join(project_root, "setup.py")
if os.path.exists(setup_py_path):
    print("setup.py found, installing project in development mode...")
    !cd {project_root} && pip install -e .
else:
    print(f"setup.py not found at {setup_py_path}, please check directory structure")

In [28]:
# Check for environment variables required by the project
import os

# List of potential required environment variables
env_vars = [
    "OPENAI_API_KEY",  # For OpenAI embeddings
    "PINECONE_API_KEY",  # For Pinecone vector DB
    "PINECONE_ENVIRONMENT",  # Pinecone environment
    "PINECONE_INDEX_NAME",  # Pinecone index name
]

print("Checking environment variables:")
for var in env_vars:
    if var in os.environ:
        print(f"✓ {var} is set")
    else:
        print(f"✗ {var} is NOT set")

Checking environment variables:
✓ OPENAI_API_KEY is set
✓ PINECONE_API_KEY is set
✓ PINECONE_ENVIRONMENT is set
✓ PINECONE_INDEX_NAME is set


## Setup and Imports

First, let's import the necessary modules and set up logging.

In [None]:
import logging
import math
import os
import sys
from typing import List

import ipywidgets as widgets
import pandas as pd
from IPython.display import clear_output, display

# Add project root to path to ensure imports work correctly
sys.path.append(os.path.abspath(os.path.join(os.getcwd(), "..")))

from nyc_landmarks.db.db_client import get_db_client
from nyc_landmarks.db.wikipedia_fetcher import WikipediaFetcher
from nyc_landmarks.embeddings.generator import EmbeddingGenerator
from nyc_landmarks.models.wikipedia_models import (
    WikipediaArticleModel,
)
from nyc_landmarks.vectordb.pinecone_db import PineconeDB

# Configure logging
logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s - %(name)s - %(levelname)s - %(message)s",
)
logger = logging.getLogger()

# Initialize the components
db_client = get_db_client()  # Using db_client instead of api_client
wiki_fetcher = WikipediaFetcher()
embedding_generator = EmbeddingGenerator()
pinecone_db = PineconeDB()

## 1. Exploring Landmark Data

Let's start by fetching some landmarks from the CoreDataStore API and explore the data structure using interactive pagination widgets.

In [31]:
# Global flags to prevent re-entrant calls
global _is_updating_page, _button_click_in_progress
_is_updating_page = False
_button_click_in_progress = False

# Declare widgets as global to ensure they persist across cell executions
global page_size_dropdown, page_number, prev_button, next_button, status_label, output_area, dashboard, total_records, landmarks

# Get the total record count for pagination (only once)
if "total_records" not in globals():
    print("Getting total landmark record count...")
    total_records = db_client.get_total_record_count()
    print(f"Total landmark records: {total_records}")

# Initialize widgets only once
if (
    "dashboard" not in globals() or dashboard is None
):  # Check if dashboard is None as well
    # Create interactive widgets for landmark data pagination
    page_size_options = [10, 20, 50, 100]
    page_size_dropdown = widgets.Dropdown(
        options=page_size_options,
        value=10,
        description="Page size:",
        disabled=False,
        layout=widgets.Layout(width="200px"),
    )

    # Calculate max page number based on total records and page size
    def get_max_page(page_size):
        return math.ceil(total_records / page_size)

    # Create the page number input with validation
    page_number = widgets.BoundedIntText(
        value=1,
        min=1,
        max=get_max_page(page_size_dropdown.value),
        step=1,
        description="Page:",
        disabled=False,
        layout=widgets.Layout(width="150px"),
    )

    # Navigation buttons
    prev_button = widgets.Button(
        description="Previous",
        disabled=True,  # Disabled initially since we start at page 1
        button_style="",
        tooltip="Go to previous page",
        icon="arrow-left",
    )

    next_button = widgets.Button(
        description="Next",
        disabled=False,
        button_style="",
        tooltip="Go to next page",
        icon="arrow-right",
    )

    # Status label showing page info
    status_label = widgets.Label(
        value=f"Page 1 of {get_max_page(page_size_dropdown.value)} (Records: {total_records})"
    )

    # Output area for the dataframe
    output_area = widgets.Output()

    # Function to fetch and display landmarks
    def fetch_and_display_landmarks(page, page_size):
        # Clear any previous output
        with output_area:
            clear_output(wait=True)  # Clear previous output in the widget
            print(f"Fetching page {page} with {page_size} records per page...")
        try:
            # Fetch the data from the API
            response = db_client.get_lpc_reports(page=page, limit=page_size)

            # Check if we got results
            if not response.results:
                with output_area:
                    print(f"No landmarks found on page {page}")
                return None

            # Create a DataFrame for display
            landmarks_df = pd.DataFrame(
                [landmark.model_dump() for landmark in response.results]
            )

            # Calculate record range on current page
            start_record = (page - 1) * page_size + 1
            end_record = min(start_record + len(response.results) - 1, total_records)

            with output_area:
                print(f"Showing records {start_record}-{end_record} of {total_records}")
                display(landmarks_df)

            # Return the landmarks for potential further use
            return response.results
        except Exception as e:
            with output_area:
                clear_output(wait=True)
                print(f"Error fetching landmarks: {str(e)}")
            return None

    # Event handlers for widgets
    def on_page_change(change):
        global _is_updating_page
        if _is_updating_page:
            return

        if change["name"] == "value" and change["new"] != change["old"]:
            _is_updating_page = True
            try:
                page = change["new"]
                max_page = get_max_page(page_size_dropdown.value)

                # Update button states
                prev_button.disabled = page <= 1
                next_button.disabled = page >= max_page

                # Update status label
                status_label.value = (
                    f"Page {page} of {max_page} (Records: {total_records})"
                )

                # Fetch and display landmarks
                global landmarks
                landmarks = fetch_and_display_landmarks(page, page_size_dropdown.value)
            finally:
                _is_updating_page = False

    def on_page_size_change(change):
        global _is_updating_page
        if _is_updating_page:
            return

        if change["name"] == "value" and change["new"] != change["old"]:
            _is_updating_page = True
            try:
                # Recalculate max page
                new_page_size = change["new"]
                new_max_page = get_max_page(new_page_size)

                # Update page number widget range
                page_number.max = new_max_page

                # Adjust current page if needed
                if page_number.value > new_max_page:
                    page_number.value = new_max_page

                # Update status label
                status_label.value = f"Page {page_number.value} of {new_max_page} (Records: {total_records})"

                # Refetch with new page size
                global landmarks
                landmarks = fetch_and_display_landmarks(
                    page_number.value, new_page_size
                )

                # Update button states
                prev_button.disabled = page_number.value <= 1
                next_button.disabled = page_number.value >= new_max_page
            finally:
                _is_updating_page = False

    def on_prev_button_click(b):
        global _button_click_in_progress
        if _button_click_in_progress:
            return
        _button_click_in_progress = True
        try:
            if page_number.value > 1:
                page_number.value -= 1
        finally:
            _button_click_in_progress = False

    def on_next_button_click(b):
        global _button_click_in_progress
        if _button_click_in_progress:
            return
        _button_click_in_progress = True
        try:
            if page_number.value < get_max_page(page_size_dropdown.value):
                page_number.value += 1
        finally:
            _button_click_in_progress = False

    # Register event handlers
    # Remove all existing observers to prevent multiple calls if the cell is run multiple times
    page_number.unobserve_all()
    page_size_dropdown.unobserve_all()

    page_number.observe(on_page_change, names="value")
    page_size_dropdown.observe(on_page_size_change, names="value")
    prev_button.on_click(on_prev_button_click)
    next_button.on_click(on_next_button_click)

    # For VS Code compatibility, create alternative non-widget UI
    print("\nVS Code Widget Alternative Controls:")
    print(
        "Use these functions to navigate instead of the widgets if they don't display properly:"
    )

    def go_to_page(page_num):
        """Go to a specific page"""
        if 1 <= page_num <= get_max_page(page_size_dropdown.value):
            page_number.value = page_num
            return fetch_and_display_landmarks(page_num, page_size_dropdown.value)
        else:
            print(
                f"Page number must be between 1 and {get_max_page(page_size_dropdown.value)}"
            )

    def next_page():
        """Go to next page"""
        return go_to_page(page_number.value + 1)

    def prev_page():
        """Go to previous page"""
        return go_to_page(page_number.value - 1)

    def change_page_size(size):
        """Change page size"""
        if size in page_size_options:
            page_size_dropdown.value = size
            return fetch_and_display_landmarks(page_number.value, size)
        else:
            print(f"Page size must be one of {page_size_options}")

    print("\nExample usage:")
    print("go_to_page(2)      # Go to page 2")
    print("next_page()        # Go to next page")
    print("prev_page()        # Go to previous page")
    print("change_page_size(20) # Change to 20 records per page")

    # Layout the widgets - if they display properly in your environment
    # Put the pagination controls at the top
    controls = widgets.HBox([page_size_dropdown, page_number, prev_button, next_button])
    # Position status label and controls above the output area
    dashboard = widgets.VBox(
        [
            controls,  # Pagination controls on top
            status_label,  # Status label below controls
            output_area,  # Output area at the bottom
        ]
    )

# Always display the dashboard and fetch initial data when the cell is run
try:
    display(dashboard)
    # Initial display
    landmarks = fetch_and_display_landmarks(page_number.value, page_size_dropdown.value)
except Exception as e:
    print(
        f"Note: Widget display error: {str(e)}. Use the alternative functions above instead."
    )

VBox(children=(HBox(children=(Dropdown(description='Page size:', layout=Layout(width='200px'), options=(10, 20…

Fetching page 4 with 10 records per page...


## 2. Retrieving Wikipedia Articles for Landmarks

This section explores which landmarks have associated Wikipedia articles and displays an interactive page-by-page view of the landmarks along with their Wikipedia articles. You can browse through landmarks and see which ones have Wikipedia content available for integration into the vector database.

In [None]:
# Function to check and display Wikipedia articles for a landmark


def check_wikipedia_articles(landmark_id: str) -> List[WikipediaArticleModel]:
    """Check if a landmark has associated Wikipedia articles.

    Args:
        landmark_id: ID of the landmark to check

    Returns:
        List of WikipediaArticleModel objects
    """
    articles = db_client.get_wikipedia_articles(landmark_id)
    print(f"Found {len(articles)} Wikipedia articles for landmark: {landmark_id}")
    return articles


# Check Wikipedia articles for each landmark
landmark_articles = {}
for landmark in landmarks:
    landmark_id = landmark.lpNumber
    name = landmark.name
    print(f"Checking {name} ({landmark_id})...")
    articles = check_wikipedia_articles(landmark_id)
    if articles:
        landmark_articles[landmark_id] = articles
    print("-" * 40)

print(
    f"Found {len(landmark_articles)} landmarks with Wikipedia articles out of {len(landmarks)} total"
)

# Transform landmark_articles into a list of dictionaries for analysis
landmarks_data = [
    {"landmark_id": landmark_id, "article_count": len(articles)}
    for landmark_id, articles in landmark_articles.items()
]


# Function to analyze Wikipedia article coverage for landmarks


def analyze_wikipedia_coverage(landmarks_data):
    """Analyze Wikipedia article coverage for a set of landmarks.

    Args:
        landmarks_data: List of landmark data including Wikipedia articles

    Returns:
        Dictionary with analysis results
    """
    total_landmarks = len(landmarks_data)
    landmarks_with_articles = sum(1 for ld in landmarks_data if ld["article_count"] > 0)
    coverage_percentage = (
        (landmarks_with_articles / total_landmarks * 100) if total_landmarks > 0 else 0
    )

    # Count total articles
    total_articles = sum(ld["article_count"] for ld in landmarks_data)

    # Calculate articles per landmark statistics
    articles_per_landmark = [ld["article_count"] for ld in landmarks_data]
    max_articles = max(articles_per_landmark) if articles_per_landmark else 0
    avg_articles = (
        sum(articles_per_landmark) / landmarks_with_articles
        if landmarks_with_articles > 0
        else 0
    )

    return {
        "total_landmarks": total_landmarks,
        "landmarks_with_articles": landmarks_with_articles,
        "coverage_percentage": coverage_percentage,
        "total_articles": total_articles,
        "max_articles_per_landmark": max_articles,
        "avg_articles_per_landmark": avg_articles,
    }


# Analyze the current page of landmarks
analysis = analyze_wikipedia_coverage(landmarks_data)

# Display analysis results
print("Wikipedia Coverage Analysis:")
print(f"- Landmarks analyzed: {analysis['total_landmarks']}")
print(
    f"- Landmarks with Wikipedia articles: {analysis['landmarks_with_articles']} ({analysis['coverage_percentage']:.1f}%)"
)
print(f"- Total Wikipedia articles: {analysis['total_articles']}")
print(f"- Max articles per landmark: {analysis['max_articles_per_landmark']}")
print(f"- Average articles per landmark: {analysis['avg_articles_per_landmark']:.2f}")

# Transform landmark_articles into a list of dictionaries for sorting
landmarks_with_articles = [
    {"landmark_id": landmark_id, "article_count": len(articles)}
    for landmark_id, articles in landmark_articles.items()
]

# Sort landmarks by article count
sorted_landmarks = sorted(
    landmarks_with_articles, key=lambda x: x["article_count"], reverse=True
)

# Display top landmarks with the most articles
print("\nLandmarks with the most Wikipedia articles:")
for landmark in sorted_landmarks[:3]:  # Show top 3
    print(
        f"Landmark ID: {landmark['landmark_id']}, Articles: {landmark['article_count']}"
    )

In [None]:
# Display summary of all Wikipedia articles found across all pages


def get_all_wikipedia_articles():
    """Fetch Wikipedia articles for multiple pages of landmarks to get a larger sample"""
    all_articles_data = []
    max_pages_to_check = 3
    page_size = 20

    print(
        f"Fetching Wikipedia articles across {max_pages_to_check} pages (page size: {page_size})..."
    )

    for page in range(1, max_pages_to_check + 1):
        print(f"Fetching page {page}...")
        response = db_client.get_lpc_reports(page=page, limit=page_size)

        if not response.results:
            print(f"No landmarks found on page {page}")
            break

        landmarks_batch = response.results

        for landmark in landmarks_batch:
            landmark_id = landmark.lpNumber
            landmark_name = landmark.name
            articles = db_client.get_wikipedia_articles(landmark_id)

            for article in articles:
                all_articles_data.append(
                    {
                        "landmark_id": landmark_id,
                        "landmark_name": landmark_name,
                        "article_title": article.title,
                        "article_url": article.url,
                    }
                )

    print(f"Found {len(all_articles_data)} Wikipedia articles")
    return all_articles_data


# Get a sample of Wikipedia articles across multiple pages
all_wiki_articles = get_all_wikipedia_articles()

# Create a DataFrame
if all_wiki_articles:
    # Create a DataFrame for easier viewing
    articles_df = pd.DataFrame(all_wiki_articles)

    # Show summary statistics
    landmark_count = len(set(articles_df["landmark_id"]))
    print("\nSummary Statistics:")
    print(f"- Total Wikipedia articles found: {len(articles_df)}")
    print(f"- Total landmarks with articles: {landmark_count}")
    print(f"- Average articles per landmark: {len(articles_df) / landmark_count:.2f}")

    # Display the full DataFrame
    display(articles_df)

    # Display article title word cloud (if matplotlib and wordcloud are available)
    try:
        import matplotlib.pyplot as plt
        from wordcloud import WordCloud

        # Join all article titles
        all_titles = " ".join(articles_df["article_title"])

        # Generate word cloud
        wordcloud = WordCloud(
            width=800,
            height=400,
            background_color="white",
            max_words=100,
            contour_width=3,
        ).generate(all_titles)

        # Display the word cloud
        plt.figure(figsize=(10, 5))
        plt.imshow(wordcloud, interpolation="bilinear")
        plt.axis("off")
        plt.title("Word Cloud of Wikipedia Article Titles")
        plt.show()
    except ImportError:
        print(
            "\nNote: Install wordcloud and matplotlib packages to generate a word cloud visualization"
        )

else:
    print("No Wikipedia articles found in the sampled landmarks")

In [None]:
# Create an interactive display for landmarks and their Wikipedia articles with pagination

# Create widgets for pagination
landmark_page_size_options = [5, 10, 20, 50]
landmark_page_size_dropdown = widgets.Dropdown(
    options=landmark_page_size_options,
    value=5,
    description="Items/page:",
    disabled=False,
    layout=widgets.Layout(width="150px"),
)

# Initialize landmarks data
all_landmarks_with_articles = []

# Function to fetch Wikipedia articles for all landmarks on the current page


def fetch_landmark_articles(page, page_size):
    global all_landmarks_with_articles
    all_landmarks_with_articles = []

    # Fetch the current page of landmarks
    response = db_client.get_lpc_reports(page=page, limit=page_size)

    if not response.results:
        return []

    # Get Wikipedia articles for each landmark
    for landmark in response.results:
        landmark_id = landmark.lpNumber
        name = landmark.name

        # Fetch articles for this landmark
        articles = db_client.get_wikipedia_articles(landmark_id)

        # Add to our data structure
        landmark_data = {
            "landmark_id": landmark_id,
            "landmark_name": name,
            "articles": articles,
            "article_count": len(articles),
        }
        all_landmarks_with_articles.append(landmark_data)

    return all_landmarks_with_articles


# Calculate max page number


def get_landmarks_max_page(page_size):
    return math.ceil(total_records / page_size)


# Create the page number input
landmark_page_number = widgets.BoundedIntText(
    value=1,
    min=1,
    max=get_landmarks_max_page(landmark_page_size_dropdown.value),
    step=1,
    description="Page:",
    disabled=False,
    layout=widgets.Layout(width="150px"),
)

# Navigation buttons
landmark_prev_button = widgets.Button(
    description="Previous",
    disabled=True,  # Disabled initially
    button_style="",
    tooltip="Go to previous page",
    icon="arrow-left",
)

landmark_next_button = widgets.Button(
    description="Next",
    disabled=False,
    button_style="",
    tooltip="Go to next page",
    icon="arrow-right",
)

# Status label
landmark_status_label = widgets.Label(
    value=f"Page 1 of {get_landmarks_max_page(landmark_page_size_dropdown.value)} (Records: {total_records})"
)

# Output area
landmark_output_area = widgets.Output()

# Function to display landmarks with their Wikipedia articles


def display_landmarks_with_articles():
    # Clear previous output
    if not all_landmarks_with_articles:
        print("No landmarks found on this page")
        return

    # Create a basic table for landmarks
    landmarks_table = pd.DataFrame(
        [
            {
                "ID": item["landmark_id"],
                "Name": item["landmark_name"],
                "Wikipedia Articles": item["article_count"],
            }
            for item in all_landmarks_with_articles
        ]
    )

    print("Landmarks:")
    display(landmarks_table)

    # For each landmark, show their articles if any exist
    for landmark in all_landmarks_with_articles:
        if landmark["article_count"] > 0:
            print(
                f"\n{landmark['landmark_name']} ({landmark['landmark_id']}) - {landmark['article_count']} Wikipedia articles:"
            )

            # Create articles table
            articles_data = [
                {"Title": article.title, "URL": article.url}
                for article in landmark["articles"]
            ]

            articles_table = pd.DataFrame(articles_data)
            display(articles_table)
            print("-" * 80)  # Separator


# Event handlers


def on_landmark_page_change(change):
    if change["name"] == "value" and change["new"] != change["old"]:
        page = change["new"]
        page_size = landmark_page_size_dropdown.value
        max_page = get_landmarks_max_page(page_size)

        # Update button states
        landmark_prev_button.disabled = page <= 1
        landmark_next_button.disabled = page >= max_page

        # Update status label
        landmark_status_label.value = (
            f"Page {page} of {max_page} (Records: {total_records})"
        )

        # Fetch and display landmarks with articles
        print(f"Fetching page {page} with {page_size} landmarks per page...")
        fetch_landmark_articles(page, page_size)
        display_landmarks_with_articles()


def on_landmark_page_size_change(change):
    if change["name"] == "value" and change["new"] != change["old"]:
        new_page_size = change["new"]
        new_max_page = get_landmarks_max_page(new_page_size)

        # Update page number widget range
        landmark_page_number.max = new_max_page

        # Adjust current page if needed
        if landmark_page_number.value > new_max_page:
            landmark_page_number.value = new_max_page

        # Update status label
        landmark_status_label.value = f"Page {landmark_page_number.value} of {new_max_page} (Records: {total_records})"

        # Refetch with new page size
        fetch_landmark_articles(landmark_page_number.value, new_page_size)
        display_landmarks_with_articles()


def on_landmark_prev_button_click(b):
    if landmark_page_number.value > 1:
        landmark_page_number.value -= 1


def on_landmark_next_button_click(b):
    if landmark_page_number.value < get_landmarks_max_page(
        landmark_page_size_dropdown.value
    ):
        landmark_page_number.value += 1


# Register event handlers
# Remove existing observers to prevent multiple calls if the cell is run multiple times
if hasattr(landmark_page_number, "observers"):
    landmark_page_number.unobserve(on_landmark_page_change, names="value")
if hasattr(landmark_page_size_dropdown, "observers"):
    landmark_page_size_dropdown.unobserve(on_landmark_page_size_change, names="value")
if hasattr(landmark_prev_button, "observers"):
    landmark_prev_button.unobserve(on_landmark_prev_button_click)
if hasattr(landmark_next_button, "observers"):
    landmark_next_button.unobserve(on_landmark_next_button_click)

landmark_page_number.observe(on_landmark_page_change, names="value")
landmark_page_size_dropdown.observe(on_landmark_page_size_change, names="value")
landmark_prev_button.on_click(on_landmark_prev_button_click)
landmark_next_button.on_click(on_landmark_next_button_click)

# For VS Code compatibility, create alternative non-widget UI
print("\nVS Code Widget Alternative Controls for Landmarks with Wikipedia Articles:")
print(
    "Use these functions to navigate instead of the widgets if they don't display properly:"
)


def view_landmarks_page(page_num):
    """Go to a specific page of landmarks with Wikipedia articles"""
    if 1 <= page_num <= get_landmarks_max_page(landmark_page_size_dropdown.value):
        landmark_page_number.value = page_num
        fetch_landmark_articles(page_num, landmark_page_size_dropdown.value)
        display_landmarks_with_articles()
    else:
        print(
            f"Page number must be between 1 and {get_landmarks_max_page(landmark_page_size_dropdown.value)}"
        )


def next_landmarks_page():
    """Go to next page of landmarks"""
    view_landmarks_page(landmark_page_number.value + 1)


def prev_landmarks_page():
    """Go to previous page of landmarks"""
    view_landmarks_page(landmark_page_number.value - 1)


def change_landmarks_page_size(size):
    """Change page size for landmarks"""
    if size in landmark_page_size_options:
        landmark_page_size_dropdown.value = size
        fetch_landmark_articles(landmark_page_number.value, size)
        display_landmarks_with_articles()
    else:
        print(f"Page size must be one of {landmark_page_size_options}")


print("\nExample usage:")
print("view_landmarks_page(2)        # Go to page 2")
print("next_landmarks_page()         # Go to next page")
print("prev_landmarks_page()         # Go to previous page")
print("change_landmarks_page_size(10) # Change to 10 records per page")

# Initial fetch and display
fetch_landmark_articles(landmark_page_number.value, landmark_page_size_dropdown.value)
display_landmarks_with_articles()

# Layout the widgets - display if they render properly in your environment
landmark_controls = widgets.HBox(
    [
        landmark_page_size_dropdown,
        landmark_page_number,
        landmark_prev_button,
        landmark_next_button,
    ]
)
landmark_dashboard = widgets.VBox(
    [
        widgets.HTML("<h3>Landmarks with Wikipedia Articles</h3>"),
        landmark_controls,
        landmark_status_label,
        landmark_output_area,
    ]
)

# Try displaying the dashboard
try:
    display(landmark_dashboard)
except Exception as e:
    print(
        f"\nNote: Widget display error: {str(e)}. Use the alternative functions above instead."
    )

## 3. Fetching and Processing Wikipedia Content

Now let's fetch the actual content from a Wikipedia article and process it for embedding.

In [None]:
# Fetch and process content from a single Wikipedia article

# Create widgets for article selection
article_selector = None
if all_wiki_articles:
    # Create a dropdown with available articles
    article_options = [
        (f"{article['landmark_name']} - {article['article_title']}", i)
        for i, article in enumerate(all_wiki_articles)
    ]

    article_selector = widgets.Dropdown(
        options=article_options,
        description="Select article:",
        style={"description_width": "initial"},
        layout=widgets.Layout(width="600px"),
    )

    # Display the selector
    print("\nVS Code Widget Alternative for Article Selection:")
    print(
        "Use this function to select an article if the dropdown widget doesn't display properly:"
    )

    # Create a list of available articles for selection
    print("\nAvailable articles:")
    for i, (label, _) in enumerate(article_options):
        print(f"{i}: {label}")

    def select_article(index):
        """Select an article by its index"""
        if 0 <= index < len(article_options):
            fetch_article_content(index)
        else:
            print(f"Index must be between 0 and {len(article_options) - 1}")

    print("\nExample usage:")
    print("select_article(0)  # Select the first article")

    # Try to display the selector widget
    try:
        display(article_selector)
    except Exception as e:
        print(
            f"\nNote: Widget display error: {str(e)}. Use the select_article function instead."
        )

    # Create an output area for article content
    article_output = widgets.Output()
    display(article_output)

    # Function to fetch and display article content
    def fetch_article_content(article_idx):
        if article_idx is None:
            print("Please select an article")
            return

        article = all_wiki_articles[article_idx]
        print(f"Fetching content for: {article['article_title']}")
        print(f"URL: {article['article_url']}")
        print(f"Landmark: {article['landmark_name']} ({article['landmark_id']})")
        print("-" * 80)

        # Fetch article content
        content = wiki_fetcher.fetch_wikipedia_content(article["article_url"])
        if not content:
            print("Failed to fetch article content")
            return

        # Display content statistics
        print(f"Article length: {len(content)} characters")

        # Split into chunks
        chunks = wiki_fetcher.chunk_wikipedia_text(
            content, chunk_size=1000, chunk_overlap=200
        )

        print(f"Split article into {len(chunks)} chunks")

        # Display a sample of the content
        print("\nArticle preview:")
        preview_length = min(500, len(content))
        print(content[:preview_length] + "...")

        # Display first chunk
        if chunks:
            print("\nFirst chunk:")
            print(f"Text: {chunks[0]['text'][:300]}...")
            print(f"Metadata: {chunks[0]['metadata']}")

    # Observe article selection
    def on_article_select(change):
        if change["name"] == "value" and change["new"] is not None:
            fetch_article_content(change["new"])

    article_selector.observe(on_article_select, names="value")

    # Initial display if we have articles
    if article_options:
        fetch_article_content(article_options[0][1])

else:
    print("No Wikipedia articles available to process")

## 4. End-to-End Wikipedia Processing Test

This section demonstrates a complete end-to-end workflow for processing and querying Wikipedia articles for NYC landmarks.

In [None]:
# Complete end-to-end test for a single landmark


def process_landmark_wikipedia_articles(landmark_id):
    """Process Wikipedia articles for a single landmark.

    Args:
        landmark_id: ID of the landmark to process

    Returns:
        Dictionary with processing results
    """
    print(f"Processing Wikipedia articles for landmark {landmark_id}")

    # Progress tracker
    progress_output = widgets.Output()
    progress_bar = widgets.IntProgress(
        value=0,
        min=0,
        max=5,  # 5 steps in our process
        description="Progress:",
        bar_style="info",
        orientation="horizontal",
    )
    display(
        widgets.VBox(
            [widgets.HTML("<b>Processing Steps</b>"), progress_bar, progress_output]
        )
    )

    def update_progress(step, message):
        progress_bar.value = step
        with progress_output:
            print(f"Step {step}/5: {message}")

    update_progress(1, "Fetching Wikipedia articles")

    # Step 1: Get Wikipedia articles for the landmark
    try:
        articles = db_client.get_wikipedia_articles(landmark_id)
        if not articles:
            with progress_output:
                print(f"No Wikipedia articles found for landmark {landmark_id}")
            return {"success": False, "reason": "No Wikipedia articles found"}

        with progress_output:
            print(f"Found {len(articles)} Wikipedia articles")
    except Exception as e:
        with progress_output:
            print(f"Error fetching Wikipedia articles: {str(e)}")
        return {"success": False, "reason": f"Error fetching articles: {str(e)}"}

    # Step 2: Process each article
    update_progress(2, "Fetching article content")
    all_chunks = []
    article_details = []

    for i, article in enumerate(articles):
        with progress_output:
            print(f"\nProcessing article {i+1}/{len(articles)}: {article.title}")

        # Fetch article content
        try:
            content = wiki_fetcher.fetch_wikipedia_content(article.url)
            if not content:
                with progress_output:
                    print(f"Failed to fetch content for article: {article.title}")
                continue

            with progress_output:
                print(f"Successfully fetched article content ({len(content)} chars)")

            article_details.append(
                {
                    "title": article.title,
                    "url": article.url,
                    "content_length": len(content),
                }
            )
        except Exception as e:
            with progress_output:
                print(f"Error fetching content for {article.title}: {str(e)}")
            continue

        # Chunk the content
        update_progress(3, "Chunking article content")
        try:
            chunks = wiki_fetcher.chunk_wikipedia_text(
                content, chunk_size=1000, chunk_overlap=200
            )

            with progress_output:
                print(f"Split article into {len(chunks)} chunks")

            # Add article metadata to chunks
            for chunk in chunks:
                chunk["metadata"]["article_title"] = article.title
                chunk["metadata"]["article_url"] = article.url
                chunk["metadata"]["source_type"] = "wikipedia"
                chunk["metadata"]["landmark_id"] = landmark_id

            all_chunks.extend(chunks)
        except Exception as e:
            with progress_output:
                print(f"Error chunking content for {article.title}: {str(e)}")
            continue

    # Step 3: Generate embeddings
    update_progress(4, "Generating embeddings")
    # Limit to first 5 chunks for testing to reduce processing time
    test_chunks = all_chunks[:5] if len(all_chunks) > 5 else all_chunks

    with progress_output:
        print(f"\nGenerating embeddings for {len(test_chunks)} chunks")

    try:
        chunks_with_embeddings = embedding_generator.process_chunks(test_chunks)
        with progress_output:
            print(f"Generated embeddings for {len(chunks_with_embeddings)} chunks")
    except Exception as e:
        with progress_output:
            print(f"Error generating embeddings: {str(e)}")
        return {"success": False, "reason": f"Error generating embeddings: {str(e)}"}

    # Step 4: Store in Pinecone
    update_progress(5, "Storing in vector database")
    with progress_output:
        print("\nStoring embeddings in Pinecone...")

    try:
        vector_ids = pinecone_db.store_chunks(
            chunks=chunks_with_embeddings,
            id_prefix=f"wiki-{landmark_id}-",
            landmark_id=landmark_id,
            use_fixed_ids=True,
            delete_existing=True,
        )

        with progress_output:
            print(f"Stored {len(vector_ids)} vectors in Pinecone")
    except Exception as e:
        with progress_output:
            print(f"Error storing vectors: {str(e)}")
        return {"success": False, "reason": f"Error storing vectors: {str(e)}"}

    # Step 5: Query the vectors
    with progress_output:
        print("\nTesting retrieval with a query...")

    landmark_name = next(
        (l.name for l in landmarks if l.lpNumber == landmark_id), "landmark"
    )
    test_query = f"Tell me about the history of {landmark_name}"

    with progress_output:
        print(f"Test query: '{test_query}'")

    try:
        query_embedding = embedding_generator.generate_embedding(test_query)
        results = pinecone_db.query_vectors(
            query_embedding,
            top_k=3,
            filter_dict={"landmark_id": landmark_id, "source_type": "wikipedia"},
        )

        with progress_output:
            print(f"Found {len(results)} matching results")

            if results:
                for i, match in enumerate(results):
                    print(f"\nMatch {i+1} - Score: {match['score']:.4f}")
                    print(
                        f"Article: {match['metadata'].get('article_title', 'Unknown')}"
                    )
                    print(f"Text: {match['metadata'].get('text', '')[:150]}...")
    except Exception as e:
        with progress_output:
            print(f"Error querying vectors: {str(e)}")
        # Don't fail the overall process just for the query test

    # Return the results
    return {
        "success": True,
        "landmark_id": landmark_id,
        "landmark_name": landmark_name,
        "articles_processed": len(articles),
        "article_details": article_details,
        "chunks_generated": len(all_chunks),
        "vectors_stored": len(vector_ids),
        "query_results": results if "results" in locals() else [],
    }


# Create a dropdown to select a landmark for testing
if landmark_articles:
    # Create options for the dropdown
    landmark_options = [
        (
            f"{landmark_id} - {next((l.name for l in landmarks if l.lpNumber == landmark_id), 'Unknown')}",
            landmark_id,
        )
        for landmark_id in landmark_articles.keys()
    ]

    test_landmark_selector = widgets.Dropdown(
        options=landmark_options,
        description="Test landmark:",
        style={"description_width": "initial"},
        layout=widgets.Layout(width="600px"),
    )

    # Create a button to start the test
    test_button = widgets.Button(
        description="Run End-to-End Test",
        button_style="primary",
        tooltip="Process Wikipedia articles for the selected landmark",
    )

    # Display the selector and button
    display(
        widgets.VBox(
            [
                widgets.HTML("<h3>End-to-End Wikipedia Processing Test</h3>"),
                test_landmark_selector,
                test_button,
            ]
        )
    )

    # Create an output area for test results
    test_output = widgets.Output()
    display(test_output)

    # Function to run when button is clicked
    def on_test_button_click(b):
        with test_output:
            clear_output()
            if test_landmark_selector.value:
                print(
                    f"Starting end-to-end test for landmark: {test_landmark_selector.value}"
                )
                result = process_landmark_wikipedia_articles(
                    test_landmark_selector.value
                )

                if result["success"]:
                    print("\nTest completed successfully!")
                    print(
                        f"Landmark: {result['landmark_name']} ({result['landmark_id']})"
                    )
                    print(f"Articles processed: {result['articles_processed']}")
                    print(f"Chunks generated: {result['chunks_generated']}")
                    print(f"Vectors stored: {result['vectors_stored']}")
                else:
                    print(f"\nTest failed: {result['reason']}")
            else:
                print("Please select a landmark to test")

    # Register button click handler
    test_button.on_click(on_test_button_click)

    # Automatically run the test if only one landmark is available
    if len(landmark_options) == 1:
        with test_output:
            print(
                f"Only one landmark with Wikipedia articles found. Running test automatically..."
            )
            test_landmark_selector.value = landmark_options[0][1]
            on_test_button_click(test_button)
else:
    print(
        "No landmarks with Wikipedia articles available for testing. Please browse to a page with landmarks that have Wikipedia articles."
    )

## 5. Analyzing Wikipedia Coverage in Vector Database

This section analyzes how Wikipedia articles are represented in the vector database. We'll examine:

- The total number of Wikipedia vectors compared to other content
- Coverage percentage of landmarks with Wikipedia content
- Distribution of Wikipedia content across different landmarks
- Performance of semantic searches against Wikipedia content

This analysis helps assess the current state of Wikipedia integration and identify opportunities for improvement in coverage and quality.

In [None]:
# Analyze Wikipedia vectors in the Pinecone database

# Create an output area for database information
db_output = widgets.Output()
display(db_output)

with db_output:
    # Get statistics from Pinecone
    print("Checking vector database for Wikipedia content...")
    try:
        # Get index statistics
        index_stats = pinecone_db.get_index_stats()
        print(
            f"Total vectors in the database: {index_stats.get('total_vector_count', 'N/A')}"
        )
        print(f"Dimensions: {index_stats.get('dimension', 'N/A')}")

        # Query for Wikipedia vectors specifically
        wikipedia_count = pinecone_db.count_vectors({"source_type": "wikipedia"})
        print(f"Wikipedia vectors in the database: {wikipedia_count}")

        # Calculate percentage
        if index_stats.get("total_vector_count", 0) > 0:
            wikipedia_percentage = (
                wikipedia_count / index_stats["total_vector_count"] * 100
            )
            print(
                f"Wikipedia content makes up {wikipedia_percentage:.2f}% of the vector database"
            )

            # Create a pie chart of content distribution if matplotlib is available
            try:
                import matplotlib.pyplot as plt

                # Create the pie chart data
                labels = ["Wikipedia Content", "Other Content"]
                sizes = [
                    wikipedia_count,
                    index_stats["total_vector_count"] - wikipedia_count,
                ]
                colors = ["#ff9999", "#66b3ff"]
                explode = (0.1, 0)  # explode the 1st slice (Wikipedia)

                # Plot the pie chart
                plt.figure(figsize=(8, 6))
                plt.pie(
                    sizes,
                    explode=explode,
                    labels=labels,
                    colors=colors,
                    autopct="%1.1f%%",
                    shadow=True,
                    startangle=90,
                )
                plt.axis(
                    "equal"
                )  # Equal aspect ratio ensures that pie is drawn as a circle
                plt.title("Vector Database Content Distribution")
                plt.show()
            except ImportError:
                print("\nNote: Install matplotlib to generate visualization")

        # Get landmarks with Wikipedia content
        all_landmark_ids = set()
        wikipedia_landmark_ids = set()

        # This is a simplified approach - in a real production environment,
        # you might need to paginate through all vectors
        sample_size = 1000
        vectors = pinecone_db.list_vectors(limit=sample_size, filter_dict=None)

        # Data for analysis
        landmark_vector_counts = {}
        wikipedia_vector_counts = {}

        for vector in vectors:
            if "landmark_id" in vector["metadata"]:
                landmark_id = vector["metadata"]["landmark_id"]
                all_landmark_ids.add(landmark_id)

                # Count vectors per landmark
                landmark_vector_counts[landmark_id] = (
                    landmark_vector_counts.get(landmark_id, 0) + 1
                )

                if vector["metadata"].get("source_type") == "wikipedia":
                    wikipedia_landmark_ids.add(landmark_id)

                    # Count wikipedia vectors per landmark
                    wikipedia_vector_counts[landmark_id] = (
                        wikipedia_vector_counts.get(landmark_id, 0) + 1
                    )

        print(f"\nCoverage analysis (based on sample of {len(vectors)} vectors):")
        print(f"Landmarks with any content: {len(all_landmark_ids)}")
        print(f"Landmarks with Wikipedia content: {len(wikipedia_landmark_ids)}")

        if len(all_landmark_ids) > 0:
            coverage_percentage = (
                len(wikipedia_landmark_ids) / len(all_landmark_ids) * 100
            )
            print(f"Landmark Wikipedia coverage: {coverage_percentage:.2f}%")

            # Detailed distribution analysis
            if wikipedia_vector_counts:
                print("\nWikipedia content distribution:")
                average_wiki_vectors = sum(wikipedia_vector_counts.values()) / len(
                    wikipedia_vector_counts
                )
                max_wiki_vectors = (
                    max(wikipedia_vector_counts.values())
                    if wikipedia_vector_counts
                    else 0
                )
                min_wiki_vectors = (
                    min(wikipedia_vector_counts.values())
                    if wikipedia_vector_counts
                    else 0
                )

                print(
                    f"- Average Wikipedia vectors per landmark: {average_wiki_vectors:.2f}"
                )
                print(
                    f"- Maximum Wikipedia vectors for a single landmark: {max_wiki_vectors}"
                )
                print(f"- Minimum Wikipedia vectors for a landmark: {min_wiki_vectors}")

                # Plot distribution if matplotlib is available
                try:
                    import matplotlib.pyplot as plt
                    import numpy as np

                    # Get top landmarks by Wikipedia vector count
                    top_landmarks = sorted(
                        wikipedia_vector_counts.items(),
                        key=lambda x: x[1],
                        reverse=True,
                    )[:10]

                    if top_landmarks:
                        # Create bar chart for top landmarks
                        landmark_ids = [item[0] for item in top_landmarks]
                        vector_counts = [item[1] for item in top_landmarks]

                        plt.figure(figsize=(10, 6))
                        bars = plt.bar(range(len(landmark_ids)), vector_counts)
                        plt.xticks(range(len(landmark_ids)), landmark_ids, rotation=45)
                        plt.xlabel("Landmark ID")
                        plt.ylabel("Wikipedia Vector Count")
                        plt.title("Top Landmarks by Wikipedia Vector Count")

                        # Add count labels on top of bars
                        for bar in bars:
                            height = bar.get_height()
                            plt.text(
                                bar.get_x() + bar.get_width() / 2.0,
                                height,
                                f"{height}",
                                ha="center",
                                va="bottom",
                                rotation=0,
                            )

                        plt.tight_layout()
                        plt.show()
                except ImportError:
                    print("Note: Install matplotlib to generate visualizations")

    except Exception as e:
        print(f"Error retrieving database statistics: {str(e)}")

# Create a section for testing semantic search capabilities
print("\nTest Semantic Search Against Wikipedia Content")
print("==================================================")

# Create input field for custom queries
query_input = widgets.Text(
    value="Tell me about the history of NYC landmarks",
    placeholder="Type your query here...",
    description="Query:",
    disabled=False,
    style={"description_width": "initial"},
    layout=widgets.Layout(width="600px"),
)

# Create filter options
filter_options = widgets.Dropdown(
    options=[("All Wikipedia Content", "all"), ("Filter by Landmark ID", "landmark")],
    value="all",
    description="Filter:",
    disabled=False,
    layout=widgets.Layout(width="300px"),
)

# Create input for landmark ID filter
landmark_id_input = widgets.Text(
    value="",
    placeholder="Enter landmark ID to filter results",
    description="Landmark ID:",
    disabled=True,
    style={"description_width": "initial"},
    layout=widgets.Layout(width="400px"),
)

# Number of results to return
top_k_slider = widgets.IntSlider(
    value=5,
    min=1,
    max=20,
    step=1,
    description="Results to show:",
    disabled=False,
    continuous_update=False,
    orientation="horizontal",
    readout=True,
    readout_format="d",
    layout=widgets.Layout(width="400px"),
)

# Create button to run query
query_button = widgets.Button(
    description="Run Query",
    button_style="info",
    tooltip="Test a semantic query against Wikipedia content",
    icon="search",
)

# Output area for results
query_output = widgets.Output()

# Function to update landmark ID input based on filter selection


def update_landmark_filter(change):
    if change["name"] == "value":
        landmark_id_input.disabled = change["new"] != "landmark"


# Register the callback
filter_options.observe(update_landmark_filter, names="value")

# Function to run when button is clicked


def on_query_button_click(b):
    with query_output:
        clear_output()
        query_text = query_input.value
        top_k = top_k_slider.value

        print(f"Query: '{query_text}'")
        print(f"Retrieving top {top_k} results")

        # Prepare filter
        filter_dict = {"source_type": "wikipedia"}
        if filter_options.value == "landmark" and landmark_id_input.value.strip():
            filter_dict["landmark_id"] = landmark_id_input.value.strip()
            print(f"Filtering by landmark ID: {landmark_id_input.value.strip()}")

        # Generate embedding for the query
        try:
            query_embedding = embedding_generator.generate_embedding(query_text)

            # Query the vector database
            results = pinecone_db.query_vectors(
                query_embedding, top_k=top_k, filter_dict=filter_dict
            )

            print(f"\nFound {len(results)} matching results")

            # Display the results
            if results:
                # Create a DataFrame for better visualization
                results_data = []
                for i, match in enumerate(results):
                    results_data.append(
                        {
                            "Rank": i + 1,
                            "Score": f"{match['score']:.4f}",
                            "Landmark ID": match["metadata"].get(
                                "landmark_id", "Unknown"
                            ),
                            "Article Title": match["metadata"].get(
                                "article_title", "Unknown"
                            ),
                            "Text Snippet": match["metadata"].get("text", "")[:100]
                            + "...",
                        }
                    )

                results_df = pd.DataFrame(results_data)
                display(results_df)

                # Show detailed view of the first result
                if results_data:
                    print("\nFirst Match Details:")
                    match = results[0]
                    print(f"Score: {match['score']:.4f}")
                    print(
                        f"Landmark ID: {match['metadata'].get('landmark_id', 'Unknown')}"
                    )
                    print(
                        f"Article: {match['metadata'].get('article_title', 'Unknown')}"
                    )
                    print(f"URL: {match['metadata'].get('article_url', 'Unknown')}")
                    print("\nText content:")
                    print(match["metadata"].get("text", "")[:500] + "...")
            else:
                print("No Wikipedia content found matching the query")
        except Exception as e:
            print(f"Error performing query: {str(e)}")


# Register button click handler
query_button.on_click(on_query_button_click)

# Display the widgets
display(widgets.HTML("<h3>Semantic Search of Wikipedia Content</h3>"))
display(query_input)
display(widgets.HBox([filter_options, landmark_id_input]))
display(top_k_slider)
display(query_button)
display(query_output)

## 6. Summary Report

This section provides a summary of the Wikipedia integration testing and the current state of Wikipedia content in the vector database. We'll examine:

- The total number of Wikipedia vectors compared to other content
- Coverage percentage of landmarks with Wikipedia content
- Distribution of Wikipedia content across different landmarks
- Performance of semantic searches against Wikipedia content

This analysis helps assess the current state of Wikipedia integration and identify opportunities for improvement in coverage and quality.

In [None]:
# Generate a summary report of Wikipedia integration

# Create a styled output area for the summary report
summary_output = widgets.Output()
display(widgets.HTML("<h3>Wikipedia Integration Summary Report</h3>"), summary_output)

with summary_output:
    print("Generating Wikipedia integration summary report...\n")

    # 1. API Connectivity
    print("1. API Connectivity")
    try:
        record_count = db_client.get_total_record_count()
        print(
            f"✅ Successfully connected to CoreDataStore API. Found {record_count} landmark records."
        )
    except Exception as e:
        print(f"❌ Error connecting to CoreDataStore API: {str(e)}")

    # 2. Wikipedia Data
    print("\n2. Wikipedia Data")
    total_articles = (
        sum(len(articles) for articles in landmark_articles.values())
        if landmark_articles
        else 0
    )
    print(
        f"✅ Found {total_articles} Wikipedia articles across {len(landmark_articles) if landmark_articles else 0} landmarks in the sample."
    )

    # 3. Vector Database Integration
    print("\n3. Vector Database Integration")
    try:
        # Get vector database stats
        index_stats = pinecone_db.get_index_stats()
        wiki_count = pinecone_db.count_vectors({"source_type": "wikipedia"})

        print("✅ Vector database connection established.")
        print(f"   - Total vectors: {index_stats.get('total_vector_count', 'N/A')}")
        print(
            f"   - Wikipedia vectors: {wiki_count} ({(wiki_count/index_stats['total_vector_count']*100) if index_stats.get('total_vector_count', 0) > 0 else 0:.2f}%)"
        )
    except Exception as e:
        print(f"❌ Error connecting to vector database: {str(e)}")

    # 4. Embedding Generation
    print("\n4. Embedding Generation")
    try:
        test_text = "This is a test for embedding generation."
        embedding = embedding_generator.generate_embedding(test_text)
        print(f"✅ Embedding generation is working. Dimension: {len(embedding)}")
    except Exception as e:
        print(f"❌ Error generating embeddings: {str(e)}")

    # 5. Overall Assessment
    print("\n5. Overall Assessment")

    # Let's make a simple assessment based on what we found
    if total_articles > 0 and "wiki_count" in locals() and wiki_count > 0:
        print("✅ Wikipedia integration is functioning correctly.")
        print("   - Wikipedia articles can be fetched from the API")
        print("   - Articles can be processed and chunked")
        print("   - Embeddings can be generated")
        print("   - Vectors can be stored in and retrieved from the database")
    elif total_articles > 0:
        print("⚠️ Wikipedia integration is partially working.")
        print("   - Wikipedia articles can be fetched from the API")
        print("   - More testing needed for processing, embedding, and storage")
    else:
        print("❌ Wikipedia integration needs attention.")
        print("   - No Wikipedia articles found or unable to access them")
        print("   - Review API connectivity and data availability")

    # 6. Recommendations
    print("\n6. Recommendations")
    if "wiki_count" in locals() and index_stats.get("total_vector_count", 0) > 0:
        wiki_percentage = wiki_count / index_stats["total_vector_count"] * 100
        if wiki_percentage < 10:
            print("→ Consider increasing Wikipedia coverage in the vector database")
        if total_articles > wiki_count:
            print("→ Many Wikipedia articles are not yet processed into vectors")
    else:
        print("→ Begin processing Wikipedia articles into the vector database")

    print("→ Run the end-to-end test on more landmarks to ensure robustness")
    print("→ Consider implementing automatic updates when Wikipedia content changes")

    print("\nTesting completed successfully!")