# NYC Landmarks Wikipedia Integration Testing

This notebook tests the integration of Wikipedia articles for NYC landmarks into the vector database. It demonstrates the process of:

1. Fetching landmark information from the CoreDataStore API
2. Retrieving associated Wikipedia articles
3. Processing article content (fetching, cleaning, chunking)
4. Generating embeddings for article chunks
5. Storing embeddings in Pinecone vector database
6. Querying the vector database to retrieve Wikipedia content
7. Analyzing the distribution and quality of Wikipedia content in the vector database

The notebook serves as both a testing tool and a demonstration of the Wikipedia integration capabilities.

## Environment Setup

First, let's set up our environment by creating a Python alias and installing any required dependencies.

In [None]:
# Create a python alias for python3 and verify the Python installation
!alias python=python3
!python --version

# Check if the project is installed correctly
!pip list | grep nyc-landmarks-vector-db || echo "Project not installed - install with 'pip install -e .'"

In [None]:
# Install the project in development mode if not already installed
import os
import sys

# Check if we're in the right directory structure
project_root = os.path.abspath(os.path.join(os.getcwd(), ".."))
print(f"Project root directory: {project_root}")

# Check for setup.py to confirm we're in the right place
setup_py_path = os.path.join(project_root, "setup.py")
if os.path.exists(setup_py_path):
    print("setup.py found, installing project in development mode...")
    !cd {project_root} && pip install -e .
else:
    print(f"setup.py not found at {setup_py_path}, please check directory structure")

In [None]:
# Check for environment variables required by the project
import os

# List of potential required environment variables
env_vars = [
    "OPENAI_API_KEY",  # For OpenAI embeddings
    "PINECONE_API_KEY",  # For Pinecone vector DB
    "PINECONE_ENVIRONMENT",  # Pinecone environment
    "PINECONE_INDEX_NAME",  # Pinecone index name
]

print("Checking environment variables:")
for var in env_vars:
    if var in os.environ:
        print(f"✓ {var} is set")
    else:
        print(f"✗ {var} is NOT set")

## Setup and Imports

First, let's import the necessary modules and set up logging.

In [None]:
import logging
import os
import sys
import time
from typing import Dict, List, Optional, Tuple

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from IPython.display import HTML, display

# Add project root to path to ensure imports work correctly
sys.path.append(os.path.abspath(os.path.join(os.getcwd(), "..")))

from nyc_landmarks.config.settings import settings
from nyc_landmarks.db.coredatastore_api import CoreDataStoreAPI
from nyc_landmarks.db.wikipedia_fetcher import WikipediaFetcher
from nyc_landmarks.embeddings.generator import EmbeddingGenerator
from nyc_landmarks.models.wikipedia_models import (
    WikipediaArticleModel,
    WikipediaContentModel,
    WikipediaProcessingResult,
)
from nyc_landmarks.vectordb.pinecone_db import pinecone

# Configure logging
logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s - %(name)s - %(levelname)s - %(message)s",
)
logger = logging.getLogger()

# Initialize the components
api_client = CoreDataStoreAPI()
wiki_fetcher = WikipediaFetcher()
embedding_generator = EmbeddingGenerator()
pinecone_db = pinecone()

## 1. Exploring Landmark Data

Let's start by fetching some landmarks from the CoreDataStore API and explore the data structure.

In [None]:
# Fetch a small number of landmarks for exploration
landmarks = api_client.get_all_landmarks(limit=10)

# Create a DataFrame for easier viewing
landmarks_df = pd.DataFrame(landmarks)
landmarks_df

## 2. Retrieving Wikipedia Articles for Landmarks

Now let's check which landmarks have associated Wikipedia articles and examine their structure.

In [None]:
# Function to check and display Wikipedia articles for a landmark


def check_wikipedia_articles(landmark_id: str) -> List[WikipediaArticleModel]:
    """Check if a landmark has associated Wikipedia articles.

    Args:
        landmark_id: ID of the landmark to check

    Returns:
        List of WikipediaArticleModel objects
    """
    articles = api_client.get_wikipedia_articles(landmark_id)
    print(f"Found {len(articles)} Wikipedia articles for landmark: {landmark_id}")
    return articles


# Check Wikipedia articles for each landmark
landmark_articles = {}
for landmark in landmarks:
    landmark_id = landmark["id"]
    name = landmark["name"]
    print(f"Checking {name} ({landmark_id})...")
    articles = check_wikipedia_articles(landmark_id)
    if articles:
        landmark_articles[landmark_id] = articles
    print("-" * 40)

print(
    f"Found {len(landmark_articles)} landmarks with Wikipedia articles out of {len(landmarks)} total"
)

In [None]:
# Display the Wikipedia articles we found
if landmark_articles:
    # Extract landmark ID, name, article title, and URL into a list of dictionaries
    articles_data = []
    for landmark_id, articles in landmark_articles.items():
        landmark_name = next(
            (l["name"] for l in landmarks if l["id"] == landmark_id), "Unknown"
        )
        for article in articles:
            articles_data.append(
                {
                    "landmark_id": landmark_id,
                    "landmark_name": landmark_name,
                    "article_title": article.title,
                    "article_url": article.url,
                }
            )

    # Create a DataFrame for easier viewing
    articles_df = pd.DataFrame(articles_data)
    articles_df
else:
    print("No landmarks with Wikipedia articles found in the sample")

## 3. Fetching and Processing Wikipedia Content

Now let's fetch the actual content from a Wikipedia article and process it for embedding.

In [None]:
# Select a landmark with Wikipedia articles for testing
test_landmark_id = next(iter(landmark_articles.keys())) if landmark_articles else None

if test_landmark_id:
    test_articles = landmark_articles[test_landmark_id]
    test_article = test_articles[0]  # Just take the first article for the test

    print(f"Testing with landmark {test_landmark_id}, article: {test_article.title}")

    # Fetch the article content
    print("Fetching Wikipedia content...")
    content = wiki_fetcher.fetch_wikipedia_content(test_article.url)

    if content:
        print(f"Successfully fetched Wikipedia content ({len(content)} chars)")
        print("\nPreview of the content:")
        print(content[:500] + "..." if len(content) > 500 else content)
    else:
        print("Failed to fetch Wikipedia content")
else:
    print("No landmarks with Wikipedia articles found for testing")

In [None]:
# Process the article content (chunking)
if test_landmark_id and content:
    print("Chunking Wikipedia content...")
    chunks = wiki_fetcher.chunk_wikipedia_text(
        content, chunk_size=1000, chunk_overlap=200
    )

    print(f"Split Wikipedia article into {len(chunks)} chunks")

    # Display the first chunk
    if chunks:
        print("\nFirst chunk:")
        print(f"Chunk index: {chunks[0]['chunk_index']}")
        print(f"Content: {chunks[0]['text'][:300]}...")
        print(f"Metadata: {chunks[0]['metadata']}")

    # Enhance chunks with article metadata
    for chunk in chunks:
        chunk["metadata"]["article_title"] = test_article.title
        chunk["metadata"]["article_url"] = test_article.url
        chunk["metadata"]["source_type"] = "wikipedia"
        chunk["metadata"]["landmark_id"] = test_landmark_id

    print("\nEnhanced first chunk metadata:")
    print(chunks[0]["metadata"] if chunks else "No chunks")

## 4. Generating Embeddings for Wikipedia Content

Now let's generate embeddings for the Wikipedia chunks using OpenAI's embedding model.

In [None]:
# Generate embeddings for the chunks
if test_landmark_id and chunks:
    print("Generating embeddings for Wikipedia chunks...")

    # To avoid excessive API calls in testing, let's just use the first few chunks
    test_chunks = chunks[:2] if len(chunks) > 2 else chunks

    # Generate embeddings
    chunks_with_embeddings = embedding_generator.process_chunks(test_chunks)

    print(f"Generated embeddings for {len(chunks_with_embeddings)} chunks")

    # Check the structure of a processed chunk
    if chunks_with_embeddings:
        print("\nProcessed chunk keys:")
        print(chunks_with_embeddings[0].keys())

        print("\nEmbedding dimensions:")
        print(len(chunks_with_embeddings[0]["embedding"]))

        print("\nEmbedding preview (first 5 values):")
        print(chunks_with_embeddings[0]["embedding"][:5])

## 5. Storing Wikipedia Embeddings in Pinecone

Now let's store the embeddings in Pinecone with appropriate metadata.

In [None]:
# Store embeddings in Pinecone
if test_landmark_id and chunks_with_embeddings:
    print("Storing Wikipedia embeddings in Pinecone...")

    # Store with deterministic IDs (with wiki- prefix to distinguish from PDF chunks)
    vector_ids = pinecone_db.store_chunks(
        chunks=chunks_with_embeddings,
        id_prefix=f"wiki-{test_landmark_id}-{test_article.title.replace(' ', '_')}-",
        landmark_id=test_landmark_id,
        use_fixed_ids=True,
        delete_existing=True,  # Delete existing vectors for this landmark/article
    )

    print(f"Stored {len(vector_ids)} vectors in Pinecone")
    print(f"Vector IDs: {vector_ids}")

## 6. Querying Wikipedia Content from Pinecone

Now let's test querying the vector database to retrieve Wikipedia content.

In [None]:
# Create a test query
if test_landmark_id:
    print("Generating test query...")

    # Create a test query based on the landmark name
    landmark_name = next(
        (l["name"] for l in landmarks if l["id"] == test_landmark_id), "landmark"
    )
    test_query = f"What is the history of {landmark_name}?"

    print(f"Test query: {test_query}")

    # Generate embedding for the query
    query_embedding = embedding_generator.generate_embedding(test_query)

    print("\nQuerying Pinecone...")

    # Query Pinecone with different filter options
    # 1. No filter
    results_no_filter = pinecone_db.query_vectors(query_embedding, top_k=3)

    # 2. Filter by landmark ID
    results_landmark_filter = pinecone_db.query_vectors(
        query_embedding, top_k=3, filter_dict={"landmark_id": test_landmark_id}
    )

    # 3. Filter by source type (wikipedia)
    results_wiki_filter = pinecone_db.query_vectors(
        query_embedding, top_k=3, filter_dict={"source_type": "wikipedia"}
    )

    # 4. Combined filter (landmark ID and source type)
    results_combined_filter = pinecone_db.query_vectors(
        query_embedding,
        top_k=3,
        filter_dict={"landmark_id": test_landmark_id, "source_type": "wikipedia"},
    )

    # Display the results
    print(f"\nQuery results with no filter: {len(results_no_filter)} matches")
    print(f"Query results with landmark filter: {len(results_landmark_filter)} matches")
    print(f"Query results with wiki filter: {len(results_wiki_filter)} matches")
    print(f"Query results with combined filter: {len(results_combined_filter)} matches")

In [None]:
# Display the Wikipedia query results


def display_query_results(results, title):
    """Display query results in a formatted way.

    Args:
        results: List of query result dictionaries
        title: Title for the results section
    """
    print(f"\n{title} ({len(results)} results)")
    print("-" * 80)

    for i, result in enumerate(results):
        print(f"Result {i+1} - Score: {result['score']:.4f}")
        print(f"Source: {result['metadata'].get('source_type', 'unknown')}")

        # Display article info if available
        if "article_title" in result["metadata"]:
            print(f"Article: {result['metadata']['article_title']}")

        # Display landmark info
        print(f"Landmark: {result['metadata'].get('landmark_id', 'unknown')}")

        # Display text content (truncated for clarity)
        text = result["metadata"].get("text", "")
        if text:
            preview = text[:300] + "..." if len(text) > 300 else text
            print(f"\nContent: {preview}")

        print("-" * 80)


# Display the combined filter results (most relevant for our test)
if "results_combined_filter" in locals() and results_combined_filter:
    display_query_results(
        results_combined_filter, "Wikipedia Results (Landmark + Wiki Filter)"
    )
elif "results_wiki_filter" in locals() and results_wiki_filter:
    display_query_results(results_wiki_filter, "Wikipedia Results (Wiki Filter Only)")
else:
    print("No Wikipedia results to display")

## 7. End-to-End Wikipedia Processing Test

Now let's test the complete flow using the processing functions from the module.

In [None]:
# Import the processing function from our script
from scripts.process_wikipedia_articles import process_landmark_wikipedia

# Process Wikipedia articles for a specific landmark
if test_landmark_id:
    print(f"Processing Wikipedia articles for landmark: {test_landmark_id}")

    # Process the landmark's Wikipedia articles
    result = process_landmark_wikipedia(
        landmark_id=test_landmark_id,
        chunk_size=1000,
        chunk_overlap=200,
        recreate_index=False,
        delete_existing=True,
    )

    if result:
        print(f"\nProcessing summary: {str(result)}")
    else:
        print("Processing failed")

## 8. Analyzing Wikipedia Coverage and Distribution

Let's analyze the coverage of Wikipedia articles across the landmarks in our database.

In [None]:
# Fetch a larger set of landmarks for analysis
print("Fetching landmarks for analysis...")
all_landmarks = api_client.get_all_landmarks(limit=50)  # Adjust limit as needed
print(f"Fetched {len(all_landmarks)} landmarks")

# Get landmarks with Wikipedia articles
print("\nChecking for Wikipedia articles...")
landmarks_with_wikipedia = {}
for landmark in all_landmarks:
    landmark_id = landmark["id"]
    articles = api_client.get_wikipedia_articles(landmark_id)
    if articles:
        landmarks_with_wikipedia[landmark_id] = articles
    time.sleep(0.5)  # Add a small delay to avoid rate limiting

print(
    f"Found {len(landmarks_with_wikipedia)} landmarks with Wikipedia articles out of {len(all_landmarks)} total"
)

# Calculate coverage percentage
coverage_percentage = (
    (len(landmarks_with_wikipedia) / len(all_landmarks)) * 100 if all_landmarks else 0
)
print(f"Wikipedia coverage: {coverage_percentage:.2f}%")

In [None]:
# Analyze Wikipedia articles per landmark
if landmarks_with_wikipedia:
    articles_per_landmark = [
        len(articles) for articles in landmarks_with_wikipedia.values()
    ]

    # Calculate statistics
    avg_articles = sum(articles_per_landmark) / len(articles_per_landmark)
    max_articles = max(articles_per_landmark)
    min_articles = min(articles_per_landmark)

    print(f"Articles per landmark statistics:")
    print(f"Average: {avg_articles:.2f}")
    print(f"Maximum: {max_articles}")
    print(f"Minimum: {min_articles}")

    # Create a distribution histogram
    plt.figure(figsize=(10, 6))
    plt.hist(
        articles_per_landmark,
        bins=range(1, max_articles + 2),
        alpha=0.7,
        color="skyblue",
        edgecolor="black",
    )
    plt.xlabel("Number of Wikipedia Articles")
    plt.ylabel("Number of Landmarks")
    plt.title("Distribution of Wikipedia Articles per Landmark")
    plt.grid(axis="y", alpha=0.75)
    plt.xticks(range(1, max_articles + 1))
    plt.show()

In [None]:
# Analyze landmark attributes that might correlate with Wikipedia coverage
# Create a DataFrame with landmark details and Wikipedia status
landmark_analysis_data = []
for landmark in all_landmarks:
    landmark_id = landmark["id"]
    has_wikipedia = landmark_id in landmarks_with_wikipedia
    wikipedia_articles_count = (
        len(landmarks_with_wikipedia.get(landmark_id, [])) if has_wikipedia else 0
    )

    landmark_analysis_data.append(
        {
            "id": landmark_id,
            "name": landmark["name"],
            "borough": landmark["borough"],
            "type": landmark["type"],
            "has_wikipedia": has_wikipedia,
            "wikipedia_articles_count": wikipedia_articles_count,
        }
    )

landmark_analysis_df = pd.DataFrame(landmark_analysis_data)

# Display the DataFrame
landmark_analysis_df.head()

In [None]:
# Analyze Wikipedia coverage by borough
if not landmark_analysis_df.empty:
    # Group by borough and calculate percentage with Wikipedia
    borough_analysis = (
        landmark_analysis_df.groupby("borough")
        .agg(
            {
                "has_wikipedia": "mean",  # Average of True/False gives percentage
                "id": "count",  # Count total landmarks in borough
                "wikipedia_articles_count": "sum",  # Total Wikipedia articles
            }
        )
        .reset_index()
    )

    # Rename columns for clarity
    borough_analysis.columns = [
        "Borough",
        "Wikipedia Coverage",
        "Total Landmarks",
        "Total Wikipedia Articles",
    ]

    # Convert coverage to percentage
    borough_analysis["Wikipedia Coverage"] = (
        borough_analysis["Wikipedia Coverage"] * 100
    )

    # Calculate articles per landmark
    borough_analysis["Articles per Landmark"] = (
        borough_analysis["Total Wikipedia Articles"]
        / borough_analysis["Total Landmarks"]
    )

    # Sort by coverage percentage
    borough_analysis = borough_analysis.sort_values(
        "Wikipedia Coverage", ascending=False
    )

    # Display the analysis
    borough_analysis

In [None]:
# Visualize Wikipedia coverage by borough
if "borough_analysis" in locals() and not borough_analysis.empty:
    plt.figure(figsize=(12, 6))

    # Create bar chart
    bars = plt.bar(
        borough_analysis["Borough"],
        borough_analysis["Wikipedia Coverage"],
        color="skyblue",
        edgecolor="black",
    )

    # Add data labels
    for bar in bars:
        height = bar.get_height()
        plt.text(
            bar.get_x() + bar.get_width() / 2.0,
            height,
            f"{height:.1f}%",
            ha="center",
            va="bottom",
            rotation=0,
        )

    plt.xlabel("Borough")
    plt.ylabel("Wikipedia Coverage (%)")
    plt.title("Wikipedia Coverage by Borough")
    plt.grid(axis="y", alpha=0.3)
    plt.xticks(rotation=45, ha="right")
    plt.tight_layout()
    plt.show()

## 9. Summary and Conclusions

This notebook has demonstrated the process of integrating Wikipedia articles for NYC landmarks into the vector database. We've seen how to:

1. Fetch landmark information and associated Wikipedia articles
2. Process Wikipedia content by fetching, cleaning, and chunking
3. Generate embeddings for the article chunks
4. Store the embeddings in Pinecone with appropriate metadata
5. Query the vector database to retrieve Wikipedia content
6. Analyze the coverage and distribution of Wikipedia articles across the landmarks

The implementation successfully extends the existing PDF-based vector database to include Wikipedia content, which will provide additional context and information for the vector search and chat functionality.