# Wikipedia Embeddings Visualization Notebook

This notebook provides an interactive way to visualize 3D vectors from the SQLite database, one cluster at a time, with page titles as annotations.

## Instructions:
1. Make sure you have run the wp-embeddings pipeline to generate embeddings and 3D projections
2. Select a namespace and cluster ID to visualize
3. Use the interactive controls to explore the data

## Available Commands:
- `python visualize_embeddings.py --list-clusters` - List available clusters
- `python visualize_embeddings.py --cluster <id> --limit <number>` - Visualize specific cluster
- `python visualize_embeddings.py --cluster <id> --interactive` - Interactive visualization with controls

In [None]:
import sys
from pathlib import Path

# Check if database exists
if not Path('chunk_log.db').exists():
    print("Error: Database file 'chunk_log.db' not found.")
    print("Make sure you have run the wp-embeddings pipeline first.")
    sys.exit(1)

# Import our visualization module
try:
    from visualize_embeddings import EmbeddingVisualizer
    print("✓ Successfully imported visualization module")
except ImportError:
    print("Error: Could not import visualization module. Make sure you're running this from the project root.")
    sys.exit(1)

In [None]:
# List available clusters
namespace = 'enwiki_namespace_0'  # Change this if you have other namespaces
visualizer = EmbeddingVisualizer(namespace)

print(f"\nAvailable clusters in namespace '{namespace}':")
clusters = visualizer.get_available_clusters()

if not clusters:
    print("No clusters with 3D projections found.")
    print("Make sure you have run the 'project' command to generate 3D vectors.")
else:
    for cluster in clusters:
        cluster_id, name, description, page_count = cluster
        name_display = name or "Unnamed"
        desc_display = f" - {description}" if description else ""
        print(f"  Cluster {cluster_id}: {name_display}{desc_display} ({page_count} pages")

In [None]:
# Configuration - modify these values to explore different clusters
cluster_id = 1  # Change this to the cluster ID you want to visualize
limit = 50     # Number of points to display (reduce for faster loading)
show_labels = True  # Show page title labels

# Create visualizer and load data
visualizer = EmbeddingVisualizer(namespace, cluster_id, limit)
data, sqlconn = visualizer.load_cluster_data()

if data is None or data.empty:
    print(f"No 3D vector data found for cluster {cluster_id} in namespace '{namespace}'")
else:
    print(f"✓ Loaded {len(data)} pages from cluster {cluster_id}")
    print(f"✓ Data shape: {data.shape}")
    print(f"✓ Columns: {list(data.columns)}")
    
    # Show sample data
    print("\nSample data:")
    display(data.head())
    
    # Get cluster info
    cluster_info = visualizer.get_available_clusters(sqlconn)
    cluster_info = next((c for c in cluster_info if c[0] == cluster_id), None)
    
    if cluster_info:
        print(f"\nCluster Information:")
        print(f"  ID: {cluster_info[0]}")
        print(f"  Name: {cluster_info[1] or 'Unnamed'}")
        print(f"  Description: {cluster_info[2] or 'No description available'}")
        print(f"  Total Pages: {cluster_info[3]}")
        print(f"  Showing: {min(len(data), cluster_info[3])} pages (limit: {limit})")

In [None]:
# Create static 3D visualization
%matplotlib inline
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

if data is not None and not data.empty:
    fig = plt.figure(figsize=(12, 9))
    ax = fig.add_subplot(111, projection='3d')
    
    # Plot points
    scatter = ax.scatter(
        data['x'], data['y'], data['z'],
        c=data.index,
        cmap='viridis',
        alpha=0.6,
        s=20,
        edgecolors='w',
        linewidths=0.5
    )
    
    # Add vectors from origin to each point
    for i in range(len(data)):
        ax.plot(
            [0, data.iloc[i]['x']],
            [0, data.iloc[i]['y']],
            [0, data.iloc[i]['z']],
            'gray',
            alpha=0.18,  # Make vectors more transparent than points
            linewidth=0.5
        )
    
    # Add labels if requested
    if show_labels:
        # Sample labels to avoid overcrowding
        label_step = max(1, len(data) // 20)  # Show max 20 labels
        for i in range(0, len(data), label_step):
            ax.text(
                data.iloc[i]['x'],
                data.iloc[i]['y'],
                data.iloc[i]['z'],
                data.iloc[i]['title'],
                fontsize=8,
                alpha=0.7,
                bbox=dict(boxstyle="round,pad=0.3", facecolor="white", alpha=0.7)
            )
    
    # Set labels and title
    ax.set_xlabel('X Dimension')
    ax.set_ylabel('Y Dimension')
    ax.set_zlabel('Z Dimension')
    
    cluster_name = cluster_info[1] if cluster_info else f"Cluster {cluster_id}"
    ax.set_title(
        f'3D Vector Visualization: {cluster_name}\n{len(data)} pages shown',
        fontsize=14,
        fontweight='bold'
    )
    
    # Add colorbar
    cbar = plt.colorbar(scatter, ax=ax, shrink=0.5, aspect=20)
    cbar.set_label('Page Index', rotation=270, labelpad=15)
    
    # Adjust view angle
    ax.view_init(elev=20, azim=45)
    
    plt.tight_layout()
    plt.show()

In [None]:
# Interactive visualization (requires matplotlib with interactive backend)
# This will open a new window with interactive controls

# Uncomment the line below to run interactive visualization
# visualizer.plot_interactive()

print("To run interactive visualization, uncomment the line above and run this cell.")
print("Note: This will open a new window with interactive controls.")

## Analysis Tips:

1. **Cluster Exploration**: Try different cluster IDs to see how different topics group together
2. **Density Analysis**: Look for dense regions vs. sparse regions in the 3D space
3. **Outlier Detection**: Points far from the main cluster may represent unusual or unique content
4. **Topic Patterns**: Similar page titles near each other suggest related topics
5. **Vector Visualization**: The gray lines show the direction and magnitude of each embedding vector from the origin

## Next Steps:
- Try different cluster IDs to explore different topics
- Adjust the `limit` parameter to see more or fewer points
- Use the command-line tool for interactive exploration with mouse controls
- Consider exporting the data for further analysis in other tools