# GPU-Accelerated Embedding Visualization with RAPIDS

This notebook loads pre-extracted embeddings and visualizes them using **RAPIDS cuML** for GPU-accelerated UMAP/t-SNE dimensionality reduction and **Plotly** for interactive 2D/3D visualizations.

## üìä Overview
- **Data Source**: Pre-extracted embeddings from `embeddings_output/` (parquet files)
- **GPU Acceleration**: RAPIDS cuDF + cuML for 100x faster processing
- **Algorithms**: cuML UMAP and t-SNE (GPU-accelerated)
- **Visualization**: Interactive 2D/3D Plotly scatter plots
- **Labels**: Flexible labeling from dataset metadata, cuBERT, or None

## üéØ Features
- ‚úÖ **GPU-Accelerated**: cuML UMAP/t-SNE runs entirely on GPU
- ‚úÖ **Memory Efficient**: cuDF for GPU DataFrame operations
- ‚úÖ **Scalable**: Handle millions of embeddings efficiently
- ‚úÖ **Interactive**: Plotly 2D/3D visualizations with hover details
- ‚úÖ **Flexible Labels**: Support for metadata labels, cuBERT clustering, or unlabeled
- ‚úÖ **Export**: Save to HTML for easy sharing

## üîó Reference
Based on [RAPIDS cuBERT Topic Modelling](https://github.com/rapidsai/rapids-examples/tree/main/cuBERT_topic_modelling)

---


## üîß Setup: Install RAPIDS and Dependencies


In [1]:
# Install RAPIDS and required packages
# Note: RAPIDS requires specific CUDA versions. See: https://rapids.ai/start.html
# For CUDA 12.x:
# !pip install cudf-cu13 cuml-cu13 --extra-index-url=https://pypi.nvidia.com

# Core dependencies (non-RAPIDS fallback available)
%pip install numpy pandas plotly tqdm pyarrow -q

# Check if RAPIDS is available
try:
    import cudf
    import cuml
    RAPIDS_AVAILABLE = True
    print("‚úÖ RAPIDS (cuDF, cuML) is available - GPU acceleration enabled!")
except ImportError:
    RAPIDS_AVAILABLE = False
    print("‚ö†Ô∏è  RAPIDS not available - falling back to CPU (numpy/sklearn)")
    print("   To install RAPIDS: pip install cudf-cu13 cuml-cu13 --extra-index-url=https://pypi.nvidia.com")

Note: you may need to restart the kernel to use updated packages.
‚úÖ RAPIDS (cuDF, cuML) is available - GPU acceleration enabled!


In [2]:
import os
import numpy as np
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from pathlib import Path
from tqdm.auto import tqdm
from typing import List, Dict, Tuple, Optional
import warnings
warnings.filterwarnings('ignore')

# RAPIDS imports (with CPU fallback)
if RAPIDS_AVAILABLE:
    import cudf
    import cupy as cp
    from cuml.manifold import UMAP as cumlUMAP
    from cuml.manifold import TSNE as cumlTSNE
    print("‚úÖ RAPIDS imports successful (cuDF, cuML UMAP/TSNE)")
else:
    # CPU fallback
    try:
        from sklearn.manifold import TSNE as sklearnTSNE
        import umap as cpuUMAP
        print("‚úÖ CPU fallback imports successful (sklearn TSNE, umap-learn)")
    except ImportError:
        print("‚ö†Ô∏è  Installing CPU fallback packages...")
        import subprocess
        subprocess.run(["pip", "install", "umap-learn", "scikit-learn", "-q"])
        from sklearn.manifold import TSNE as sklearnTSNE
        import umap as cpuUMAP

# Configure Plotly for notebook rendering
import plotly.io as pio
pio.renderers.default = "notebook_connected"

print(f"\nüñ•Ô∏è  Using: {'GPU (RAPIDS cuML)' if RAPIDS_AVAILABLE else 'CPU (sklearn/umap-learn)'}")
print("‚úÖ All imports successful!")


‚úÖ RAPIDS imports successful (cuDF, cuML UMAP/TSNE)

üñ•Ô∏è  Using: GPU (RAPIDS cuML)
‚úÖ All imports successful!


## ‚öôÔ∏è Configuration

Configure paths, dataset selection, and visualization parameters.

### üìÇ Path Options

| Option | When to Use | How to Configure |
|--------|-------------|------------------|
| **1. Direct Paths** | Simple, one-time setup | Edit paths directly in the cell below |
| **2. Relative Paths** | Portable, stays with notebook | Uncomment the relative path block |
| **3. Environment Variables** | CI/CD, shared environments | Set `EMBEDDINGS_DIR`, `DATASETS_DIR`, `OUTPUT_DIR` in shell |

### üìä Dataset/Split Selection (similar to `extract_embeddings_parallel_shards.py`)

| Option | Example | Description |
|--------|---------|-------------|
| **A. Load All** | `LOAD_ALL = True` | Load all available embeddings |
| **B. Specific** | `["v1:chat", "v2:math"]` | Select exact dataset:split combinations |
| **C. By Dataset** | `["v1:*", "v2:*"]` | All splits from specified datasets |
| **D. By Split** | `["*:chat", "*:code"]` | Same split across all datasets |
| **E. Mixed** | `["v1:*", "*:safety"]` | Combine patterns |

### Quick Edit:
```python
# Paths
EMBEDDINGS_DIR = Path("/your/path/to/embeddings")
DATASETS_DIR = Path("/your/path/to/datasets")  
OUTPUT_DIR = Path("/your/path/to/outputs")

# Selection
LOAD_ALL = False
SELECTED_SPLITS = ["v1:chat", "v1:code", "llama-sft:*"]
```


In [3]:
# =============================================================================
# CONFIGURATION - EDIT PATHS HERE
# =============================================================================

# ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
# ‚îÇ OPTION 1: Direct Paths (Recommended)                                        ‚îÇ
# ‚îÇ Simply set absolute paths to your data directories                          ‚îÇ
# ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò

EMBEDDINGS_DIR = Path("/raid/embeddings")
DATASETS_DIR = Path("/raid/datasets")
OUTPUT_DIR = Path("/raid/outputs")

# ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
# ‚îÇ OPTION 2: Relative Paths (uncomment to use)                                 ‚îÇ
# ‚îÇ Paths relative to notebook location                                         ‚îÇ
# ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
# SCRIPT_DIR = Path(".").absolute()
# EMBEDDINGS_DIR = SCRIPT_DIR / "embeddings"
# DATASETS_DIR = SCRIPT_DIR / "datasets"
# OUTPUT_DIR = SCRIPT_DIR / "outputs"

# ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
# ‚îÇ OPTION 3: Environment Variables (uncomment to use)                          ‚îÇ
# ‚îÇ Set via: export EMBEDDINGS_DIR=/path/to/embeddings                          ‚îÇ
# ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
# EMBEDDINGS_DIR = Path(os.environ.get("EMBEDDINGS_DIR", "./embeddings"))
# DATASETS_DIR = Path(os.environ.get("DATASETS_DIR", "./datasets"))
# OUTPUT_DIR = Path(os.environ.get("OUTPUT_DIR", "./outputs"))

# =============================================================================
# DATASET / SPLIT SELECTION (similar to extract_embeddings_parallel_shards.py)
# =============================================================================
# Format: "dataset:split" - specify exactly which data to visualize
#
# OPTION A: Load ALL available embeddings
# LOAD_ALL = True
# SELECTED_SPLITS = []  # Ignored when LOAD_ALL = True

# OPTION B: Select specific dataset:split combinations (set LOAD_ALL = False)
# LOAD_ALL = False
# SELECTED_SPLITS = [
#     "v1:chat",
#     "v1:code", 
#     "v1:math",
#     "v2:stem",
#     "llama-sft:safety",
#     "llama-sft:science",
# ]

# OPTION C: Select by dataset only (all splits from those datasets)
# LOAD_ALL = False
# SELECTED_SPLITS = ["v1:*", "v2:*"]  # All splits from v1 and v2

# OPTION D: Select by split only (same split across all datasets)
# LOAD_ALL = False  
# SELECTED_SPLITS = ["*:chat", "*:code"]  # All chat and code splits

# OPTION E: Mix and match
# LOAD_ALL = False
# SELECTED_SPLITS = [
#     "v1:*",           # All v1 splits
#     "v2:chat",        # Only v2 chat
#     "*:safety",       # Safety from all datasets
#     "llama-sft:code", # Specific combination
# ]

LOAD_ALL = False
SELECTED_SPLITS = [
    # "llama-sft:chat", 
    # "llama-sft:code",
    # "llama-sft:math",
    # "llama-sft:science",
    # "llama-sft:safety",
    # "llama-sft:stem",
    # "llama-sft:tool_calling",
    "v2:chat",
    "v2:code",
    "v2:math",
    "v2:stem",
    
]  # Ignored when LOAD_ALL = True
# =============================================================================
# Validate paths exist
# =============================================================================
print("üìÇ Path Configuration:")
print(f"   EMBEDDINGS_DIR: {EMBEDDINGS_DIR}")
print(f"   DATASETS_DIR:   {DATASETS_DIR}")
print(f"   OUTPUT_DIR:     {OUTPUT_DIR}")

# Check if paths exist
if not EMBEDDINGS_DIR.exists():
    print(f"   ‚ö†Ô∏è  Warning: EMBEDDINGS_DIR does not exist!")
if not DATASETS_DIR.exists():
    print(f"   ‚ö†Ô∏è  Warning: DATASETS_DIR does not exist!")

# Create output directory
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
print(f"   ‚úÖ OUTPUT_DIR created/verified")

# Show selection mode
print(f"\nüìä Data Selection:")
if LOAD_ALL:
    print(f"   Mode: Load ALL available embeddings")
else:
    print(f"   Mode: Selected splits only")
    for s in SELECTED_SPLITS:
        print(f"      ‚Ä¢ {s}")

# =============================================================================
# VISUALIZATION SETTINGS
# =============================================================================

SAMPLE_SIZE = None  # None = use all data, or set a number like 100000 for testing
RANDOM_SEED = 42

# Dimensionality reduction settings
REDUCTION_METHOD = "umap"  # "umap" or "tsne"
N_COMPONENTS_2D = 2
N_COMPONENTS_3D = 3

# UMAP parameters (GPU-optimized)
UMAP_N_NEIGHBORS = 15
UMAP_MIN_DIST = 0.1
UMAP_METRIC = "cosine"

# t-SNE parameters (GPU-optimized)
TSNE_PERPLEXITY = 30
TSNE_LEARNING_RATE = 200

# Color scheme for categories
CATEGORY_COLORS = {
    'chat': '#FF6B6B',
    'code': '#4ECDC4', 
    'math': '#45B7D1',
    'stem': '#96CEB4',
    'tool_calling': '#FFEAA7',
    'science': '#DDA0DD',
    'safety': '#FF7F50',
    # 'multilingual_ja': '#9B59B6',
    # 'multilingual_de': '#3498DB',
    # 'multilingual_it': '#E74C3C',
    # 'multilingual_es': '#F39C12',
    # 'multilingual_fr': '#1ABC9C',
    'unknown': '#95A5A6'
}

print("\n‚öôÔ∏è  Visualization Settings:")
print(f"   Reduction method: {REDUCTION_METHOD}")
print(f"   Sample size: {'All data' if SAMPLE_SIZE is None else f'{SAMPLE_SIZE:,}'}")
print(f"   UMAP: n_neighbors={UMAP_N_NEIGHBORS}, min_dist={UMAP_MIN_DIST}, metric={UMAP_METRIC}")
print(f"   t-SNE: perplexity={TSNE_PERPLEXITY}, learning_rate={TSNE_LEARNING_RATE}")

üìÇ Path Configuration:
   EMBEDDINGS_DIR: /raid/embeddings
   DATASETS_DIR:   /raid/datasets
   OUTPUT_DIR:     /raid/outputs
   ‚úÖ OUTPUT_DIR created/verified

üìä Data Selection:
   Mode: Selected splits only
      ‚Ä¢ v2:chat
      ‚Ä¢ v2:code
      ‚Ä¢ v2:math
      ‚Ä¢ v2:stem

‚öôÔ∏è  Visualization Settings:
   Reduction method: umap
   Sample size: All data
   UMAP: n_neighbors=15, min_dist=0.1, metric=cosine
   t-SNE: perplexity=30, learning_rate=200


## üì• Load Extracted Embeddings

Load pre-extracted embeddings from parquet files in `embeddings_output/` directory.
These were generated by `extract_embeddings_parallel_shards.py`.


In [4]:
def matches_selection(dataset: str, split: str, selected_splits: List[str]) -> bool:
    """
    Check if a dataset:split combination matches any of the selection patterns.
    
    Supports wildcards:
    - "v1:chat"     - exact match
    - "v1:*"        - all splits from v1
    - "*:chat"      - chat split from all datasets
    - "*:*"         - everything (same as LOAD_ALL=True)
    
    Args:
        dataset: Dataset name (e.g., "v1", "llama-sft")
        split: Split name (e.g., "chat", "code")
        selected_splits: List of selection patterns
    
    Returns:
        True if matches any pattern
    """
    for pattern in selected_splits:
        if ':' not in pattern:
            # Treat as dataset-only pattern
            if pattern == dataset or pattern == '*':
                return True
            continue
        
        pat_dataset, pat_split = pattern.split(':', 1)
        
        # Check dataset match
        dataset_match = (pat_dataset == '*' or pat_dataset == dataset)
        
        # Check split match
        split_match = (pat_split == '*' or pat_split == split)
        
        if dataset_match and split_match:
            return True
    
    return False


def discover_embedding_files(
    embeddings_dir: Path,
    load_all: bool = True,
    selected_splits: Optional[List[str]] = None
) -> List[Dict]:
    """
    Discover parquet embedding files with optional filtering.
    
    Args:
        embeddings_dir: Path to embeddings directory
        load_all: If True, load all available embeddings
        selected_splits: List of "dataset:split" patterns to filter
                        Supports wildcards: "v1:*", "*:chat", "v1:chat"
    
    Returns:
        List of dicts with: dataset, split, shard_idx, total_shards, filepath
    """
    files = []
    
    if not embeddings_dir.exists():
        print(f"‚ö†Ô∏è  Embeddings directory not found: {embeddings_dir}")
        print("   Run extract_embeddings_parallel_shards.py first to generate embeddings.")
        return files
    
    if selected_splits is None:
        selected_splits = []
    
    # Track what we find vs what was requested
    found_combinations = set()
    
    # Walk through embeddings directory
    for dataset_dir in sorted(embeddings_dir.iterdir()):
        if not dataset_dir.is_dir():
            continue
        
        dataset_name = dataset_dir.name
        
        for split_dir in sorted(dataset_dir.iterdir()):
            if not split_dir.is_dir():
                continue
            
            split_name = split_dir.name
            
            # Check if this combination should be included
            if not load_all and selected_splits:
                if not matches_selection(dataset_name, split_name, selected_splits):
                    continue
            
            found_combinations.add(f"{dataset_name}:{split_name}")
            
            # Find all parquet files
            parquet_files = sorted(split_dir.glob("*.parquet"))
            
            for pq_file in parquet_files:
                # Parse filename: v1-chat-00000-of-00001.parquet
                parts = pq_file.stem.split("-")
                if len(parts) >= 4 and "of" in parts:
                    of_idx = parts.index("of")
                    shard_idx = int(parts[of_idx - 1])
                    total_shards = int(parts[of_idx + 1])
                else:
                    shard_idx = 0
                    total_shards = 1
                
                files.append({
                    'dataset': dataset_name,
                    'split': split_name,
                    'shard_idx': shard_idx,
                    'total_shards': total_shards,
                    'filepath': pq_file
                })
    
    # Show what was found
    if not load_all and selected_splits:
        print(f"\nüéØ Selection filter active:")
        for pattern in selected_splits:
            print(f"   ‚Ä¢ {pattern}")
        print(f"\n   Matched {len(found_combinations)} dataset:split combination(s)")
    
    return files


def load_embeddings_from_parquet(
    file_infos: List[Dict],
    sample_size: Optional[int] = None,
    random_seed: int = 42
) -> Tuple[np.ndarray, pd.DataFrame]:
    """
    Load embeddings from parquet files into numpy array and metadata DataFrame.
    
    Args:
        file_infos: List of file info dicts from discover_embedding_files()
        sample_size: Optional limit on total samples to load
        random_seed: Random seed for sampling
    
    Returns:
        Tuple of (embeddings_array, metadata_df)
    """
    all_embeddings = []
    all_metadata = []
    
    np.random.seed(random_seed)
    
    print(f"üìÇ Loading embeddings from {len(file_infos)} parquet file(s)...")
    
    for file_info in tqdm(file_infos, desc="Loading files"):
        filepath = file_info['filepath']
        
        try:
            # Load parquet file
            if RAPIDS_AVAILABLE:
                df = cudf.read_parquet(str(filepath))
                # Convert embeddings column to numpy
                embeddings = df['embeddings'].to_pandas().tolist()
                indices = df['original_index'].to_pandas().tolist() if 'original_index' in df.columns else list(range(len(df)))
            else:
                df = pd.read_parquet(str(filepath))
                embeddings = df['embeddings'].tolist()
                indices = df['original_index'].tolist() if 'original_index' in df.columns else list(range(len(df)))
            
            # Add embeddings and metadata
            for i, (emb, idx) in enumerate(zip(embeddings, indices)):
                all_embeddings.append(emb)
                all_metadata.append({
                    'dataset': file_info['dataset'],
                    'split': file_info['split'],
                    'shard_idx': file_info['shard_idx'],
                    'original_index': idx,
                    'label': file_info['split']  # Default label = split name
                })
                
        except Exception as e:
            print(f"   ‚ö†Ô∏è  Error loading {filepath.name}: {e}")
            continue
    
    if not all_embeddings:
        raise ValueError("No embeddings loaded! Check the embeddings directory.")
    
    # Convert to numpy array
    embeddings_array = np.array(all_embeddings, dtype=np.float32)
    metadata_df = pd.DataFrame(all_metadata)
    
    # Sample if requested
    if sample_size is not None and sample_size < len(embeddings_array):
        print(f"   üìä Sampling {sample_size:,} from {len(embeddings_array):,} embeddings...")
        indices = np.random.choice(len(embeddings_array), size=sample_size, replace=False)
        embeddings_array = embeddings_array[indices]
        metadata_df = metadata_df.iloc[indices].reset_index(drop=True)
    
    print(f"\n‚úÖ Loaded {len(embeddings_array):,} embeddings")
    print(f"   Embedding dimension: {embeddings_array.shape[1]}")
    print(f"   Datasets: {metadata_df['dataset'].unique().tolist()}")
    print(f"   Splits: {metadata_df['split'].unique().tolist()}")
    
    return embeddings_array, metadata_df


In [5]:
# Discover and load embeddings
print("=" * 80)
print("üìÇ Discovering embedding files...")
print("=" * 80)

# Use selection parameters from configuration
file_infos = discover_embedding_files(
    EMBEDDINGS_DIR,
    load_all=LOAD_ALL,
    selected_splits=SELECTED_SPLITS
)

if file_infos:
    # Show discovered files
    print(f"\nüìã Found {len(file_infos)} embedding file(s):\n")
    
    # Group by dataset and split
    from collections import defaultdict
    grouped = defaultdict(list)
    for f in file_infos:
        grouped[f"{f['dataset']}/{f['split']}"].append(f)
    
    for key, files in sorted(grouped.items()):
        total_shards = files[0]['total_shards']
        print(f"   {key}: {len(files)} shard(s) of {total_shards}")
    
    # Load embeddings
    print("\n" + "=" * 80)
    embeddings, metadata_df = load_embeddings_from_parquet(file_infos, sample_size=SAMPLE_SIZE)
    print("=" * 80)
    
    # Display metadata distribution
    print("\nüìä Data Distribution:")
    print(f"\nBy Dataset:")
    print(metadata_df['dataset'].value_counts().to_string())
    print(f"\nBy Split (Label):")
    print(metadata_df['split'].value_counts().to_string())
else:
    print("\n‚ö†Ô∏è  No embedding files found!")
    print("   Expected directory structure:")
    print("   embeddings/")
    print("   ‚îú‚îÄ‚îÄ v1/")
    print("   ‚îÇ   ‚îú‚îÄ‚îÄ chat/")
    print("   ‚îÇ   ‚îÇ   ‚îî‚îÄ‚îÄ v1-chat-00000-of-00001.parquet")
    print("   ‚îÇ   ‚îî‚îÄ‚îÄ ...")
    print("   ‚îî‚îÄ‚îÄ ...")
    print("\n   Run: python extract_embeddings_parallel_shards.py --all")
    embeddings = None
    metadata_df = None


üìÇ Discovering embedding files...

üéØ Selection filter active:
   ‚Ä¢ v2:chat
   ‚Ä¢ v2:code
   ‚Ä¢ v2:math
   ‚Ä¢ v2:stem

   Matched 4 dataset:split combination(s)

üìã Found 18 embedding file(s):

   v2/chat: 12 shard(s) of 12
   v2/code: 2 shard(s) of 2
   v2/math: 2 shard(s) of 2
   v2/stem: 2 shard(s) of 2

üìÇ Loading embeddings from 18 parquet file(s)...


Loading files:   0%|          | 0/18 [00:00<?, ?it/s]


‚úÖ Loaded 1,397,187 embeddings
   Embedding dimension: 4096
   Datasets: ['v2']
   Splits: ['chat', 'code', 'math', 'stem']

üìä Data Distribution:

By Dataset:
dataset
v2    1397187

By Split (Label):
split
chat    627720
stem    355000
math    239467
code    175000


## üöÄ GPU-Accelerated Dimensionality Reduction

Apply **cuML UMAP** or **cuML t-SNE** for GPU-accelerated dimensionality reduction.
This is 10-100x faster than CPU-based methods for large datasets.


In [6]:
def apply_umap_gpu(embeddings: np.ndarray, n_components: int = 2) -> np.ndarray:
    """
    Apply GPU-accelerated UMAP using cuML.
    
    Args:
        embeddings: Input embeddings (n_samples, n_features)
        n_components: Output dimensions (2 or 3)
    
    Returns:
        Reduced embeddings (n_samples, n_components)
    """
    print(f"üöÄ Applying cuML UMAP (GPU-accelerated)...")
    print(f"   Input shape: {embeddings.shape}")
    print(f"   Output dimensions: {n_components}")
    print(f"   Parameters: n_neighbors={UMAP_N_NEIGHBORS}, min_dist={UMAP_MIN_DIST}, metric={UMAP_METRIC}")
    
    # Convert to cupy array for GPU processing
    embeddings_gpu = cp.asarray(embeddings, dtype=cp.float32)
    
    # Initialize cuML UMAP
    reducer = cumlUMAP(
        n_components=n_components,
        n_neighbors=UMAP_N_NEIGHBORS,
        min_dist=UMAP_MIN_DIST,
        metric=UMAP_METRIC,
        random_state=RANDOM_SEED,
        verbose=True
    )
    
    # Fit and transform
    reduced = reducer.fit_transform(embeddings_gpu)
    
    # Convert back to numpy
    result = cp.asnumpy(reduced)
    
    print(f"‚úÖ UMAP complete! Output shape: {result.shape}")
    return result


def apply_tsne_gpu(embeddings: np.ndarray, n_components: int = 2) -> np.ndarray:
    """
    Apply GPU-accelerated t-SNE using cuML.
    
    Args:
        embeddings: Input embeddings (n_samples, n_features)
        n_components: Output dimensions (2 or 3)
    
    Returns:
        Reduced embeddings (n_samples, n_components)
    """
    print(f"üöÄ Applying cuML t-SNE (GPU-accelerated)...")
    print(f"   Input shape: {embeddings.shape}")
    print(f"   Output dimensions: {n_components}")
    print(f"   Parameters: perplexity={TSNE_PERPLEXITY}, learning_rate={TSNE_LEARNING_RATE}")
    
    # Convert to cupy array for GPU processing
    embeddings_gpu = cp.asarray(embeddings, dtype=cp.float32)
    
    # Initialize cuML t-SNE
    reducer = cumlTSNE(
        n_components=n_components,
        perplexity=TSNE_PERPLEXITY,
        learning_rate=TSNE_LEARNING_RATE,
        random_state=RANDOM_SEED,
        verbose=True
    )
    
    # Fit and transform
    reduced = reducer.fit_transform(embeddings_gpu)
    
    # Convert back to numpy
    result = cp.asnumpy(reduced)
    
    print(f"‚úÖ t-SNE complete! Output shape: {result.shape}")
    return result


def apply_umap_cpu(embeddings: np.ndarray, n_components: int = 2) -> np.ndarray:
    """CPU fallback for UMAP using umap-learn."""
    print(f"üê¢ Applying CPU UMAP (umap-learn)...")
    print(f"   Input shape: {embeddings.shape}")
    
    reducer = cpuUMAP.UMAP(
        n_components=n_components,
        n_neighbors=UMAP_N_NEIGHBORS,
        min_dist=UMAP_MIN_DIST,
        metric=UMAP_METRIC,
        random_state=RANDOM_SEED,
        verbose=True
    )
    
    result = reducer.fit_transform(embeddings)
    print(f"‚úÖ UMAP complete! Output shape: {result.shape}")
    return result


def apply_tsne_cpu(embeddings: np.ndarray, n_components: int = 2) -> np.ndarray:
    """CPU fallback for t-SNE using sklearn."""
    print(f"üê¢ Applying CPU t-SNE (sklearn)...")
    print(f"   Input shape: {embeddings.shape}")
    
    reducer = sklearnTSNE(
        n_components=n_components,
        perplexity=TSNE_PERPLEXITY,
        learning_rate=TSNE_LEARNING_RATE,
        random_state=RANDOM_SEED,
        verbose=1
    )
    
    result = reducer.fit_transform(embeddings)
    print(f"‚úÖ t-SNE complete! Output shape: {result.shape}")
    return result


def reduce_dimensions(
    embeddings: np.ndarray,
    method: str = "umap",
    n_components: int = 2
) -> np.ndarray:
    """
    Apply dimensionality reduction using GPU if available, else CPU.
    
    Args:
        embeddings: Input embeddings
        method: "umap" or "tsne"
        n_components: 2 or 3
    
    Returns:
        Reduced embeddings
    """
    if RAPIDS_AVAILABLE:
        if method == "umap":
            return apply_umap_gpu(embeddings, n_components)
        else:
            return apply_tsne_gpu(embeddings, n_components)
    else:
        if method == "umap":
            return apply_umap_cpu(embeddings, n_components)
        else:
            return apply_tsne_cpu(embeddings, n_components)


In [None]:
# Apply dimensionality reduction (2D and 3D)
if embeddings is not None:
    print("=" * 80)
    print(f"üó∫Ô∏è  Applying {REDUCTION_METHOD.upper()} Dimensionality Reduction")
    print("=" * 80)
    
    # 2D reduction
    print("\nüìä 2D Projection:")
    embeddings_2d = reduce_dimensions(embeddings, method=REDUCTION_METHOD, n_components=2)
    
    # Add to metadata
    metadata_df['x'] = embeddings_2d[:, 0]
    metadata_df['y'] = embeddings_2d[:, 1]
    
    # 3D reduction
    print("\nüìä 3D Projection:")
    embeddings_3d = reduce_dimensions(embeddings, method=REDUCTION_METHOD, n_components=3)
    
    # Add to metadata
    metadata_df['x3d'] = embeddings_3d[:, 0]
    metadata_df['y3d'] = embeddings_3d[:, 1]
    metadata_df['z3d'] = embeddings_3d[:, 2]
    
    print("\n" + "=" * 80)
    print("‚úÖ Dimensionality reduction complete!")
    print(f"   2D range: x=[{metadata_df['x'].min():.2f}, {metadata_df['x'].max():.2f}], y=[{metadata_df['y'].min():.2f}, {metadata_df['y'].max():.2f}]")
    print(f"   3D range: x=[{metadata_df['x3d'].min():.2f}, {metadata_df['x3d'].max():.2f}], y=[{metadata_df['y3d'].min():.2f}, {metadata_df['y3d'].max():.2f}], z=[{metadata_df['z3d'].min():.2f}, {metadata_df['z3d'].max():.2f}]")
    print("=" * 80)
else:
    print("‚ö†Ô∏è  No embeddings loaded - skipping dimensionality reduction")


üó∫Ô∏è  Applying UMAP Dimensionality Reduction

üìä 2D Projection:
üöÄ Applying cuML UMAP (GPU-accelerated)...
   Input shape: (1397187, 4096)
   Output dimensions: 2
   Parameters: n_neighbors=15, min_dist=0.1, metric=cosine
[2025-12-23 10:55:17.784] [CUML] [debug] Computing KNN Graph


## üìä Interactive Plotly Visualizations

Create 2D and 3D interactive visualizations with Plotly.
Labels come from dataset metadata (split names). Future versions can use cuBERT clustering.


In [None]:
def create_2d_scatter(
    df: pd.DataFrame,
    color_col: str = 'split',
    title: str = "2D Embedding Visualization",
    color_map: Optional[Dict] = None
) -> go.Figure:
    """
    Create interactive 2D scatter plot with Plotly.
    
    Args:
        df: DataFrame with x, y columns and metadata
        color_col: Column to use for coloring points
        title: Plot title
        color_map: Optional custom color mapping
    
    Returns:
        Plotly Figure object
    """
    # Use custom colors if provided
    if color_map is None:
        color_map = CATEGORY_COLORS
    
    # Get unique values and assign colors
    unique_vals = df[color_col].unique()
    colors = {val: color_map.get(val, '#95A5A6') for val in unique_vals}
    
    fig = px.scatter(
        df,
        x='x',
        y='y',
        color=color_col,
        color_discrete_map=colors,
        hover_data=['dataset', 'split', 'label'],
        title=title,
        labels={'x': f'{REDUCTION_METHOD.upper()} Dimension 1', 'y': f'{REDUCTION_METHOD.upper()} Dimension 2'},
        template='plotly_white'
    )
    
    fig.update_traces(
        marker=dict(size=5, opacity=0.6, line=dict(width=0.3, color='white'))
    )
    
    fig.update_layout(
        width=1200,
        height=800,
        title_font_size=20,
        title_x=0.5,
        legend=dict(
            title=color_col.title(),
            yanchor="top",
            y=0.99,
            xanchor="left",
            x=1.01,
            bgcolor="rgba(255, 255, 255, 0.9)",
            bordercolor="gray",
            borderwidth=1
        ),
        hovermode='closest'
    )
    
    return fig


def create_3d_scatter(
    df: pd.DataFrame,
    color_col: str = 'split',
    title: str = "3D Embedding Visualization",
    color_map: Optional[Dict] = None
) -> go.Figure:
    """
    Create interactive 3D scatter plot with Plotly.
    
    Args:
        df: DataFrame with x3d, y3d, z3d columns and metadata
        color_col: Column to use for coloring points
        title: Plot title
        color_map: Optional custom color mapping
    
    Returns:
        Plotly Figure object
    """
    # Use custom colors if provided
    if color_map is None:
        color_map = CATEGORY_COLORS
    
    # Get unique values and assign colors
    unique_vals = df[color_col].unique()
    colors = {val: color_map.get(val, '#95A5A6') for val in unique_vals}
    
    fig = px.scatter_3d(
        df,
        x='x3d',
        y='y3d',
        z='z3d',
        color=color_col,
        color_discrete_map=colors,
        hover_data=['dataset', 'split', 'label'],
        title=title,
        labels={
            'x3d': f'{REDUCTION_METHOD.upper()} Dim 1',
            'y3d': f'{REDUCTION_METHOD.upper()} Dim 2',
            'z3d': f'{REDUCTION_METHOD.upper()} Dim 3'
        },
        template='plotly_white'
    )
    
    fig.update_traces(
        marker=dict(size=3, opacity=0.6, line=dict(width=0.2, color='white'))
    )
    
    fig.update_layout(
        width=1200,
        height=900,
        title_font_size=20,
        title_x=0.5,
        scene=dict(
            xaxis_title=f'{REDUCTION_METHOD.upper()} Dimension 1',
            yaxis_title=f'{REDUCTION_METHOD.upper()} Dimension 2',
            zaxis_title=f'{REDUCTION_METHOD.upper()} Dimension 3',
            camera=dict(eye=dict(x=1.5, y=1.5, z=1.2))
        ),
        legend=dict(
            title=color_col.title(),
            yanchor="top",
            y=0.99,
            xanchor="left",
            x=0.01,
            bgcolor="rgba(255, 255, 255, 0.9)",
            bordercolor="gray",
            borderwidth=1
        )
    )
    
    return fig


In [None]:
# Create and display visualizations
if metadata_df is not None and 'x' in metadata_df.columns:
    print("=" * 80)
    print("üìä Creating Interactive Visualizations")
    print("=" * 80)
    
    # =========================================================================
    # Visualization 1: 2D scatter by Split (default label)
    # =========================================================================
    print("\nüé® Creating 2D visualization colored by Split...")
    fig_2d_split = create_2d_scatter(
        metadata_df,
        color_col='split',
        title=f'{REDUCTION_METHOD.upper()} 2D Projection - Colored by Split'
    )
    
    # Save to HTML
    output_file = OUTPUT_DIR / f'{REDUCTION_METHOD}_2d_by_split.html'
    fig_2d_split.write_html(str(output_file))
    print(f"   ‚úÖ Saved: {output_file}")
    
    # Display
    fig_2d_split.show()
else:
    print("‚ö†Ô∏è  No data to visualize - run previous cells first")


In [None]:
# =========================================================================
# Visualization 2: 2D scatter by Dataset
# =========================================================================
if metadata_df is not None and 'x' in metadata_df.columns:
    print("\nüé® Creating 2D visualization colored by Dataset...")
    
    # Custom colors for datasets
    dataset_colors = {
        'v1': '#FF6B6B',
        'v2': '#4ECDC4',
        'llama-sft': '#45B7D1',
        'llama-rl': '#96CEB4',
        'v3-science': '#DDA0DD',
        'v3-math-proofs': '#F39C12',
        'v3-instruction-chat': '#9B59B6'
    }
    
    fig_2d_dataset = create_2d_scatter(
        metadata_df,
        color_col='dataset',
        title=f'{REDUCTION_METHOD.upper()} 2D Projection - Colored by Dataset',
        color_map=dataset_colors
    )
    
    # Save to HTML
    output_file = OUTPUT_DIR / f'{REDUCTION_METHOD}_2d_by_dataset.html'
    fig_2d_dataset.write_html(str(output_file))
    print(f"   ‚úÖ Saved: {output_file}")
    
    # Display
    fig_2d_dataset.show()


In [None]:
# =========================================================================
# Visualization 3: 3D scatter by Split
# =========================================================================
if metadata_df is not None and 'x3d' in metadata_df.columns:
    print("\nüé® Creating 3D visualization colored by Split...")
    fig_3d_split = create_3d_scatter(
        metadata_df,
        color_col='split',
        title=f'{REDUCTION_METHOD.upper()} 3D Projection - Colored by Split'
    )
    
    # Save to HTML
    output_file = OUTPUT_DIR / f'{REDUCTION_METHOD}_3d_by_split.html'
    fig_3d_split.write_html(str(output_file))
    print(f"   ‚úÖ Saved: {output_file}")
    
    # Display
    fig_3d_split.show()


In [None]:
# =========================================================================
# Visualization 4: 3D scatter by Dataset
# =========================================================================
if metadata_df is not None and 'x3d' in metadata_df.columns:
    print("\nüé® Creating 3D visualization colored by Dataset...")
    fig_3d_dataset = create_3d_scatter(
        metadata_df,
        color_col='dataset',
        title=f'{REDUCTION_METHOD.upper()} 3D Projection - Colored by Dataset',
        color_map=dataset_colors
    )
    
    # Save to HTML
    output_file = OUTPUT_DIR / f'{REDUCTION_METHOD}_3d_by_dataset.html'
    fig_3d_dataset.write_html(str(output_file))
    print(f"   ‚úÖ Saved: {output_file}")
    
    # Display
    fig_3d_dataset.show()


In [None]:
## üìà Summary and Export


In [None]:
# Display summary
if metadata_df is not None:
    print("=" * 80)
    print("üìä VISUALIZATION SUMMARY")
    print("=" * 80)
    
    print(f"\n‚úÖ Total embeddings visualized: {len(metadata_df):,}")
    print(f"   Embedding dimension: {embeddings.shape[1] if embeddings is not None else 'N/A'}")
    print(f"   Reduction method: {REDUCTION_METHOD.upper()}")
    print(f"   Backend: {'GPU (RAPIDS cuML)' if RAPIDS_AVAILABLE else 'CPU'}")
    
    print(f"\nüìÅ Output files saved to: {OUTPUT_DIR}")
    for f in OUTPUT_DIR.glob(f"{REDUCTION_METHOD}_*.html"):
        print(f"   ‚Ä¢ {f.name}")
    
    print(f"\nüìä Data Distribution:")
    print(f"\nBy Dataset:")
    print(metadata_df['dataset'].value_counts().to_string())
    print(f"\nBy Split:")
    print(metadata_df['split'].value_counts().to_string())
    
    # Save metadata DataFrame for further analysis
    metadata_file = OUTPUT_DIR / 'visualization_metadata.parquet'
    metadata_df.to_parquet(str(metadata_file))
    print(f"\nüíæ Metadata saved to: {metadata_file}")
    
    print("\n" + "=" * 80)
    print("‚úÖ Visualization complete!")
    print("   Open the HTML files in a browser for interactive exploration.")
    print("=" * 80)
else:
    print("‚ö†Ô∏è  No visualizations created - no embeddings loaded")


## üîÑ Optional: Compare UMAP vs t-SNE

Run this cell to also generate t-SNE visualizations for comparison.
Note: t-SNE is typically slower than UMAP, even with GPU acceleration.


In [None]:
# Optional: Generate t-SNE visualizations
# Set RUN_TSNE = True to generate t-SNE visualizations
RUN_TSNE = False

if RUN_TSNE and embeddings is not None:
    print("=" * 80)
    print("üîÑ Generating t-SNE visualizations for comparison...")
    print("=" * 80)
    
    # 2D t-SNE
    print("\nüìä 2D t-SNE:")
    tsne_2d = reduce_dimensions(embeddings, method="tsne", n_components=2)
    metadata_df['tsne_x'] = tsne_2d[:, 0]
    metadata_df['tsne_y'] = tsne_2d[:, 1]
    
    # Create 2D t-SNE plot
    fig_tsne_2d = px.scatter(
        metadata_df,
        x='tsne_x',
        y='tsne_y',
        color='split',
        color_discrete_map=CATEGORY_COLORS,
        hover_data=['dataset', 'split'],
        title='t-SNE 2D Projection - Colored by Split',
        template='plotly_white'
    )
    fig_tsne_2d.update_traces(marker=dict(size=5, opacity=0.6))
    fig_tsne_2d.update_layout(width=1200, height=800)
    
    output_file = OUTPUT_DIR / 'tsne_2d_by_split.html'
    fig_tsne_2d.write_html(str(output_file))
    print(f"   ‚úÖ Saved: {output_file}")
    fig_tsne_2d.show()
    
    print("\n‚úÖ t-SNE visualizations complete!")
elif RUN_TSNE:
    print("‚ö†Ô∏è  No embeddings available for t-SNE")
else:
    print("‚ÑπÔ∏è  t-SNE comparison skipped (set RUN_TSNE = True to enable)")


## üîÆ Future Work: Label Assignment

Labels can come from multiple sources:

1. **Dataset Metadata** (current): Using `split` names (chat, code, math, stem, etc.)
2. **cuBERT Clustering**: GPU-accelerated topic modeling with BERT embeddings
3. **K-Means/HDBSCAN**: Unsupervised clustering on the reduced embeddings
4. **Manual Labels**: Domain expert annotations

To add cuBERT-based labels, see: [RAPIDS cuBERT Topic Modelling](https://github.com/rapidsai/rapids-examples/tree/main/cuBERT_topic_modelling)


In [None]:
# Example: Adding cluster labels with cuML HDBSCAN (optional)
ADD_CLUSTER_LABELS = False

if ADD_CLUSTER_LABELS and RAPIDS_AVAILABLE and embeddings is not None:
    from cuml.cluster import HDBSCAN
    
    print("üîç Computing HDBSCAN clusters on GPU...")
    
    # Use 2D reduced embeddings for clustering
    embeddings_for_clustering = cp.asarray(embeddings_2d, dtype=cp.float32)
    
    clusterer = HDBSCAN(
        min_cluster_size=50,
        min_samples=10,
        metric='euclidean'
    )
    
    cluster_labels = clusterer.fit_predict(embeddings_for_clustering)
    metadata_df['cluster'] = cp.asnumpy(cluster_labels)
    
    n_clusters = len(set(metadata_df['cluster'])) - (1 if -1 in metadata_df['cluster'].values else 0)
    print(f"   Found {n_clusters} clusters")
    
    # Visualize clusters
    fig_clusters = px.scatter(
        metadata_df,
        x='x', y='y',
        color='cluster',
        title=f'{REDUCTION_METHOD.upper()} with HDBSCAN Clusters ({n_clusters} clusters)',
        template='plotly_white'
    )
    fig_clusters.update_layout(width=1200, height=800)
    fig_clusters.show()
else:
    print("‚ÑπÔ∏è  Cluster labeling skipped (set ADD_CLUSTER_LABELS = True to enable)")


In [None]:
# End of notebook
print("üéâ Notebook execution complete!")
print("\nNext steps:")
print("1. Open the HTML files in visualizations/ folder for interactive exploration")
print("2. Set ADD_CLUSTER_LABELS = True to compute unsupervised clusters")
print("3. Set RUN_TSNE = True to compare UMAP vs t-SNE")
print("4. Integrate cuBERT for topic-based labeling")


In [None]:
# Scratch cell for experimentation


# Additional scratch space


In [None]:
# Empty cell


In [None]:
# End
