# NVIDIA Nemotron Datasets Visualization with UMAP

This notebook processes Nemotron-v1 and Nemotron-v2 datasets in parallel, generates embeddings using NVIDIA's NIM (nv-embedqa-e5-v5), and visualizes the data using UMAP 2D projection with **interactive Plotly visualizations**.

## üìä Overview
- **Datasets**: Nemotron-Post-Training-Dataset-v1 and v2
- **Embedding Model**: nvidia/nv-embedqa-e5-v5 (NVIDIA NIM)
- **Visualization**: Interactive UMAP 2D/3D plots with Plotly
- **Color Coding**: Based on dataset headers (category, reasoning, version, split)

## üéØ Features
- ‚úÖ Parallel processing for embedding generation
- ‚úÖ NVIDIA NIM API integration for high-quality embeddings
- ‚úÖ Interactive Plotly visualizations with hover details
- ‚úÖ Multiple views: by category, version, reasoning, and faceted comparisons
- ‚úÖ 3D visualization for deeper exploration
- ‚úÖ Export to HTML for easy sharing

---


## üîß Setup and Imports


In [1]:
# Install required packages
%pip install dotenv datasets umap-learn numpy pandas plotly scikit-learn openai tqdm joblib nbformat>=5.0.0 -q

Note: you may need to restart the kernel to use updated packages.


In [2]:
import os
import numpy as np
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from datasets import load_dataset
from concurrent.futures import ThreadPoolExecutor, ProcessPoolExecutor, as_completed
from tqdm.auto import tqdm
import pickle
import json
from openai import OpenAI
import umap
from typing import List, Dict, Tuple
import warnings
warnings.filterwarnings('ignore')

# Configure Plotly for notebook rendering
import plotly.io as pio
pio.renderers.default = "notebook_connected"

print("‚úÖ All imports successful!")


‚úÖ All imports successful!


## üì¶ Configure NVIDIA NIM Client

Set up the OpenAI client to connect to NVIDIA's NIM endpoint for embeddings.


In [3]:
from dotenv import load_dotenv
load_dotenv()

# Configure NVIDIA NIM client
nvidia_api_key = os.environ.get("NVIDIA_API_KEY", "nvapi-YOUR_API_KEY")
# print(nvidia_api_key)

In [9]:
# Model configuration
# EMBEDDING_MODEL = "nvidia/nv-embedqa-e5-v5"
# EMBEDDING_DIM = 1024  # Expected dimension for this model
EMBEDDING_MODEL = "nvidia/llama-3_2-nemoretriever-300m-embed-v2"
EMBEDDING_DIM = 4096  # Expected dimension for this model
BASE_URL = "https://integrate.api.nvidia.com/v1"    
print(f"‚úÖ NVIDIA NIM client configured with model: {EMBEDDING_MODEL}")
print(f"   Base URL: {BASE_URL}")

client = OpenAI(
    base_url=BASE_URL,
    api_key=nvidia_api_key
)


‚úÖ NVIDIA NIM client configured with model: nvidia/llama-3_2-nemoretriever-300m-embed-v2
   Base URL: https://integrate.api.nvidia.com/v1


## üß™ Test Embedding API Payload

Before running the full extraction, let's test the embedding creation with random sample data to verify:
1. The API endpoint is working correctly
2. The request payload format is accepted
3. The response format is as expected
4. The embedding dimensions are correct


In [5]:
import json
import random

print("üß™ Testing Embedding API with Random Sample Data")
print("=" * 80)

# Generate random test data to simulate different types of content
np.random.seed(42)
random.seed(42)

# Sample texts simulating different categories
sample_texts = [
    "user: Write a Python function to calculate fibonacci numbers\nassistant: Here's a recursive implementation of fibonacci...",
    "user: Explain quantum entanglement in simple terms\nassistant: Quantum entanglement is a phenomenon where particles become connected...",
    "user: How do I solve quadratic equations?\nassistant: To solve ax¬≤ + bx + c = 0, you can use the quadratic formula...",
    "user: What are best practices for REST API design?\nassistant: RESTful API design should follow these principles...",
    "system: You are a helpful assistant\nuser: Tell me a joke\nassistant: Why don't scientists trust atoms? Because they make up everything!"
]

# Create random test dataframe
test_df = pd.DataFrame({
    'text': sample_texts,
    'version': ['v1', 'v1', 'v2', 'v2', 'v1'],
    'split': ['code', 'stem', 'math', 'chat', 'chat'],
    'category': ['code', 'stem', 'math', 'chat', 'chat'],
    'reasoning': ['on', 'on', 'off', 'on', 'off'],
    'idx': [0, 1, 2, 3, 4]
})

print(f"\nüìä Test Configuration:")
print(f"   Model: {EMBEDDING_MODEL}")
print(f"   Number of test samples: {len(test_df)}")
print(f"   Expected dimension: {EMBEDDING_DIM}")

# print(f"\nüìù Random test dataframe:")
# print(test_df[['category', 'version', 'reasoning']].to_string())

print(f"\nüìù Sample texts (first 80 chars):")
for i, text in enumerate(test_df['text'].tolist(), 1):
    preview = text.replace('\n', ' ')[:80]
    print(f"   {i}. {preview}...")

print(f"\n{'='*80}")
print(f"üîç Testing SINGLE text embedding...")
print(f"{'='*80}")

test_texts = test_df['text'].tolist()

try:
    # Test with single text first
    response = client.embeddings.create(
        input=[test_texts[0]],
        model=EMBEDDING_MODEL,
        encoding_format="float",
        extra_body={"input_type": "query", "truncate": "NONE"}
    )
    
    print("‚úÖ Single text embedding successful!")
    print(f"   Response type: {type(response)}")
    print(f"   Number of embeddings: {len(response.data)}")
    print(f"   Embedding dimension: {len(response.data[0].embedding)}")
    print(f"   First 10 values: {response.data[0].embedding[:10]}")
    
    # Verify dimension
    actual_dim = len(response.data[0].embedding)
    if actual_dim == EMBEDDING_DIM:
        print(f"   ‚úÖ Dimension matches expected: {actual_dim} == {EMBEDDING_DIM}")
    else:
        print(f"   ‚ö†Ô∏è  Dimension mismatch: {actual_dim} != {EMBEDDING_DIM}")
        print(f"      Updating EMBEDDING_DIM to {actual_dim}")
        EMBEDDING_DIM = actual_dim

except Exception as e:
    print(f"‚ùå Single text embedding failed!")
    print(f"   Error: {e}")
    print(f"   Error type: {type(e).__name__}")
    if hasattr(e, 'response'):
        print(f"   Response status: {getattr(e.response, 'status_code', 'N/A')}")

print(f"\n{'='*80}")
print(f"üîç Testing BATCH embedding (all {len(test_texts)} samples)...")
print(f"{'='*80}")

try:
    # Test with batch
    response = client.embeddings.create(
        input=test_texts,
        model=EMBEDDING_MODEL,
        encoding_format="float",
        extra_body={"input_type": "passage", "truncate": "NONE"}
    )
    
    print("‚úÖ Batch embedding successful!")
    print(f"   Response type: {type(response)}")
    print(f"   Number of embeddings returned: {len(response.data)}")
    print(f"   Expected number: {len(test_texts)}")
    print(f"   All dimensions: {[len(d.embedding) for d in response.data]}")
    
    # Verify all dimensions match
    dims = [len(d.embedding) for d in response.data]
    if len(set(dims)) == 1 and dims[0] == EMBEDDING_DIM:
        print(f"   ‚úÖ All dimensions consistent: {dims[0]}")
    else:
        print(f"   ‚ö†Ô∏è  Dimension inconsistency detected: {set(dims)}")
    
    # print(f"\nüìä Sample embedding statistics:")
    # sample_embedding = np.array(response.data[0].embedding)
    # print(f"   Mean: {sample_embedding.mean():.6f}")
    # print(f"   Std: {sample_embedding.std():.6f}")
    # print(f"   Min: {sample_embedding.min():.6f}")
    # print(f"   Max: {sample_embedding.max():.6f}")
    # print(f"   Norm (L2): {np.linalg.norm(sample_embedding):.6f}")
    
    # Print first embedding in full detail
    print(f"\nüîç FULL FIRST TEST EMBEDDING:")
    print(f"   Text: '{test_texts[0][:80]}...'")
    print(f"   Dimension: {len(response.data[0].embedding)}")
    print(f"   Full embedding array:")
    embedding_1 = response.data[0].embedding
    # Print in rows of 10 values for readability
    for i in range(0, len(embedding_1), 10):
        chunk = embedding_1[i:i+10]
        values_str = ", ".join([f"{v:8.5f}" for v in chunk])
        print(f"      [{i:4d}:{min(i+10, len(embedding_1)):4d}] {values_str}")
    
    # Also save embeddings to test_df for verification
    test_df['embedding'] = [d.embedding for d in response.data]
    print(f"\n‚úÖ Saved {len(response.data)} embeddings to test_df['embedding']")
    print(f"   test_df shape: {test_df.shape}")
    print(f"   Embedding column type: {type(test_df['embedding'].iloc[0])}")
    print(f"   First embedding length: {len(test_df['embedding'].iloc[0])}")
    
    # Test payload structure
    print(f"\nüì¶ API Request Payload Structure:")
    print(f"   ‚úÖ input: list of {len(test_texts)} strings")
    print(f"   ‚úÖ model: {EMBEDDING_MODEL}")
    print(f"   ‚úÖ encoding_format: float")
    print(f"   ‚úÖ extra_body: {{'input_type': 'passage', 'truncate': 'NONE'}}")

except Exception as e:
    print(f"‚ùå Batch embedding failed!")
    print(f"   Error: {e}")
    print(f"   Error type: {type(e).__name__}")
    
    # Try to extract more details
    if hasattr(e, 'response'):
        print(f"   Response status: {getattr(e.response, 'status_code', 'N/A')}")
        try:
            error_body = e.response.json() if hasattr(e.response, 'json') else str(e.response.text)
            print(f"   Response body: {error_body}")
        except:
            pass

print("\n" + "=" * 80)
print("üß™ Embedding API test complete!")
print("=" * 80)


üß™ Testing Embedding API with Random Sample Data

üìä Test Configuration:
   Model: nvidia/llama-3_2-nemoretriever-300m-embed-v2
   Number of test samples: 5
   Expected dimension: 4096

üìù Sample texts (first 80 chars):
   1. user: Write a Python function to calculate fibonacci numbers assistant: Here's a...
   2. user: Explain quantum entanglement in simple terms assistant: Quantum entangleme...
   3. user: How do I solve quadratic equations? assistant: To solve ax¬≤ + bx + c = 0, ...
   4. user: What are best practices for REST API design? assistant: RESTful API design...
   5. system: You are a helpful assistant user: Tell me a joke assistant: Why don't sc...

üîç Testing SINGLE text embedding...
‚ùå Single text embedding failed!
   Error: 404 page not found
   Error type: NotFoundError
   Response status: 404

üîç Testing BATCH embedding (all 5 samples)...
‚ùå Batch embedding failed!
   Error: 404 page not found
   Error type: NotFoundError
   Response status: 404

üß™ Emb

## üì• Load Datasets

Load both Nemotron v1 and v2 datasets from the local cache.


In [6]:
print("üì• Loading Nemotron datasets...\n")

# Load Nemotron v1
print("Loading Nemotron-Post-Training-Dataset-v1...")
dataset_v1 = load_dataset(
    "nvidia/Nemotron-Post-Training-Dataset-v1",
    cache_dir="./datasets/nemotron-v1"
)
print(f"   ‚úÖ Loaded v1 with splits: {list(dataset_v1.keys())}")
print(f"   Total v1 samples: {sum(len(dataset_v1[split]) for split in dataset_v1.keys()):,}\n")

# Load Nemotron v2
print("Loading Nemotron-Post-Training-Dataset-v2...")
dataset_v2 = load_dataset(
    "nvidia/Nemotron-Post-Training-Dataset-v2",
    cache_dir="./datasets/nemotron-v2"
)
print(f"   ‚úÖ Loaded v2 with splits: {list(dataset_v2.keys())}")
print(f"   Total v2 samples: {sum(len(dataset_v2[split]) for split in dataset_v2.keys()):,}\n")

print("‚úÖ All datasets loaded successfully!")


üì• Loading Nemotron datasets...

Loading Nemotron-Post-Training-Dataset-v1...


Resolving data files:   0%|          | 0/183 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/159 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/660 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/183 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/159 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/660 [00:00<?, ?it/s]

Loading dataset shards:   0%|          | 0/175 [00:00<?, ?it/s]

Loading dataset shards:   0%|          | 0/152 [00:00<?, ?it/s]

Loading dataset shards:   0%|          | 0/649 [00:00<?, ?it/s]

   ‚úÖ Loaded v1 with splits: ['chat', 'code', 'math', 'stem', 'tool_calling']
   Total v1 samples: 25,659,642

Loading Nemotron-Post-Training-Dataset-v2...


Resolving data files:   0%|          | 0/37 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/38 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/38 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/33 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/37 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/37 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/38 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/38 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/33 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/37 [00:00<?, ?it/s]

Loading dataset shards:   0%|          | 0/36 [00:00<?, ?it/s]

Loading dataset shards:   0%|          | 0/37 [00:00<?, ?it/s]

Loading dataset shards:   0%|          | 0/37 [00:00<?, ?it/s]

Loading dataset shards:   0%|          | 0/32 [00:00<?, ?it/s]

Loading dataset shards:   0%|          | 0/36 [00:00<?, ?it/s]

   ‚úÖ Loaded v2 with splits: ['stem', 'chat', 'math', 'code', 'multilingual_ja', 'multilingual_de', 'multilingual_it', 'multilingual_es', 'multilingual_fr']
   Total v2 samples: 6,341,414

‚úÖ All datasets loaded successfully!


## üîç Explore Dataset Structure

Let's examine the structure of the datasets to understand what fields are available, particularly the most meaning full fields


In [7]:
# Examine sample from v1
print("=" * 80)
print("üìä Nemotron v1 Dataset Structure")
print("=" * 80)
sample_v1 = dataset_v1['chat'][0]
print(f"\nFields in v1: {list(sample_v1.keys())}\n")
for key, value in sample_v1.items():
    if isinstance(value, str) and len(value) > 200:
        print(f"{key}: {value[:200]}...")
    elif isinstance(value, list) and len(value) > 3:
        print(f"{key}: {value[:3]}... (truncated)")
    else:
        print(f"{key}: {value}")

# Examine sample from v2
print("\n" + "=" * 80)
print("üìä Nemotron v2 Dataset Structure")
print("=" * 80)
sample_v2 = dataset_v2['chat'][0]
print(f"\nFields in v2: {list(sample_v2.keys())}\n")
for key, value in sample_v2.items():
    if isinstance(value, str) and len(value) > 200:
        print(f"{key}: {value[:200]}...")
    elif isinstance(value, list) and len(value) > 3:
        print(f"{key}: {value[:3]}... (truncated)")
    else:
        print(f"{key}: {value}")

print("\n" + "=" * 80)

# Analyze all unique values for categorical fields
print("\nüìä Analyzing categorical fields across all splits...\n")

# V1 analysis
print("V1 Categories and Reasoning:")
for split in list(dataset_v1.keys())[:2]:  # Check first 2 splits
    categories = set()
    reasoning_vals = set()
    for i in range(min(100, len(dataset_v1[split]))):
        sample = dataset_v1[split][i]
        if 'category' in sample:
            categories.add(sample['category'])
        if 'reasoning' in sample:
            reasoning_vals.add(sample['reasoning'])
    print(f"  {split}: categories={categories}, reasoning={reasoning_vals}")

print("\nV2 Categories and Reasoning:")
for split in list(dataset_v2.keys())[:2]:  # Check first 2 splits
    categories = set()
    reasoning_vals = set()
    for i in range(min(100, len(dataset_v2[split]))):
        sample = dataset_v2[split][i]
        if 'category' in sample:
            categories.add(sample['category'])
        if 'reasoning' in sample:
            reasoning_vals.add(sample['reasoning'])
    print(f"  {split}: categories={categories}, reasoning={reasoning_vals}")


üìä Nemotron v1 Dataset Structure

Fields in v1: ['uuid', 'license', 'generator', 'version', 'category', 'reasoning', 'messages', 'metadata']

uuid: 1b07b912-0135-4f23-b704-2ceea567f617
license: CC BY 4.0
generator: Qwen3-235B-A22B
version: v1
category: chat
reasoning: off
messages: [{'role': 'user', 'content': '', 'tool_calls': []}, {'role': 'assistant', 'content': "Understood. I'm ready to proceed with the activity. Please ask the first question.", 'tool_calls': []}]
metadata: {"conversation_id": "8e31a022d01d49748f6053a8805dfbd2", "source": "https://huggingface.co/datasets/lmsys/lmsys-chat-1m"}

üìä Nemotron v2 Dataset Structure

Fields in v2: ['uuid', 'license', 'generator', 'version', 'category', 'reasoning', 'messages']

uuid: 76242391-3c82-4471-a971-e51f57b2899e
license: CC BY 4.0
generator: Qwen3-235B-A22B, Qwen3-30B-A3B
version: v2
category: chat
reasoning: off
messages: [{'role': 'system', 'content': ''}, {'role': 'user', 'content': 'Write a description of Mijaƒçija and Brs

## üìä Sample and Prepare Data

Since the datasets are very large (millions of samples), we'll sample a representative subset for visualization. We'll use stratified sampling to maintain the distribution across splits and versions.


In [8]:
# Configuration
# os.remove("embeddings_cache.pkl", exist_ok=True)
SAMPLING_FRACTION = 0.01  # Fraction of data to sample (0.0 to 1.0, where 1.0 = 100%)
MAX_TEXT_LENGTH = 2000     # Max characters for embedding (API will auto-truncate to 512 token limit)

# Custom color scheme for categories
CATEGORY_COLORS = {
    'chat': 'red',
    'code': 'darkorange', 
    'math': 'gold',
    'stem': 'turquoise',
    'tool_calling': 'darkgreen',
    'multilingual_ja': 'purple',
    'multilingual_de': 'pink',
    'multilingual_it': 'brown',
    'multilingual_es': 'olive',
    'multilingual_fr': 'cyan'
}

def extract_text_from_messages(messages):
    """Extract text from messages field."""
    if not isinstance(messages, list):
        return ""
    
    texts = []
    for msg in messages:
        if isinstance(msg, dict):
            # Extract content from message
            content = msg.get('content', '')
            role = msg.get('role', '')
            if content:
                texts.append(f"{role}: {content}")
    
    return ' '.join(texts)[:MAX_TEXT_LENGTH]

def get_category_and_reasoning(sample):
    """Extract category and reasoning fields."""
    category = sample.get('category', 'unknown')
    reasoning = sample.get('reasoning', 'unknown')
    return category, reasoning

# Sample data from both datasets
sampled_data = []

print("üìä Sampling data from Nemotron v1...")
for split_name in dataset_v1.keys():
    split_data = dataset_v1[split_name]
    total_samples = len(split_data)
    n_samples = max(1, int(total_samples * SAMPLING_FRACTION))  # At least 1 sample
    
    print(f"  {split_name}: sampling {n_samples:,} / {total_samples:,} ({SAMPLING_FRACTION*100:.1f}%)")
    
    # Random sampling
    indices = np.random.choice(total_samples, size=n_samples, replace=False)
    
    for idx in tqdm(indices, desc=f"  Processing {split_name}", leave=False):
        sample = split_data[int(idx)]
        
        # Extract text from messages
        messages = sample.get('messages', [])
        text = extract_text_from_messages(messages)
        
        # Get category and reasoning
        category, reasoning = get_category_and_reasoning(sample)
        
        sampled_data.append({
            'text': text,
            'version': 'v1',
            'split': split_name,
            'category': category,
            'reasoning': reasoning,
            'idx': int(idx)
        })

print(f"  ‚úÖ Sampled {len(sampled_data)} samples from v1\n")

v1_count = len(sampled_data)

print("üìä Sampling data from Nemotron v2...")
for split_name in dataset_v2.keys():
    split_data = dataset_v2[split_name]
    total_samples = len(split_data)
    n_samples = max(1, int(total_samples * SAMPLING_FRACTION))  # At least 1 sample
    
    print(f"  {split_name}: sampling {n_samples:,} / {total_samples:,} ({SAMPLING_FRACTION*100:.1f}%)")
    
    # Random sampling
    indices = np.random.choice(total_samples, size=n_samples, replace=False)
    
    for idx in tqdm(indices, desc=f"  Processing {split_name}", leave=False):
        sample = split_data[int(idx)]
        
        # Extract text from messages
        messages = sample.get('messages', [])
        text = extract_text_from_messages(messages)
        
        # Get category and reasoning
        category, reasoning = get_category_and_reasoning(sample)
        
        sampled_data.append({
            'text': text,
            'version': 'v2',
            'split': split_name,
            'category': category,
            'reasoning': reasoning,
            'idx': int(idx)
        })

print(f"  ‚úÖ Sampled {len(sampled_data) - v1_count} samples from v2\n")

# Create DataFrame
df = pd.DataFrame(sampled_data)

print("=" * 80)
print(f"üìä Total samples prepared: {len(df):,}")
print(f"   Sampling fraction: {SAMPLING_FRACTION*100:.1f}%")
print(f"   - v1: {len(df[df['version'] == 'v1']):,}")
print(f"   - v2: {len(df[df['version'] == 'v2']):,}")
print(f"\nCategory distribution:")
print(df['category'].value_counts())
print(f"\nReasoning distribution:")
print(df['reasoning'].value_counts())
print(f"\nSplit distribution:")
print(df['split'].value_counts())
print("=" * 80)


üìä Sampling data from Nemotron v1...
  chat: sampling 7,466 / 746,622 (1.0%)


  Processing chat:   0%|          | 0/7466 [00:00<?, ?it/s]

  code: sampling 18,963 / 1,896,395 (1.0%)


  Processing code:   0%|          | 0/18963 [00:00<?, ?it/s]

  math: sampling 20,444 / 2,044,407 (1.0%)


  Processing math:   0%|          | 0/20444 [00:00<?, ?it/s]

  stem: sampling 206,621 / 20,662,167 (1.0%)


  Processing stem:   0%|          | 0/206621 [00:00<?, ?it/s]

KeyboardInterrupt: 

## üöÄ Generate Embeddings with NVIDIA NIM

Process the text samples in parallel to generate embeddings using NVIDIA's NV-EmbedQA-E5-V5 model.


In [None]:
import time
from concurrent.futures import ThreadPoolExecutor, as_completed

# Embedding cache file
EMBEDDINGS_CACHE_FILE = "embeddings_cache.pkl"

def get_embedding_batch(texts, model=EMBEDDING_MODEL, max_retries=3):
    """Get embeddings for a batch of texts with retry logic."""
    for attempt in range(max_retries):
        try:
            response = client.embeddings.create(
                input=texts,
                model=model,
                # max_tokens=4096,
                encoding_format="float",
                extra_body={"input_type": "passage", "truncate": "NONE"}
            )
            return [data.embedding for data in response.data]
        except Exception as e:
            if attempt < max_retries - 1:
                wait_time = (attempt + 1) * 2
                print(f"   ‚ö†Ô∏è  Error: {e}. Retrying in {wait_time}s...")
                time.sleep(wait_time)
            else:
                print(f"   ‚ùå Failed after {max_retries} attempts: {e}")
                return None
    return None

def process_batch(batch_texts, batch_indices):
    """Process a batch of texts and return embeddings with indices."""
    embeddings = get_embedding_batch(batch_texts)
    if embeddings:
        return list(zip(batch_indices, embeddings))
    return []

# Check if embeddings are already cached
if os.path.exists(EMBEDDINGS_CACHE_FILE):
    print(f"üì¶ Loading cached embeddings from {EMBEDDINGS_CACHE_FILE}...")
    with open(EMBEDDINGS_CACHE_FILE, 'rb') as f:
        embeddings_array = pickle.load(f)
    print(f"‚úÖ Loaded {len(embeddings_array)} cached embeddings\n")
else:
    print("üöÄ Generating embeddings using NVIDIA NIM...\n")
    
    # Batch processing configuration
    BATCH_SIZE = 32  # Process multiple texts per API call
    MAX_WORKERS = 8  # Number of parallel workers
    
    # Prepare batches
    texts = df['text'].tolist()
    n_samples = len(texts)
    batches = []
    
    for i in range(0, n_samples, BATCH_SIZE):
        batch_texts = texts[i:i+BATCH_SIZE]
        batch_indices = list(range(i, min(i+BATCH_SIZE, n_samples)))
        batches.append((batch_texts, batch_indices))
    
    print(f"üìä Processing {n_samples} texts in {len(batches)} batches (batch size: {BATCH_SIZE})")
    print(f"   Using {MAX_WORKERS} parallel workers\n")
    
    # Process batches in parallel
    embeddings_dict = {}
    
    with ThreadPoolExecutor(max_workers=MAX_WORKERS) as executor:
        # Submit all batches
        future_to_batch = {
            executor.submit(process_batch, batch_texts, batch_indices): i 
            for i, (batch_texts, batch_indices) in enumerate(batches)
        }
        
        # Process results with progress bar
        with tqdm(total=len(batches), desc="Generating embeddings") as pbar:
            for future in as_completed(future_to_batch):
                batch_idx = future_to_batch[future]
                try:
                    results = future.result()
                    for idx, embedding in results:
                        embeddings_dict[idx] = embedding
                except Exception as e:
                    print(f"\n‚ùå Batch {batch_idx} failed: {e}")
                pbar.update(1)
    
    # Convert to numpy array in correct order
    embeddings_array = np.array([embeddings_dict[i] for i in range(n_samples) if i in embeddings_dict])
    
    print(f"\n‚úÖ Generated embeddings for {len(embeddings_array)}/{n_samples} samples")
    print(f"   Embedding shape: {embeddings_array.shape}")
    
    # Save embeddings to cache
    print(f"\nüíæ Saving embeddings to {EMBEDDINGS_CACHE_FILE}...")
    with open(EMBEDDINGS_CACHE_FILE, 'wb') as f:
        pickle.dump(embeddings_array, f)
    print("‚úÖ Embeddings cached successfully!")

# Add embeddings to dataframe (only for successfully embedded samples)
if len(embeddings_array) == len(df):
    df['embedding'] = list(embeddings_array)
    print(f"\n‚úÖ All {len(df)} samples have embeddings!")
else:
    # Handle case where some embeddings failed
    print(f"\n‚ö†Ô∏è  Only {len(embeddings_array)}/{len(df)} samples have embeddings")
    df = df.iloc[:len(embeddings_array)].copy()
    df['embedding'] = list(embeddings_array)
    print(f"   Trimmed dataframe to {len(df)} samples")


## üó∫Ô∏è UMAP Dimensionality Reduction

Apply UMAP to reduce the high-dimensional embeddings to 2D for visualization.


In [None]:
print("üó∫Ô∏è Applying UMAP dimensionality reduction...\n")

# Prepare embedding matrix
embedding_matrix = np.array(df['embedding'].tolist())
print(f"Embedding matrix shape: {embedding_matrix.shape}")

# Apply UMAP
reducer = umap.UMAP(
    n_components=2,
    n_neighbors=15,
    min_dist=0.1,
    metric='cosine',
    random_state=42,
    verbose=True
)

print("\nFitting UMAP...")
umap_embeddings = reducer.fit_transform(embedding_matrix)

# Add UMAP coordinates to dataframe
df['umap_x'] = umap_embeddings[:, 0]
df['umap_y'] = umap_embeddings[:, 1]

print(f"\n‚úÖ UMAP reduction complete!")
print(f"   2D coordinates shape: {umap_embeddings.shape}")
print(f"   X range: [{umap_embeddings[:, 0].min():.2f}, {umap_embeddings[:, 0].max():.2f}]")
print(f"   Y range: [{umap_embeddings[:, 1].min():.2f}, {umap_embeddings[:, 1].max():.2f}]")


In [None]:
# Map categories to numerical scores (0-4) for color mapping
CATEGORY_TO_SCORE = {
    'chat': 0,
    'code': 1,
    'math': 2,
    'stem': 3,
    'tool_calling': 4,
    # Multilingual categories mapped to existing scores
    'multilingual_ja': 2,
    'multilingual_de': 3,
    'multilingual_it': 1,
    'multilingual_es': 0,
    'multilingual_fr': 4
}

# Color mapping for scores 0-4 (matching matplotlib style)
SCORE_COLORS = ["red", "darkorange", "gold", "turquoise", "darkgreen"]

# Add Score column based on category
df['Score'] = df['category'].map(CATEGORY_TO_SCORE)

print("Score mapping:")
for cat, score in sorted(CATEGORY_TO_SCORE.items(), key=lambda x: x[1]):
    color = SCORE_COLORS[score]
    count = len(df[df['category'] == cat])
    print(f"  Score {score} ({color:12s}): {cat:20s} - {count:,} samples")


## üìä Visualizations

Create multiple visualizations of the UMAP 2D projection, colored by different attributes from the dataset headers.


In [None]:
# Visualization 1: Score-based coloring with centroids (matching matplotlib style)
print("üìä Creating interactive visualizations with Plotly...\n")

# Extract coordinates
x = df['umap_x'].values
y = df['umap_y'].values
color_indices = df['Score'].values

# Create discrete color map for scores
score_color_map = {i: SCORE_COLORS[i] for i in range(5)}

# Create figure
fig = go.Figure()

# Add scatter plot for each score
for score in range(5):
    mask = df['Score'] == score
    if mask.sum() > 0:
        fig.add_trace(go.Scatter(
            x=df[mask]['umap_x'],
            y=df[mask]['umap_y'],
            mode='markers',
            marker=dict(
                color=SCORE_COLORS[score],
                size=8,
                opacity=0.3,
                line=dict(width=0.5, color='white')
            ),
            name=f'Score {score}',
            customdata=df[mask][['category', 'version', 'split', 'reasoning']].values,
            hovertemplate='<b>Score %{text}</b><br>' +
                         'Category: %{customdata[0]}<br>' +
                         'Version: %{customdata[1]}<br>' +
                         'Split: %{customdata[2]}<br>' +
                         'Reasoning: %{customdata[3]}<br>' +
                         'X: %{x:.2f}<br>' +
                         'Y: %{y:.2f}<extra></extra>',
            text=[score] * mask.sum()
        ))

# Calculate and add centroids for each score
for score in range(5):
    mask = df['Score'] == score
    if mask.sum() > 0:
        avg_x = df[mask]['umap_x'].mean()
        avg_y = df[mask]['umap_y'].mean()
        
        fig.add_trace(go.Scatter(
            x=[avg_x],
            y=[avg_y],
            mode='markers',
            marker=dict(
                symbol='x',
                size=15,
                color=SCORE_COLORS[score],
                line=dict(width=3)
            ),
            name=f'Score {score} centroid',
            showlegend=False,
            hovertemplate=f'<b>Score {score} Centroid</b><br>' +
                         f'X: {avg_x:.2f}<br>' +
                         f'Y: {avg_y:.2f}<extra></extra>'
        ))

fig.update_layout(
    title='UMAP Visualization - Colored by Score',
    title_font_size=20,
    title_x=0.5,
    xaxis_title='UMAP Dimension 1',
    yaxis_title='UMAP Dimension 2',
    width=1200,
    height=800,
    template='plotly_white',
    legend=dict(
        title='Score',
        yanchor="top",
        y=0.99,
        xanchor="left",
        x=1.01,
        bgcolor="rgba(255, 255, 255, 0.9)",
        bordercolor="gray",
        borderwidth=1
    ),
    hovermode='closest'
)

fig.write_html('umap_by_score.html')
print("‚úÖ Saved: umap_by_score.html")
fig.show()


In [None]:
# Visualization 2: Interactive scatter plot colored by Version (v1 vs v2)
fig = px.scatter(
    df,
    x='umap_x',
    y='umap_y',
    color='version',
    hover_data=['version', 'split', 'reasoning', 'category'],
    title='UMAP Visualization - Colored by Dataset Version (v1 vs v2)',
    labels={'umap_x': 'UMAP Dimension 1', 'umap_y': 'UMAP Dimension 2'},
    width=800,
    height=600,
    template='plotly_white',
    color_discrete_map={'v1': '#FF6B6B', 'v2': '#4ECDC4'}
)

fig.update_traces(
    marker=dict(size=8, opacity=0.7, line=dict(width=0.5, color='white')),
)

fig.update_layout(
    title_font_size=20,
    title_x=0.5,
    legend=dict(
        title='Version',
        yanchor="top",
        y=0.99,
        xanchor="left",
        x=1.01,
        bgcolor="rgba(255, 255, 255, 0.9)",
        bordercolor="gray",
        borderwidth=1
    ),
    hovermode='closest'
)

fig.write_html('umap_by_version.html')
print("‚úÖ Saved: umap_by_version.html")
fig.show()


In [None]:
# Visualization 3: Interactive scatter plot colored by Reasoning
fig = px.scatter(
    df,
    x='umap_x',
    y='umap_y',
    color='reasoning',
    hover_data=['version', 'split', 'reasoning', 'category'],
    title='UMAP Visualization - Colored by Reasoning',
    labels={'umap_x': 'UMAP Dimension 1', 'umap_y': 'UMAP Dimension 2'},
    width=800,
    height=600,
    template='plotly_white'
)

fig.update_traces(
    marker=dict(size=8, opacity=0.7, line=dict(width=0.5, color='white')),
)

fig.update_layout(
    title_font_size=20,
    title_x=0.5,
    legend=dict(
        title='Reasoning',
        yanchor="top",
        y=0.99,
        xanchor="left",
        x=1.01,
        bgcolor="rgba(255, 255, 255, 0.9)",
        bordercolor="gray",
        borderwidth=1
    ),
    hovermode='closest'
)

fig.write_html('umap_by_reasoning.html')
print("‚úÖ Saved: umap_by_reasoning.html")
fig.show()


In [None]:
# Visualization 4: Side-by-side comparison with facets (v1 vs v2)
fig = px.scatter(
    df,
    x='umap_x',
    y='umap_y',
    color='category',
    facet_col='version',
    hover_data=['version', 'split', 'reasoning', 'category'],
    title='UMAP Visualization - v1 vs v2 Comparison (Faceted by Version)',
    labels={'umap_x': 'UMAP Dimension 1', 'umap_y': 'UMAP Dimension 2'},
    width=800,
    height=600,
    template='plotly_white'
)

fig.update_traces(
    marker=dict(size=7, opacity=0.7, line=dict(width=0.5, color='white')),
)

fig.update_layout(
    title_font_size=20,
    title_x=0.5,
    legend=dict(
        title='Category',
        yanchor="top",
        y=0.99,
        xanchor="left",
        x=1.01,
        bgcolor="rgba(255, 255, 255, 0.9)",
        bordercolor="gray",
        borderwidth=1
    ),
    hovermode='closest'
)

fig.for_each_annotation(lambda a: a.update(text=a.text.replace("version=", "Nemotron ")))

fig.write_html('umap_v1_vs_v2_comparison.html')
print("‚úÖ Saved: umap_v1_vs_v2_comparison.html")
fig.show()


In [None]:
# Visualization 5: 3D scatter plot with all attributes
# Create a combined categorical label for better visualization
df['combined_label'] = df['version'] + ' - ' + df['category']

fig = px.scatter_3d(
    df,
    x='umap_x',
    y='umap_y',
    z=df.groupby('category').ngroup(),  # Use category as third dimension
    color='category',
    symbol='version',
    hover_data=['version', 'split', 'reasoning', 'category'],
    title='UMAP Visualization - 3D View with Category Grouping',
    labels={
        'umap_x': 'UMAP Dimension 1', 
        'umap_y': 'UMAP Dimension 2',
        'z': 'Category Group'
    },
    width=800,
    height=600,
    template='plotly_white'
)

fig.update_traces(
    marker=dict(size=5, opacity=0.7, line=dict(width=0.3, color='white')),
)

fig.update_layout(
    title_font_size=20,
    title_x=0.5,
    scene=dict(
        xaxis_title='UMAP Dimension 1',
        yaxis_title='UMAP Dimension 2',
        zaxis_title='Category Group',
        camera=dict(
            eye=dict(x=1.5, y=1.5, z=1.3)
        )
    ),
    legend=dict(
        yanchor="top",
        y=0.99,
        xanchor="left",
        x=0.01,
        bgcolor="rgba(255, 255, 255, 0.9)",
        bordercolor="gray",
        borderwidth=1
    )
)

fig.write_html('umap_3d_visualization.html')
print("‚úÖ Saved: umap_3d_visualization.html")
fig.show()


## üìà Summary Statistics


In [None]:
# Display summary statistics
print("=" * 80)
print("üìä VISUALIZATION SUMMARY")
print("=" * 80)

print(f"\n‚úÖ Total samples visualized: {len(df):,}")
print(f"\nüìÅ Saved files:")
print("   ‚Ä¢ umap_by_category.html - Interactive plot colored by category")
print("   ‚Ä¢ umap_by_version.html - Interactive plot showing v1 vs v2")
print("   ‚Ä¢ umap_by_reasoning.html - Interactive plot colored by reasoning")
print("   ‚Ä¢ umap_v1_vs_v2_comparison.html - Side-by-side comparison")
print("   ‚Ä¢ umap_3d_visualization.html - 3D interactive visualization")

print(f"\nüìä Dataset Distribution:")
print(f"\nBy Version:")
print(df['version'].value_counts())
print(f"\nBy Category:")
print(df['category'].value_counts())
print(f"\nBy Reasoning:")
print(df['reasoning'].value_counts())
print(f"\nBy Split:")
print(df['split'].value_counts())

print("\n" + "=" * 80)
print("‚úÖ Visualization complete! Open the HTML files in a browser for interactive exploration.")
print("=" * 80)
