# Notebook 07: DGX Spark Performance Monitoring

This notebook provides comprehensive performance monitoring for NVIDIA DGX Spark systems.

## Features

- **SSH Connection Management**: Robust SSH setup with automatic reconnection
- **VPN Monitoring**: Tailscale and VPN connection health checks
- **GPU Metrics**: Real-time GPU utilization, memory, temperature, and power
- **System Metrics**: CPU, RAM, disk I/O, and system load
- **Network Metrics**: Bandwidth, latency, and interface statistics
- **Real-Time Dashboard**: Live updating visualizations
- **Historical Analysis**: Load and analyze past performance data
- **Experiment Integration**: Monitor during LLM experiment runs

## Usage

1. Configure your DGX Spark connection details in Section 1
2. Establish SSH connection in Section 2
3. Check VPN status in Section 3
4. Start monitoring metrics in Section 4-6
5. View real-time dashboard in Section 7
6. Export data for analysis in Section 8


## Section 1: Setup and Configuration

First, let's import all necessary libraries and configure connection settings.


In [None]:
# Import necessary modules
import sys
from pathlib import Path
import time
import json
from datetime import datetime
from typing import Optional, Dict, List, Any

# Data manipulation and visualization
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.animation import FuncAnimation
import matplotlib.dates as mdates

# Add the src directory to Python path
project_root = Path().resolve().parent
sys.path.insert(0, str(project_root))

# Import our monitoring module
from src.dgx_monitoring import (
    SSHConnection,
    check_vpn_status,
    get_gpu_metrics_nvidia_smi,
    get_gpu_metrics_nvml,
    get_system_metrics,
    get_network_metrics,
    collect_all_metrics,
    export_metrics_to_json,
    export_metrics_to_csv
)

print("‚úÖ All modules imported successfully!")


### Configuration

Configure your DGX Spark connection details here. Update these values to match your setup.


In [None]:
# ============================================================================
# DGX SPARK CONNECTION CONFIGURATION
# ============================================================================
# Update these values to match your DGX Spark setup

# SSH Connection Settings
DGX_HOSTNAME = "spark-xxxx.local"  # Replace with your DGX Spark hostname or IP
DGX_USERNAME = "your_username"     # Replace with your SSH username
DGX_SSH_PORT = 22                  # Default SSH port (usually 22)
DGX_SSH_KEY_PATH = None            # Path to SSH private key (optional, e.g., "~/.ssh/id_rsa")
DGX_SSH_PASSWORD = None            # SSH password (optional, less secure than key)

# Monitoring Settings
MONITORING_INTERVAL_SECONDS = 5    # How often to collect metrics (in seconds)
USE_NVML_FOR_GPU = False           # Use NVML library (local only) vs nvidia-smi
ENABLE_REAL_TIME_DASHBOARD = True  # Enable live updating dashboard

# Alert Thresholds (for visual indicators)
GPU_TEMP_WARNING = 80              # GPU temperature warning threshold (¬∞C)
GPU_TEMP_CRITICAL = 90             # GPU temperature critical threshold (¬∞C)
MEMORY_WARNING_PCT = 85            # Memory usage warning threshold (%)
DISK_WARNING_PCT = 85              # Disk usage warning threshold (%)

# Data Storage
METRICS_SAVE_DIR = Path("notebooks/results/dgx_monitoring")
METRICS_SAVE_DIR.mkdir(parents=True, exist_ok=True)

print("‚úÖ Configuration loaded!")
print(f"   Hostname: {DGX_HOSTNAME}")
print(f"   Username: {DGX_USERNAME}")
print(f"   Monitoring interval: {MONITORING_INTERVAL_SECONDS}s")
print(f"   Metrics save directory: {METRICS_SAVE_DIR}")


## Section 2: SSH Connection Management

Establish and manage SSH connection to DGX Spark with automatic reconnection.


In [None]:
# Initialize SSH connection object
# This creates the connection object but doesn't connect yet
ssh_connection = SSHConnection(
    hostname=DGX_HOSTNAME,
    username=DGX_USERNAME,
    key_path=DGX_SSH_KEY_PATH,
    password=DGX_SSH_PASSWORD,
    port=DGX_SSH_PORT
)

print(f"SSH connection object created for {DGX_USERNAME}@{DGX_HOSTNAME}:{DGX_SSH_PORT}")
print("Ready to connect...")


In [None]:
# Establish SSH connection
# This will attempt to connect to the DGX Spark system
print(f"Attempting to connect to {DGX_HOSTNAME}...")

connection_successful = ssh_connection.connect(timeout=10)

if connection_successful:
    print("‚úÖ SSH connection established successfully!")
    
    # Test the connection by running a simple command
    success, stdout, stderr = ssh_connection.execute_command("hostname && uptime")
    if success:
        print("\nüìä System Information:")
        print(stdout)
    else:
        print(f"‚ö†Ô∏è  Connection test failed: {stderr}")
else:
    print("‚ùå SSH connection failed!")
    print("\nTroubleshooting tips:")
    print("  1. Verify hostname/IP is correct")
    print("  2. Check if VPN is connected (if required)")
    print("  3. Verify SSH key path or password is correct")
    print("  4. Check firewall/network settings")
    print("  5. Try connecting manually: ssh {}@{}".format(DGX_USERNAME, DGX_HOSTNAME))


In [None]:
# Connection health check function
# This function checks if the SSH connection is still alive and can be used periodically

def check_ssh_health(ssh: SSHConnection) -> Dict[str, Any]:
    """
    Check SSH connection health.
    
    Returns dictionary with connection status and latency information.
    """
    health = {
        "connected": False,
        "latency_ms": None,
        "uptime_seconds": None,
        "error": None
    }
    
    if not ssh:
        health["error"] = "SSH connection object not initialized"
        return health
    
    # Check if connected
    health["connected"] = ssh.is_connected()
    
    if health["connected"]:
        # Measure latency by running a simple command
        start_time = time.time()
        success, stdout, stderr = ssh.execute_command("echo 'ping'", timeout=5)
        end_time = time.time()
        
        if success:
            health["latency_ms"] = (end_time - start_time) * 1000
            
            # Get system uptime
            success, stdout, stderr = ssh.execute_command("cat /proc/uptime", timeout=5)
            if success:
                try:
                    uptime_seconds = float(stdout.split()[0])
                    health["uptime_seconds"] = uptime_seconds
                except:
                    pass
        else:
            health["error"] = stderr
            health["connected"] = False
    
    return health

# Test connection health
if ssh_connection.is_connected():
    health_status = check_ssh_health(ssh_connection)
    print("üì° SSH Connection Health:")
    print(f"   Status: {'‚úÖ Connected' if health_status['connected'] else '‚ùå Disconnected'}")
    if health_status['latency_ms']:
        print(f"   Latency: {health_status['latency_ms']:.2f} ms")
    if health_status['uptime_seconds']:
        uptime_days = health_status['uptime_seconds'] / 86400
        print(f"   System Uptime: {uptime_days:.1f} days")
else:
    print("‚ö†Ô∏è  SSH connection not established. Run the connection cell above first.")


## Section 3: VPN Connection Monitoring

Check VPN connection status (Tailscale or other VPN services).


In [None]:
# Check VPN status
# This will check for Tailscale or other VPN services
# If SSH is connected, it checks on the remote host; otherwise checks locally

vpn_status = check_vpn_status(ssh_connection if ssh_connection.is_connected() else None)

print("üîí VPN Connection Status:")
print(f"   VPN Available: {'‚úÖ Yes' if vpn_status['vpn_available'] else '‚ùå No'}")
if vpn_status['vpn_type']:
    print(f"   VPN Type: {vpn_status['vpn_type']}")
print(f"   Connected: {'‚úÖ Yes' if vpn_status['connected'] else '‚ùå No'}")
print(f"   Status: {vpn_status['status_message']}")

# Display VPN status indicator
if vpn_status['connected']:
    print("\n‚úÖ VPN is connected - you should be able to access DGX Spark")
elif vpn_status['vpn_available']:
    print("\n‚ö†Ô∏è  VPN is installed but not connected")
    print("   You may need to connect to VPN before establishing SSH connection")
else:
    print("\n‚ÑπÔ∏è  VPN not detected or not required for your setup")


## Section 4: GPU Performance Metrics

Monitor GPU utilization, memory, temperature, and power consumption.


In [None]:
# Get GPU metrics
# This function collects GPU information using nvidia-smi (works both locally and remotely)
# For local monitoring with NVML, set USE_NVML_FOR_GPU = True in configuration

ssh = ssh_connection if ssh_connection.is_connected() else None

if USE_NVML_FOR_GPU and not ssh:
    # Use NVML for local monitoring (more efficient)
    gpu_metrics = get_gpu_metrics_nvml(ssh)
    method = "NVML"
else:
    # Use nvidia-smi (works both locally and remotely)
    gpu_metrics = get_gpu_metrics_nvidia_smi(ssh)
    method = "nvidia-smi"

print(f"üìä GPU Metrics (via {method}):")
print(f"   Found {len(gpu_metrics)} GPU(s)\n")

if gpu_metrics:
    for gpu in gpu_metrics:
        print(f"GPU {gpu['index']}: {gpu['name']}")
        print(f"   Utilization: {gpu['utilization_gpu']:.1f}%")
        print(f"   Memory: {gpu['memory_used_mb']:.0f} MB / {gpu['memory_total_mb']:.0f} MB ({gpu['memory_used_pct']:.1f}%)")
        print(f"   Temperature: {gpu['temperature_c']:.1f}¬∞C")
        print(f"   Power: {gpu['power_draw_w']:.1f}W / {gpu['power_limit_w']:.1f}W")
        
        # Add warning indicators
        if gpu['temperature_c'] >= GPU_TEMP_CRITICAL:
            print(f"   ‚ö†Ô∏è  CRITICAL: Temperature is very high!")
        elif gpu['temperature_c'] >= GPU_TEMP_WARNING:
            print(f"   ‚ö†Ô∏è  WARNING: Temperature is high")
        
        print()
else:
    print("‚ùå No GPU metrics collected. Possible issues:")
    print("   - No GPUs available")
    print("   - nvidia-smi not available on target system")
    print("   - Permission issues")


## Section 5: System Metrics (CPU, RAM, Disk)

Monitor CPU utilization, memory usage, disk I/O, and system load.


In [None]:
# Get system metrics
# This collects CPU, RAM, disk, and load information
# Note: For remote monitoring via SSH, some metrics may be limited

system_metrics = get_system_metrics(ssh)

print("üíª System Metrics:")

if system_metrics:
    # CPU Information
    if 'cpu_percent' in system_metrics:
        print(f"\nüñ•Ô∏è  CPU:")
        print(f"   Overall Utilization: {system_metrics['cpu_percent']:.1f}%")
        print(f"   CPU Cores: {system_metrics.get('cpu_count', 'N/A')}")
        if 'cpu_per_core' in system_metrics:
            per_core = system_metrics['cpu_per_core']
            print(f"   Per-Core Utilization: {', '.join([f'{c:.1f}%' for c in per_core])}")
        if 'load_avg_1min' in system_metrics:
            print(f"   Load Average (1min): {system_metrics['load_avg_1min']:.2f}")
    
    # Memory Information
    if 'memory_total_gb' in system_metrics:
        print(f"\nüß† Memory:")
        print(f"   Total: {system_metrics['memory_total_gb']:.2f} GB")
        print(f"   Used: {system_metrics['memory_used_gb']:.2f} GB ({system_metrics['memory_percent']:.1f}%)")
        print(f"   Available: {system_metrics.get('memory_available_gb', 0):.2f} GB")
        
        if system_metrics['memory_percent'] >= MEMORY_WARNING_PCT:
            print(f"   ‚ö†Ô∏è  WARNING: Memory usage is high!")
        
        if 'swap_total_gb' in system_metrics and system_metrics['swap_total_gb'] > 0:
            print(f"   Swap: {system_metrics['swap_used_gb']:.2f} GB / {system_metrics['swap_total_gb']:.2f} GB ({system_metrics['swap_percent']:.1f}%)")
    
    # Disk Information
    if 'disk_usage' in system_metrics:
        print(f"\nüíæ Disk Usage:")
        for mountpoint, usage in system_metrics['disk_usage'].items():
            print(f"   {mountpoint}:")
            print(f"      Used: {usage['used_gb']:.2f} GB / {usage['total_gb']:.2f} GB ({usage['percent']:.1f}%)")
            print(f"      Free: {usage['free_gb']:.2f} GB")
            
            if usage['percent'] >= DISK_WARNING_PCT:
                print(f"      ‚ö†Ô∏è  WARNING: Disk usage is high!")
        
        if 'disk_read_mb' in system_metrics:
            print(f"\n   Disk I/O:")
            print(f"      Read: {system_metrics['disk_read_mb']:.2f} MB")
            print(f"      Write: {system_metrics['disk_write_mb']:.2f} MB")
else:
    print("‚ùå Could not collect system metrics")
    print("   This may be normal if monitoring remotely via SSH")


## Section 6: Network Performance

Monitor network bandwidth, latency, and interface statistics.


In [None]:
# Get network metrics
network_metrics = get_network_metrics(ssh)

print("üåê Network Metrics:")

if network_metrics:
    if 'bytes_sent_mb' in network_metrics:
        print(f"\nüì§ Network I/O:")
        print(f"   Bytes Sent: {network_metrics['bytes_sent_mb']:.2f} MB")
        print(f"   Bytes Received: {network_metrics['bytes_recv_mb']:.2f} MB")
        print(f"   Packets Sent: {network_metrics.get('packets_sent', 0):,}")
        print(f"   Packets Received: {network_metrics.get('packets_recv', 0):,}")
    
    if 'connections' in network_metrics:
        print(f"   Active Connections: {network_metrics['connections']}")
    
    if 'interfaces' in network_metrics:
        print(f"\nüîå Network Interfaces:")
        for interface, stats in network_metrics['interfaces'].items():
            status = "‚úÖ UP" if stats.get('isup', False) else "‚ùå DOWN"
            print(f"   {interface}: {status}")
            if stats.get('speed_mbps', 0) > 0:
                print(f"      Speed: {stats['speed_mbps']} Mbps")
            if stats.get('mtu'):
                print(f"      MTU: {stats['mtu']}")
else:
    print("‚ùå Could not collect network metrics")


## Section 7: Real-Time Monitoring Dashboard

Live updating dashboard showing all metrics in real-time.


In [None]:
# Initialize data storage for real-time monitoring
# This will store metrics over time for plotting

monitoring_data = {
    "timestamps": [],
    "gpu_data": [],  # List of GPU metric dictionaries
    "system_data": [],
    "network_data": []
}

print("‚úÖ Monitoring data storage initialized")
print(f"   Ready to collect metrics every {MONITORING_INTERVAL_SECONDS} seconds")


In [None]:
# Function to collect one sample of all metrics
def collect_metrics_sample():
    """
    Collect one sample of all metrics and store in monitoring_data.
    """
    ssh = ssh_connection if ssh_connection.is_connected() else None
    
    # Collect all metrics at once
    all_metrics = collect_all_metrics(ssh, use_nvml=USE_NVML_FOR_GPU)
    
    # Store in our data structure
    timestamp = datetime.now()
    monitoring_data["timestamps"].append(timestamp)
    monitoring_data["gpu_data"].append(all_metrics["gpu_metrics"])
    monitoring_data["system_data"].append(all_metrics["system_metrics"])
    monitoring_data["network_data"].append(all_metrics["network_metrics"])
    
    return all_metrics

# Test collection
print("Collecting initial metrics sample...")
sample = collect_metrics_sample()
print(f"‚úÖ Collected metrics at {sample['timestamp_readable']}")
print(f"   GPUs: {len(sample['gpu_metrics'])}")
print(f"   SSH Connected: {sample['ssh_connected']}")
print(f"   VPN Connected: {sample['vpn_status']['connected']}")


In [None]:
# Create real-time monitoring dashboard
# This creates live-updating plots showing key metrics over time

if ENABLE_REAL_TIME_DASHBOARD:
    # Set up the figure with subplots
    fig, axes = plt.subplots(2, 2, figsize=(16, 12))
    fig.suptitle('DGX Spark Performance Monitoring Dashboard', fontsize=16, fontweight='bold')
    
    # Flatten axes for easier indexing
    ax1, ax2, ax3, ax4 = axes.flatten()
    
    # Plot 1: GPU Utilization
    ax1.set_title('GPU Utilization (%)', fontweight='bold')
    ax1.set_xlabel('Time')
    ax1.set_ylabel('Utilization (%)')
    ax1.grid(True, alpha=0.3)
    ax1.set_ylim(0, 100)
    
    # Plot 2: GPU Memory Usage
    ax2.set_title('GPU Memory Usage (%)', fontweight='bold')
    ax2.set_xlabel('Time')
    ax2.set_ylabel('Memory Usage (%)')
    ax2.grid(True, alpha=0.3)
    ax2.set_ylim(0, 100)
    
    # Plot 3: GPU Temperature
    ax3.set_title('GPU Temperature (¬∞C)', fontweight='bold')
    ax3.set_xlabel('Time')
    ax3.set_ylabel('Temperature (¬∞C)')
    ax3.grid(True, alpha=0.3)
    ax3.axhline(y=GPU_TEMP_WARNING, color='orange', linestyle='--', label=f'Warning ({GPU_TEMP_WARNING}¬∞C)')
    ax3.axhline(y=GPU_TEMP_CRITICAL, color='red', linestyle='--', label=f'Critical ({GPU_TEMP_CRITICAL}¬∞C)')
    ax3.legend()
    
    # Plot 4: System CPU and Memory
    ax4.set_title('System CPU and Memory Usage (%)', fontweight='bold')
    ax4.set_xlabel('Time')
    ax4.set_ylabel('Usage (%)')
    ax4.grid(True, alpha=0.3)
    ax4.set_ylim(0, 100)
    
    plt.tight_layout()
    
    def update_dashboard(frame):
        """Update function for animation."""
        # Collect new metrics
        collect_metrics_sample()
        
        if len(monitoring_data["timestamps"]) == 0:
            return
        
        # Clear and replot
        ax1.clear()
        ax2.clear()
        ax3.clear()
        ax4.clear()
        
        # Get recent data (last 50 samples or all if less)
        n_samples = min(50, len(monitoring_data["timestamps"]))
        recent_times = monitoring_data["timestamps"][-n_samples:]
        recent_gpu = monitoring_data["gpu_data"][-n_samples:]
        recent_system = monitoring_data["system_data"][-n_samples:]
        
        # Plot GPU Utilization
        for gpu_idx in range(len(recent_gpu[0]) if recent_gpu else 0):
            utilizations = [gpu[gpu_idx]['utilization_gpu'] for gpu in recent_gpu if len(gpu) > gpu_idx]
            if utilizations:
                ax1.plot(recent_times[:len(utilizations)], utilizations, label=f'GPU {gpu_idx}', marker='o', markersize=3)
        ax1.set_title('GPU Utilization (%)', fontweight='bold')
        ax1.set_xlabel('Time')
        ax1.set_ylabel('Utilization (%)')
        ax1.grid(True, alpha=0.3)
        ax1.set_ylim(0, 100)
        ax1.legend()
        ax1.xaxis.set_major_formatter(mdates.DateFormatter('%H:%M:%S'))
        plt.setp(ax1.xaxis.get_majorticklabels(), rotation=45)
        
        # Plot GPU Memory
        for gpu_idx in range(len(recent_gpu[0]) if recent_gpu else 0):
            memory_pcts = [gpu[gpu_idx]['memory_used_pct'] for gpu in recent_gpu if len(gpu) > gpu_idx]
            if memory_pcts:
                ax2.plot(recent_times[:len(memory_pcts)], memory_pcts, label=f'GPU {gpu_idx}', marker='o', markersize=3)
        ax2.set_title('GPU Memory Usage (%)', fontweight='bold')
        ax2.set_xlabel('Time')
        ax2.set_ylabel('Memory Usage (%)')
        ax2.grid(True, alpha=0.3)
        ax2.set_ylim(0, 100)
        ax2.legend()
        ax2.xaxis.set_major_formatter(mdates.DateFormatter('%H:%M:%S'))
        plt.setp(ax2.xaxis.get_majorticklabels(), rotation=45)
        
        # Plot GPU Temperature
        for gpu_idx in range(len(recent_gpu[0]) if recent_gpu else 0):
            temps = [gpu[gpu_idx]['temperature_c'] for gpu in recent_gpu if len(gpu) > gpu_idx]
            if temps:
                ax3.plot(recent_times[:len(temps)], temps, label=f'GPU {gpu_idx}', marker='o', markersize=3)
        ax3.set_title('GPU Temperature (¬∞C)', fontweight='bold')
        ax3.set_xlabel('Time')
        ax3.set_ylabel('Temperature (¬∞C)')
        ax3.grid(True, alpha=0.3)
        ax3.axhline(y=GPU_TEMP_WARNING, color='orange', linestyle='--', label=f'Warning ({GPU_TEMP_WARNING}¬∞C)')
        ax3.axhline(y=GPU_TEMP_CRITICAL, color='red', linestyle='--', label=f'Critical ({GPU_TEMP_CRITICAL}¬∞C)')
        ax3.legend()
        ax3.xaxis.set_major_formatter(mdates.DateFormatter('%H:%M:%S'))
        plt.setp(ax3.xaxis.get_majorticklabels(), rotation=45)
        
        # Plot System CPU and Memory
        cpu_pcts = [sys.get('cpu_percent', 0) for sys in recent_system if sys]
        mem_pcts = [sys.get('memory_percent', 0) for sys in recent_system if sys]
        if cpu_pcts:
            ax4.plot(recent_times[:len(cpu_pcts)], cpu_pcts, label='CPU %', marker='o', markersize=3)
        if mem_pcts:
            ax4.plot(recent_times[:len(mem_pcts)], mem_pcts, label='Memory %', marker='s', markersize=3)
        ax4.set_title('System CPU and Memory Usage (%)', fontweight='bold')
        ax4.set_xlabel('Time')
        ax4.set_ylabel('Usage (%)')
        ax4.grid(True, alpha=0.3)
        ax4.set_ylim(0, 100)
        ax4.legend()
        ax4.xaxis.set_major_formatter(mdates.DateFormatter('%H:%M:%S'))
        plt.setp(ax4.xaxis.get_majorticklabels(), rotation=45)
        
        plt.tight_layout()
    
    print("‚úÖ Dashboard created!")
    print("   The dashboard will update automatically when you run the animation cell below.")
else:
    print("‚ÑπÔ∏è  Real-time dashboard is disabled in configuration")


In [None]:
# Start real-time monitoring animation
# This will continuously collect metrics and update the dashboard
# Press the stop button in Jupyter to stop monitoring

if ENABLE_REAL_TIME_DASHBOARD and len(monitoring_data["timestamps"]) > 0:
    # Create animation that updates every MONITORING_INTERVAL_SECONDS
    # interval is in milliseconds
    interval_ms = MONITORING_INTERVAL_SECONDS * 1000
    
    print(f"üîÑ Starting real-time monitoring...")
    print(f"   Update interval: {MONITORING_INTERVAL_SECONDS} seconds")
    print(f"   Press the stop button (‚ñ†) to stop monitoring")
    
    # Create animation
    # Note: This will run until stopped manually
    ani = FuncAnimation(fig, update_dashboard, interval=interval_ms, blit=False, cache_frame_data=False)
    
    plt.show()
else:
    print("‚ÑπÔ∏è  Dashboard not enabled or no data collected yet")


## Section 8: Data Collection and Storage

Export collected metrics to files for historical analysis.


In [None]:
# Collect metrics for a specified duration
# This function collects metrics over time and stores them

def collect_metrics_duration(duration_seconds: int, interval_seconds: int = None):
    """
    Collect metrics for a specified duration.
    
    Args:
        duration_seconds: How long to collect metrics (in seconds)
        interval_seconds: How often to collect (defaults to MONITORING_INTERVAL_SECONDS)
    """
    if interval_seconds is None:
        interval_seconds = MONITORING_INTERVAL_SECONDS
    
    print(f"üìä Collecting metrics for {duration_seconds} seconds...")
    print(f"   Interval: {interval_seconds} seconds")
    
    start_time = time.time()
    samples_collected = 0
    
    while time.time() - start_time < duration_seconds:
        collect_metrics_sample()
        samples_collected += 1
        
        elapsed = time.time() - start_time
        remaining = duration_seconds - elapsed
        
        if remaining > 0:
            print(f"   Collected {samples_collected} samples, {remaining:.1f}s remaining...", end='\r')
            time.sleep(min(interval_seconds, remaining))
        else:
            break
    
    print(f"\n‚úÖ Collection complete! Collected {samples_collected} samples")

# Example: Collect metrics for 30 seconds
# Uncomment the line below to run
# collect_metrics_duration(30)


In [None]:
# Prepare collected metrics for export
# Convert our monitoring_data structure to the format expected by export functions

def prepare_metrics_for_export() -> List[Dict[str, Any]]:
    """
    Convert monitoring_data to list of metric dictionaries for export.
    """
    exported_metrics = []
    
    for i, timestamp in enumerate(monitoring_data["timestamps"]):
        metric = {
            "timestamp": timestamp.isoformat(),
            "timestamp_readable": timestamp.strftime("%Y-%m-%d %H:%M:%S"),
            "gpu_metrics": monitoring_data["gpu_data"][i] if i < len(monitoring_data["gpu_data"]) else [],
            "system_metrics": monitoring_data["system_data"][i] if i < len(monitoring_data["system_data"]) else [],
            "network_metrics": monitoring_data["network_data"][i] if i < len(monitoring_data["network_data"]) else []
        }
        exported_metrics.append(metric)
    
    return exported_metrics

# Export to JSON
if len(monitoring_data["timestamps"]) > 0:
    metrics_to_export = prepare_metrics_for_export()
    
    # Generate filename with timestamp
    timestamp_str = datetime.now().strftime("%Y%m%d_%H%M%S")
    json_filepath = METRICS_SAVE_DIR / f"dgx_metrics_{timestamp_str}.json"
    
    export_metrics_to_json(metrics_to_export, json_filepath)
    print(f"‚úÖ Metrics exported to JSON: {json_filepath}")
    print(f"   Total samples: {len(metrics_to_export)}")
else:
    print("‚ö†Ô∏è  No metrics collected yet. Run monitoring cells above first.")


In [None]:
# Export to CSV
if len(monitoring_data["timestamps"]) > 0:
    metrics_to_export = prepare_metrics_for_export()
    
    # Generate filename with timestamp
    timestamp_str = datetime.now().strftime("%Y%m%d_%H%M%S")
    csv_filepath = METRICS_SAVE_DIR / f"dgx_metrics_{timestamp_str}.csv"
    
    export_metrics_to_csv(metrics_to_export, csv_filepath)
    print(f"‚úÖ Metrics exported to CSV: {csv_filepath}")
    print(f"   Total samples: {len(metrics_to_export)}")
else:
    print("‚ö†Ô∏è  No metrics collected yet. Run monitoring cells above first.")


## Section 9: Integration with LLM Experiments

Monitor DGX Spark performance during LLM experiment runs.


In [None]:
# Function to monitor during experiment execution
# This can be called from other notebooks or run in parallel

def monitor_during_experiment(experiment_name: str, duration_seconds: int = None):
    """
    Monitor DGX Spark during an experiment run.
    
    Args:
        experiment_name: Name/identifier for this experiment
        duration_seconds: How long to monitor (None = until stopped manually)
    """
    print(f"üî¨ Starting monitoring for experiment: {experiment_name}")
    
    # Create experiment-specific data storage
    experiment_data = {
        "experiment_name": experiment_name,
        "start_time": datetime.now(),
        "timestamps": [],
        "gpu_data": [],
        "system_data": [],
        "network_data": []
    }
    
    start_time = time.time()
    samples_collected = 0
    
    try:
        while duration_seconds is None or (time.time() - start_time) < duration_seconds:
            # Collect metrics
            ssh = ssh_connection if ssh_connection.is_connected() else None
            all_metrics = collect_all_metrics(ssh, use_nvml=USE_NVML_FOR_GPU)
            
            # Store in experiment data
            timestamp = datetime.now()
            experiment_data["timestamps"].append(timestamp)
            experiment_data["gpu_data"].append(all_metrics["gpu_metrics"])
            experiment_data["system_data"].append(all_metrics["system_metrics"])
            experiment_data["network_data"].append(all_metrics["network_metrics"])
            
            samples_collected += 1
            
            # Print status
            elapsed = time.time() - start_time
            if duration_seconds:
                remaining = duration_seconds - elapsed
                print(f"   [{elapsed:.0f}s] Collected {samples_collected} samples, {remaining:.1f}s remaining...", end='\r')
            else:
                print(f"   [{elapsed:.0f}s] Collected {samples_collected} samples...", end='\r')
            
            time.sleep(MONITORING_INTERVAL_SECONDS)
    
    except KeyboardInterrupt:
        print("\n‚ö†Ô∏è  Monitoring stopped by user")
    
    experiment_data["end_time"] = datetime.now()
    experiment_data["duration_seconds"] = (experiment_data["end_time"] - experiment_data["start_time"]).total_seconds()
    
    print(f"\n‚úÖ Monitoring complete for experiment: {experiment_name}")
    print(f"   Duration: {experiment_data['duration_seconds']:.1f} seconds")
    print(f"   Samples collected: {samples_collected}")
    
    # Export experiment data
    timestamp_str = datetime.now().strftime("%Y%m%d_%H%M%S")
    experiment_filepath = METRICS_SAVE_DIR / f"experiment_{experiment_name}_{timestamp_str}.json"
    
    export_data = {
        "experiment_info": {
            "name": experiment_name,
            "start_time": experiment_data["start_time"].isoformat(),
            "end_time": experiment_data["end_time"].isoformat(),
            "duration_seconds": experiment_data["duration_seconds"]
        },
        "metrics": prepare_metrics_for_export()
    }
    
    with open(experiment_filepath, 'w') as f:
        json.dump(export_data, f, indent=2)
    
    print(f"   Data exported to: {experiment_filepath}")
    
    return experiment_data

# Example usage:
# experiment_monitoring = monitor_during_experiment("nolh_experiment_1", duration_seconds=300)


## Section 10: Historical Analysis

Load and analyze past performance data.


In [None]:
# Function to load historical metrics from JSON file
def load_historical_metrics(filepath: Path) -> Dict[str, Any]:
    """
    Load historical metrics from a JSON export file.
    
    Args:
        filepath: Path to JSON file
    
    Returns:
        Dictionary with loaded metrics
    """
    if isinstance(filepath, Path):
        filepath = str(filepath)
    
    with open(filepath, 'r') as f:
        data = json.load(f)
    
    return data

# List available metric files
print("üìÅ Available metric files:")
metric_files = list(METRICS_SAVE_DIR.glob("dgx_metrics_*.json"))
experiment_files = list(METRICS_SAVE_DIR.glob("experiment_*.json"))

if metric_files:
    print("\n   General monitoring files:")
    for f in sorted(metric_files)[-5:]:  # Show last 5
        print(f"      - {f.name}")
else:
    print("   No general monitoring files found")

if experiment_files:
    print("\n   Experiment files:")
    for f in sorted(experiment_files)[-5:]:  # Show last 5
        print(f"      - {f.name}")
else:
    print("   No experiment files found")


In [None]:
# Load and visualize historical data
# Replace the filepath with an actual file from the list above

# Example: Load a specific file
# Uncomment and modify the path below to load a specific file
# historical_file = METRICS_SAVE_DIR / "dgx_metrics_20240115_143022.json"
# historical_data = load_historical_metrics(historical_file)

# if historical_data:
#     print(f"‚úÖ Loaded historical data:")
#     print(f"   Export timestamp: {historical_data.get('export_timestamp_readable', 'N/A')}")
#     print(f"   Number of samples: {historical_data.get('num_samples', 0)}")
#     
#     # Extract GPU utilization over time
#     metrics = historical_data.get('metrics', [])
#     if metrics:
#         # Create visualization
#         fig, axes = plt.subplots(2, 2, figsize=(16, 12))
#         fig.suptitle('Historical Performance Analysis', fontsize=16, fontweight='bold')
#         ax1, ax2, ax3, ax4 = axes.flatten()
#         
#         # Extract timestamps and GPU data
#         timestamps = [datetime.fromisoformat(m['timestamp']) for m in metrics]
#         
#         # Plot GPU utilization
#         for gpu_idx in range(len(metrics[0].get('gpu_metrics', []))):
#             utilizations = [m['gpu_metrics'][gpu_idx]['utilization_gpu'] 
#                           for m in metrics if len(m.get('gpu_metrics', [])) > gpu_idx]
#             if utilizations:
#                 ax1.plot(timestamps[:len(utilizations)], utilizations, label=f'GPU {gpu_idx}')
#         ax1.set_title('GPU Utilization Over Time')
#         ax1.set_xlabel('Time')
#         ax1.set_ylabel('Utilization (%)')
#         ax1.legend()
#         ax1.grid(True, alpha=0.3)
#         
#         # Similar plots for memory, temperature, etc.
#         # ... (add more visualizations as needed)
#         
#         plt.tight_layout()
#         plt.show()
# else:
#     print("‚ö†Ô∏è  No historical data file specified or file not found")

print("‚ÑπÔ∏è  Uncomment the code above and specify a file path to load historical data")


## Troubleshooting and Error Handling

Common issues and solutions for DGX Spark monitoring.


In [None]:
# Comprehensive connection health check
# Run this if you're experiencing connection issues

def comprehensive_health_check():
    """
    Perform comprehensive health check of all connections and services.
    """
    print("üîç Comprehensive Health Check\n")
    print("=" * 60)
    
    # 1. SSH Connection
    print("\n1. SSH Connection:")
    if ssh_connection:
        if ssh_connection.is_connected():
            print("   ‚úÖ SSH connection is active")
            health = check_ssh_health(ssh_connection)
            if health['latency_ms']:
                print(f"   üìä Latency: {health['latency_ms']:.2f} ms")
        else:
            print("   ‚ùå SSH connection is not active")
            print("   üí° Try running the SSH connection cell again")
    else:
        print("   ‚ö†Ô∏è  SSH connection object not initialized")
    
    # 2. VPN Status
    print("\n2. VPN Status:")
    vpn = check_vpn_status(ssh_connection if ssh_connection and ssh_connection.is_connected() else None)
    if vpn['connected']:
        print(f"   ‚úÖ VPN connected ({vpn['vpn_type']})")
    elif vpn['vpn_available']:
        print(f"   ‚ö†Ô∏è  VPN available but not connected ({vpn['vpn_type']})")
    else:
        print("   ‚ÑπÔ∏è  VPN not detected (may not be required)")
    
    # 3. GPU Availability
    print("\n3. GPU Availability:")
    ssh = ssh_connection if ssh_connection and ssh_connection.is_connected() else None
    gpus = get_gpu_metrics_nvidia_smi(ssh)
    if gpus:
        print(f"   ‚úÖ Found {len(gpus)} GPU(s)")
        for gpu in gpus:
            print(f"      - GPU {gpu['index']}: {gpu['name']}")
    else:
        print("   ‚ùå No GPUs detected")
        print("   üí° Check if nvidia-smi is available on target system")
    
    # 4. System Metrics
    print("\n4. System Metrics:")
    system = get_system_metrics(ssh)
    if system:
        print("   ‚úÖ System metrics available")
        if 'cpu_count' in system:
            print(f"      CPU Cores: {system['cpu_count']}")
        if 'memory_total_gb' in system:
            print(f"      Total Memory: {system['memory_total_gb']:.2f} GB")
    else:
        print("   ‚ö†Ô∏è  System metrics not available")
        print("   üí° This may be normal for remote monitoring")
    
    # 5. Data Collection
    print("\n5. Data Collection:")
    if len(monitoring_data["timestamps"]) > 0:
        print(f"   ‚úÖ {len(monitoring_data['timestamps'])} samples collected")
        print(f"      First sample: {monitoring_data['timestamps'][0]}")
        print(f"      Last sample: {monitoring_data['timestamps'][-1]}")
    else:
        print("   ‚ÑπÔ∏è  No data collected yet")
    
    print("\n" + "=" * 60)
    print("‚úÖ Health check complete!")

# Run health check
comprehensive_health_check()


### Common Issues and Solutions

1. **SSH Connection Fails**
   - Verify hostname/IP is correct
   - Check VPN connection (if required)
   - Verify SSH key permissions: `chmod 600 ~/.ssh/id_rsa`
   - Test manual connection: `ssh username@hostname`

2. **No GPU Metrics**
   - Ensure nvidia-smi is installed on target system
   - Check GPU drivers are loaded: `nvidia-smi` on remote host
   - Verify user has permissions to access GPU

3. **High Latency**
   - Check network connection quality
   - Verify VPN is not causing delays
   - Consider monitoring locally if SSH latency is too high

4. **Missing Dependencies**
   - Install required packages: `pip install -r requirements.txt`
   - For NVML: Ensure NVIDIA drivers are installed locally

5. **Dashboard Not Updating**
   - Ensure animation cell is running
   - Check that metrics are being collected
   - Verify matplotlib backend supports animation
