diff --git a/.gitignore b/.gitignore index 1dc4703..1e97e64 100644 --- a/.gitignore +++ b/.gitignore @@ -18,13 +18,19 @@ env/ .vscode/ *.swp -# Data +# Data and cache *.db metrics.db +.features_cache # Logs *.log +# Old/backup files +*.old +*.bak +*.tmp + # OS .DS_Store Thumbs.db diff --git a/DISTRIBUTION.md b/DISTRIBUTION.md new file mode 100644 index 0000000..bf1c561 --- /dev/null +++ b/DISTRIBUTION.md @@ -0,0 +1,119 @@ +# Distribution Setup Complete + +## Summary +Cluster Health Monitor v1.0.0 is now ready for portable ZIP distribution. + +## What Was Implemented + +### 1. Code Cleanup +- Removed debug print statements from workloads.py +- No emojis or verbose logging in code +- Clean, concise comments throughout + +### 2. Feature Detection & Caching +- `monitor/utils/features.py`: Runtime feature detection +- Detects: nvidia-smi, cupy, torch, gpu_benchmark availability +- Results cached in `.features_cache` JSON file +- Fast subsequent loads (no repeated checks) + +### 3. Requirements Simplified +- Single `requirements.txt` file +- Core dependencies required +- GPU libraries (cupy/torch) commented as optional +- Setup script prompts for GPU library installation + +### 4. PowerShell Setup Script +- `setup.ps1`: Automated Windows setup wizard +- Checks Python 3.8+ +- Detects NVIDIA drivers and CUDA version +- Creates virtual environment +- Installs dependencies +- Prompts for CuPy or PyTorch based on CUDA version +- Runs feature detection and caching +- Verifies installation + +### 5. Update Mechanism +- CLI: `python health_monitor.py --update` +- Web: "Check for Updates" button in header +- Checks GitHub releases API +- Downloads and applies updates automatically +- Preserves venv, config, and data + +### 6. Feature Graying in UI +- `/api/features` endpoint returns cached feature flags +- JavaScript checks features on page load +- Disables benchmark controls if GPU libraries not available +- Visual feedback: opacity 0.5, cursor not-allowed +- Alert message explains missing libraries + +### 7. Multi-GPU Support +- Already implemented in gpu.py collector +- Loops through all NVIDIA GPUs via NVML +- Web UI displays all GPUs in grid +- Benchmark supports any GPU (defaults to GPU 0) + +### 8. Portable ZIP Distribution +- `package.ps1`: Creates distribution ZIP +- Includes: monitor/, health_monitor.py, config.yaml, requirements.txt, setup.ps1, README.md, LICENSE +- Excludes: venv, __pycache__, .features_cache, *.db +- ~50KB compressed size +- Ready for GitHub releases + +### 9. Updated Documentation +- README.md rewritten for ZIP distribution +- Installation: Download → Extract → Run setup.ps1 +- Troubleshooting section updated +- Simplified project structure +- Removed development-focused content + +## Files Created/Modified + +### New Files +- `monitor/utils/features.py` - Feature detection +- `monitor/utils/update.py` - Update mechanism +- `monitor/utils/__init__.py` - Utils module exports +- `setup.ps1` - Windows setup wizard +- `package.ps1` - Distribution packaging script + +### Modified Files +- `health_monitor.py` - Added --update flag +- `monitor/api/server.py` - Added /api/features, /api/update/* endpoints +- `monitor/api/templates/index.html` - Update button, feature graying +- `monitor/benchmark/workloads.py` - Removed debug prints +- `requirements.txt` - Simplified to single file +- `README.md` - Complete rewrite for ZIP distribution + +### Removed Files +- `requirements-base.txt` - Merged into requirements.txt +- `requirements-gpu.txt` - Merged into requirements.txt +- `setup.py` - No longer using pip package +- `MANIFEST.in` - No longer needed +- `BUILD.md` - Removed +- `CHECKLIST.md` - Removed +- `RELEASE_NOTES.md` - Removed + +## Usage + +### For End Users +1. Download `cluster-health-monitor-v1.0.0.zip` from releases +2. Extract to desired location +3. Run `setup.ps1` in PowerShell +4. Activate venv and run: `python health_monitor.py monitor --web` +5. Access dashboard at http://localhost:8090 + +### For Distribution +1. Run `.\package.ps1` to create ZIP +2. Upload `cluster-health-monitor-v1.0.0.zip` to GitHub releases +3. Users download and follow above steps + +### For Updates +Users can update via: +- CLI: `python health_monitor.py --update` +- Web: Click "Check for Updates" button + +## Next Steps (Future) +- Create GitHub Actions workflow for automated releases +- Add version check on startup (optional notification) +- Multi-platform support (Linux setup.sh) +- Configuration wizard in web UI +- Export/import settings diff --git a/README.md b/README.md index f058a28..5418853 100644 --- a/README.md +++ b/README.md @@ -5,6 +5,7 @@ Real-time GPU and system monitoring with web dashboard and CLI interface. Featur ## Features ### Monitoring + - Real-time GPU metrics (utilization, memory, temperature, power) - System metrics (CPU, memory, disk I/O) - Web dashboard with live charts @@ -12,6 +13,7 @@ Real-time GPU and system monitoring with web dashboard and CLI interface. Featur - Historical data storage and alerting ### GPU Benchmarking + - GEMM (matrix multiplication) stress test - Particle simulation workload - Auto-scaling stress test (dynamically increases load to 98% GPU utilization) @@ -20,82 +22,74 @@ Real-time GPU and system monitoring with web dashboard and CLI interface. Featur ## Requirements -### Core Monitoring (Always Available) - Python 3.8+ - NVIDIA GPU with drivers installed -- `nvidia-smi` command available - -### GPU Benchmarking (Optional) -- CUDA Toolkit 12.0+ or compatible -- One of: - - CuPy: `pip install cupy-cuda12x` (or appropriate CUDA version) - - PyTorch: `pip install torch --index-url https://download.pytorch.org/whl/cu121` +- CUDA Toolkit 12.0+ (for benchmarking) ## Installation -### 1. Clone Repository -```bash -git clone https://github.com/DataBoySu/cluster-monitor.git -cd cluster-health-monitor -``` +### 1. Download -### 2. Create Virtual Environment -```bash -python -m venv .venv -``` - -Activate: -- Windows: `.venv\Scripts\activate` -- Linux/Mac: `source .venv/bin/activate` +Download the latest release ZIP from [Releases](https://github.com/DataBoySu/cluster-monitor/releases). -### 3. Install Dependencies +Extract to your desired location: -**Basic Monitoring:** -```bash -pip install -r requirements.txt +```powershell +Expand-Archive cluster-health-monitor-v1.0.0.zip -DestinationPath C:\Tools\ +cd C:\Tools\cluster-health-monitor ``` -**With GPU Benchmarking (CuPy):** -```bash -pip install -r requirements.txt -pip install cupy-cuda12x # Adjust for your CUDA version -``` +### 2. Run Setup -**With GPU Benchmarking (PyTorch):** -```bash -pip install -r requirements.txt -pip install torch --index-url https://download.pytorch.org/whl/cu121 +```powershell +.\setup.ps1 ``` -### 4. Verify Installation -```bash +The setup script will: + +- Check for NVIDIA drivers and CUDA +- Create Python virtual environment +- Install required dependencies +- Prompt for optional GPU benchmark libraries (CuPy or PyTorch) +- Verify installation + +### 3. Verify + +```powershell +.\venv\Scripts\Activate.ps1 python health_monitor.py --help ``` ## Usage -### Web Dashboard (Recommended) -```bash -python health_monitor.py monitor --web +### Web Dashboard (Default) + +```powershell +python health_monitor.py +# Change port: python health_monitor.py --port 3000 ``` Access at: http://localhost:8090 Features: + - Real-time GPU/system metrics - Interactive benchmark controls - Live performance charts - Historical data visualization +- In-dashboard updates ### Terminal Dashboard -```bash -python health_monitor.py monitor + +```powershell +python health_monitor.py cli ``` Displays live metrics in terminal with auto-refresh. ### CLI Benchmark -```bash + +```powershell # Quick 15-second test python health_monitor.py benchmark --mode quick @@ -136,13 +130,15 @@ The Stress Test mode automatically increases workload intensity: 4. Continues scaling up to 15 times or until 98% GPU utilization achieved Example progression: -``` + +```text 100K particles → 200K → 400K → 800K → 1.2M → 1.8M → 2.2M → 2.6M (94% GPU util) ``` ## Benchmark Types ### GEMM (Matrix Multiplication) + Dense matrix multiplication for maximum compute stress. Measures TFLOPS. ```bash @@ -150,6 +146,7 @@ python health_monitor.py benchmark --type gemm --mode stress-test ``` ### Particle Simulation + Vectorized particle physics simulation with collision detection. Measures steps/second. ```bash @@ -178,33 +175,6 @@ storage: path: ./metrics.db ``` -## Project Structure - -``` -cluster-health-monitor/ -├── monitor/ -│ ├── benchmark/ -│ │ ├── config.py # Benchmark configuration -│ │ ├── storage.py # Baseline storage (SQLite) -│ │ ├── workloads.py # GPU workloads (GEMM/Particle) -│ │ └── runner.py # Benchmark orchestration -│ ├── collectors/ -│ │ ├── gpu.py # GPU metrics via nvidia-smi -│ │ ├── system.py # CPU, memory, disk -│ │ └── network.py # Network info -│ ├── storage/ -│ │ └── sqlite.py # Metrics persistence -│ ├── api/ -│ │ ├── server.py # FastAPI web server -│ │ └── templates/ -│ │ └── index.html # Web dashboard -│ └── cli/ -│ └── benchmark_cli.py # CLI commands -├── config.yaml # Configuration -├── requirements.txt # Dependencies -└── health_monitor.py # Main entry point -``` - ## API Endpoints When running web server (`--web`): @@ -215,57 +185,36 @@ When running web server (`--web`): - `POST /api/benchmark/start` - Start benchmark - `GET /api/benchmark/status` - Benchmark progress - `POST /api/benchmark/stop` - Stop benchmark -- `GET /api/benchmark/results` - Get results -- `GET /api/benchmark/baseline` - Get baseline for GPU -## Troubleshooting +## Updates -### "nvidia-smi not found" -- Install NVIDIA drivers -- Add nvidia-smi to PATH -- Verify: `nvidia-smi` in terminal +### CLI -### "No CUDA libraries found" -Benchmarking features disabled without CUDA libraries. Install CuPy or PyTorch. +```powershell +python health_monitor.py --update +``` -### Web dashboard not loading data -- Check terminal for errors -- Verify port 8090 is available -- Check firewall settings -- Try: `http://127.0.0.1:8090` +### Web Dashboard -### Benchmark not scaling GPU to 98% -- Increase max_scales in runner.py -- Check GPU has available memory -- Verify no other GPU workloads running -- Try different benchmark type (GEMM vs Particle) +Click the "Check for Updates" button in the dashboard. -## Performance Tips +## Troubleshooting -1. **Close other GPU applications** during benchmarking -2. **Adequate cooling** for stress tests -3. **Monitor temperatures** - tests will stop at temp limit -4. **Use Stress Test mode** to find maximum GPU performance -5. **Run Extended mode** for stability validation +### "nvidia-smi not found" +Install NVIDIA drivers from https://www.nvidia.com/download/index.aspx -## Development +### "No CUDA Toolkit found" +Download CUDA from https://developer.nvidia.com/cuda-downloads +Re-run `.\setup.ps1` after installation. -### Run Tests -```bash -pytest tests/ -``` +### Web dashboard not loading data +- Check port 8090 is available +- Try: `http://127.0.0.1:8090` +- Check firewall settings -### Code Structure -- Modular design: config, storage, workloads, runner separated -- Clean API exports via `__init__.py` -- Type hints throughout -- Comprehensive error handling +### Benchmark features grayed out -### Contributing -1. Fork repository -2. Create feature branch -3. Add tests for new features -4. Submit pull request +GPU benchmark libraries not installed. Run setup script and select CuPy or PyTorch installation. ## License @@ -279,6 +228,4 @@ MIT License - See LICENSE file ## Support -- Issues: GitHub Issues -- Documentation: This README -- CUDA setup: https://developer.nvidia.com/cuda-downloads +GitHub: https://github.com/DataBoySu/cluster-monitor diff --git a/health_monitor.py b/health_monitor.py index 50e917f..0c82fba 100644 --- a/health_monitor.py +++ b/health_monitor.py @@ -295,12 +295,40 @@ async def main(): @click.group(invoke_without_command=True) @click.option('--config', '-c', type=click.Path(), help='Configuration file path.') +@click.option('--port', '-p', type=int, help='Web server port (default: 8090).') +@click.option('--update', is_flag=True, help='Check for and install updates.') @click.pass_context -def cli(ctx, config): +def cli(ctx, config, port, update): """Cluster Health Monitor: Real-time GPU and system health monitoring.""" + if update: + from monitor.utils import check_for_updates, perform_update + console.print("\n[cyan]Checking for updates...[/cyan]") + + status = check_for_updates() + + if status.get('error'): + console.print(f"[red]{status['error']}[/red]") + return + + if not status['available']: + console.print(f"[green]You have the latest version ({status['current']})[/green]") + return + + console.print(f"\n[yellow]Update available:[/yellow]") + console.print(f" Current: {status['current']}") + console.print(f" Latest: {status['latest']}") + + if click.confirm("\nDownload and install update?"): + console.print("\n[cyan]Downloading update...[/cyan]") + if perform_update(): + console.print("[green]Update complete! Restart the application.[/green]") + else: + console.print("[red]Update failed. Try again later.[/red]") + return + ctx.obj = {'config_path': config} if ctx.invoked_subcommand is None: - _run_app(config, port=None, nodes=None, once=False, web_mode=True) + _run_app(config, port=port, nodes=None, once=False, web_mode=True) @cli.command() @click.option('--port', '-p', type=int, help='Web server port (overrides config).') diff --git a/monitor/__version__.py b/monitor/__version__.py new file mode 100644 index 0000000..09ad572 --- /dev/null +++ b/monitor/__version__.py @@ -0,0 +1,5 @@ +"""Version information for Cluster Health Monitor.""" + +__version__ = "1.0.0" +__author__ = "DataBoySu" +__license__ = "MIT" diff --git a/monitor/api/server.py b/monitor/api/server.py index 89dfff0..ce4b8be 100644 --- a/monitor/api/server.py +++ b/monitor/api/server.py @@ -11,6 +11,7 @@ from fastapi import FastAPI, HTTPException, BackgroundTasks from fastapi.responses import HTMLResponse, StreamingResponse, FileResponse +from fastapi.staticfiles import StaticFiles from monitor.collectors.gpu import GPUCollector from monitor.collectors.system import SystemCollector @@ -20,6 +21,7 @@ # Path to the templates directory, relative to this file TEMPLATE_DIR = Path(__file__).parent / "templates" +STATIC_DIR = Path(__file__).parent / "static" def create_app(config: Dict[str, Any]) -> FastAPI: @@ -29,6 +31,9 @@ def create_app(config: Dict[str, Any]) -> FastAPI: version="1.0.0" ) + # Mount static files + app.mount("/static", StaticFiles(directory=str(STATIC_DIR)), name="static") + storage = MetricsStorage(config['storage']['path']) alert_engine = AlertEngine(config.get('alerts', {})) @@ -81,7 +86,23 @@ async def get_gpus(): @app.get("/api/processes") async def get_processes(): collector = GPUCollector() - return {'processes': collector.collect_processes()} + gpus = collector.collect() + processes = collector.collect_processes() + + # Calculate total VRAM usage from processes + gpu_memory_stats = {} + for gpu in gpus: + if not gpu.get('error'): + gpu_memory_stats[gpu['index']] = { + 'total': gpu.get('memory_total', 0), + 'used': gpu.get('memory_used', 0), + 'free': gpu.get('memory_free', 0) + } + + return { + 'processes': processes, + 'gpu_memory': gpu_memory_stats + } @app.get("/api/system") async def get_system(): @@ -110,6 +131,28 @@ async def get_available_metrics(): ] } + @app.get("/api/features") + async def get_features_endpoint(): + """Get available features (always fresh to detect newly installed packages).""" + from monitor.utils.features import detect_features + return detect_features(force=True) + + @app.post("/api/update/check") + async def check_update(): + """Check for available updates.""" + from monitor.utils import check_for_updates + return check_for_updates() + + @app.post("/api/update/install") + async def install_update(): + """Install available update.""" + from monitor.utils import perform_update + success = perform_update() + if success: + return {'status': 'success', 'message': 'Update installed. Restart application.'} + else: + return {'status': 'error', 'message': 'Update failed'} + @app.get("/api/export/json") async def export_json(hours: int = 24): metrics = await storage.query(hours=hours) diff --git a/monitor/api/static/main.js b/monitor/api/static/main.js new file mode 100644 index 0000000..da7f82a --- /dev/null +++ b/monitor/api/static/main.js @@ -0,0 +1,810 @@ +let countdown = 5; +let historyChart = null; + +// Tab switching +document.querySelectorAll('.tab').forEach(tab => { + tab.addEventListener('click', () => { + document.querySelectorAll('.tab').forEach(t => t.classList.remove('active')); + document.querySelectorAll('.tab-content').forEach(c => c.classList.remove('active')); + tab.classList.add('active'); + document.getElementById(tab.dataset.tab).classList.add('active'); + + if (tab.dataset.tab === 'history') loadHistory(); + if (tab.dataset.tab === 'processes') loadProcesses(); + if (tab.dataset.tab === 'benchmark') { loadBenchmarkResults(); loadBaseline(); } + }); +}); + +async function loadBenchmarkResults() { + try { + const response = await fetch('/api/benchmark/results'); + const results = await response.json(); + if (results && results.status !== 'no_results') { + displayBenchmarkResults(results); + } + } catch (error) { + console.error('Error loading benchmark results:', error); + } +} + +async function fetchStatus() { + try { + console.log('Fetching status from /api/status...'); + const response = await fetch('/api/status'); + console.log('Response status:', response.status); + + if (!response.ok) { + throw new Error(`HTTP error! status: ${response.status}`); + } + + const data = await response.json(); + console.log('Received data:', data); + updateDashboard(data); + } catch (error) { + console.error('Error fetching status:', error); + document.getElementById('gpu-list').innerHTML = '
Saved: ${new Date(baseline.timestamp).toLocaleString()}
+ `; + } else { + document.getElementById('baseline-info').style.display = 'none'; + } + } catch (error) { + console.error('Error loading baseline:', error); + } +} + +function selectBenchType(type) { + selectedBenchType = type; + // Reload baseline when benchmark type changes + loadBaseline(); + document.querySelectorAll('.type-btn').forEach(btn => { + btn.classList.toggle('active', btn.dataset.type === type); + }); + + // Update description + const descriptions = { + 'gemm': 'Dense matrix multiplication for maximum GPU compute stress. Measures TFLOPS.', + 'particle': '2D particle physics simulation with millions of particles. Measures steps/second.' + }; + document.getElementById('type-description').textContent = descriptions[type] || ''; + + // Show/hide type-specific settings in custom mode + document.getElementById('gemm-settings').style.display = type === 'gemm' ? 'block' : 'none'; + document.getElementById('particle-settings').style.display = type === 'particle' ? 'block' : 'none'; +} + +function selectMode(mode) { + selectedMode = mode; + document.querySelectorAll('.mode-btn').forEach(btn => { + btn.classList.toggle('active', btn.dataset.mode === mode); + }); + document.getElementById('custom-controls').style.display = mode === 'custom' ? 'block' : 'none'; + + // Update mode description + const descriptions = { + 'quick': 'Quick baseline test - 15 seconds with fixed workload size', + 'standard': 'Standard benchmark - 60 seconds with fixed workload size', + 'extended': 'Extended burn-in test - 180 seconds with fixed workload size for thorough validation', + 'stress-test': 'Stress test - 60 seconds with AUTO-SCALING workload that dynamically increases to push GPU to 98% utilization', + 'custom': 'Custom configuration - set your own duration, limits, and workload parameters' + }; + document.getElementById('mode-description').textContent = descriptions[mode] || ''; +} + +function updateSliderValue(type) { + const slider = document.getElementById('custom-' + type); + const input = document.getElementById('custom-' + type + '-val'); + input.value = slider.value; +} + +// Sync input to slider +['duration', 'temp', 'memory', 'power', 'matrix', 'particles'].forEach(type => { + const input = document.getElementById('custom-' + type + '-val'); + if (input) { + input.addEventListener('change', () => { + document.getElementById('custom-' + type).value = input.value; + }); + } +}); + +async function startBenchmark() { + const btn = document.getElementById('start-bench-btn'); + const stopBtn = document.getElementById('stop-bench-btn'); + btn.disabled = true; + btn.textContent = 'Running...'; + stopBtn.style.display = 'inline-block'; + + document.getElementById('benchmark-progress').style.display = 'block'; + document.getElementById('benchmark-live-charts').style.display = 'block'; + document.getElementById('benchmark-results').innerHTML = ''; + document.getElementById('bench-stop-reason').textContent = ''; + document.getElementById('iteration-counter').style.display = 'inline'; + document.getElementById('iteration-counter').textContent = 'Iteration #0'; + + // Build URL with params + let url = '/api/benchmark/start?benchmark_type=' + selectedBenchType; + + // Handle different modes + if (selectedMode === 'quick') { + url += '&mode=fixed&duration=15&auto_scale=false'; + } else if (selectedMode === 'standard') { + url += '&mode=fixed&duration=60&auto_scale=false'; + } else if (selectedMode === 'stress-test') { + url += '&mode=stress&duration=60&auto_scale=true'; + } else if (selectedMode === 'extended') { + url += '&mode=fixed&duration=180&auto_scale=false'; + } else if (selectedMode === 'custom') { + url += '&mode=custom&auto_scale=false'; + url += '&duration=' + document.getElementById('custom-duration-val').value; + url += '&temp_limit=' + document.getElementById('custom-temp-val').value; + url += '&memory_limit=' + document.getElementById('custom-memory-val').value; + url += '&power_limit=' + document.getElementById('custom-power-val').value; + if (selectedBenchType === 'gemm') { + url += '&matrix_size=' + document.getElementById('custom-matrix-val').value; + } else if (selectedBenchType === 'particle') { + const particles = Math.round(parseFloat(document.getElementById('custom-particles-val').value) * 1000000); + url += '&num_particles=' + particles; + } + } + + // Initialize live charts + initLiveCharts(); + + try { + await fetch(url, { method: 'POST' }); + benchmarkPollInterval = setInterval(pollBenchmarkStatus, 500); + } catch (error) { + console.error('Error starting benchmark:', error); + btn.disabled = false; + btn.textContent = 'Start Benchmark'; + stopBtn.style.display = 'none'; + } +} + +async function stopBenchmark() { + try { + await fetch('/api/benchmark/stop', { method: 'POST' }); + } catch (error) { + console.error('Error stopping benchmark:', error); + } +} + +function createSmallChart(canvasId, color, maxY = null) { + const ctx = document.getElementById(canvasId).getContext('2d'); + return new Chart(ctx, { + type: 'line', + data: { labels: [], datasets: [{ data: [], borderColor: color, backgroundColor: color + '20', fill: true, tension: 0.3, pointRadius: 0 }] }, + options: { + responsive: true, + plugins: { legend: { display: false } }, + scales: { + x: { display: false }, + y: { min: 0, max: maxY, ticks: { color: '#a0a0a0' }, grid: { color: '#4a4a4a' } } + } + } + }); +} + +function initLiveCharts() { + Object.values(benchCharts).forEach(c => c.destroy()); + benchCharts = { + utilization: createSmallChart('chartUtilization', '#76b900', 100), + temperature: createSmallChart('chartTemperature', '#ffc107', 100), + memory: createSmallChart('chartMemory', '#00a0ff'), + power: createSmallChart('chartPower', '#dc3545') + }; +} + +async function pollBenchmarkStatus() { + try { + const [statusRes, samplesRes] = await Promise.all([ + fetch('/api/benchmark/status'), + fetch('/api/benchmark/samples') + ]); + const status = await statusRes.json(); + const samplesData = await samplesRes.json(); + + document.getElementById('bench-progress-bar').style.width = status.progress + '%'; + document.getElementById('bench-percent').textContent = status.progress + '%'; + document.getElementById('iteration-counter').textContent = 'Iteration #' + (status.iterations || 0); + document.getElementById('workload-info').textContent = 'Workload: ' + (status.workload_type || 'N/A'); + document.getElementById('bench-workload').textContent = status.workload_type || ''; + + // Update live charts with samples + if (samplesData.samples && benchCharts.utilization) { + const samples = samplesData.samples; + const labels = samples.map(s => s.elapsed_sec + 's'); + + benchCharts.utilization.data.labels = labels; + benchCharts.utilization.data.datasets[0].data = samples.map(s => s.utilization || 0); + benchCharts.utilization.update('none'); + + benchCharts.temperature.data.labels = labels; + benchCharts.temperature.data.datasets[0].data = samples.map(s => s.temperature_c || 0); + benchCharts.temperature.update('none'); + + benchCharts.memory.data.labels = labels; + benchCharts.memory.data.datasets[0].data = samples.map(s => s.memory_used_mb || 0); + benchCharts.memory.update('none'); + + benchCharts.power.data.labels = labels; + benchCharts.power.data.datasets[0].data = samples.map(s => s.power_w || 0); + benchCharts.power.update('none'); + } + + if (!status.running) { + clearInterval(benchmarkPollInterval); + document.getElementById('start-bench-btn').disabled = false; + document.getElementById('start-bench-btn').textContent = 'Start Benchmark'; + document.getElementById('stop-bench-btn').style.display = 'none'; + document.getElementById('bench-status').textContent = 'Completed'; + + const resultsResponse = await fetch('/api/benchmark/results'); + const results = await resultsResponse.json(); + displayBenchmarkResults(results); + + // Reload baseline if saved + loadBaseline(); + } + } catch (error) { + console.error('Error polling benchmark:', error); + } +} + +function displayBenchmarkResults(results) { + if (!results || results.status === 'no_results') { + document.getElementById('benchmark-results').innerHTML = 'No results available
'; + return; + } + + const gpuInfo = results.gpu_info || {}; + const config = results.config || {}; + const scores = results.scores || {}; + const baseline = results.baseline; + const perf = results.performance || {}; + + // Show stop reason if benchmark was stopped early + if (results.stop_reason && results.stop_reason !== 'Duration completed') { + document.getElementById('bench-stop-reason').textContent = 'Stopped: ' + results.stop_reason; + } + + // Baseline comparison + let baselineComparison = ''; + if (baseline) { + const iterDiff = results.iterations_completed - baseline.iterations_completed; + const iterPct = ((iterDiff / baseline.iterations_completed) * 100).toFixed(1); + const iterColor = iterDiff >= 0 ? 'var(--accent-green)' : 'var(--accent-red)'; + baselineComparison = ` +Saved as new baseline
' : ''} +Benchmark completed at ${new Date(results.timestamp).toLocaleString()}
`; + + document.getElementById('benchmark-results').innerHTML = html; +} + +async function checkForUpdates() { + const btn = document.getElementById('update-btn'); + btn.disabled = true; + btn.textContent = 'Checking...'; + btn.removeAttribute('data-tooltip'); + + try { + const response = await fetch('/api/update/check', { method: 'POST' }); + const data = await response.json(); + + if (data.available) { + btn.textContent = `Update: ${data.latest}`; + btn.classList.remove('success', 'error'); + btn.disabled = false; + btn.setAttribute('data-tooltip', `Current: ${data.current} → Latest: ${data.latest}`); + + btn.onclick = async () => { + btn.textContent = 'Installing...'; + btn.disabled = true; + const install = await fetch('/api/update/install', { method: 'POST' }); + const result = await install.json(); + + if (result.status === 'success') { + btn.textContent = '✓ Restart App'; + btn.classList.add('success'); + btn.setAttribute('data-tooltip', 'Update installed - restart application'); + } else { + btn.textContent = '✗ Update Failed'; + btn.classList.add('error'); + btn.setAttribute('data-tooltip', result.message); + btn.disabled = false; + } + }; + } else if (data.error) { + btn.textContent = '✗ Check Failed'; + btn.classList.add('error'); + btn.setAttribute('data-tooltip', data.error); + btn.disabled = false; + } else { + btn.textContent = '✓ Latest Version'; + btn.classList.add('success'); + btn.setAttribute('data-tooltip', `Version ${data.current}`); + setTimeout(() => { + btn.textContent = 'Check for Updates'; + btn.classList.remove('success'); + btn.disabled = false; + btn.removeAttribute('data-tooltip'); + }, 3000); + } + } catch (error) { + btn.textContent = '✗ Network Error'; + btn.classList.add('error'); + btn.setAttribute('data-tooltip', 'Could not connect to update server'); + btn.disabled = false; + } +} + +function tick() { + countdown--; + document.getElementById('countdown').textContent = countdown; + if (countdown <= 0) { + countdown = 5; + fetchStatus(); + + // Auto-refresh active tab content + const activeTab = document.querySelector('.tab-content.active'); + if (activeTab) { + const tabId = activeTab.id; + if (tabId === 'processes-tab') { + loadProcesses(); + } else if (tabId === 'history-tab') { + const activeChart = document.querySelector('.chart-tab.active'); + if (activeChart) { + loadHistory(); + } + } + } + } +} + +async function loadFeatures() { + try { + const response = await fetch('/api/features'); + const features = await response.json(); + + // Disable benchmark controls if GPU benchmark not available + if (!features.gpu_benchmark) { + const benchTab = document.querySelector('[data-tab="benchmark"]'); + const startBtn = document.getElementById('start-bench-btn'); + const typeButtons = document.querySelectorAll('.type-btn'); + const modeButtons = document.querySelectorAll('.mode-btn'); + + if (benchTab) { + benchTab.classList.add('disabled'); + benchTab.setAttribute('data-tooltip', 'Install CuPy or PyTorch for GPU benchmarking'); + benchTab.style.pointerEvents = 'auto'; + } + + if (startBtn) { + startBtn.disabled = true; + startBtn.style.opacity = '0.5'; + startBtn.style.cursor = 'not-allowed'; + startBtn.title = 'GPU benchmark libraries not installed'; + } + + typeButtons.forEach(btn => { + btn.disabled = true; + btn.style.opacity = '0.5'; + btn.style.cursor = 'not-allowed'; + }); + + modeButtons.forEach(btn => { + btn.disabled = true; + btn.style.opacity = '0.5'; + btn.style.cursor = 'not-allowed'; + }); + } + } catch (error) { + console.error('Error loading features:', error); + } +} + +fetchStatus(); +loadBaseline(); +loadFeatures(); +setInterval(tick, 1000); diff --git a/monitor/api/static/style.css b/monitor/api/static/style.css new file mode 100644 index 0000000..0100125 --- /dev/null +++ b/monitor/api/static/style.css @@ -0,0 +1,374 @@ +@import url('https://fonts.googleapis.com/css2?family=JetBrains+Mono:wght@300;400;600;700&display=swap'); + +:root { + --bg-primary: #1a1a1a; + --bg-secondary: #2a2a2a; + --bg-tertiary: #3a3a3a; + --text-primary: #f0f0f0; + --text-secondary: #a0a0a0; + --accent-green: #76b900; + --accent-blue: #00a0ff; + --accent-yellow: #ffc107; + --accent-red: #dc3545; + --border-color: #4a4a4a; +} + +* { margin: 0; padding: 0; box-sizing: border-box; } + +body { + font-family: 'JetBrains Mono', 'Consolas', 'Monaco', monospace; + background: var(--bg-primary); + color: var(--text-primary); + line-height: 1.6; +} + +.container { max-width: 1400px; margin: 0 auto; padding: 20px; } + +header { + background: var(--accent-green); + padding: 20px 30px; + border-radius: 12px; + margin-bottom: 20px; + display: flex; + justify-content: space-between; + align-items: center; +} + +header h1 { + font-size: 1.5em; + color: #000; +} + +.header-right { + display: flex; + gap: 15px; + align-items: center; +} + +.update-btn { + padding: 8px 16px; + background: var(--accent-blue); + border: none; + border-radius: 6px; + color: white; + cursor: pointer; + font-weight: bold; + transition: all 0.2s; + position: relative; +} + +.update-btn:hover { + opacity: 0.8; +} + +.update-btn:disabled { + opacity: 0.5; + cursor: not-allowed; +} + +.update-btn.success { + background: var(--accent-green); + cursor: default; +} + +.update-btn.error { + background: var(--accent-red); +} + +.update-btn::before { + content: attr(data-tooltip); + position: absolute; + bottom: 100%; + right: 0; + background: rgba(0, 0, 0, 0.9); + color: white; + padding: 8px 12px; + border-radius: 6px; + font-size: 0.85em; + white-space: nowrap; + opacity: 0; + pointer-events: none; + transition: opacity 0.2s; + margin-bottom: 5px; +} + +.update-btn:hover::before { + opacity: 1; +} + +.tab { + padding: 10px 20px; + background: var(--bg-secondary); + border: 1px solid var(--border-color); + border-radius: 8px 8px 0 0; + cursor: pointer; + color: var(--text-secondary); + transition: all 0.2s; + position: relative; +} + +.tab:hover { background: var(--bg-tertiary); } +.tab.active { + background: var(--accent-green); + color: #000; + border-color: var(--accent-green); +} + +.tab.disabled { + opacity: 0.5; + cursor: not-allowed; +} + +.tab.disabled::after { + content: attr(data-tooltip); + position: absolute; + bottom: 100%; + left: 50%; + transform: translateX(-50%); + background: rgba(0, 0, 0, 0.9); + color: white; + padding: 8px 12px; + border-radius: 6px; + font-size: 0.85em; + white-space: nowrap; + opacity: 0; + pointer-events: none; + transition: opacity 0.2s; + margin-bottom: 5px; +} + +.tab.disabled:hover::after { + opacity: 1; +} + +.status-badge { + padding: 6px 16px; + border-radius: 20px; + font-weight: bold; + text-transform: uppercase; + font-size: 0.85em; + color: #000; + position: relative; + cursor: default; +} + +.status-badge::before { + content: attr(data-tooltip); + position: absolute; + bottom: 100%; + right: 0; + background: rgba(0, 0, 0, 0.9); + color: white; + padding: 8px 12px; + border-radius: 6px; + font-size: 0.85em; + white-space: nowrap; + opacity: 0; + pointer-events: none; + transition: opacity 0.2s; + margin-bottom: 5px; +} + +.status-badge:hover::before { + opacity: 1; +} + +.status-healthy { background: var(--accent-blue); } +.status-info { background: var(--accent-green); } +.status-warning { background: var(--accent-yellow); } + +/* Tabs */ +.tabs { + display: flex; + gap: 5px; + margin-bottom: 20px; + border-bottom: 2px solid var(--border-color); + padding-bottom: 10px; +} + +.tab { + padding: 10px 20px; + background: var(--bg-secondary); + border: 1px solid var(--border-color); + border-radius: 8px 8px 0 0; + cursor: pointer; + color: var(--text-secondary); + transition: all 0.2s; +} + +.tab:hover { background: var(--bg-tertiary); } +.tab.active { + background: var(--accent-green); + color: #000; + border-color: var(--accent-green); +} + +.tab-content { display: none; } +.tab-content.active { display: block; } + +/* Cards */ +.grid { + display: grid; + grid-template-columns: repeat(auto-fit, minmax(300px, 1fr)); + gap: 20px; + margin-bottom: 20px; +} + +.card { + background: var(--bg-secondary); + border: 1px solid var(--border-color); + border-radius: 12px; + padding: 20px; +} + +.card h2 { + color: var(--accent-green); + margin-bottom: 15px; + font-size: 1.1em; +} + +.gpu-card { + background: var(--bg-tertiary); + border-radius: 8px; + padding: 15px; + margin-bottom: 10px; +} + +.gpu-header { + display: flex; + justify-content: space-between; + margin-bottom: 10px; +} + +.gpu-name { font-weight: bold; } +.gpu-temp { color: var(--accent-yellow); } +.gpu-temp.hot { color: var(--accent-red); } + +.progress-bar { + background: var(--bg-secondary); + border-radius: 4px; + height: 8px; + margin: 8px 0; + overflow: hidden; +} + +.progress-fill { + height: 100%; + background: var(--accent-green); + transition: width 0.3s ease; +} + +.progress-fill.warn { background: var(--accent-yellow); } +.progress-fill.crit { background: var(--accent-red); } + +.metric-row { + display: flex; + justify-content: space-between; + padding: 5px 0; + border-bottom: 1px solid var(--border-color); +} + +.metric-label { color: var(--text-secondary); font-size: 0.9em; } +.metric-value { font-weight: bold; } + +/* Process table */ +.process-table { + width: 100%; + border-collapse: collapse; + font-size: 0.9em; +} + +.process-table th, .process-table td { + padding: 10px; + text-align: left; + border-bottom: 1px solid var(--border-color); +} + +.process-table th { + background: var(--bg-tertiary); + color: var(--accent-green); +} + +.process-table tr:hover { background: var(--bg-tertiary); } + +/* Chart */ +.chart-container { + background: var(--bg-secondary); + border-radius: 12px; + padding: 20px; + margin-bottom: 20px; +} + +.chart-controls { + display: flex; + gap: 10px; + margin-bottom: 15px; + flex-wrap: wrap; +} + +.chart-controls select, .chart-controls button { + padding: 8px 15px; + background: var(--bg-tertiary); + border: 1px solid var(--border-color); + border-radius: 6px; + color: var(--text-primary); + cursor: pointer; +} + +.chart-controls button:hover { + background: var(--accent-green); + color: #000; +} + +/* Export */ +.export-section { + display: flex; + gap: 15px; + flex-wrap: wrap; +} + +.export-btn { + padding: 12px 25px; + background: var(--accent-green); + border: none; + border-radius: 8px; + color: #000; + cursor: pointer; + font-size: 1em; + font-weight: bold; +} + +.export-btn:hover { opacity: 0.9; } +.export-btn.secondary { + background: var(--bg-tertiary); + border: 1px solid var(--border-color); + color: var(--text-primary); +} + +footer { + text-align: center; + padding: 20px; + color: var(--text-secondary); + font-size: 0.9em; +} + +.alert-item { + background: var(--bg-tertiary); + border-left: 4px solid var(--accent-yellow); + padding: 10px 15px; + margin-bottom: 10px; + border-radius: 0 8px 8px 0; +} + +.mode-btn, .type-btn { + padding: 10px 20px; + background: var(--bg-tertiary); + border: 1px solid var(--border-color); + border-radius: 6px; + color: var(--text-primary); + cursor: pointer; + transition: all 0.2s; +} + +.mode-btn:hover, .type-btn:hover { background: var(--bg-secondary); border-color: var(--accent-green); } +.mode-btn.active, .type-btn.active { background: var(--accent-green); color: #000; border-color: var(--accent-green); } diff --git a/monitor/api/templates/benchmark_cli.py b/monitor/api/templates/benchmark_cli.py deleted file mode 100644 index e69de29..0000000 diff --git a/monitor/api/templates/index.html b/monitor/api/templates/index.html index 8c6759f..954d22c 100644 --- a/monitor/api/templates/index.html +++ b/monitor/api/templates/index.html @@ -4,257 +4,10 @@| PID | Process | GPU | -GPU Memory | +GPU Util % | User | @@ -483,641 +248,6 @@
|---|