solusops · solusops · Dec 6, 2025 · Dec 6, 2025 · Dec 6, 2025
diff --git a/README.md b/README.md
@@ -1,156 +1,284 @@
 # Cluster Health Monitor
 
-A lightweight, real-time monitoring tool for NVIDIA GPUs. Track GPU utilization, memory, temperature, and power during ML training or any GPU workload.
+Real-time GPU and system monitoring with web dashboard and CLI interface. Features intelligent GPU stress testing with auto-scaling workloads and performance baselines.
 
-## System Requirements
+## Features
 
-### Hardware
+### Monitoring
+- Real-time GPU metrics (utilization, memory, temperature, power)
+- System metrics (CPU, memory, disk I/O)
+- Web dashboard with live charts
+- Terminal interface with auto-refresh
+- Historical data storage and alerting
 
-- NVIDIA GPU (GeForce, RTX, Quadro, Tesla, etc.)
+### GPU Benchmarking
+- GEMM (matrix multiplication) stress test
+- Particle simulation workload
+- Auto-scaling stress test (dynamically increases load to 98% GPU utilization)
+- Performance baseline tracking per GPU and benchmark type
+- Multiple test modes: Quick (15s), Standard (60s), Extended (180s), Stress Test, Custom
 
-### Software
+## Requirements
 
-- Windows 10/11 or Linux (Ubuntu 18.04+)
-- Python 3.8 or higher
-- NVIDIA Driver 450.0 or higher
+### Core Monitoring (Always Available)
+- Python 3.8+
+- NVIDIA GPU with drivers installed
+- `nvidia-smi` command available
 
-### Verify Your Setup
+### GPU Benchmarking (Optional)
+- CUDA Toolkit 12.0+ or compatible
+- One of:
+  - CuPy: `pip install cupy-cuda12x` (or appropriate CUDA version)
+  - PyTorch: `pip install torch --index-url https://download.pytorch.org/whl/cu121`
 
-Before installing, confirm your GPU is detected:
+## Installation
 
+### 1. Clone Repository
 ```bash
-nvidia-smi
+git clone https://github.com/DataBoySu/cluster-monitor.git
+cd cluster-health-monitor
 ```
 
-You should see your GPU listed with driver version. If this command fails, install NVIDIA drivers first.
+### 2. Create Virtual Environment
+```bash
+python -m venv .venv
+```
 
-## Installation
+Activate:
+- Windows: `.venv\Scripts\activate`
+- Linux/Mac: `source .venv/bin/activate`
 
-### Step 1: Clone the Repository
+### 3. Install Dependencies
 
-```git
-git clone https://github.com/DataBoySu/cluster-monitor.git
-cd cluster-monitor
+**Basic Monitoring:**
+```bash
+pip install -r requirements.txt
 ```
 
-### Step 2: Create Virtual Environment
-
-Windows:
-
-```python
-python -m venv venv
-venv\Scripts\activate
+**With GPU Benchmarking (CuPy):**
+```bash
+pip install -r requirements.txt
+pip install cupy-cuda12x  # Adjust for your CUDA version
 ```
 
-Linux/macOS:
+**With GPU Benchmarking (PyTorch):**
+```bash
+pip install -r requirements.txt
+pip install torch --index-url https://download.pytorch.org/whl/cu121
+```
 
-```python
-python3 -m venv venv
-source venv/bin/activate
+### 4. Verify Installation
+```bash
+python health_monitor.py --help
 ```
 
-### Step 3: Install Dependencies
+## Usage
 
-```python
-pip install -r requirements.txt
+### Web Dashboard (Recommended)
+```bash
+python health_monitor.py monitor --web
 ```
 
-### Step 4: Verify Installation
+Access at: http://localhost:8090
+
+Features:
+- Real-time GPU/system metrics
+- Interactive benchmark controls
+- Live performance charts
+- Historical data visualization
 
-```python
-python health_monitor.py --once
+### Terminal Dashboard
+```bash
+python health_monitor.py monitor
 ```
 
-This should print your GPU information once and exit.
+Displays live metrics in terminal with auto-refresh.
 
-## Usage
+### CLI Benchmark
+```bash
+# Quick 15-second test
+python health_monitor.py benchmark --mode quick
 
-### CLI Dashboard (Terminal)
+# Standard 60-second test  
+python health_monitor.py benchmark --mode standard
 
-Live monitoring in your terminal with auto-refresh:
+# Stress test with auto-scaling (pushes GPU to 98% util)
+python health_monitor.py benchmark --mode stress-test --type particle
 
-```python
-python health_monitor.py --cli
-```
+# Extended 180-second burn-in
+python health_monitor.py benchmark --mode extended
 
-Press Ctrl+C to exit.
+# Custom configuration
+python health_monitor.py benchmark --mode custom --duration 120 --temp-limit 85
+```
 
-### Single Snapshot
+## Benchmark Modes
 
-Print GPU info once and exit:
+| Mode | Duration | Workload | Auto-Scale | Use Case |
+|------|----------|----------|------------|----------|
+| Quick | 15s | Fixed | No | Quick baseline check |
+| Standard | 60s | Fixed | No | Standard benchmark |
+| Extended | 180s | Fixed | No | Long-term stability |
+| Stress Test | 60s | Dynamic | Yes | Maximum GPU load testing |
+| Custom | Variable | Fixed | Optional | User-defined parameters |
 
-```python
-python health_monitor.py --once
-```
+### Auto-Scaling Stress Test
 
-### Web Dashboard (Optional)
+The Stress Test mode automatically increases workload intensity:
 
-Start a web server with browser-based dashboard:
+1. Starts with baseline workload (2048x2048 GEMM or 100K particles)
+2. Every 2 seconds, checks GPU utilization
+3. Scales workload aggressively if GPU util < target:
+   - `<70% util`: 2.0x scaling
+   - `70-85% util`: 1.5x scaling  
+   - `85-93% util`: 1.2x scaling
+   - `>93% util`: Target reached
+4. Continues scaling up to 15 times or until 98% GPU utilization achieved
 
-```python
-python health_monitor.py --web --port 8888
+Example progression:
+```
+100K particles → 200K → 400K → 800K → 1.2M → 1.8M → 2.2M → 2.6M (94% GPU util)
 ```
 
-Then open <http://localhost:8888> in your browser.
+## Benchmark Types
 
-## What You See
+### GEMM (Matrix Multiplication)
+Dense matrix multiplication for maximum compute stress. Measures TFLOPS.
+
+```bash
+python health_monitor.py benchmark --type gemm --mode stress-test
+```
 
-The monitor displays:
+### Particle Simulation
+Vectorized particle physics simulation with collision detection. Measures steps/second.
 
-- GPU utilization (%)
-- Memory usage (used/total GB)
-- Temperature (C)
-- Power draw (W)
-- CPU and RAM usage (system)
+```bash
+python health_monitor.py benchmark --type particle --mode stress-test
+```
 
 ## Configuration
 
-Edit `config.yaml` to customize:
+Edit `config.yaml`:
 
 ```yaml
 monitoring:
-  interval_seconds: 5    # How often to refresh
+  interval_seconds: 5
+  history_retention_hours: 168
 
 alerts:
-  gpu_temperature_warn: 80     # Warn at 80C
-  gpu_temperature_critical: 90 # Critical at 90C
+  gpu_temperature_warn: 80
+  gpu_temperature_critical: 90
+  gpu_memory_usage_warn: 90
+
+web:
+  host: 0.0.0.0
+  port: 8090
+
+storage:
+  path: ./metrics.db
 ```
 
-## Troubleshooting
+## Project Structure
 
-### "No NVIDIA GPU detected"
+```
+cluster-health-monitor/
+├── monitor/
+│   ├── benchmark/
+│   │   ├── config.py          # Benchmark configuration
+│   │   ├── storage.py         # Baseline storage (SQLite)
+│   │   ├── workloads.py       # GPU workloads (GEMM/Particle)
+│   │   └── runner.py          # Benchmark orchestration
+│   ├── collectors/
+│   │   ├── gpu.py             # GPU metrics via nvidia-smi
+│   │   ├── system.py          # CPU, memory, disk
+│   │   └── network.py         # Network info
+│   ├── storage/
+│   │   └── sqlite.py          # Metrics persistence
+│   ├── api/
+│   │   ├── server.py          # FastAPI web server
+│   │   └── templates/
+│   │       └── index.html     # Web dashboard
+│   └── cli/
+│       └── benchmark_cli.py   # CLI commands
+├── config.yaml                # Configuration
+├── requirements.txt           # Dependencies
+└── health_monitor.py          # Main entry point
+```
+
+## API Endpoints
+
+When running web server (`--web`):
 
-- Run `nvidia-smi` to verify driver is installed
-- Make sure you have a discrete NVIDIA GPU (not Intel/AMD integrated)
+- `GET /` - Web dashboard
+- `GET /api/status` - Current metrics
+- `GET /api/history` - Historical data
+- `POST /api/benchmark/start` - Start benchmark
+- `GET /api/benchmark/status` - Benchmark progress
+- `POST /api/benchmark/stop` - Stop benchmark
+- `GET /api/benchmark/results` - Get results
+- `GET /api/benchmark/baseline` - Get baseline for GPU
 
-### "pynvml not found" or "ModuleNotFoundError"
+## Troubleshooting
 
-- Make sure virtual environment is activated
-- Run: `pip install pynvml`
+### "nvidia-smi not found"
+- Install NVIDIA drivers
+- Add nvidia-smi to PATH
+- Verify: `nvidia-smi` in terminal
 
-### "rich not found"
+### "No CUDA libraries found"
+Benchmarking features disabled without CUDA libraries. Install CuPy or PyTorch.
 
-- Run: `pip install rich`
+### Web dashboard not loading data
+- Check terminal for errors
+- Verify port 8090 is available
+- Check firewall settings
+- Try: `http://127.0.0.1:8090`
 
-### Web dashboard not loading
+### Benchmark not scaling GPU to 98%
+- Increase max_scales in runner.py
+- Check GPU has available memory
+- Verify no other GPU workloads running
+- Try different benchmark type (GEMM vs Particle)
 
-- Install web dependencies: `pip install fastapi uvicorn`
-- Check if port 8080 is available
+## Performance Tips
 
-### High CPU usage
+1. **Close other GPU applications** during benchmarking
+2. **Adequate cooling** for stress tests
+3. **Monitor temperatures** - tests will stop at temp limit
+4. **Use Stress Test mode** to find maximum GPU performance
+5. **Run Extended mode** for stability validation
 
-- Increase refresh interval in config.yaml
+## Development
 
-## Dependencies
+### Run Tests
+```bash
+pytest tests/
+```
 
-- pynvml - NVIDIA GPU metrics
-- psutil - System metrics (CPU, RAM, disk)
-- pyyaml - Configuration file parsing
-- click - Command line interface
-- rich - Terminal UI
-- fastapi - REST API
-- uvicorn - Web server
+### Code Structure
+- Modular design: config, storage, workloads, runner separated
+- Clean API exports via `__init__.py`
+- Type hints throughout
+- Comprehensive error handling
+
+### Contributing
+1. Fork repository
+2. Create feature branch
+3. Add tests for new features
+4. Submit pull request
 
 ## License
 
-MIT License
+MIT License - See LICENSE file
+
+## Acknowledgments
+
+- Built with FastAPI, Rich, Chart.js
+- GPU compute via CuPy and PyTorch
+- Inspired by nvidia-smi and GPU monitoring tools
+
+## Support
+
+- Issues: GitHub Issues
+- Documentation: This README
+- CUDA setup: https://developer.nvidia.com/cuda-downloads