Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
300 changes: 214 additions & 86 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,156 +1,284 @@
# Cluster Health Monitor

A lightweight, real-time monitoring tool for NVIDIA GPUs. Track GPU utilization, memory, temperature, and power during ML training or any GPU workload.
Real-time GPU and system monitoring with web dashboard and CLI interface. Features intelligent GPU stress testing with auto-scaling workloads and performance baselines.

## System Requirements
## Features

### Hardware
### Monitoring
- Real-time GPU metrics (utilization, memory, temperature, power)
- System metrics (CPU, memory, disk I/O)
- Web dashboard with live charts
- Terminal interface with auto-refresh
- Historical data storage and alerting

- NVIDIA GPU (GeForce, RTX, Quadro, Tesla, etc.)
### GPU Benchmarking
- GEMM (matrix multiplication) stress test
- Particle simulation workload
- Auto-scaling stress test (dynamically increases load to 98% GPU utilization)
- Performance baseline tracking per GPU and benchmark type
- Multiple test modes: Quick (15s), Standard (60s), Extended (180s), Stress Test, Custom

### Software
## Requirements

- Windows 10/11 or Linux (Ubuntu 18.04+)
- Python 3.8 or higher
- NVIDIA Driver 450.0 or higher
### Core Monitoring (Always Available)
- Python 3.8+
- NVIDIA GPU with drivers installed
- `nvidia-smi` command available

### Verify Your Setup
### GPU Benchmarking (Optional)
- CUDA Toolkit 12.0+ or compatible
- One of:
- CuPy: `pip install cupy-cuda12x` (or appropriate CUDA version)
- PyTorch: `pip install torch --index-url https://download.pytorch.org/whl/cu121`

Before installing, confirm your GPU is detected:
## Installation

### 1. Clone Repository
```bash
nvidia-smi
git clone https://github.com/DataBoySu/cluster-monitor.git
cd cluster-health-monitor
```

You should see your GPU listed with driver version. If this command fails, install NVIDIA drivers first.
### 2. Create Virtual Environment
```bash
python -m venv .venv
```

## Installation
Activate:
- Windows: `.venv\Scripts\activate`
- Linux/Mac: `source .venv/bin/activate`

### Step 1: Clone the Repository
### 3. Install Dependencies

```git
git clone https://github.com/DataBoySu/cluster-monitor.git
cd cluster-monitor
**Basic Monitoring:**
```bash
pip install -r requirements.txt
```

### Step 2: Create Virtual Environment

Windows:

```python
python -m venv venv
venv\Scripts\activate
**With GPU Benchmarking (CuPy):**
```bash
pip install -r requirements.txt
pip install cupy-cuda12x # Adjust for your CUDA version
```

Linux/macOS:
**With GPU Benchmarking (PyTorch):**
```bash
pip install -r requirements.txt
pip install torch --index-url https://download.pytorch.org/whl/cu121
```

```python
python3 -m venv venv
source venv/bin/activate
### 4. Verify Installation
```bash
python health_monitor.py --help
```

### Step 3: Install Dependencies
## Usage

```python
pip install -r requirements.txt
### Web Dashboard (Recommended)
```bash
python health_monitor.py monitor --web
```

### Step 4: Verify Installation
Access at: http://localhost:8090

Features:
- Real-time GPU/system metrics
- Interactive benchmark controls
- Live performance charts
- Historical data visualization

```python
python health_monitor.py --once
### Terminal Dashboard
```bash
python health_monitor.py monitor
```

This should print your GPU information once and exit.
Displays live metrics in terminal with auto-refresh.

## Usage
### CLI Benchmark
```bash
# Quick 15-second test
python health_monitor.py benchmark --mode quick

### CLI Dashboard (Terminal)
# Standard 60-second test
python health_monitor.py benchmark --mode standard

Live monitoring in your terminal with auto-refresh:
# Stress test with auto-scaling (pushes GPU to 98% util)
python health_monitor.py benchmark --mode stress-test --type particle

```python
python health_monitor.py --cli
```
# Extended 180-second burn-in
python health_monitor.py benchmark --mode extended

Press Ctrl+C to exit.
# Custom configuration
python health_monitor.py benchmark --mode custom --duration 120 --temp-limit 85
```

### Single Snapshot
## Benchmark Modes

Print GPU info once and exit:
| Mode | Duration | Workload | Auto-Scale | Use Case |
|------|----------|----------|------------|----------|
| Quick | 15s | Fixed | No | Quick baseline check |
| Standard | 60s | Fixed | No | Standard benchmark |
| Extended | 180s | Fixed | No | Long-term stability |
| Stress Test | 60s | Dynamic | Yes | Maximum GPU load testing |
| Custom | Variable | Fixed | Optional | User-defined parameters |

```python
python health_monitor.py --once
```
### Auto-Scaling Stress Test

### Web Dashboard (Optional)
The Stress Test mode automatically increases workload intensity:

Start a web server with browser-based dashboard:
1. Starts with baseline workload (2048x2048 GEMM or 100K particles)
2. Every 2 seconds, checks GPU utilization
3. Scales workload aggressively if GPU util < target:
- `<70% util`: 2.0x scaling
- `70-85% util`: 1.5x scaling
- `85-93% util`: 1.2x scaling
- `>93% util`: Target reached
4. Continues scaling up to 15 times or until 98% GPU utilization achieved

```python
python health_monitor.py --web --port 8888
Example progression:
```
100K particles → 200K → 400K → 800K → 1.2M → 1.8M → 2.2M → 2.6M (94% GPU util)
```

Then open <http://localhost:8888> in your browser.
## Benchmark Types

## What You See
### GEMM (Matrix Multiplication)
Dense matrix multiplication for maximum compute stress. Measures TFLOPS.

```bash
python health_monitor.py benchmark --type gemm --mode stress-test
```

The monitor displays:
### Particle Simulation
Vectorized particle physics simulation with collision detection. Measures steps/second.

- GPU utilization (%)
- Memory usage (used/total GB)
- Temperature (C)
- Power draw (W)
- CPU and RAM usage (system)
```bash
python health_monitor.py benchmark --type particle --mode stress-test
```

## Configuration

Edit `config.yaml` to customize:
Edit `config.yaml`:

```yaml
monitoring:
interval_seconds: 5 # How often to refresh
interval_seconds: 5
history_retention_hours: 168

alerts:
gpu_temperature_warn: 80 # Warn at 80C
gpu_temperature_critical: 90 # Critical at 90C
gpu_temperature_warn: 80
gpu_temperature_critical: 90
gpu_memory_usage_warn: 90

web:
host: 0.0.0.0
port: 8090

storage:
path: ./metrics.db
```

## Troubleshooting
## Project Structure

### "No NVIDIA GPU detected"
```
cluster-health-monitor/
├── monitor/
│ ├── benchmark/
│ │ ├── config.py # Benchmark configuration
│ │ ├── storage.py # Baseline storage (SQLite)
│ │ ├── workloads.py # GPU workloads (GEMM/Particle)
│ │ └── runner.py # Benchmark orchestration
│ ├── collectors/
│ │ ├── gpu.py # GPU metrics via nvidia-smi
│ │ ├── system.py # CPU, memory, disk
│ │ └── network.py # Network info
│ ├── storage/
│ │ └── sqlite.py # Metrics persistence
│ ├── api/
│ │ ├── server.py # FastAPI web server
│ │ └── templates/
│ │ └── index.html # Web dashboard
│ └── cli/
│ └── benchmark_cli.py # CLI commands
├── config.yaml # Configuration
├── requirements.txt # Dependencies
└── health_monitor.py # Main entry point
```

## API Endpoints

When running web server (`--web`):

- Run `nvidia-smi` to verify driver is installed
- Make sure you have a discrete NVIDIA GPU (not Intel/AMD integrated)
- `GET /` - Web dashboard
- `GET /api/status` - Current metrics
- `GET /api/history` - Historical data
- `POST /api/benchmark/start` - Start benchmark
- `GET /api/benchmark/status` - Benchmark progress
- `POST /api/benchmark/stop` - Stop benchmark
- `GET /api/benchmark/results` - Get results
- `GET /api/benchmark/baseline` - Get baseline for GPU

### "pynvml not found" or "ModuleNotFoundError"
## Troubleshooting

- Make sure virtual environment is activated
- Run: `pip install pynvml`
### "nvidia-smi not found"
- Install NVIDIA drivers
- Add nvidia-smi to PATH
- Verify: `nvidia-smi` in terminal

### "rich not found"
### "No CUDA libraries found"
Benchmarking features disabled without CUDA libraries. Install CuPy or PyTorch.

- Run: `pip install rich`
### Web dashboard not loading data
- Check terminal for errors
- Verify port 8090 is available
- Check firewall settings
- Try: `http://127.0.0.1:8090`

### Web dashboard not loading
### Benchmark not scaling GPU to 98%
- Increase max_scales in runner.py
- Check GPU has available memory
- Verify no other GPU workloads running
- Try different benchmark type (GEMM vs Particle)

- Install web dependencies: `pip install fastapi uvicorn`
- Check if port 8080 is available
## Performance Tips

### High CPU usage
1. **Close other GPU applications** during benchmarking
2. **Adequate cooling** for stress tests
3. **Monitor temperatures** - tests will stop at temp limit
4. **Use Stress Test mode** to find maximum GPU performance
5. **Run Extended mode** for stability validation

- Increase refresh interval in config.yaml
## Development

## Dependencies
### Run Tests
```bash
pytest tests/
```

- pynvml - NVIDIA GPU metrics
- psutil - System metrics (CPU, RAM, disk)
- pyyaml - Configuration file parsing
- click - Command line interface
- rich - Terminal UI
- fastapi - REST API
- uvicorn - Web server
### Code Structure
- Modular design: config, storage, workloads, runner separated
- Clean API exports via `__init__.py`
- Type hints throughout
- Comprehensive error handling

### Contributing
1. Fork repository
2. Create feature branch
3. Add tests for new features
4. Submit pull request

## License

MIT License
MIT License - See LICENSE file

## Acknowledgments

- Built with FastAPI, Rich, Chart.js
- GPU compute via CuPy and PyTorch
- Inspired by nvidia-smi and GPU monitoring tools

## Support

- Issues: GitHub Issues
- Documentation: This README
- CUDA setup: https://developer.nvidia.com/cuda-downloads
Loading