Live demo: uhop.dev
UHOP is an open hardware optimization platform that unifies GPU acceleration across CUDA, ROCm/HIP, Metal, OpenCL, and future architectures. It detects your machine, dispatches to the best backend, can generate kernels with AI, validates them, and caches the fastest path for reuse — so developers can write simple code and run fast everywhere.
Key capabilities today:
- Automatic backend detection: Torch (CUDA/MPS/CPU), OpenCL (GPU/CPU), Triton (Linux), CPU fallback
- Drop‑in acceleration via
@uhop.optimize("op")
decorator (e.g., matmul) - AI kernel generation (OpenAI) for OpenCL/CUDA/Python/Triton with validation/smoke tests
- On‑disk caching of selected kernels/implementations per device
- Friendly CLI for hardware info, demos, AI codegen, and cache tools
- Optional Local Agent so the web portal can run on your hardware
Vision: a universal, community-driven runtime optimizer that makes high‑performance computing approachable, portable, and fun — across vendors and form factors.
Planned (see issues/
): multi‑backend benchmarking/policies, correctness suites, distributed training loops for AI‑generated kernels, richer dashboard, and tighter framework integrations (PyTorch/JAX).
The platform has four layers working together:
- Frontend (Vite + React) — live controls, real‑time logs, and benchmarks
- Backend (Node/Express + ws) — routes jobs to your Local Agent or server runtime
- Local Agent (Python) — runs UHOP operations on your machine securely
- UHOP Core (Python) — backends, optimizer, AI codegen/validation, caching
See also: docs/architecture.svg
(source image) for sharing in blogs/slides.
At a glance, the request flow prefers the Local Agent when connected, and falls back to server‑side execution when not.
Prereqs
- Python 3.10+
- OS: Windows, macOS, or Linux
- Drivers/toolchains as applicable: CUDA (NVIDIA), OpenCL runtime (AMD/Intel/NVIDIA), Apple MPS (macOS)
- Optional:
OPENAI_API_KEY
for AI codegen
Install
git clone https://github.com/sevenloops/uhop.git
cd uhop
pip install -e . # install CLI `uhop`
# optional extras
pip install -e .[dev] # tests & notebooks
pip install -e .[amd] # ROCm Python tools
pip install -e .[nvidia] # CuPy for CUDA
Verify your setup
uhop info
uhop info --json
Run a demo
# Matmul: naive Python vs UHOP‑optimized
uhop demo --size 192 --iters 3
# Fused Conv2D+ReLU (OpenCL). Choose device if multiple are present:
uhop demo-conv2d-relu --h 128 --w 128 --c-in 3 --c-out 32 --k 3 --stride 1 --padding 1
uhop demo-conv2d-relu --ocl-device 0
Try OpenCL elementwise add vs naive
python examples/compare_elementwise_add_opencl_vs_naive.py --size 2000000
Integrate in your code
from uhop import optimize
@optimize("matmul")
def my_matmul(a, b):
# write the simplest correct version — UHOP will dispatch/accelerate
import numpy as np
return np.array(a) @ np.array(b)
Environment knobs
UHOP_OPENCL_DEVICE_INDEX=<idx>
— default OpenCL device overrideUHOP_STRICT_VALIDATE=1
— tighten AI‑kernel validation during codegen
# Generate OpenCL matmul, validate build, run smoke test
python -m uhop.cli ai-generate matmul --target opencl --validate --smoke
# Generate fused Conv2D+ReLU and benchmark vs current fused backend
python -m uhop.cli ai-generate-fused --stride 1 --padding 1
Expose a local HTTP API for demos/automation:
uhop web-api --host 0.0.0.0 --port 5824
# or
python -m uhop.web_api --host 0.0.0.0 --port 5824
Endpoints
- GET
/health
- GET
/info
- POST
/demo/matmul
with{ "size": 256, "iters": 3 }
Docker
docker build -t uhop-demo-api -f api.Dockerfile .
docker run --rm -p 5824:5824 uhop-demo-api
We’re building UHOP as a friendly, long‑term open platform. All experience levels welcome — and we especially invite:
- GPU engineers (CUDA/ROCm/Metal/OpenCL)
- Compiler/runtime developers (Triton/MLIR/TVM)
- ML engineers and researchers (kernels, validation, datasets)
- Frontend devs (Vite/React/Tailwind, data viz)
Start here:
- Read
CONTRIBUTING.md
for local setup, tests, and PR tips - Run
./contributing.sh setup
and./contributing.sh test
- Explore
issues/
for scoped design notes and milestones
Expectations:
- Keep public APIs stable; update docs/tests with behavior changes
- Aim for reproducible steps and minimal dependencies
- Small, focused PRs with clear titles (Conventional Commits encouraged)
Milestone | Focus | Status |
---|---|---|
Pre‑MVP | Runtime decorator, hardware detection, caching, CLI demo | In progress |
MVP | Multi‑backend benchmarking and selection policies | Planned |
AI Kernels v1 | Automated validation, correctness suites, smoke tests | Planned |
Dashboard | Logging, benchmark viz, local agent UX | Planned |
Frameworks | PyTorch/JAX wrappers, training loop integration | Planned |
All‑systems support | CUDA, ROCm/HIP, Metal, OpenCL (explore Vulkan/oneAPI) | Vision |
All‑ops coverage | Elementwise, reductions, convs, attention, norms, fused ops | Vision |
Protocol Spec v1.0 | Stable spec: device negotiation, cache manifests, kernel metadata | Vision |
See the issues/
directory for detailed write‑ups:
- 01 Implement runtime decorator
- 02 Hardware detection refinement
- 03 Caching metadata schema
- 04 CLI demo
- 05 AI kernel validation
- 06 Logging & benchmark viz
- 07 Multi‑backend benchmarking
Jump in with these approachable starters:
- Improve OpenCL/kernel templates and add simple correctness tests
- Add a CUDA/HIP example parity with the OpenCL elementwise add
- Enhance
uhop info --json
fields (driver versions, memory footprints) - Add README snippets for Windows/Mac specific setup tips
- Polish the frontend build or add a minimal dashboard card
- Optimize CI/CD workflow and docs for PRs and promotions (badges, faster CI, templates) — see issues/15-ci-cd-workflow-docs-promo.md
Or pick one from the tracked proposals above in issues/
and comment to claim.
Run the test suite (GPU‑dependent tests skip automatically):
pytest -q
Targeted runs:
pytest -q tests/test_matmul.py
pytest -q -k "opencl or cuda or hip or metal"
MIT © UHOP Systems
Tags: gpu, compiler, rocm, cuda, opencl, metal, hpc, mlops, deep-learning, open-hardware