Performance benchmarking of operators on NVIDIA Hopper (SM90) GPUs. Measures equivalent bandwidth, latency, throughput (GFLOPS), and hardware utilization across PyTorch, CUDA, and Triton implementations.
| Operator | PyTorch | CUDA | Triton | Description |
|---|---|---|---|---|
| abs | ✅ | ✅ | ✅ | Element-wise absolute value |
| add | ✅ | ✅ | ✅ | Element-wise addition |
| arange | ✅ | ✅ | ✅ | Integer range tensor |
| argmax | ✅ | ✅ | ✅ | Argmax along last dim |
| clone | ✅ | ✅ | ✅ | Tensor copy |
| concat | ✅ | — | — | Tensor concatenation (cat) |
| div | ✅ | ✅ | ✅ | Element-wise division |
| embedding | ✅ | ✅ | ✅ | Embedding table lookup |
| expand | ✅ | ✅ | ✅ | Tensor broadcast expand |
| fill | ✅ | ✅ | ✅ | Fill tensor with scalar |
| fused_add_rms_norm | ✅ | ✅ | ✅ | Fused residual add + RMS norm |
| ge | ✅ | ✅ | ✅ | Greater-or-equal comparison |
| gemma_rms_norm | ✅ | ✅ | ✅ | Gemma-style RMS norm (post-norm) |
| grouped_topk | ✅ | ✅ | ✅ | Grouped top-k selection (MoE) |
| hardswish | ✅ | ✅ | ✅ | Hard-swish activation |
| index_put | ✅ | ✅ | — | Scatter-update (index_put_) |
| index_select | ✅ | ✅ | ✅ | Index-select along dim |
| l2_norm | ✅ | ✅ | ✅ | L2 (unit) normalization |
| masked_fill | ✅ | ✅ | ✅ | Masked in-place fill |
| max | ✅ | ✅ | ✅ | Element-wise maximum |
| mul | ✅ | ✅ | ✅ | Element-wise multiply |
| neg | ✅ | ✅ | ✅ | Element-wise negation |
| reduce_max | ✅ | ✅ | ✅ | Reduction max along last dim |
| reduce_sum | ✅ | ✅ | ✅ | Reduction sum along last dim |
| repeat | ✅ | ✅ | ✅ | Tensor repeat (tile) |
| rms_norm | ✅ | ✅ | ✅ | RMS normalization (dim=-1) |
| rms_norm_without_weight | ✅ | ✅ | ✅ | RMS norm without affine weight |
| rotary_embedding | ✅ | ✅ | ✅ | Rotary positional embedding (RoPE) |
| sigmoid | ✅ | ✅ | ✅ | Sigmoid activation |
| softmax | ✅ | ✅ | — | Row-wise softmax (dim=-1) |
| softplus | ✅ | ✅ | ✅ | Softplus activation |
| sort | ✅ | ✅ | — | Sort along last dim |
| stack | ✅ | — | — | Stack tensors along new dim |
| sum | ✅ | ✅ | ✅ | Reduction sum (all elements) |
| swi_glu | ✅ | ✅ | ✅ | SwiGLU activation (gate × swish) |
| topk | ✅ | ✅ | — | Top-k selection |
| where | ✅ | ✅ | ✅ | Conditional select (where) |
# Install core + dev dependencies
pip install -e ".[dev]"
# Optional: Triton support
pip install -e ".[triton]"
# Optional: PDF report generation
pip install markdown weasyprint
# For Chinese PDF output, also install CJK fonts (Ubuntu/Debian):
apt-get install -y fonts-noto-cjk
# Check which GPUs are free, then run benchmarks
nvidia-smi
CUDA_VISIBLE_DEVICES=<idle_gpu> python scripts/run_bench.py --op rms_norm --dtype fp16 fp32
CUDA_VISIBLE_DEVICES=<idle_gpu> python scripts/run_bench.py --all --dtype fp16 fp32
# Run tests
pytest tests/ -vNote: Always check
nvidia-smifirst and setCUDA_VISIBLE_DEVICESto an idle GPU. Never default to GPU 0.
benchmarks/ # One subfolder per operator
<op>/
__init__.py
config.py # ShapeGenerator subclass + bytes_accessed/flops formulas
pytorch_impl.py # PyTorch baseline (required)
cuda_impl/ # CUDA kernel + pybind11 wrapper (required)
__init__.py # JIT-loads the compiled extension
<op>_kernel.cu # C++17 kernel targeting SM90
setup.py # Build script
triton_impl.py # Triton kernel with autotuning (optional)
bench.py # Benchmark runner for this operator
common/ # Shared utilities
benchmark.py # Harness: L2 flush, warm-up, CUDA event timing, verify_implementations()
metrics.py # Bandwidth, throughput, utilization calculations
report.py # Markdown + CSV report generation
shape_gen.py # Shape generator framework + classify_workload()
cache.py # L2 cache clearing helpers
gpu_info.py # Hopper GPU detection (SM90)
model_shape/ # Fixed model-derived shapes per operator
__init__.py # load_model_shapes() utility
<op>_shape.py # MODEL_SHAPES dict per operator (keyed by model name)
template/ # Single-invocation latency benchmarks
template_bench_<op>.py # One file per operator (1000 runs, batch CUDA events)
reports/ # Generated benchmark output (gitignored)
scripts/
run_bench.py # CLI: run benchmarks for one or all operators
generate_report.py # CLI: produce per-operator report from saved CSV
generate_summary_report.py # CLI: aggregate summary across all operators
generate_model_report.py # CLI: cross-operator report grouped by model (English)
generate_model_report_cn.py # CLI: same report in Chinese
generate_model_shapes.py # CLI: scaffold model_shape/<op>_shape.py files
add_operator.py # CLI: scaffold a new operator
remove_operator.py # CLI: remove an operator and all related files
md_to_pdf.py # CLI: convert Markdown report to PDF
tests/ # Unit tests
Each operator benchmark produces a Markdown report and CSV under reports/<op>/.
Reports are organized into three workload categories:
| Category | Condition |
|---|---|
| Small | Input tensor < 5 GB |
| Large | 5 GB ≤ input tensor < 40 GB |
| Model Shape | Fixed shapes from real model architectures |
Each category section includes per-implementation avg/min/max summary statistics and a per-shape detail table.
A cross-operator report grouped by model is produced by generate_model_report.py / generate_model_report_cn.py. Each model section has:
- Operator Summary — avg bandwidth and utilization per operator/dtype
- Operator Detailed — per-shape timing and bandwidth tables (with shape dimension labels)
# English
python scripts/generate_model_report.py
# Chinese
python scripts/generate_model_report_cn.py
# Convert to PDF
python scripts/md_to_pdf.py reports/model_report.md
python scripts/md_to_pdf.py reports/model_report_cn.md --toc-title "目录" --lang zhscripts/md_to_pdf.py converts any Markdown report to PDF with a clickable table of contents, styled tables, code blocks, and embedded fonts.
python scripts/md_to_pdf.py <input.md> [-o output.pdf] [--page-size A4] \
[--font-size 10.5] [--margin "20mm 18mm"] [--no-toc] \
[--toc-title "Contents"] [--lang en]Dependencies: pip install markdown weasyprint; for CJK: apt-get install -y fonts-noto-cjk
- Equivalent Bandwidth (GB/s):
(bytes_read + bytes_written) / median_time - Throughput (GFLOPS/TFLOPS):
flops / median_time - HW Utilization (%): measured / peak × 100
- For matmul-like ops: Tensor Core peak is the denominator.
- For non-matmul ops (softmax, rms_norm, etc.): CUDA Core peak is the denominator.
| Metric | FP16 | FP32 |
|---|---|---|
| HBM Bandwidth | 3.35 TB/s | 3.35 TB/s |
| Tensor Core | 989.5 TFLOPS | 494.7 TFLOPS |
| CUDA Core | 134.0 TFLOPS | 67.0 TFLOPS |
| Metric | FP16 | FP32 |
|---|---|---|
| HBM Bandwidth | 4.0 TB/s | 4.0 TB/s |
| Tensor Core | 148.0 TFLOPS | 74.0 TFLOPS |
| CUDA Core | — | 44.0 TFLOPS |
| Memory | 96 GB | — |
| L2 Cache | 64 MB | — |
| SMs | 78 | — |
Each operator's config.py defines a ShapeGenerator yielding ≤65 shapes per dtype:
- Small cases (~30): input tensor < 5 GB — powers-of-2, non-powers, small (≤64), large (≥8192), non-aligned dims
- Large cases (~30): 5 GB ≤ input tensor < 40 GB — multi-dimensional shapes evenly distributed across the range
generate() must explicitly call yield from self._large_cases(dtype) — omitting this is a silent bug.
- Run
python scripts/add_operator.py <op>to scaffold files, or create manually:benchmarks/<op>/__init__.py,config.py,pytorch_impl.py,cuda_impl/,bench.pytemplate/template_bench_<op>.pymodel_shape/<op>_shape.py
- Shape generator must yield ≤65 cases per dtype with both small and large coverage
bench.pymust callverify_implementations()before timing andload_model_shapes()for model shapes- CUDA kernels: C++17, SM90, one kernel per
.cufile,AT_DISPATCH_FLOATING_TYPES_AND_HALF - Run
ruff check .andpytest tests/ -vbefore committing
- Python 3.10+
- PyTorch ≥ 2.1.0 with CUDA
- NVIDIA Hopper GPU (SM ≥ 9.0)
- CUDA toolkit (for kernel compilation)
numpy ≥ 1.24,tabulate ≥ 0.9,pybind11 ≥ 2.11- Optional:
triton ≥ 2.2 - Optional (PDF export):
markdown,weasyprint, system packagefonts-noto-cjk(for CJK)