Hopper GPU Operator Benchmark Suite

Performance benchmarking of operators on NVIDIA Hopper (SM90) GPUs. Measures equivalent bandwidth, latency, throughput (GFLOPS), and hardware utilization across PyTorch, CUDA, and Triton implementations.

Operators

Operator	PyTorch	CUDA	Triton	Description
abs	✅	✅	✅	Element-wise absolute value
add	✅	✅	✅	Element-wise addition
arange	✅	✅	✅	Integer range tensor
argmax	✅	✅	✅	Argmax along last dim
clone	✅	✅	✅	Tensor copy
concat	✅	—	—	Tensor concatenation (cat)
div	✅	✅	✅	Element-wise division
embedding	✅	✅	✅	Embedding table lookup
expand	✅	✅	✅	Tensor broadcast expand
fill	✅	✅	✅	Fill tensor with scalar
fused_add_rms_norm	✅	✅	✅	Fused residual add + RMS norm
ge	✅	✅	✅	Greater-or-equal comparison
gemma_rms_norm	✅	✅	✅	Gemma-style RMS norm (post-norm)
grouped_topk	✅	✅	✅	Grouped top-k selection (MoE)
hardswish	✅	✅	✅	Hard-swish activation
index_put	✅	✅	—	Scatter-update (`index_put_`)
index_select	✅	✅	✅	Index-select along dim
l2_norm	✅	✅	✅	L2 (unit) normalization
masked_fill	✅	✅	✅	Masked in-place fill
max	✅	✅	✅	Element-wise maximum
mul	✅	✅	✅	Element-wise multiply
neg	✅	✅	✅	Element-wise negation
reduce_max	✅	✅	✅	Reduction max along last dim
reduce_sum	✅	✅	✅	Reduction sum along last dim
repeat	✅	✅	✅	Tensor repeat (tile)
rms_norm	✅	✅	✅	RMS normalization (dim=-1)
rms_norm_without_weight	✅	✅	✅	RMS norm without affine weight
rotary_embedding	✅	✅	✅	Rotary positional embedding (RoPE)
sigmoid	✅	✅	✅	Sigmoid activation
softmax	✅	✅	—	Row-wise softmax (dim=-1)
softplus	✅	✅	✅	Softplus activation
sort	✅	✅	—	Sort along last dim
stack	✅	—	—	Stack tensors along new dim
sum	✅	✅	✅	Reduction sum (all elements)
swi_glu	✅	✅	✅	SwiGLU activation (gate × swish)
topk	✅	✅	—	Top-k selection
where	✅	✅	✅	Conditional select (where)

Quick Start

# Install core + dev dependencies
pip install -e ".[dev]"

# Optional: Triton support
pip install -e ".[triton]"

# Optional: PDF report generation
pip install markdown weasyprint
# For Chinese PDF output, also install CJK fonts (Ubuntu/Debian):
apt-get install -y fonts-noto-cjk

# Check which GPUs are free, then run benchmarks
nvidia-smi
CUDA_VISIBLE_DEVICES=<idle_gpu> python scripts/run_bench.py --op rms_norm --dtype fp16 fp32
CUDA_VISIBLE_DEVICES=<idle_gpu> python scripts/run_bench.py --all --dtype fp16 fp32

# Run tests
pytest tests/ -v

Note: Always check nvidia-smi first and set CUDA_VISIBLE_DEVICES to an idle GPU. Never default to GPU 0.

Project Structure

benchmarks/            # One subfolder per operator
  <op>/
    __init__.py
    config.py          # ShapeGenerator subclass + bytes_accessed/flops formulas
    pytorch_impl.py    # PyTorch baseline (required)
    cuda_impl/         # CUDA kernel + pybind11 wrapper (required)
      __init__.py      # JIT-loads the compiled extension
      <op>_kernel.cu   # C++17 kernel targeting SM90
      setup.py         # Build script
    triton_impl.py     # Triton kernel with autotuning (optional)
    bench.py           # Benchmark runner for this operator
common/                # Shared utilities
  benchmark.py         # Harness: L2 flush, warm-up, CUDA event timing, verify_implementations()
  metrics.py           # Bandwidth, throughput, utilization calculations
  report.py            # Markdown + CSV report generation
  shape_gen.py         # Shape generator framework + classify_workload()
  cache.py             # L2 cache clearing helpers
  gpu_info.py          # Hopper GPU detection (SM90)
model_shape/           # Fixed model-derived shapes per operator
  __init__.py          # load_model_shapes() utility
  <op>_shape.py        # MODEL_SHAPES dict per operator (keyed by model name)
template/              # Single-invocation latency benchmarks
  template_bench_<op>.py  # One file per operator (1000 runs, batch CUDA events)
reports/               # Generated benchmark output (gitignored)
scripts/
  run_bench.py              # CLI: run benchmarks for one or all operators
  generate_report.py        # CLI: produce per-operator report from saved CSV
  generate_summary_report.py  # CLI: aggregate summary across all operators
  generate_model_report.py  # CLI: cross-operator report grouped by model (English)
  generate_model_report_cn.py # CLI: same report in Chinese
  generate_model_shapes.py  # CLI: scaffold model_shape/<op>_shape.py files
  add_operator.py           # CLI: scaffold a new operator
  remove_operator.py        # CLI: remove an operator and all related files
  md_to_pdf.py              # CLI: convert Markdown report to PDF
tests/                 # Unit tests

Reports

Each operator benchmark produces a Markdown report and CSV under reports/<op>/. Reports are organized into three workload categories:

Category	Condition
Small	Input tensor < 5 GB
Large	5 GB ≤ input tensor < 40 GB
Model Shape	Fixed shapes from real model architectures

Each category section includes per-implementation avg/min/max summary statistics and a per-shape detail table.

Model Report

A cross-operator report grouped by model is produced by generate_model_report.py / generate_model_report_cn.py. Each model section has:

Operator Summary — avg bandwidth and utilization per operator/dtype
Operator Detailed — per-shape timing and bandwidth tables (with shape dimension labels)

# English
python scripts/generate_model_report.py

# Chinese
python scripts/generate_model_report_cn.py

# Convert to PDF
python scripts/md_to_pdf.py reports/model_report.md
python scripts/md_to_pdf.py reports/model_report_cn.md --toc-title "目录" --lang zh

PDF Conversion

scripts/md_to_pdf.py converts any Markdown report to PDF with a clickable table of contents, styled tables, code blocks, and embedded fonts.

python scripts/md_to_pdf.py <input.md> [-o output.pdf] [--page-size A4] \
    [--font-size 10.5] [--margin "20mm 18mm"] [--no-toc] \
    [--toc-title "Contents"] [--lang en]

Dependencies: pip install markdown weasyprint; for CJK: apt-get install -y fonts-noto-cjk

Metrics

Equivalent Bandwidth (GB/s): (bytes_read + bytes_written) / median_time
Throughput (GFLOPS/TFLOPS): flops / median_time
HW Utilization (%): measured / peak × 100
For matmul-like ops: Tensor Core peak is the denominator.
For non-matmul ops (softmax, rms_norm, etc.): CUDA Core peak is the denominator.

H100 SXM Peak Specs

Metric	FP16	FP32
HBM Bandwidth	3.35 TB/s	3.35 TB/s
Tensor Core	989.5 TFLOPS	494.7 TFLOPS
CUDA Core	134.0 TFLOPS	67.0 TFLOPS

H20 Peak Specs (Porting Reference)

Metric	FP16	FP32
HBM Bandwidth	4.0 TB/s	4.0 TB/s
Tensor Core	148.0 TFLOPS	74.0 TFLOPS
CUDA Core	—	44.0 TFLOPS
Memory	96 GB	—
L2 Cache	64 MB	—
SMs	78	—

Shape Generation

Each operator's config.py defines a ShapeGenerator yielding ≤65 shapes per dtype:

Small cases (~30): input tensor < 5 GB — powers-of-2, non-powers, small (≤64), large (≥8192), non-aligned dims
Large cases (~30): 5 GB ≤ input tensor < 40 GB — multi-dimensional shapes evenly distributed across the range

generate() must explicitly call yield from self._large_cases(dtype) — omitting this is a silent bug.

Adding a New Operator

Run python scripts/add_operator.py <op> to scaffold files, or create manually:
- benchmarks/<op>/__init__.py, config.py, pytorch_impl.py, cuda_impl/, bench.py
- template/template_bench_<op>.py
- model_shape/<op>_shape.py
Shape generator must yield ≤65 cases per dtype with both small and large coverage
bench.py must call verify_implementations() before timing and load_model_shapes() for model shapes
CUDA kernels: C++17, SM90, one kernel per .cu file, AT_DISPATCH_FLOATING_TYPES_AND_HALF
Run ruff check . and pytest tests/ -v before committing

Requirements

Python 3.10+
PyTorch ≥ 2.1.0 with CUDA
NVIDIA Hopper GPU (SM ≥ 9.0)
CUDA toolkit (for kernel compilation)
numpy ≥ 1.24, tabulate ≥ 0.9, pybind11 ≥ 2.11
Optional: triton ≥ 2.2
Optional (PDF export): markdown, weasyprint, system package fonts-noto-cjk (for CJK)

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
.github		.github
benchmarks		benchmarks
common		common
model_shape		model_shape
scripts		scripts
template		template
tests		tests
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
README.md		README.md
compare_all_operators.py		compare_all_operators.py
compare_embedding_harness.py		compare_embedding_harness.py
compare_rmsnorm.py		compare_rmsnorm.py
compare_sort.py		compare_sort.py
confirm_sort_gap.py		confirm_sort_gap.py
diagnose_embedding_gap.py		diagnose_embedding_gap.py
diagnose_sort_gap.py		diagnose_sort_gap.py
profile_template_l2.py		profile_template_l2.py
pyproject.toml		pyproject.toml
run_slice_bench.py		run_slice_bench.py
test_slice.py		test_slice.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Hopper GPU Operator Benchmark Suite

Operators

Quick Start

Project Structure

Reports

Model Report

PDF Conversion

Metrics

H100 SXM Peak Specs

H20 Peak Specs (Porting Reference)

Shape Generation

Adding a New Operator

Requirements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Hopper GPU Operator Benchmark Suite

Operators

Quick Start

Project Structure

Reports

Model Report

PDF Conversion

Metrics

H100 SXM Peak Specs

H20 Peak Specs (Porting Reference)

Shape Generation

Adding a New Operator

Requirements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages