Skip to content

wch18/Hopper_benchmark

Repository files navigation

Hopper GPU Operator Benchmark Suite

Performance benchmarking of operators on NVIDIA Hopper (SM90) GPUs. Measures equivalent bandwidth, latency, throughput (GFLOPS), and hardware utilization across PyTorch, CUDA, and Triton implementations.

Operators

Operator PyTorch CUDA Triton Description
abs Element-wise absolute value
add Element-wise addition
arange Integer range tensor
argmax Argmax along last dim
clone Tensor copy
concat Tensor concatenation (cat)
div Element-wise division
embedding Embedding table lookup
expand Tensor broadcast expand
fill Fill tensor with scalar
fused_add_rms_norm Fused residual add + RMS norm
ge Greater-or-equal comparison
gemma_rms_norm Gemma-style RMS norm (post-norm)
grouped_topk Grouped top-k selection (MoE)
hardswish Hard-swish activation
index_put Scatter-update (index_put_)
index_select Index-select along dim
l2_norm L2 (unit) normalization
masked_fill Masked in-place fill
max Element-wise maximum
mul Element-wise multiply
neg Element-wise negation
reduce_max Reduction max along last dim
reduce_sum Reduction sum along last dim
repeat Tensor repeat (tile)
rms_norm RMS normalization (dim=-1)
rms_norm_without_weight RMS norm without affine weight
rotary_embedding Rotary positional embedding (RoPE)
sigmoid Sigmoid activation
softmax Row-wise softmax (dim=-1)
softplus Softplus activation
sort Sort along last dim
stack Stack tensors along new dim
sum Reduction sum (all elements)
swi_glu SwiGLU activation (gate × swish)
topk Top-k selection
where Conditional select (where)

Quick Start

# Install core + dev dependencies
pip install -e ".[dev]"

# Optional: Triton support
pip install -e ".[triton]"

# Optional: PDF report generation
pip install markdown weasyprint
# For Chinese PDF output, also install CJK fonts (Ubuntu/Debian):
apt-get install -y fonts-noto-cjk

# Check which GPUs are free, then run benchmarks
nvidia-smi
CUDA_VISIBLE_DEVICES=<idle_gpu> python scripts/run_bench.py --op rms_norm --dtype fp16 fp32
CUDA_VISIBLE_DEVICES=<idle_gpu> python scripts/run_bench.py --all --dtype fp16 fp32

# Run tests
pytest tests/ -v

Note: Always check nvidia-smi first and set CUDA_VISIBLE_DEVICES to an idle GPU. Never default to GPU 0.

Project Structure

benchmarks/            # One subfolder per operator
  <op>/
    __init__.py
    config.py          # ShapeGenerator subclass + bytes_accessed/flops formulas
    pytorch_impl.py    # PyTorch baseline (required)
    cuda_impl/         # CUDA kernel + pybind11 wrapper (required)
      __init__.py      # JIT-loads the compiled extension
      <op>_kernel.cu   # C++17 kernel targeting SM90
      setup.py         # Build script
    triton_impl.py     # Triton kernel with autotuning (optional)
    bench.py           # Benchmark runner for this operator
common/                # Shared utilities
  benchmark.py         # Harness: L2 flush, warm-up, CUDA event timing, verify_implementations()
  metrics.py           # Bandwidth, throughput, utilization calculations
  report.py            # Markdown + CSV report generation
  shape_gen.py         # Shape generator framework + classify_workload()
  cache.py             # L2 cache clearing helpers
  gpu_info.py          # Hopper GPU detection (SM90)
model_shape/           # Fixed model-derived shapes per operator
  __init__.py          # load_model_shapes() utility
  <op>_shape.py        # MODEL_SHAPES dict per operator (keyed by model name)
template/              # Single-invocation latency benchmarks
  template_bench_<op>.py  # One file per operator (1000 runs, batch CUDA events)
reports/               # Generated benchmark output (gitignored)
scripts/
  run_bench.py              # CLI: run benchmarks for one or all operators
  generate_report.py        # CLI: produce per-operator report from saved CSV
  generate_summary_report.py  # CLI: aggregate summary across all operators
  generate_model_report.py  # CLI: cross-operator report grouped by model (English)
  generate_model_report_cn.py # CLI: same report in Chinese
  generate_model_shapes.py  # CLI: scaffold model_shape/<op>_shape.py files
  add_operator.py           # CLI: scaffold a new operator
  remove_operator.py        # CLI: remove an operator and all related files
  md_to_pdf.py              # CLI: convert Markdown report to PDF
tests/                 # Unit tests

Reports

Each operator benchmark produces a Markdown report and CSV under reports/<op>/. Reports are organized into three workload categories:

Category Condition
Small Input tensor < 5 GB
Large 5 GB ≤ input tensor < 40 GB
Model Shape Fixed shapes from real model architectures

Each category section includes per-implementation avg/min/max summary statistics and a per-shape detail table.

Model Report

A cross-operator report grouped by model is produced by generate_model_report.py / generate_model_report_cn.py. Each model section has:

  • Operator Summary — avg bandwidth and utilization per operator/dtype
  • Operator Detailed — per-shape timing and bandwidth tables (with shape dimension labels)
# English
python scripts/generate_model_report.py

# Chinese
python scripts/generate_model_report_cn.py

# Convert to PDF
python scripts/md_to_pdf.py reports/model_report.md
python scripts/md_to_pdf.py reports/model_report_cn.md --toc-title "目录" --lang zh

PDF Conversion

scripts/md_to_pdf.py converts any Markdown report to PDF with a clickable table of contents, styled tables, code blocks, and embedded fonts.

python scripts/md_to_pdf.py <input.md> [-o output.pdf] [--page-size A4] \
    [--font-size 10.5] [--margin "20mm 18mm"] [--no-toc] \
    [--toc-title "Contents"] [--lang en]

Dependencies: pip install markdown weasyprint; for CJK: apt-get install -y fonts-noto-cjk

Metrics

  • Equivalent Bandwidth (GB/s): (bytes_read + bytes_written) / median_time
  • Throughput (GFLOPS/TFLOPS): flops / median_time
  • HW Utilization (%): measured / peak × 100
  • For matmul-like ops: Tensor Core peak is the denominator.
  • For non-matmul ops (softmax, rms_norm, etc.): CUDA Core peak is the denominator.

H100 SXM Peak Specs

Metric FP16 FP32
HBM Bandwidth 3.35 TB/s 3.35 TB/s
Tensor Core 989.5 TFLOPS 494.7 TFLOPS
CUDA Core 134.0 TFLOPS 67.0 TFLOPS

H20 Peak Specs (Porting Reference)

Metric FP16 FP32
HBM Bandwidth 4.0 TB/s 4.0 TB/s
Tensor Core 148.0 TFLOPS 74.0 TFLOPS
CUDA Core 44.0 TFLOPS
Memory 96 GB
L2 Cache 64 MB
SMs 78

Shape Generation

Each operator's config.py defines a ShapeGenerator yielding ≤65 shapes per dtype:

  • Small cases (~30): input tensor < 5 GB — powers-of-2, non-powers, small (≤64), large (≥8192), non-aligned dims
  • Large cases (~30): 5 GB ≤ input tensor < 40 GB — multi-dimensional shapes evenly distributed across the range

generate() must explicitly call yield from self._large_cases(dtype) — omitting this is a silent bug.

Adding a New Operator

  1. Run python scripts/add_operator.py <op> to scaffold files, or create manually:
    • benchmarks/<op>/__init__.py, config.py, pytorch_impl.py, cuda_impl/, bench.py
    • template/template_bench_<op>.py
    • model_shape/<op>_shape.py
  2. Shape generator must yield ≤65 cases per dtype with both small and large coverage
  3. bench.py must call verify_implementations() before timing and load_model_shapes() for model shapes
  4. CUDA kernels: C++17, SM90, one kernel per .cu file, AT_DISPATCH_FLOATING_TYPES_AND_HALF
  5. Run ruff check . and pytest tests/ -v before committing

Requirements

  • Python 3.10+
  • PyTorch ≥ 2.1.0 with CUDA
  • NVIDIA Hopper GPU (SM ≥ 9.0)
  • CUDA toolkit (for kernel compilation)
  • numpy ≥ 1.24, tabulate ≥ 0.9, pybind11 ≥ 2.11
  • Optional: triton ≥ 2.2
  • Optional (PDF export): markdown, weasyprint, system package fonts-noto-cjk (for CJK)

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors