GenAI Layer Benchmark #158536

BoyuanFeng · 2025-07-17T04:33:03Z

This PR adds GenAI layer benchmark. It compares pytorch eager, pytorch compiler, liger, and quack.

It covers all kernels supported by quack (CrossEntropy Fwd/Bwd, Softmax Fwd/Bwd, RMSNorm Fwd/Bwd, LayerNorm Fwd) and LayerNormBwd.

Motivations

Many OSS users asked how to properly benchmark torch.compile generated kernels. One common error is to compile a kernel/layer for one shape (e.g., batch size=1) and benchmark for another shape (e.g., batch size = 1024), which leads to bad performance. This provides an simple & clear example for proper benchmark.
We recently added GenAI model benchmark (based on vLLM). But it's usually hard to optimize models directly due to complexity. Layer benchmarks are easier to reason and optimize.

Key Settings

Avoid reusing a kernel specializing on 1 shape for benchmark on another shape.

torch._dynamo.config.automatic_dynamic_shapes = False
# Needed since changing args to function causes recompiles
torch._dynamo.config.recompile_limit = 1000000

For forward, people may mark batch size as dynamic to avoid runtime recompilation. We respect the setting in this kernel-level benchmark.

torch._dynamo.mark_dynamic(x, 0)

GPU: H100 (devvm006.dkl0)

Results: P1874246170

Note: for numerical accuracy, we use the default tolerance of torch.testing.assert_close (i.e., for torch.bfloat16, use rtol 1.6e-2 and atol 1e-5). It shows numerical issues for some backends and kernels.

Next step is to add roofline analysis, add to ci for checking regression, cover more GenAI Kernels, and include GenAI Layers for common fusion patterns.

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben @Lucaskabela

pytorch-bot · 2025-07-17T04:33:07Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/158536

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (2 Unrelated Failures)

As of commit d39fa46 with merge base a3aacd6 ():

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

inductor / cuda12.8-py3.10-gcc9-sm86 / test (inductor_torchbench, 1, 2, linux.g5.4xlarge.nvidia.gpu) (gh) (trunk failure)
torchrec_dlrm

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

pull / cuda12.8-py3.10-gcc9-sm75 / test (pr_time_benchmarks, 1, 1, linux.g4dn.metal.nvidia.gpu, unstable) (gh) (#153987)
MISSING REGRESSION TEST

This comment was automatically generated by Dr. CI and updates every 15 minutes.

vadimkantorov · 2025-07-17T11:48:58Z

Might also be interesting to add in addition to liger:

?

BoyuanFeng · 2025-07-17T16:43:08Z

@vadimkantorov yes. quack is added. thanks for letting me know flash_attn also has ops besides attention

vadimkantorov · 2025-07-17T17:49:24Z

There is also some triton-bench.com / triton-bench.ai effort: https://youtu.be/5e1YKqsP8i8?t=2123

https://github.com/pytorch-labs/tritonbench/

Also, would be maybe good to add here the vanilla examples from triton

yf225 · 2025-07-18T20:24:38Z

benchmarks/dynamo/genai_kernels/benchmark.py

+
+torch._dynamo.config.automatic_dynamic_shapes = False
+# Needed since changing args to function causes recompiles
+torch._dynamo.config.recompile_limit = 1000000


maybe we can call torch._dynamo.reset() after each benchmark? not sure if it has the same effect

seems we only need torch._dynamo.config.automatic_dynamic_shapes = False and large recompile_limit. Maybe better to be explicit than torch._dynamo.reset()?

BoyuanFeng · 2025-07-18T20:24:50Z

benchmarks/dynamo/genai_kernels/benchmark.py

+torch._dynamo.config.automatic_dynamic_shapes = False
+# Needed since changing args to function causes recompiles
+torch._dynamo.config.recompile_limit = 1000000
+
+torch._inductor.config.force_disable_caches = True


force recompilation for each size

eellison

How does this relate to tritonbench? cc @FindHao

BoyuanFeng · 2025-07-18T21:22:30Z

This benchmark a) covers genai kernels and will be extended to layers/some common fusion patterns. b) compares inductor against other DSL (e.g., quack, liger, helion, and more). So we can track regression and motivate optimizations.

Triton Bench covers hand-tuned kernels but does not cover layers. It also focuses on triton and may not compare against other DSL.

FindHao · 2025-07-18T21:23:11Z

How does this relate to tritonbench? cc @FindHao

I guess there are some duplications. @xuzhao9 has added two ops to tritonbench pytorch-labs/tritonbench#289 . Considering TritonBench has been integrated to many internal testings, nightly runs, different version checks etc., I feel it's better to add ops directly to tritonbench.

FindHao · 2025-07-18T21:28:30Z

a) covers genai kernels and will be extended to layers/some common fusion patterns.

tritonbench covers many kernels especially some internal kernels in internal version. I agree tritonbench doesn't support layers or fusion patterns right now.

b) compares inductor against other DSL (e.g., quack, liger, helion, and more)

tritonbench supports comparisons with eager, inductor, liger, quack(2 for now), helion(I remember Xu mentioned somewhere), jax, ThunderKittens, amd iter etc. see dependencies in this file https://github.com/pytorch-labs/tritonbench/blob/main/install.py

eellison · 2025-07-18T21:33:18Z

I know we've talked about showing tritonbench in dashboard (cc @drisspg) - until that's there I think the reality is folks may create other benchmarks.

FindHao · 2025-07-18T21:41:23Z

@xuzhao9 I remember you mentioned there are already some internal benchmark dashboard, right? please correct me if I was wrong.
@eellison I see. but if it is only for visualization, then we can just have a visualization part. I'm thinking, in long term, those divergences may lead to many more following up duplication works.

FindHao · 2025-07-18T21:42:43Z

I'm not trying to push back this PR, just want to point out the partial duplications and possible effort waste. I'm totally ok to have multiple benchmarks for different needs.

BoyuanFeng · 2025-07-18T23:43:05Z

Offline discussed with @FindHao and decided "GenAI Layer Benchmark" fits better.

FindHao · 2025-07-18T23:59:51Z

yeah, I feel it's more like gptfast to have some reference implementations rather than a benchmark framework.

vadimkantorov · 2025-07-19T00:59:04Z

Some more kernels in https://github.com/pytorch-labs/applied-ai

BoyuanFeng · 2025-07-19T05:33:32Z

@pytorchbot merge

pytorchmergebot · 2025-07-19T05:35:24Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

BoyuanFeng added 6 commits July 15, 2025 21:25

add kernel level benchmark

c2122f5

add CrossEntropyBackward

2b04d12

add softmax fwd & bwd

a317713

add rms norm

044346d

support layer norm fwd

e071e1c

add liger

cb0bfe5

BoyuanFeng added topic: not user facing topic category module: inductor labels Jul 17, 2025

BoyuanFeng marked this pull request as draft July 17, 2025 04:33

lint

3b90ad8

add layer norm

1b43e16

pytorch-bot bot added ciflow/inductor module: dynamo labels Jul 17, 2025

recover config

dc33d13

vadimkantorov mentioned this pull request Jul 17, 2025

Model implmenetation using Liger Kernel layers huggingface/transformers#38416

Open

BoyuanFeng added 6 commits July 17, 2025 17:20

check accuracy

ff79160

add visualization

9d52e90

lint

c072d59

nit

9da24c6

add doc & env setup

9467527

nit

fb7139d

BoyuanFeng added the ciflow/trunk Trigger trunk jobs on your pull request label Jul 18, 2025

BoyuanFeng requested review from yf225, zou3519 and eellison July 18, 2025 20:12

BoyuanFeng marked this pull request as ready for review July 18, 2025 20:12

yf225 reviewed Jul 18, 2025

View reviewed changes

yf225 approved these changes Jul 18, 2025

View reviewed changes

BoyuanFeng commented Jul 18, 2025

View reviewed changes

eellison approved these changes Jul 18, 2025

View reviewed changes

mark_dynamic on batch size for inference

b10b063

BoyuanFeng changed the title ~~GenAI Kernel Benchmark~~ GenAI Layer Benchmark Jul 18, 2025

BoyuanFeng added 2 commits July 18, 2025 16:45

rename

344391e

Merge branch 'main' into bf/kernel-benchmark

d39fa46

pytorchmergebot added the merging label Jul 19, 2025

pytorchmergebot closed this in 22d8222 Jul 19, 2025

pytorchmergebot added Merged and removed merging labels Jul 19, 2025

GenAI Layer Benchmark #158536

GenAI Layer Benchmark #158536

Conversation

BoyuanFeng commented Jul 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivations

Key Settings

Uh oh!

pytorch-bot bot commented Jul 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/158536

✅ You can merge normally! (2 Unrelated Failures)

Uh oh!

vadimkantorov commented Jul 17, 2025

Uh oh!

BoyuanFeng commented Jul 17, 2025

Uh oh!

vadimkantorov commented Jul 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yf225 Jul 18, 2025

Choose a reason for hiding this comment

Uh oh!

BoyuanFeng Jul 18, 2025

Choose a reason for hiding this comment

Uh oh!

BoyuanFeng Jul 18, 2025

Choose a reason for hiding this comment

Uh oh!

eellison left a comment

Choose a reason for hiding this comment

Uh oh!

BoyuanFeng commented Jul 18, 2025

Uh oh!

FindHao commented Jul 18, 2025

Uh oh!

FindHao commented Jul 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

eellison commented Jul 18, 2025

Uh oh!

FindHao commented Jul 18, 2025

Uh oh!

FindHao commented Jul 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

BoyuanFeng commented Jul 18, 2025

Uh oh!

FindHao commented Jul 18, 2025

Uh oh!

vadimkantorov commented Jul 19, 2025

Uh oh!

BoyuanFeng commented Jul 19, 2025

Uh oh!

pytorchmergebot commented Jul 19, 2025

Merge started

Uh oh!

Uh oh!

BoyuanFeng commented Jul 17, 2025 •

edited

Loading

pytorch-bot bot commented Jul 17, 2025 •

edited

Loading

vadimkantorov commented Jul 17, 2025 •

edited

Loading

FindHao commented Jul 18, 2025 •

edited

Loading

FindHao commented Jul 18, 2025 •

edited

Loading