Skip to content

GenAI Layer Benchmark #158536

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 18 commits into from
Closed

GenAI Layer Benchmark #158536

wants to merge 18 commits into from

Conversation

BoyuanFeng
Copy link
Contributor

@BoyuanFeng BoyuanFeng commented Jul 17, 2025

This PR adds GenAI layer benchmark. It compares pytorch eager, pytorch compiler, liger, and quack.

It covers all kernels supported by quack (CrossEntropy Fwd/Bwd, Softmax Fwd/Bwd, RMSNorm Fwd/Bwd, LayerNorm Fwd) and LayerNormBwd.

Motivations

  • Many OSS users asked how to properly benchmark torch.compile generated kernels. One common error is to compile a kernel/layer for one shape (e.g., batch size=1) and benchmark for another shape (e.g., batch size = 1024), which leads to bad performance. This provides an simple & clear example for proper benchmark.
  • We recently added GenAI model benchmark (based on vLLM). But it's usually hard to optimize models directly due to complexity. Layer benchmarks are easier to reason and optimize.

Key Settings

  • Avoid reusing a kernel specializing on 1 shape for benchmark on another shape.
torch._dynamo.config.automatic_dynamic_shapes = False
# Needed since changing args to function causes recompiles
torch._dynamo.config.recompile_limit = 1000000
  • For forward, people may mark batch size as dynamic to avoid runtime recompilation. We respect the setting in this kernel-level benchmark.
torch._dynamo.mark_dynamic(x, 0)

GPU: H100 (devvm006.dkl0)

Results: P1874246170

Note: for numerical accuracy, we use the default tolerance of torch.testing.assert_close (i.e., for torch.bfloat16, use rtol 1.6e-2 and atol 1e-5). It shows numerical issues for some backends and kernels.

Next step is to add roofline analysis, add to ci for checking regression, cover more GenAI Kernels, and include GenAI Layers for common fusion patterns.

CrossEntropyBackward_bench CrossEntropyForward_bench LayerNormBackward_bench LayerNormForward_bench RMSNormBackward_bench RMSNormForward_bench SoftmaxBackward_bench SoftmaxForward_bench

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben @Lucaskabela

Copy link

pytorch-bot bot commented Jul 17, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/158536

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (2 Unrelated Failures)

As of commit d39fa46 with merge base a3aacd6 (image):

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@BoyuanFeng BoyuanFeng marked this pull request as draft July 17, 2025 04:33
@vadimkantorov
Copy link
Contributor

Might also be interesting to add in addition to liger:

?

@BoyuanFeng
Copy link
Contributor Author

@vadimkantorov yes. quack is added. thanks for letting me know flash_attn also has ops besides attention

@vadimkantorov
Copy link
Contributor

vadimkantorov commented Jul 17, 2025

There is also some triton-bench.com / triton-bench.ai effort: https://youtu.be/5e1YKqsP8i8?t=2123

https://github.com/pytorch-labs/tritonbench/

Also, would be maybe good to add here the vanilla examples from triton

@BoyuanFeng BoyuanFeng added the ciflow/trunk Trigger trunk jobs on your pull request label Jul 18, 2025
@BoyuanFeng BoyuanFeng requested review from yf225, zou3519 and eellison July 18, 2025 20:12
@BoyuanFeng BoyuanFeng marked this pull request as ready for review July 18, 2025 20:12

torch._dynamo.config.automatic_dynamic_shapes = False
# Needed since changing args to function causes recompiles
torch._dynamo.config.recompile_limit = 1000000
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we can call torch._dynamo.reset() after each benchmark? not sure if it has the same effect

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems we only need torch._dynamo.config.automatic_dynamic_shapes = False and large recompile_limit. Maybe better to be explicit than torch._dynamo.reset()?

Comment on lines 28 to 32
torch._dynamo.config.automatic_dynamic_shapes = False
# Needed since changing args to function causes recompiles
torch._dynamo.config.recompile_limit = 1000000

torch._inductor.config.force_disable_caches = True
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

force recompilation for each size

Copy link
Contributor

@eellison eellison left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this relate to tritonbench? cc @FindHao

@BoyuanFeng
Copy link
Contributor Author

This benchmark a) covers genai kernels and will be extended to layers/some common fusion patterns. b) compares inductor against other DSL (e.g., quack, liger, helion, and more). So we can track regression and motivate optimizations.

Triton Bench covers hand-tuned kernels but does not cover layers. It also focuses on triton and may not compare against other DSL.

@FindHao
Copy link
Member

FindHao commented Jul 18, 2025

How does this relate to tritonbench? cc @FindHao

I guess there are some duplications. @xuzhao9 has added two ops to tritonbench pytorch-labs/tritonbench#289 . Considering TritonBench has been integrated to many internal testings, nightly runs, different version checks etc., I feel it's better to add ops directly to tritonbench.

@FindHao
Copy link
Member

FindHao commented Jul 18, 2025

a) covers genai kernels and will be extended to layers/some common fusion patterns.

tritonbench covers many kernels especially some internal kernels in internal version. I agree tritonbench doesn't support layers or fusion patterns right now.

b) compares inductor against other DSL (e.g., quack, liger, helion, and more)

tritonbench supports comparisons with eager, inductor, liger, quack(2 for now), helion(I remember Xu mentioned somewhere), jax, ThunderKittens, amd iter etc. see dependencies in this file https://github.com/pytorch-labs/tritonbench/blob/main/install.py

@eellison
Copy link
Contributor

I know we've talked about showing tritonbench in dashboard (cc @drisspg) - until that's there I think the reality is folks may create other benchmarks.

@FindHao
Copy link
Member

FindHao commented Jul 18, 2025

@xuzhao9 I remember you mentioned there are already some internal benchmark dashboard, right? please correct me if I was wrong.
@eellison I see. but if it is only for visualization, then we can just have a visualization part. I'm thinking, in long term, those divergences may lead to many more following up duplication works.

@FindHao
Copy link
Member

FindHao commented Jul 18, 2025

I'm not trying to push back this PR, just want to point out the partial duplications and possible effort waste. I'm totally ok to have multiple benchmarks for different needs.

@BoyuanFeng BoyuanFeng changed the title GenAI Kernel Benchmark GenAI Layer Benchmark Jul 18, 2025
@BoyuanFeng
Copy link
Contributor Author

Offline discussed with @FindHao and decided "GenAI Layer Benchmark" fits better.

@FindHao
Copy link
Member

FindHao commented Jul 18, 2025

yeah, I feel it's more like gptfast to have some reference implementations rather than a benchmark framework.

@vadimkantorov
Copy link
Contributor

Some more kernels in https://github.com/pytorch-labs/applied-ai

@BoyuanFeng
Copy link
Contributor Author

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants