-
Notifications
You must be signed in to change notification settings - Fork 24.7k
GenAI Layer Benchmark #158536
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GenAI Layer Benchmark #158536
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/158536
Note: Links to docs will display an error until the docs builds have been completed. ✅ You can merge normally! (2 Unrelated Failures)As of commit d39fa46 with merge base a3aacd6 ( BROKEN TRUNK - The following job failed but were present on the merge base:👉 Rebase onto the `viable/strict` branch to avoid these failures
UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
Might also be interesting to add in addition to liger:
? |
@vadimkantorov yes. quack is added. thanks for letting me know flash_attn also has ops besides attention |
There is also some triton-bench.com / triton-bench.ai effort: https://youtu.be/5e1YKqsP8i8?t=2123 https://github.com/pytorch-labs/tritonbench/ Also, would be maybe good to add here the vanilla examples from triton |
|
||
torch._dynamo.config.automatic_dynamic_shapes = False | ||
# Needed since changing args to function causes recompiles | ||
torch._dynamo.config.recompile_limit = 1000000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we can call torch._dynamo.reset()
after each benchmark? not sure if it has the same effect
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems we only need torch._dynamo.config.automatic_dynamic_shapes = False
and large recompile_limit. Maybe better to be explicit than torch._dynamo.reset()
?
torch._dynamo.config.automatic_dynamic_shapes = False | ||
# Needed since changing args to function causes recompiles | ||
torch._dynamo.config.recompile_limit = 1000000 | ||
|
||
torch._inductor.config.force_disable_caches = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
force recompilation for each size
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does this relate to tritonbench? cc @FindHao
This benchmark a) covers genai kernels and will be extended to layers/some common fusion patterns. b) compares inductor against other DSL (e.g., quack, liger, helion, and more). So we can track regression and motivate optimizations. Triton Bench covers hand-tuned kernels but does not cover layers. It also focuses on triton and may not compare against other DSL. |
I guess there are some duplications. @xuzhao9 has added two ops to tritonbench pytorch-labs/tritonbench#289 . Considering TritonBench has been integrated to many internal testings, nightly runs, different version checks etc., I feel it's better to add ops directly to tritonbench. |
tritonbench covers many kernels especially some internal kernels in internal version. I agree tritonbench doesn't support layers or fusion patterns right now.
tritonbench supports comparisons with eager, inductor, liger, quack(2 for now), helion(I remember Xu mentioned somewhere), jax, ThunderKittens, amd iter etc. see dependencies in this file https://github.com/pytorch-labs/tritonbench/blob/main/install.py |
I know we've talked about showing tritonbench in dashboard (cc @drisspg) - until that's there I think the reality is folks may create other benchmarks. |
@xuzhao9 I remember you mentioned there are already some internal benchmark dashboard, right? please correct me if I was wrong. |
I'm not trying to push back this PR, just want to point out the partial duplications and possible effort waste. I'm totally ok to have multiple benchmarks for different needs. |
Offline discussed with @FindHao and decided "GenAI Layer Benchmark" fits better. |
yeah, I feel it's more like gptfast to have some reference implementations rather than a benchmark framework. |
Some more kernels in https://github.com/pytorch-labs/applied-ai |
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
This PR adds GenAI layer benchmark. It compares pytorch eager, pytorch compiler, liger, and quack.
It covers all kernels supported by quack (CrossEntropy Fwd/Bwd, Softmax Fwd/Bwd, RMSNorm Fwd/Bwd, LayerNorm Fwd) and LayerNormBwd.
Motivations
Key Settings
GPU: H100 (devvm006.dkl0)
Results: P1874246170
Note: for numerical accuracy, we use the default tolerance of torch.testing.assert_close (i.e., for
torch.bfloat16
, use rtol1.6e-2
and atol1e-5
). It shows numerical issues for some backends and kernels.Next step is to add roofline analysis, add to ci for checking regression, cover more GenAI Kernels, and include GenAI Layers for common fusion patterns.
cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben @Lucaskabela