Add attention benchmarking to operator microbenchmarks #163691

jainapurva · 2025-09-23T21:22:38Z

Fixes #ISSUE_NUMBER

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @pragupta @msaroufim @dcci @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @jerryzh168 @voznesenskym @penguinwu @EikanWang @Guobing-Chen @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben @Lucaskabela

pytorch-bot · 2025-09-23T21:22:42Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/163691

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 New Failures, 123 Pending

As of commit 35715a8 with merge base 0696a4b ():

NEW FAILURES - The following jobs have failed:

Lint / lintrunner-noclang / linux-job (gh)
>>> Lint for benchmarks/transformer/score_mod.py:
Lint / pr-sanity-checks (gh)
Process completed with exit code 1.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Some properties with `cache_on_self` were prevously annotated with `no_type_check`, to get around mypy limitations. This PR replaces both annotations with `cache_property_on_self`, to enable type checking. Pull Request resolved: #163570 Approved by: https://github.com/mlazos, https://github.com/PaulZhang12, https://github.com/Skylion007

Fixes build timeouts >4h on libtorch build jobs: https://hud.pytorch.org/hud/pytorch/pytorch/75e7f49f9c70116d7c4f8f86c3d0688ade306284/1?per_page=50&name_filter=inux-binary-libtorch%20%2F%20libtorch-rocm&mergeEphemeralLF=true Brings back code to narrow down CK compilation targets from 69a25f6#diff-ce80f3115ab2f6be5142f0678a1fc92c6b2d7727766ce44f48726c99e720f777 gfx942 supports fp8 Don't enable gfx950 for now, until more optimizations are in place as per https://github.com/pytorch/pytorch/pull/162648/files#r2369588738 Validation: [rocm6.4](https://github.com/pytorch/pytorch/actions/runs/17944766350/job/51028483128) and [rocm6.3](https://github.com/pytorch/pytorch/actions/runs/17944766350/job/51028483093) libtorch builds finished within 3.9h. Pull Request resolved: #162648 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>

…on x block (#162446) Scale up XBLOCK for contiguous persistent reductions based on rnumel and number of loads + stores <img width="928" height="656" alt="Screenshot 2025-09-18 at 5 02 57 PM" src="https://github.com/user-attachments/assets/ec3c561f-2a3f-4459-9e14-653715898da3" /> Differential Revision: [](https://our.internmc.facebook.com/intern/diff/) Differential Revision: [](https://our.internmc.facebook.com/intern/diff/) Pull Request resolved: #162446 Approved by: https://github.com/v0i0, https://github.com/eellison, https://github.com/shunting314 ghstack dependencies: #162296

Introduces a variant of size-hint multi-kernel, where for novel runtime shapes, instead of performing full benchmarking to determine the optimal kernel, selects one of many kernels pre-generated from multi-kernel hints, based off similarity b/w hint / runtime input & output shapes (L1 distance in log2 space). Some caveats/changes: - Size-hint multi-kernel now only kicks in if the kernel has dynamic shapes - Pre-generation still only does 1-d search over specified hints, e.g. `matmul([s0, s1], [s1, s2])` with size-hints `[64, 256]` only generates 2 kernels - based on tuning shapes ([64, 64], [64, 64]) and ([256, 256], [256, 256]). Extending this to reasonable n-d search (via user API?) is an extension Benchmarking results, compared to multi-kernel w/ full benchmarking (hints 64, 4096), and compiling with the ground truth hint: <img width="1902" height="1222" alt="550541081_1088709150049684_6528797079439730237_n" src="https://github.com/user-attachments/assets/056cca48-c16a-4451-9b4a-fa13a7a058a9" /> Full benchmarking doing worse is extremely weird, but we did see similar spikes in #156628 Pull Request resolved: #163090 Approved by: https://github.com/bobrenjc93

Fixes #ISSUE_NUMBER Pull Request resolved: #163614 Approved by: https://github.com/eqy

Landing this instead of #162994. Here is how i think the whole dynamo + frame construction logic work: 1) There is no way to create a frame object in python land as this is created in runtime from cpython. So that's why aot_compile creates FrameInfo this way. (kind of like simulating the runtime) i guess you could write your own very simple eval_frame.c where you can interject the frame construction but we probably don't want that. 2) When there is no wrapper (the old export or aot_compile), we first assign sources by iterating over f_locals which contain both local args and closure variables (this is implementation details of cpython frame construction). So thats why closure variables end up getting LocalSource names as can be shown in this test case (https://github.com/pytorch/pytorch/blob/f6ea41ead27205a5ece3e9a41b7af30fafe67d7a/test/export/test_export.py#L1369). Note that L["self"] here means we are referring to local object self. Important thing to keep in mind here is this self is not actually model self, but the outer self. 3) When we switch to wrapper case, we end up trying to inline the original inner module. When doing so, we need to track all local and closures for this inner module as can be seen here (https://github.com/pytorch/pytorch/blob/f6ea41ead27205a5ece3e9a41b7af30fafe67d7a/torch/_dynamo/variables/functions.py#L463) Here we are not looking into inner frame's f_locals but just directly look at closures. I guess this is because we are one more frame up so there is no access to frame f_locals at this point. And it is probably not good idea to change dynamo's logic here. As a result, i get following error message that is different from old export: "While exporting, we found certain side effects happened in the model.forward. Here are the list of potential sources you can double check: ["L['self']._export_root.forward.__func__.__closure__[1].cell_contents.bank", "L['self']._export_root.forward.__func__.__closure__[1].cell_contents.bank_dict", "L['self']._export_root.forward.__func__.__closure__[0].cell_contents"]" My initial attempt of solving this was taking inner closures and put them to f_locals for the frame i am constructing which turned out too compilcated because we needed to muck around bytecode instructions as well. So i am thinking we should just update the test to reflect new names and follow up with better post-processing step to have better names. Differential Revision: [D82582029](https://our.internmc.facebook.com/intern/diff/D82582029) Pull Request resolved: #163107 Approved by: https://github.com/avikchaudhuri

Fixes #162271 Pull Request resolved: #163485 Approved by: https://github.com/Skylion007

Pull Request resolved: #163607 Approved by: https://github.com/Skylion007 ghstack dependencies: #163485

Update links in conf.py to docs.pytorch.org Fixes #ISSUE_NUMBER Pull Request resolved: #163682 Approved by: https://github.com/sekyondaMeta, https://github.com/albanD

…ng with dynamic shapes (#163365) As per comment in source code: ``` # If we are are coalescing on xblock (not ReductionHint.INNER) and this is not a tiny kernel # (not ReductionHint.OUTER_TINY), do not use persistent reduction if it induces tile # quantization. Peristent reduction forces rblock == rnumel, if the bounds between lower # and upper are large, for the lower values we will be masking off large % of read/writes, # when we could expand the coalescing xblock instead. ``` For the test case in question, this pr improves perf from 0.8573521325143717 -> 0.043151492193814305 because we were egregiously masking out rblock values (58/64 values). Differential Revision: [D82853279](https://our.internmc.facebook.com/intern/diff/D82853279) Pull Request resolved: #163365 Approved by: https://github.com/shunting314, https://github.com/PaulZhang12, https://github.com/jansel, https://github.com/v0i0

Pull Request resolved: #163560 Approved by: https://github.com/laithsakka ghstack dependencies: #163547, #163553, #163554, #163555, #163556, #163557, #163558

I had been using tiling scores to essentially check if this is an inner reduction. since that is not fully rolled out for dynamic shapes, use reduction hint when they are not available. Pull Request resolved: #163371 Approved by: https://github.com/PaulZhang12

Fixes #ISSUE_NUMBER Pull Request resolved: #163693 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>

…163281) Part of #162270 Pull Request resolved: #163281 Approved by: https://github.com/malfet

Summary: We add the parsing for list of string. This is needed for AOTInductor profiling for input information of Triton kernels. Test Plan: Included in commit. test_profiler_op_event_kwargs_list_of_strings Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: #163593 Approved by: https://github.com/sraikund16

Pull Request resolved: #163578 Approved by: https://github.com/albanD, https://github.com/Skylion007, https://github.com/atalman, https://github.com/v0i0

Pull Request resolved: #163653 Approved by: https://github.com/jansel ghstack dependencies: #163648, #163649

Support for amin, amax, and aminmax Test Plan: E2E tests in the stack with benchmark suite passes. Differential Revision: D83016894 Pull Request resolved: #163669 Approved by: https://github.com/albanD, https://github.com/malfet

…163073) Pull Request resolved: #163073 Approved by: https://github.com/eellison, https://github.com/jansel

Continued work to use std::fs in inductor. Pull Request resolved: #163632 Approved by: https://github.com/Skylion007

…62688) Summary: Restricts subprocess benchmarking to only `TritonTemplateCaller`, which is expected by the underlying `target` method. THhis triggered a bug with large K shapes because the decompose k is `SubgraphChoiceCaller`. Test Plan: mm autotuning with a large k and `TORCHINDUCTOR_AUTOTUNE_IN_SUBPROC=1` Rollback Plan: Differential Revision: D82181924 Pull Request resolved: #162688 Approved by: https://github.com/PaulZhang12, https://github.com/eellison, https://github.com/mlazos

Test Plan: existing tests Differential Revision: D82995546 Pull Request resolved: #163550 Approved by: https://github.com/tugsbayasgalan

Summary: When generating Triton kernels in the compile-time autotune blocks, it will be useful to generate source information as code comments. Previously we ignore these comments for autotune code blocks because the generated main output code will contain the same information, but it won't work if the generated autotune code crashes. Pull Request resolved: #163600 Approved by: https://github.com/yushangdi

Fixes ##162228 # Summary Majority of our tests are only compiling flex-attention in isolation. This means that for fake tensor propagation the input primals and all captured buffers dont do any intermediate computation below autograd. As a result result the by happen chance match the `require_grad`ness of the eager implementation and this check will pass. However if score_mod is a the result of some other intermediate fake tensor prop then it is not guaranteed to have accurate req_gradness, which was happening here. TLDR is that this was a boot and suspenders that was actually harmful and we should just let the joint graph handle creating the correct joint graph Pull Request resolved: #163677 Approved by: https://github.com/ydwu4

…63344) There is only one substantive change: the branch on `global_offset[shard_dim] <= local_offset[shard_dim]` is removed because it is unnecessary: you can always treat the first shard uniformly with the rest of the shards, because your global offset is guaranteed to be zero in this case anyway. I also switch the shard_size case to sym_ite, to make it possible for LocalTensor to deal with the MPMD-ness here, but it's equivalent to the old if-then-else. I tried to rewrite the comments to be more clear what is going on algorithmically here. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: #163344 Approved by: https://github.com/albanD, https://github.com/zpcore, https://github.com/tianyu-l

Fixes #163420 Pull Request resolved: #163481 Approved by: https://github.com/eellison ghstack dependencies: #163386, #163398, #163387, #163414, #163415, #163419, #163434, #163393, #163412, #163422

Fixes #163449 Pull Request resolved: #163520 Approved by: https://github.com/eellison ghstack dependencies: #163386, #163398, #163387, #163414, #163415, #163419, #163434, #163393, #163412, #163422, #163481

Replace the **runtime_error** of the vallina C++ exceptions with **TORCH_CEHCK** in **torch/nativert/*** The vallina C++ exception should not exist in the core part of pytorch for its corss-languanges trait. Comparing with the vallina C++ exceptions, TORCH_CHECK have the richer error context and It has the unified error handling mechanism. This commit replace the runtime_error with TORCH_CHECK of the files in torch/nativert/* . Fixes part of #148114 Pull Request resolved: #163308 Approved by: https://github.com/dolpm

Differential Revision: [D83354042](https://our.internmc.facebook.com/intern/diff/D83354042) Pull Request resolved: #163967 Approved by: https://github.com/avikchaudhuri

This adds basic support for subclass inputs in export (specifically for non-strict). I had to make fakify little more complicated which risks further divergence from dynamo fakification. But dynamo one is so complex, so i feel it is better to do this way. Also improved fake mode detection logic to recursively look into subclass inner tensors. Differential Revision: [D83156489](https://our.internmc.facebook.com/intern/diff/D83156489) Pull Request resolved: #163770 Approved by: https://github.com/avikchaudhuri

Export team is fixing up the old strict export implementation, as a result it fails a check where we proxy the whole module under given directories. _WrapperModule is a way for torchao to workaround the issue where export requiring nn.module to trace so it should never get proxied in the graph. Differential Revision: [D82732613](https://our.internmc.facebook.com/intern/diff/D82732613) Pull Request resolved: #163258 Approved by: https://github.com/anijain2305 ghstack dependencies: #163136, #163137

Differential Revision: [D82732614](https://our.internmc.facebook.com/intern/diff/D82732614) Pull Request resolved: #163259 Approved by: https://github.com/avikchaudhuri ghstack dependencies: #163136, #163137, #163258

The [docs](https://docs.pytorch.org/functorch/stable/) about `functorch` has been migrated into [PyTorch Doc](https://docs.pytorch.org/docs/stable/func.html) since PyTorch 2.0, so I think we can remove it right now to reduce the compute resources usages. Pull Request resolved: #162581 Approved by: https://github.com/ezyang

Remove two unnecessary `PY_VERSION_HEX` branches. Pull Request resolved: #164055 Approved by: https://github.com/ezyang

…64072) Fixes #164071 typo correction done Pull Request resolved: #164072 Approved by: https://github.com/Skylion007

Fixes #163637 Pull Request resolved: #163746 Approved by: https://github.com/malfet

This PR introduces a new "operator microbenchmark" CI workflow and GitHub Actions for operator microbenchmarks, updating test scripts and job matrices to support new parameters, and broadening the operator benchmark tests to include more data types, larger shapes, and gradient tests. The benchmark configurations now focus more on different cuda hardware and multiple dtypes (bf16, fp16, fp32), for both compile and eager mode. **Benchmark Configuration and Coverage:** * Expanded operator benchmark configurations in `addmm_test.py`, `bmm_test.py`, `matmul_test.py`, and `mm_test.py` to benchmark multiple dtypes on CUDA devices, in eager and compile mode, for forward and backward run. The configs with tag "long" for the above mentioned files are being run in CI. * The CI benchmarking is running on various hardwares: H100, A100. * The CI job also uploads the microbenchmarking outputs to a [HUD](https://hud.pytorch.org/benchmark/llms?repoName=pytorch%2Fpytorch&benchmarkName=PyTorch+operator+microbenchmark) dashboard. Pull Request resolved: #162530 Approved by: https://github.com/huydhn Co-authored-by: Huy Do <huydhn@gmail.com>

…164079) This is rare but happens with executorch tests. Pull Request resolved: #164079 Approved by: https://github.com/tugsbayasgalan

Pull Request resolved: #164082 Approved by: https://github.com/tugsbayasgalan ghstack dependencies: #164079

Apply UP035 `ruff` rule in tests, but some tests for `fx` and `dynamo` are excluded in case the old typing is the test target. Pull Request resolved: #163947 Approved by: https://github.com/ezyang

Its unclear why we had disable in the first place. With install_free_tensors, we are tracing into this hook. A better way would be to place the tracer without any hook. For now, disable the checking while dynamo is tracing. Pull Request resolved: #164084 Approved by: https://github.com/tugsbayasgalan

Pull Request resolved: #164081 Approved by: https://github.com/tugsbayasgalan ghstack dependencies: #164084

This PR removes skip conditions for ROCM <= 3.5. Pull Request resolved: #164058 Approved by: https://github.com/kwen2501

Add configs for attention benhcmarks

03c7fb0

jainapurva and others added 28 commits September 26, 2025 10:42

Add dashboard json generation

cb16d41

Implement CUDA stream protocol (#163614)

c566fad

Fixes #ISSUE_NUMBER Pull Request resolved: #163614 Approved by: https://github.com/eqy

symintify fill_diagonol_ (#163485)

16a34b3

Fixes #162271 Pull Request resolved: #163485 Approved by: https://github.com/Skylion007

[ez] use list initializer syntax in fill_diagonal_ (#163607)

818c6de

Pull Request resolved: #163607 Approved by: https://github.com/Skylion007 ghstack dependencies: #163485

Update pytorch.org links in docs/conf.py (#163682)

813fefe

Update links in conf.py to docs.pytorch.org Fixes #ISSUE_NUMBER Pull Request resolved: #163682 Approved by: https://github.com/sekyondaMeta, https://github.com/albanD

[torchfuzz] introduce multi process fuzzer (#163560)

101b876

Pull Request resolved: #163560 Approved by: https://github.com/laithsakka ghstack dependencies: #163547, #163553, #163554, #163555, #163556, #163557, #163558

[ROCm][CI] skip TestCudaPrimaryCtx.test_set_device_0 (#163693)

944afa0

Fixes #ISSUE_NUMBER Pull Request resolved: #163693 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>

[MPS] Compute offset2bag/bag_size/max_indices in _embedding_bag (#…

3537934

…163281) Part of #162270 Pull Request resolved: #163281 Approved by: https://github.com/malfet

Fix warn message (#163578)

72c19b8

Pull Request resolved: #163578 Approved by: https://github.com/albanD, https://github.com/Skylion007, https://github.com/atalman, https://github.com/v0i0

[BE] Delete all pre py-3.10 checks (#163653)

aad15e1

Pull Request resolved: #163653 Approved by: https://github.com/jansel ghstack dependencies: #163648, #163649

Support for amin, amax, and aminmax (#163669)

7f787c6

Support for amin, amax, and aminmax Test Plan: E2E tests in the stack with benchmark suite passes. Differential Revision: D83016894 Pull Request resolved: #163669 Approved by: https://github.com/albanD, https://github.com/malfet

[inductor] in emulate_precision_casts, disable fma fusion in triton (#…

8c699d8

…163073) Pull Request resolved: #163073 Approved by: https://github.com/eellison, https://github.com/jansel

[3/N] Use std::filesystem in inductor (#163632)

7c76e62

Continued work to use std::fs in inductor. Pull Request resolved: #163632 Approved by: https://github.com/Skylion007

docs and optional kwargs for full graph capture (#163550)

7578525

Test Plan: existing tests Differential Revision: D82995546 Pull Request resolved: #163550 Approved by: https://github.com/tugsbayasgalan

[inductor] Fix issue with scalar arg handling (#163481)

7fcb471

Fixes #163420 Pull Request resolved: #163481 Approved by: https://github.com/eellison ghstack dependencies: #163386, #163398, #163387, #163414, #163415, #163419, #163434, #163393, #163412, #163422

[inductor] Fix bugs in emulate_precision_casts (#163520)

0b4ce81

Fixes #163449 Pull Request resolved: #163520 Approved by: https://github.com/eellison ghstack dependencies: #163386, #163398, #163387, #163414, #163415, #163419, #163434, #163393, #163412, #163422, #163481

licy666 and others added 16 commits September 28, 2025 20:37

Building guards should be under metrics_context (#163967)

e6293c2

Differential Revision: [D83354042](https://our.internmc.facebook.com/intern/diff/D83354042) Pull Request resolved: #163967 Approved by: https://github.com/avikchaudhuri

Move control flow export tests to new tracer (#163259)

6aaae0d

Differential Revision: [D82732614](https://our.internmc.facebook.com/intern/diff/D82732614) Pull Request resolved: #163259 Approved by: https://github.com/avikchaudhuri ghstack dependencies: #163136, #163137, #163258

Remove C++ workarounds for Python < 3.10 (#164055)

b3712c0

Remove two unnecessary `PY_VERSION_HEX` branches. Pull Request resolved: #164055 Approved by: https://github.com/ezyang

registraion replaced with registration in jit_type.h file comment (#1…

cd2e7c2

…64072) Fixes #164071 typo correction done Pull Request resolved: #164072 Approved by: https://github.com/Skylion007

fixes import error 'functionalize' from functorch (#163746)

e6e5462

Fixes #163637 Pull Request resolved: #163746 Approved by: https://github.com/malfet

[dynamo][logging] Add to param_count only if metrics_count is active (#…

d59c391

…164079) This is rare but happens with executorch tests. Pull Request resolved: #164079 Approved by: https://github.com/tugsbayasgalan

[hops] Support unspecialized nn module for export hops (#164082)

20d1c08

Pull Request resolved: #164082 Approved by: https://github.com/tugsbayasgalan ghstack dependencies: #164079

[1/N] Apply UP035 rule in tests (#163947)

0cbac4e

Apply UP035 `ruff` rule in tests, but some tests for `fx` and `dynamo` are excluded in case the old typing is the test target. Pull Request resolved: #163947 Approved by: https://github.com/ezyang

[dynamo] Special path for cloning of torch dispatch tensors (#164081)

8dcce36

Pull Request resolved: #164081 Approved by: https://github.com/tugsbayasgalan ghstack dependencies: #164084

Remove old ROCm skip conditions in tests (#164058)

35715a8

This PR removes skip conditions for ROCM <= 3.5. Pull Request resolved: #164058 Approved by: https://github.com/kwen2501

jainapurva closed this Sep 29, 2025

jainapurva deleted the attention_benchmarking branch September 29, 2025 04:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add attention benchmarking to operator microbenchmarks #163691

Add attention benchmarking to operator microbenchmarks #163691

Uh oh!

jainapurva commented Sep 23, 2025 •

edited by pytorch-bot bot

Loading

Uh oh!

pytorch-bot bot commented Sep 23, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

109 participants

Add attention benchmarking to operator microbenchmarks #163691

Add attention benchmarking to operator microbenchmarks #163691

Uh oh!

Conversation

jainapurva commented Sep 23, 2025 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Sep 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/163691

❌ 2 New Failures, 123 Pending

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

109 participants

jainapurva commented Sep 23, 2025 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Sep 23, 2025 •

edited

Loading