-
Notifications
You must be signed in to change notification settings - Fork 26k
Add attention benchmarking to operator microbenchmarks #163691
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/163691
Note: Links to docs will display an error until the docs builds have been completed. ❌ 2 New Failures, 123 PendingAs of commit 35715a8 with merge base 0696a4b ( NEW FAILURES - The following jobs have failed:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
Some properties with `cache_on_self` were prevously annotated with `no_type_check`, to get around mypy limitations. This PR replaces both annotations with `cache_property_on_self`, to enable type checking. Pull Request resolved: #163570 Approved by: https://github.com/mlazos, https://github.com/PaulZhang12, https://github.com/Skylion007
Fixes build timeouts >4h on libtorch build jobs: https://hud.pytorch.org/hud/pytorch/pytorch/75e7f49f9c70116d7c4f8f86c3d0688ade306284/1?per_page=50&name_filter=inux-binary-libtorch%20%2F%20libtorch-rocm&mergeEphemeralLF=true Brings back code to narrow down CK compilation targets from 69a25f6#diff-ce80f3115ab2f6be5142f0678a1fc92c6b2d7727766ce44f48726c99e720f777 gfx942 supports fp8 Don't enable gfx950 for now, until more optimizations are in place as per https://github.com/pytorch/pytorch/pull/162648/files#r2369588738 Validation: [rocm6.4](https://github.com/pytorch/pytorch/actions/runs/17944766350/job/51028483128) and [rocm6.3](https://github.com/pytorch/pytorch/actions/runs/17944766350/job/51028483093) libtorch builds finished within 3.9h. Pull Request resolved: #162648 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>
…on x block (#162446) Scale up XBLOCK for contiguous persistent reductions based on rnumel and number of loads + stores <img width="928" height="656" alt="Screenshot 2025-09-18 at 5 02 57 PM" src="https://github.com/user-attachments/assets/ec3c561f-2a3f-4459-9e14-653715898da3" /> Differential Revision: [](https://our.internmc.facebook.com/intern/diff/) Differential Revision: [](https://our.internmc.facebook.com/intern/diff/) Pull Request resolved: #162446 Approved by: https://github.com/v0i0, https://github.com/eellison, https://github.com/shunting314 ghstack dependencies: #162296
Introduces a variant of size-hint multi-kernel, where for novel runtime shapes, instead of performing full benchmarking to determine the optimal kernel, selects one of many kernels pre-generated from multi-kernel hints, based off similarity b/w hint / runtime input & output shapes (L1 distance in log2 space). Some caveats/changes: - Size-hint multi-kernel now only kicks in if the kernel has dynamic shapes - Pre-generation still only does 1-d search over specified hints, e.g. `matmul([s0, s1], [s1, s2])` with size-hints `[64, 256]` only generates 2 kernels - based on tuning shapes ([64, 64], [64, 64]) and ([256, 256], [256, 256]). Extending this to reasonable n-d search (via user API?) is an extension Benchmarking results, compared to multi-kernel w/ full benchmarking (hints 64, 4096), and compiling with the ground truth hint: <img width="1902" height="1222" alt="550541081_1088709150049684_6528797079439730237_n" src="https://github.com/user-attachments/assets/056cca48-c16a-4451-9b4a-fa13a7a058a9" /> Full benchmarking doing worse is extremely weird, but we did see similar spikes in #156628 Pull Request resolved: #163090 Approved by: https://github.com/bobrenjc93
Fixes #ISSUE_NUMBER Pull Request resolved: #163614 Approved by: https://github.com/eqy
Landing this instead of #162994. Here is how i think the whole dynamo + frame construction logic work: 1) There is no way to create a frame object in python land as this is created in runtime from cpython. So that's why aot_compile creates FrameInfo this way. (kind of like simulating the runtime) i guess you could write your own very simple eval_frame.c where you can interject the frame construction but we probably don't want that. 2) When there is no wrapper (the old export or aot_compile), we first assign sources by iterating over f_locals which contain both local args and closure variables (this is implementation details of cpython frame construction). So thats why closure variables end up getting LocalSource names as can be shown in this test case (https://github.com/pytorch/pytorch/blob/f6ea41ead27205a5ece3e9a41b7af30fafe67d7a/test/export/test_export.py#L1369). Note that L["self"] here means we are referring to local object self. Important thing to keep in mind here is this self is not actually model self, but the outer self. 3) When we switch to wrapper case, we end up trying to inline the original inner module. When doing so, we need to track all local and closures for this inner module as can be seen here (https://github.com/pytorch/pytorch/blob/f6ea41ead27205a5ece3e9a41b7af30fafe67d7a/torch/_dynamo/variables/functions.py#L463) Here we are not looking into inner frame's f_locals but just directly look at closures. I guess this is because we are one more frame up so there is no access to frame f_locals at this point. And it is probably not good idea to change dynamo's logic here. As a result, i get following error message that is different from old export: "While exporting, we found certain side effects happened in the model.forward. Here are the list of potential sources you can double check: ["L['self']._export_root.forward.__func__.__closure__[1].cell_contents.bank", "L['self']._export_root.forward.__func__.__closure__[1].cell_contents.bank_dict", "L['self']._export_root.forward.__func__.__closure__[0].cell_contents"]" My initial attempt of solving this was taking inner closures and put them to f_locals for the frame i am constructing which turned out too compilcated because we needed to muck around bytecode instructions as well. So i am thinking we should just update the test to reflect new names and follow up with better post-processing step to have better names. Differential Revision: [D82582029](https://our.internmc.facebook.com/intern/diff/D82582029) Pull Request resolved: #163107 Approved by: https://github.com/avikchaudhuri
Fixes #162271 Pull Request resolved: #163485 Approved by: https://github.com/Skylion007
Pull Request resolved: #163607 Approved by: https://github.com/Skylion007 ghstack dependencies: #163485
Update links in conf.py to docs.pytorch.org Fixes #ISSUE_NUMBER Pull Request resolved: #163682 Approved by: https://github.com/sekyondaMeta, https://github.com/albanD
…ng with dynamic shapes (#163365) As per comment in source code: ``` # If we are are coalescing on xblock (not ReductionHint.INNER) and this is not a tiny kernel # (not ReductionHint.OUTER_TINY), do not use persistent reduction if it induces tile # quantization. Peristent reduction forces rblock == rnumel, if the bounds between lower # and upper are large, for the lower values we will be masking off large % of read/writes, # when we could expand the coalescing xblock instead. ``` For the test case in question, this pr improves perf from 0.8573521325143717 -> 0.043151492193814305 because we were egregiously masking out rblock values (58/64 values). Differential Revision: [D82853279](https://our.internmc.facebook.com/intern/diff/D82853279) Pull Request resolved: #163365 Approved by: https://github.com/shunting314, https://github.com/PaulZhang12, https://github.com/jansel, https://github.com/v0i0
I had been using tiling scores to essentially check if this is an inner reduction. since that is not fully rolled out for dynamic shapes, use reduction hint when they are not available. Pull Request resolved: #163371 Approved by: https://github.com/PaulZhang12
Fixes #ISSUE_NUMBER Pull Request resolved: #163693 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>
…163281) Part of #162270 Pull Request resolved: #163281 Approved by: https://github.com/malfet
Summary: We add the parsing for list of string. This is needed for AOTInductor profiling for input information of Triton kernels. Test Plan: Included in commit. test_profiler_op_event_kwargs_list_of_strings Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: #163593 Approved by: https://github.com/sraikund16
Pull Request resolved: #163578 Approved by: https://github.com/albanD, https://github.com/Skylion007, https://github.com/atalman, https://github.com/v0i0
Pull Request resolved: #163653 Approved by: https://github.com/jansel ghstack dependencies: #163648, #163649
Support for amin, amax, and aminmax Test Plan: E2E tests in the stack with benchmark suite passes. Differential Revision: D83016894 Pull Request resolved: #163669 Approved by: https://github.com/albanD, https://github.com/malfet
…163073) Pull Request resolved: #163073 Approved by: https://github.com/eellison, https://github.com/jansel
Continued work to use std::fs in inductor. Pull Request resolved: #163632 Approved by: https://github.com/Skylion007
…62688) Summary: Restricts subprocess benchmarking to only `TritonTemplateCaller`, which is expected by the underlying `target` method. THhis triggered a bug with large K shapes because the decompose k is `SubgraphChoiceCaller`. Test Plan: mm autotuning with a large k and `TORCHINDUCTOR_AUTOTUNE_IN_SUBPROC=1` Rollback Plan: Differential Revision: D82181924 Pull Request resolved: #162688 Approved by: https://github.com/PaulZhang12, https://github.com/eellison, https://github.com/mlazos
Test Plan: existing tests Differential Revision: D82995546 Pull Request resolved: #163550 Approved by: https://github.com/tugsbayasgalan
Summary: When generating Triton kernels in the compile-time autotune blocks, it will be useful to generate source information as code comments. Previously we ignore these comments for autotune code blocks because the generated main output code will contain the same information, but it won't work if the generated autotune code crashes. Pull Request resolved: #163600 Approved by: https://github.com/yushangdi
Fixes ##162228 # Summary Majority of our tests are only compiling flex-attention in isolation. This means that for fake tensor propagation the input primals and all captured buffers dont do any intermediate computation below autograd. As a result result the by happen chance match the `require_grad`ness of the eager implementation and this check will pass. However if score_mod is a the result of some other intermediate fake tensor prop then it is not guaranteed to have accurate req_gradness, which was happening here. TLDR is that this was a boot and suspenders that was actually harmful and we should just let the joint graph handle creating the correct joint graph Pull Request resolved: #163677 Approved by: https://github.com/ydwu4
…63344) There is only one substantive change: the branch on `global_offset[shard_dim] <= local_offset[shard_dim]` is removed because it is unnecessary: you can always treat the first shard uniformly with the rest of the shards, because your global offset is guaranteed to be zero in this case anyway. I also switch the shard_size case to sym_ite, to make it possible for LocalTensor to deal with the MPMD-ness here, but it's equivalent to the old if-then-else. I tried to rewrite the comments to be more clear what is going on algorithmically here. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: #163344 Approved by: https://github.com/albanD, https://github.com/zpcore, https://github.com/tianyu-l
Replace the **runtime_error** of the vallina C++ exceptions with **TORCH_CEHCK** in **torch/nativert/*** The vallina C++ exception should not exist in the core part of pytorch for its corss-languanges trait. Comparing with the vallina C++ exceptions, TORCH_CHECK have the richer error context and It has the unified error handling mechanism. This commit replace the runtime_error with TORCH_CHECK of the files in torch/nativert/* . Fixes part of #148114 Pull Request resolved: #163308 Approved by: https://github.com/dolpm
Differential Revision: [D83354042](https://our.internmc.facebook.com/intern/diff/D83354042) Pull Request resolved: #163967 Approved by: https://github.com/avikchaudhuri
This adds basic support for subclass inputs in export (specifically for non-strict). I had to make fakify little more complicated which risks further divergence from dynamo fakification. But dynamo one is so complex, so i feel it is better to do this way. Also improved fake mode detection logic to recursively look into subclass inner tensors. Differential Revision: [D83156489](https://our.internmc.facebook.com/intern/diff/D83156489) Pull Request resolved: #163770 Approved by: https://github.com/avikchaudhuri
Export team is fixing up the old strict export implementation, as a result it fails a check where we proxy the whole module under given directories. _WrapperModule is a way for torchao to workaround the issue where export requiring nn.module to trace so it should never get proxied in the graph. Differential Revision: [D82732613](https://our.internmc.facebook.com/intern/diff/D82732613) Pull Request resolved: #163258 Approved by: https://github.com/anijain2305 ghstack dependencies: #163136, #163137
Differential Revision: [D82732614](https://our.internmc.facebook.com/intern/diff/D82732614) Pull Request resolved: #163259 Approved by: https://github.com/avikchaudhuri ghstack dependencies: #163136, #163137, #163258
The [docs](https://docs.pytorch.org/functorch/stable/) about `functorch` has been migrated into [PyTorch Doc](https://docs.pytorch.org/docs/stable/func.html) since PyTorch 2.0, so I think we can remove it right now to reduce the compute resources usages. Pull Request resolved: #162581 Approved by: https://github.com/ezyang
Remove two unnecessary `PY_VERSION_HEX` branches. Pull Request resolved: #164055 Approved by: https://github.com/ezyang
…64072) Fixes #164071 typo correction done Pull Request resolved: #164072 Approved by: https://github.com/Skylion007
Fixes #163637 Pull Request resolved: #163746 Approved by: https://github.com/malfet
This PR introduces a new "operator microbenchmark" CI workflow and GitHub Actions for operator microbenchmarks, updating test scripts and job matrices to support new parameters, and broadening the operator benchmark tests to include more data types, larger shapes, and gradient tests. The benchmark configurations now focus more on different cuda hardware and multiple dtypes (bf16, fp16, fp32), for both compile and eager mode. **Benchmark Configuration and Coverage:** * Expanded operator benchmark configurations in `addmm_test.py`, `bmm_test.py`, `matmul_test.py`, and `mm_test.py` to benchmark multiple dtypes on CUDA devices, in eager and compile mode, for forward and backward run. The configs with tag "long" for the above mentioned files are being run in CI. * The CI benchmarking is running on various hardwares: H100, A100. * The CI job also uploads the microbenchmarking outputs to a [HUD](https://hud.pytorch.org/benchmark/llms?repoName=pytorch%2Fpytorch&benchmarkName=PyTorch+operator+microbenchmark) dashboard. Pull Request resolved: #162530 Approved by: https://github.com/huydhn Co-authored-by: Huy Do <huydhn@gmail.com>
…164079) This is rare but happens with executorch tests. Pull Request resolved: #164079 Approved by: https://github.com/tugsbayasgalan
Pull Request resolved: #164082 Approved by: https://github.com/tugsbayasgalan ghstack dependencies: #164079
Apply UP035 `ruff` rule in tests, but some tests for `fx` and `dynamo` are excluded in case the old typing is the test target. Pull Request resolved: #163947 Approved by: https://github.com/ezyang
Its unclear why we had disable in the first place. With install_free_tensors, we are tracing into this hook. A better way would be to place the tracer without any hook. For now, disable the checking while dynamo is tracing. Pull Request resolved: #164084 Approved by: https://github.com/tugsbayasgalan
Pull Request resolved: #164081 Approved by: https://github.com/tugsbayasgalan ghstack dependencies: #164084
This PR removes skip conditions for ROCM <= 3.5. Pull Request resolved: #164058 Approved by: https://github.com/kwen2501
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
ciflow/h100-symm-mem
ciflow/inductor
ciflow/mps
Run MPS tests (subset of trunk)
ciflow/vllm
module: cpu
CPU specific problem (e.g., perf, algorithm)
module: dynamo
module: inductor
oncall: distributed
Add this issue/PR to distributed oncall triage queue
release notes: distributed (checkpoint)
release notes: inductor (aoti)
release notes: quantization
release notes category
release notes: releng
release notes category
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Fixes #ISSUE_NUMBER
cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @pragupta @msaroufim @dcci @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @jerryzh168 @voznesenskym @penguinwu @EikanWang @Guobing-Chen @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben @Lucaskabela