Skip to content

Conversation

@pytorch-bot
Copy link

pytorch-bot bot commented Sep 23, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/163691

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 New Failures, 123 Pending

As of commit 35715a8 with merge base 0696a4b (image):

NEW FAILURES - The following jobs have failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

jainapurva and others added 28 commits September 26, 2025 10:42
Some properties with `cache_on_self` were prevously annotated with `no_type_check`, to get around mypy limitations. This PR replaces both annotations with `cache_property_on_self`, to enable type checking.

Pull Request resolved: #163570
Approved by: https://github.com/mlazos, https://github.com/PaulZhang12, https://github.com/Skylion007
…on x block (#162446)

Scale up XBLOCK for contiguous persistent reductions based on rnumel and number of loads + stores

<img width="928" height="656" alt="Screenshot 2025-09-18 at 5 02 57 PM" src="https://github.com/user-attachments/assets/ec3c561f-2a3f-4459-9e14-653715898da3" />

Differential Revision: [](https://our.internmc.facebook.com/intern/diff/)

Differential Revision: [](https://our.internmc.facebook.com/intern/diff/)
Pull Request resolved: #162446
Approved by: https://github.com/v0i0, https://github.com/eellison, https://github.com/shunting314
ghstack dependencies: #162296
Introduces a variant of size-hint multi-kernel, where for novel runtime shapes, instead of performing full benchmarking to determine the optimal kernel, selects one of many kernels pre-generated from multi-kernel hints, based off similarity b/w hint / runtime input & output shapes (L1 distance in log2 space).

Some caveats/changes:
- Size-hint multi-kernel now only kicks in if the kernel has dynamic shapes
- Pre-generation still only does 1-d search over specified hints, e.g. `matmul([s0, s1], [s1, s2])` with size-hints `[64, 256]` only generates 2 kernels - based on tuning shapes ([64, 64], [64, 64]) and ([256, 256], [256, 256]). Extending this to reasonable n-d search (via user API?) is an extension

Benchmarking results, compared to multi-kernel w/ full benchmarking (hints 64, 4096), and compiling with the ground truth hint:
<img width="1902" height="1222" alt="550541081_1088709150049684_6528797079439730237_n" src="https://github.com/user-attachments/assets/056cca48-c16a-4451-9b4a-fa13a7a058a9" />

Full benchmarking doing worse is extremely weird, but we did see similar spikes in #156628

Pull Request resolved: #163090
Approved by: https://github.com/bobrenjc93
Fixes #ISSUE_NUMBER

Pull Request resolved: #163614
Approved by: https://github.com/eqy
Landing this instead of #162994.

Here is how i think the whole dynamo + frame construction logic work:
1) There is no way to create a frame object in python land as this is created in runtime from cpython. So that's why aot_compile creates FrameInfo this way. (kind of like simulating the runtime) i guess you could write your own very simple eval_frame.c where you can interject the frame construction but we probably don't want that.
2) When there is no wrapper (the old export or aot_compile), we first assign sources by iterating over f_locals which contain both local args and closure variables (this is implementation details of cpython frame construction). So thats why closure variables end up getting LocalSource names as can be shown in this test case (https://github.com/pytorch/pytorch/blob/f6ea41ead27205a5ece3e9a41b7af30fafe67d7a/test/export/test_export.py#L1369). Note that L["self"] here means we are referring to local object self. Important thing to keep in mind here is this self is not actually model self, but the outer self.
3) When we switch to wrapper case, we end up trying to inline the original inner module. When doing so, we need to track all local and closures for this inner module as can be seen here (https://github.com/pytorch/pytorch/blob/f6ea41ead27205a5ece3e9a41b7af30fafe67d7a/torch/_dynamo/variables/functions.py#L463) Here we are not looking into inner frame's f_locals but just directly look at closures. I guess this is because we are one more frame up so there is no access to frame f_locals at this point. And it is probably not good idea to change dynamo's logic here. As a result, i get following error message that is different from old export:
"While exporting, we found certain side effects happened in the model.forward. Here are the list of potential sources you can double check: ["L['self']._export_root.forward.__func__.__closure__[1].cell_contents.bank", "L['self']._export_root.forward.__func__.__closure__[1].cell_contents.bank_dict", "L['self']._export_root.forward.__func__.__closure__[0].cell_contents"]"

My initial attempt of solving this was taking inner closures and put them to f_locals for the frame i am constructing which turned out too compilcated because we needed to muck around bytecode instructions as well. So i am thinking we should just update the test to reflect new names and follow up with better post-processing step to have better names.

Differential Revision: [D82582029](https://our.internmc.facebook.com/intern/diff/D82582029)

Pull Request resolved: #163107
Approved by: https://github.com/avikchaudhuri
Update links in conf.py to docs.pytorch.org

Fixes #ISSUE_NUMBER

Pull Request resolved: #163682
Approved by: https://github.com/sekyondaMeta, https://github.com/albanD
…ng with dynamic shapes (#163365)

As per comment in source code:
```
            # If we are are coalescing on xblock (not ReductionHint.INNER) and this is not a tiny kernel
            # (not ReductionHint.OUTER_TINY), do not use persistent reduction if it induces tile
            # quantization. Peristent reduction forces rblock == rnumel, if the bounds between lower
            # and upper are large, for the lower values we will be masking off large % of read/writes,
            # when we could expand the coalescing xblock instead.
```

For the test case in question, this pr improves perf from 0.8573521325143717 -> 0.043151492193814305 because we were egregiously masking out rblock values (58/64 values).

Differential Revision: [D82853279](https://our.internmc.facebook.com/intern/diff/D82853279)
Pull Request resolved: #163365
Approved by: https://github.com/shunting314, https://github.com/PaulZhang12, https://github.com/jansel, https://github.com/v0i0
I had been using tiling scores to essentially check if this is an inner reduction. since that is not fully rolled out for dynamic shapes, use reduction hint when they are not available.

Pull Request resolved: #163371
Approved by: https://github.com/PaulZhang12
Fixes #ISSUE_NUMBER

Pull Request resolved: #163693
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
Summary:
We add the parsing for list of string. This is needed for AOTInductor
profiling for input information of Triton kernels.

Test Plan:
Included in commit.
test_profiler_op_event_kwargs_list_of_strings

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: #163593
Approved by: https://github.com/sraikund16
Pull Request resolved: #163653
Approved by: https://github.com/jansel
ghstack dependencies: #163648, #163649
Support for amin, amax, and aminmax

Test Plan: E2E tests in the stack with benchmark suite passes.

Differential Revision: D83016894

Pull Request resolved: #163669
Approved by: https://github.com/albanD, https://github.com/malfet
Continued work to use std::fs in inductor.
Pull Request resolved: #163632
Approved by: https://github.com/Skylion007
…62688)

Summary: Restricts subprocess benchmarking to only `TritonTemplateCaller`, which is expected by the underlying `target` method. THhis triggered a bug with large K shapes because the decompose k is `SubgraphChoiceCaller`.

Test Plan:
mm autotuning with a large k and `TORCHINDUCTOR_AUTOTUNE_IN_SUBPROC=1`

Rollback Plan:

Differential Revision: D82181924

Pull Request resolved: #162688
Approved by: https://github.com/PaulZhang12, https://github.com/eellison, https://github.com/mlazos
Test Plan: existing tests

Differential Revision: D82995546

Pull Request resolved: #163550
Approved by: https://github.com/tugsbayasgalan
Summary: When generating Triton kernels in the compile-time autotune blocks, it will be useful to generate source information as code comments. Previously we ignore these comments for autotune code blocks because the generated main output code will contain the same information, but it won't work if the generated autotune code crashes.

Pull Request resolved: #163600
Approved by: https://github.com/yushangdi
Fixes ##162228

# Summary

Majority of our tests are only compiling flex-attention in isolation. This means that for fake tensor propagation the input primals and all captured buffers dont do any intermediate computation below autograd.  As a result result the by happen chance match the `require_grad`ness of the eager implementation and this check  will pass. However if score_mod is a the result of some other intermediate fake tensor prop then it is not guaranteed to have accurate req_gradness, which was happening here.

TLDR is that this was a boot and suspenders that was actually harmful and we should just let the joint graph handle creating the correct joint graph

Pull Request resolved: #163677
Approved by: https://github.com/ydwu4
…63344)

There is only one substantive change: the branch on
`global_offset[shard_dim] <= local_offset[shard_dim]`
is removed because it is unnecessary: you can always treat the
first shard uniformly with the rest of the shards, because your
global offset is guaranteed to be zero in this case anyway.

I also switch the shard_size case to sym_ite, to make it possible
for LocalTensor to deal with the MPMD-ness here, but it's equivalent
to the old if-then-else.

I tried to rewrite the comments to be more clear what is going on
algorithmically here.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: #163344
Approved by: https://github.com/albanD, https://github.com/zpcore, https://github.com/tianyu-l
licy666 and others added 16 commits September 28, 2025 20:37
Replace the **runtime_error** of the vallina C++ exceptions with **TORCH_CEHCK** in **torch/nativert/***

The vallina C++ exception should not exist in the core part of pytorch for its corss-languanges trait. Comparing with the vallina C++ exceptions, TORCH_CHECK have the richer error context and It has the unified error handling mechanism. This commit replace the runtime_error with TORCH_CHECK of the files in
torch/nativert/*   .

Fixes part of #148114

Pull Request resolved: #163308
Approved by: https://github.com/dolpm
This adds basic support for subclass inputs in export (specifically for non-strict). I had to make fakify little more complicated which risks further divergence from dynamo fakification. But dynamo one is so complex, so i feel it is better to do this way. Also improved fake mode detection logic to recursively look into subclass inner tensors.

Differential Revision: [D83156489](https://our.internmc.facebook.com/intern/diff/D83156489)
Pull Request resolved: #163770
Approved by: https://github.com/avikchaudhuri
Export team is fixing up the old strict export implementation, as a result it fails a check where we proxy the whole module under given directories. _WrapperModule is a way for torchao to workaround the issue where export requiring nn.module to trace so it should never get proxied in the graph.

Differential Revision: [D82732613](https://our.internmc.facebook.com/intern/diff/D82732613)

Pull Request resolved: #163258
Approved by: https://github.com/anijain2305
ghstack dependencies: #163136, #163137
The [docs](https://docs.pytorch.org/functorch/stable/) about `functorch` has been migrated into [PyTorch Doc](https://docs.pytorch.org/docs/stable/func.html) since PyTorch 2.0, so I think we can remove it right now to reduce the compute resources usages.
Pull Request resolved: #162581
Approved by: https://github.com/ezyang
Remove two unnecessary `PY_VERSION_HEX` branches.

Pull Request resolved: #164055
Approved by: https://github.com/ezyang
This PR introduces a new "operator microbenchmark" CI workflow and GitHub Actions for operator microbenchmarks, updating test scripts and job matrices to support new parameters, and broadening the operator benchmark tests to include more data types, larger shapes, and gradient tests. The benchmark configurations now focus more on different cuda hardware and multiple dtypes (bf16, fp16, fp32), for both compile and eager mode.

**Benchmark Configuration and Coverage:**

* Expanded operator benchmark configurations in `addmm_test.py`, `bmm_test.py`, `matmul_test.py`, and `mm_test.py` to benchmark multiple dtypes on CUDA devices, in eager and compile mode, for forward and backward run. The configs with tag "long" for the above mentioned files are being run in CI.
* The CI benchmarking is running on various hardwares: H100, A100.
* The CI job also uploads the microbenchmarking outputs to a [HUD](https://hud.pytorch.org/benchmark/llms?repoName=pytorch%2Fpytorch&benchmarkName=PyTorch+operator+microbenchmark) dashboard.

Pull Request resolved: #162530
Approved by: https://github.com/huydhn

Co-authored-by: Huy Do <huydhn@gmail.com>
…164079)

This is rare but happens with executorch tests.

Pull Request resolved: #164079
Approved by: https://github.com/tugsbayasgalan
Apply UP035 `ruff` rule in tests, but some tests for `fx` and `dynamo` are excluded in case the old typing is the test target.

Pull Request resolved: #163947
Approved by: https://github.com/ezyang
Its unclear why we had disable in the first place. With
install_free_tensors, we are tracing into this hook. A better way would
be to place the tracer without any hook. For now, disable the checking
while dynamo is tracing.
Pull Request resolved: #164084
Approved by: https://github.com/tugsbayasgalan
This PR removes skip conditions for ROCM <= 3.5.

Pull Request resolved: #164058
Approved by: https://github.com/kwen2501
@jainapurva jainapurva closed this Sep 29, 2025
@jainapurva jainapurva deleted the attention_benchmarking branch September 29, 2025 04:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.