[AOTI] Support .item() in the ABI-compatible mode #117989

desertfire · 2024-01-22T18:40:45Z

Stack from ghstack (oldest at bottom):

Summary:

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @aakhundov @ColinPeppler

Differential Revision: D52965076

Summary: [ghstack-poisoned]

pytorch-bot · 2024-01-22T18:40:48Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/117989

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit fbc07e7 with merge base b2a3d6b ():

UNSTABLE - The following job failed but was likely due to flakiness present on trunk and has been marked as unstable:

pull / linux-focal-py3_8-clang9-xla / test (xla, 1, 1, linux.12xlarge, unstable) (gh)
Process completed with exit code 128.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

desertfire · 2024-01-22T18:42:08Z

@desertfire has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Summary: cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler Differential Revision: [D52965076](https://our.internmc.facebook.com/intern/diff/D52965076) [ghstack-poisoned]

desertfire · 2024-01-22T18:51:00Z

@desertfire has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Summary: cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler Differential Revision: [D52965076](https://our.internmc.facebook.com/intern/diff/D52965076) [ghstack-poisoned]

Summary: ghstack-source-id: 3ecf878 Pull Request resolved: #117989

desertfire · 2024-01-22T19:00:03Z

@desertfire has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

ezyang · 2024-01-23T14:26:07Z

torch/_inductor/codegen/wrapper.py

-            )
-            self.writeline(f"auto {node.sym} = {data}.item().{convert_type}();")
+            if node.is_bool:
+                self.writeline(f"bool {node.sym} = {data}.item() ? 1 : 0;")


What's going on here lol

I don't really remember why this, but the not config.aot_inductor.abi_compatible branch will go away anyway.

ezyang · 2024-01-23T14:26:43Z

torch/_inductor/codegen/wrapper.py


            def use_thread_local_cached_output_tensor(idx, output):
+                if self.cuda:
+                    return


Just for my personal understanding, what is this doing?

use_thread_local_cached_output_tensor is an optimization the CPU AOTI uses, to make sure output tensors can reuse CPU memory.

ezyang

Looks legit. Is the testing not easy to do?

desertfire · 2024-01-23T15:35:57Z

@desertfire has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

chenyang78

LGTM. Thanks!

desertfire · 2024-01-23T19:41:57Z

Looks legit. Is the testing not easy to do?

We probably need the next scalar_to_tensor PR to do a proper testing for this one.

facebook-github-bot · 2024-01-24T20:16:00Z

@pytorchbot merge -f 'Landed internally'

(Initiating merge automatically since Phabricator Diff has merged, using force because this PR might not pass merge_rules.json but landed internally)

pytorchmergebot · 2024-01-24T20:17:45Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Summary: #117989 disabled use_thread_local_cached_output_tensor for cuda, but it is not necessarily true, because we can still have cpu tensors when running cuda models. Differential Revision: D53089956 Pull Request resolved: #118291 Approved by: https://github.com/Skylion007, https://github.com/frank-wei, https://github.com/chenyang78, https://github.com/khabinov

….3.0 Aarni Koskela (1): Increase hub download chunk size (#116536) Aaron Bockover (6): [ONNX] Bump onnx submodule to 1.14.1; ONNX Runtime 1.16 (#106984) [ONNX] Add initial support for FP8 ONNX export (#107962) [ONNX] bump submodule to onnx==1.14.1 (#108895) [ONNX] bump ort-nightly==1.16.0.dev20230908001 (#109212) [ONNX] switch from onnxscript-preview to onnxscript (#109139) [ONNX] bump onnx submodule to rel-1.15.0 (#110663) Aaron Enye Shi (12): Reland [Profiler] Improve the docstring for export_memory_timeline (#110983) [c10] Move profiler clock to libc10 for timestamps (#111972) [Profiler] Disable CUPTI Teardown when using CUDA Graphs (#112507) [Profiler] Manual Submodule Update for Kineto (#112540) [Profiler][Easy] Make timestamps in memory timelines be in microseconds (us) (#112772) [Kineto] Initialize libkineto profilers during torch init process during pybind set-up (#112623) [Memory Snapshot] Add timestamps to memory events collected in snapshots (#112266) [Kineto] Initialize libkineto profilers during torch init process during pybind set-up (#112623) [Memory Snapshot] Add CUDAAllocatorConfig details into snapshot metadata (#119404) [Memory Snapshot] Clean up elem text (#120245) [Memory Snapshot] Add Total memory used after allocation in Trace View (#120339) [Memory Snapshot] Stop clearing history when changing context (#120436) Aaron Gokaslan (83): Update ruff to v0.0.286 (#108058) [BE]: Update ruff to v0.0.290 (#109435) [BE]: Replace undocumented constant in logging (#109434) [BE]: enable ruff rules PLR1722 and PLW3301 (#109461) Fix invalid arg to getLogger in torch distributed checkpoint (#110008) [BE]: Enable some basic pytest style rules (#110362) [BE]: Update NCCL submodule to v2.19.3 (#110827) [BE]: Enable ruff's flake8-PYI rules (#110830) [BE]: Update ruff to 0.1.0 (#111391) [BE]: Update lintrunner mypy to 1.6.0 (#111375) [BE]: Attach cause to some exceptions and enable RUFF TRY200 (#111496) Update RUFF to 0.1.1 (#111618) [BE]: remove unnecessary enumerate calls (#111690) [BE]: Apply subprocess check to github scripts (#111684) Update ruff to v0.1.4 (#112966) [BE]: Enable ruff PIE794 and fix bugs it found in test suite (#112989) [BE]: Apply FURB145 to make code more readable and idiomatic. (#112990) [BE]: Apply RUF015 to torch folder (#113025) Update ruff linter to v0.1.5 (#113355) [BE]: Remove useless lambdas (#113602) [BE]: ruff apply rule PLW1510 to find silent subprocess errors (#113644) [BE]: Enable ruff rule PIE800 - unnecessary nested dict expansion (#113880) [BE]: ruff - enable PIE804 (#113951) [BE][easy]: Update ruff to 0.1.6 (#114125) [BE]: ruff FURB136: replace ternary with min/max (preview) (#114382) [BE]: Enable Ruff + Flake8 G201,G202 logging format rule. (#114474) [BE]: Enable Ruff + Flake8 G201,G202 logging format rule. (#114474) [BE][Easy]: Enable NPY lint rules for ruff (#114476) [BE][Easy]: Enable flake8-exe rules in ruff too. (#114521) [BE][Easy]: add some PLR pylint checks and exclusions to ruff (#114519) [BE][Easy]: Apply RUF019: remove duplicate checks for dict access (#114478) [BE]: Enable more ruff PLW checks. Disable one PLR that is preview. (#114759) [BE]: Enable a PLC0131, PLC0132, PLC0205. Fix PLC0132 bug. (#115015) [BE]: Update ruff to v0.1.7 (#115169) [BE]: Enable RUF015 codebase wide (#115507) [BE]: Enable clang-tidy check for readability-string-compare (#115994) [BE]: enable readability-delete-null-pointer clang-tidy check (#116107) [Easy][BE]: Enable clang-tidy check for duplicate includes (#116193) [BE]: Enable RUFF PERF402 and apply fixes (#115505) [BE]: Apply FURB118 (prev): replaces unnecessary lambdas with operator. (#116027) [Easy][BE]: remove itertools.accumulate Python 2 shim and apply UFMT (#116192) [Easy][BE]: Enable RUF008 and RUF016 checks (#116195) [BE][Easy]: Enable clang-tidy check readability-misplaced-array-index (#116210) [Easy][BE]: Enable clang-tidy check for duplicate includes (#116193) [BE][Easy]: Enable clang-tidy check readability-misplaced-array-index (#116210) [BE][Easy]: Update ruff to 0.1.9 (#116290) [Easy][BE]: Fix none type comparison (#116399) [BE]: Use `iterable.chain.from_iterable` where possible (#116376) [BE]: Enable readability-simplify-subscript-expr clang-tidy check (#116356) [BE]: Enable readability-redundant-function-ptr-dereference check (#116538) [BE]: Use os.fspath and os.PathLike in torch serialization (#116562) [BE]: Use exist_ok arg for os.makedirs calls (#116561) [BE]: Add better handling of pathlib.Path with os calls (#116564) [BE]: Further improve pathlib checks in torch serialization (#116577) [BE]: Enable F821 and fix bugs (#116579) [BE]: Fix F821 error in torch/fx/experimental (#116587) [BE]: Improve typing to respect ruff PYI058 (#116588) [BE]: Update flake8 to v6.1.0 and fix lints (#116591) [BE]: Update ruff to 0.1.11 (#116704) [BE][Easy]: Update libfmt submodule to 10.2.1 (#116864) BugFix: Fix F632 bug in dynamo (if statement is always false) (#116867) Add bfloat16 CUDA support to RNN (#116927) Fix typo in CUDA Macro (#116930) Add bfloat16 CUDA support to gamma unary functions (#116929) Add bfloat16 CUDA support to smoothl1loss (#116933) Add bfloat16 CUDA support to binomial distribution (#116932) Add bfloat16 CUDA support to multinomial (#116951) Add float16 support to CUDA logaddexp2 (#116948) Add bfloat16 + fp16 support to fractional_max_pool for CUDA and CPU (#116950) [BE][dynamo]: Add operator is and is not tests to dynamo tests (#116397) Add dynamo support for operator.abs (#117442) [dynamo][easy]: Add support for `operator.truth` (#117463) [BE]: Add type alias typing annotation to prims_common (#117928) [BE][Easy]: Update ruff to 0.1.14 (#118466) [BE]: Apply RUF025 dict.fromkeys preview rule (#118637) [BE]: Add filelock typing to mypy stubs (#119390) [BE][Ez]: FURB129: remove unneeded readlines() (#119796) [BE][Ez] Update ruff to 0.2.2 (#120517) [BE]: Enable ruff LOG checks (#120674) [BE][Ez]: Update ruff to 0.3.0 (#121003) [BE]: FURB187 Use inplace reverse on lists: faster, more readable. (#121140) Add operator length hint support (#121495) Update NCCL submodule to v2.20.5 (#121635) Aaron Meurer (3): Sort the output of TORCH_LOGS=help (#114657) Add a decomposition for take() (#114813) Add a decomposition for isin() (#115390) Aaron Orenstein (15): Replace recursive stable_topological_sort() with iterative. (#116761) Add default parameters to rrelu_with_noise() (#117141) Ensure that deleter is called even for a no-data tensor. (#117418) Protect against modules without __file__ (#117445) Fix test_compressed_layout_conversions_coverage to check BSC format (#117951) Fix dynamo failure w/ astype (#117952) Rewrite group_batch_fusion.find_independent_subset_greedy() to be iterative. (#118324) Fix guards for field access through properties (#119719) Tweak to pr#119719 - eager & fullgraph (#119921) Update find_test_dir() to check for skip files relative to the local path first. (#120521) Limit loop unrolling (#120023) Fix guard for SUPPORTED_NODES (#120798) Fix guard for SUPPORTED_NODES (#120798) Prevent infinite recursion within Tensor.__repr__ (#120206) more passing dynamo tests (#121378) Aaron Shi (4): [Profiler] Improve the docstring for export_memory_timeline (#110949) Back out "[Kineto] Initialize libkineto profilers during torch init process during pybind set-up (#112623)" (#116201) [Reference Cycle Detector] Ignore FakeTensor in cycle leak detection (#117116) [Memory Snapshot] Track context for SEGMENT_FREE and SEGMENT_UNMAP (#118055) Adam J. Stewart (3): ReduceLROnPlateau: inherit LRScheduler (#108464) SummaryWriter.add_figure: add type hints (#110021) TransformerEncoder/Decoder: add type hints (#120550) Adam Louly (1): [ONNX] Cast scale back to fp16 after _attention_scale. (#112554) Adnan Akhundov (28): [inductor] Add input generation fn option for autotuning (#108242) Skip launching kernels with zero grid in AOT Inductor (#110312) Remove runtime assertions between export and AOT compilation (#110710) [inductor] Add aoti_torch_dtype_bool to AOTI ABI shim (#110713) [inductor] Add AOTI ABI shim function for repeat_interleave.Tensor (#110745) [inductor] Add size, stride, storage_offset to RAIIAtenTensorHandle (#110764) [inductor] Add AOTI ABI shim function for torch.nonzero (#110766) Fix size_hint call sites failing on unbacked SymInts (#110520) [indictor] Fix cat decomp when first tensor is empty (#113514) [export] Allow shifted constraint ranges in dynamo._export (#114024) [inductor] Add ABI shim function for torch.scatter (#114027) [inductor] Pass None and skip constexpr in custom Triton kernel calls from C++ (#114475) Preserve strides of custom Triton kernel args (#116219) Realize non-ReinterpretView Views in custom Triton kernel args (#117468) [inductor] Fix CPP wrapper codegen for ExternKernel args (#117931) Optimize recursive_add_node in fx splitter (#117969) Don't skip register-spilling configs in custom Triton kernel auto-tuning (#119634) Check alignment of ReinterpretView args of custom Triton kernels (#119649) [inductor] Refactor device guard Python codegen to allow nested indentation (#119673) [inductor] Recursivly unwrap_storage_for_input when convert_to_reinterpret_view fails (#119867) [inductor] Add torch.cond support to JIT Inductor (#119759) [inductor] Apply fx passes recursively to nested subgraphs (#120665) [inductor] Do not reuse buffers across scopes in mem planning (#120777) Add equal_to_1 to triton_meta for user-written Triton kernels (#120579) Replace TTIR string parsing with structured MLIR walk in Triton kernel mutation analysis (#120476) Remove ids_of_folded_args from test_triton_kernel_equal_to_1_arg (#121192) Add torch.cond support to AOT Inductor (#121120) Skip AOT Inductor test_cond_* tests on ROCm (#121522) Adrian Wälchli (4): Fix pydocstyle errors in torch/nn/module (#112674) Fix pydocstyle errors in fully_sharded_data_parallel.py, api.py, graph_utils.py, distribute.py, iter_graph_module.py, comm_tensor.py, experimental_ops.py, batch_dim_utils.py, data_parallel.py, graph_optimization.py (#113216) Implement pass-through `state_dict` and `load_state_dict` for dynamo OptimizedModule (#113423) Support `pathlib.Path` as input to `torch.load` when `mmap=True` (#116104) Aiden Brent (1): Fix type hints on nn.attention.sdpa_kernel (#119140) Aidyn-A (11): [CUDA][CUDA Graphs] Fix CUDAGraph::reset function (#108896) [UCC][CUDA] Overlap p2p (#111608) [TEST] Skip test_schema_correctness for float8 dtype (#115757) [TEST] Increase numerical tolerances in test_torchinductor_opinfo:test_comprehensive (#115768) [ATen][CUDA][CUBLAS] cublasLtMatmul increase workspace_size (#120925) [ATen][Native][CUDA] Decrease max_threads in ctc_loss (#120746) [ATen][CUDA][CUBLAS] cublasLtMatmul increase workspace_size (#120925) [C10d][UCC] Retain CUDA context in progress_loop (#121446) [CUDA graphs] Pool argument for make_graphed_callables (#121475) [C10d][NCCL] Refactor complex all_reduce and broadcast (#121045) [TEST] Prepare test_cumulative_trapezoid for SciPy 1.12 (#121541) Akihiro Nitta (1): Update `torch.compiler_troubleshooting.rst` (#114530) Albert Zeyer (1): DistributedDataParallel._post_forward, fix return (#114678) Aleksandar Samardžić (5): Minor fixes in semi-structured sparse code (#105595) Fix typo in mixed dtypes linear operator implementation. (#111127) Update F32 sparse semi-structured support for CUTLASS back-end (#116017) Add out_dtype support for sparse semi-structured CUTLASS back-end (#116519) Add CUTLASS kernel as choice for _int_mm() Inductor autotuning (#119685) Aleksei Nikiforov (16): Fix fallback FBGEMM implementation for Big Endian systems. (#96422) When byteorder record is missing load as little endian by default (#108343) s390x SIMD: update abs() function for complex numbers (#108515) s390x onnx: byteswap data when serializing it (#107963) Don't use cpuinfo on s390x (#109496) Don't link to libcpuinfo on s390x (#109875) S390x inductor support (#111367) Additional lint fixes (#111793) s390x vectorization: implement atanh for complex vectorized data (#111653) Skip test_fork_wait_4 and test_fork_wait_4_async (#112743) s390x: fix inductor constructing floats out of bytes (#112723) s390x: skip tests relying on specific openblas precision (#112843) S390x complex division (#108516) test_lazy: skip HashTest.Scalar (#112747) s390x: fix build (#114508) Add copy of scripts for setting up s390x workers (#120417) Alexander Grund (13): Avoid undefined behavior in JIT-generated conversion code (#110212) Don't set CUDA_HOME when not compiled with CUDA support (#106310) Fix failing test_mkldnn_pattern_matcher if built without MKL (#113949) Fix failing test_invalid_input_csr_large (#114940) VSX: Fix vectorized abs function for complex tensors (#116859) cmake: Include `CheckCXXCompilerFlag` where it is used (#113028) VSX: Fix overflow in complex division (#116972) Fix failure of test_dynamo_distributed & test_inductor_collectives (#117741) Convert `requires_cuda` to full decorator (#118281) Don't check is_conj for `_refs.linalg.svd` (#117972) c10d: Don't add NCCL backend by default without CUDA (#119149) Skip test_wrap_bad if run under pytest (#115070) Check for releasing GIL at compiletime (#116695) Alexander Kurakin (2): ReduceLROnPlateau init _last_lr (#119366) (#119556) Fix optim.lr_scheduler examples in doc to use optimizer vs self.opt (#119563) Alexander Mols (1): caffe2: remove support for specifically running "flaky tests" (#112007) Alexander Yermolovich (1): [llvm][oncall] Fix build for llvm-18+ (#115652) Alin Pahontu (1): added path to correct directory containing headers (#110063) AllenTiTaiWang (4): [ONNX] Move large scale models without non-persistent buffers to runtime test (#108084) [ONNX] Support None in fx.args as torchlib inputs (#108708) [ONNX] Refactor MaxPool to support dynamic inputs (#113318) [ONNX] Support more sympy operations in fx-onnx exporter (#112758) Alperen ÜNLÜ (1): Fix docstrings on torch/nn/modules (#113260) Amadeusz Skrzypczak (1): Fix type promotion of float8_e5m2 and float8_e4m3fn (#110279) Ana Basalo (1): Chore: improve log message about cache size limit exceeded (#116557) Andre Eid (1): Add quantized gelu (#119935) Andrea D'Eusanio (1): Fixing searchsorted doc (#109364) Andrei (1): Add code example for torch.stack() (#120304) Andrei Gheorghe (6): Use global variables to register the return_types namedtuples (#107000) Fix finding Intel MKL on Windows, as well as LAPACK, cuDNN and cuSPARSELt (#108040) Use global variables to register the return_types namedtuples (#108832) Improved DDP checkpoint documentation (#106985) Fix aminmax on CUDA when input shape contains 0 (#107564) Use fmt::format in NCCLUtils and ProcessGroupNCCL instead of c10::str (#107268) Andres Lugo-Reyes (6): [ROCM] Enable test_fn_fwgrad_..._functional_binary_cross_entropy on ROCM (#109038) [ROCm] Enable Lerp tests for complex32 (#108100) [ROCM] Enable bwd cross_entropy on ROCM now that eps tolerance update (#109384) [ROCm] Unskip functorch tests that now work (#110760) [ROCm] Unskip functorch tests that now work (#110760) [ROCm] Autocast RNN Support (#121539) Andrew Calvano (5): Fix for out of bounds read in torch mobile flatbuffer loader (#108439) Fix for PyTorch mobile flatbuffer loader out of bounds reads (#110162) Fix for out of bounds read in mobile interpreter FORMAT opcode handler (#110303) Fix for out of bounds read in mobile interpreter INTERFACE_CALL opcode handler (#110301) Fix for out of bounds registers_ access in mobile TorchScript interpreter (#110300) Andrew Gallagher (2): [aarch64][caffe2/torch/csrc/profiler] Support aarch64 in inline assembly (#104707) [caffe2] Add non-x86 stub definition for `libraryFor` too (#114023) Andrew Gu (63): [FSDP] Only check exec order if DETAIL (#109049) [Easy] Fixed typo in `init_device_mesh` note (#111658) [PT-D] Updated Dynamo skip message for `@contract` tests (#112793) [PT-D] Made `_get_registry` return `None` if no APIs applied (#113654) [DTensor] Added `op_call` in no-mesh dispatch assert message (#113903) [DTensor] Renamed `shard_spec` -> `placements` in test file (#113917) [DTensor] Made `_Partial`, `Replicate` frozen dataclasses (#113919) [DTensor] Used new placements for neg dim in `redistribute` (#113924) [DTensor] Used new placements for neg dim in `from_local` (#114134) [DTensor] Ensured `grad_placements` was tuple (#113925) [DTensor] Used new placements for neg dim in `distribute_tensor` (#113930) [DTensor] Replaced neg dim normalization with assert in helper (#114141) [DTensor] Cached hash for `DTensorSpec` (#113915) [DTensor] Reduced to one `isinstance` call in `is_shard` (#114140) [DTensor] Computed `DTensorSpec` hash lazily (#114322) [FSDP] Added DDP parity test for CPU training (#114372) [FSDP] Passed `TORCH_NCCL_DESYNC_DEBUG` instead of `NCCL_DESYNC_DEBUG` (#114432) [DTensor] Made `DTensorSpec` hash recomputation lazy (#114379) [DTensor] Passed `dynamic=False` for compile tests (#114390) [FSDP] Simplified FSDP wrapping in ignored module test (#114611) [FSDP] Added test for `ignored_states` + auto wrap (#114612) [FSDP] Cloned unsharded tensor slice in optim state dict load (#117261) [ez][docs] Fixed render of `tensors` in `backward` (#117994) [DTensor] Relaxed `to_local` `requires_grad` warning (#118186) Added `"any"` mode to `register_multi_grad_hook` (#117984) [FSDP2] Introduced initial `fully_shard` frontend (#117776) [DeviceMesh] Removed print of `self._dim_group_infos` (#118527) [FSDP2][Reland] Introduced initial `fully_shard` frontend (#118525) [FSDP2] Added `mesh` arg, `FSDPState`, move to device (#117814) [FSDP2] Added initial `FSDPParamGroup`, `FSDPParam`, `ParamModuleInfo` (#117867) [FSDP2] Sharded parameter in `FSDPParam` (#117877) [FSDP2] Added initial `_lazy_init` and FQNs for debugging (#117881) [FSDP2] Added all-gather and unsharded parameter (#117950) [FSDP2] Added `_to_kwargs` root forward input cast (#117955) [FSDP2] Added forward unshard/wait for unshard/reshard (#117973) [FSDP2] Added reduce-scatter (#117975) [FSDP2] Added pre/post-backward (#118004) [FSDP] Fixed `device_mesh` and auto wrap (#119064) [FSDP2] Added `reshard_after_forward` (#118017) [FSDP2] Added backward prefetching (#118118) [FSDP2] Used `split_with_sizes_copy` for all-gather copy-out (#119451) [FSDP2] Replaced version-ctx with `no_grad`; removed `no_grad` (#119550) [FSDP] Added deprecation msg for `NO_SHARD` (#119553) [FSDP2] Added autograd/memory/overlap/frozen/2D/AC tests (#118136) [FSDP2] Added mixed precision (#118223) [BE] Enabled mypy in `common_fsdp.py` (#118755) [FSDP2][ez] Replaced `groupby` with `all` for same-dtype check (#119825) [FSDP2] Added gradient accumulation w/o reduction (#118298) [FSDP2][ez] Made typing more strict to avoid `cast` (#119985) [FSDP2] Used stream APIs for CUDA event handling (#120231) [FSDP2] Removed `super().__setattr__` call (#120340) [FSDP] Removed `.detach` in `clip_grad_norm_` (#120612) [FSDP2][ez] Combined communication test files (#120904) [FSDP] Added warning about unsupported double backwards (#120926) [DTensor] Supported `foreach=False` for `clip_grad_norm_` (#120238) [DTensor] Supported `foreach=True` for `clip_grad_norm_` (#120910) [FSDP2] Used `ReduceOp.AVG` if fp32 reduce-scatter (#120919) [FSDP2] Added initial meta-device init support (#120351) [DTensor] Initialized RNG tracker if needed (#121328) [FSDP2] Relaxed check for parent mesh (#121360) [FSDP2] Zeroed padded tensor in `_apply` (#121509) [DCP] Replaced `storage()` with `untyped_storage()` (#121538) [FSDP2][BE] Refactored `check_1d_sharded_parity` to use mesh (#121357) Andrew Hoblitzell (1): docstyle _correct_bias.py _equalize.py _learnable_fake_quantize.py backend_config experimental fake_quantize.py fuse_modules.py fuser_method_mappings.py (#112992) Andrew Hu (1): Add "device not supported" assert to inductor (#112001) Andrew M. James (11): Memory leak from bsr_scatter_mm_indices_data argument cache (#112301) [SparseCompressed] Support `add(sparse_compressed, dense)` (#115432) [SparseCompressed] support csc layout for add sparse/dense. (#115433) [inductor] Handle special values correctly in ir.Scan codegen (#118788) [inductor] Implementing missing magic methods on IR values. (#118933) Add lowering for logcumsumexp (#118753) [testing][inductor] Allow grad tolerance override (#119844) Add decomp for linalg.cross (#119809) Add lowering for logcumsumexp (#118753) Add lowering for adaptive_max_pool2d (#120254) Add lowering for fraction_max_pool2d (#120460) Andrew Or (2): [quant][pt2] Fix and rename `move_model_to_eval` (#108891) Back out "Enable pickling model prepared with QAT qconfig" (#110392) Andrey Talman (12): Updates to patch version release plans (#110952) Add Python 3.12 as experimental to release 2.2 (#119705) [RELEASE ONLY CHANGES] Apply release only changes Release 2.3 (#121726) [Release Only] Build triton using pinned version rather branch (#121765) [RELEASE ONLY CHANGES] Apply release only changes Release 2.3 (#121813) [RELEASE ONLY CHANGES] Increase timeout for linux binary jobs, fix workflow lint (#121851) Triton wheel build using 2.3.x branch (#122403) Use temporary name for triton package, fix lint (#122438) Revert "CI: Specify libc and libstdcxx versions in conda environments" (#122497) Revert "Revert "CI: Specify libc and libstdcxx versions in conda environments"" (#122523) [Wheel] Change libtorch_cpu OpenMP search path (#123417) (#123442) [Release only] Release 2.3 start using triton package from pypi (#123580) Andrzej Kotlowski (1): Add Bfloat16 scalar support to gloo backend (#113557) Angel Yang (2): Fix S367052 to unblock ICVR MC3 (#109853) Fix S367052 to unblock ICVR MC3 (#109937) Angela Yi (51): [export] Fix autogenerated stacktrace (#108217) [pytree] Allow register_pytree_node to take in 5 inputs (#108256) [export] Change _generate_new_graph_signature (#108571) [export] Lift constant tensors as buffes (reland) (#109040) [export] Separate out exported_program.py (#109147) [export] Update deserialized FakeTensorMode/ShapeEnv with same configs as export (#109522) [exir] Add lift constant tensors passes after aten_to_edge (#109382) [aotinductor] Update performance benchmark code (109560) (#109820) [export] Verifier for exported program (#109519) [export] Add dynamic_shapes to _export.aot_compile (#110101) [export] Add run_decomposition() function to ExportedProgram (#110236) [aotinductor] Use dynamic_shape instead of constraints (#110360) [export] Add ir spec (#110394) [export] Get export APIs ready for PTC (#110410) [export] Get export APIs ready for PTC (reland) (#111030) [export] Fix issue with internal model (#111140) [export][retry] Move lifted tensors out of state_dict (#113689) [aps] Sync thrift (#113810) [reland][aotinductor] Add example_value metadata to nodes (#113986) [export] Update schema (#114172) [export] Move serialized custom class objs to toplevel (#114371) [fx] Update symbolic_trace nn_module_stack (#114422) [dynamo][reland] `ExecutorchCallDelegateHigherOrderVariable` - add sanity check that input and output tensors are disjoint (#114167) [export] Fix state dict device serialization (#114695) [export] Remove convert_to_cpu flag (#114775) [export] Remove combine_args_kwargs (#114782) [export][reland] Remove runtime assertion pass (#115597) [export] Fix test to run internally (#116118) [export][refactor][4/n] Make equality_constraints optional (#116233) [export][refactor][6/n] Remove equality_constraints (#116979) [export][ez] Fix getting meta["val"] (#117313) [export] Add lifted constant obj to input (#116985) [exportdb] Remove torch/fb/exportdb (#117866) [export] Allow constant outputs + None input/outputs (#117894) [export] Add node meta into UnflattenedModule (#118138) [export] Various fixes to .module() (#118272) [export] Convert all export tests to .module() (#118425) [export] Fix graph signature for primitive outputs (#118655) [export] Only deepcopy graph in unlift (#118821) [reland][export] Fix graph signature for primitive outputs (#118818) [export] Move _create_graph_module_for_export to torch/export (#118893) [export] Prevent specialization on backends (#118683) [export] Remove CallSpec (#117671) [export] Convert internal tests to using .module() (#119105) [export] Remove torch._export.export (#119095) [export] Don't error if nn_module_stack doesn't contain a class (#119753) Add pixel_shuffle to core aten decomps (#119899) [export] Disable exported_program.__call__ (#119466) Add pixel_shuffle to core aten decomps (#120092) [dynamo] Reorder logs (#116106) [export][reland] Disable exported_program.__call__ (#120019) Aniket Patil (1): Fixed typo in activation.py (#111358) Animesh Jain (77): [reland][Dynamo] cache_size policy #107496 (#108069) [logging] Add more flags to default logs (#107912) [dynamo] Reduce cache size limit to 8 (#108526) [dynamo][activation checkpointing] Trace through ActivationWrapper (#108599) [dynamo][finishing colesbury's PR 100642] Guard on nn.Module dicts and type (#108528) reland [finishing colesbury's PR 100642] Guard on nn.Module dicts and type (#108883) Symintify repeat_interleave (#109133) [dynamo] Unblock a model with jit.isinstance (#109178) Reland [dynamo][activation checkpointing] Trace through ActivationWrapper (#109327) Reland 3rd try [finishing colesbury's PR 100642] Guard on nn.Module dicts and type (#109323) [dynamo] remove DummyGlobalSource (#109411) [dynamo] Graph break on rng get/set state - remove GeneratorStateSource (#109410) [dynamo][guards-log] Do not print duplicate guard entries (#110023) [dynamo][nn_module_guards] Config flag to disable nn_module_guards (#110039) [dynamo][guards-log] Print nn module guard saved dict versions for debugging (#110028) [dynamo][higher order op] Fix minor bug in error msgs (#110099) [dynamo][guards-log] Add debug msg for nn_module_guards only when log is enabled (#110167) [dynamo] Dont put nn module guards on torch inbuilt nn modules (#110230) [dynamo] Remove SuperSource (#110475) [dynamo][easy] Move code from GetAttrVariable to a suitable place (#110535) [dynamo] GetItemSource - restrict the supported index Source to be GlobalWeakRefSource (#117138) [inductor] Iterative percolate tags (#117306) [dynamo] GetItemSource - restrict the supported index Source to be GlobalWeakRefSource (#117138) [dynamo] LazyVariable - redirect __str__ to the realized variable __str__ (#117583) [dynamo] Extend LazyVariableTracker to tuples (#117426) [ac][pattern matcher] Do not percolate tags beyond the inputs of matched portion (#118034) [dynamo][assume_constant_result] Dont put symbolic guards for assume_constant_result (#118430) [dynamo][higher order ops] Remove restore side effects logic (#118420) [dynamo] Setup the globals for guard_fn without a reference to f_locals (#118447) [dynamo-must-fix] Use ID_MATCH for UserDefinedClass (#119853) [inductor][scheduler] Use set for origin (#119861) [dynamo][guards] Use EQUALS_MATCH for NAME_MATCH (#120132) [dynamo][refactor] Use TYPE_MATCH instead of manually constructing guard (#120140) [dynamo][refactor] Use existing helper functions for CLOSURE_MATCH (#120145) [dynamo] Use EQUALS_MATCH guard for mod.training (#120147) [dynamo][guards-c++-refactor] Introduce LeafGuard, GuardManager and GuardAccessor classes (#119822) [dynamo][guards-c++-refactor] EQUALS_MATCH guard (#119827) [dynamo][guards-cpp-refactor] GetAttrGuardAccessor (#119833) [dynamo][guards-cpp-refactor] DEFAULT_DEVICE guard (#120060) [dynamo][guards-cpp-refactor] GLOBAL_STATE guard (#120061) [dynamo][guards-cpp-refactor] DATA_PTR_MATCH guard (#120062) [dynamo][guards-cpp-refactor] TENSOR_ALIASING guard (#120064) [dynamo][guards-cpp-refactor] NO_TENSOR_ALIASING guard (#120065) [dynamo][guards-cpp-refactor] GetItemGuardAccessor (#120067) [dynamo][guards-cpp-refactor] GlobalsGuardAccessor (#120068) [dynamo][guards-cpp-refactor] TypeGuardAccessor (#120089) [dynamo][guards-cpp-refactor] TupleIteratorGetItemAccessor (#120091) [dynamo][guards-cpp-refactor] TUPLE_ITERATOR_LEN guard (#120119) [dynamo][guards-cpp-refactor] LENGTH_CHECK guard (#120123) [dynamo][guards-cpp-refactor] GlobalWeakRefGuardAccessor (#120093) [dynamo][guards-cpp-refactor] DYNAMIC_INDICES guard (#120096) [dynamo][guards-cpp-refactor] TENSOR_MATCH guard (#120342) [dynamo][refactor] Move some helper functions to global scope (#120426) [dynamo][guards-cpp-refactor] WEAKREF_ALIVE guard (#120344) [dynamo][guards-cpp-refactor] DictGuardManager (#120359) [dynamo][guards-cpp-refactor] DICT_VERSION guard (#120416) [dynamo] Reland 120147 - - Use EQUALS_MATCH guard for mod.training (#120578) [dynamo][guards-cpp-refactor] NO_HASATTR guard (#120469) [dynamo][compile-time] Collect guard debug stack info only with logs enabled (#120520) [dynamo][guards-cpp-refactor] DictGetItemGuardAccessor for f_locals (#120593) [dynamo] Desugar accumulate_grad, fix .grad handling (#120590) [dynamo][refactor] Rename LIST_LENGTH to SEQUENCE_LENGTH, separate DICT_LENGTH (#120721) [dynamo][refactor] Use originating_source for HASATTR (#120723) [dynamo] Fix source for default dict default_factory (#120864) [dynamo][guards-cpp-refactor] PythonLambdaGuardAccessor (#120730) [dynamo][easy] Dynamo test changes (#120927) [dynamo][guards-cpp-refactor] DICT_CONTAINS guard (#120673) [dynamo][guards-cpp-refactor] Skip type and length check guard for DictGuardManager (#120739) [dynamo][comp-time] BuiltinVariableTracker - inspect signature only on failure (#121053) [dynamo][compile-time] Remove unnecessary tree_map_only (#121052) [dynamo][guards-cpp-refactor] Add argnames in pybind'ings (#121121) [dynamo][guards-cpp-refactor] Simplify DictGuardManager by removing KeyValueDictGuardManager (#121147) [dynamo][guards-cpp-refactor] Pass source name for debug ease (#121154) [dynamo][guards-cpp-refactor] Prevent duplication of leaf guards (#121164) [dynamo][guards] Use lazy variable tracker for func defaults (#121388) [dynamo][guards-cpp-refactor] Permit dict version guard in DictGuardManager (#121327) [dynamo][guards-cpp-refactor] Func defaults and kwdefaults accessor (#121338) Anthony Alayo (3): cmake: allow to build pytorch as a CMake subproject (#110373) Adding c10 device type to newly added DeviceAccelerator (#119961) Updating sleef submodule to eb3d97785 to fix export errors (#119953) Anthony Shoumikhin (1): [executorch] Update iOS toolchain with a modern cmake syntax. (#115799) Antoni Viros (6): Add embedding op to jagged NT (#112288) Implement narrow from a regular tensor to jagged tensor (#112770) Expose Flash attn to autograd (#114378) Add an SDPA dispatcher for nested tensors with jagged layouts (#114164) Add an SDPA dispatcher for nested tensors with jagged layouts (#114164) Fix for Wait kernel lowering in inductor not accepting MultiOutputs from non-collective calls (#121428) Antoni Viros i Martin (1): Add requirement for input to AllGatherIntoTensor to be contiguous (#109561) Antonio Kim (4): Add support for `torch.Generator` type in TorchScript (#110413) Add support for `torch.Generator` type in TorchScript (#110413) Add support for `torch.Generator` type in TorchScript (#110413) Add `reset_storage` method to FunctionalTensorWrapper (#115235) Anupam Bhatnagar (2): HTA docs (#115060) Removing HTA documentation (#116513) Arseny Kapoulkine (2): Use SEQUENTIAL posix_fadvise on mmapped files (#117805) Use SEQUENTIAL posix_fadvise on mmapped files (#117805) Arun Ranganathan (1): Fix user input mutations for run_decompositions (#116382) Aryan Gupta (3): Doc: Add and Fix docstrings for torch.util.data files (#112817) fix: Flake8-BugBear code B-026 for PyTorch (#111362) Doc: Add and fix docstrings for torch.distributed files (#112735) Ashvanth.S (1): Fix docstring errors in default_hooks.py, optimizer_overlap.py, checkpoint_wrapper.py, copy.py, benchmark_ddp_rpc.py, utils.py, dependency.py, phony.py, pipeline.py, checkpoint.py, worker.py, batchnorm.py, quantization.py (#113511) Atul Jangra (1): [torchx] Do not terminate parent process if exit code from child isn't valid (#111961) Avik Chaudhuri (28): print equalities (#108427) enforce equalities (#108429) untracked inputs in constraints (#109037) New export API with dynamic shape specifications instead of constraints (#108448) deprecate constraints in favor of dynamic_shapes (#110143) dynamic_shapes + retrace exported program (#110276) constant output errors (#110472) export db links for user errors (#110555) remove replaced symbols from range_constraints (#110644) different bounds for same Dim name (#110638) [user errors] compulsory case names, allow multiple (#110733) [user errors] compulsory case names, allow multiple (#110878) [docs] export full aten opset (#111161) direct runtime assertions (#111262) non-strict export with dynamic shapes (#115862) non-strict export with dynamic shapes (#115862) [export][reland] non-strict export with dynamic shapes (#116048) ignore ill-formed solution of reduce_inequalities (#117310) non-strict improvements: constant args and kwargs (#119529) derived dim (#118729) [export] kill deprecated constraints API (#120860) fix dupe deprecated warning in dynamo export (#120896) remove constraints from aot_compile (#120979) remove constraints from capture_pre_autograd_graph (#120981) [non-strict export] support tensor attribute without other args (#121176) suggested fixes for congruences (#121418) fix accidental specialization with faketensor input checks (#121460) relax assertion on fake shape (#121599) Axel Donath (1): Clarify maximize option in optimizer.py (#112724) Ayham Tannous (1): Add file name and size to the serialization metadata logging (#113077) BJ Hargrave (4): docs: Fix some docstring errors in torch.nn.utils parametrize/spectral_norm/stateless (#112786) docs: Add docstring for torch.masked._ops.logaddexp (#113206) docs: Fix docstring lint errors in torch/distributed/fsdp/_flat_param.py & torch/distributed/fsdp/_init_utils.py (#113358) fsdp: Unit test for ModuleWrapPolicy as a Callable (#117395) Banit Agrawal (7): [CUDACaching Allocator] Release the allocator lock on the slow path (#108367) [PyTorch] Add the lazy init call for p2p access function (#1991) (#108589) [PyTorch CCA] Refactor caching allocator config code (#110123) [CUDA Host Allocator] Add support of CudaHostRegister (#108488) [PyTorch Pinned Allocator] Create per thread task pool for mapping memory space (#111545) [PyTorch] Mark USDT probes as noinline to avoid duplications in ThinLTO mode (#117381) [PyTorch] Back scalar value to pinned memory for .item() (#119202) Behrang Javaherian (2): [raas][torch][jit] Allow not storing the optimized graph (#115381) [torch] Reduce the memory usage by adding flags to clearing intermediate graphs used for optimization during the ineference. (#115657) Behzad Abghari (1): Avoid adding to lazy device cache if cache size is 0 (#113710) Bert Maher (19): [AOTInductor] Add is_cpu for AOTInductorModelContainer (#109287) Added a flag is_cpu to the AOTInductor runtime (#109300) [inductor] Decompose torch.ops.quantized.embedding_bag_byte_unpack (#109398) [aot inductor] Make unit tests work on CPU (#109625) [inductor] Decompose addmm if it's a dot product on cpu (#110010) [aot_inductor] Lightweight model runner (#110158) [inductor] Add fbcode include path for cuda (#110240) [inductor][easy] Free functions in headers should be declared inline (#110445) [aoti] Remove pessimizing move (#110446) [inductor] Lower small gemvs on CPU (#110456) [inductor] get_system shouldn't error if CUDA is not installed (#110282) [inductor] Allow backend compiler to skip (#111153) [easy] Reapply D49842542 (remove pessimizing move) (#111910) Reland "[aot inductor] Move constant loading logic from Container to Model" (#112197) Remove cpp/tensorexpr benchmarks (#116868) [inductor] Use torch.cuda.clock_rate instead of triton.testing.nvsmi (#118662) [inductor] When generating debug logs don't fail if nvcc not found (#120346) [inductor] Colorization improvements for bandwidth profiler (#120343) [inductor] Log triton kernel source and metadata on failure (#120494) Bin Bao (128): [inductor] Move test_inductor_sequence_nr out of test_aot_inductor (#108237) Revert "[AOTInductor] Include constants in AOTInductor .so file. (#10… (#108349) [CI] Enable max-autotune for Sunday dashboard run (#108386) [inductor] Add an aot_inductor class in inductor config (#108369) [inductor] Update how AOTInductor resizes output tensors (#108412) [inductor] Use empty_strided to create output tensors when testing AOTInductor (#108364) [inductor] Move AOTInductor runtime headers (#108564) [inductor] Refactor wrapper.py (#108653) [CI] Update the pinned timm version (#108076) [inductor] Switch to use the runtime interface for AOTInductor testing (#108663) [reland][inductor] Switch to use the runtime interface for AOTInductor testing (#108878) [inductor] Add a C shim layer for libtorch (#109391) [inductor] Forward fix a windows test error (#109449) [inductor] Fix CudaStreamGuard in AOTInductor ABI compatible mode (#109471) [inductor] Clean up AOTInductor runtime ABI (#109678) [inductor] Change AOTInductor to return output tensors (#109790) [inductor] Refactor some libtorch c shim interfaces (#109834) [inductor] Add back a missing header include (#109845) [aotinductor] Rename aot_runtime to aoti_runtime (#110007) [aotinductor] Relax the CUDAGuard device index check (#110030) [inductor] Enhance an input type assertion msg (#110176) [inductor] Add CI jobs to test AOTInductor (#108419) [aotinductor] Refactor test_aot_inductor (#110215) [aotinductor] Refactor test_aot_inductor to take different devices (#110216) [aotinductor] Fix a missing schema issue for repeat_interleave (#110105) [aotinductor] Remove output_spec from AOTInductorModelCache (#110462) [aotindutor] Refactor optional value codegen (#110233) [aotinductor] Clean up fallback kernel cpp name generation (#110267) [aotinductor] Enable test_non_default_cuda_device on CI (#110509) [aotinductor] Avoid generating redundant kernel loading code (#110510) [aotindutor] Forward fix a performance regression (#110800) [aotinductor] Add a perf smoke test for AOTInductor (#110972) [aotindutor] Update the cpp test example (#110652) [aotinductor] Add AOTIModelRunner as a utility class (#110891) [aotinductor] Add both cpu and cuda tests for the AOTInductor cpp test (#110920) [CI] Add auto label rule for torch/_export (#111181) [aotinductor] Relax ExternKernel kwargs checking (#111167) [aotinductor] Refactor the generated result (#111080) [aotinductor] Make writing of the weight files to be conditional (#111379) [aotinductor] Update test utility to use AOTIModelRunner (#111657) [aotinductor] Fix a problem when the generated graph is empty (#111822) [aotinductor] Fix duplicated unbacked symbol declarations (#111823) [aotinductor] Add a debug compile flag (#112021) [aotinductor] Allow specifying a .so name in the aot_inductor.output_path config (#112651) [aotinductor] Solves a problem where a tensor is returned more than once (#112177) [aotinductor] Move cache_dir to utils.py (#112728) [aotinductor] Solves a problem where a tensor is returned more than once (#112177) [aotinductor] Update the benchmarking script to clone an eager model (#113046) [aotinductor] Add a demo tutorial (#112457) [AOTI] Remove try_find_schema (#113617) [AOTI] Delay the fallback kernel naming decision to the codegen time (#113660) [AOTI] Improve the two-pass wrapper codegen (#114067) [CI] Rename the inductor test config names for dynamic shapes tests (#113574) [CI] Increase the shard numbers for torchbench tests (#113575) [CI] Remove CI skip list for inductor integration tests (#113446) [CI] Switch to check against expected result files for dynamo_eager and aot_eager benchmark tests (#113559) [CI] Switch to check against expected result files for cpu inductor integration tests (#113668) [AOTI] Fix a weight loading issue when the weight size can be 0 (#114280) [CI] Bump up the graph break count for DALLE2_pytorch temporarily (#114598) [dynamo] Support itertools.groupby (#114192) [CI] Remove an exception catching for Triton compiler error (#113064) [CI] Use linux.12xlarge for cpu_inductor integration tests (#114729) [CI] Dump more detailed error msg in PT2 integration tests (#114683) [CI] Update torchbench pin (#114694) [CI] Fix a REQUIRE_HIGHER_TOLERANCE comparison bug (#114870) [inductor] Update triton pin (#114772) [AOTI] Handle empty input args (#114682) [CI] Log load_model failures in csv (#114784) [CI] Add torch/_functorch/_aot_autograd to auto-label rule (#115283) [aotautograd] Fix an output shape error when inputs are aliased (#115279) [inductor] adapt to the get_max_simd_tflops Triton API change (#115288) [CI] Fix a missing write_csv_when_exception problem (#115370) [CI] Call torch.cuda.empty_cache to release device memory (#114663) [AOTI] Fix a missing declaration for the result of item() (#115175) [inductor] Fix an aliased output bug (#115373) [CI] Lower the smoketest speedup threshold for nangpt (#115562) [dynamo] Fix a closure cell empty error (#115541) [inductor] Fix an aliased output bug (#115373) [CI] Fix lint errors on master (#115627) [AOTI][refactor][1/n] Rename cpp_kernel to cpp_kernel_name (#115783) [AOTI][refactor][2/n] Rename kernel to python_kernel_name (#115766) [AOTI][refactor][3/n] Declare python_kernel_name and cpp_kernel_name in ExternKernel (#115831) [AOTI][refactor][3/n] Declare python_kernel_name and cpp_kernel_name in ExternKernel (#115972) [AOTI][refactor] Organize model runner files (#116022) [AOTI][refactor] Refactor model runner API (#116047) [AOTI][refactor] Remove model_container_runner_cuda.cpp (#116113) [inductor] Remove the float16 restriction for cpu cpp_wrapper (#116205) [inductor] fix cpp_wrapper inputs mismatch (#116197) [inductor] Fix cpp_wrapper codegen for ir.ComplexView (#116481) [inductor] Control the cpp_wrapper mode with an env variable (#116615) [AOTI] Add pybind for AOTIModelContainerRunnerCpu and AOTIModelContainerRunnerCuda (#116269) [AOTI] Forward fix a Windows build failure (#116790) [AOTI] Update AOTI runner util (#116971) [AOTI] Remove caching for compiled model.so (#117087) [CI] Catch more exception types when running eager in PT2 tests (#117120) [AOTI] Add torch._export.aot_load (#117610) [cpp_wrapper] Change CppWrapperCodeCache to use faster python binding (#117693) [AOTI] Fix a bug in the torch._export.aot_load API (#118039) Update Triton pin (#117873) [AOTI] Support .item() in the ABI-compatible mode (#117989) [AOTI] Refactor shim_common.cpp (#118168) [AOTI] Add _scaled_dot_product_efficient_attention to C shim (#118169) [AOTI] Fix a None as index codegen issue (#118187) [AOTI] Skip test_index_put_with_none_index on rocm (#118290) [AOTI] Forward fix https://github.com/pytorch/pytorch/pull/117989 (#118291) [AOTI] Support scalar to tensor in the ABI-compatible mode (#118024) [inductor] Refactor ir.ComplexView (#118704) [AOTI] Add aoti_torch_view_dtype in C shim (#118705) [AOTI] Support _embedding_bag in C shim (#118706) [inductor] Fix an internal test issue (#118903) [AOTI] Fix a RAIIAtenTensorHandle premature deallocation bug (#118963) [AOTI] Fix a cpp kernel missing arg type issue (#119021) [AOTI] Support copy_, _fft_c2c and view_as_real in C shim (#119125) [AOTI] Make abi_compatible as default for OSS CI (#119126) [AOTI] Rename config.aot_inductor.abi_compatible (#119065) [AOTI][refactor] Split common aoti_runtime utils into a separate header (#119066) [inductor] Update the compile options for CppPythonBindingsCodeCache (#119415) [AOTI][refactor] Move ThreadLocalCachedOutputTensor into a separate header (#119392) [inductor] Update JIT Inductor cpp wrapper entry function signature (#119280) [AOTI] Fix a typo (#120094) [AOTI] Fix a strict-aliasing warning (#120628) [AOTI][refactor] Move a few util functions in atoi_torch (#119987) [AOTI] Change the cpp wrapper codegen for sdpa (#120592) [AOTI] Store OpOverload in ir.ExternKernel (#120629) [AOTI] Use torchgen to generate C shim functions (#120513) [AOTI] Update cpp wrapper codegen to use v2 C shim (#120714) [Inductor] Enable ABI-compatible mode for cpp-wrapper JIT (#121309) [Inductor] Allocate another shard for testing cpp-wrapper JIT (#121310) Bowen Bao (1): [ONNX] beartype to emit warning instead of error by default (#123363) BowenBao (37): [ONNX] Move out onnx bench bash scripts (#103983) Benchmark flag to include slowdowns when computing gmean of speedups over eager (#108375) [ONNX] Add dynamo_onnx_aot_inline to bench (#110183) [ONNX] Benchmark to store test data along exported model (#111095) Ignore beartype if its version is 0.16.0 (#111859) [ONNX] Enable onnx inlining in benchmark for >2GB models (#111867) Apply same 'pick_grad' on generating fp64 reference outputs (#111593) Delete deepcopied model after use in benchmark to reduce memory consumption (#111868) [ONNX] A better way to safe guard 2GB model serialization (#111984) Support 'BaseOutput' and subclasses from 'diffusers' in dynamo (#111978) [ONNX] Fix scalar type promotion between fp16 tensor and fp32 scalar (#113404) [ONNX] Fix scalar type promotion between fp16 tensor and fp32 scalar (#113404) [ONNX] Fix scalar type promotion between fp16 tensor and fp32 scalar (#113404) [ONNX][dynamo_export] Add 'aten::rsub' type promotion (#113697) [ONNX][dynamo_export] Add 'aten::rsub' type promotion (#113697) [ONNX] Fix bench w/ iobinding; Remove cpu fallback (#113703) [ONNX][dynamo_export] Turn off opmath for type promotion (#113780) [ONNX] Benchmark to save sample inputs to disk before running (#114163) [Experimental][ONNX] Export with symbolic shapes in proto (#112179) [ONNX][Bench] Relax tolerance for cuda accuracy check (#114767) [ONNX][Bench] Add warmup for onnx cuda runs (#114821) [ONNX] Add sanity check in CI for onnxbench (#110178) [ONNX][Bench] Remove double export and session init in perf test (#114907) [ONNX][Bench] Fix model name retrieval and remove unused argument (#115108) [ONNX][dynamo_export] Extend expected fx output types for int, float, bool (#115431) [ONNX] Set correct cuda.current_device for multi-device onnx performance bench (#115670) [ONNX] Dump sarif diagnostics for failed onnx exports in benchmark (#115673) [ONNX] Add proper iobinding synchronize for ONNX cuda bench (#115773) [ONNX] Add copy before export for perf bench to avoid mutating base model (#115945) [ONNX][Dort] Fix bug preventing running with OrtValueVector (#116124) [ONNX][dynamo_export] Decomposition skips using custom operator (#117314) [ONNX][bench] Deepcopy model to another device before export to avoid OOM (#118710) [ONNX] Fix upsample_bilinear2d decomp skip with output shape (#118823) [ONNX][dynamo_export] Turn off opmath type promotion for div (#119112) [ONNX][dynamo_export] Adjust to new symbolic shape name format in value_info (#119855) [ONNX][dynamo_export] Skip instance_norm decomp for export (#120866) Remove opmath cast for im2col decomp (#121363) Boyuan Feng (7): Replace `constraints` with `dynamic_shapes` in deeplearning/aot_inductor test (#117573) Replace `constraints` with `dynamic_shapes` in scripts/sijiac/prototypes and test/inductor (#117915) Allow dynamic shapes of `tuple` type for inputs of `dataclass` type (#117917) Replace `constraints` with `dynamic_shapes` in export-to-executorch tutorial (#117916) Replace `constraints` with `dynamic_shapes` in caffe2/test/cpp & torchrec/distributed/tests/test_pt2 (#118026) [torch] Expose dynamic_shapes api at multiple levels (#118695) Add ATen Op _chunk_cat and _chunk_cat.out (#121081) Bradley Davis (2): [torch.export] fixes for unlifting lifted tensor constants (#116266) [ait] inspect get_attr nodes for _decline_if_input_dtype (#118760) Brian (5): Update planner.py (#107998) Update chunk_sharding_spec.py (#108915) Allow public access for imports (#108914) Enable planner to be used for loading sharded optimizer state dict (#112259) Enable planner to be used for loading sharded optimizer state dict (#112259) Brian Hirsh (53): better support for fakeifying and dynamoing through torch_dispatch subclasses (with dynamic shapes) (#107415) reorder proxy / fake modes so they always run last (#104482) add return_and_correct_aliasing() util for wrapper subclasses (#107915) add dynamic shapes support for subclasses that override size/stride (#107916) wrapper subclasses: support non-cpu device for dynamic shape overload (#107926) Fix inductor <> ddp_optimizer issue (#108081) error when using _dynamo.optimize_ddp=True and _inductor.keep_output_stride=False together (#108235) fix issue with lift_fresh_copy when using export + compile (#108243) Add TorchDispatch version of functionalization (#106404) python functionalization: add helpers, functionalize_sync and mirror_autograd_meta (#107917) return_and_correct_aliasing: massage some schemas to work with torchgen (#108897) reland "python functionalization: add helpers, functionalize_sync and mirror_autograd_meta (#107917)" (#109518) functorch: fallthrough on calls to custom size/stride/storage_offset calls (#109024) custom ops: don't error if autograd input is a tensor subclass (#109248) python functionalization: support higher order ops (#108656) fix subclass custom sizes dynamic shapes caching (#108654) _return_and_correct_aliasing: fix for schemas with mutable tensor in kwargs (#109662) fix infinite loop with primtorch and .to(meta) (#109632) Make FunctionalTensor subclass to be more like functorch (interaction with ZeroTensor + Conjugate key) (#109023) Update AOTAutograd to use FunctionalTensorMode instead of C++ functionalization (#106406) Reland "Update AOTAutograd to use FunctionalTensorMode instead of C++ functionalization (#106406)" (#109906) Reland attempt 2 of "Update AOTAutograd to use FunctionalTensorMode instead of C++ functionalization (#106406)" (#109906)" (#110079) AOTDispatch subclass (#104483) AOTAutograd: Go down inference path if no outputs require grad (#111011) torch.compile DTensor E2E (#105236) FunctionalTensor: avoid spurious not_implemented logging during proxy tracing (#111040) fix wrong meta for index_select.out (#111364) Reland "AOTAutograd: Go down inference path if no outputs require grad (#111011)" (#111347) dynamo: graph break on resize_ (#111553) python_arg_parser + dynamic shapes: fix segfault coercing symint to intlist (#111642) AOTAutograd: avoid intermediate_base logic when all aliased outputs came from a multi_output_view (#111411) Fix selective activation checkpointing with subclasses that override sizes() (#113380) AOTAutograd: handle set_(), detect metadata mutations that cancel out (#111554) graph break on out= ops with noncontiguous out args (#113267) handle cross-dtype views during AOTAutograd view-replay (#113416) aot_autograd: keep input mutations on requires_grad=True tensor out of the graph for inference (#113584) graph break on intermediate leaves that require grad (#113277) [test] AOTAutograd: support mutations on buffers that happen during th bw (#112906) AOTAutograd: keep input mutations in the graph if they are under no_grad, even if they require_grad (#114646) AOTAutograd: support mutations on buffers that happen during the bw (#114953) remove aot_config.keep_inference_input_mutations from assert_functional_graph (#115195) propagate torch stack trace metadata to copy_() nodes during input mutations (#117587) make flash_attn_bw impl correct w.r.t. meta when k and v have different strides (#119500) make flash_attn_bw impl correct w.r.t. meta when k and v have different strides (#119500) beef up non-overlapping checks for detecting false aliasing of graph inputs (#119826) dynamo: respect autograd.Function + multiple save_for_backward calls (#117667) dynamo: support attribute access on tensor subclasses without sources (#117666) DTensor: make tensor_flatten more compatible for dynamo getattr (#118209) fix multiple-fake-modes bug with compile + subclasses (#118191) DTensor: use memory_format in the hash for all aten ops that use that arg (e.g. aten.clone) (#118667) DTensor + dynamo: fix is_shard/replicate always inlining to False (#118668) add a test that non_overlapping checks dont generate too many guards (#120106) get CommsDebugMode to work with DTensor (#118769) Brian Vaughan (2): fix an incorrect indent in documentation (#108273) improve annotation device parameters where a device ordinal is allowed (#113647) CJMenart (1): Bugfix to MixtureSameFamily's _pad_mixture_dimension (#118947) CK Luk (7): Back out "[Dynamo x FSDP] Add support for params, buffers, submodules on FSDPManagedNNModuleVariable (#107923)" (#108823) Back out "[Inductor] Break the loop fusion when node2 depends on node1 mutations (#109172)" (#110622) add use_fake_all_gather and use_fake_reduce_scatter to FSDP for ablation studies (#113106) Take 2 of "Add an option to log the source of the Triton kernels generated by torch._inductor (#115979) Allow explicit shutdown of the compile-worker pools (#117664) Add torch.backends.mha.get_fastpath_enabled to FUNC_INLINELIST (#118979) Do not use warm_pool() if TorchTnT is used (#121047) CYuxian (4): [ONNX] Fix indexing issue of meshgrid op (#109350) [ONNX] Fix export issue of aten::layer_norm in opset 17 (#114058) [ONNX] Consider negative dim in _index_fill_reshape_helper (#114050) [ONNX] Fix output mismatch issue of repeat_interleave when dim is None (#116689) Cao E (1): Add Half support for softmax and log_softmax on CPU (#103315) CaoE (33): Add channels_last3d support for mkldnn conv and mkldnn deconv (#95271) Add scalar conversion using avx instructions for half (#102140) add Half support for GroupNorm on CPU (#100234) add Half support for maxpool on CPU (#98819) add Half support for BatchNorm on CPU (#102070) add Half support for BatchNorm on CPU (#102070) add Half support for BatchNorm on CPU (#102070) add fp16 support for mkldnn conv and deconv on CPU (#99496) add fp16 support for native conv and deconv on CPU (#99497) add fp16 support for gemm (#99498) add Half support for bernoulli on CPU (#104176) Add Half support for addcmul, addcdiv, cumsum, and topk on CPU (#103319) add Half support for multinomial on CPU (#104178) Add Half support for aminmax on CPU (#106853) Add Half support for logspace and range on CPU (#112131) Add Half support for kthvalue, cross, hist, and logit on CPU (#112135) Add Half support for cummax, cummin, cumprod, logcumsumexp, and prod on CPU (#112132) Add Half support for poisson and use float for Half cumulative distribution on CPU (#112124) Add Half for aten2, logaddexp, logaddexp2, hypot, and nextafter on CPU (#112138) add Half support for AdaptiveAvgPool2d and AdaptiveMaxPool2d on CPU (#102079) Add Half support for CPU autocast on eager mode (#112484) Remove memory_format check for native_group_norm_backward (#115721) add _amp_foreach_non_finite_check_and_unscale_cpu_ and _amp_update_scale_cpu_ kernels on CPU (#109281) Add Half support for masked_softmax on CPU (#117028) Add half specializations for load of sum (#106454) Fix kaiser_window for lower precision data types on CPU (#117345) add GradScaler on CPU (#109993) enable mkl_gemm_f16f16f32 in cpublas::gemm (#118367) add Half support for flash attention on CPU (#118368) add test cases for GradScaler on CPU (#109994) add Half support for flash attention (#119247) Fix permuted sum precision issue for lower precision on CPU (#108559) Fix lower precision check for MKLDNN on Windows (#121618) Carlos Mocholí (4): Fix `torch.compiler.cudagraph_mark_step_begin` example (#112807) Fix cudagraph check message (#115664) Update device_mesh.py docs imports (#116074) Fix ColwiseParallel typo (#116151) Catherine Lee (88): Put logging in run_tests (#107987) When patching dynamic shapes test class, don't run the original tests (#108681) [ez] Fix small type error in run_test (#109036) Forward fix lint (#109177) inductor/test_max_autotune serial in CI (#109209) Disable tests mentioned in 109213 (#109232) Add tensorboard to pip requirements (#109349) Revert "[PyTorch] Add Expanded call stack to nodes (#108426)" (#109468) Fix test_libtorch.bat not exiting on error (#109393) Increase timeout for slow tests (#109206) Clean up test_external_module_register (#110254) run_tests.py minor logging changes (#110188) Quieter logs in CI (#110033) [ez] Remove print in heuristics aggregation (#110621) [ci] Move step to get workflow job id before test step in linux (#111483) [ez] Remove unused code in upload_test_stats (#111504) [ci] Save various json files from test infra into folder (#111516) [TD] Historical edited files and profiling heuristics (#111510) [experiment][TD] Rating number system (#112676) [ci][ez] Add job_id to emit_metrics (#113099) [TD] Disable HistoricalClassFailurCorrelation (#113497) More random stepcurrent (#113620) [ez] Add some more pyre related files to gitignore (#113796) [ez] Don't retry onnx in shell (#113803) [ez] Hash update to reuse issues again (#113961) [td] Consistent pytest cache (#113804) Fix keep-going (#112098) Tests have main linter (#114882) Add call to run_tests for a few tests (#115097) [ez] Don't run workflows on forks (#115429) [ez] Remove unittest retries (#115460) [ez] Remove some args from run_test.py (#115459) Add main in dynamo/test_compile.py (#115941) Reset stepcurrent cache if file succeeds (#115775) Add call to run_tests for more tests? (#115781) [ez][td] Fix for emit metrics can't find JOB_NAME (#116748) [ez][td] Pipe TD logs to log file (#116796) Pytest do not rewrite assertions by default (#117060) Add super().setup in test_numeric (#117148) Add super().setUp() to TestFFT1D (#117329) Reduce pytest prints (#117069) Run some OOMing tests serially (#117759) Reduce pytest prints (#117069) OIDC for update_pytorch_labels (#117876) [ez] Provide a slightly better error message if process times out (#117865) Add environment for close-nonexistent-disable-issues (#117885) Mark DynamicShapesExportTests::test_retracibility_dynamic_shapes as slow (#117896) [ez] Serial when NUM_PROCS is 1 (#117977) Reduce pytest prints (#117069) Check if enable inside run call (#118101) Check if enable inside run call (#118101) [mergebot] Dry run for labels + easier to read Dr CI result (#118240) Various CI settings (#117668) [ez] Windows log printing + save successful test logs (#118124) Fix divergence between internal + external (#118509) [ez] Discover tests without importing torch (#118574) Enable possibly-undefined error code (#118533) Workaround for super() calls in test_torchinductor_dynamic_shapes (#118586) Merging heuristics (#118029) Workaround for super() calls in test_torchinductor_dynamic_shapes (#118586) [ez] Fix CI log file piping error (#118807) Fix internal failure D53291154 (#118907) Revert "add Half support for flash attention on CPU (#118368)" (#119204) Delete old branches (#117079) [ez] Lower windows timeout limit for trunk, set test step timeout (#119234) Fix delete branches (#119399) No TD (test removal) option in CI (#118808) Fix delete branches (#119399) Separate clang lint? (#119575) [ez] Add try catch for deleting old branches (#119696) [mergebot] No unique behavior for facebook bot re pending jobs (#119735) Alternate sharding (#119078) [ez] Explicit env for run_test (#120251) Update dynamo_test_failures list (#120271) Alternate sharding (#119078) Numbers based TD (#119901) [ez] Smaller weight for some TD heuristics (#120736) [dynamo] Fix inference_mode context variable (#120830) Test TD (test removal) on crossref (#119426) TD outside of test job (#118250) [ez] Add super() calls in test_custom_ops (#121239) Forward fix lint after 121202 (#121425) TD Heuristic for tests mentioned in PR body, less verbose TD printing (#120621) Fix round robin sharding (#121022) CI sanity check test for env vars (#120519) Disable test_torch_name_rule_map_updated in code (#120627) CI sanity check test for env vars (#120519) Fix round robin sharding (#121022) ChanBong (3): fix docstring issues in torch.utils (#113335) fix docstring issues in torch.distributed (#113337) fix docstring issues in torch.utils.tensorboard (#1…

[AOTI] Support .item() in the ABI-compatible mode

f0ceef6

Summary: [ghstack-poisoned]

github-actions bot added module: inductor ciflow/inductor labels Jan 22, 2024

desertfire added a commit that referenced this pull request Jan 22, 2024

[AOTI] Support .item() in the ABI-compatible mode

9cb1a53

Summary: ghstack-source-id: 3ecf878 Pull Request resolved: #117989

desertfire mentioned this pull request Jan 22, 2024

[AOTI] Support scalar to tensor in the ABI-compatible mode #118024

Closed

desertfire requested a review from ezyang January 23, 2024 14:05

ezyang reviewed Jan 23, 2024

View reviewed changes

ezyang approved these changes Jan 23, 2024

View reviewed changes

chenyang78 approved these changes Jan 23, 2024

View reviewed changes

desertfire added the ciflow/trunk Trigger trunk jobs on your pull request label Jan 24, 2024

pytorchmergebot added the merging label Jan 24, 2024

pytorchmergebot closed this in 821b2c5 Jan 24, 2024

pytorchmergebot added Merged and removed merging labels Jan 24, 2024

desertfire mentioned this pull request Jan 25, 2024

[AOTI] Forward fix https://github.com/pytorch/pytorch/pull/117989 #118291

Closed

facebook-github-bot deleted the gh/desertfire/311/head branch January 28, 2024 15:21

desertfire mentioned this pull request Jan 31, 2024

AOTInductor support ABI stable item() call #117271

Closed

[AOTI] Support .item() in the ABI-compatible mode #117989

[AOTI] Support .item() in the ABI-compatible mode #117989

Uh oh!

Conversation

desertfire commented Jan 22, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Jan 22, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/117989

✅ You can merge normally! (1 Unrelated Failure)

Uh oh!

desertfire commented Jan 22, 2024

Uh oh!

desertfire commented Jan 22, 2024

Uh oh!

desertfire commented Jan 22, 2024

Uh oh!

ezyang Jan 23, 2024

Choose a reason for hiding this comment

Uh oh!

desertfire Jan 23, 2024

Choose a reason for hiding this comment

Uh oh!

ezyang Jan 23, 2024

Choose a reason for hiding this comment

Uh oh!

desertfire Jan 23, 2024

Choose a reason for hiding this comment

Uh oh!

ezyang left a comment

Choose a reason for hiding this comment

Uh oh!

desertfire commented Jan 23, 2024

Uh oh!

chenyang78 left a comment

Choose a reason for hiding this comment

Uh oh!

desertfire commented Jan 23, 2024

Uh oh!

facebook-github-bot commented Jan 24, 2024

Uh oh!

pytorchmergebot commented Jan 24, 2024

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

desertfire commented Jan 22, 2024 •

edited

Loading

pytorch-bot bot commented Jan 22, 2024 •

edited

Loading