Test that TORCH_FEATURE_VERSION guards are used where needed #4

bar-qodo · 2025-11-19T14:20:20Z

User description

Splits each torch library registration in the 2.10 folder into its own file -- I had a script that parsed kernel.cpp to do this but I felt like forcing this responsibility on the user might be less error prone

Compiles each file targetting 2.9 and asserts that compilation fails. (There are 2 2.9 kernels we use as negative tests where compilation is expected to succeed)

Stack from ghstack (oldest at bottom):

PR Type

Enhancement, Bug fix, Tests

Description

This is a large, multi-faceted PR that includes several major refactoring efforts and improvements across the PyTorch codebase:

PyObject Lifecycle Management Refactoring:

Simplified PyObject preservation and reference counting in intrusive_ptr, TensorImpl, and StorageImpl
Replaced complex MaybeOwned wrapper with direct tensor storage and atomic PyObject slot management
Added thread-safe PyObject initialization with atomic compare-exchange patterns
Removed resurrection logic and simplified Python object lifecycle tracking

Thread Safety Improvements:

Added mutex protection to cuBLAS workspace management with double-checked locking
Improved JIT operator registry thread safety by returning copies instead of references
Enhanced PyInterpreter interface with try_incref() and refcnt() methods
Fixed cudagraph reference counting logic to account for multiple references

ROCm/HIP Removal:

Removed ROCm-specific code from static CUDA launcher, triton heuristics, and BLAS implementations
Simplified kernel binary format handling to CUDA-only (cubin)
Removed HIP-specific atomic add implementations and conditional compilation blocks

Device-Agnostic and Multi-Device Support:

Refactored distributed tests to use device-agnostic APIs and multi-device instantiation
Updated test utilities to support XPU alongside CUDA
Added device type detection and lazy initialization for checkpoint operations
Improved backend specification in distributed test decorators

Filesystem Dependency Removal:

Replaced c10::filesystem with custom cross-platform file utilities
Updated logging, exception handling, and JIT components to use string manipulation instead of filesystem APIs

Inductor and Compilation Improvements:

Simplified memory coalescing analysis by removing broadcast detection
Improved Welford reduction helper handling in C++ codegen
Added all-reduce bucketing pass configuration for distributed operations
Fixed fx_wrapper mode to properly handle symbolic scalars and flatten arguments
Added SIMD tiling score simplification

Numeric and Kernel Enhancements:

Added complex number support to logaddexp operations
Added MXFP4 GPU support validation for B200/B300 devices
Refactored CUDA BLAS bias handling with optional parameters
Added XPU graph memory pool management

Test Coverage Expansion:

Added complex number logaddexp CPU vs CUDA tests
Added thread safety tests for gradients and storage
Added run-to-run determinism tests for inductor models
Added data pointer accessor tests for stable ABI
Added MPS regression and broadcasting tests
Updated variable naming in dynamic shape and auto-functionalize tests

API Deprecations:

Added deprecation annotations to _check_is_size and guard_size_oblivious functions
Updated usages to use alternative APIs

Configuration and Utilities:

Added bucket_all_reduces_fx configuration options for distributed operations
Enhanced performance CSV checking with detailed metrics
Added weights-only safety checks to model deserialization
Improved dataloader worker affinity testing

Diagram Walkthrough

flowchart LR
  A["PyObject Management<br/>Refactoring"] -->|Simplifies| B["TensorImpl &<br/>StorageImpl"]
  A -->|Adds atomic ops| C["PyObjectSlot"]
  D["Thread Safety<br/>Improvements"] -->|Protects| E["cuBLAS Workspace"]
  D -->|Secures| F["JIT Operator<br/>Registry"]
  G["ROCm/HIP<br/>Removal"] -->|Eliminates| H["HIP-specific Code"]
  G -->|Simplifies| I["CUDA Launcher"]
  J["Device-Agnostic<br/>Updates"] -->|Enables| K["Multi-Device<br/>Testing"]
  J -->|Supports| L["XPU Backend"]
  M["Filesystem<br/>Replacement"] -->|Removes| N["c10::filesystem<br/>Dependency"]
  O["Inductor<br/>Enhancements"] -->|Adds| P["All-Reduce<br/>Bucketing"]
  O -->|Improves| Q["Welford Helpers"]

File Walkthrough

Relevant files

Tests

18 files

test_dynamic_shapes.py `Update variable naming in dynamic shape test assertions` test/test_dynamic_shapes.py Updated variable naming in expected IR output strings to use simplified names (`ge`, `ge_1`, `ge_2`, etc.) instead of numbered suffixes (`ge_1`, `ge_2`, `ge_3`, etc.) Changes reflect a renumbering scheme for generated intermediate variables in dynamic shape assertions Multiple test assertions updated to match new variable naming patterns	+32/-32
test_auto_functionalize.py `Update variable naming in auto-functionalize test outputs` test/inductor/test_auto_functionalize.py Updated expected IR output strings to use simplified variable naming (`ge` instead of `ge_1`) Changed intermediate variable references in assertion messages to match new naming scheme Multiple test assertions updated for consistency with new variable naming patterns	+8/-8
test_higher_order_ops.py `Reduce operation counts and remove size check operations` test/dynamo/test_higher_order_ops.py Reduced expected operation counts in dynamic shape tests (from 10 to 9, 8 to 7, 17 to 15, 13 to 11) Removed `_check_is_size` operation calls from expected IR output strings Updated multiple test assertions to reflect fewer generated operations	+3/-8
test_linalg.py `Add complex number logaddexp CPU vs CUDA test` test/test_linalg.py Added new test method `test_logaddexp_cpu_vs_cuda_complex()` for complex number logaddexp operations Tests logaddexp with complex values on CPU vs CUDA with various edge cases (infinity, NaN) Validates that results are bitwise equivalent between CPU and GPU implementations	+59/-0
test_matmul_cuda.py `Expand addmm/baddmm tests with broadcast and output variants` test/test_matmul_cuda.py Reduced parametrization ranges for `N` and `batch_size` parameters in `test_addmm_baddmm_dtype_overload()` Added new parameters `broadcast_self` and `high_precision_self` to test method Updated `create_inputs()` function to handle broadcast shapes for `c` tensor Added tests for `out` variant of `addmm` and `baddbmm` operations Enhanced test coverage for output tensor handling with different dtypes	+21/-7
test_libtorch_agnostic.py `Add data pointer retrieval tests for stable ABI` test/cpp_extensions/test_libtorch_agnostic.py Added `get_supported_dtypes()` function listing all supported dtypes for stable ABI Added two new test methods: `test_get_any_data_ptr()` and `test_get_template_any_data_ptr()` Tests validate data pointer retrieval with various dtypes and mutable flags Added version check decorator `@skipIfTorchVersionLessThan(2, 10)` for new tests	+66/-0
test_deterministic.py `Add run-to-run determinism test for inductor models` test/inductor/test_deterministic.py Added new test method `test_run2run_determinism()` with parametrization for model names, training/inference modes, and precision types Tests run-to-run determinism for HuggingFace models using inductor backend Validates bitwise equivalent results across multiple runs with deterministic mode enabled Includes subprocess-based testing with environment variable configuration	+62/-0
test_inductor_collectives.py `Add gloo backend NCCL estimator regression test` test/distributed/test_inductor_collectives.py Removed custom `_pass` function for bucketing all-reduce operations Added `bucket_mode` parameter to inductor config patch Added new test method `test_regression_use_nccl_estimate_with_gloo()` for gloo backend compatibility Added `@requires_gloo()` decorator to new test method	+46/-7
test_mps.py `Add MPS regression and broadcasting tests` test/test_mps.py Added regression test `test_repeat_interleave_offset` for issue Crash on MPS when using repeat_interleave with sliced tensor pytorch/pytorch#167924 Added test `test_clamp_tensor_bounds_broadcasting` to verify clamp operation with tensor bounds and broadcasting Removed extra blank line before `test_clamp_max` method	+29/-1
test_fake_distributed.py `Update fake distributed test expected output` test/dynamo/test_fake_distributed.py Updated expected graph module output to reflect corrected variable naming Changed variable names from `ge_1`, `ge_3`, `ge_5` to `ge`, `ge_1`, `ge_2` for consistency	+6/-6
test_mix_order_reduction.py `Expand rms_norm_bwd test coverage with new shapes` test/inductor/test_mix_order_reduction.py Added new shape parameter `(1000000, 256)` to `test_rms_norm_bwd` test Added `add_1dim` parametrization to test with additional dimension Added resource optimization logic to skip non-critical tests Modified test to conditionally reshape input tensor based on `add_1dim` parameter	+17/-2
test_serialize.py `Add torch artifact deserialization test` test/export/test_serialize.py Added import for `deserialize_torch_artifact` function Added new test `test_deserialize_torch_artifact_dict` to verify deserialization of dictionary objects	+11/-1
test_autograd.py `Add gradient thread safety test` test/test_autograd.py Added new test `test_grad_thread_safety` to verify thread-safe access to tensor gradients Test uses `ThreadPoolExecutor` to concurrently access gradients and verify consistency	+28/-0
test_torchinductor.py `Add inner reduction detection test` test/inductor/test_torchinductor.py Added new test `test_inner_reduction_detection` to verify reduction hint detection Test compiles function and checks for `ReductionHint.OUTER` in generated code	+15/-0
test_custom_operators.cpp `Update tests for thread-safe operator registry changes` test/cpp/jit/test_custom_operators.cpp Changed all `auto&` references to `auto` for operator retrieval calls Updated 6 test cases to work with copied operator vectors instead of references	+6/-7
cuda_cublas_handle_pool_test.cpp `Add concurrent access test for cuBLAS handle pool` aten/src/ATen/test/cuda_cublas_handle_pool_test.cpp Added new concurrent stress test for cuBLAS handle pool and workspace management Tests concurrent access from multiple threads with simultaneous workspace clearing Verifies thread safety of `getCurrentCUDABlasHandle()` and `getCUDABlasLtWorkspace()`	+77/-0
test_scalartype.cpp `Add test for quantized integer type detection` test/cpp/aoti_abi_check/test_scalartype.cpp Added new test `TestScalarType::isQIntType` to verify quantized integer type detection Tests both positive cases (QInt types) and negative cases (other scalar types)	+11/-0
test_custom_ops.cpp `Update custom operator test for thread-safe registry` test/custom_operator/test_custom_ops.cpp Changed `auto&` to `auto` for operator retrieval from registry Updated to work with copied operator vectors	+1/-1

Enhancement

64 files

test_utils.py `Refactor tests for device-agnostic GPU support` test/test_utils.py Replaced hardcoded CUDA device references with device-agnostic accelerator API calls Added `device_type` variable using `torch.accelerator.current_accelerator()` for cross-device support Replaced `HAS_CUDA` with `TEST_GPU` flag checking both XPU and CUDA availability Updated test methods to use `torch.get_device_module()` and `torch.accelerator` APIs instead of `torch.cuda` directly Modified device string formatting to use `device_type` variable for GPU tests	+51/-43
test_2d_composability.py `Simplify backend selection and fix decorator ordering` test/distributed/_composable/test_composability/test_2d_composability.py Removed `curr_backend` variable that was derived from `dist.get_default_backend_for_device()` Updated `backend` property to use hardcoded backend strings based on `TEST_XPU` flag Reordered decorator stacking for test methods (moved `@with_comms` before `@skip_if_lt_x_gpu`) Simplified backend selection logic to conditionally return XPU or CUDA NCCL backends	+13/-14
test_ddp_hooks.py `Migrate to MultiProcessTestCase with NCCL-specific setup` test/distributed/algorithms/ddp_comm_hooks/test_ddp_hooks.py Changed base class from `DistributedTestBase` to `MultiProcessTestCase` Added `setUp()` and `tearDown()` methods with process spawning and file cleanup Implemented `_get_process_group_nccl()` method for NCCL process group initialization Replaced `@requires_accelerator_dist_backend()` decorators with `@requires_nccl()` Removed device-agnostic code and reverted to CUDA-specific implementations Updated `gpus_for_rank()` function to use `torch.cuda.device_count()` directly	+38/-20
test_c10d_object_collectives.py `Add device type instantiation for multi-device testing` test/distributed/test_c10d_object_collectives.py Added device type detection logic using `TEST_HPU` and `TEST_CUDA` flags Replaced device-agnostic `torch.accelerator` calls with explicit device module selection Added `instantiate_device_type_tests()` call to generate device-specific test variants Updated test method signatures to accept `device` parameter Modified `with_comms` decorator to pass device information to test methods	+31/-13
tiling_utils.py `Simplify memory coalescing analysis and remove broadcast detection` torch/_inductor/tiling_utils.py Removed `find_broadcast_var()` function that identified broadcast patterns in memory access Removed `try_get_buf_size()` helper function for buffer size retrieval Removed `uncoalesced_addrs` field from `CoalesceVarAnalysis` dataclass Simplified `get_score()` function signature by removing `buf_names` parameter Refactored memory coalescing analysis to remove buffer size constraints and broadcast variable handling	+13/-74
test_pp_composability.py `Update backend requirements and GPU availability checks` test/distributed/_composable/test_composability/test_pp_composability.py Updated `@requires_accelerator_dist_backend()` decorators to specify backend list `["nccl", "xccl"]` Replaced `at_least_x_gpu()` checks with `TEST_MULTIGPU` and `TEST_XPU` flags Imported `TEST_MULTIGPU` and `TEST_XPU` from common test utilities Updated skip conditions to check for multi-GPU or XPU availability	+18/-9
test_scaled_matmul_cuda.py `Add MXFP4 SM120+ device skip conditions` test/test_scaled_matmul_cuda.py Added `SM120OrLater` to imports from common CUDA utilities Added skip conditions for MXFP4 tests on SM120+ devices (only supported on B200/B300) Applied skip logic to three test methods: `test_mxfp8_nvfp4_scaled_grouped_mm_2d_2d()`, `test_mxfp8_scaled_grouped_mm_2d_3d()`, and `test_blockwise_mxfp8_nvfp4_mxfp4_numerics()`	+14/-0
triton_heuristics.py `Remove fbcode and ROCm-specific logic from triton heuristics` torch/_inductor/runtime/triton_heuristics.py Removed import of `is_fbcode` from `torch._environment` Removed ROCm/HIP device type checks and binary format handling (`hsaco` vs `cubin`) Changed device type validation to only accept CUDA devices Removed conditional logic based on `is_fbcode()` for heuristic tuning Simplified cubin path construction to always use `.cubin` extension	+9/-12
check_perf_csv.py `Enhance performance CSV checking with detailed metrics` benchmarks/dynamo/check_perf_csv.py Added file existence check with error handling for missing CSV files Enhanced output formatting to display detailed performance metrics (latency, compilation time, memory ratio) Improved failure reporting with sorted list and percentage deviation from target Added success message when all models pass threshold check Fixed typo in help text ("multiple" to "multiply")	+41/-8
static_cuda_launcher.py `Remove ROCm support from static CUDA launcher` torch/_inductor/runtime/static_cuda_launcher.py Removed ROCm/HIP specific kernel ABI handling for `hsaco` binary format Simplified kernel initialization to only handle CUDA `cubin` format Removed `is_rocm` flag and associated conditional logic Removed complex scratch space parameter handling specific to HIP kernel ABI Simplified argument type handling for CUDA kernels only	+6/-49
comm_analysis.py `Simplify NCCL estimator error handling and backend checks` torch/_inductor/comm_analysis.py Removed try-except wrapper around NCCL time estimator calls Moved backend support checks earlier in `_nccl_estimate()` function Added explicit checks for fake backend and time estimate support Removed conditional check for `torch.distributed.is_nccl_available()` before using estimator Simplified error handling for NCCL estimator failures	+18/-19
cpp.py `Improve welford reduction helper handling in C++ codegen` torch/_inductor/codegen/cpp.py Updated `reduction_combine()` to pass helper value to `welford_combine()` when available Changed `need_use_acc_helper()` to always use helper for `welford_reduce` (removed scalar check) Modified reduction code generation to use `welford_helper_cse` for welford reductions Updated scalar helper initialization to include welford reductions alongside sum reductions	+24/-20
test_dataloader.py `Simplify dataloader worker affinity testing` test/test_dataloader.py Simplified `test_ind_worker_queue()` to use fixed batch sizes and worker counts Removed CPU affinity detection logic and dynamic worker count calculation Updated `SetAffinityDataset` to accept and store expected affinity value Added `_worker_set_affinity_init()` function for worker initialization Refactored affinity setting to pass expected value through dataset instead of worker function	+26/-33
profiler.py `Add Python 3.2 compatibility and improve type annotations` torch/autograd/profiler.py Added fallback implementation of `ContextDecorator` for Python < 3.2 compatibility Updated type annotations to use `Optional[]` instead of `\|` union syntax for compatibility Changed `record_function` base class to use `_ContextDecorator` with pyrefly ignore comment Added type annotation comments for TorchScript compatibility	+32/-9
test_zero_redundancy_optimizer.py `Refactor device type detection and determinism handling` test/distributed/optim/test_zero_redundancy_optimizer.py Removed unused `contextmanager` import from contextlib Replaced custom `deterministic_algorithms` context manager with direct `torch.use_deterministic_algorithms` calls Imported `get_devtype` from `torch.testing._internal.common_fsdp` Simplified device type detection using `get_devtype()` instead of custom logic	+4/-13
test_binary_ufuncs.py `Add torch.complex32 support to binary ufuncs` test/test_binary_ufuncs.py Added `torch.complex32` support to `logaddexp` and `logaddexp2` operations Updated dtype decorators to include `torch.complex32` in CUDA tests Added special handling for `torch.complex32` in test helper functions Removed expected failure skip for complex type promotion test	+21/-3
common_methods_invocations.py `Update logaddexp dtype configuration and test skips` torch/testing/_internal/common_methods_invocations.py Updated `logaddexp` dtype support to include `torch.complex32` for CUDA Removed expected failure skip for complex type promotion test Added `test_python_ref_executor` to expected failures for complex types	+8/-10
test_c10d_functional_native.py `Refactor distributed test to use MultiProcessTestCase` test/distributed/test_c10d_functional_native.py Changed base class from `DistributedTestBase` to `MultiProcessTestCase` Updated decorator to specify backends `["nccl", "xccl"]` Added `setUp` method to spawn processes Replaced `create_pg` call with manual process group initialization using `FileStore`	+17/-4
simd.py `Simplify SIMD tiling score calculation` torch/_inductor/codegen/simd.py Removed `total_uncoalesced` calculation and related penalty scoring logic Simplified `score_mod` function to only consider tile size penalties Removed uncoalesced memory penalty from tiling score calculation	+3/-12
ops.py `Add data pointer accessor functions` test/cpp_extensions/libtorch_agnostic_2_10_extension/libtorch_agnostic_2_10/ops.py Added `get_any_data_ptr` function to return tensor data pointer value Added `get_template_any_data_ptr` function for template-based data pointer retrieval with dtype checking	+26/-0
test_cpu_repro.py `Add simdlen parametrization to CPU test` test/inductor/test_cpu_repro.py Modified test loop to parametrize over `simdlen` values `[None, 0]` and `dynamic` values `[True, False]` Wrapped test logic with `config.patch` to set `cpp.simdlen` configuration	+11/-10
common_dtensor.py `Simplify distributed tensor backend detection` torch/testing/_internal/distributed/_tensor/common_dtensor.py Removed import of `ACCELERATOR_DIST_BACKENDS` Simplified GPU check to specifically look for `"nccl"` in backend string Reordered backend initialization logic	+4/-8
test_device_mesh.py `Remove HPU skip condition from device mesh test` test/distributed/test_device_mesh.py Removed `TEST_HPU` from skip condition in test decorator Updated skip message to only mention XPU	+2/-2
post_grad.py `Add all-reduce bucketing pass configuration` torch/_inductor/fx_passes/post_grad.py Added new bucketing pass for all-reduce operations when `config.bucket_all_reduces_fx` is enabled Integrated bucketing logic with configurable bucket size determinator	+12/-0
serialize.py `Add weights_only safety check to deserialization` torch/_export/serde/serialize.py Modified `deserialize_torch_artifact` to first attempt loading with `weights_only=True` Falls back to `weights_only=False` on exception with warning log	+11/-1
checkpoint.py `Implement lazy device type detection for checkpoint` torch/utils/checkpoint.py Changed `_default_device_type` initialization from hardcoded `"cuda"` to `None` Added lazy initialization logic in `get_device_type` to detect device type on first call	+4/-1
test_sparse.py `Enable sparse mm test on MPS device` test/test_sparse.py Removed `@onlyCPU` decorator from `test_mm` method Added `@dtypesIfMPS` decorator with float32 and complex64 support	+1/-1
test_opaque_obj_v2.py `Replace deprecated _check_is_size usage` test/test_opaque_obj_v2.py Replaced `torch._check_is_size(u0)` call with `torch._check(u0 >= 0)`	+1/-1
python_variable.cpp `Refactor tensor Python object wrapping and lifecycle` torch/csrc/autograd/python_variable.cpp Added `using torch::utils::PyObjectPreservation` declaration Refactored `THPVariable_Wrap` to use new `THPVariable_WrapWithType` template function Simplified Python object lifecycle management using `PyObjectPreservation` utility Removed complex resurrection and ownership tracking logic Updated `THPVariable_traverse` and `THPVariable_clear` to simplified implementations Removed `THPVariable_NewWithVar` function in favor of template-based approach	+160/-613
Storage.cpp `Refactor storage Python object lifecycle management` torch/csrc/Storage.cpp Added `using torch::utils::PyObjectPreservation` declaration Refactored `THPStorage_NewWithStorage` to use `PyObjectPreservation::init_fresh_nonatomic` Simplified `THPStorage_Wrap` to use new preservation utility Removed complex preservation and ownership tracking logic Removed `THPStorageMetaType` metaclass definition Updated `THPStorageType` to use standard `PyType_Type` as metaclass	+48/-279
Blas.cpp `Refactor CUDA BLAS bias handling and dtype checks` aten/src/ATen/native/cuda/Blas.cpp Changed `launchGemmAndBiasCublasLt` to accept `std::optional` for bias parameter Simplified bias pointer extraction logic Refactored `addmm_out_cuda_impl` to compute `use_bias_ptr_lt` earlier and pass optional bias Removed `is_bmm` parameter from `baddbmm_bmm_out_dtype_checks` function Fixed `_baddbmm_dtype_cuda` to properly initialize output tensor and copy self Improved dtype checking and validation in `_addmm_dtype_out_cuda`	+38/-45
XPUCachingAllocator.cpp `Add XPU graph memory pool management` c10/xpu/XPUCachingAllocator.cpp Added forward declaration for `XPUAllocator` class Added `PrivatePool` struct to manage memory pools for XPU graphs Added `MempoolIdHash` hash function for mempool IDs Enhanced `BlockPool` to track owner `PrivatePool` Added graph pool management with `graph_pools` and `graph_pools_freeable` maps Updated `get_pool` to support graph-specific memory pools Enhanced `release_cached_blocks` to handle graph-specific pool cleanup Added `create_or_incref_pool` and `get_private_pool` methods Updated `malloc` and `emptyCache` to support mempool IDs	+153/-19
ScaledBlas.cpp `Add MXFP4 GPU support validation` aten/src/ATen/native/cuda/ScaledBlas.cpp Added `_check_mxfp4_support` function to validate MXFP4 support on B200/B300 GPUs Added device property check in `_scaled_mxfp4_mxfp4` function	+14/-0
static_cuda_launcher.cpp `Remove ROCm support from static CUDA launcher` torch/csrc/inductor/static_cuda_launcher.cpp Changed preprocessor guard from `USE_CUDA \|\| USE_ROCM` to `USE_CUDA &&` `!USE_ROCM` with explanatory comment Removed all `USE_ROCM` conditional code blocks and HIP-specific includes Simplified function implementations to use only CUDA driver APIs	+6/-96
model_package_loader.cpp `Replace c10::filesystem with custom cross-platform file utilities` torch/csrc/inductor/aoti_package/model_package_loader.cpp Removed dependency on `c10::filesystem` and replaced with custom implementations Added `file_exists()`, `recursive_mkdir()`, and `recursive_rmdir()` helper functions Added Windows-specific macros for `access` and `F_OK` Updated file operations to use custom implementations instead of `c10::filesystem`	+115/-15
PyInterpreter.cpp `Simplify PyObject reference management in PyInterpreter` torch/csrc/PyInterpreter.cpp Simplified `decref()` signature by removing `has_pyobj_slot` parameter Added new methods `try_incref()` and `refcnt()` to PyInterpreter interface Removed complex PyObject resurrection logic from `decref()` implementation Updated `set_tensor_attr_with_capsule()` and `get_set_cached_attr()` to use simplified PyObject access	+25/-52
operator.cpp `Improve thread safety of operator registry access` torch/csrc/jit/runtime/operator.cpp Added `getOperatorsWithLockHeld()` private method for lock-protected operator retrieval Changed `getOperators()` return type from reference to value (copy) for thread safety Added `getSortedOperators()` method to centralize operator sorting logic Updated `getAllSortedOperatorsFor()` to delegate to `getSortedOperators()`	+41/-29
pyobject_preservation.cpp `Refactor PyObject preservation with atomic initialization` torch/csrc/utils/pyobject_preservation.cpp Replaced `clear_slots()` implementation with new `PyObjectPreservation` class Added `init_fresh_nonatomic()` method for initializing PyObject on fresh targets Added `init_once()` method with atomic compare-exchange for thread-safe initialization Implemented proper reference counting and memory ordering semantics	+62/-14
Module.cpp `Simplify tensor PyObject management and remove MaybeOwned wrapper` torch/csrc/Module.cpp Changed `THPVariable.cdata` from `c10::MaybeOwned` to `at::Tensor` Simplified `THPModule_swap_tensor_impl()` to use local tensor copies instead of complex PyObject slot manipulation Updated PyObject slot operations to use `store_pyobj()` instead of `init_pyobj()` Added guard condition `!defined(USE_ROCM)` to StaticCudaLauncher initialization	+17/-26
kernel.cpp `Add data pointer accessor functions to test extension` test/cpp_extensions/libtorch_agnostic_2_10_extension/libtorch_agnostic_2_10/csrc/kernel.cpp Added `get_any_data_ptr()` function to retrieve tensor data pointers Added `get_template_any_data_ptr()` templated function with scalar type dispatch Registered new functions with `STABLE_TORCH_LIBRARY_FRAGMENT` and `STABLE_TORCH_LIBRARY_IMPL`	+39/-0
StorageImpl.cpp `Add PyObject reference management methods to StorageImpl` c10/core/StorageImpl.cpp Added `incref_pyobject()` method with acquire fence for proper memory ordering Added `decref_pyobject()` method for PyObject reference management Added `try_incref_pyobject()` method with interpreter availability check	+24/-0
TensorImpl.cpp `Add PyObject reference management methods to TensorImpl` c10/core/TensorImpl.cpp Removed `pyobj_slot_.maybe_destroy_pyobj()` call from `release_resources()` Added `incref_pyobject()` method with acquire fence for proper memory ordering Added `decref_pyobject()` method for PyObject reference management Added `try_incref_pyobject()` method with interpreter availability check	+24/-1
jit_log.cpp `Replace filesystem utilities with string manipulation` torch/csrc/jit/jit_log.cpp Replaced `c10::filesystem::path` usage with custom string manipulation Added manual filename extraction using `StripBasename()` and string operations Updated `is_enabled()` and `jit_log_prefix()` to use new string utilities	+8/-3
init.cpp `Update JIT Python bindings for thread-safe operator access` torch/csrc/jit/python/init.cpp Changed `const auto&` to `auto` for operator retrieval in three locations Updated code to work with copied operator vectors instead of references	+3/-3
Logging.cpp `Remove filesystem dependency from logging utilities` c10/util/Logging.cpp Removed `#include` dependency Replaced `c10::filesystem::path(file).filename()` with `StripBasename()` utility	+2/-2
inline_container.cc `Allow multiple serialization of triton binary files` caffe2/serialize/inline_container.cc Added special handling for triton binary files (`.so`, `.cubin`, `.hsaco`) Allow multiple writes for triton extensions with warning log Maintain strict single-write assertion for other file types	+14/-2
PyInterpreter.cpp `Update noop PyInterpreter for simplified interface` c10/core/impl/PyInterpreter.cpp Updated `NoopPyInterpreterVTable::decref()` signature to remove `has_pyobj_slot` parameter Added `try_incref()` method returning false Added `refcnt()` method with panic assertion	+9/-2
jit_opt_limit.cpp `Replace filesystem utilities in JIT optimization limit` torch/csrc/jit/jit_opt_limit.cpp Replaced `c10::filesystem::path` with custom string utilities Added `#include` and `#include` Used `StripBasename()` and `ExcludeFileExtension()` for path manipulation	+5/-2
shim_common.cpp `Add data pointer accessor functions to AOTI shim` torch/csrc/shim_common.cpp Added `torch_get_const_data_ptr()` function to retrieve const tensor data pointers Added `torch_get_mutable_data_ptr()` function to retrieve mutable tensor data pointers Both functions use exception-to-error-code conversion pattern	+18/-0
Exception.cpp `Remove filesystem dependency from exception utilities` c10/util/Exception.cpp Removed `#include` dependency Replaced `c10::filesystem::path(file).filename()` with `detail::StripBasename()`	+1/-2
StorageMethods.cpp `Simplify Storage cdata assignment` torch/csrc/StorageMethods.cpp Simplified `THPStorage__setCdata()` by removing explicit destructor call Changed `cdata` assignment from `MaybeOwned::owned()` to direct `c10::Storage` assignment	+2/-3
input_buffer.cpp `Use new tensor stealability check in accumulation logic` torch/csrc/autograd/input_buffer.cpp Replaced `at::caching::adjusted_use_count(v) == 1` with `impl::is_tensor_stealable()` call Added parameter accounting for cached tensor status	+2/-2
schema_matching.cpp `Update schema matching for thread-safe operator access` torch/csrc/jit/frontend/schema_matching.cpp Changed `const auto&` to `auto` for operator variant retrieval Updated to work with copied operator vectors	+1/-1
alias_analysis.cpp `Update alias analysis for thread-safe operator access` torch/csrc/jit/ir/alias_analysis.cpp Changed `const auto&` to `auto` for operator candidate retrieval Updated to work with copied operator vectors	+1/-1
symbolic_shape_registry.cpp `Update symbolic shape registry for thread-safe operator access` torch/csrc/jit/runtime/symbolic_shape_registry.cpp Changed `auto&` to `auto` for inplace variant operator retrieval Updated to work with copied operator vectors	+1/-1
ir.cpp `Update IR node schema matching for thread-safe access` torch/csrc/jit/ir/ir.cpp Changed `const auto&` to `auto` for operator candidate retrieval Updated to work with copied operator vectors	+1/-1
LogAddExpKernel.cu `Add complex number support to logaddexp CUDA kernel` aten/src/ATen/native/cuda/LogAddExpKernel.cu Added complex number support to logaddexp kernel with specialized implementations Added helper functions for complex min/max, exponential, and log operations Implemented jiterator string for complex logaddexp computation Added conditional compilation for jiterator vs fallback implementations	+234/-1
KernelUtils.cuh `Remove ROCm-specific atomic add implementations` aten/src/ATen/native/cuda/KernelUtils.cuh Removed ROCm-specific atomic add implementations for `__hip_bfloat162` and `__half2` Simplified to use standard `unsafeAtomicAdd` for ROCm	+1/-59
intrusive_ptr.h `Implement PyObject preservation in intrusive_ptr` c10/util/intrusive_ptr.h Added `kHasPyObject` constant to track PyObject wrapper presence in refcount Added `has_pyobject()` helper function to check PyObject bit Added `TargetTraits` template for PyObject support configuration Updated `retain_()` and `reset_()` to manage PyObject lifecycle with refcount transitions Added `is_uniquely_owned()` method for stronger uniqueness check Updated `weak_intrusive_ptr::lock()` with PyObject incref logic Added `incref()` function with PyObject management	+136/-15
PyObjectSlot.h `Simplify PyObjectSlot interface and add atomic accessors` c10/core/impl/PyObjectSlot.h Simplified PyObjectSlot interface by removing complex check/init methods Added `load_pyobj()` and `store_pyobj()` atomic accessor methods Added `has_unique_reference()` method to check PyObject refcount Removed ownership tagging and hermetic context logic Changed `pyobj_` to atomic for thread-safe access	+36/-95
cpp_prefix.h `Improve scalar type extraction for Welford helper` torch/csrc/inductor/cpp_prefix.h Added `GetScalarType` template to extract scalar type from vectorized types Updated `WelfordHelper::weight_recps` to use `GetScalarType` for proper type extraction Removed `if constexpr (IsVecType::value)` conditional in `welford_combine()` Unified welford combine logic for both scalar and vectorized types	+40/-30
tensor_inl.h `Add typed data pointer accessors to stable Tensor API` torch/csrc/stable/tensor_inl.h Added templated `mutable_data_ptr()` and `const_data_ptr()` methods with scalar type checking Guarded new methods with `TORCH_FEATURE_VERSION >= TORCH_VERSION_2_10_0` Implemented type-safe data pointer casting for all scalar types	+33/-0
PyInterpreter.h `Update PyInterpreter interface for simplified PyObject management` c10/core/impl/PyInterpreter.h Updated `decref()` signature to remove `has_pyobj_slot` parameter Added `try_incref()` method taking `PyObjectSlot` reference Added `refcnt()` method to retrieve PyObject reference count Added forward declaration for `PyObjectSlot`	+9/-3
TensorImpl.h `Add PyObject lifecycle management to TensorImpl` c10/core/TensorImpl.h Added `incref_pyobject()`, `decref_pyobject()`, and `try_incref_pyobject()` override methods Added `TargetTraits` specialization for TensorImpl to enable PyObject support	+19/-0

Bug fix

11 files

test_state_dict_utils.py `Revert to CUDA-specific device implementations` test/distributed/checkpoint/test_state_dict_utils.py Reverted device-agnostic changes back to CUDA-specific implementations Changed `torch.accelerator.device_count()` back to `torch.cuda.device_count()` Replaced `self.device_type` references with hardcoded `"cuda"` strings Updated device checks to use `is_cuda` property instead of `device.type` comparisons Replaced `torch.accelerator.synchronize()` with `torch.cuda.synchronize()`	+17/-19
test_static_cuda_launcher.py `Remove ROCm support and use CUDA-only cubin format` test/inductor/test_static_cuda_launcher.py Removed conditional logic for ROCm/HIP binary format (`hsaco` vs `cubin`) Changed to use only `cubin` format for all kernels Added `@skipIfRocm` decorator to all test methods Simplified kernel assembly handling to only expect CUDA cubin format	+19/-2
test_torch.py `Fix weak reference and storage lifecycle tests` test/test_torch.py Updated `test_storage_use_count()` to expect 2 references instead of 1 (accounting for wrapper) Changed exception type in `test_as_subclass()` from `RuntimeError` to `TypeError` Rewrote `test_tensor_dead_weak_ref()` to verify tensor stays alive via weak reference Simplified `test_storage_dead_weak_ref()` to verify storage lifecycle with weak references Added new `test_storage_thread_safety()` method for concurrent storage access validation	+38/-18
common.py `Add weights_only parameter to torch.load call` benchmarks/dynamo/common.py Added `weights_only=False` parameter to `torch.load()` call when loading saved model outputs Ensures compatibility with newer PyTorch versions that default to weights-only loading	+3/-1
test_fxir_backend.py `Fix fx_wrapper argument handling and add reshape tests` test/inductor/test_fxir_backend.py Fixed `fx_wrapper` mode to flatten arguments before passing to compiled module Added two new test methods `test_reshape_dynamic_ph` and `test_reshape_dynamic_tmd` for dynamic reshape operations	+35/-1
cudagraph_trees.py `Fix cudagraph reference counting logic` torch/_inductor/cudagraph_trees.py Fixed reference count checking logic in `expired` property Updated to account for two additional references when `extra_ref_check` is set Added assertion to ensure storage count is non-negative Enhanced `check_refcount` to handle cached tensor outputs with multiple references	+16/-3
compile_fx.py `Fix fx_wrapper input handling for symbolic scalars` torch/_inductor/compile_fx.py Added conditional logic to only replace non-tensor inputs with `None` when not using `fx_wrapper` mode Added type check to ensure fake inputs are tensors before device validation	+9/-6
test_codecache.py `Simplify and fix codecache test skip conditions` test/inductor/test_codecache.py Simplified CUDA bfloat16 skip condition by removing HIP version check Added skip condition for static CUDA launcher with ROCM	+3/-6
ir.py `Fix outer reduction detection for zero strides` torch/_inductor/ir.py Updated stride check logic to treat 0 stride as non-contiguous Added comment explaining that 0 stride can occur when reduction ranges contain 1	+3/-1
_op_schema.py `Fix typo in OpStrategy string representation` torch/distributed/tensor/_op_schema.py Fixed typo in `__str__` method from `OpStragety` to `OpStrategy`	+1/-1
CublasHandlePool.cpp `Add thread-safe mutex protection to cuBLAS workspace management` aten/src/ATen/cuda/CublasHandlePool.cpp Introduced `WorkspaceMapWithMutex` struct to wrap workspace map with `std::shared_mutex` for thread safety Added `setWorkspaceForHandle()` function with double-checked locking pattern Updated `clearCublasWorkspaces()` and `getCUDABlasLtWorkspace()` to use mutex-protected access Refactored workspace allocation to separate fast and slow paths	+75/-20

Documentation

2 files

__init__.py `Add deprecation annotation to _check_is_size` torch/init.py Added import of `deprecated` from `typing_extensions` Added deprecation decorator to `_check_is_size` function with removal notice	+9/-2
symbolic_shapes.py `Add deprecation annotation to guard_size_oblivious` torch/fx/experimental/symbolic_shapes.py Added deprecation decorator to `guard_size_oblivious` function Deprecation message directs users to use explicit unbacked handling alternatives	+4/-0

Configuration changes

1 files

config.py `Add all-reduce bucketing configuration options` torch/_inductor/config.py Added `bucket_all_reduces_fx` configuration option with values `"none"` or `"all"` Added `bucket_all_reduces_fx_bucket_size_determinator` optional callable configuration	+4/-0

Formatting

1 files

kernel.cpp `Fix kernel definition indentation` test/cpp_extensions/libtorch_agnostic_2_9_extension/libtorch_agnostic_2_9/csrc/kernel.cpp Fixed indentation of `m.def("test_default_constructor(bool undefined)` `-> bool")` line	+1/-2

Additional files

34 files

build_cpu.sh	+2/-0
audio.txt	+1/-1
xla.txt	+1/-1
TensorBase.h	+3/-0
CUDAContextLight.h	+8/-2
Repeat.mm	+22/-24
TensorCompare.mm	+34/-20
native_functions.yaml	+1/-1
CMakeLists.txt	+1/-0
valgrind.sup	+7/-0
SafePyObject.h	+2/-2
ScalarType.h	+0/-7
StorageImpl.h	+20/-0
PyObjectSlot.cpp	+0/-56
XPUCachingAllocator.h	+1/-1
Codegen.cmake	+1/-6
test_export.py	+0/-4
test_ck_backend.py	+0/-1
test_loop_ordering.py	+0/-14
xpu.txt	+1/-1
__init__.pyi.in	+2/-3
Storage.h	+3/-5
accumulate_grad.h	+4/-2
grad_layout_contract.h	+3/-1
wrap_outputs.h	+4/-0
variable.h	+17/-1
static_cuda_launcher.h	+1/-1
script_type_parser.cpp	+0/-6
operator.h	+3/-2
shim.h	+11/-0
tensor_struct.h	+22/-0
pyobject_preservation.h	+25/-1
runtime_assert.py	+0/-11
ScalarType.h	+8/-0

This PR outputs chars to stream without building temporary strings. They were modified by (on fish) ``` sed -i -e 's/<< "\([^\\\']\)"/<< \'\1\'/g' (grep '<< "."' -r torch c10 aten -l) ``` and revert some invalid changes. Pull Request resolved: pytorch#167899 Approved by: https://github.com/Skylion007

# Description Fixes pytorch#114850, we will port test utils and schema check to Intel GPU We could enable Intel GPU with following methods and try the best to keep the original code styles: # Changes 1. Get device type with from accelerator and get_devtype helper method 2. Replace the requires cuda statement to device_type. 3. Add HAS_XPU and HAS GPU check to replace some of the HAS_XPU etc. # Notify Pull Request resolved: pytorch#166684 Approved by: https://github.com/ezyang, https://github.com/guangyey Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>

Summary: This diff would be a follow-up diff for D85883723. Test Plan: See D86719598. We are now able to publish the model. Unit test: ``` buck run fbcode//mode/opt -c remoteexecution.local=enabled fbcode//sigmoid/inference/test:test_passes -m ovr_config//triton:experimental -- -r test_triton_hop_cpu ``` Differential Revision: D87091238 Pull Request resolved: pytorch#167862 Approved by: https://github.com/XueningXu

Signed-off-by: Edward Yang <ezyang@meta.com> Pull Request resolved: pytorch#167916 Approved by: https://github.com/Skylion007

**Summary:** Optimize scalar welford_reduce implementation, combining Welford algorithm with cascade sum to improve numerical stability. Specifically: 1. Use Welford algorithm to compute mean and variance. 2. Use cascade summation when computing sum over input for both mean and variance. **Example:** Take pytorch#141541 as an example: ``` import torch import torch.nn as nn torch.manual_seed(0) class Model(nn.Module): def __init__(self): super().__init__() self.gn = nn.GroupNorm(num_groups=32, num_channels=32) def forward(self, x): return self.gn(x) model = Model().eval() x = torch.randn(1, 32, 128, 128, 128) with torch.no_grad(): output = model(x) with torch._inductor.config.patch({"cpp.simdlen": 0}): c_model = torch.compile(model) c_output = c_model(x) print(torch.max(torch.abs(output - c_output))) print(torch.allclose(output, c_output, 1.3e-6, 1e-5)) ``` **logs** - before ``` tensor(0.0005) False ``` - After ``` tensor(1.4305e-06) True ``` **Generated code:** - before ``` cpp_fused_native_group_norm_0 = async_compile.cpp_pybinding(['float*', 'float*', 'const float*', 'const float*', 'const float*', 'float*'], ''' #include <torch/csrc/inductor/cpp_prefix.h> extern "C" void kernel(float* in_out_ptr0, float* in_out_ptr1, const float* in_ptr0, const float* in_ptr1, const float* in_ptr2, float* out_ptr2) { auto out_ptr1 = in_out_ptr0; auto out_ptr0 = in_out_ptr1; { #pragma GCC ivdep for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(32L); x0+=static_cast<int64_t>(1L)) { { Welford<float> tmp_acc0 = Welford<float>(); Welford<float> tmp_acc0_arr[4]; for (int i = 0; i < 4; i++) { tmp_acc0_arr[i] = Welford<float>(); } #pragma omp parallel num_threads(4) { int tid = omp_get_thread_num(); Welford<float> tmp_acc0_local = Welford<float>(); #pragma omp for for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(2097152L); x1+=static_cast<int64_t>(1L)) { { { auto tmp0 = in_ptr0[static_cast<int64_t>(x1 + 2097152L*x0)]; tmp_acc0_local = welford_combine(tmp_acc0_local, tmp0); } } } tmp_acc0_arr[tid] = tmp_acc0_local; } for (int tid = 0; tid < 4; tid++) { tmp_acc0 = welford_combine(tmp_acc0, tmp_acc0_arr[tid]); } in_out_ptr1[static_cast<int64_t>(x0)] = tmp_acc0.mean; in_out_ptr0[static_cast<int64_t>(x0)] = tmp_acc0.m2; } } } { #pragma GCC ivdep for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(32L); x0+=static_cast<int64_t>(1L)) { { { auto tmp0 = out_ptr1[static_cast<int64_t>(x0)]; auto tmp6 = in_ptr1[static_cast<int64_t>(x0)]; auto tmp8 = out_ptr0[static_cast<int64_t>(x0)]; auto tmp11 = in_ptr2[static_cast<int64_t>(x0)]; auto tmp1 = static_cast<float>(2097152.0); auto tmp2 = tmp0 / tmp1; auto tmp3 = static_cast<float>(1e-05); auto tmp4 = float(tmp2 + tmp3); auto tmp5 = 1 / std::sqrt(tmp4); auto tmp7 = float(tmp5 * tmp6); auto tmp9 = decltype(tmp8)(-tmp8); auto tmp10 = float(tmp9 * tmp7); auto tmp12 = float(tmp10 + tmp11); in_out_ptr0[static_cast<int64_t>(x0)] = tmp7; in_out_ptr1[static_cast<int64_t>(x0)] = tmp12; } } } } #pragma omp parallel num_threads(4) { int tid = omp_get_thread_num(); { #pragma omp for for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(32L); x0+=static_cast<int64_t>(1L)) { #pragma GCC ivdep for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(2097152L); x1+=static_cast<int64_t>(1L)) { { { auto tmp0 = in_ptr0[static_cast<int64_t>(x1 + 2097152L*x0)]; auto tmp1 = in_out_ptr0[static_cast<int64_t>(x0)]; auto tmp3 = in_out_ptr1[static_cast<int64_t>(x0)]; auto tmp2 = float(tmp0 * tmp1); auto tmp4 = float(tmp2 + tmp3); out_ptr2[static_cast<int64_t>(x1 + 2097152L*x0)] = tmp4; } } } } } } } ''') async_compile.wait(globals()) del async_compile class Runner: def __init__(self, partitions): self.partitions = partitions def recursively_apply_fns(self, fns): new_callables = [] for fn, c in zip(fns, self.partitions): new_callables.append(fn(c)) self.partitions = new_callables def call(self, args): arg0_1, arg1_1, arg2_1 = args args.clear() assert_size_stride(arg0_1, (32, ), (1, )) assert_size_stride(arg1_1, (32, ), (1, )) assert_size_stride(arg2_1, (1, 32, 128, 128, 128), (67108864, 2097152, 16384, 128, 1)) buf0 = empty_strided_cpu((1, 32, 1, 1), (32, 1, 32, 32), torch.float32) buf1 = empty_strided_cpu((1, 32, 1, 1), (32, 1, 32, 32), torch.float32) buf3 = reinterpret_tensor(buf1, (1, 32, 1, 1), (32, 1, 1, 1), 0); del buf1 # reuse buf4 = reinterpret_tensor(buf0, (1, 32, 1, 1), (32, 1, 1, 1), 0); del buf0 # reuse buf5 = empty_strided_cpu((1, 32, 128, 128, 128), (67108864, 2097152, 16384, 128, 1), torch.float32) # [Provenance debug handles] cpp_fused_native_group_norm_0:1 cpp_fused_native_group_norm_0(buf3, buf4, arg2_1, arg0_1, arg1_1, buf5) del arg0_1 del arg1_1 del arg2_1 return (buf5, ) ``` - After ``` cpp_fused_native_group_norm_0 = async_compile.cpp_pybinding(['float*', 'float*', 'const float*', 'const float*', 'const float*', 'float*'], ''' #include <torch/csrc/inductor/cpp_prefix.h> extern "C" void kernel(float* in_out_ptr0, float* in_out_ptr1, const float* in_ptr0, const float* in_ptr1, const float* in_ptr2, float* out_ptr2) { auto out_ptr1 = in_out_ptr0; auto out_ptr0 = in_out_ptr1; { #pragma GCC ivdep for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(32L); x0+=static_cast<int64_t>(1L)) { { Welford<float> tmp_acc0 = Welford<float>(); Welford<float> tmp_acc0_arr[4]; for (int i = 0; i < 4; i++) { tmp_acc0_arr[i] = Welford<float>(); } #pragma omp parallel num_threads(4) { int tid = omp_get_thread_num(); WelfordHelper<float, float, 4096> scalar_welford_helper0(static_cast<int64_t>(524288L)); Welford<float> tmp_acc0_local = Welford<float>(); #pragma omp for for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(2097152L); x1+=static_cast<int64_t>(1L)) { { { auto tmp0 = in_ptr0[static_cast<int64_t>(x1 + 2097152L*x0)]; tmp_acc0_local = welford_combine(tmp_acc0_local, tmp0, &scalar_welford_helper0); } } } tmp_acc0_local = welford_combine(tmp_acc0_local, &scalar_welford_helper0); tmp_acc0_arr[tid] = tmp_acc0_local; } for (int tid = 0; tid < 4; tid++) { tmp_acc0 = welford_combine(tmp_acc0, tmp_acc0_arr[tid]); } in_out_ptr1[static_cast<int64_t>(x0)] = tmp_acc0.mean; in_out_ptr0[static_cast<int64_t>(x0)] = tmp_acc0.m2; } } } { #pragma GCC ivdep for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(32L); x0+=static_cast<int64_t>(1L)) { { { auto tmp0 = out_ptr1[static_cast<int64_t>(x0)]; auto tmp6 = in_ptr1[static_cast<int64_t>(x0)]; auto tmp8 = out_ptr0[static_cast<int64_t>(x0)]; auto tmp11 = in_ptr2[static_cast<int64_t>(x0)]; auto tmp1 = static_cast<float>(2097152.0); auto tmp2 = tmp0 / tmp1; auto tmp3 = static_cast<float>(1e-05); auto tmp4 = float(tmp2 + tmp3); auto tmp5 = 1 / std::sqrt(tmp4); auto tmp7 = float(tmp5 * tmp6); auto tmp9 = decltype(tmp8)(-tmp8); auto tmp10 = float(tmp9 * tmp7); auto tmp12 = float(tmp10 + tmp11); in_out_ptr0[static_cast<int64_t>(x0)] = tmp7; in_out_ptr1[static_cast<int64_t>(x0)] = tmp12; } } } } #pragma omp parallel num_threads(4) { int tid = omp_get_thread_num(); { #pragma omp for for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(32L); x0+=static_cast<int64_t>(1L)) { #pragma GCC ivdep for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(2097152L); x1+=static_cast<int64_t>(1L)) { { { auto tmp0 = in_ptr0[static_cast<int64_t>(x1 + 2097152L*x0)]; auto tmp1 = in_out_ptr0[static_cast<int64_t>(x0)]; auto tmp3 = in_out_ptr1[static_cast<int64_t>(x0)]; auto tmp2 = float(tmp0 * tmp1); auto tmp4 = float(tmp2 + tmp3); out_ptr2[static_cast<int64_t>(x1 + 2097152L*x0)] = tmp4; } } } } } } } ''') async_compile.wait(globals()) del async_compile class Runner: def __init__(self, partitions): self.partitions = partitions def recursively_apply_fns(self, fns): new_callables = [] for fn, c in zip(fns, self.partitions): new_callables.append(fn(c)) self.partitions = new_callables def call(self, args): arg0_1, arg1_1, arg2_1 = args args.clear() assert_size_stride(arg0_1, (32, ), (1, )) assert_size_stride(arg1_1, (32, ), (1, )) assert_size_stride(arg2_1, (1, 32, 128, 128, 128), (67108864, 2097152, 16384, 128, 1)) buf0 = empty_strided_cpu((1, 32, 1, 1), (32, 1, 32, 32), torch.float32) buf1 = empty_strided_cpu((1, 32, 1, 1), (32, 1, 32, 32), torch.float32) buf3 = reinterpret_tensor(buf1, (1, 32, 1, 1), (32, 1, 1, 1), 0); del buf1 # reuse buf4 = reinterpret_tensor(buf0, (1, 32, 1, 1), (32, 1, 1, 1), 0); del buf0 # reuse buf5 = empty_strided_cpu((1, 32, 128, 128, 128), (67108864, 2097152, 16384, 128, 1), torch.float32) # [Provenance debug handles] cpp_fused_native_group_norm_0:1 cpp_fused_native_group_norm_0(buf3, buf4, arg2_1, arg0_1, arg1_1, buf5) del arg0_1 del arg1_1 del arg2_1 return (buf5, ) ``` Pull Request resolved: pytorch#162709 Approved by: https://github.com/CaoE, https://github.com/jansel

) Test Plan: CI Differential Revision: D86211542 Pull Request resolved: pytorch#167799 Approved by: https://github.com/njriasan, https://github.com/eellison

This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned audio hash. Pull Request resolved: pytorch#167914 Approved by: https://github.com/pytorchbot

Pull Request resolved: pytorch#167198 Approved by: https://github.com/bobrenjc93

This PR fixes a bug where `torch.clamp` on MPS fails when min/max tensors have more dimensions than the input tensor. CPU already supports this broadcasting, but MPS raised a RuntimeError. Example of failing case before the fix: ```python x = torch.randn(2, 3, device="mps") min_t = torch.randn(1, 2, 3, device="mps") max_t = torch.randn(1, 2, 3, device="mps") torch.clamp(x, min=min_t, max=max_t) # RuntimeError ``` After this fix, MPS matches CPU behavior. Fixes pytorch#160734 Pull Request resolved: pytorch#165058 Approved by: https://github.com/malfet

…7734)" This reverts commit 226850c. Reverted pytorch#167734 on behalf of https://github.com/Aidyn-A due to fails on CUDA 12.8 ([comment](pytorch#167734 (comment)))

The PR pytorch#167401 reminded me that the removal of old NVTX interface is long overdue, as the header-only NVTX3 has been around for more than 5 years and is shipped with all CUDA Toolkit versions of 12+. In addition to that, `libnvToolsExt.so` was removed in CUDA Toolkit 13 and onward. Pull Request resolved: pytorch#167637 Approved by: https://github.com/eqy

…device allocator (pytorch#166831) The implementation plan of MemPool for XPU, which is the dependance of [XPUGraph](pytorch#166285), following the [RFC](pytorch#162143). - [ ] ->pytorch#166831 - [ ] pytorch#166833 - [ ] pytorch#166843 Pull Request resolved: pytorch#166831 Approved by: https://github.com/EikanWang, https://github.com/gujinghui Co-authored-by: Eikan Wang <eikan.wang@intel.com>

…lasLtWorkspace" (pytorch#167928) Summary: getCurrentCUDABlasHandle() and getCUDABlasLtWorkspace() use static mutable maps that are not protected from concurrent read-and-write. This leads to crashes. This diff adds mutexes to synchronize access to the static maps. Re-land context: This is a re-land of pytorch#167248. A few issues were addressed: - fix for a bug in fast path: premature return in getCurrentCUDABlasHandle) - fix for test flakiness (pytorch#167884) Test Plan: 1. regression tests: buck2 test \mode/opt //caffe2/test\:test_transformers_cuda https://www.internalfb.com/intern/testinfra/testrun/6192449759713581 2. Use a GPU OD, run multi-threaded tests with TSAN: buck test fbcode//mode/dev-tsan fbcode//caffe2:cuda_cublas_handle_pool_test -- --stress-runs 100 https://www.internalfb.com/intern/testinfra/testrun/14355223937501118 Differential Revision: D87111985 Pull Request resolved: pytorch#167928 Approved by: https://github.com/Skylion007

…rnels (pytorch#158250) Co-authored-by: Nikhil Gupta [nikhil.gupta2@arm.com](mailto:nikhil.gupta2@arm.com) This PR enables the use of KleidiAI INT4 kernels that directly produce BF16 outputs within PyTorch to boost LLM prefill & decode performance **This change improves decode throughput by ~15% & reduces memory required to inference the model by 50%** ### Benchmark Setup ``` Model: meta-llama/Llama-3.1-8B Test Platform: Neoverse V2 ``` ### Detailed Results | Metric | With `--compile` | Without `--compile` | |----------------------------------|---------------------------|---------------------------| | Quantization Scheme | INT4 symmetric channelwise | INT4 symmetric channelwise | | Input Precision | BF16 | BF16 | | Number of Layers Quantized | 32 | 32 | | Average Compression Ratio | 87.49% | 87.49% | | Total Quantization Time (s) | 9.62 | 10.32 | | Compile Time (First) (s) | 134.48 | 1.69 | | Compile Time (Second) (s) | 80.44 | 1.60 | | Compile Time (Subsequent) (s) | 0.19 | 0.22 | | Prefill Tokens | 54 | 54 | | Decoded Tokens | 33 | 33 | | Prefill Time (s) | 0.19 | 0.22 | | Decode Time (s) | 0.76 | 1.38 | | E2E Generation Time (s) | 0.95 | 1.60 | | Prefill Throughput (tokens/s) | 288.13 | 249.91 | | Decode Throughput (tokens/s) | 43.42 | 23.83 | Pull Request resolved: pytorch#158250 Approved by: https://github.com/malfet, https://github.com/aditew01, https://github.com/fadara01 Co-authored-by: Nikhil Gupta <nikhil.gupta2@arm.com> Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>

This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned xla hash. Pull Request resolved: pytorch#167968 Approved by: https://github.com/pytorchbot

Update the torch-xpu-ops commit to [intel/torch-xpu-ops@1e69f4](intel/torch-xpu-ops@1e69f40), includes: - Add PTL in the default AOT target list for both Win and Lin - Use PyTorch p2p API in Copy kernel - Add event cache and event timing to XCCL - Add Float8_e8m0fnu support for copy - Add CMAKE_SYCL_COMPILER_LAUNCHER for sccache Pull Request resolved: pytorch#167698 Approved by: https://github.com/EikanWang

Exposing `_inductor.config.bucket_all_reduces_fx` similar to all_gathers, reduce_scatters with only option "all". Pull Request resolved: pytorch#167634 Approved by: https://github.com/eellison

Make the PyObject preservation scheme thread-safe with free threaded (nogil) Python. The general idea is: * Python Tensor and Storage objects always hold a strong reference to their underlying c10 object * c10 objects hold a strong reference to their Python objects if there's at least one other reference to the c10 object This is implemented in `intrusive_ptr`: * The top most bit (`kHasPyObject`) from the weakref count is now used to indicate if the `intrusive_ptr_target` has an associated PyObject. So `kHasPyObject` is one bit, the weakref count is now 31 bits and the strong refcount remains 32 bits. * When the reference count increases from one to two and `kHasPyObject` is set, we incref the associated Python object to ensure that it's kept alive. * When the reference count decreases from two to one (i.e., there are no C++ reference to the `intrusive_ptr_target` other than from the Python object), we decre the associated Python object to break the cycle. Other benefits: * We can delete a lot of the copypasta from Python internal `subtype_dealloc` * This fixes the weakref and GC bugs we had in the previous scheme. Python weakrefs on Tensors and Storages should just work as expected now. Risks: * Extra branch for reference count operations on `intrusive_ptr<TensorImpl>`, `intrusive_ptr<StorageImpl>`, and the generic `intrusive_ptr<intrusive_ptr_target>` even when we're not using Python. * It's a big change (Second attempt at pytorch#166342) Pull Request resolved: pytorch#167564 Approved by: https://github.com/albanD, https://github.com/Skylion007

Previously we hard failed if pg was "gloo". Fallback on hardcoded formulas. Pull Request resolved: pytorch#167827 Approved by: https://github.com/eellison

pytorch#166044 removes openblas from whl dependency list for AArch64+CPU build so this PR adds it back. Only affects CPU build since AArch64+CUDA uses NVPL. Pull Request resolved: pytorch#167841 Approved by: https://github.com/tinglvv, https://github.com/malfet

Use standard HIP headers for unsafeAtomicAdd. Removes copy/paste of unsafeAtomicAdd as "preview" implementation for gfx942. Pull Request resolved: pytorch#167661 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>

@albanD

…rch#165067)" This reverts commit 96a4c4b. Reverted pytorch#165067 on behalf of https://github.com/jeanschmidt due to breaks internal tests see D87036515, @albanD please help the author get this PR merged ([comment](pytorch#165067 (comment)))

@Skylion007

This reverts commit e20ca3b. Reverted pytorch#167049 on behalf of https://github.com/jeanschmidt due to breaks internal tests see D87120562, @Skylion007 please thelp the author get this PR merged ([comment](pytorch#167049 (comment)))

This reverts commit 2245d7d. Reverted pytorch#167899 on behalf of https://github.com/jeanschmidt due to need to revert in order to revert pytorch#167899 ([comment](pytorch#167899 (comment)))

@Skylion007

This reverts commit deabb3e. Reverted pytorch#167821 on behalf of https://github.com/jeanschmidt due to Breaks internal tests, see D87148810. @Skylion007 may you help the author to get this PR merged? ([comment](pytorch#167821 (comment)))

Alas, one can not use `repeat_interleave_common` for MPS tensors, as `data_offset` is not a valid pointer to `id<MTLTensor>` On the other hand, one does not need to use `AT_DISPATCH_INDEX_TYPES` as dispatching is happening on the shader side Fixes pytorch#167924 Pull Request resolved: pytorch#167961 Approved by: https://github.com/manuelcandales

@nWEIdia

Summary: MXFP4 unit tests pass on B200, fail on RTX 5090 - disable non-B200 cases. Also add a fail w/a not implemented error for non-B200 to avoid unhelpful failure messages. Test Plan: ``` pytest -sv -k "mxfp4" test/test_scaled_matmul_cuda.py ``` Reviewers: @nWEIdia Subscribers: Tasks: Fixes pytorch#167850 Tags: Signed-off-by: Simon Layton <simonlayton@meta.com> Pull Request resolved: pytorch#167857 Approved by: https://github.com/nWEIdia, https://github.com/malfet

Upgrade all the ROCm docker images to ROCm 7.1 release version. Pull Request resolved: pytorch#166743 Approved by: https://github.com/atalman, https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com> Co-authored-by: Prachi Gupta <prachi.gupta@amd.com>

…7860) getAllOperatorsFor returns a const reference to internal state that is protected by a lock. Presuming that the lock is necessary in the first place (about which I offer no opinion because it's unclear to what extent the GIL should help here), this is a straightforward way to cause callers to create race conditions. This should fix those race conditions by copying the state instead. I modified calling code to stop binding a const reference to the result for clarity. Differential Revision: [D87088731](https://our.internmc.facebook.com/intern/diff/D87088731/) **NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D87088731/)! Pull Request resolved: pytorch#167860 Approved by: https://github.com/zou3519

…ytorch#161728) Resolves pytorch#161290 ## Summary Expands `dynamo/check_perf_csv.py` output capabilities with latency, compile time and memory information: - Display's measured speedup and display % from target - Added clear messaging for all passing model tests when no regression is found - Added error handling if csv file is missing ### Example (Failing Check) ```bash python benchmarks/dynamo/check_perf_csv.py -f reports-dir/inductor_training_smoketest.csv -t 1.40 ``` **Example Output:** ``` Checking inductor_training_smoketest.csv (speedup threshold >= 1.40x) hf_Bert speedup=1.005x, latency=390.8 ms/iter, compile=1.526s, mem_ratio=1.02x (eager=360.6 GB, dynamo=369.3 GB) Error 1 model(s) performance regressed hf_Bert - hf_Bert: 1.005x (< 1.40x; -28.2% from target) ``` ### Example (Passing Check) ```bash python benchmarks/dynamo/check_perf_csv.py -f reports-dir/inductor_training_smoketest.csv -t 1.40 ``` **Example Output:** ``` Checking inductor_training_smoketest.csv (speedup threshold >= 1.00x) hf_Bert speedup=1.005x, latency=390.8 ms/iter, compile=1.526s, mem_ratio=1.02x (eager=360.6 GB, dynamo=369.3 GB) All 1 model(s) passed threshold check (>= 1.00x) ``` Pull Request resolved: pytorch#161728 Approved by: https://github.com/isuruf

This reverts commit 99fdca8. Reverted pytorch#166492 on behalf of https://github.com/jeanschmidt due to Internally we still depends on the old logic, so we need to find a way to maintain backwards compatibility, for now ([comment](pytorch#166492 (comment)))

Pull Request resolved: pytorch#167772 Approved by: https://github.com/janeyx99

…orch::stable::Tensor. (pytorch#161891) This ghstack is a prerequisite for porting torchaudio C++ extensions to use torch stable ABI, see pytorch/audio#4074, pytorch/audio#4075, pytorch/audio#4076, pytorch/audio#4077, pytorch/audio#4078 Pull Request resolved: pytorch#161891 Approved by: https://github.com/mikaylagawarecki ghstack dependencies: pytorch#167772

The following tests are failing on python 3.14 on linux machine * TestSetAffinity::test_set_affinity_in_worker_init * Why? 3.14 makes `forkserver` the default start method for multiprocessing. With it, local functions are not pickle-able and unit test fail. * TestIndividualWorkerQueue::test_ind_worker_queue * Why? The test was hitting timeout. This is also related to the start method. I am increasing timeout and reducing batch size iterations to reduce total unit test time. * Fixes pytorch#68643 Pull Request resolved: pytorch#167429 Approved by: https://github.com/aelavender, https://github.com/ramanishsingh

This reverts commit 77acc66. Reverted pytorch#166743 on behalf of https://github.com/pytorch-auto-revert due to Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable ([comment](pytorch#166743 (comment)))

…7633) Pull Request resolved: pytorch#167633 Approved by: https://github.com/eellison ghstack dependencies: pytorch#167827

Not sure if the path are already properly setup so I can call 'benchmarks/dynamo/huggingface.py' in unit test directly. Let's tell from CI. Pull Request resolved: pytorch#167482 Approved by: https://github.com/v0i0, https://github.com/mlazos

Inductor may treat an outer reduction as inner reduction when the reduction ranges contains a 1. This cause some weird issue that we skip fusing with mix order reduction. While I'm still debugging why that happens, I think we should fix the decision here anyways Pull Request resolved: pytorch#167697 Approved by: https://github.com/jansel, https://github.com/v0i0

@drisspg

Fixes pytorch#158429 Updated LogAddExpKernel.cu to allow for complex numbers. Also, updated unittest to run test_logaddexp on CUDA with complex data types and added a unit test in test_linalg.py to compare results between CUDA and cpu. @drisspg Pull Request resolved: pytorch#163509 Approved by: https://github.com/isuruf

Enables mm out for sparse tensors Pull Request resolved: pytorch#167908 Approved by: https://github.com/malfet

…#167931) Per title 1) allows `self` argument to have the same precision as output 2) fixes broadcasting of `self` argument - it used to allocate incorrectly sized output and resize it later, causing a warning, in addmm, and error out in baddbmm 3) fixes `out` handling for `out` baddbmm overload, where the implementation used uninitialized memory in `out` instead of copying `self` to out. 4) removes couple unneeded iife patterns Pull Request resolved: pytorch#167931 Approved by: https://github.com/PaulZhang12, https://github.com/drisspg, https://github.com/malfet

…idiAI kernels (pytorch#158250)" This reverts commit 53809f9. Reverted pytorch#158250 on behalf of https://github.com/zou3519 due to reverting to see if it fixes inductor halide test failure ([comment](pytorch#158250 (comment)))

Summary: add support for symint placeholders added two test cases with dynamic reshape - dynamic info coming from tmd on placeholders - dynamic info coming from placeholders (symints) Test Plan: test_reshape_dynamic_ph test_reshape_dynamic_tmd Differential Revision: D86984100 Pull Request resolved: pytorch#167757 Approved by: https://github.com/blaine-rister

…locate test into `TestSaveLoad` (pytorch#158247) This is a follow-up to [pytorch#154333](pytorch#154333), where I initially introduced a fallback mechanism in deserialize_torch_artifact. In this revised PR: Cleaned up commit history for clarity and reproducibility. Relocated the test into the TestSaveLoad class in test_serialize.py. There were some issues with last PR so opened this PR The previous PR had inconsistencies due to local branch issues and was closed in favor of this cleaner submission. Feedback is very welcome Pull Request resolved: pytorch#158247 Approved by: https://github.com/angelayi

This reverts commit 99117c1. Reverted pytorch#167637 on behalf of https://github.com/yangw-dev due to breaks internal build with torch/csrc/profiler/stubs/cuda.cpp:4:10: fatal error: 'nvtx3/nvtx3.hpp' file not found 4 | #include <nvtx3/nvtx3.hpp>, please find a meta fella to resolve this issue and try again, diff:[D87229660] ([comment](pytorch#167637 (comment)))

This reverts commit 7ede33b. Reverted pytorch#167771 on behalf of https://github.com/eellison due to needs one fix ([comment](pytorch#167771 (comment)))

… used where needed" Splits each torch library registration in the 2.10 folder into its own file -- I had a script that parsed kernel.cpp to do this but I felt like forcing this responsibility on the user might be less error prone Compiles each file targetting 2.9 and asserts that compilation fails. (There are 2 2.9 kernels we use as negative tests where compilation is expected to succeed) [ghstack-poisoned]

qodo-merge-pro · 2025-11-19T14:21:57Z

PR Compliance Guide 🔍

Below is a summary of compliance checks for this PR:

Security Compliance
🟢	No security concerns identified No security vulnerabilities detected by AI analysis. Human verification advised for critical code.
Ticket Compliance
⚪	🎫 No ticket provided Create ticket/issue
Codebase Duplication Compliance
⚪	Codebase context is not defined Follow the guide to enable codebase context checks.
Custom Compliance
🟢	Generic: Meaningful Naming and Self-Documenting Code Objective: Ensure all identifiers clearly express their purpose and intent, making code self-documenting Status: Passed Learn more about managing compliance generic rules or creating your own custom rules
	Generic: Secure Error Handling Objective: To prevent the leakage of sensitive system information through error messages while providing sufficient detail for internal debugging. Status: Passed Learn more about managing compliance generic rules or creating your own custom rules
	Generic: Secure Logging Practices Objective: To ensure logs are useful for debugging and auditing without exposing sensitive information like PII, PHI, or cardholder data. Status: Passed Learn more about managing compliance generic rules or creating your own custom rules
	Generic: Security-First Input Validation and Data Handling Objective: Ensure all data inputs are validated, sanitized, and handled securely to prevent vulnerabilities Status: Passed Learn more about managing compliance generic rules or creating your own custom rules
⚪	Generic: Comprehensive Audit Trails Objective: To create a detailed and reliable record of critical system actions for security analysis and compliance. Status: No auditing: New test helpers initialize NCCL process groups and manipulate GPU state without adding any audit logging of critical actions, but as this is test code and may rely on external logging, it requires human verification. Referred Code def _get_process_group_nccl(self): store = dist.FileStore(self.file_name, self.world_size) dist.init_process_group( backend="nccl", world_size=self.world_size, rank=self.rank, store=store, ) return dist.distributed_c10d._get_default_group() Learn more about managing compliance generic rules or creating your own custom rules
	Generic: Robust Error Handling and Edge Case Management Objective: Ensure comprehensive error handling that provides meaningful context and graceful degradation Status: Error handling: The refactor of memory coalescing scoring removes buffer-size bounding and broadcast handling which could affect edge cases without adding explicit error or boundary checks, but behavior may be validated elsewhere by tests. Referred Code if indirect_expr: continue size = get_score(memory_expr, var_ranges) if size == 0: continue maybe_coalesced_var = find_coalesced_var(memory_expr, var_ranges) byte_multipler = 0 for buf_name in buf_names: if buf := V.graph.try_get_buffer(buf_name): byte_multipler += buf.dtype.itemsize # coalesced writes more important byte_multipler = 1 if is_read else 2 if maybe_coalesced_var: coalesced_by_var[maybe_coalesced_var] += size byte_multipler else: uncoalesced_addrs[memory_expr] += size * byte_multipler ... (clipped 1 lines) Learn more about managing compliance generic rules or creating your own custom rules

Compliance status legend

🟢 - Fully Compliant
🟡 - Partial Compliant
🔴 - Not Compliant
⚪ - Requires Further Human Verification
🏷️ - Compliance label

bar-qodo · 2025-11-19T14:22:29Z

/agentic_review

bar-qodo · 2025-11-19T14:22:44Z

@sentry review

qodo-merge-pro · 2025-11-19T14:23:04Z

PR Code Suggestions ✨

Explore these optional code suggestions:

Category	Suggestion	Impact
Possible issue	Restore check for squeezable tensors Restore the check for squeezable 1D bias tensors in `launchGemmAndBiasCublasLt` to fix a regression. aten/src/ATen/native/cuda/Blas.cpp [307] -const auto* self_ptr = self.has_value() ? self.value().const_data_ptr<scalar_t>() : static_cast<const scalar_t>(nullptr); +const auto self_ptr = (self.has_value() && (self.value().dim() == 1 \|\| self.value().squeeze().dim() == 1)) + ? self.value().const_data_ptr<scalar_t>() + : static_cast<const scalar_t>(nullptr); Apply / Chat* Suggestion importance[1-10]: 8 __ Why: The suggestion correctly identifies a functional regression where the check for a bias tensor being squeezable to 1D was lost, preventing certain valid bias shapes from using an optimized code path.	Medium
	Use robust floating-point comparison In `test_logaddexp_cpu_vs_cuda_complex`, replace `self.assertEqual` with `torch.testing.assert_close(..., equal_nan=True)` for robustly comparing floating-point results, including `inf` and `NaN`. test/test_linalg.py [10075-10131] @onlyCUDA def test_logaddexp_cpu_vs_cuda_complex(self, device): # test logaddexp with complex values produce the same values (up to machine precision) on cpu and CUDA. input_real = torch.tensor([0.052, -0.2115, 0.6913], dtype=torch.float64) input_img = torch.tensor([-0.3229, -0.8374, 0.8391], dtype=torch.float64) input_complex = torch.complex(input_real, input_img).cuda() other_real = torch.tensor([0.2550, 0.8769, -0.4884], dtype=torch.float64) other_img = torch.tensor([0.6063, 0.4343, -1.4166], dtype=torch.float64) other_complex = torch.complex(other_real, other_img).cuda() out_gpu = torch.logaddexp(input=input_complex, other=other_complex) out_cpu = torch.logaddexp(input=input_complex.cpu(), other=other_complex.cpu()) torch.testing.assert_close(out_gpu.cpu(), out_cpu, rtol=1e-12, atol=1e-14) # test extreme cases (infty, -infty, and nan) are handled the same between cuda and cpu input_complex = torch.complex(torch.tensor(float('inf')), torch.tensor(float('inf'))) other_complex = torch.complex(torch.tensor(float('inf')), torch.tensor(float('inf'))) out_gpu = torch.logaddexp(input=input_complex, other=other_complex) out_cpu = torch.logaddexp(input=input_complex.cpu(), other=other_complex.cpu()) - self.assertEqual(out_gpu.cpu(), out_cpu) + torch.testing.assert_close(out_gpu.cpu(), out_cpu, equal_nan=True) ... `[To ensure code accuracy, apply this suggestion manually]` Suggestion importance[1-10]: 7 __ Why: The suggestion correctly identifies that `self.assertEqual` is not robust for comparing floating-point numbers, especially `NaN`, and recommends the more appropriate `torch.testing.assert_close(..., equal_nan=True)`.	Medium
	Handle equality case in complex comparison Improve the `_logaddexp_minmax` function by adding a tie-breaking rule based on imaginary parts when real parts are equal to ensure deterministic behavior. aten/src/ATen/native/cuda/LogAddExpKernel.cu [23-36] template <typename scalar_t, bool min> __host__ __device__ c10::complex<scalar_t> _logaddexp_minmax(const c10::complex<scalar_t>& x, const c10::complex<scalar_t>& y) { scalar_t xr = std::real(x); scalar_t yr = std::real(y); if (::isnan(yr) \|\| (::isnan(std::imag(y)))) { return y; } else if (::isnan(xr) \|\| (::isnan(std::imag(x)))) { return x; - } else if (min) { // min - return (xr < yr) ? x : y; - } else { // max - return (xr >= yr) ? x : y; } + + if (xr != yr) { + if (min) { + return (xr < yr) ? x : y; + } + return (xr > yr) ? x : y; + } + + // real parts are equal, break tie with imaginary parts + scalar_t xi = std::imag(x); + scalar_t yi = std::imag(y); + if (min) { + return (xi < yi) ? x : y; + } + return (xi > yi) ? x : y; } Apply / Chat Suggestion importance[1-10]: 6 __ Why: The suggestion correctly identifies that the comparison for complex numbers is not deterministic when real parts are equal. Adding a tie-breaker based on the imaginary part improves correctness and determinism.	Low
General	Use stat instead of lstat Replace `lstat` with `stat` in the `file_exists` function to correctly handle symbolic links by checking the target file's existence. torch/csrc/inductor/aoti_package/model_package_loader.cpp [82-89] bool file_exists(const std::string& path) { #ifdef _WIN32 return fs::exists(path); #else struct stat rc{}; - return lstat(path.c_str(), &rc) == 0; + return stat(path.c_str(), &rc) == 0; #endif } `[To ensure code accuracy, apply this suggestion manually]` Suggestion importance[1-10]: 7 __ Why: The suggestion correctly points out that using `stat` instead of `lstat` is more appropriate for checking file existence by following symbolic links, which prevents potential failures in subsequent file operations.	Medium
General	Improve weak reference storage test Improve `test_storage_dead_weak_ref` by adding an assertion to verify the storage object is accessible via the weak reference before the final strong reference is deleted. test/test_torch.py [10345-10352] @skipIfTorchDynamo("https://github.com/pytorch/torchdynamo/issues/1993") def test_storage_dead_weak_ref(self): x = torch.UntypedStorage(2) w_x = weakref.ref(x) y = torch.tensor(x) del x + + # Check that the storage is still alive and accessible + storage_from_weak_ref = w_x() + self.assertIsNotNone(storage_from_weak_ref) + # Perform an operation to ensure it's a valid storage object + self.assertEqual(storage_from_weak_ref.size(), 2) + del storage_from_weak_ref + self.assertIsNotNone(w_x()) del y self.assertIsNone(w_x()) `[To ensure code accuracy, apply this suggestion manually]` Suggestion importance[1-10]: 4 __ Why: The suggestion correctly proposes adding an assertion to verify the storage object is alive and accessible via the weak reference before the final strong reference is deleted, making the test more robust.	Low
More

bar-qodo · 2025-11-19T18:02:45Z

@sentry review

bar-qodo · 2025-11-19T18:05:07Z

@sentry review

cyyever and others added 30 commits November 16, 2025 07:19

s/Stragety/Strategy/ (pytorch#167916)

363385a

Signed-off-by: Edward Yang <ezyang@meta.com> Pull Request resolved: pytorch#167916 Approved by: https://github.com/Skylion007

Enable PyTorch OSS numerics changes, inductor heuristics (pytorch#167799

d8ce6f8

) Test Plan: CI Differential Revision: D86211542 Pull Request resolved: pytorch#167799 Approved by: https://github.com/njriasan, https://github.com/eellison

deprecate check_is_size and guard_size_oblivious (pytorch#167198)

f2e6f94

Pull Request resolved: pytorch#167198 Approved by: https://github.com/bobrenjc93

Revert "[ATen][CUDA] Add sm_121a flag for RowwiseScaledMM (pytorch#16…

b9bccec

…7734)" This reverts commit 226850c. Reverted pytorch#167734 on behalf of https://github.com/Aidyn-A due to fails on CUDA 12.8 ([comment](pytorch#167734 (comment)))

[inductor] Expose config for fx bucket all_reduces (pytorch#167634)

9ff95f6

Exposing `_inductor.config.bucket_all_reduces_fx` similar to all_gathers, reduce_scatters with only option "all". Pull Request resolved: pytorch#167634 Approved by: https://github.com/eellison

Do not hardfail on use nccl estimations for non-nccl (pytorch#167827)

2f74916

Previously we hard failed if pg was "gloo". Fallback on hardcoded formulas. Pull Request resolved: pytorch#167827 Approved by: https://github.com/eellison

Revert "Improve char printing (pytorch#167899)"

22ccd44

This reverts commit 2245d7d. Reverted pytorch#167899 on behalf of https://github.com/jeanschmidt due to need to revert in order to revert pytorch#167899 ([comment](pytorch#167899 (comment)))

Revert "Use c10::filesystem (pytorch#167821)"

a4c7bf7

This reverts commit deabb3e. Reverted pytorch#167821 on behalf of https://github.com/jeanschmidt due to Breaks internal tests, see D87148810. @Skylion007 may you help the author to get this PR merged? ([comment](pytorch#167821 (comment)))

pytorchmergebot and others added 18 commits November 17, 2025 17:59

Move isQIntType to headeronly (pytorch#167772)

02b55c3

Pull Request resolved: pytorch#167772 Approved by: https://github.com/janeyx99

Cleanup in inductor usage of nccl estimator after its fix (pytorch#16…

4414e1b

…7633) Pull Request resolved: pytorch#167633 Approved by: https://github.com/eellison ghstack dependencies: pytorch#167827

[MPS] mm out sparse (pytorch#167908)

a892f76

Enables mm out for sparse tensors Pull Request resolved: pytorch#167908 Approved by: https://github.com/malfet

Revert "Tiling bug fix (pytorch#167771)"

1c04a43

This reverts commit 7ede33b. Reverted pytorch#167771 on behalf of https://github.com/eellison due to needs one fix ([comment](pytorch#167771 (comment)))

qodo-merge-pro bot added the Review effort 2/5 label Nov 19, 2025

bar-qodo requested a review from ravidqodo November 19, 2025 18:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Test that TORCH_FEATURE_VERSION guards are used where needed #4

Test that TORCH_FEATURE_VERSION guards are used where needed #4

Uh oh!

bar-qodo commented Nov 19, 2025 •

edited by qodo-merge-pro bot

Loading

Uh oh!

qodo-merge-pro bot commented Nov 19, 2025

Uh oh!

bar-qodo commented Nov 19, 2025

Uh oh!

bar-qodo commented Nov 19, 2025

Uh oh!

qodo-merge-pro bot commented Nov 19, 2025

Uh oh!

bar-qodo commented Nov 19, 2025

Uh oh!

bar-qodo commented Nov 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

35 participants

Test that TORCH_FEATURE_VERSION guards are used where needed #4

Are you sure you want to change the base?

Test that TORCH_FEATURE_VERSION guards are used where needed #4

Uh oh!

Conversation

bar-qodo commented Nov 19, 2025 • edited by qodo-merge-pro bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

User description

PR Type

Description

Diagram Walkthrough

File Walkthrough

Uh oh!

qodo-merge-pro bot commented Nov 19, 2025

PR Compliance Guide 🔍

Uh oh!

bar-qodo commented Nov 19, 2025

Uh oh!

bar-qodo commented Nov 19, 2025

Uh oh!

qodo-merge-pro bot commented Nov 19, 2025

PR Code Suggestions ✨

Uh oh!

bar-qodo commented Nov 19, 2025

Uh oh!

bar-qodo commented Nov 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

35 participants

bar-qodo commented Nov 19, 2025 •

edited by qodo-merge-pro bot

Loading