Skip to content

Conversation

@bar-qodo
Copy link

@bar-qodo bar-qodo commented Nov 19, 2025

User description

Splits each torch library registration in the 2.10 folder into its own file -- I had a script that parsed kernel.cpp to do this but I felt like forcing this responsibility on the user might be less error prone

Compiles each file targetting 2.9 and asserts that compilation fails. (There are 2 2.9 kernels we use as negative tests where compilation is expected to succeed)

Stack from ghstack (oldest at bottom):


PR Type

Enhancement, Bug fix, Tests


Description

This is a large, multi-faceted PR that includes several major refactoring efforts and improvements across the PyTorch codebase:

PyObject Lifecycle Management Refactoring:

  • Simplified PyObject preservation and reference counting in intrusive_ptr, TensorImpl, and StorageImpl

  • Replaced complex MaybeOwned wrapper with direct tensor storage and atomic PyObject slot management

  • Added thread-safe PyObject initialization with atomic compare-exchange patterns

  • Removed resurrection logic and simplified Python object lifecycle tracking

Thread Safety Improvements:

  • Added mutex protection to cuBLAS workspace management with double-checked locking

  • Improved JIT operator registry thread safety by returning copies instead of references

  • Enhanced PyInterpreter interface with try_incref() and refcnt() methods

  • Fixed cudagraph reference counting logic to account for multiple references

ROCm/HIP Removal:

  • Removed ROCm-specific code from static CUDA launcher, triton heuristics, and BLAS implementations

  • Simplified kernel binary format handling to CUDA-only (cubin)

  • Removed HIP-specific atomic add implementations and conditional compilation blocks

Device-Agnostic and Multi-Device Support:

  • Refactored distributed tests to use device-agnostic APIs and multi-device instantiation

  • Updated test utilities to support XPU alongside CUDA

  • Added device type detection and lazy initialization for checkpoint operations

  • Improved backend specification in distributed test decorators

Filesystem Dependency Removal:

  • Replaced c10::filesystem with custom cross-platform file utilities

  • Updated logging, exception handling, and JIT components to use string manipulation instead of filesystem APIs

Inductor and Compilation Improvements:

  • Simplified memory coalescing analysis by removing broadcast detection

  • Improved Welford reduction helper handling in C++ codegen

  • Added all-reduce bucketing pass configuration for distributed operations

  • Fixed fx_wrapper mode to properly handle symbolic scalars and flatten arguments

  • Added SIMD tiling score simplification

Numeric and Kernel Enhancements:

  • Added complex number support to logaddexp operations

  • Added MXFP4 GPU support validation for B200/B300 devices

  • Refactored CUDA BLAS bias handling with optional parameters

  • Added XPU graph memory pool management

Test Coverage Expansion:

  • Added complex number logaddexp CPU vs CUDA tests

  • Added thread safety tests for gradients and storage

  • Added run-to-run determinism tests for inductor models

  • Added data pointer accessor tests for stable ABI

  • Added MPS regression and broadcasting tests

  • Updated variable naming in dynamic shape and auto-functionalize tests

API Deprecations:

  • Added deprecation annotations to _check_is_size and guard_size_oblivious functions

  • Updated usages to use alternative APIs

Configuration and Utilities:

  • Added bucket_all_reduces_fx configuration options for distributed operations

  • Enhanced performance CSV checking with detailed metrics

  • Added weights-only safety checks to model deserialization

  • Improved dataloader worker affinity testing


Diagram Walkthrough

flowchart LR
  A["PyObject Management<br/>Refactoring"] -->|Simplifies| B["TensorImpl &<br/>StorageImpl"]
  A -->|Adds atomic ops| C["PyObjectSlot"]
  D["Thread Safety<br/>Improvements"] -->|Protects| E["cuBLAS Workspace"]
  D -->|Secures| F["JIT Operator<br/>Registry"]
  G["ROCm/HIP<br/>Removal"] -->|Eliminates| H["HIP-specific Code"]
  G -->|Simplifies| I["CUDA Launcher"]
  J["Device-Agnostic<br/>Updates"] -->|Enables| K["Multi-Device<br/>Testing"]
  J -->|Supports| L["XPU Backend"]
  M["Filesystem<br/>Replacement"] -->|Removes| N["c10::filesystem<br/>Dependency"]
  O["Inductor<br/>Enhancements"] -->|Adds| P["All-Reduce<br/>Bucketing"]
  O -->|Improves| Q["Welford Helpers"]
Loading

File Walkthrough

Relevant files
Tests
18 files
test_dynamic_shapes.py
Update variable naming in dynamic shape test assertions   

test/test_dynamic_shapes.py

  • Updated variable naming in expected IR output strings to use
    simplified names (ge, ge_1, ge_2, etc.) instead of numbered suffixes
    (ge_1, ge_2, ge_3, etc.)
  • Changes reflect a renumbering scheme for generated intermediate
    variables in dynamic shape assertions
  • Multiple test assertions updated to match new variable naming patterns
+32/-32 
test_auto_functionalize.py
Update variable naming in auto-functionalize test outputs

test/inductor/test_auto_functionalize.py

  • Updated expected IR output strings to use simplified variable naming
    (ge instead of ge_1)
  • Changed intermediate variable references in assertion messages to
    match new naming scheme
  • Multiple test assertions updated for consistency with new variable
    naming patterns
+8/-8     
test_higher_order_ops.py
Reduce operation counts and remove size check operations 

test/dynamo/test_higher_order_ops.py

  • Reduced expected operation counts in dynamic shape tests (from 10 to
    9, 8 to 7, 17 to 15, 13 to 11)
  • Removed _check_is_size operation calls from expected IR output strings
  • Updated multiple test assertions to reflect fewer generated operations
+3/-8     
test_linalg.py
Add complex number logaddexp CPU vs CUDA test                       

test/test_linalg.py

  • Added new test method test_logaddexp_cpu_vs_cuda_complex() for complex
    number logaddexp operations
  • Tests logaddexp with complex values on CPU vs CUDA with various edge
    cases (infinity, NaN)
  • Validates that results are bitwise equivalent between CPU and GPU
    implementations
+59/-0   
test_matmul_cuda.py
Expand addmm/baddmm tests with broadcast and output variants

test/test_matmul_cuda.py

  • Reduced parametrization ranges for N and batch_size parameters in
    test_addmm_baddmm_dtype_overload()
  • Added new parameters broadcast_self and high_precision_self to test
    method
  • Updated create_inputs() function to handle broadcast shapes for c
    tensor
  • Added tests for out variant of addmm and baddbmm operations
  • Enhanced test coverage for output tensor handling with different
    dtypes
+21/-7   
test_libtorch_agnostic.py
Add data pointer retrieval tests for stable ABI                   

test/cpp_extensions/test_libtorch_agnostic.py

  • Added get_supported_dtypes() function listing all supported dtypes for
    stable ABI
  • Added two new test methods: test_get_any_data_ptr() and
    test_get_template_any_data_ptr()
  • Tests validate data pointer retrieval with various dtypes and mutable
    flags
  • Added version check decorator @skipIfTorchVersionLessThan(2, 10) for
    new tests
+66/-0   
test_deterministic.py
Add run-to-run determinism test for inductor models           

test/inductor/test_deterministic.py

  • Added new test method test_run2run_determinism() with parametrization
    for model names, training/inference modes, and precision types
  • Tests run-to-run determinism for HuggingFace models using inductor
    backend
  • Validates bitwise equivalent results across multiple runs with
    deterministic mode enabled
  • Includes subprocess-based testing with environment variable
    configuration
+62/-0   
test_inductor_collectives.py
Add gloo backend NCCL estimator regression test                   

test/distributed/test_inductor_collectives.py

  • Removed custom _pass function for bucketing all-reduce operations
  • Added bucket_mode parameter to inductor config patch
  • Added new test method test_regression_use_nccl_estimate_with_gloo()
    for gloo backend compatibility
  • Added @requires_gloo() decorator to new test method
+46/-7   
test_mps.py
Add MPS regression and broadcasting tests                               

test/test_mps.py

+29/-1   
test_fake_distributed.py
Update fake distributed test expected output                         

test/dynamo/test_fake_distributed.py

  • Updated expected graph module output to reflect corrected variable
    naming
  • Changed variable names from ge_1, ge_3, ge_5 to ge, ge_1, ge_2 for
    consistency
+6/-6     
test_mix_order_reduction.py
Expand rms_norm_bwd test coverage with new shapes               

test/inductor/test_mix_order_reduction.py

  • Added new shape parameter (1000000, 256) to test_rms_norm_bwd test
  • Added add_1dim parametrization to test with additional dimension
  • Added resource optimization logic to skip non-critical tests
  • Modified test to conditionally reshape input tensor based on add_1dim
    parameter
+17/-2   
test_serialize.py
Add torch artifact deserialization test                                   

test/export/test_serialize.py

  • Added import for deserialize_torch_artifact function
  • Added new test test_deserialize_torch_artifact_dict to verify
    deserialization of dictionary objects
+11/-1   
test_autograd.py
Add gradient thread safety test                                                   

test/test_autograd.py

  • Added new test test_grad_thread_safety to verify thread-safe access to
    tensor gradients
  • Test uses ThreadPoolExecutor to concurrently access gradients and
    verify consistency
+28/-0   
test_torchinductor.py
Add inner reduction detection test                                             

test/inductor/test_torchinductor.py

  • Added new test test_inner_reduction_detection to verify reduction hint
    detection
  • Test compiles function and checks for ReductionHint.OUTER in generated
    code
+15/-0   
test_custom_operators.cpp
Update tests for thread-safe operator registry changes     

test/cpp/jit/test_custom_operators.cpp

  • Changed all auto& references to auto for operator retrieval calls
  • Updated 6 test cases to work with copied operator vectors instead of
    references
+6/-7     
cuda_cublas_handle_pool_test.cpp
Add concurrent access test for cuBLAS handle pool               

aten/src/ATen/test/cuda_cublas_handle_pool_test.cpp

  • Added new concurrent stress test for cuBLAS handle pool and workspace
    management
  • Tests concurrent access from multiple threads with simultaneous
    workspace clearing
  • Verifies thread safety of getCurrentCUDABlasHandle() and
    getCUDABlasLtWorkspace()
+77/-0   
test_scalartype.cpp
Add test for quantized integer type detection                       

test/cpp/aoti_abi_check/test_scalartype.cpp

  • Added new test TestScalarType::isQIntType to verify quantized integer
    type detection
  • Tests both positive cases (QInt types) and negative cases (other
    scalar types)
+11/-0   
test_custom_ops.cpp
Update custom operator test for thread-safe registry         

test/custom_operator/test_custom_ops.cpp

  • Changed auto& to auto for operator retrieval from registry
  • Updated to work with copied operator vectors
+1/-1     
Enhancement
64 files
test_utils.py
Refactor tests for device-agnostic GPU support                     

test/test_utils.py

  • Replaced hardcoded CUDA device references with device-agnostic
    accelerator API calls
  • Added device_type variable using
    torch.accelerator.current_accelerator() for cross-device support
  • Replaced HAS_CUDA with TEST_GPU flag checking both XPU and CUDA
    availability
  • Updated test methods to use torch.get_device_module() and
    torch.accelerator APIs instead of torch.cuda directly
  • Modified device string formatting to use device_type variable for GPU
    tests
+51/-43 
test_2d_composability.py
Simplify backend selection and fix decorator ordering       

test/distributed/_composable/test_composability/test_2d_composability.py

  • Removed curr_backend variable that was derived from
    dist.get_default_backend_for_device()
  • Updated backend property to use hardcoded backend strings based on
    TEST_XPU flag
  • Reordered decorator stacking for test methods (moved @with_comms
    before @skip_if_lt_x_gpu)
  • Simplified backend selection logic to conditionally return XPU or CUDA
    NCCL backends
+13/-14 
test_ddp_hooks.py
Migrate to MultiProcessTestCase with NCCL-specific setup 

test/distributed/algorithms/ddp_comm_hooks/test_ddp_hooks.py

  • Changed base class from DistributedTestBase to MultiProcessTestCase
  • Added setUp() and tearDown() methods with process spawning and file
    cleanup
  • Implemented _get_process_group_nccl() method for NCCL process group
    initialization
  • Replaced @requires_accelerator_dist_backend() decorators with
    @requires_nccl()
  • Removed device-agnostic code and reverted to CUDA-specific
    implementations
  • Updated gpus_for_rank() function to use torch.cuda.device_count()
    directly
+38/-20 
test_c10d_object_collectives.py
Add device type instantiation for multi-device testing     

test/distributed/test_c10d_object_collectives.py

  • Added device type detection logic using TEST_HPU and TEST_CUDA flags
  • Replaced device-agnostic torch.accelerator calls with explicit device
    module selection
  • Added instantiate_device_type_tests() call to generate device-specific
    test variants
  • Updated test method signatures to accept device parameter
  • Modified with_comms decorator to pass device information to test
    methods
+31/-13 
tiling_utils.py
Simplify memory coalescing analysis and remove broadcast detection

torch/_inductor/tiling_utils.py

  • Removed find_broadcast_var() function that identified broadcast
    patterns in memory access
  • Removed try_get_buf_size() helper function for buffer size retrieval
  • Removed uncoalesced_addrs field from CoalesceVarAnalysis dataclass
  • Simplified get_score() function signature by removing buf_names
    parameter
  • Refactored memory coalescing analysis to remove buffer size
    constraints and broadcast variable handling
+13/-74 
test_pp_composability.py
Update backend requirements and GPU availability checks   

test/distributed/_composable/test_composability/test_pp_composability.py

  • Updated @requires_accelerator_dist_backend() decorators to specify
    backend list ["nccl", "xccl"]
  • Replaced at_least_x_gpu() checks with TEST_MULTIGPU and TEST_XPU flags
  • Imported TEST_MULTIGPU and TEST_XPU from common test utilities
  • Updated skip conditions to check for multi-GPU or XPU availability
+18/-9   
test_scaled_matmul_cuda.py
Add MXFP4 SM120+ device skip conditions                                   

test/test_scaled_matmul_cuda.py

  • Added SM120OrLater to imports from common CUDA utilities
  • Added skip conditions for MXFP4 tests on SM120+ devices (only
    supported on B200/B300)
  • Applied skip logic to three test methods:
    test_mxfp8_nvfp4_scaled_grouped_mm_2d_2d(),
    test_mxfp8_scaled_grouped_mm_2d_3d(), and
    test_blockwise_mxfp8_nvfp4_mxfp4_numerics()
+14/-0   
triton_heuristics.py
Remove fbcode and ROCm-specific logic from triton heuristics

torch/_inductor/runtime/triton_heuristics.py

  • Removed import of is_fbcode from torch._environment
  • Removed ROCm/HIP device type checks and binary format handling (hsaco
    vs cubin)
  • Changed device type validation to only accept CUDA devices
  • Removed conditional logic based on is_fbcode() for heuristic tuning
  • Simplified cubin path construction to always use .cubin extension
+9/-12   
check_perf_csv.py
Enhance performance CSV checking with detailed metrics     

benchmarks/dynamo/check_perf_csv.py

  • Added file existence check with error handling for missing CSV files
  • Enhanced output formatting to display detailed performance metrics
    (latency, compilation time, memory ratio)
  • Improved failure reporting with sorted list and percentage deviation
    from target
  • Added success message when all models pass threshold check
  • Fixed typo in help text ("multiple" to "multiply")
+41/-8   
static_cuda_launcher.py
Remove ROCm support from static CUDA launcher                       

torch/_inductor/runtime/static_cuda_launcher.py

  • Removed ROCm/HIP specific kernel ABI handling for hsaco binary format
  • Simplified kernel initialization to only handle CUDA cubin format
  • Removed is_rocm flag and associated conditional logic
  • Removed complex scratch space parameter handling specific to HIP
    kernel ABI
  • Simplified argument type handling for CUDA kernels only
+6/-49   
comm_analysis.py
Simplify NCCL estimator error handling and backend checks

torch/_inductor/comm_analysis.py

  • Removed try-except wrapper around NCCL time estimator calls
  • Moved backend support checks earlier in _nccl_estimate() function
  • Added explicit checks for fake backend and time estimate support
  • Removed conditional check for torch.distributed.is_nccl_available()
    before using estimator
  • Simplified error handling for NCCL estimator failures
+18/-19 
cpp.py
Improve welford reduction helper handling in C++ codegen 

torch/_inductor/codegen/cpp.py

  • Updated reduction_combine() to pass helper value to welford_combine()
    when available
  • Changed need_use_acc_helper() to always use helper for welford_reduce
    (removed scalar check)
  • Modified reduction code generation to use welford_helper_cse for
    welford reductions
  • Updated scalar helper initialization to include welford reductions
    alongside sum reductions
+24/-20 
test_dataloader.py
Simplify dataloader worker affinity testing                           

test/test_dataloader.py

  • Simplified test_ind_worker_queue() to use fixed batch sizes and worker
    counts
  • Removed CPU affinity detection logic and dynamic worker count
    calculation
  • Updated SetAffinityDataset to accept and store expected affinity value
  • Added _worker_set_affinity_init() function for worker initialization
  • Refactored affinity setting to pass expected value through dataset
    instead of worker function
+26/-33 
profiler.py
Add Python 3.2 compatibility and improve type annotations

torch/autograd/profiler.py

  • Added fallback implementation of ContextDecorator for Python < 3.2
    compatibility
  • Updated type annotations to use Optional[] instead of | union syntax
    for compatibility
  • Changed record_function base class to use _ContextDecorator with
    pyrefly ignore comment
  • Added type annotation comments for TorchScript compatibility
+32/-9   
test_zero_redundancy_optimizer.py
Refactor device type detection and determinism handling   

test/distributed/optim/test_zero_redundancy_optimizer.py

  • Removed unused contextmanager import from contextlib
  • Replaced custom deterministic_algorithms context manager with direct
    torch.use_deterministic_algorithms calls
  • Imported get_devtype from torch.testing._internal.common_fsdp
  • Simplified device type detection using get_devtype() instead of custom
    logic
+4/-13   
test_binary_ufuncs.py
Add torch.complex32 support to binary ufuncs                         

test/test_binary_ufuncs.py

  • Added torch.complex32 support to logaddexp and logaddexp2 operations
  • Updated dtype decorators to include torch.complex32 in CUDA tests
  • Added special handling for torch.complex32 in test helper functions
  • Removed expected failure skip for complex type promotion test
+21/-3   
common_methods_invocations.py
Update logaddexp dtype configuration and test skips           

torch/testing/_internal/common_methods_invocations.py

  • Updated logaddexp dtype support to include torch.complex32 for CUDA
  • Removed expected failure skip for complex type promotion test
  • Added test_python_ref_executor to expected failures for complex types
+8/-10   
test_c10d_functional_native.py
Refactor distributed test to use MultiProcessTestCase       

test/distributed/test_c10d_functional_native.py

  • Changed base class from DistributedTestBase to MultiProcessTestCase
  • Updated decorator to specify backends ["nccl", "xccl"]
  • Added setUp method to spawn processes
  • Replaced create_pg call with manual process group initialization using
    FileStore
+17/-4   
simd.py
Simplify SIMD tiling score calculation                                     

torch/_inductor/codegen/simd.py

  • Removed total_uncoalesced calculation and related penalty scoring
    logic
  • Simplified score_mod function to only consider tile size penalties
  • Removed uncoalesced memory penalty from tiling score calculation
+3/-12   
ops.py
Add data pointer accessor functions                                           

test/cpp_extensions/libtorch_agnostic_2_10_extension/libtorch_agnostic_2_10/ops.py

  • Added get_any_data_ptr function to return tensor data pointer value
  • Added get_template_any_data_ptr function for template-based data
    pointer retrieval with dtype checking
+26/-0   
test_cpu_repro.py
Add simdlen parametrization to CPU test                                   

test/inductor/test_cpu_repro.py

  • Modified test loop to parametrize over simdlen values [None, 0] and
    dynamic values [True, False]
  • Wrapped test logic with config.patch to set cpp.simdlen configuration
+11/-10 
common_dtensor.py
Simplify distributed tensor backend detection                       

torch/testing/_internal/distributed/_tensor/common_dtensor.py

  • Removed import of ACCELERATOR_DIST_BACKENDS
  • Simplified GPU check to specifically look for "nccl" in backend string
  • Reordered backend initialization logic
+4/-8     
test_device_mesh.py
Remove HPU skip condition from device mesh test                   

test/distributed/test_device_mesh.py

  • Removed TEST_HPU from skip condition in test decorator
  • Updated skip message to only mention XPU
+2/-2     
post_grad.py
Add all-reduce bucketing pass configuration                           

torch/_inductor/fx_passes/post_grad.py

  • Added new bucketing pass for all-reduce operations when
    config.bucket_all_reduces_fx is enabled
  • Integrated bucketing logic with configurable bucket size determinator
+12/-0   
serialize.py
Add weights_only safety check to deserialization                 

torch/_export/serde/serialize.py

  • Modified deserialize_torch_artifact to first attempt loading with
    weights_only=True
  • Falls back to weights_only=False on exception with warning log
+11/-1   
checkpoint.py
Implement lazy device type detection for checkpoint           

torch/utils/checkpoint.py

  • Changed _default_device_type initialization from hardcoded "cuda" to
    None
  • Added lazy initialization logic in get_device_type to detect device
    type on first call
+4/-1     
test_sparse.py
Enable sparse mm test on MPS device                                           

test/test_sparse.py

  • Removed @onlyCPU decorator from test_mm method
  • Added @dtypesIfMPS decorator with float32 and complex64 support
+1/-1     
test_opaque_obj_v2.py
Replace deprecated _check_is_size usage                                   

test/test_opaque_obj_v2.py

  • Replaced torch._check_is_size(u0) call with torch._check(u0 >= 0)
+1/-1     
python_variable.cpp
Refactor tensor Python object wrapping and lifecycle         

torch/csrc/autograd/python_variable.cpp

  • Added using torch::utils::PyObjectPreservation declaration
  • Refactored THPVariable_Wrap to use new THPVariable_WrapWithType
    template function
  • Simplified Python object lifecycle management using
    PyObjectPreservation utility
  • Removed complex resurrection and ownership tracking logic
  • Updated THPVariable_traverse and THPVariable_clear to simplified
    implementations
  • Removed THPVariable_NewWithVar function in favor of template-based
    approach
+160/-613
Storage.cpp
Refactor storage Python object lifecycle management           

torch/csrc/Storage.cpp

  • Added using torch::utils::PyObjectPreservation declaration
  • Refactored THPStorage_NewWithStorage to use
    PyObjectPreservation::init_fresh_nonatomic
  • Simplified THPStorage_Wrap to use new preservation utility
  • Removed complex preservation and ownership tracking logic
  • Removed THPStorageMetaType metaclass definition
  • Updated THPStorageType to use standard PyType_Type as metaclass
+48/-279
Blas.cpp
Refactor CUDA BLAS bias handling and dtype checks               

aten/src/ATen/native/cuda/Blas.cpp

  • Changed launchGemmAndBiasCublasLt to accept std::optional for bias
    parameter
  • Simplified bias pointer extraction logic
  • Refactored addmm_out_cuda_impl to compute use_bias_ptr_lt earlier and
    pass optional bias
  • Removed is_bmm parameter from baddbmm_bmm_out_dtype_checks function
  • Fixed _baddbmm_dtype_cuda to properly initialize output tensor and
    copy self
  • Improved dtype checking and validation in _addmm_dtype_out_cuda
+38/-45 
XPUCachingAllocator.cpp
Add XPU graph memory pool management                                         

c10/xpu/XPUCachingAllocator.cpp

  • Added forward declaration for XPUAllocator class
  • Added PrivatePool struct to manage memory pools for XPU graphs
  • Added MempoolIdHash hash function for mempool IDs
  • Enhanced BlockPool to track owner PrivatePool
  • Added graph pool management with graph_pools and graph_pools_freeable
    maps
  • Updated get_pool to support graph-specific memory pools
  • Enhanced release_cached_blocks to handle graph-specific pool cleanup
  • Added create_or_incref_pool and get_private_pool methods
  • Updated malloc and emptyCache to support mempool IDs
+153/-19
ScaledBlas.cpp
Add MXFP4 GPU support validation                                                 

aten/src/ATen/native/cuda/ScaledBlas.cpp

  • Added _check_mxfp4_support function to validate MXFP4 support on
    B200/B300 GPUs
  • Added device property check in _scaled_mxfp4_mxfp4 function
+14/-0   
static_cuda_launcher.cpp
Remove ROCm support from static CUDA launcher                       

torch/csrc/inductor/static_cuda_launcher.cpp

  • Changed preprocessor guard from USE_CUDA || USE_ROCM to USE_CUDA &&
    !USE_ROCM with explanatory comment
  • Removed all USE_ROCM conditional code blocks and HIP-specific includes
  • Simplified function implementations to use only CUDA driver APIs
+6/-96   
model_package_loader.cpp
Replace c10::filesystem with custom cross-platform file utilities

torch/csrc/inductor/aoti_package/model_package_loader.cpp

  • Removed dependency on c10::filesystem and replaced with custom
    implementations
  • Added file_exists(), recursive_mkdir(), and recursive_rmdir() helper
    functions
  • Added Windows-specific macros for access and F_OK
  • Updated file operations to use custom implementations instead of
    c10::filesystem
+115/-15
PyInterpreter.cpp
Simplify PyObject reference management in PyInterpreter   

torch/csrc/PyInterpreter.cpp

  • Simplified decref() signature by removing has_pyobj_slot parameter
  • Added new methods try_incref() and refcnt() to PyInterpreter interface
  • Removed complex PyObject resurrection logic from decref()
    implementation
  • Updated set_tensor_attr_with_capsule() and get_set_cached_attr() to
    use simplified PyObject access
+25/-52 
operator.cpp
Improve thread safety of operator registry access               

torch/csrc/jit/runtime/operator.cpp

  • Added getOperatorsWithLockHeld() private method for lock-protected
    operator retrieval
  • Changed getOperators() return type from reference to value (copy) for
    thread safety
  • Added getSortedOperators() method to centralize operator sorting logic
  • Updated getAllSortedOperatorsFor() to delegate to getSortedOperators()
+41/-29 
pyobject_preservation.cpp
Refactor PyObject preservation with atomic initialization

torch/csrc/utils/pyobject_preservation.cpp

  • Replaced clear_slots() implementation with new PyObjectPreservation
    class
  • Added init_fresh_nonatomic() method for initializing PyObject on fresh
    targets
  • Added init_once() method with atomic compare-exchange for thread-safe
    initialization
  • Implemented proper reference counting and memory ordering semantics
+62/-14 
Module.cpp
Simplify tensor PyObject management and remove MaybeOwned wrapper

torch/csrc/Module.cpp

  • Changed THPVariable.cdata from c10::MaybeOwned to at::Tensor
  • Simplified THPModule_swap_tensor_impl() to use local tensor copies
    instead of complex PyObject slot manipulation
  • Updated PyObject slot operations to use store_pyobj() instead of
    init_pyobj()
  • Added guard condition !defined(USE_ROCM) to StaticCudaLauncher
    initialization
+17/-26 
kernel.cpp
Add data pointer accessor functions to test extension       

test/cpp_extensions/libtorch_agnostic_2_10_extension/libtorch_agnostic_2_10/csrc/kernel.cpp

  • Added get_any_data_ptr() function to retrieve tensor data pointers
  • Added get_template_any_data_ptr() templated function with scalar type
    dispatch
  • Registered new functions with STABLE_TORCH_LIBRARY_FRAGMENT and
    STABLE_TORCH_LIBRARY_IMPL
+39/-0   
StorageImpl.cpp
Add PyObject reference management methods to StorageImpl 

c10/core/StorageImpl.cpp

  • Added incref_pyobject() method with acquire fence for proper memory
    ordering
  • Added decref_pyobject() method for PyObject reference management
  • Added try_incref_pyobject() method with interpreter availability check
+24/-0   
TensorImpl.cpp
Add PyObject reference management methods to TensorImpl   

c10/core/TensorImpl.cpp

  • Removed pyobj_slot_.maybe_destroy_pyobj() call from
    release_resources()
  • Added incref_pyobject() method with acquire fence for proper memory
    ordering
  • Added decref_pyobject() method for PyObject reference management
  • Added try_incref_pyobject() method with interpreter availability check
+24/-1   
jit_log.cpp
Replace filesystem utilities with string manipulation       

torch/csrc/jit/jit_log.cpp

  • Replaced c10::filesystem::path usage with custom string manipulation
  • Added manual filename extraction using StripBasename() and string
    operations
  • Updated is_enabled() and jit_log_prefix() to use new string utilities
+8/-3     
init.cpp
Update JIT Python bindings for thread-safe operator access

torch/csrc/jit/python/init.cpp

  • Changed const auto& to auto for operator retrieval in three locations
  • Updated code to work with copied operator vectors instead of
    references
+3/-3     
Logging.cpp
Remove filesystem dependency from logging utilities           

c10/util/Logging.cpp

  • Removed #include dependency
  • Replaced c10::filesystem::path(file).filename() with StripBasename()
    utility
+2/-2     
inline_container.cc
Allow multiple serialization of triton binary files           

caffe2/serialize/inline_container.cc

  • Added special handling for triton binary files (.so, .cubin, .hsaco)
  • Allow multiple writes for triton extensions with warning log
  • Maintain strict single-write assertion for other file types
+14/-2   
PyInterpreter.cpp
Update noop PyInterpreter for simplified interface             

c10/core/impl/PyInterpreter.cpp

  • Updated NoopPyInterpreterVTable::decref() signature to remove
    has_pyobj_slot parameter
  • Added try_incref() method returning false
  • Added refcnt() method with panic assertion
+9/-2     
jit_opt_limit.cpp
Replace filesystem utilities in JIT optimization limit     

torch/csrc/jit/jit_opt_limit.cpp

  • Replaced c10::filesystem::path with custom string utilities
  • Added #include and #include
  • Used StripBasename() and ExcludeFileExtension() for path manipulation
+5/-2     
shim_common.cpp
Add data pointer accessor functions to AOTI shim                 

torch/csrc/shim_common.cpp

  • Added torch_get_const_data_ptr() function to retrieve const tensor
    data pointers
  • Added torch_get_mutable_data_ptr() function to retrieve mutable tensor
    data pointers
  • Both functions use exception-to-error-code conversion pattern
+18/-0   
Exception.cpp
Remove filesystem dependency from exception utilities       

c10/util/Exception.cpp

  • Removed #include dependency
  • Replaced c10::filesystem::path(file).filename() with
    detail::StripBasename()
+1/-2     
StorageMethods.cpp
Simplify Storage cdata assignment                                               

torch/csrc/StorageMethods.cpp

  • Simplified THPStorage__setCdata() by removing explicit destructor call
  • Changed cdata assignment from MaybeOwned::owned() to direct
    c10::Storage assignment
+2/-3     
input_buffer.cpp
Use new tensor stealability check in accumulation logic   

torch/csrc/autograd/input_buffer.cpp

  • Replaced at::caching::adjusted_use_count(v) == 1 with
    impl::is_tensor_stealable() call
  • Added parameter accounting for cached tensor status
+2/-2     
schema_matching.cpp
Update schema matching for thread-safe operator access     

torch/csrc/jit/frontend/schema_matching.cpp

  • Changed const auto& to auto for operator variant retrieval
  • Updated to work with copied operator vectors
+1/-1     
alias_analysis.cpp
Update alias analysis for thread-safe operator access       

torch/csrc/jit/ir/alias_analysis.cpp

  • Changed const auto& to auto for operator candidate retrieval
  • Updated to work with copied operator vectors
+1/-1     
symbolic_shape_registry.cpp
Update symbolic shape registry for thread-safe operator access

torch/csrc/jit/runtime/symbolic_shape_registry.cpp

  • Changed auto& to auto for inplace variant operator retrieval
  • Updated to work with copied operator vectors
+1/-1     
ir.cpp
Update IR node schema matching for thread-safe access       

torch/csrc/jit/ir/ir.cpp

  • Changed const auto& to auto for operator candidate retrieval
  • Updated to work with copied operator vectors
+1/-1     
LogAddExpKernel.cu
Add complex number support to logaddexp CUDA kernel           

aten/src/ATen/native/cuda/LogAddExpKernel.cu

  • Added complex number support to logaddexp kernel with specialized
    implementations
  • Added helper functions for complex min/max, exponential, and log
    operations
  • Implemented jiterator string for complex logaddexp computation
  • Added conditional compilation for jiterator vs fallback
    implementations
+234/-1 
KernelUtils.cuh
Remove ROCm-specific atomic add implementations                   

aten/src/ATen/native/cuda/KernelUtils.cuh

  • Removed ROCm-specific atomic add implementations for __hip_bfloat162
    and __half2
  • Simplified to use standard unsafeAtomicAdd for ROCm
+1/-59   
intrusive_ptr.h
Implement PyObject preservation in intrusive_ptr                 

c10/util/intrusive_ptr.h

  • Added kHasPyObject constant to track PyObject wrapper presence in
    refcount
  • Added has_pyobject() helper function to check PyObject bit
  • Added TargetTraits template for PyObject support configuration
  • Updated retain_() and reset_() to manage PyObject lifecycle with
    refcount transitions
  • Added is_uniquely_owned() method for stronger uniqueness check
  • Updated weak_intrusive_ptr::lock() with PyObject incref logic
  • Added incref() function with PyObject management
+136/-15
PyObjectSlot.h
Simplify PyObjectSlot interface and add atomic accessors 

c10/core/impl/PyObjectSlot.h

  • Simplified PyObjectSlot interface by removing complex check/init
    methods
  • Added load_pyobj() and store_pyobj() atomic accessor methods
  • Added has_unique_reference() method to check PyObject refcount
  • Removed ownership tagging and hermetic context logic
  • Changed pyobj_ to atomic for thread-safe access
+36/-95 
cpp_prefix.h
Improve scalar type extraction for Welford helper               

torch/csrc/inductor/cpp_prefix.h

  • Added GetScalarType template to extract scalar type from vectorized
    types
  • Updated WelfordHelper::weight_recps to use GetScalarType for proper
    type extraction
  • Removed if constexpr (IsVecType::value) conditional in
    welford_combine()
  • Unified welford combine logic for both scalar and vectorized types
+40/-30 
tensor_inl.h
Add typed data pointer accessors to stable Tensor API       

torch/csrc/stable/tensor_inl.h

  • Added templated mutable_data_ptr() and const_data_ptr() methods with
    scalar type checking
  • Guarded new methods with TORCH_FEATURE_VERSION >= TORCH_VERSION_2_10_0
  • Implemented type-safe data pointer casting for all scalar types
+33/-0   
PyInterpreter.h
Update PyInterpreter interface for simplified PyObject management

c10/core/impl/PyInterpreter.h

  • Updated decref() signature to remove has_pyobj_slot parameter
  • Added try_incref() method taking PyObjectSlot reference
  • Added refcnt() method to retrieve PyObject reference count
  • Added forward declaration for PyObjectSlot
+9/-3     
TensorImpl.h
Add PyObject lifecycle management to TensorImpl                   

c10/core/TensorImpl.h

  • Added incref_pyobject(), decref_pyobject(), and try_incref_pyobject()
    override methods
  • Added TargetTraits specialization for TensorImpl to enable PyObject
    support
+19/-0   
Bug fix
11 files
test_state_dict_utils.py
Revert to CUDA-specific device implementations                     

test/distributed/checkpoint/test_state_dict_utils.py

  • Reverted device-agnostic changes back to CUDA-specific implementations
  • Changed torch.accelerator.device_count() back to
    torch.cuda.device_count()
  • Replaced self.device_type references with hardcoded "cuda" strings
  • Updated device checks to use is_cuda property instead of device.type
    comparisons
  • Replaced torch.accelerator.synchronize() with torch.cuda.synchronize()
+17/-19 
test_static_cuda_launcher.py
Remove ROCm support and use CUDA-only cubin format             

test/inductor/test_static_cuda_launcher.py

  • Removed conditional logic for ROCm/HIP binary format (hsaco vs cubin)
  • Changed to use only cubin format for all kernels
  • Added @skipIfRocm decorator to all test methods
  • Simplified kernel assembly handling to only expect CUDA cubin format
+19/-2   
test_torch.py
Fix weak reference and storage lifecycle tests                     

test/test_torch.py

  • Updated test_storage_use_count() to expect 2 references instead of 1
    (accounting for wrapper)
  • Changed exception type in test_as_subclass() from RuntimeError to
    TypeError
  • Rewrote test_tensor_dead_weak_ref() to verify tensor stays alive via
    weak reference
  • Simplified test_storage_dead_weak_ref() to verify storage lifecycle
    with weak references
  • Added new test_storage_thread_safety() method for concurrent storage
    access validation
+38/-18 
common.py
Add weights_only parameter to torch.load call                       

benchmarks/dynamo/common.py

  • Added weights_only=False parameter to torch.load() call when loading
    saved model outputs
  • Ensures compatibility with newer PyTorch versions that default to
    weights-only loading
+3/-1     
test_fxir_backend.py
Fix fx_wrapper argument handling and add reshape tests     

test/inductor/test_fxir_backend.py

  • Fixed fx_wrapper mode to flatten arguments before passing to compiled
    module
  • Added two new test methods test_reshape_dynamic_ph and
    test_reshape_dynamic_tmd for dynamic reshape operations
+35/-1   
cudagraph_trees.py
Fix cudagraph reference counting logic                                     

torch/_inductor/cudagraph_trees.py

  • Fixed reference count checking logic in expired property
  • Updated to account for two additional references when extra_ref_check
    is set
  • Added assertion to ensure storage count is non-negative
  • Enhanced check_refcount to handle cached tensor outputs with multiple
    references
+16/-3   
compile_fx.py
Fix fx_wrapper input handling for symbolic scalars             

torch/_inductor/compile_fx.py

  • Added conditional logic to only replace non-tensor inputs with None
    when not using fx_wrapper mode
  • Added type check to ensure fake inputs are tensors before device
    validation
+9/-6     
test_codecache.py
Simplify and fix codecache test skip conditions                   

test/inductor/test_codecache.py

  • Simplified CUDA bfloat16 skip condition by removing HIP version check
  • Added skip condition for static CUDA launcher with ROCM
+3/-6     
ir.py
Fix outer reduction detection for zero strides                     

torch/_inductor/ir.py

  • Updated stride check logic to treat 0 stride as non-contiguous
  • Added comment explaining that 0 stride can occur when reduction ranges
    contain 1
+3/-1     
_op_schema.py
Fix typo in OpStrategy string representation                         

torch/distributed/tensor/_op_schema.py

  • Fixed typo in __str__ method from OpStragety to OpStrategy
+1/-1     
CublasHandlePool.cpp
Add thread-safe mutex protection to cuBLAS workspace management

aten/src/ATen/cuda/CublasHandlePool.cpp

  • Introduced WorkspaceMapWithMutex struct to wrap workspace map with
    std::shared_mutex for thread safety
  • Added setWorkspaceForHandle() function with double-checked locking
    pattern
  • Updated clearCublasWorkspaces() and getCUDABlasLtWorkspace() to use
    mutex-protected access
  • Refactored workspace allocation to separate fast and slow paths
+75/-20 
Documentation
2 files
__init__.py
Add deprecation annotation to _check_is_size                         

torch/init.py

  • Added import of deprecated from typing_extensions
  • Added deprecation decorator to _check_is_size function with removal
    notice
+9/-2     
symbolic_shapes.py
Add deprecation annotation to guard_size_oblivious             

torch/fx/experimental/symbolic_shapes.py

  • Added deprecation decorator to guard_size_oblivious function
  • Deprecation message directs users to use explicit unbacked handling
    alternatives
+4/-0     
Configuration changes
1 files
config.py
Add all-reduce bucketing configuration options                     

torch/_inductor/config.py

  • Added bucket_all_reduces_fx configuration option with values "none" or
    "all"
  • Added bucket_all_reduces_fx_bucket_size_determinator optional callable
    configuration
+4/-0     
Formatting
1 files
kernel.cpp
Fix kernel definition indentation                                               

test/cpp_extensions/libtorch_agnostic_2_9_extension/libtorch_agnostic_2_9/csrc/kernel.cpp

  • Fixed indentation of m.def("test_default_constructor(bool undefined)
    -> bool") line
+1/-2     
Additional files
34 files
build_cpu.sh +2/-0     
audio.txt +1/-1     
xla.txt +1/-1     
TensorBase.h +3/-0     
CUDAContextLight.h +8/-2     
Repeat.mm +22/-24 
TensorCompare.mm +34/-20 
native_functions.yaml +1/-1     
CMakeLists.txt +1/-0     
valgrind.sup +7/-0     
SafePyObject.h +2/-2     
ScalarType.h +0/-7     
StorageImpl.h +20/-0   
PyObjectSlot.cpp +0/-56   
XPUCachingAllocator.h +1/-1     
Codegen.cmake +1/-6     
test_export.py +0/-4     
test_ck_backend.py +0/-1     
test_loop_ordering.py +0/-14   
xpu.txt +1/-1     
__init__.pyi.in +2/-3     
Storage.h +3/-5     
accumulate_grad.h +4/-2     
grad_layout_contract.h +3/-1     
wrap_outputs.h +4/-0     
variable.h +17/-1   
static_cuda_launcher.h +1/-1     
script_type_parser.cpp +0/-6     
operator.h +3/-2     
shim.h +11/-0   
tensor_struct.h +22/-0   
pyobject_preservation.h +25/-1   
runtime_assert.py +0/-11   
ScalarType.h +8/-0     

cyyever and others added 30 commits November 16, 2025 07:19
This PR outputs chars to stream without building temporary strings.
They were modified by (on fish)
```
sed  -i -e 's/<< "\([^\\\']\)"/<< \'\1\'/g' (grep '<< "."' -r torch c10 aten -l)
```
and revert some invalid changes.

Pull Request resolved: pytorch#167899
Approved by: https://github.com/Skylion007
# Description
Fixes pytorch#114850, we will port test utils and schema check to Intel GPU
We could enable Intel GPU with following methods and try the best to keep the original code styles:

# Changes
1. Get device type with from accelerator and get_devtype helper method
2. Replace the requires cuda statement to device_type.
3. Add HAS_XPU and HAS GPU check to replace some of the HAS_XPU etc.

# Notify

Pull Request resolved: pytorch#166684
Approved by: https://github.com/ezyang, https://github.com/guangyey

Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>
Summary: This diff would be a follow-up diff for D85883723.

Test Plan:
See D86719598. We are now able to publish the model.

Unit test:
```
buck run fbcode//mode/opt -c remoteexecution.local=enabled fbcode//sigmoid/inference/test:test_passes -m ovr_config//triton:experimental -- -r test_triton_hop_cpu
```

Differential Revision: D87091238

Pull Request resolved: pytorch#167862
Approved by: https://github.com/XueningXu
Signed-off-by: Edward Yang <ezyang@meta.com>

Pull Request resolved: pytorch#167916
Approved by: https://github.com/Skylion007
**Summary:**
Optimize scalar welford_reduce implementation, combining Welford algorithm with cascade sum to improve numerical stability. Specifically:

1. Use Welford algorithm to compute mean and variance.
2. Use cascade summation when computing sum over input for both mean and variance.

**Example:**
Take pytorch#141541 as an example:
```
import torch
import torch.nn as nn
torch.manual_seed(0)

class Model(nn.Module):
    def __init__(self):
        super().__init__()
        self.gn = nn.GroupNorm(num_groups=32, num_channels=32)

    def forward(self, x):
        return self.gn(x)

model = Model().eval()
x = torch.randn(1, 32, 128, 128, 128)

with torch.no_grad():
    output = model(x)
    with torch._inductor.config.patch({"cpp.simdlen": 0}):
        c_model = torch.compile(model)
        c_output = c_model(x)

print(torch.max(torch.abs(output - c_output)))
print(torch.allclose(output, c_output, 1.3e-6, 1e-5))
```
**logs**

- before
```
tensor(0.0005)
False
```
- After
```
tensor(1.4305e-06)
True
```

**Generated code:**
- before
```
cpp_fused_native_group_norm_0 = async_compile.cpp_pybinding(['float*', 'float*', 'const float*', 'const float*', 'const float*', 'float*'], '''
#include <torch/csrc/inductor/cpp_prefix.h>
extern "C"  void  kernel(float* in_out_ptr0,
                       float* in_out_ptr1,
                       const float* in_ptr0,
                       const float* in_ptr1,
                       const float* in_ptr2,
                       float* out_ptr2)
{
    auto out_ptr1 = in_out_ptr0;
    auto out_ptr0 = in_out_ptr1;
    {
        #pragma GCC ivdep
        for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(32L); x0+=static_cast<int64_t>(1L))
        {
            {
                Welford<float> tmp_acc0 = Welford<float>();
                Welford<float> tmp_acc0_arr[4];
                for (int i = 0; i < 4; i++)
                {
                    tmp_acc0_arr[i] = Welford<float>();
                }
                #pragma omp parallel num_threads(4)
                {
                    int tid = omp_get_thread_num();
                    Welford<float> tmp_acc0_local = Welford<float>();
                    #pragma omp for
                    for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(2097152L); x1+=static_cast<int64_t>(1L))
                    {
                        {
                            {
                                auto tmp0 = in_ptr0[static_cast<int64_t>(x1 + 2097152L*x0)];
                                tmp_acc0_local = welford_combine(tmp_acc0_local, tmp0);
                            }
                        }
                    }
                    tmp_acc0_arr[tid] = tmp_acc0_local;
                }
                for (int tid = 0; tid < 4; tid++)
                {
                    tmp_acc0 = welford_combine(tmp_acc0, tmp_acc0_arr[tid]);
                }
                in_out_ptr1[static_cast<int64_t>(x0)] = tmp_acc0.mean;
                in_out_ptr0[static_cast<int64_t>(x0)] = tmp_acc0.m2;
            }
        }
    }
    {
        #pragma GCC ivdep
        for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(32L); x0+=static_cast<int64_t>(1L))
        {
            {
                {
                    auto tmp0 = out_ptr1[static_cast<int64_t>(x0)];
                    auto tmp6 = in_ptr1[static_cast<int64_t>(x0)];
                    auto tmp8 = out_ptr0[static_cast<int64_t>(x0)];
                    auto tmp11 = in_ptr2[static_cast<int64_t>(x0)];
                    auto tmp1 = static_cast<float>(2097152.0);
                    auto tmp2 = tmp0 / tmp1;
                    auto tmp3 = static_cast<float>(1e-05);
                    auto tmp4 = float(tmp2 + tmp3);
                    auto tmp5 = 1 / std::sqrt(tmp4);
                    auto tmp7 = float(tmp5 * tmp6);
                    auto tmp9 = decltype(tmp8)(-tmp8);
                    auto tmp10 = float(tmp9 * tmp7);
                    auto tmp12 = float(tmp10 + tmp11);
                    in_out_ptr0[static_cast<int64_t>(x0)] = tmp7;
                    in_out_ptr1[static_cast<int64_t>(x0)] = tmp12;
                }
            }
        }
    }
    #pragma omp parallel num_threads(4)
    {
        int tid = omp_get_thread_num();
        {
            #pragma omp for
            for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(32L); x0+=static_cast<int64_t>(1L))
            {
                #pragma GCC ivdep
                for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(2097152L); x1+=static_cast<int64_t>(1L))
                {
                    {
                        {
                            auto tmp0 = in_ptr0[static_cast<int64_t>(x1 + 2097152L*x0)];
                            auto tmp1 = in_out_ptr0[static_cast<int64_t>(x0)];
                            auto tmp3 = in_out_ptr1[static_cast<int64_t>(x0)];
                            auto tmp2 = float(tmp0 * tmp1);
                            auto tmp4 = float(tmp2 + tmp3);
                            out_ptr2[static_cast<int64_t>(x1 + 2097152L*x0)] = tmp4;
                        }
                    }
                }
            }
        }
    }
}
''')

async_compile.wait(globals())
del async_compile

class Runner:
    def __init__(self, partitions):
        self.partitions = partitions

    def recursively_apply_fns(self, fns):
        new_callables = []
        for fn, c in zip(fns, self.partitions):
            new_callables.append(fn(c))
        self.partitions = new_callables

    def call(self, args):
        arg0_1, arg1_1, arg2_1 = args
        args.clear()
        assert_size_stride(arg0_1, (32, ), (1, ))
        assert_size_stride(arg1_1, (32, ), (1, ))
        assert_size_stride(arg2_1, (1, 32, 128, 128, 128), (67108864, 2097152, 16384, 128, 1))
        buf0 = empty_strided_cpu((1, 32, 1, 1), (32, 1, 32, 32), torch.float32)
        buf1 = empty_strided_cpu((1, 32, 1, 1), (32, 1, 32, 32), torch.float32)
        buf3 = reinterpret_tensor(buf1, (1, 32, 1, 1), (32, 1, 1, 1), 0); del buf1  # reuse
        buf4 = reinterpret_tensor(buf0, (1, 32, 1, 1), (32, 1, 1, 1), 0); del buf0  # reuse
        buf5 = empty_strided_cpu((1, 32, 128, 128, 128), (67108864, 2097152, 16384, 128, 1), torch.float32)
        # [Provenance debug handles] cpp_fused_native_group_norm_0:1
        cpp_fused_native_group_norm_0(buf3, buf4, arg2_1, arg0_1, arg1_1, buf5)
        del arg0_1
        del arg1_1
        del arg2_1
        return (buf5, )
```

- After
```
cpp_fused_native_group_norm_0 = async_compile.cpp_pybinding(['float*', 'float*', 'const float*', 'const float*', 'const float*', 'float*'], '''
#include <torch/csrc/inductor/cpp_prefix.h>
extern "C"  void  kernel(float* in_out_ptr0,
                       float* in_out_ptr1,
                       const float* in_ptr0,
                       const float* in_ptr1,
                       const float* in_ptr2,
                       float* out_ptr2)
{
    auto out_ptr1 = in_out_ptr0;
    auto out_ptr0 = in_out_ptr1;
    {
        #pragma GCC ivdep
        for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(32L); x0+=static_cast<int64_t>(1L))
        {
            {
                Welford<float> tmp_acc0 = Welford<float>();
                Welford<float> tmp_acc0_arr[4];
                for (int i = 0; i < 4; i++)
                {
                    tmp_acc0_arr[i] = Welford<float>();
                }
                #pragma omp parallel num_threads(4)
                {
                    int tid = omp_get_thread_num();
                    WelfordHelper<float, float, 4096> scalar_welford_helper0(static_cast<int64_t>(524288L));
                    Welford<float> tmp_acc0_local = Welford<float>();
                    #pragma omp for
                    for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(2097152L); x1+=static_cast<int64_t>(1L))
                    {
                        {
                            {
                                auto tmp0 = in_ptr0[static_cast<int64_t>(x1 + 2097152L*x0)];
                                tmp_acc0_local = welford_combine(tmp_acc0_local, tmp0, &scalar_welford_helper0);
                            }
                        }
                    }
                    tmp_acc0_local = welford_combine(tmp_acc0_local, &scalar_welford_helper0);
                    tmp_acc0_arr[tid] = tmp_acc0_local;
                }
                for (int tid = 0; tid < 4; tid++)
                {
                    tmp_acc0 = welford_combine(tmp_acc0, tmp_acc0_arr[tid]);
                }
                in_out_ptr1[static_cast<int64_t>(x0)] = tmp_acc0.mean;
                in_out_ptr0[static_cast<int64_t>(x0)] = tmp_acc0.m2;
            }
        }
    }
    {
        #pragma GCC ivdep
        for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(32L); x0+=static_cast<int64_t>(1L))
        {
            {
                {
                    auto tmp0 = out_ptr1[static_cast<int64_t>(x0)];
                    auto tmp6 = in_ptr1[static_cast<int64_t>(x0)];
                    auto tmp8 = out_ptr0[static_cast<int64_t>(x0)];
                    auto tmp11 = in_ptr2[static_cast<int64_t>(x0)];
                    auto tmp1 = static_cast<float>(2097152.0);
                    auto tmp2 = tmp0 / tmp1;
                    auto tmp3 = static_cast<float>(1e-05);
                    auto tmp4 = float(tmp2 + tmp3);
                    auto tmp5 = 1 / std::sqrt(tmp4);
                    auto tmp7 = float(tmp5 * tmp6);
                    auto tmp9 = decltype(tmp8)(-tmp8);
                    auto tmp10 = float(tmp9 * tmp7);
                    auto tmp12 = float(tmp10 + tmp11);
                    in_out_ptr0[static_cast<int64_t>(x0)] = tmp7;
                    in_out_ptr1[static_cast<int64_t>(x0)] = tmp12;
                }
            }
        }
    }
    #pragma omp parallel num_threads(4)
    {
        int tid = omp_get_thread_num();
        {
            #pragma omp for
            for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(32L); x0+=static_cast<int64_t>(1L))
            {
                #pragma GCC ivdep
                for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(2097152L); x1+=static_cast<int64_t>(1L))
                {
                    {
                        {
                            auto tmp0 = in_ptr0[static_cast<int64_t>(x1 + 2097152L*x0)];
                            auto tmp1 = in_out_ptr0[static_cast<int64_t>(x0)];
                            auto tmp3 = in_out_ptr1[static_cast<int64_t>(x0)];
                            auto tmp2 = float(tmp0 * tmp1);
                            auto tmp4 = float(tmp2 + tmp3);
                            out_ptr2[static_cast<int64_t>(x1 + 2097152L*x0)] = tmp4;
                        }
                    }
                }
            }
        }
    }
}
''')

async_compile.wait(globals())
del async_compile

class Runner:
    def __init__(self, partitions):
        self.partitions = partitions

    def recursively_apply_fns(self, fns):
        new_callables = []
        for fn, c in zip(fns, self.partitions):
            new_callables.append(fn(c))
        self.partitions = new_callables

    def call(self, args):
        arg0_1, arg1_1, arg2_1 = args
        args.clear()
        assert_size_stride(arg0_1, (32, ), (1, ))
        assert_size_stride(arg1_1, (32, ), (1, ))
        assert_size_stride(arg2_1, (1, 32, 128, 128, 128), (67108864, 2097152, 16384, 128, 1))
        buf0 = empty_strided_cpu((1, 32, 1, 1), (32, 1, 32, 32), torch.float32)
        buf1 = empty_strided_cpu((1, 32, 1, 1), (32, 1, 32, 32), torch.float32)
        buf3 = reinterpret_tensor(buf1, (1, 32, 1, 1), (32, 1, 1, 1), 0); del buf1  # reuse
        buf4 = reinterpret_tensor(buf0, (1, 32, 1, 1), (32, 1, 1, 1), 0); del buf0  # reuse
        buf5 = empty_strided_cpu((1, 32, 128, 128, 128), (67108864, 2097152, 16384, 128, 1), torch.float32)
        # [Provenance debug handles] cpp_fused_native_group_norm_0:1
        cpp_fused_native_group_norm_0(buf3, buf4, arg2_1, arg0_1, arg1_1, buf5)
        del arg0_1
        del arg1_1
        del arg2_1
        return (buf5, )
```

Pull Request resolved: pytorch#162709
Approved by: https://github.com/CaoE, https://github.com/jansel
This PR fixes a bug where `torch.clamp` on MPS fails when min/max tensors have more dimensions than the input tensor.
CPU already supports this broadcasting, but MPS raised a RuntimeError.

Example of failing case before the fix:
```python
x = torch.randn(2, 3, device="mps")
min_t = torch.randn(1, 2, 3, device="mps")
max_t = torch.randn(1, 2, 3, device="mps")
torch.clamp(x, min=min_t, max=max_t)  # RuntimeError
```
After this fix, MPS matches CPU behavior.

Fixes pytorch#160734

Pull Request resolved: pytorch#165058
Approved by: https://github.com/malfet
The PR pytorch#167401 reminded me that the removal of old NVTX interface is long overdue, as the header-only NVTX3 has been around for more than 5 years and is shipped with all CUDA Toolkit versions of 12+. In addition to that, `libnvToolsExt.so` was removed in CUDA Toolkit 13 and onward.

Pull Request resolved: pytorch#167637
Approved by: https://github.com/eqy
…device allocator (pytorch#166831)

The implementation plan of MemPool for XPU, which is the dependance of [XPUGraph](pytorch#166285), following the [RFC](pytorch#162143).

- [ ] ->pytorch#166831
- [ ] pytorch#166833
- [ ] pytorch#166843

Pull Request resolved: pytorch#166831
Approved by: https://github.com/EikanWang, https://github.com/gujinghui

Co-authored-by: Eikan Wang <eikan.wang@intel.com>
…lasLtWorkspace" (pytorch#167928)

Summary:
getCurrentCUDABlasHandle() and getCUDABlasLtWorkspace() use static mutable maps that are not protected from concurrent read-and-write. This leads to crashes.

This diff adds mutexes to synchronize access to the static maps.

Re-land context:

This is a re-land of pytorch#167248.

A few issues were addressed:
- fix for a bug in fast path: premature return in getCurrentCUDABlasHandle)
- fix for test flakiness (pytorch#167884)

Test Plan:
1. regression tests:
buck2 test \mode/opt //caffe2/test\:test_transformers_cuda
https://www.internalfb.com/intern/testinfra/testrun/6192449759713581

2. Use a GPU OD, run multi-threaded tests with TSAN:

buck test fbcode//mode/dev-tsan fbcode//caffe2:cuda_cublas_handle_pool_test  -- --stress-runs 100
https://www.internalfb.com/intern/testinfra/testrun/14355223937501118

Differential Revision: D87111985

Pull Request resolved: pytorch#167928
Approved by: https://github.com/Skylion007
…rnels (pytorch#158250)

Co-authored-by: Nikhil Gupta [nikhil.gupta2@arm.com](mailto:nikhil.gupta2@arm.com)

This PR enables the use of KleidiAI INT4 kernels that directly produce BF16 outputs within PyTorch to boost LLM prefill & decode performance

**This change improves decode throughput by ~15% & reduces memory required to inference the model by 50%**

### Benchmark Setup
```
Model: meta-llama/Llama-3.1-8B
Test Platform: Neoverse V2
```
### Detailed Results

| Metric                           | With `--compile`         | Without `--compile`      |
|----------------------------------|---------------------------|---------------------------|
| Quantization Scheme              | INT4 symmetric channelwise | INT4 symmetric channelwise |
| Input Precision                  | BF16                      | BF16                      |
| Number of Layers Quantized       | 32                        | 32                        |
| Average Compression Ratio        | 87.49%                    | 87.49%                    |
| Total Quantization Time (s)      | 9.62                      | 10.32                     |
| Compile Time (First) (s)         | 134.48                    | 1.69                      |
| Compile Time (Second) (s)        | 80.44                     | 1.60                      |
| Compile Time (Subsequent) (s)    | 0.19                      | 0.22                      |
| Prefill Tokens                   | 54                        | 54                        |
| Decoded Tokens                   | 33                        | 33                        |
| Prefill Time (s)                 | 0.19                      | 0.22                      |
| Decode Time (s)                  | 0.76                      | 1.38                      |
| E2E Generation Time (s)          | 0.95                      | 1.60                      |
| Prefill Throughput (tokens/s)    | 288.13                    | 249.91                    |
| Decode Throughput (tokens/s)     | 43.42                     | 23.83                     |
Pull Request resolved: pytorch#158250
Approved by: https://github.com/malfet, https://github.com/aditew01, https://github.com/fadara01

Co-authored-by: Nikhil Gupta <nikhil.gupta2@arm.com>
Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned xla hash.
Pull Request resolved: pytorch#167968
Approved by: https://github.com/pytorchbot
Update the torch-xpu-ops commit to [intel/torch-xpu-ops@1e69f4](intel/torch-xpu-ops@1e69f40), includes:

- Add PTL in the default AOT target list for both Win and Lin
- Use PyTorch p2p API in Copy kernel
- Add event cache and event timing to XCCL
- Add Float8_e8m0fnu support for copy
- Add CMAKE_SYCL_COMPILER_LAUNCHER for sccache
Pull Request resolved: pytorch#167698
Approved by: https://github.com/EikanWang
Exposing `_inductor.config.bucket_all_reduces_fx` similar to all_gathers, reduce_scatters with only option "all".

Pull Request resolved: pytorch#167634
Approved by: https://github.com/eellison
Make the PyObject preservation scheme thread-safe with free threaded (nogil) Python. The general idea is:

* Python Tensor and Storage objects always hold a strong reference to their underlying c10 object
* c10 objects hold a strong reference to their Python objects if there's at least one other reference to the c10 object

This is implemented in `intrusive_ptr`:

* The top most bit (`kHasPyObject`) from the weakref count is now used to indicate if the `intrusive_ptr_target` has an associated PyObject. So `kHasPyObject` is one bit, the weakref count is now 31 bits and the strong refcount remains 32 bits.
* When the reference count increases from one to two and `kHasPyObject` is set, we incref the associated Python object to ensure that it's kept alive.
* When the reference count decreases from two to one (i.e., there are no C++ reference to the `intrusive_ptr_target` other than from the Python object), we decre the associated Python object to break the cycle.

Other benefits:

* We can delete a lot of the copypasta from Python internal `subtype_dealloc`
* This fixes the weakref and GC bugs we had in the previous scheme. Python weakrefs on Tensors and Storages should just work as expected now.

Risks:

* Extra branch for reference count operations on `intrusive_ptr<TensorImpl>`, `intrusive_ptr<StorageImpl>`, and the generic `intrusive_ptr<intrusive_ptr_target>` even when we're not using Python.
* It's a big change

(Second attempt at pytorch#166342)

Pull Request resolved: pytorch#167564
Approved by: https://github.com/albanD, https://github.com/Skylion007
Previously we hard failed if pg was "gloo".
Fallback on hardcoded formulas.

Pull Request resolved: pytorch#167827
Approved by: https://github.com/eellison
pytorch#166044 removes openblas from whl dependency list for AArch64+CPU build so this PR adds it back. Only affects CPU build since AArch64+CUDA uses NVPL.

Pull Request resolved: pytorch#167841
Approved by: https://github.com/tinglvv, https://github.com/malfet
Use standard HIP headers for unsafeAtomicAdd. Removes copy/paste of unsafeAtomicAdd as "preview" implementation for gfx942.

Pull Request resolved: pytorch#167661
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
…rch#165067)"

This reverts commit 96a4c4b.

Reverted pytorch#165067 on behalf of https://github.com/jeanschmidt due to breaks internal tests see D87036515, @albanD please help the author get this PR merged ([comment](pytorch#165067 (comment)))
This reverts commit e20ca3b.

Reverted pytorch#167049 on behalf of https://github.com/jeanschmidt due to breaks internal tests see D87120562, @Skylion007 please thelp the author get this PR merged ([comment](pytorch#167049 (comment)))
This reverts commit 2245d7d.

Reverted pytorch#167899 on behalf of https://github.com/jeanschmidt due to need to revert in order to revert pytorch#167899 ([comment](pytorch#167899 (comment)))
This reverts commit deabb3e.

Reverted pytorch#167821 on behalf of https://github.com/jeanschmidt due to Breaks internal tests, see D87148810. @Skylion007 may you help the author to get this PR merged? ([comment](pytorch#167821 (comment)))
Alas, one can not use `repeat_interleave_common` for MPS tensors, as `data_offset` is not a valid pointer to `id<MTLTensor>`
On the other hand, one does not need to use `AT_DISPATCH_INDEX_TYPES` as dispatching is happening on the shader side

Fixes pytorch#167924
Pull Request resolved: pytorch#167961
Approved by: https://github.com/manuelcandales
Summary:

MXFP4 unit tests pass on B200, fail on RTX 5090 - disable non-B200
cases.

Also add a fail w/a not implemented error for non-B200 to avoid
unhelpful failure messages.

Test Plan:

```
pytest -sv -k "mxfp4" test/test_scaled_matmul_cuda.py
```

Reviewers:

@nWEIdia

Subscribers:

Tasks:

Fixes pytorch#167850

Tags:
Signed-off-by: Simon Layton <simonlayton@meta.com>
Pull Request resolved: pytorch#167857
Approved by: https://github.com/nWEIdia, https://github.com/malfet
Upgrade all the ROCm docker images to ROCm 7.1 release version.

Pull Request resolved: pytorch#166743
Approved by: https://github.com/atalman, https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
Co-authored-by: Prachi Gupta <prachi.gupta@amd.com>
…7860)

getAllOperatorsFor returns a const reference to internal state that is protected by a lock. Presuming that the lock is necessary in the first place (about which I offer no opinion because it's unclear to what extent the GIL should help here), this is a straightforward way to cause callers to create race conditions.

This should fix those race conditions by copying the state instead. I modified calling code to stop binding a const reference to the result for clarity.

Differential Revision: [D87088731](https://our.internmc.facebook.com/intern/diff/D87088731/)

**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D87088731/)!

Pull Request resolved: pytorch#167860
Approved by: https://github.com/zou3519
…ytorch#161728)

Resolves pytorch#161290

## Summary

Expands `dynamo/check_perf_csv.py` output capabilities with latency, compile time and memory information:

- Display's measured speedup and display % from target
- Added clear messaging for all passing model tests when no regression is found
- Added error handling if csv file is missing

### Example (Failing Check)

```bash
python benchmarks/dynamo/check_perf_csv.py -f reports-dir/inductor_training_smoketest.csv -t 1.40
```

**Example Output:**
```
Checking inductor_training_smoketest.csv (speedup threshold >= 1.40x)
hf_Bert                            speedup=1.005x, latency=390.8 ms/iter, compile=1.526s, mem_ratio=1.02x (eager=360.6 GB, dynamo=369.3 GB)
Error 1 model(s) performance regressed
    hf_Bert
  - hf_Bert: 1.005x (< 1.40x; -28.2% from target)
```

### Example (Passing Check)

```bash
python benchmarks/dynamo/check_perf_csv.py -f reports-dir/inductor_training_smoketest.csv -t 1.40
```

**Example Output:**
```
Checking inductor_training_smoketest.csv (speedup threshold >= 1.00x)
hf_Bert                            speedup=1.005x, latency=390.8 ms/iter, compile=1.526s, mem_ratio=1.02x (eager=360.6 GB, dynamo=369.3 GB)
All 1 model(s) passed threshold check (>= 1.00x)
```

Pull Request resolved: pytorch#161728
Approved by: https://github.com/isuruf
pytorchmergebot and others added 18 commits November 17, 2025 17:59
This reverts commit 99fdca8.

Reverted pytorch#166492 on behalf of https://github.com/jeanschmidt due to Internally we still depends on the old logic, so we need to find a way to maintain backwards compatibility, for now ([comment](pytorch#166492 (comment)))
…orch::stable::Tensor. (pytorch#161891)

This ghstack is a prerequisite for porting torchaudio C++ extensions to use torch stable ABI, see pytorch/audio#4074, pytorch/audio#4075, pytorch/audio#4076, pytorch/audio#4077, pytorch/audio#4078

Pull Request resolved: pytorch#161891
Approved by: https://github.com/mikaylagawarecki
ghstack dependencies: pytorch#167772
The following tests are failing on python 3.14 on linux machine

* TestSetAffinity::test_set_affinity_in_worker_init
    * Why? 3.14 makes `forkserver` the default start method for multiprocessing. With it, local functions are not pickle-able and unit test fail.
* TestIndividualWorkerQueue::test_ind_worker_queue
    * Why? The test was hitting timeout. This is also related to the start method. I am increasing timeout and reducing batch size iterations to reduce total unit test time.
    * Fixes pytorch#68643

Pull Request resolved: pytorch#167429
Approved by: https://github.com/aelavender, https://github.com/ramanishsingh
This reverts commit 77acc66.

Reverted pytorch#166743 on behalf of https://github.com/pytorch-auto-revert due to Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable ([comment](pytorch#166743 (comment)))
Not sure if the path are already properly setup so I can call 'benchmarks/dynamo/huggingface.py' in unit test directly. Let's tell from CI.

Pull Request resolved: pytorch#167482
Approved by: https://github.com/v0i0, https://github.com/mlazos
Inductor may treat an outer reduction as inner reduction when the reduction ranges contains a 1. This cause some weird issue that we skip fusing with mix order reduction. While I'm still debugging why that happens, I think we should fix the decision here anyways

Pull Request resolved: pytorch#167697
Approved by: https://github.com/jansel, https://github.com/v0i0
Fixes pytorch#158429

Updated LogAddExpKernel.cu to allow for complex numbers. Also, updated unittest to run test_logaddexp on CUDA with complex data types and added a unit test in test_linalg.py to compare results between CUDA and cpu.

@drisspg
Pull Request resolved: pytorch#163509
Approved by: https://github.com/isuruf
Enables mm out for sparse tensors
Pull Request resolved: pytorch#167908
Approved by: https://github.com/malfet
…#167931)

Per title
1) allows `self` argument to have the same precision as output
2) fixes broadcasting of `self` argument - it used to allocate incorrectly sized output and resize it later, causing a warning, in addmm, and error out in baddbmm
3) fixes `out` handling for `out` baddbmm overload, where the implementation used uninitialized memory in `out` instead of copying `self` to out.
4) removes couple unneeded iife patterns

Pull Request resolved: pytorch#167931
Approved by: https://github.com/PaulZhang12, https://github.com/drisspg, https://github.com/malfet
…idiAI kernels (pytorch#158250)"

This reverts commit 53809f9.

Reverted pytorch#158250 on behalf of https://github.com/zou3519 due to reverting to see if it fixes inductor halide test failure ([comment](pytorch#158250 (comment)))
Summary:
add support for symint placeholders

added two test cases with dynamic reshape
- dynamic info coming from tmd on placeholders
- dynamic info coming from placeholders (symints)

Test Plan:
test_reshape_dynamic_ph
test_reshape_dynamic_tmd

Differential Revision: D86984100

Pull Request resolved: pytorch#167757
Approved by: https://github.com/blaine-rister
…locate test into `TestSaveLoad` (pytorch#158247)

This is a follow-up to [pytorch#154333](pytorch#154333), where I initially introduced a fallback mechanism in deserialize_torch_artifact.

In this revised PR:

Cleaned up commit history for clarity and reproducibility.

Relocated the test into the TestSaveLoad class in test_serialize.py.

There were some issues with last PR so opened this PR

The previous PR had inconsistencies due to local branch issues and was closed in favor of this cleaner submission.

Feedback is very welcome
Pull Request resolved: pytorch#158247
Approved by: https://github.com/angelayi
This reverts commit 99117c1.

Reverted pytorch#167637 on behalf of https://github.com/yangw-dev due to breaks internal build with torch/csrc/profiler/stubs/cuda.cpp:4:10: fatal error: 'nvtx3/nvtx3.hpp' file not found 4 | #include <nvtx3/nvtx3.hpp>, please find a meta fella to resolve this issue and try again, diff:[D87229660] ([comment](pytorch#167637 (comment)))
This reverts commit 7ede33b.

Reverted pytorch#167771 on behalf of https://github.com/eellison due to needs one fix ([comment](pytorch#167771 (comment)))
… used where needed"

Splits each torch library registration in the 2.10 folder into its own file -- I had a script that parsed kernel.cpp to do this but I felt like forcing this responsibility on the user might be less error prone

Compiles each file targetting 2.9 and asserts that compilation fails. (There are 2 2.9 kernels we use as negative tests where compilation is expected to succeed)




[ghstack-poisoned]
… used where needed"

Splits each torch library registration in the 2.10 folder into its own file -- I had a script that parsed kernel.cpp to do this but I felt like forcing this responsibility on the user might be less error prone

Compiles each file targetting 2.9 and asserts that compilation fails. (There are 2 2.9 kernels we use as negative tests where compilation is expected to succeed)




[ghstack-poisoned]
@qodo-merge-pro
Copy link

PR Compliance Guide 🔍

Below is a summary of compliance checks for this PR:

Security Compliance
🟢
No security concerns identified No security vulnerabilities detected by AI analysis. Human verification advised for critical code.
Ticket Compliance
🎫 No ticket provided
  • Create ticket/issue
Codebase Duplication Compliance
Codebase context is not defined

Follow the guide to enable codebase context checks.

Custom Compliance
🟢
Generic: Meaningful Naming and Self-Documenting Code

Objective: Ensure all identifiers clearly express their purpose and intent, making code
self-documenting

Status: Passed

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Secure Error Handling

Objective: To prevent the leakage of sensitive system information through error messages while
providing sufficient detail for internal debugging.

Status: Passed

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Secure Logging Practices

Objective: To ensure logs are useful for debugging and auditing without exposing sensitive
information like PII, PHI, or cardholder data.

Status: Passed

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Security-First Input Validation and Data Handling

Objective: Ensure all data inputs are validated, sanitized, and handled securely to prevent
vulnerabilities

Status: Passed

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Comprehensive Audit Trails

Objective: To create a detailed and reliable record of critical system actions for security analysis
and compliance.

Status:
No auditing: New test helpers initialize NCCL process groups and manipulate GPU state without adding
any audit logging of critical actions, but as this is test code and may rely on external
logging, it requires human verification.

Referred Code
def _get_process_group_nccl(self):
    store = dist.FileStore(self.file_name, self.world_size)
    dist.init_process_group(
        backend="nccl",
        world_size=self.world_size,
        rank=self.rank,
        store=store,
    )
    return dist.distributed_c10d._get_default_group()

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Robust Error Handling and Edge Case Management

Objective: Ensure comprehensive error handling that provides meaningful context and graceful
degradation

Status:
Error handling: The refactor of memory coalescing scoring removes buffer-size bounding and broadcast
handling which could affect edge cases without adding explicit error or boundary checks,
but behavior may be validated elsewhere by tests.

Referred Code
if indirect_expr:
    continue

size = get_score(memory_expr, var_ranges)
if size == 0:
    continue

maybe_coalesced_var = find_coalesced_var(memory_expr, var_ranges)

byte_multipler = 0
for buf_name in buf_names:
    if buf := V.graph.try_get_buffer(buf_name):
        byte_multipler += buf.dtype.itemsize

# coalesced writes more important
byte_multipler *= 1 if is_read else 2

if maybe_coalesced_var:
    coalesced_by_var[maybe_coalesced_var] += size * byte_multipler
else:
    uncoalesced_addrs[memory_expr] += size * byte_multipler



 ... (clipped 1 lines)

Learn more about managing compliance generic rules or creating your own custom rules

Compliance status legend 🟢 - Fully Compliant
🟡 - Partial Compliant
🔴 - Not Compliant
⚪ - Requires Further Human Verification
🏷️ - Compliance label

@bar-qodo
Copy link
Author

/agentic_review

@bar-qodo
Copy link
Author

@sentry review

@qodo-merge-pro
Copy link

PR Code Suggestions ✨

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
Possible issue
Restore check for squeezable tensors

Restore the check for squeezable 1D bias tensors in launchGemmAndBiasCublasLt to
fix a regression.

aten/src/ATen/native/cuda/Blas.cpp [307]

-const auto* self_ptr = self.has_value() ? self.value().const_data_ptr<scalar_t>() : static_cast<const scalar_t*>(nullptr);
+const auto* self_ptr = (self.has_value() && (self.value().dim() == 1 || self.value().squeeze().dim() == 1))
+                           ? self.value().const_data_ptr<scalar_t>()
+                           : static_cast<const scalar_t*>(nullptr);
  • Apply / Chat
Suggestion importance[1-10]: 8

__

Why: The suggestion correctly identifies a functional regression where the check for a bias tensor being squeezable to 1D was lost, preventing certain valid bias shapes from using an optimized code path.

Medium
Use robust floating-point comparison

In test_logaddexp_cpu_vs_cuda_complex, replace self.assertEqual with
torch.testing.assert_close(..., equal_nan=True) for robustly comparing
floating-point results, including inf and NaN.

test/test_linalg.py [10075-10131]

 @onlyCUDA
 def test_logaddexp_cpu_vs_cuda_complex(self, device):
     # test logaddexp with complex values produce the same values (up to machine precision) on cpu and CUDA.
     input_real = torch.tensor([0.052, -0.2115, 0.6913], dtype=torch.float64)
     input_img = torch.tensor([-0.3229, -0.8374, 0.8391], dtype=torch.float64)
     input_complex = torch.complex(input_real, input_img).cuda()
 
     other_real = torch.tensor([0.2550, 0.8769, -0.4884], dtype=torch.float64)
     other_img = torch.tensor([0.6063, 0.4343, -1.4166], dtype=torch.float64)
     other_complex = torch.complex(other_real, other_img).cuda()
 
     out_gpu = torch.logaddexp(input=input_complex, other=other_complex)
     out_cpu = torch.logaddexp(input=input_complex.cpu(), other=other_complex.cpu())
 
     torch.testing.assert_close(out_gpu.cpu(), out_cpu, rtol=1e-12, atol=1e-14)
 
     # test extreme cases (infty, -infty, and nan) are handled the same between cuda and cpu
     input_complex = torch.complex(torch.tensor(float('inf')), torch.tensor(float('inf')))
     other_complex = torch.complex(torch.tensor(float('inf')), torch.tensor(float('inf')))
     out_gpu = torch.logaddexp(input=input_complex, other=other_complex)
     out_cpu = torch.logaddexp(input=input_complex.cpu(), other=other_complex.cpu())
-    self.assertEqual(out_gpu.cpu(), out_cpu)
+    torch.testing.assert_close(out_gpu.cpu(), out_cpu, equal_nan=True)
     ...

[To ensure code accuracy, apply this suggestion manually]

Suggestion importance[1-10]: 7

__

Why: The suggestion correctly identifies that self.assertEqual is not robust for comparing floating-point numbers, especially NaN, and recommends the more appropriate torch.testing.assert_close(..., equal_nan=True).

Medium
Handle equality case in complex comparison

Improve the _logaddexp_minmax function by adding a tie-breaking rule based on
imaginary parts when real parts are equal to ensure deterministic behavior.

aten/src/ATen/native/cuda/LogAddExpKernel.cu [23-36]

 template <typename scalar_t, bool min>
 __host__ __device__ c10::complex<scalar_t> _logaddexp_minmax(const c10::complex<scalar_t>& x, const c10::complex<scalar_t>& y) {
   scalar_t xr = std::real(x);
   scalar_t yr = std::real(y);
   if (::isnan(yr) || (::isnan(std::imag(y)))) {
     return y;
   } else if (::isnan(xr) || (::isnan(std::imag(x)))) {
     return x;
-  } else if (min) { // min
-    return (xr < yr) ? x : y;
-  } else { // max
-    return (xr >= yr) ? x : y;
   }
+
+  if (xr != yr) {
+    if (min) {
+      return (xr < yr) ? x : y;
+    }
+    return (xr > yr) ? x : y;
+  }
+
+  // real parts are equal, break tie with imaginary parts
+  scalar_t xi = std::imag(x);
+  scalar_t yi = std::imag(y);
+  if (min) {
+    return (xi < yi) ? x : y;
+  }
+  return (xi > yi) ? x : y;
 }
  • Apply / Chat
Suggestion importance[1-10]: 6

__

Why: The suggestion correctly identifies that the comparison for complex numbers is not deterministic when real parts are equal. Adding a tie-breaker based on the imaginary part improves correctness and determinism.

Low
General
Use stat instead of lstat

Replace lstat with stat in the file_exists function to correctly handle symbolic
links by checking the target file's existence.

torch/csrc/inductor/aoti_package/model_package_loader.cpp [82-89]

 bool file_exists(const std::string& path) {
 #ifdef _WIN32
   return fs::exists(path);
 #else
   struct stat rc{};
-  return lstat(path.c_str(), &rc) == 0;
+  return stat(path.c_str(), &rc) == 0;
 #endif
 }

[To ensure code accuracy, apply this suggestion manually]

Suggestion importance[1-10]: 7

__

Why: The suggestion correctly points out that using stat instead of lstat is more appropriate for checking file existence by following symbolic links, which prevents potential failures in subsequent file operations.

Medium
Improve weak reference storage test

Improve test_storage_dead_weak_ref by adding an assertion to verify the storage
object is accessible via the weak reference before the final strong reference is
deleted.

test/test_torch.py [10345-10352]

 @skipIfTorchDynamo("https://github.com/pytorch/torchdynamo/issues/1993")
 def test_storage_dead_weak_ref(self):
     x = torch.UntypedStorage(2)
     w_x = weakref.ref(x)
     y = torch.tensor(x)
     del x
+
+    # Check that the storage is still alive and accessible
+    storage_from_weak_ref = w_x()
+    self.assertIsNotNone(storage_from_weak_ref)
+    # Perform an operation to ensure it's a valid storage object
+    self.assertEqual(storage_from_weak_ref.size(), 2)
+    del storage_from_weak_ref
+
     self.assertIsNotNone(w_x())
     del y
     self.assertIsNone(w_x())

[To ensure code accuracy, apply this suggestion manually]

Suggestion importance[1-10]: 4

__

Why: The suggestion correctly proposes adding an assertion to verify the storage object is alive and accessible via the weak reference before the final strong reference is deleted, making the test more robust.

Low
  • More

@bar-qodo
Copy link
Author

@sentry review

@bar-qodo bar-qodo requested a review from ravidqodo November 19, 2025 18:04
@bar-qodo
Copy link
Author

@sentry review

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.