-
Notifications
You must be signed in to change notification settings - Fork 25.1k
fake commit #8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
fake commit #8
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
@pytorchbot test this please |
@pytorchbot retest this please |
@pytorchbot retest this please |
Continuous builds now work for both Pull Requests and for pushes to Master. Build server is Jenkins (EC2) + Nimbix. Build scripts are here: https://github.com/pytorch/builder |
apaszke
pushed a commit
that referenced
this pull request
Jan 30, 2017
apaszke
pushed a commit
that referenced
this pull request
Jan 31, 2017
yogi81
referenced
this pull request
Aug 16, 2017
pytorchmergebot
pushed a commit
that referenced
this pull request
Nov 22, 2024
See #140725 (comment) Running `torch.mps.synchronize()` after metal kernel resulted in infinite wait inside `[_MTLCommandBuffer waitUntilCompleted]` ``` (lldb) bt * thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP * frame #0: 0x00000001aa919084 Metal`pthread_cond_wait + 12 frame #1: 0x00000001aa78b1b4 Metal`-[_MTLCommandBuffer waitUntilCompleted] + 84 frame #2: 0x00000001032bf358 libtorch_python.dylib`torch::mps::MPSModule_deviceSynchronize(_object*, _object*) + 40 frame #3: 0x0000000100e94c20 Python`cfunction_vectorcall_NOARGS + 100 frame #4: 0x0000000100e389b8 Python`PyObject_Vectorcall + 92 frame #5: 0x0000000100f61e38 Python`_PyEval_EvalFrameDefault + 19040 frame #6: 0x0000000100f5d180 Python`PyEval_EvalCode + 200 frame #7: 0x0000000100fcd1a4 Python`run_eval_code_obj + 104 frame #8: 0x0000000100fccbe4 Python`run_mod + 168 frame #9: 0x0000000100fcb518 Python`pyrun_file + 164 frame #10: 0x0000000100fca854 Python`_PyRun_SimpleFileObject + 256 frame #11: 0x0000000100fca4e8 Python`_PyRun_AnyFileObject + 80 frame #12: 0x0000000100ff2028 Python`pymain_run_file_obj + 164 frame #13: 0x0000000100ff1ce4 Python`pymain_run_file + 72 frame #14: 0x0000000100ff0f74 Python`Py_RunMain + 988 frame #15: 0x0000000100ff1564 Python`pymain_main + 304 frame #16: 0x0000000100ff1604 Python`Py_BytesMain + 40 frame #17: 0x000000019f630274 dyld`start + 2840 ``` Pull Request resolved: #141296 Approved by: https://github.com/huydhn
youssef62
pushed a commit
to youssef62/pytorch
that referenced
this pull request
Nov 23, 2024
See pytorch#140725 (comment) Running `torch.mps.synchronize()` after metal kernel resulted in infinite wait inside `[_MTLCommandBuffer waitUntilCompleted]` ``` (lldb) bt * thread pytorch#1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP * frame #0: 0x00000001aa919084 Metal`pthread_cond_wait + 12 frame pytorch#1: 0x00000001aa78b1b4 Metal`-[_MTLCommandBuffer waitUntilCompleted] + 84 frame pytorch#2: 0x00000001032bf358 libtorch_python.dylib`torch::mps::MPSModule_deviceSynchronize(_object*, _object*) + 40 frame pytorch#3: 0x0000000100e94c20 Python`cfunction_vectorcall_NOARGS + 100 frame pytorch#4: 0x0000000100e389b8 Python`PyObject_Vectorcall + 92 frame pytorch#5: 0x0000000100f61e38 Python`_PyEval_EvalFrameDefault + 19040 frame pytorch#6: 0x0000000100f5d180 Python`PyEval_EvalCode + 200 frame pytorch#7: 0x0000000100fcd1a4 Python`run_eval_code_obj + 104 frame pytorch#8: 0x0000000100fccbe4 Python`run_mod + 168 frame pytorch#9: 0x0000000100fcb518 Python`pyrun_file + 164 frame pytorch#10: 0x0000000100fca854 Python`_PyRun_SimpleFileObject + 256 frame pytorch#11: 0x0000000100fca4e8 Python`_PyRun_AnyFileObject + 80 frame pytorch#12: 0x0000000100ff2028 Python`pymain_run_file_obj + 164 frame pytorch#13: 0x0000000100ff1ce4 Python`pymain_run_file + 72 frame pytorch#14: 0x0000000100ff0f74 Python`Py_RunMain + 988 frame pytorch#15: 0x0000000100ff1564 Python`pymain_main + 304 frame pytorch#16: 0x0000000100ff1604 Python`Py_BytesMain + 40 frame pytorch#17: 0x000000019f630274 dyld`start + 2840 ``` Pull Request resolved: pytorch#141296 Approved by: https://github.com/huydhn
gglin001
pushed a commit
to gglin001/pytorch
that referenced
this pull request
Nov 27, 2024
chunyuan-w
pushed a commit
to chunyuan-w/pytorch
that referenced
this pull request
Dec 2, 2024
pobin6
pushed a commit
to pobin6/pytorch
that referenced
this pull request
Dec 5, 2024
See pytorch#140725 (comment) Running `torch.mps.synchronize()` after metal kernel resulted in infinite wait inside `[_MTLCommandBuffer waitUntilCompleted]` ``` (lldb) bt * thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP * frame #0: 0x00000001aa919084 Metal`pthread_cond_wait + 12 frame #1: 0x00000001aa78b1b4 Metal`-[_MTLCommandBuffer waitUntilCompleted] + 84 frame pytorch#2: 0x00000001032bf358 libtorch_python.dylib`torch::mps::MPSModule_deviceSynchronize(_object*, _object*) + 40 frame pytorch#3: 0x0000000100e94c20 Python`cfunction_vectorcall_NOARGS + 100 frame pytorch#4: 0x0000000100e389b8 Python`PyObject_Vectorcall + 92 frame pytorch#5: 0x0000000100f61e38 Python`_PyEval_EvalFrameDefault + 19040 frame pytorch#6: 0x0000000100f5d180 Python`PyEval_EvalCode + 200 frame pytorch#7: 0x0000000100fcd1a4 Python`run_eval_code_obj + 104 frame pytorch#8: 0x0000000100fccbe4 Python`run_mod + 168 frame pytorch#9: 0x0000000100fcb518 Python`pyrun_file + 164 frame pytorch#10: 0x0000000100fca854 Python`_PyRun_SimpleFileObject + 256 frame pytorch#11: 0x0000000100fca4e8 Python`_PyRun_AnyFileObject + 80 frame pytorch#12: 0x0000000100ff2028 Python`pymain_run_file_obj + 164 frame pytorch#13: 0x0000000100ff1ce4 Python`pymain_run_file + 72 frame pytorch#14: 0x0000000100ff0f74 Python`Py_RunMain + 988 frame pytorch#15: 0x0000000100ff1564 Python`pymain_main + 304 frame pytorch#16: 0x0000000100ff1604 Python`Py_BytesMain + 40 frame pytorch#17: 0x000000019f630274 dyld`start + 2840 ``` Pull Request resolved: pytorch#141296 Approved by: https://github.com/huydhn
aostrowski-hbn
added a commit
to HabanaAI/pytorch-fork
that referenced
this pull request
Jan 7, 2025
drisspg
added a commit
that referenced
this pull request
Jan 15, 2025
…ention" Thanks to manman-ren who verified that triton-lang/triton#4247 fixes this issue as well. This is not currently cherry-picked into pytorch-triton. ========= COMPUTE-SANITIZER Test completed successfully! ========= ERROR SUMMARY: 0 errors ## NOTE: HMM very interestingly: If the og_headdim is a odd this works as expected. However when the og_head_dim is a multiple of 2 this segfaults here: ```Shell (lldb) bt * thread #67, name = 'pt_autograd_0', stop reason = signal SIGSEGV: address not mapped to object (fault address: 0x10) * frame #0: 0x00007ffed327fbfe libtriton.so`scheduleRemainingToLastStage(forOp=ForOp @ 0x00007ffcafdfd658, schedule=0x00007ffcafdfd9e0, afterPrologue=<unavailable>, numStages=2) at MatmulLoopPipeline.cpp:893:9 frame #1: 0x00007ffed328d970 libtriton.so`mlir::triton::preProcessLoopAndGetSchedule(forOp=0x00007ffcafdfddc0, numStages=2, options=0x00007ffcafdfde80) at MatmulLoopPipeline.cpp:1230:31 frame #2: 0x00007ffed32a6a43 libtriton.so`mlir::triton::gpu::PipelinePass::runOnOperation() [inlined] pipelineLoop(numStages=2, forOp=ForOp @ 0x00007ffcafdfddc0) at SoftwarePipeliner.cpp:79:47 frame #3: 0x00007ffed32a6998 libtriton.so`mlir::triton::gpu::PipelinePass::runOnOperation(this=0x00007ffc54767f10) at SoftwarePipeliner.cpp:125:36 frame #4: 0x00007ffed385147c libtriton.so`mlir::detail::OpToOpPassAdaptor::run(mlir::Pass*, mlir::Operation*, mlir::AnalysisManager, bool, unsigned int) + 700 frame #5: 0x00007ffed3851df2 libtriton.so`mlir::detail::OpToOpPassAdaptor::runPipeline(mlir::OpPassManager&, mlir::Operation*, mlir::AnalysisManager, bool, unsigned int, mlir::PassInstrumentor*, mlir::PassInstrumentation::PipelineParentInfo const*) + 354 frame #6: 0x00007ffed385481c libtriton.so`mlir::PassManager::run(mlir::Operation*) + 876 frame #7: 0x00007ffed3542bad libtriton.so`<lambda(mlir::PassManager&, mlir::ModuleOp&)>::operator(self=<unavailable>, mod=0x00007ffc54579280, __closure=<unavailable>)(mlir::PassManager &, mlir::ModuleOp &) at ir.cc:1625:19 frame #8: 0x00007ffed3560108 libtriton.so`_FUN [inlined] operator(this=0x0000000000000000, call=0x00007ffcafdfe6e0) at cast.h:1480:37 frame #9: 0x00007ffed35600f0 libtriton.so`_FUN((null)=0x00007ffcafdfe6e0) at pybind11.h:224:21 frame #10: 0x00007ffed9ae5590 libtriton.so`typeinfo for pybind11::handle + 24 frame #11: 0x00007ffed9ae5590 libtriton.so`typeinfo for pybind11::handle + 24 ``` cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang aakhundov [ghstack-poisoned]
drisspg
added a commit
that referenced
this pull request
Jan 15, 2025
Thanks to manman-ren who verified that triton-lang/triton#4247 fixes this issue as well. This is not currently cherry-picked into pytorch-triton. ========= COMPUTE-SANITIZER Test completed successfully! ========= ERROR SUMMARY: 0 errors ## NOTE: HMM very interestingly: If the og_headdim is a odd this works as expected. However when the og_head_dim is a multiple of 2 this segfaults here: ```Shell (lldb) bt * thread #67, name = 'pt_autograd_0', stop reason = signal SIGSEGV: address not mapped to object (fault address: 0x10) * frame #0: 0x00007ffed327fbfe libtriton.so`scheduleRemainingToLastStage(forOp=ForOp @ 0x00007ffcafdfd658, schedule=0x00007ffcafdfd9e0, afterPrologue=<unavailable>, numStages=2) at MatmulLoopPipeline.cpp:893:9 frame #1: 0x00007ffed328d970 libtriton.so`mlir::triton::preProcessLoopAndGetSchedule(forOp=0x00007ffcafdfddc0, numStages=2, options=0x00007ffcafdfde80) at MatmulLoopPipeline.cpp:1230:31 frame #2: 0x00007ffed32a6a43 libtriton.so`mlir::triton::gpu::PipelinePass::runOnOperation() [inlined] pipelineLoop(numStages=2, forOp=ForOp @ 0x00007ffcafdfddc0) at SoftwarePipeliner.cpp:79:47 frame #3: 0x00007ffed32a6998 libtriton.so`mlir::triton::gpu::PipelinePass::runOnOperation(this=0x00007ffc54767f10) at SoftwarePipeliner.cpp:125:36 frame #4: 0x00007ffed385147c libtriton.so`mlir::detail::OpToOpPassAdaptor::run(mlir::Pass*, mlir::Operation*, mlir::AnalysisManager, bool, unsigned int) + 700 frame #5: 0x00007ffed3851df2 libtriton.so`mlir::detail::OpToOpPassAdaptor::runPipeline(mlir::OpPassManager&, mlir::Operation*, mlir::AnalysisManager, bool, unsigned int, mlir::PassInstrumentor*, mlir::PassInstrumentation::PipelineParentInfo const*) + 354 frame #6: 0x00007ffed385481c libtriton.so`mlir::PassManager::run(mlir::Operation*) + 876 frame #7: 0x00007ffed3542bad libtriton.so`<lambda(mlir::PassManager&, mlir::ModuleOp&)>::operator(self=<unavailable>, mod=0x00007ffc54579280, __closure=<unavailable>)(mlir::PassManager &, mlir::ModuleOp &) at ir.cc:1625:19 frame #8: 0x00007ffed3560108 libtriton.so`_FUN [inlined] operator(this=0x0000000000000000, call=0x00007ffcafdfe6e0) at cast.h:1480:37 frame #9: 0x00007ffed35600f0 libtriton.so`_FUN((null)=0x00007ffcafdfe6e0) at pybind11.h:224:21 frame #10: 0x00007ffed9ae5590 libtriton.so`typeinfo for pybind11::handle + 24 frame #11: 0x00007ffed9ae5590 libtriton.so`typeinfo for pybind11::handle + 24 ``` cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang aakhundov [ghstack-poisoned]
pytorchmergebot
pushed a commit
that referenced
this pull request
Jun 1, 2025
Which inherits from `RuntimeError` and contains `error_code`, which in case of CUDA should contain error returned by `cudaGetLastError` `torch::detail::_new_accelerator_error_object(c10::AcceleratorError&)` follows the pattern of CPython's [`PyErr_SetString`](https://github.com/python/cpython/blob/cb8a72b301f47e76d93a7fe5b259e9a5758792e1/Python/errors.c#L282), namely - Convert cstr into Python string with `PyUnicode_FromString` - Create new exception object using `PyObject_CallOneArg` just like it's done in [`_PyErr_CreateException`](https://github.com/python/cpython/blob/cb8a72b301f47e76d93a7fe5b259e9a5758792e1/Python/errors.c#L32) - Set `error_code` property using `PyObject_SetAttrString` - decref all temporary references Test that it works and captures CPP backtrace (in addition to CI) by running ```python import os os.environ['TORCH_SHOW_CPP_STACKTRACES'] = '1' import torch x = torch.rand(10, device="cuda") y = torch.arange(20, device="cuda") try: x[y] = 2 print(x) except torch.AcceleratorError as e: print("Exception was raised", e.args[0]) print("Captured error code is ", e.error_code) ``` which produces following output ``` Exception was raised CUDA error: device-side assert triggered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1 Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions. Exception raised from c10_cuda_check_implementation at /home/ubuntu/pytorch/c10/cuda/CUDAException.cpp:41 (most recent call first): C++ CapturedTraceback: #4 std::_Function_handler<std::shared_ptr<c10::LazyValue<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > const> (), c10::SetStackTraceFetcher(std::function<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) from Logging.cpp:0 #5 c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) from ??:0 #6 c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) [clone .cold] from CUDAException.cpp:0 #7 void at::native::gpu_kernel_impl<at::native::AbsFunctor<float> >(at::TensorIteratorBase&, at::native::AbsFunctor<float> const&) [clone .isra.0] from tmpxft_000191fc_00000000-6_AbsKernel.cudafe1.cpp:0 #8 at::native::abs_kernel_cuda(at::TensorIteratorBase&) from ??:0 #9 at::Tensor& at::native::unary_op_impl_with_complex_to_float_out<at::native::abs_stub_DECLARE_DISPATCH_type>(at::Tensor&, at::Tensor const&, at::native::abs_stub_DECLARE_DISPATCH_type&, bool) [clone .constprop.0] from UnaryOps.cpp:0 #10 at::(anonymous namespace)::(anonymous namespace)::wrapper_CUDA_out_abs_out(at::Tensor const&, at::Tensor&) from RegisterCUDA_0.cpp:0 #11 at::_ops::abs_out::call(at::Tensor const&, at::Tensor&) from ??:0 #12 at::native::abs(at::Tensor const&) from ??:0 #13 c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (at::Tensor const&), &at::(anonymous namespace)::(anonymous namespace)::wrapper_CompositeExplicitAutograd__abs>, at::Tensor, c10::guts::typelist::typelist<at::Tensor const&> >, at::Tensor (at::Tensor const&)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&) from RegisterCompositeExplicitAutograd_0.cpp:0 #14 at::_ops::abs::redispatch(c10::DispatchKeySet, at::Tensor const&) from ??:0 #15 torch::autograd::VariableType::(anonymous namespace)::abs(c10::DispatchKeySet, at::Tensor const&) from VariableType_1.cpp:0 #16 c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (c10::DispatchKeySet, at::Tensor const&), &torch::autograd::VariableType::(anonymous namespace)::abs>, at::Tensor, c10::guts::typelist::typelist<c10::DispatchKeySet, at::Tensor const&> >, at::Tensor (c10::DispatchKeySet, at::Tensor const&)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&) from VariableType_1.cpp:0 #17 at::_ops::abs::call(at::Tensor const&) from ??:0 #18 at::native::isfinite(at::Tensor const&) from ??:0 #19 c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (at::Tensor const&), &at::(anonymous namespace)::(anonymous namespace)::wrapper_CompositeImplicitAutograd__isfinite>, at::Tensor, c10::guts::typelist::typelist<at::Tensor const&> >, at::Tensor (at::Tensor const&)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&) from RegisterCompositeImplicitAutograd_0.cpp:0 #20 at::_ops::isfinite::call(at::Tensor const&) from ??:0 #21 torch::autograd::THPVariable_isfinite(_object*, _object*, _object*) from python_torch_functions_2.cpp:0 #22 PyObject_CallFunctionObjArgs from ??:0 #23 _PyObject_MakeTpCall from ??:0 #24 _PyEval_EvalFrameDefault from ??:0 #25 _PyObject_FastCallDictTstate from ??:0 #26 _PyStack_AsDict from ??:0 #27 _PyObject_MakeTpCall from ??:0 #28 _PyEval_EvalFrameDefault from ??:0 #29 _PyFunction_Vectorcall from ??:0 #30 _PyEval_EvalFrameDefault from ??:0 #31 _PyFunction_Vectorcall from ??:0 #32 _PyEval_EvalFrameDefault from ??:0 #33 _PyFunction_Vectorcall from ??:0 #34 _PyEval_EvalFrameDefault from ??:0 #35 PyFrame_GetCode from ??:0 #36 PyNumber_Xor from ??:0 #37 PyObject_Str from ??:0 #38 PyFile_WriteObject from ??:0 #39 _PyWideStringList_AsList from ??:0 #40 _PyDict_NewPresized from ??:0 #41 _PyEval_EvalFrameDefault from ??:0 #42 PyEval_EvalCode from ??:0 #43 PyEval_EvalCode from ??:0 #44 PyUnicode_Tailmatch from ??:0 #45 PyInit__collections from ??:0 #46 PyUnicode_Tailmatch from ??:0 #47 _PyRun_SimpleFileObject from ??:0 #48 _PyRun_AnyFileObject from ??:0 #49 Py_RunMain from ??:0 #50 Py_BytesMain from ??:0 #51 __libc_init_first from ??:0 #52 __libc_start_main from ??:0 #53 _start from ??:0 Captured error code is 710 ``` Pull Request resolved: #152023 Approved by: https://github.com/eqy, https://github.com/mradmila, https://github.com/ngimel ghstack dependencies: #154436
pytorchmergebot
pushed a commit
that referenced
this pull request
Jun 24, 2025
…56600) Don't call `sum()` on a tensor that is default constructed. Previously we could call `sum()` on a tensor that was default-contructed. That would lead to an error like this: ``` Traceback (most recent call last): File "/home/ahmads/.conda/envs/pt3/lib/python3.12/unittest/case.py", line 58, in testPartExecutor yield File "/home/ahmads/.conda/envs/pt3/lib/python3.12/unittest/case.py", line 634, in run self._callTestMethod(testMethod) File "/home/ahmads/.conda/envs/pt3/lib/python3.12/unittest/case.py", line 589, in _callTestMethod if method() is not None: ^^^^^^^^ File "/home/ahmads/personal/pytorch/torch/testing/_internal/common_utils.py", line 3191, in wrapper method(*args, **kwargs) File "/home/ahmads/personal/pytorch/test/test_nn.py", line 7235, in test_layer_norm_backwards_eps ln_out_cuda.backward(grad_output_cuda) File "/home/ahmads/personal/pytorch/torch/_tensor.py", line 647, in backward torch.autograd.backward( File "/home/ahmads/personal/pytorch/torch/autograd/__init__.py", line 354, in backward _engine_run_backward( File "/home/ahmads/personal/pytorch/torch/autograd/graph.py", line 829, in _engine_run_backward return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RuntimeError: tensor does not have a device Exception raised from device_default at /home/ahmads/personal/pytorch/c10/core/TensorImpl.h:1265 (most recent call first): C++ CapturedTraceback: #4 std::_Function_handler<std::shared_ptr<c10::LazyValue<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > const> (), c10::SetStackTraceFetcher(std::function<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) from Logging.cpp:0 #5 c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) from ??:0 #6 c10::detail::torchCheckFail(char const*, char const*, unsigned int, char const*) from ??:0 #7 at::TensorBase::options() const from :0 #8 at::meta::resize_reduction(at::impl::MetaBase&, at::Tensor const&, c10::OptionalArrayRef<long>, bool, c10::ScalarType, bool) from :0 #9 at::meta::structured_sum_dim_IntList::meta(at::Tensor const&, c10::OptionalArrayRef<long>, bool, std::optional<c10::ScalarType>) from ??:0 #10 at::(anonymous namespace)::wrapper_CompositeExplicitAutogradNonFunctional_sum_dim_IntList(at::Tensor const&, c10::OptionalArrayRef<long>, bool, std::optional<c10::ScalarType>) from RegisterCompositeExplicitAutogradNonFunctional_0.cpp:0 #11 c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (at::Tensor const&, c10::OptionalArrayRef<long>, bool, std::optional<c10::ScalarType>), &at::(anonymous namespace)::wrapper_CompositeExplicitAutogradNonFunctional_sum_dim_IntList>, at::Tensor, c10::guts::typelist::typelist<at::Tensor const&, c10::OptionalArrayRef<long>, bool, std::optional<c10::ScalarType> > >, at::Tensor (at::Tensor const&, c10::OptionalArrayRef<long>, bool, std::optional<c10::ScalarType>)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, c10::OptionalArrayRef<long>, bool, std::optional<c10::ScalarType>) from RegisterCompositeExplicitAutogradNonFunctional_0.cpp:0 #12 at::_ops::sum_dim_IntList::call(at::Tensor const&, c10::OptionalArrayRef<long>, bool, std::optional<c10::ScalarType>) from ??:0 #13 void at::native::(anonymous namespace)::LaunchGammaBetaBackwardCUDAKernel<float, float>(float const*, float const*, float const*, float const*, long, long, at::Tensor*, at::Tensor*, CUstream_st*) from ??:0 #14 void at::native::(anonymous namespace)::LayerNormBackwardKernelImplInternal<float>(at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, long, long, at::Tensor*, at::Tensor*, at::Tensor*) from ??:0 #15 at::native::(anonymous namespace)::LayerNormBackwardKernelImpl(at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, long, long, at::Tensor*, at::Tensor*, at::Tensor*) from ??:0 #16 at::native::layer_norm_backward_cuda(at::Tensor const&, at::Tensor const&, c10::ArrayRef<long>, at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, std::array<bool, 3ul>) from ??:0 #17 at::(anonymous namespace)::(anonymous namespace)::wrapper_CUDA__native_layer_norm_backward(at::Tensor const&, at::Tensor const&, c10::ArrayRef<c10::SymInt>, at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, std::array<bool, 3ul>) from RegisterCUDA_0.cpp:0 ``` Now we only call `sum(0)` on tensors that are defined and properly guard the `sum(0)` and assignment. Pull Request resolved: #156600 Approved by: https://github.com/eqy, https://github.com/ngimel
ethanwee1
added a commit
to ethanwee1/pytorch
that referenced
this pull request
Jun 26, 2025
…:warp_size() (pytorch#2293) Fixes SWDEV-540240, SWDEV-540309, SWDEV-539989 ### Error ``` pytorch#24 437.7 what(): HIP error: no ROCm-capable device is detected pytorch#24 437.7 HIP kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. pytorch#24 437.7 For debugging consider passing AMD_SERIALIZE_KERNEL=3 pytorch#24 437.7 Device-side assertions were explicitly omitted for this error check; the error probably arose while initializing the DSA handlers. pytorch#24 437.7 Exception raised from c10_hip_check_implementation at /pytorch/c10/hip/HIPException.cpp:44 (most recent call first): pytorch#24 437.7 frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x88 (0x7f272de18738 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so) pytorch#24 437.7 frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x55 (0x7f272ddb42ed in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so) ... pytorch#24 437.7 frame pytorch#7: at::cuda::getCurrentDeviceProperties() + 0x9 (0x7f270b5874e9 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_hip.so) pytorch#24 437.7 frame pytorch#8: at::cuda::warp_size() + 0x9 (0x7f270b587509 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_hip.so) pytorch#24 437.7 frame pytorch#9: <unknown function> + 0x81ac8b (0x7f2709c27c8b in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_hip.so) ``` ### Explanation ROCm@80cca70 created a static global variable that used `at::cuda::warp_size()` to initialize its value, which needs GPUs to be visible to query device properties. However, GPUs are not present on CPU-only build systems. ### Solution Convert static variable into a static function, thus preventing static initialization. ### Validation http://rocm-ci.amd.com/job/pyt_whl_docker_mainline/1461/artifact/build_artifacts.txt/*view*/ Ran microbenchmark to confirm basic functionality: ``` root@ubb4-rack-22:/var/lib/jenkins/pytorch-micro-benchmarking# python3 micro_benchmarking_pytorch.py --network resnet50 INFO: running forward and backward for warmup. INFO: running the benchmark.. OK: finished running benchmark.. --------------------SUMMARY-------------------------- Microbenchmark for network : resnet50 Num devices: 1 Dtype: FP32 Mini batch size [img] : 64 Time per mini-batch : 0.10158218145370483 Throughput [img/sec] : 630.0317544289736= ```
jagadish-amd
pushed a commit
to jagadish-amd/pytorch
that referenced
this pull request
Jul 9, 2025
…:warp_size() (pytorch#2293) Fixes SWDEV-540240, SWDEV-540309, SWDEV-539989 ### Error ``` pytorch#24 437.7 what(): HIP error: no ROCm-capable device is detected pytorch#24 437.7 HIP kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. pytorch#24 437.7 For debugging consider passing AMD_SERIALIZE_KERNEL=3 pytorch#24 437.7 Device-side assertions were explicitly omitted for this error check; the error probably arose while initializing the DSA handlers. pytorch#24 437.7 Exception raised from c10_hip_check_implementation at /pytorch/c10/hip/HIPException.cpp:44 (most recent call first): pytorch#24 437.7 frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x88 (0x7f272de18738 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so) pytorch#24 437.7 frame pytorch#1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x55 (0x7f272ddb42ed in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so) ... pytorch#24 437.7 frame pytorch#7: at::cuda::getCurrentDeviceProperties() + 0x9 (0x7f270b5874e9 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_hip.so) pytorch#24 437.7 frame pytorch#8: at::cuda::warp_size() + 0x9 (0x7f270b587509 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_hip.so) pytorch#24 437.7 frame pytorch#9: <unknown function> + 0x81ac8b (0x7f2709c27c8b in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_hip.so) ``` ### Explanation ROCm@80cca70 created a static global variable that used `at::cuda::warp_size()` to initialize its value, which needs GPUs to be visible to query device properties. However, GPUs are not present on CPU-only build systems. ### Solution Convert static variable into a static function, thus preventing static initialization. ### Validation http://rocm-ci.amd.com/job/pyt_whl_docker_mainline/1461/artifact/build_artifacts.txt/*view*/ Ran microbenchmark to confirm basic functionality: ``` root@ubb4-rack-22:/var/lib/jenkins/pytorch-micro-benchmarking# python3 micro_benchmarking_pytorch.py --network resnet50 INFO: running forward and backward for warmup. INFO: running the benchmark.. OK: finished running benchmark.. --------------------SUMMARY-------------------------- Microbenchmark for network : resnet50 Num devices: 1 Dtype: FP32 Mini batch size [img] : 64 Time per mini-batch : 0.10158218145370483 Throughput [img/sec] : 630.0317544289736= ```
pytorchmergebot
pushed a commit
that referenced
this pull request
Jul 19, 2025
For tensor with non-zero offset, it must be multiplied by element size Add regression test by creating Tensor in array of 6 elements with offset 3, which before the fix crashed with ``` C++ exception with description "setStorage: sizes [3, 3], strides [0, 1], storage offset 3, and itemsize 4 requiring a storage size of 24 are out of bounds for storage of size 15 Exception raised from checkInBoundsForStorage at /Users/nshulga/git/pytorch/pytorch/aten/src/ATen/native/Resize.h:123 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>) + 56 (0x104a9cd44 in libc10.dylib) frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&) + 120 (0x104a9a05c in libc10.dylib) frame #2: void at::native::checkInBoundsForStorage<long long>(c10::ArrayRef<long long>, c10::ArrayRef<long long>, long long, caffe2::TypeMeta const&, c10::Storage const&) + 656 (0x111dbd314 in libtorch_cpu.dylib) frame #3: void at::native::setStrided<long long>(at::Tensor const&, c10::ArrayRef<long long>, c10::ArrayRef<long long>, long long) + 152 (0x111dcd22c in libtorch_cpu.dylib) frame #4: at::native::as_strided_tensorimpl(at::Tensor const&, c10::ArrayRef<long long>, c10::ArrayRef<long long>, std::__1::optional<long long>) + 312 (0x111dccf98 in libtorch_cpu.dylib) frame #5: c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, std::__1::optional<c10::SymInt>), &at::(anonymous namespace)::(anonymous namespace)::wrapper_CPU__as_strided(at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, std::__1::optional<c10::SymInt>)>, at::Tensor, c10::guts::typelist::typelist<at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, std::__1::optional<c10::SymInt>>>, at::Tensor (at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, std::__1::optional<c10::SymInt>)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, std::__1::optional<c10::SymInt>) + 104 (0x1129a1e94 in libtorch_cpu.dylib) frame #6: at::_ops::as_strided::call(at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>, std::__1::optional<c10::SymInt>) + 476 (0x112200ad0 in libtorch_cpu.dylib) frame #7: at::Tensor::as_strided(c10::ArrayRef<long long>, c10::ArrayRef<long long>, std::__1::optional<long long>) const + 236 (0x1115db098 in libtorch_cpu.dylib) frame #8: at::native::expand(at::Tensor const&, c10::ArrayRef<long long>, bool) + 348 (0x111dcc0d4 in libtorch_cpu.dylib) frame #9: c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, bool), &torch::ADInplaceOrView::(anonymous namespace)::expand(c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, bool)>, at::Tensor, c10::guts::typelist::typelist<c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, bool>>, at::Tensor (c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, bool)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, bool) + 116 (0x1157ac410 in libtorch_cpu.dylib) frame #10: c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, bool), &torch::autograd::VariableType::(anonymous namespace)::expand(c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, bool)>, at::Tensor, c10::guts::typelist::typelist<c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, bool>>, at::Tensor (c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, bool)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, bool) + 992 (0x114e8b010 in libtorch_cpu.dylib) frame #11: at::_ops::expand::call(at::Tensor const&, c10::ArrayRef<c10::SymInt>, bool) + 316 (0x112743c90 in libtorch_cpu.dylib) frame #12: at::expand_size(at::Tensor const&, c10::ArrayRef<long long>) + 164 (0x1047d82b4 in basic) frame #13: BasicTest_TestForBlobResizeCPU_Test::TestBody() + 284 (0x1047d8048 in basic) ``` Pull Request resolved: #158690 Approved by: https://github.com/angelayi
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
No description provided.