Skip to content

Conversation

[ghstack-poisoned]
@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Sep 6, 2022

🔗 Helpful links

❌ 1 New Failures

As of commit d6ec123 (more details on the Dr. CI page):

Expand to see more
  • 1/1 failures introduced in this PR

🕵️ 1 new failure recognized by patterns

The following CI failures do not appear to be due to upstream breakages

See GitHub Actions build periodic / linux-bionic-cuda11.6-py3.9-gcc7 / test (multigpu, 1, 1, linux.16xlarge.nvidia.gpu) (1/1)

Step: "Get workflow job id" (full log | diagnosis details)

2022-09-07T02:50:51.1462773Z RuntimeError: Expe...e, but found at least two devices, cuda:0 and cpu!
2022-09-07T02:50:51.1439984Z frame #37: clone + 0x3f (0x7f3a4d28661f in /lib/x86_64-linux-gnu/libc.so.6)
2022-09-07T02:50:51.1440333Z 
2022-09-07T02:50:51.1440357Z 
2022-09-07T02:50:51.1440508Z On WorkerInfo(id=3, name=worker3):
2022-09-07T02:50:51.1452734Z RuntimeError('Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!\nException raised from compute_types at /var/lib/jenkins/workspace/aten/src/ATen/TensorIterator.cpp:484 (most recent call first):\nframe #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x6b (0x7f6807b0bcab in /opt/conda/lib/python3.10/site-packages/torch/lib/libc10.so)\nframe #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xce (0x7f6807b0767e in /opt/conda/lib/python3.10/site-packages/torch/lib/libc10.so)\nframe #2: at::TensorIteratorBase::compute_types(at::TensorIteratorConfig const&) + 0xbbb (0x7f68121159bb in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)\nframe #3: at::TensorIteratorBase::build(at::TensorIteratorConfig&) + 0x7f (0x7f6812116ddf in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)\nframe #4: at::TensorIteratorBase::build_borrowing_binary_op(at::TensorBase const&, at::TensorBase const&, at::TensorBase const&) + 0xf2 (0x7f6812118482 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)\nframe #5: at::meta::structured_add_Tensor::meta(at::Tensor const&, at::Tensor const&, c10::Scalar const&) + 0x2e (0x7f68122f575e in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)\nframe #6: <unknown function> + 0x2a489ae (0x7f680a7959ae in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cuda_cu.so)\nframe #7: <unknown function> + 0x2a48ab6 (0x7f680a795ab6 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cuda_cu.so)\nframe #8: at::_ops::add_Tensor::redispatch(c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, c10::Scalar const&) + 0x98 (0x7f6812d19d88 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)\nframe #9: <unknown function> + 0x324a12a (0x7f68144a612a in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)\nframe #10: <unknown function> + 0x324a899 (0x7f68144a6899 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)\nframe #11: at::_ops::add_Tensor::call(at::Tensor const&, at::Tensor const&, c10::Scalar const&) + 0x172 (0x7f6812d4dd42 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)\nframe #12: <unknown function> + 0x32b5c7 (0x7f681ef685c7 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_python.so)\nframe #13: <unknown function> + 0x32b8e6 (0x7f681ef688e6 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_python.so)\nframe #14: <unknown function> + 0x1ddc68 (0x5610128bfc68 in /opt/conda/bin/python)\nframe #15: <unknown function> + 0x199499 (0x56101287b499 in /opt/conda/bin/python)\nframe #16: <unknown function> + 0x1995fa (0x56101287b5fa in /opt/conda/bin/python)\nframe #17: PyNumber_Add + 0x41 (0x5610128274b1 in /opt/conda/bin/python)\nframe #18: _PyEval_EvalFrameDefault + 0x1008 (0x5610128c4098 in /opt/conda/bin/python)\nframe #19: <unknown function> + 0x18f742 (0x561012871742 in /opt/conda/bin/python)\nframe #20: _PyObject_Call + 0x20a (0x561012829faa in /opt/conda/bin/python)\nframe #21: _PyEval_EvalFrameDefault + 0x26e4 (0x5610128c5774 in /opt/conda/bin/python)\nframe #22: <unknown function> + 0x18f742 (0x561012871742 in /opt/conda/bin/python)\nframe #23: _PyObject_Call + 0x20a (0x561012829faa in /opt/conda/bin/python)\nframe #24: <unknown function> + 0xa2577a (0x7f681f66277a in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_python.so)\nframe #25: torch::distributed::rpc::PythonRpcHandler::runPythonUdf(pybind11::object const&) + 0x7d (0x7f681f6609bd in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_python.so)\nframe #26: torch::distributed::rpc::RequestCallbackImpl::runPythonFunction(pybind11::object const&, std::vector<c10::Stream, std::allocator<c10::Stream> >, bool) const + 0x85 (0x7f681f663b55 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_python.so)\nframe #27: torch::distributed::rpc::RequestCallbackImpl::processPythonCall(torch::distributed::rpc::RpcCommandBase&, std::vector<c10::Stream, std::allocator<c10::Stream> >) const + 0x96 (0x7f681f6676f6 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_python.so)\nframe #28: torch::distributed::rpc::RequestCallbackNoPython::processRpc(torch::distributed::rpc::RpcCommandBase&, torch::distributed::rpc::MessageType const&, std::vector<c10::Stream, std::allocator<c10::Stream> >) const + 0x10c (0x7f681587117c in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)\nframe #29: torch::distributed::rpc::RequestCallbackImpl::processRpcWithErrors(torch::distributed::rpc::RpcCommandBase&, torch::distributed::rpc::MessageType const&, std::vector<c10::Stream, std::allocator<c10::Stream> >) const + 0x65 (0x7f681f663835 in /opt/conda/lib/python3.10/site-packages/torch/lib/libto
2022-09-07T02:50:51.1460266Z Traceback (most recent call last):
2022-09-07T02:50:51.1460829Z   File "/opt/conda/lib/python3.10/site-packages/torch/distributed/rpc/internal.py", line 206, in _run_function
2022-09-07T02:50:51.1461297Z     result = python_udf.func(*python_udf.args, **python_udf.kwargs)
2022-09-07T02:50:51.1461937Z   File "/opt/conda/lib/python3.10/site-packages/torch/testing/_internal/distributed/rpc/rpc_test.py", line 5911, in _gpu_add_wrong_gpus
2022-09-07T02:50:51.1462374Z     return x.cpu() + y.cuda()
2022-09-07T02:50:51.1462773Z RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!
2022-09-07T02:50:51.1463338Z Exception raised from compute_types at /var/lib/jenkins/workspace/aten/src/ATen/TensorIterator.cpp:484 (most recent call first):
2022-09-07T02:50:51.1464224Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x6b (0x7f6807b0bcab in /opt/conda/lib/python3.10/site-packages/torch/lib/libc10.so)
2022-09-07T02:50:51.1465223Z frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xce (0x7f6807b0767e in /opt/conda/lib/python3.10/site-packages/torch/lib/libc10.so)
2022-09-07T02:50:51.1466139Z frame #2: at::TensorIteratorBase::compute_types(at::TensorIteratorConfig const&) + 0xbbb (0x7f68121159bb in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
2022-09-07T02:50:51.1466943Z frame #3: at::TensorIteratorBase::build(at::TensorIteratorConfig&) + 0x7f (0x7f6812116ddf in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
2022-09-07T02:50:51.1467852Z frame #4: at::TensorIteratorBase::build_borrowing_binary_op(at::TensorBase const&, at::TensorBase const&, at::TensorBase const&) + 0xf2 (0x7f6812118482 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
2022-09-07T02:50:51.1468851Z frame #5: at::meta::structured_add_Tensor::meta(at::Tensor const&, at::Tensor const&, c10::Scalar const&) + 0x2e (0x7f68122f575e in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
2022-09-07T02:50:51.1469605Z frame #6: <unknown function> + 0x2a489ae (0x7f680a7959ae in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cuda_cu.so)
2022-09-07T02:50:51.1470271Z frame #7: <unknown function> + 0x2a48ab6 (0x7f680a795ab6 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cuda_cu.so)
2022-09-07T02:50:51.1471119Z frame #8: at::_ops::add_Tensor::redispatch(c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, c10::Scalar const&) + 0x98 (0x7f6812d19d88 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)

This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

@facebook-github-bot facebook-github-bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Sep 6, 2022
@mrshenli mrshenli added ciflow/trunk Trigger trunk jobs on your pull request ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR and removed oncall: distributed Add this issue/PR to distributed oncall triage queue labels Sep 6, 2022
mrshenli added a commit that referenced this pull request Sep 6, 2022
ghstack-source-id: 883f59e
Pull Request resolved: #84604
facebook-github-bot pushed a commit that referenced this pull request Sep 8, 2022
Summary:
Pull Request resolved: #84604
Approved by: https://github.com/wanchaol

Test Plan: contbuild & OSS CI, see https://hud.pytorch.org/commit/pytorch/pytorch/f43c38bdc820650ad974bb1c48360b0c6931961a

Reviewed By: izaitsevfb

Differential Revision: D39308908

Pulled By: mrshenli

fbshipit-source-id: 69e03ce1fdd9feafd7da65b9273e9d353cc58d3a
@facebook-github-bot facebook-github-bot deleted the gh/mrshenli/340/head branch September 10, 2022 14:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR ciflow/trunk Trigger trunk jobs on your pull request cla signed release notes: distributed (c10d) release notes category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants