Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revert "Revert D25547962: Make tls_local_dispatch_key_set inlineable" #49604

Closed
wants to merge 1 commit into from

Conversation

malfet
Copy link
Contributor

@malfet malfet commented Dec 18, 2020

This reverts commit 19dc5e9.

@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Dec 18, 2020

💊 CI failures summary and remediations

As of commit d562bef (more details on the Dr. CI page):



🕵️ 1 new failure recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

See CircleCI build pytorch_macos_10_13_py3_test (1/1)

Step: "Test" (full log | diagnosis details | 🔁 rerun)

Dec 18 20:13:51 test_udf_remote_message_delay_timeout_to_self (__main__.FaultyFaultyAgentRpcTestWithSpawn) ... [E request_callback_no_python.cpp:636] Received error while processing request type 261: false INTERNAL ASSERT FAILED at "../torch/csrc/distributed/rpc/rref_context.cpp":379, please report a bug to PyTorch. Expected OwnerRRef with id GloballyUniqueId(created_on=0, local_id=0) to be created.
Dec 18 20:13:11 frame #10: c10::ThreadPool::main_loop(unsigned long) + 569 (0x1040ceae9 in libc10.dylib)
Dec 18 20:13:11 frame #11: void* std::__1::__thread_proxy<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, c10::ThreadPool::ThreadPool(int, int, std::__1::function<void ()>)::$_0> >(void*) + 67 (0x1040cf163 in libc10.dylib)
Dec 18 20:13:11 frame #12: _pthread_start + 148 (0x7fff69254109 in libsystem_pthread.dylib)
Dec 18 20:13:11 frame #13: thread_start + 15 (0x7fff6924fb8b in libsystem_pthread.dylib)
Dec 18 20:13:11 
Dec 18 20:13:11 ok (3.840s)
Dec 18 20:13:27   test_rpc_builtin_timeout (__main__.FaultyFaultyAgentRpcTestWithSpawn) ... ok (15.652s)
Dec 18 20:13:36   test_rpc_script_timeout (__main__.FaultyFaultyAgentRpcTestWithSpawn) ... ok (9.464s)
Dec 18 20:13:40   test_rref_to_here_timeout (__main__.FaultyFaultyAgentRpcTestWithSpawn) ... ok (3.687s)
Dec 18 20:13:48   test_udf_remote_message_delay_timeout (__main__.FaultyFaultyAgentRpcTestWithSpawn) ... ok (7.852s)
Dec 18 20:13:51   test_udf_remote_message_delay_timeout_to_self (__main__.FaultyFaultyAgentRpcTestWithSpawn) ... [E request_callback_no_python.cpp:636] Received error while processing request type 261: false INTERNAL ASSERT FAILED at "../torch/csrc/distributed/rpc/rref_context.cpp":379, please report a bug to PyTorch. Expected OwnerRRef with id GloballyUniqueId(created_on=0, local_id=0) to be created.
Dec 18 20:13:51 Exception raised from getOwnerRRef at ../torch/csrc/distributed/rpc/rref_context.cpp:379 (most recent call first):
Dec 18 20:13:51 frame #0: c10::Error::Error(c10::SourceLocation, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >) + 98 (0x123bea0e2 in libc10.dylib)
Dec 18 20:13:51 frame #1: torch::distributed::rpc::RRefContext::getOwnerRRef(torch::distributed::rpc::GloballyUniqueId const&, bool) + 1492 (0x11c527304 in libtorch_cpu.dylib)
Dec 18 20:13:51 frame #2: torch::distributed::rpc::RequestCallbackImpl::processPythonRemoteCall(torch::distributed::rpc::RpcCommandBase&, std::__1::function<void (torch::distributed::rpc::Message)> const&, long long, std::__1::shared_ptr<torch::utils::Future<torch::distributed::rpc::Message> > const&) const + 138 (0x118b9deca in libtorch_python.dylib)
Dec 18 20:13:51 frame #3: torch::distributed::rpc::RequestCallbackNoPython::processRpc(torch::distributed::rpc::RpcCommandBase&, torch::distributed::rpc::MessageType const&, long long, std::__1::shared_ptr<torch::utils::Future<torch::distributed::rpc::Message> > const&) const + 603 (0x11c51798b in libtorch_cpu.dylib)
Dec 18 20:13:51 frame #4: torch::distributed::rpc::RequestCallbackImpl::processRpcWithErrors(torch::distributed::rpc::RpcCommandBase&, torch::distributed::rpc::MessageType const&, long long, std::__1::shared_ptr<torch::utils::Future<torch::distributed::rpc::Message> > const&) const + 37 (0x118b9ff05 in libtorch_python.dylib)
Dec 18 20:13:51 frame #5: std::__1::__function::__func<torch::distributed::rpc::RequestCallbackNoPython::processMessage(torch::distributed::rpc::Message&) const::$_0, std::__1::allocator<torch::distributed::rpc::RequestCallbackNoPython::processMessage(torch::distributed::rpc::Message&) const::$_0>, void ()>::operator()() + 175 (0x11c51c80f in libtorch_cpu.dylib)
Dec 18 20:13:51 frame #6: torch::distributed::rpc::RequestCallbackNoPython::processMessage(torch::distributed::rpc::Message&) const + 473 (0x11c516db9 in libtorch_cpu.dylib)
Dec 18 20:13:51 frame #7: torch::distributed::rpc::RequestCallback::operator()(torch::distributed::rpc::Message&) const + 15 (0x11c516a7f in libtorch_cpu.dylib)
Dec 18 20:13:51 frame #8: torch::distributed::rpc::ProcessGroupAgent::handleRecv(torch::distributed::rpc::RecvWork&) + 169 (0x118b78079 in libtorch_python.dylib)

❄️ 1 failure tentatively classified as flaky

but reruns have not yet been triggered to confirm:

See CircleCI build pytorch_linux_xenial_cuda10_2_cudnn7_py3_gcc7_build (1/1)

Step: "Set Up System Environment" (full log | diagnosis details | 🔁 rerun) ❄️

gpg: no valid OpenPGP data found.
+ curl --retry 3 -s -L https://packagecloud.io/circleci/trusty/gpgkey
+ sudo apt-key add -
gpg: no valid OpenPGP data found.


Exited with code exit status 2

--- ### 🚧 1 fixed upstream failure: These were probably **caused by upstream breakages** that were **already fixed**.
Please rebase on the viable/strict branch (expand for instructions)

If your commit is older than viable/strict, run these commands:

git fetch https://github.com/pytorch/pytorch viable/strict
git rebase FETCH_HEAD

Check out the recency history of this "viable master" tracking branch.


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

This comment has been revised 10 times.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@malfet has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@mruberry
Copy link
Collaborator

mruberry commented Dec 18, 2020

@malfet Would you elaborate on the latest in your investigation to help us understand the context for this revert revert? I'm guessing you've identified #49364 as the source of the Windows failures?

@malfet
Copy link
Contributor Author

malfet commented Dec 18, 2020

@mruberry, sure, will try to summarize it all in #49558
In short, in #49587 I've reverted all changes but TLS one and was able to reproduce the stack overflow failure (SEH 0xC00000FD is STATUS_STACK_OVERFLOW), which makes TLS change an unlikely culprit, which I'm validating in this PR.
As for #49364, as @swolchok pointed out TensorList is actually ArrayRef, so it technically should be safe to pass by value

@mruberry
Copy link
Collaborator

@mruberry, sure, will try to summarize it all in #49558
In short, in #49587 I've reverted all changes but TLS one and was able to reproduce the stack overflow failure (SEH 0xC00000FD is STATUS_STACK_OVERFLOW), which makes TLS change an unlikely culprit, which I'm validating in this PR.
As for #49364, as @swolchok pointed out TensorList is actually ArrayRef, so it technically should be safe to pass by value

Interesting that this PR seems to have triggered our other intermittent Windows failure. See:

https://app.circleci.com/pipelines/github/pytorch/pytorch/253164/workflows/827eb258-9967-41a7-8900-dd7d24e76b37/jobs/9741875

cc #49596

@malfet malfet closed this Feb 2, 2021
@malfet malfet deleted the malfet/reinstate-tls-change branch February 2, 2021 00:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants