Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CUDA graphs] Make sure graph mempool cudaMalloc_count decrement pairs with cudaFree for all allocations #61567

Closed
wants to merge 2 commits into from

Conversation

mcarilli
Copy link
Collaborator

@mcarilli mcarilli commented Jul 13, 2021

Graphs mempools aren't deleted until all their allocations are cudaFreed. PrivatePool::cudaMalloc_count tracks the number of outstanding (not-yet-cudaFreed) allocations.

#44742 moves cudaFree to release_block, while the cudaMalloc_count decrement (if needed) remains in a caller (release_blocks). But I noticed there's also a path (release_available_cached_blocks) that calls release_block without calling release_blocks, in other words, it calls cudaFree but dodges any potential cudaMalloc_count decrement.

In practice, the way the code is currently organized, I don't think this second path can cause the pool to become a zombie whose cudaMalloc_count will never reach zero (I think this could only happen if you call release_available_cached_blocks on a private pool, and the only way it would be called on a private pool is if capture is underway, and if capture is underway, the cudaFree call will hard error). Regardless, I feel much more comfortable keeping the cudaMalloc_count decrement right next to the cudaFree.

@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Jul 13, 2021

💊 CI failures summary and remediations

As of commit ef19382 (more details on the Dr. CI page and at hud.pytorch.org/pr/61567):


  • 5/5 failures introduced in this PR

🕵️ 3 new failures recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

See CircleCI build pytorch_xla_linux_bionic_py3_6_clang9_build (1/3)

Step: "(Optional) Merge target branch" (full log | diagnosis details | 🔁 rerun)

Automatic merge failed; fix conflicts and then commit the result.
CONFLICT (add/add): Merge conflict in .circleci/verbatim-sources/job-specs/pytorch-job-specs.yml
Auto-merging .circleci/verbatim-sources/job-specs/pytorch-job-specs.yml
CONFLICT (add/add): Merge conflict in .circleci/verbatim-sources/job-specs/binary-job-specs.yml
Auto-merging .circleci/verbatim-sources/job-specs/binary-job-specs.yml
CONFLICT (add/add): Merge conflict in .circleci/verbatim-sources/commands.yml
Auto-merging .circleci/verbatim-sources/commands.yml
CONFLICT (add/add): Merge conflict in .circleci/config.yml
Auto-merging .circleci/config.yml
CONFLICT (add/add): Merge conflict in .circleci/cimodel/data/pytorch_build_data.py
Auto-merging .circleci/cimodel/data/pytorch_build_data.py
Automatic merge failed; fix conflicts and then commit the result.


Exited with code exit status 1

See CircleCI build pytorch_macos_10_13_py3_test (2/3)

Step: "Test" (full log | diagnosis details | 🔁 rerun)

Jul 19 22:20:33 test_remote_message_script_de...yUniqueId(created_on=0, local_id=0) to be created.
Jul 19 22:20:05 frame #12: std::__1::__function::__func<std::__1::__bind<torch::distributed::rpc::ProcessGroupAgent::enqueueRecv(torch::distributed::rpc::RecvWork)::$_6, torch::distributed::rpc::RecvWork>, std::__1::allocator<std::__1::__bind<torch::distributed::rpc::ProcessGroupAgent::enqueueRecv(torch::distributed::rpc::RecvWork)::$_6, torch::distributed::rpc::RecvWork> >, void ()>::operator()() + 42 (0x11351abea in libtorch_cpu.dylib)
Jul 19 22:20:05 frame #13: c10::ThreadPool::main_loop(unsigned long) + 569 (0x10dbea389 in libc10.dylib)
Jul 19 22:20:05 frame #14: void* std::__1::__thread_proxy<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, c10::ThreadPool::ThreadPool(int, int, std::__1::function<void ()>)::$_0> >(void*) + 67 (0x10dbeaa33 in libc10.dylib)
Jul 19 22:20:05 frame #15: _pthread_start + 148 (0x7fff70fe1109 in libsystem_pthread.dylib)
Jul 19 22:20:05 frame #16: thread_start + 15 (0x7fff70fdcb8b in libsystem_pthread.dylib)
Jul 19 22:20:05 
Jul 19 22:20:05 ok (4.069s)
Jul 19 22:20:14   test_remote_message_dropped_pickle (__main__.FaultyFaultyAgentRpcTestWithSpawn) ... ok (8.405s)
Jul 19 22:20:22   test_remote_message_dropped_pickle_to_self (__main__.FaultyFaultyAgentRpcTestWithSpawn) ... ok (8.167s)
Jul 19 22:20:29   test_remote_message_script_delay_timeout (__main__.FaultyFaultyAgentRpcTestWithSpawn) ... ok (7.035s)
Jul 19 22:20:33   test_remote_message_script_delay_timeout_to_self (__main__.FaultyFaultyAgentRpcTestWithSpawn) ... [E request_callback_no_python.cpp:555] Received error while processing request type 260: falseINTERNAL ASSERT FAILED at "../torch/csrc/distributed/rpc/rref_context.cpp":390, please report a bug to PyTorch. Expected OwnerRRef with id GloballyUniqueId(created_on=0, local_id=0) to be created.
Jul 19 22:20:33 Exception raised from getOwnerRRef at ../torch/csrc/distributed/rpc/rref_context.cpp:390 (most recent call first):
Jul 19 22:20:33 frame #0: c10::Error::Error(c10::SourceLocation, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >) + 98 (0x114e3d6d2 in libc10.dylib)
Jul 19 22:20:33 frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&) + 106 (0x114e3be4a in libc10.dylib)
Jul 19 22:20:33 frame #2: c10::detail::torchInternalAssertFail(char const*, char const*, unsigned int, char const*, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&) + 64 (0x114e3c080 in libc10.dylib)
Jul 19 22:20:33 frame #3: torch::distributed::rpc::RRefContext::getOwnerRRef(torch::distributed::rpc::GloballyUniqueId const&, bool) + 1711 (0x11d3ab0ef in libtorch_cpu.dylib)
Jul 19 22:20:33 frame #4: torch::distributed::rpc::RequestCallbackNoPython::assignOwnerRRef(torch::distributed::rpc::GloballyUniqueId const&, torch::distributed::rpc::GloballyUniqueId const&, c10::intrusive_ptr<c10::ivalue::Future, c10::detail::intrusive_target_default_null_type<c10::ivalue::Future> >) const + 86 (0x11d395946 in libtorch_cpu.dylib)
Jul 19 22:20:33 frame #5: torch::distributed::rpc::RequestCallbackImpl::processScriptRemoteCall(torch::distributed::rpc::RpcCommandBase&, std::__1::vector<c10::Stream, std::__1::allocator<c10::Stream> >) const + 376 (0x1144370e8 in libtorch_python.dylib)
Jul 19 22:20:33 frame #6: torch::distributed::rpc::RequestCallbackNoPython::processRpc(torch::distributed::rpc::RpcCommandBase&, torch::distributed::rpc::MessageType const&, std::__1::vector<c10::Stream, std::__1::allocator<c10::Stream> >) const + 437 (0x11d394595 in libtorch_cpu.dylib)
Jul 19 22:20:33 frame #7: torch::distributed::rpc::RequestCallbackImpl::processRpcWithErrors(torch::distributed::rpc::RpcCommandBase&, torch::distributed::rpc::MessageType const&, std::__1::vector<c10::Stream, std::__1::allocator<c10::Stream> >) const + 74 (0x114437e5a in libtorch_python.dylib)
Jul 19 22:20:33 frame #8: c10::intrusive_ptr<c10::ivalue::Future, c10::detail::intrusive_target_default_null_type<c10::ivalue::Future> > c10::ivalue::Future::thenAsync<torch::distributed::rpc::RequestCallbackNoPython::processMessage(torch::distributed::rpc::Message&, std::__1::vector<c10::Stream, std::__1::allocator<c10::Stream> >) const::$_1>(torch::distributed::rpc::RequestCallbackNoPython::processMessage(torch::distributed::rpc::Message&, std::__1::vector<c10::Stream, std::__1::allocator<c10::Stream> >) const::$_1, std::__1::shared_ptr<c10::Type>)::'lambda'(c10::ivalue::Future&)::operator()(c10::ivalue::Future&) + 223 (0x11d39c25f in libtorch_cpu.dylib)

See CircleCI build pytorch_linux_xenial_py3_6_gcc5_4_build (3/3)

Step: "(Optional) Merge target branch" (full log | diagnosis details | 🔁 rerun)

Automatic merge failed; fix conflicts and then commit the result.
CONFLICT (add/add): Merge conflict in .circleci/verbatim-sources/job-specs/pytorch-job-specs.yml
Auto-merging .circleci/verbatim-sources/job-specs/pytorch-job-specs.yml
CONFLICT (add/add): Merge conflict in .circleci/verbatim-sources/job-specs/binary-job-specs.yml
Auto-merging .circleci/verbatim-sources/job-specs/binary-job-specs.yml
CONFLICT (add/add): Merge conflict in .circleci/verbatim-sources/commands.yml
Auto-merging .circleci/verbatim-sources/commands.yml
CONFLICT (add/add): Merge conflict in .circleci/config.yml
Auto-merging .circleci/config.yml
CONFLICT (add/add): Merge conflict in .circleci/cimodel/data/pytorch_build_data.py
Auto-merging .circleci/cimodel/data/pytorch_build_data.py
Automatic merge failed; fix conflicts and then commit the result.


Exited with code exit status 1


2 failures not recognized by patterns:

Job Step Action
GitHub Actions Lint / quick-checks Ensure no trailing spaces 🔁 rerun
GitHub Actions Lint / shellcheck Assert that regenerating the workflows didn't change them 🔁 rerun

Preview docs built from this PR

This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

@mcarilli mcarilli requested review from ngimel and ezyang July 13, 2021 02:19
@mcarilli mcarilli added module: cuda graphs Ability to capture and then replay streams of CUDA kernels module: cuda Related to torch.cuda, and CUDA support in general labels Jul 13, 2021
@H-Huang H-Huang added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jul 13, 2021
@facebook-github-bot
Copy link
Contributor

@ezyang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>
@facebook-github-bot
Copy link
Contributor

@ezyang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@ezyang merged this pull request in ffd2e60.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla signed Merged module: cuda graphs Ability to capture and then replay streams of CUDA kernels module: cuda Related to torch.cuda, and CUDA support in general open source triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants