[CUDA graphs] Make sure graph mempool cudaMalloc_count decrement pairs with cudaFree for all allocations #61567

mcarilli · 2021-07-13T01:32:12Z

Graphs mempools aren't deleted until all their allocations are cudaFreed. PrivatePool::cudaMalloc_count tracks the number of outstanding (not-yet-cudaFreed) allocations.

#44742 moves cudaFree to release_block, while the cudaMalloc_count decrement (if needed) remains in a caller (release_blocks). But I noticed there's also a path (release_available_cached_blocks) that calls release_block without calling release_blocks, in other words, it calls cudaFree but dodges any potential cudaMalloc_count decrement.

In practice, the way the code is currently organized, I don't think this second path can cause the pool to become a zombie whose cudaMalloc_count will never reach zero (I think this could only happen if you call release_available_cached_blocks on a private pool, and the only way it would be called on a private pool is if capture is underway, and if capture is underway, the cudaFree call will hard error). Regardless, I feel much more comfortable keeping the cudaMalloc_count decrement right next to the cudaFree.

facebook-github-bot · 2021-07-13T01:32:18Z

💊 CI failures summary and remediations

As of commit ef19382 (more details on the Dr. CI page and at hud.pytorch.org/pr/61567):

5/5 failures introduced in this PR

🕵️ 3 new failures recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

pytorch_xla_linux_bionic_py3_6_clang9_build (1/3)

Step: "(Optional) Merge target branch" (full log | diagnosis details | 🔁 rerun)

Automatic merge failed; fix conflicts and then commit the result.

CONFLICT (add/add): Merge conflict in .circleci/verbatim-sources/job-specs/pytorch-job-specs.yml
Auto-merging .circleci/verbatim-sources/job-specs/pytorch-job-specs.yml
CONFLICT (add/add): Merge conflict in .circleci/verbatim-sources/job-specs/binary-job-specs.yml
Auto-merging .circleci/verbatim-sources/job-specs/binary-job-specs.yml
CONFLICT (add/add): Merge conflict in .circleci/verbatim-sources/commands.yml
Auto-merging .circleci/verbatim-sources/commands.yml
CONFLICT (add/add): Merge conflict in .circleci/config.yml
Auto-merging .circleci/config.yml
CONFLICT (add/add): Merge conflict in .circleci/cimodel/data/pytorch_build_data.py
Auto-merging .circleci/cimodel/data/pytorch_build_data.py
Automatic merge failed; fix conflicts and then commit the result.


Exited with code exit status 1

pytorch_macos_10_13_py3_test (2/3)

Step: "Test" (full log | diagnosis details | 🔁 rerun)

Jul 19 22:20:33 test_remote_message_script_de...yUniqueId(created_on=0, local_id=0) to be created.

Jul 19 22:20:05 frame #12: std::__1::__function::__func<std::__1::__bind<torch::distributed::rpc::ProcessGroupAgent::enqueueRecv(torch::distributed::rpc::RecvWork)::$_6, torch::distributed::rpc::RecvWork>, std::__1::allocator<std::__1::__bind<torch::distributed::rpc::ProcessGroupAgent::enqueueRecv(torch::distributed::rpc::RecvWork)::$_6, torch::distributed::rpc::RecvWork> >, void ()>::operator()() + 42 (0x11351abea in libtorch_cpu.dylib)
Jul 19 22:20:05 frame #13: c10::ThreadPool::main_loop(unsigned long) + 569 (0x10dbea389 in libc10.dylib)
Jul 19 22:20:05 frame #14: void* std::__1::__thread_proxy<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, c10::ThreadPool::ThreadPool(int, int, std::__1::function<void ()>)::$_0> >(void*) + 67 (0x10dbeaa33 in libc10.dylib)
Jul 19 22:20:05 frame #15: _pthread_start + 148 (0x7fff70fe1109 in libsystem_pthread.dylib)
Jul 19 22:20:05 frame #16: thread_start + 15 (0x7fff70fdcb8b in libsystem_pthread.dylib)
Jul 19 22:20:05 
Jul 19 22:20:05 ok (4.069s)
Jul 19 22:20:14   test_remote_message_dropped_pickle (__main__.FaultyFaultyAgentRpcTestWithSpawn) ... ok (8.405s)
Jul 19 22:20:22   test_remote_message_dropped_pickle_to_self (__main__.FaultyFaultyAgentRpcTestWithSpawn) ... ok (8.167s)
Jul 19 22:20:29   test_remote_message_script_delay_timeout (__main__.FaultyFaultyAgentRpcTestWithSpawn) ... ok (7.035s)
Jul 19 22:20:33   test_remote_message_script_delay_timeout_to_self (__main__.FaultyFaultyAgentRpcTestWithSpawn) ... [E request_callback_no_python.cpp:555] Received error while processing request type 260: falseINTERNAL ASSERT FAILED at "../torch/csrc/distributed/rpc/rref_context.cpp":390, please report a bug to PyTorch. Expected OwnerRRef with id GloballyUniqueId(created_on=0, local_id=0) to be created.
Jul 19 22:20:33 Exception raised from getOwnerRRef at ../torch/csrc/distributed/rpc/rref_context.cpp:390 (most recent call first):
Jul 19 22:20:33 frame #0: c10::Error::Error(c10::SourceLocation, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >) + 98 (0x114e3d6d2 in libc10.dylib)
Jul 19 22:20:33 frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&) + 106 (0x114e3be4a in libc10.dylib)
Jul 19 22:20:33 frame #2: c10::detail::torchInternalAssertFail(char const*, char const*, unsigned int, char const*, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&) + 64 (0x114e3c080 in libc10.dylib)
Jul 19 22:20:33 frame #3: torch::distributed::rpc::RRefContext::getOwnerRRef(torch::distributed::rpc::GloballyUniqueId const&, bool) + 1711 (0x11d3ab0ef in libtorch_cpu.dylib)
Jul 19 22:20:33 frame #4: torch::distributed::rpc::RequestCallbackNoPython::assignOwnerRRef(torch::distributed::rpc::GloballyUniqueId const&, torch::distributed::rpc::GloballyUniqueId const&, c10::intrusive_ptr<c10::ivalue::Future, c10::detail::intrusive_target_default_null_type<c10::ivalue::Future> >) const + 86 (0x11d395946 in libtorch_cpu.dylib)
Jul 19 22:20:33 frame #5: torch::distributed::rpc::RequestCallbackImpl::processScriptRemoteCall(torch::distributed::rpc::RpcCommandBase&, std::__1::vector<c10::Stream, std::__1::allocator<c10::Stream> >) const + 376 (0x1144370e8 in libtorch_python.dylib)
Jul 19 22:20:33 frame #6: torch::distributed::rpc::RequestCallbackNoPython::processRpc(torch::distributed::rpc::RpcCommandBase&, torch::distributed::rpc::MessageType const&, std::__1::vector<c10::Stream, std::__1::allocator<c10::Stream> >) const + 437 (0x11d394595 in libtorch_cpu.dylib)
Jul 19 22:20:33 frame #7: torch::distributed::rpc::RequestCallbackImpl::processRpcWithErrors(torch::distributed::rpc::RpcCommandBase&, torch::distributed::rpc::MessageType const&, std::__1::vector<c10::Stream, std::__1::allocator<c10::Stream> >) const + 74 (0x114437e5a in libtorch_python.dylib)
Jul 19 22:20:33 frame #8: c10::intrusive_ptr<c10::ivalue::Future, c10::detail::intrusive_target_default_null_type<c10::ivalue::Future> > c10::ivalue::Future::thenAsync<torch::distributed::rpc::RequestCallbackNoPython::processMessage(torch::distributed::rpc::Message&, std::__1::vector<c10::Stream, std::__1::allocator<c10::Stream> >) const::$_1>(torch::distributed::rpc::RequestCallbackNoPython::processMessage(torch::distributed::rpc::Message&, std::__1::vector<c10::Stream, std::__1::allocator<c10::Stream> >) const::$_1, std::__1::shared_ptr<c10::Type>)::'lambda'(c10::ivalue::Future&)::operator()(c10::ivalue::Future&) + 223 (0x11d39c25f in libtorch_cpu.dylib)

pytorch_linux_xenial_py3_6_gcc5_4_build (3/3)

Step: "(Optional) Merge target branch" (full log | diagnosis details | 🔁 rerun)

Automatic merge failed; fix conflicts and then commit the result.

CONFLICT (add/add): Merge conflict in .circleci/verbatim-sources/job-specs/pytorch-job-specs.yml
Auto-merging .circleci/verbatim-sources/job-specs/pytorch-job-specs.yml
CONFLICT (add/add): Merge conflict in .circleci/verbatim-sources/job-specs/binary-job-specs.yml
Auto-merging .circleci/verbatim-sources/job-specs/binary-job-specs.yml
CONFLICT (add/add): Merge conflict in .circleci/verbatim-sources/commands.yml
Auto-merging .circleci/verbatim-sources/commands.yml
CONFLICT (add/add): Merge conflict in .circleci/config.yml
Auto-merging .circleci/config.yml
CONFLICT (add/add): Merge conflict in .circleci/cimodel/data/pytorch_build_data.py
Auto-merging .circleci/cimodel/data/pytorch_build_data.py
Automatic merge failed; fix conflicts and then commit the result.


Exited with code exit status 1

2 failures not recognized by patterns:

Job	Step	Action
^{Lint / quick-checks}	^{Ensure no trailing spaces}	🔁 rerun
^{Lint / shellcheck}	^{Assert that regenerating the workflows didn't change them}	🔁 rerun

Preview docs built from this PR

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

facebook-github-bot · 2021-07-19T15:55:11Z

@ezyang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

facebook-github-bot · 2021-07-19T19:55:44Z

@ezyang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot · 2021-07-20T02:23:51Z

@ezyang merged this pull request in ffd2e60.

pair graph cudaMalloc_count decrement with cudaFree

d82e28a

facebook-github-bot added the cla signed label Jul 13, 2021

pytorchbot added the open source label Jul 13, 2021

mcarilli requested review from ngimel and ezyang July 13, 2021 02:19

mcarilli added module: cuda graphs Ability to capture and then replay streams of CUDA kernels module: cuda Related to torch.cuda, and CUDA support in general labels Jul 13, 2021

H-Huang added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jul 13, 2021

ezyang approved these changes Jul 19, 2021

View reviewed changes

lint

ef19382

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

facebook-github-bot closed this in ffd2e60 Jul 20, 2021

facebook-github-bot added the Merged label Jul 20, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CUDA graphs] Make sure graph mempool cudaMalloc_count decrement pairs with cudaFree for all allocations #61567

[CUDA graphs] Make sure graph mempool cudaMalloc_count decrement pairs with cudaFree for all allocations #61567

mcarilli commented Jul 13, 2021 •

edited

facebook-github-bot commented Jul 13, 2021 •

edited

facebook-github-bot commented Jul 19, 2021

facebook-github-bot commented Jul 19, 2021

facebook-github-bot commented Jul 20, 2021

[CUDA graphs] Make sure graph mempool cudaMalloc_count decrement pairs with cudaFree for all allocations #61567

[CUDA graphs] Make sure graph mempool cudaMalloc_count decrement pairs with cudaFree for all allocations #61567

Conversation

mcarilli commented Jul 13, 2021 • edited

facebook-github-bot commented Jul 13, 2021 • edited

💊 CI failures summary and remediations

🕵️ 3 new failures recognized by patterns

pytorch_xla_linux_bionic_py3_6_clang9_build (1/3)

pytorch_macos_10_13_py3_test (2/3)

pytorch_linux_xenial_py3_6_gcc5_4_build (3/3)

2 failures not recognized by patterns:

facebook-github-bot commented Jul 19, 2021

facebook-github-bot commented Jul 19, 2021

facebook-github-bot commented Jul 20, 2021

mcarilli commented Jul 13, 2021 •

edited

facebook-github-bot commented Jul 13, 2021 •

edited