Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[foxy] Remove finalization of guard_condition from GraphListener::__shutdown() #1401

Merged
merged 1 commit into from
Oct 13, 2020

Conversation

brawner
Copy link
Contributor

@brawner brawner commented Oct 12, 2020

The unit test error_run_graph_listener_destroy_context (copied below) was added in #1330 which tests destroying a context before calling run() on the graph_listener (and ultimately shutdown). This was leading to a memory read error in the shutdown_guard_condition_ when trying to finalize it. A special case was added in #906, which would try to finalize this guard_condition in the event that the context was already destroyed.

This addresses #1391

However, I'm not sure if this PR is the correct solution. In #906, this situation was specifically called out in (#906 (comment)) and its reply.

@wjwwood

@ivanpauno well... there may still be a resource ownership issue here. The context owns the guard condition and the wait set in the graph listener uses the guard condition. So by removing strong ownership of the context from the graph listener, the graph listener no longer has ownership of the guard condition it is using. This could lead to the context cleaning up the guard condition during its destruction and then later the graph listener trying to continue to use the guard condition in correctly...

@ivanpauno

The shutdown_guard_condition_ should still be valid, as the context only clears it if you call release_interrupt_guard_condition.
It's true that in case you can't call release_interrupt_guard_condition, rcl_guard_condition_fini should be manually called.

In fact, there is a memory leak in the error_run_graph_listener_destroy_context test with this PR (but no memory errors).

TEST_F(TestGraphListener, error_run_graph_listener_destroy_context) {
  auto context_to_destroy = std::make_shared<rclcpp::contexts::DefaultContext>();
  context_to_destroy->init(0, nullptr);
  auto graph_listener_error =
    std::make_shared<TestGraphListenerProtectedMethods>(context_to_destroy);
  context_to_destroy.reset();
  EXPECT_NO_THROW(graph_listener_error->run_protected());
}
==2997306== 
==2997306== HEAP SUMMARY:
==2997306==     in use at exit: 48,592 bytes in 102 blocks
==2997306==   total heap usage: 9,040 allocs, 8,938 frees, 4,426,257 bytes allocated
==2997306== 
==2997306== 144 (56 direct, 88 indirect) bytes in 1 blocks are definitely lost in loss record 23 of 35
==2997306==    at 0x483D7F3: malloc (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==2997306==    by 0x566120C: __default_allocate (allocator.c:37)
==2997306==    by 0x5B2BF4E: __rcl_guard_condition_init_from_rmw_impl (guard_condition.c:73)
==2997306==    by 0x5B2C15E: rcl_guard_condition_init (guard_condition.c:108)
==2997306==    by 0x4FCD7BC: rclcpp::Context::get_interrupt_guard_condition(rcl_wait_set_t*) (context.cpp:404)
==2997306==    by 0x5025903: rclcpp::graph_listener::GraphListener::GraphListener(std::shared_ptr<rclcpp::Context>) (graph_listener.cpp:57)
==2997306==    by 0x5080E3D: std::shared_ptr<rclcpp::graph_listener::GraphListener> rclcpp::Context::get_sub_context<rclcpp::graph_listener::GraphListener, std::shared_ptr<rclcpp::Context> >(std::shared_ptr<rclcpp::Context>&&) (context.hpp:325)
==2997306==    by 0x5078DAB: rclcpp::node_interfaces::NodeGraph::NodeGraph(rclcpp::node_interfaces::NodeBaseInterface*) (node_graph.cpp:41)
==2997306==    by 0x505A7A4: rclcpp::Node::Node(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rclcpp::NodeOptions const&) (node.cpp:108)
==2997306==    by 0x167C19: void __gnu_cxx::new_allocator<rclcpp::Node>::construct<rclcpp::Node, char const (&) [5], char const (&) [3]>(rclcpp::Node*, char const (&) [5], char const (&) [3]) (new_allocator.h:147)
==2997306==    by 0x167188: void std::allocator_traits<std::allocator<rclcpp::Node> >::construct<rclcpp::Node, char const (&) [5], char const (&) [3]>(std::allocator<rclcpp::Node>&, rclcpp::Node*, char const (&) [5], char const (&) [3]) (alloc_traits.h:484)
==2997306==    by 0x1661BA: std::_Sp_counted_ptr_inplace<rclcpp::Node, std::allocator<rclcpp::Node>, (__gnu_cxx::_Lock_policy)2>::_Sp_counted_ptr_inplace<char const (&) [5], char const (&) [3]>(std::allocator<rclcpp::Node>, char const (&) [5], char const (&) [3]) (shared_ptr_base.h:548)
==2997306== 
==2997306== 144 (56 direct, 88 indirect) bytes in 1 blocks are definitely lost in loss record 24 of 35
==2997306==    at 0x483D7F3: malloc (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==2997306==    by 0x566120C: __default_allocate (allocator.c:37)
==2997306==    by 0x5B2BF4E: __rcl_guard_condition_init_from_rmw_impl (guard_condition.c:73)
==2997306==    by 0x5B2C15E: rcl_guard_condition_init (guard_condition.c:108)
==2997306==    by 0x4FCD7BC: rclcpp::Context::get_interrupt_guard_condition(rcl_wait_set_t*) (context.cpp:404)
==2997306==    by 0x5025903: rclcpp::graph_listener::GraphListener::GraphListener(std::shared_ptr<rclcpp::Context>) (graph_listener.cpp:57)
==2997306==    by 0x1609BB: TestGraphListenerProtectedMethods::TestGraphListenerProtectedMethods(std::shared_ptr<rclcpp::Context>) (test_graph_listener.cpp:120)
==2997306==    by 0x1684E8: void __gnu_cxx::new_allocator<TestGraphListenerProtectedMethods>::construct<TestGraphListenerProtectedMethods, std::shared_ptr<rclcpp::contexts::DefaultContext>&>(TestGraphListenerProtectedMethods*, std::shared_ptr<rclcpp::contexts::DefaultContext>&) (new_allocator.h:147)
==2997306==    by 0x16794B: void std::allocator_traits<std::allocator<TestGraphListenerProtectedMethods> >::construct<TestGraphListenerProtectedMethods, std::shared_ptr<rclcpp::contexts::DefaultContext>&>(std::allocator<TestGraphListenerProtectedMethods>&, TestGraphListenerProtectedMethods*, std::shared_ptr<rclcpp::contexts::DefaultContext>&) (alloc_traits.h:484)
==2997306==    by 0x166E44: std::_Sp_counted_ptr_inplace<TestGraphListenerProtectedMethods, std::allocator<TestGraphListenerProtectedMethods>, (__gnu_cxx::_Lock_policy)2>::_Sp_counted_ptr_inplace<std::shared_ptr<rclcpp::contexts::DefaultContext>&>(std::allocator<TestGraphListenerProtectedMethods>, std::shared_ptr<rclcpp::contexts::DefaultContext>&) (shared_ptr_base.h:548)
==2997306==    by 0x165BD2: std::__shared_count<(__gnu_cxx::_Lock_policy)2>::__shared_count<TestGraphListenerProtectedMethods, std::allocator<TestGraphListenerProtectedMethods>, std::shared_ptr<rclcpp::contexts::DefaultContext>&>(TestGraphListenerProtectedMethods*&, std::_Sp_alloc_shared_tag<std::allocator<TestGraphListenerProtectedMethods> >, std::shared_ptr<rclcpp::contexts::DefaultContext>&) (shared_ptr_base.h:679)
==2997306==    by 0x164E8B: std::__shared_ptr<TestGraphListenerProtectedMethods, (__gnu_cxx::_Lock_policy)2>::__shared_ptr<std::allocator<TestGraphListenerProtectedMethods>, std::shared_ptr<rclcpp::contexts::DefaultContext>&>(std::_Sp_alloc_shared_tag<std::allocator<TestGraphListenerProtectedMethods> >, std::shared_ptr<rclcpp::contexts::DefaultContext>&) (shared_ptr_base.h:1344)
==2997306== 
==2997306== LEAK SUMMARY:
==2997306==    definitely lost: 112 bytes in 2 blocks
==2997306==    indirectly lost: 176 bytes in 4 blocks
==2997306==      possibly lost: 0 bytes in 0 blocks
==2997306==    still reachable: 48,304 bytes in 96 blocks
==2997306==         suppressed: 0 bytes in 0 blocks
==2997306== Reachable blocks (those to which a pointer was found) are not shown.
==2997306== To see them, rerun with: --leak-check=full --show-leak-kinds=all
==2997306== 
==2997306== ERROR SUMMARY: 2 errors from 2 contexts (suppressed: 0 from 0)


Signed-off-by: Stephen Brawner brawner@gmail.com

@ivanpauno
Copy link
Member

Thanks for bringing this up @brawner !

Yes, this looks pretty wrong: the guard condition is stored in a std::unordered_map owned by the context, so accessing the gc memory after the context was destroyed is a memory error.

I think that what makes things a complicated is that context is owning a guard condition that don't need to be owning.
Interrupt guard conditions are only triggered in the same condition that shutdown callbacks are called (see here).
It would be much easier if we remove the concept of interrupt guard conditions from context completely, and use shutdown callbacks when needed.
In that way, a "graph listener" would own its guard condition and register a shutdown callback that triggers it (the graph listener retains a gc shared ptr, passing a weak ptr to the callback).
The other place that interrupt guard conditions are used is executors, so doing the change shouldn't be hard.

The other option is to return a std::shared_ptr in get_interrupt_guard_condition (maybe, with a custom deleter), instead of the current raw ptr.

Both options are API breaking, so between the two I prefer the first one.

If we want a non-API breaking fix, I don't see another option than leaking like proposed here.
Not great, but also not a big issue.

@ivanpauno ivanpauno added the bug Something isn't working label Oct 13, 2020
Copy link
Member

@ivanpauno ivanpauno left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to hear @wjwwood opinion before this gets merged, but it looks ok (it is the only backportable change I see possible).

@brawner
Copy link
Contributor Author

brawner commented Oct 13, 2020

I'm working on backporting things to foxy so this technically blocks #1383. How about I retarget this to foxy and then a more appropriate change can be applied to rolling?

I like moving things to a shared_ptr for get_interrupt_guard_condition because everything else is memory managed in Context.

@ivanpauno
Copy link
Member

I'm working on backporting things to foxy so this technically blocks #1383. How about I retarget this to foxy and then a more appropriate change can be applied to rolling?

👍 I think that's fine

@brawner brawner force-pushed the brawner/rclcpp-fix-graph-listener branch from f9f860f to 127505e Compare October 13, 2020 18:42
@brawner brawner changed the base branch from master to foxy October 13, 2020 18:43
@brawner brawner changed the title Remove finalization of guard_condition from GraphListener::__shutdown() [foxy] Remove finalization of guard_condition from GraphListener::__shutdown() Oct 13, 2020
Signed-off-by: Stephen Brawner <brawner@gmail.com>
@brawner brawner force-pushed the brawner/rclcpp-fix-graph-listener branch from 127505e to e21e62c Compare October 13, 2020 18:44
@brawner
Copy link
Contributor Author

brawner commented Oct 13, 2020

This PR is now targeting foxy.

Testing --packages-select rclcpp

  • Linux Build Status
  • Linux-aarch64 Build Status
  • macOS Build Status
  • Windows Build Status

Edit: rookie mistake

Copy link
Member

@wjwwood wjwwood left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lgtm, assuming you'll do something different in rolling. This is just leaking the guard condition right?

@brawner
Copy link
Contributor Author

brawner commented Oct 13, 2020 via email

@brawner
Copy link
Contributor Author

brawner commented Oct 13, 2020

Thanks for the reviews. The failures in the Fpr job are unrelated to this PR and will probably be resolved in #1383, which does not show the same failures.

@irobot-ros
Copy link
Contributor

Hi, we noticed that the memory leak mentioned by @brawner in the post is still present in the Foxy branch!
Is there a reason why it has not been addressed yet?

We have been able to reproduce it in a simple test file that repeatedly calls rclcpp::init(), then creates a node and then calls rclcpp::shutdown().

@ivanpauno
Copy link
Member

Hi, we noticed that the memory leak mentioned by @brawner in the post is still present in the Foxy branch!

IIRC, we didn't find an ABI compatible fix.
Feel free to propose a fix or to create a new issue.

irobot-ros pushed a commit to irobot-ros/rclcpp that referenced this pull request May 17, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants