Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TestInteractiveMarkerClient.states fails when run with rmw_connextdds #40

Closed
asorbini opened this issue Apr 15, 2021 · 6 comments · Fixed by ros2/rclcpp#1668
Closed
Assignees
Labels
test-failure The issue or PR relates to a test failure
Projects

Comments

@asorbini
Copy link
Collaborator

asorbini commented Apr 15, 2021

Test TestInteractiveMarkerClient.states from package ros-visualization/interactive_markers fails when run with rmw_connextdds (ref):

[ RUN      ] TestInteractiveMarkerClient.states
[INFO] [1618402845.369215543] [test_interactive_marker_server_node]: Connected on namespace: test_namespace
[INFO] [1618402845.370914886] [test_interactive_marker_server_node]: Connected on namespace: test_namespace
[INFO] [1618402845.372802357] [test_interactive_marker_server_node]: Connected on namespace: test_namespace
[INFO] [1618402845.379741315] [mock_interactive_marker_server]: Mock server ready
[INFO] [1618402845.380105568] [test_interactive_marker_server_node]: Sending request for interactive markers
[INFO] [1618402845.390321054] [mock_interactive_marker_server]: Mock server destroyed
[WARN] [1618402845.392697448] [test_interactive_marker_server_node]: Server not available during initialization, resetting
[INFO] [1618402845.407828611] [mock_interactive_marker_server]: Mock server ready
[INFO] [1618402845.407945940] [test_interactive_marker_server_node]: Sending request for interactive markers
[INFO] [1618402845.418071273] [mock_interactive_marker_server]: Mock server destroyed
[INFO] [1618402845.421568472] [test_interactive_marker_server_node]: Connected on namespace: test_namespace
[WARN] [1618402845.422274371] [rcl.logging_rosout]: Publisher already registered for provided node name. If this is due to multiple nodes with the same name then all logs for that logger name will go out over the existing publisher. As soon as any node with that name is destructed it will unregister the publisher, preventing any further logs for that name from being published on the rosout topic.
[INFO] [1618402845.427342460] [mock_interactive_marker_server]: Mock server ready
[ERROR] [1618402845.427709873] [rclcpp]: Error in destruction of rcl service handle test_namespace/get_interactive_markers: the Node Handle was destructed too early. You will leak memory
[INFO] [1618402845.427803546] [test_interactive_marker_server_node]: Sending request for interactive markers
[INFO] [1618402845.468389763] [mock_interactive_marker_server]: Interactive markers request received
[INFO] [1618402845.499019875] [test_interactive_marker_server_node]: Service response received for initialization
[INFO] [1618402845.509249996] [mock_interactive_marker_server]: Mock server destroyed
RTI Data Distribution Service Non-commercial license is for academic, research, evaluation and personal use only. USE FOR COMMERCIAL PURPOSES IS PROHIBITED. See RTI_LICENSE.TXT for terms. Download free tools at rti.com/ncl. License issued to Non-Commercial User license@rti.com For non-production use only.
Expires on 00-jan-00 See www.rti.com for more information.
/tmp/ws/src/ros-visualization/interactive_markers/test/interactive_markers/test_interactive_marker_client.cpp:236: Failure
Expected equality of these values:
  client_->getState()
    Which is: 2
  ClientState::STATE_IDLE
    Which is: 0
[INFO] [1618402848.515853148] [test_interactive_marker_server_node]: Connected on namespace: test_namespace
[INFO] [1618402848.522049052] [mock_interactive_marker_server]: Mock server ready
[INFO] [1618402848.522210463] [test_interactive_marker_server_node]: Sending request for interactive markers
[INFO] [1618402848.562642973] [mock_interactive_marker_server]: Interactive markers request received
[INFO] [1618402848.593074129] [test_interactive_marker_server_node]: Service response received for initialization
[WARN] [1618402848.653861432] [test_interactive_marker_server_node]: Update sequence number is out of order. 1 (expected) vs. 2 (received)
[INFO] [1618402848.653926360] [test_interactive_marker_server_node]: Sending request for interactive markers
[INFO] [1618402848.664057004] [mock_interactive_marker_server]: Mock server destroyed
[WARN] [1618402849.664008944] [test_interactive_marker_server_node]: Did not receive response with interactive markers, resending request...
[INFO] [1618402849.664127018] [test_interactive_marker_server_node]: Sending request for interactive markers
[WARN] [1618402850.671448428] [test_interactive_marker_server_node]: Did not receive response with interactive markers, resending request...
[INFO] [1618402850.671568523] [test_interactive_marker_server_node]: Sending request for interactive markers
/tmp/ws/src/ros-visualization/interactive_markers/test/interactive_markers/test_interactive_marker_client.cpp:268: Failure
Expected equality of these values:
  client_->getState()
    Which is: 1
  ClientState::STATE_IDLE
    Which is: 0
[ERROR] [1618402851.674446619] [rmw_connextdds]: failed to delete DDS subscriber's entities
[ERROR] [1618402851.674483621] [rmw_connextdds]: failed to finalize node in context
[ERROR] [1618402851.674498396] [rclcpp]: Error in destruction of rcl node handle: failed to delete DDS subscriber's entities, at /tmp/ws/src/ros2/rmw_connextdds/rmw_connextdds_common/src/common/rmw_context.cpp:436, at /tmp/ws/src/ros2/rcl/rcl/src/rcl/node.c:384
[D0000|Sub(80000009)|DELETE_CONTAINED]PRESPsService_destroyAllLocalEndpointsInGroupWithCursor:!delete local reader
[D0000|Sub(80000009)|DELETE_CONTAINED]PRESPsService_destroyAllLocalEndpointsInGroup:!destroyAllLocalEndpointsInGroupWithCursor
[D0000|Sub(80000009)|DELETE_CONTAINED]DDS_Subscriber_delete_contained_entities:!delete contained entities
[  FAILED  ] TestInteractiveMarkerClient.states (6320 ms)

Both failures [1][2] are caused by an interactive_markers::InteractiveMarkerClient object not transitioning to ClientState::STATE_IDLE within a pretty generous timeout (3s). Both times, these state transitions would normally be triggered by rclcpp::Client::service_is_ready() called on member get_interactive_markers_client_ returning false [1][2]. This is expected to occur because both times the mock_server has been removed from the executor and deleted.

Unfortunately, the client never seems to "lose service", and the transitions never happen.

The key to why this is happening lies in the ERROR line:

[ERROR] [1618402845.427709873] [rclcpp]: Error in destruction of rcl service handle test_namespace/get_interactive_markers: the Node Handle was destructed too early. You will leak memory

This error is thrown by rclcpp when the node associated with an rcl_service_t has already been deleted at the time of the service's deletion [1]. Since the node handle is not valid anymore, rcl_service_fini() and rmw_destroy_service() are never called for that service.

I have been doing some digging around and I've determined that the reason for this leak lies in the following sequence of events:

  1. The rcl_service_t becomes active while the test is spinning the executor [1][2][3].
  2. A subscription is also active and is returned instead of the service [1]. The rcl_server_t remains referenced by AllocatorMemoryStrategy::service_handles_.
  3. The active subscription makes the client transition to RUNNING state [1] which causes the test to delete the mock_server. Since the rcl_service_t is still referenced, its "destructor" is not run.
  4. The test starts waiting for the client to transition to IDLE. This in turn causes the executor to finally notice that the rcl_service_t is associated with a deleted node, and finally clears its reference, causing the destructor to run, and the error to be printed [1].

Unfortunately the client will never transition to IDLE because in order to do this, its "request publisher" and "reply subscriber" would have to unmatch all endpoints from matching services [1], but this will never happen thanks to the leaked rcl_service_t (since rmw_destroy_service() was not called and the underlying DDS endpoints will never be deleted). I believe these leaked endpoints are also the reason for the finalization errors reported by Connext and rmw_connextdds at the end of the test.

Fix/Workaround

No changes are required in rmw_connextdds.

The only change required to make the test pass consistently is to modify Executor::get_next_ready_executable_from_map() in rclcpp/src/rclcpp/executor.cpp so that services are checked before subscriptions:

diff --git a/rclcpp/src/rclcpp/executor.cpp b/rclcpp/src/rclcpp/executor.cpp
index 630db50b..8b5b5f23 100644
--- a/rclcpp/src/rclcpp/executor.cpp
+++ b/rclcpp/src/rclcpp/executor.cpp
@@ -826,16 +826,16 @@ Executor::get_next_ready_executable_from_map(
     success = true;
   }
   if (!success) {
-    // Check the subscriptions to see if there are any that are ready
-    memory_strategy_->get_next_subscription(any_executable, weak_groups_to_nodes);
-    if (any_executable.subscription) {
+    // Check the services to see if there are any that are ready
+    memory_strategy_->get_next_service(any_executable, weak_groups_to_nodes);
+    if (any_executable.service) {
       success = true;
     }
   }
   if (!success) {
-    // Check the services to see if there are any that are ready
-    memory_strategy_->get_next_service(any_executable, weak_groups_to_nodes);
-    if (any_executable.service) {
+    // Check the subscriptions to see if there are any that are ready
+    memory_strategy_->get_next_subscription(any_executable, weak_groups_to_nodes);
+    if (any_executable.subscription) {
       success = true;
     }
   }

The reason why this small change fixes the leak, and the test in general, is because if the rcl_service_t is detected as active, it ends up being removed from AllocatorMemoryStrategy::service_handles_ and it can be properly finalized alongside its parent node. The subscription is still detected on the following wait, making the test progress as expected.

This is probably not The Fix, since this issue seems to outline a general memory management problem (i.e. the leak of rcl_service_t). While making the change in rclcpp will make the test pass, I suspect more problems will arise inevitably until the code is refactored so that all RMW resources are always guaranteed to be correctly finalized.

Sidenote

Turns out these errors have nothing to do with the reliability protocol's configuration, as I had suspected when I opened #26. I'm not quite sure why a very fast heartbeat rate is also able to make the test pass, but I suspect it has something to do with altering the timing and ordering in which the executor clears its associated objects.

EDIT: grammar and link to the correct related PR (#26)

@asorbini asorbini self-assigned this Apr 15, 2021
@asorbini asorbini added the test-failure The issue or PR relates to a test failure label Apr 15, 2021
@asorbini
Copy link
Collaborator Author

@ivanpauno @clalancette please take a look at the analysis and let me know if you have any thoughts on how to resolve the issue.

@ivanpauno
Copy link
Member

This error is thrown by rclcpp when the node associated with an rcl_service_t has already been deleted at the time of the service's deletion [1]. Since the node handle is not valid anymore, rcl_service_fini() and rmw_destroy_service() are never called for that service.

Uff 🤦‍♂️

please take a look at the analysis and let me know if you have any thoughts on how to resolve the issue.

The suggested patch seems to be only a workaround.
It might be fine to get the issue fixed temporarily, but we should manage ownership of objects correctly so this cannot happen.

@ivanpauno
Copy link
Member

I think ros2/rclcpp#1668 should fix the issue.

@asorbini
Copy link
Collaborator Author

asorbini commented May 13, 2021

@ivanpauno thank you for looking into this. Just wondering if you validated the fix with rmw_connextdds?

I don't see CI results (the one in ros2/rclcpp#1668 only tested Cyclone) and this error only occurred when RMW_IMPLEMENTATION=rmw_connextdds.

@ivanpauno
Copy link
Member

@ivanpauno thank you for looking into this. Just wondering if you validated the fix with rmw_connextdds?

Good question, I didn't sorry.
I'm confident that fixed the issue, but let's check in CI:

  • Linux Build Status

@asorbini
Copy link
Collaborator Author

Great, looks like it fixed the issue. Thanks again!

@clalancette clalancette moved this from To do to Needs Release in Galactic May 13, 2021
@clalancette clalancette moved this from Needs Release to Done in Galactic May 23, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
test-failure The issue or PR relates to a test failure
Projects
No open projects
Galactic
  
Done
Development

Successfully merging a pull request may close this issue.

2 participants