-
Notifications
You must be signed in to change notification settings - Fork 173
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nodes missing from ros2 node list
after relaunch
#582
Comments
I'm seeing something similar with gazebo + ros2_control as well. The interesting thing is that if I do: If I do Restarting the daemon with |
I think that this is expected behavior for ros2 daemon, it is well described what-is-ros2-daemon. |
Is it? I understood it as a cache of nodes and their subs/pubs/services etc... that should be transparent to use. I could understand that it keeps some nodes as "alive" in the cache, as it takes some time of them being unresponsive before eliminating them. But I am starting new nodes and they do not show up on any commands that use the daemon, even after waiting several minutes. I have to restart the daemon or use the |
Ah, i see. you are saying
problem-1: old cache can be seen, and will not be cleaned?
problem-2: cache does not get updated? Am i understanding correct? |
Exactly, I've seen both issues. problem-1: Cache (daemon) retaining nodes killed long ago. I'm trying to find reproducible examples, currently I can make it happen 100% of the time, but on a complex setup involving ros2_control with 2 controllers and launching and stopping navigation2. There may also be underlying rmw issues causing problem-2, since I've seen that rviz2 would not list the topics from the newly spawned nodes, and even though I haven't looked in depth, I believe rviz2 has 0 relation with ros2cli. |
Probably related to ros2/rmw_fastrtps#509. |
could be related to ros2/rmw_fastrtps#514 if the communication is localhost? |
I'm seeing this bug on a project with five nodes, FastRTPS, native Ubuntu install. I'm using ros2 launch files, everything comes up nicely the first couple of times, but eventually rqt is a bit weird, there were a few time when it seemed able to find a different collection of topics and nodes to the cli tools
|
if your problem is related to ros2/rmw_fastrtps#514, it would be really appreciated to try https://github.com/ros2/ros2/tree/foxy branch to check if you still meet the problem. |
@fujitatomoya I'll try the following: That sets an order-of-magnitude baseline for how long to test the new branch install ros from source: rebuild with colcon (including ros2 source packages) Does that sound about right? |
i think that sounds okay, and whole procedure is https://docs.ros.org/en/foxy/Installation/Linux-Development-Setup.html. i usually use |
I have a result! -- Not fixed. I built from source (55 minutes build time, after tracking down additional deps), and my build does contain ros2/rmw_fastrtps#514. In order to trigger this bug, I have to sigint Once this bug is triggered, I can load the same 5-node launch file and I can retrigger this bug, and the size of the subset gets smaller by one node each time. I can keep triggering it until no nodes from that launch file get listed, and eventually reloading rqt doesn't list. |
Recently I've met this bug in my project, and here is what I found:
And I have the questions: @nielsvd @BrettRD @v-lopez
|
discovery protocol is implemented in RMW implementation, so changing rmw would solve the problem.
no i do not think so, related to previous comment, discovery depends on underneath rmw implementation.
i cannot reproduce this issue with my local environment and rolling branch. |
@fujitatomoya thank you for your quick reply.
Thanks for your tips, I will have a try.
OK, so rclcpp would not bypass the issue.
according to @v-lopez , only the complex launch would cause this node list problem.
|
I have not noticed this bug in Galactic, but I encountered it immediately again when I used Humble. |
@iuhilnehc-ynos @llapx can you check if we can see this problem with i think there is no easy reproducible procedure currently, but we can check with #582 (comment) . |
@BrettRD the primary difference between Galactic and Humble/Foxy is the default rmw used. |
from my test #779 (comment) and the comment from @v-lopez that rviz2 will bypass the issue of node missing. I believe the root cause would not be in the rmw layer, so changing rmw will not bypass the issue, and rclcpp/rviz2 will not see this problem. |
OK, I'll take a check. |
I have tested it on ros:rolling (docker), and build turtlebot3 and navigation2 (ros:rolling no providing nav2 packages) from sources, |
This issue is not easy to reproduce. But it must still be there because I can reproduce this issue with rolling (the reproducible steps are similar to #582 (comment)) a few times. After stopping the ros2 daemon in step 2 of #582 (comment), we can immediately get the correct result of the node list.
Notice that the navigation demo runs well even if the |
to find out the thread
the backtrace for thread Id 8,
void perform_listen_operation(
Locator input_locator)
{
Locator remote_locator;
while (alive())
{
// Blocking receive.
std::shared_ptr<SharedMemManager::Buffer> message;
if (!(message = Receive(remote_locator)))
//////\ expect that the `Receive` can block if there is no data, but it will try to Receive the nullptr message again and again.
{
continue;
} failed to I don't know whether it's a bug or not because I can't reproduce this issue the first time after clearing the related shm files |
could be related to eProsima/Fast-DDS#2790 |
@iuhilnehc-ynos a couple of questions.
can you point out which node or processes cannot exit normally? is that receiving exception or core crash?
i think this is good step that we found out.
|
Press
It shows a random node list, but if the issue happens, the node list is almost the same as the prior while running the
No, I tried using BTW: I think it's not difficult to reproduce this issue. Please don't be polite to the |
I hope you guys can reproduce this issue on your machine, otherwise, nobody can help confirm even if I have a workaround patch 😄 . |
@JLBuenoLopez-eProsima @MiguelCompany any thoughts? i believe that it is clear that shared memory file or caches used by |
I had an issue calling I tried other methods such as stopping and restarting the daemon and that seemed to work, but I felt apprehensive of that workaround as I don't fully understand the consequences. What I found what worked was adding What does --spin-time SPIN_TIME |
downside could be discovery time for any other nodes running on that host system. daemon caches and advertises ros 2 network graph in it, then if the daemon is running, other ros 2 nodes running in the same host can find the connectivity to request the daemon without waiting entire discovery.
we can use this option to wait for ros 2 network graph updated until specific timeout expires. but this option is only valid when daemon is not running or |
Currently having this problem as well, but Restarting the daemon also does not seem to solve the problem. No idea if it helps, but here is the output of
Again, not sure if helpful, but when I installed ROS2, I added the following lines to ~/.bashrc:
|
In case this helps, I can also reproduce the issue as follows. Note that I put
$ ros2 node list -a
/_ros2cli_daemon_0_41534b5a8f1d43cfbb3b1ee12d408355
/turtlesim
$ ros2 topic list
/parameter_events
/rosout
/turtle1/cmd_vel
/turtle1/color_sensor
/turtle1/pose
root# source /opt/ros/humble/setup.bash
root# ros2 run turtlesim turtlesim_node Turtlesim is correctly launched (note that to reproduce the issue it's not enough to source root# ros2 node list -a
/_ros2cli_daemon_0_41534b5a8f1d43cfbb3b1ee12d408355
root# ros2 topic list
/parameter_events
/rosout
$ ros2 node list -a
/_ros2cli_daemon_0_41534b5a8f1d43cfbb3b1ee12d408355
$ ros2 topic list
/parameter_events
/rosout
$ ros2 node list -a
/_ros2cli_daemon_0_b175e66117984230bf91ab71681160d6
/turtlesim
$ ros2 topic list
/parameter_events
/rosout
/turtle1/cmd_vel
/turtle1/color_sensor
/turtle1/pose Here's my PLATFORM INFORMATION
system : Linux
platform info : Linux-5.19.0-46-generic-x86_64-with-glibc2.35
release : 5.19.0-46-generic
processor : x86_64
QOS COMPATIBILITY LIST
compatibility status : No publisher/subscriber pairs found
RMW MIDDLEWARE
middleware name : rmw_fastrtps_cpp
ROS 2 INFORMATION
distribution name : humble
distribution type : ros2
distribution status : active
release platforms : {'debian': ['bullseye'], 'rhel': ['8'], 'ubuntu': ['jammy']} Here my environment variables (same for both regular and root users): $ printenv | grep ROS
ROS_VERSION=2
ROS_PYTHON_VERSION=3
ROS_LOCALHOST_ONLY=0
ROS_DISTRO=humble Edit: A couple things I'd like to add for clarification:
|
can you evaluate 2 PRs introduced by ros2/rmw_fastrtps#699 (comment) with reproducible procedure in this issue? |
After testing many times with running I believe this issue is fixed by the eProsima/Fast-DDS#3753 |
@iuhilnehc-ynos great news! thanks for checking. |
@iuhilnehc-ynos thanks for testing, i will go ahead to close. |
Bug report
Required Info:
Steps to reproduce issue
1
From the workspace root, launch (e.g.) a TurtleBot3 simulation:
Then, in a second terminal, launch the navigation:
Print the node list:
Close (ctrl-c) the navigation and the simulation.
2
Relaunch from the same respective terminals, the simulation:
and the navigation:
Print the node list again (2nd time):
Close (ctrl-c) the navigation and the simulation. Stop the
ros2
daemon:3
Relaunch from the same respective terminals, the simulation:
and the navigation:
Print the node list again (3rd time):
Expected behavior
The node list should be the same all three times (up to some hash in the
/transform_listener_impl_...
nodes).Actual behavior
The second time, the following nodes are missing (the remainder is practically the same):
The third time, after stopping the daemon, it works as expected again.
Note, that everything else works fine and in case of the above navigation use case, the nodes are fully functional.
Additional information
This issue was raised here: ros-navigation/navigation2#2145.
The text was updated successfully, but these errors were encountered: