Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Only keep the last sample in the image tools by default. #238

Merged
merged 2 commits into from
Jul 19, 2018

Conversation

clalancette
Copy link
Contributor

Particularly for cam2image, running it without a subscriber
and KEEP_ALL will eventually cause it to exhaust virtual
memory space. Running it with KEEP_LAST lets it run indefinitely
without growing memory usage.

We saw this problem while just running cam2image for this tutorial https://github.com/ros2/ros2/wiki/Rosbag-with-ROS1-Bridge . We had left it running for a little while before running the bridge/rosbag, and it would crash after a few minutes. With this change in place, we can leave it running for as long as we like.

@clalancette
Copy link
Contributor Author

  • Linux Build Status
  • Linux-aarch64 Build Status
  • macOS Build Status
  • Windows Build Status

@clalancette clalancette added the in progress Actively being worked on (Kanban column) label Jun 7, 2018
Copy link
Member

@dhood dhood left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're using fastrtps right? I wouldn't have expected the publisher to keep the samples if there are no subscribers (unless we use transient local durability, which we aren't). But if this is the behaviour that we observe, then I agree it makes sense to change the default of the demo.

The help text of the demo/title of the PR are misleading because keep last will keep up to the queue size, not just 1 sample. I added e8e401a to improve the docs to more closely match https://github.com/ros2/ros2/wiki/About-Quality-of-Service-Settings

We should delay/keep this change in mind wrt ros2/rmw_fastrtps#202 which is referencing the existing defaults

@clalancette
Copy link
Contributor Author

You're using fastrtps right? I wouldn't have expected the publisher to keep the samples if there are no subscribers (unless we use transient local durability, which we aren't). But if this is the behaviour that we observe, then I agree it makes sense to change the default of the demo.

Fast-RTPS, correct. Do you think the publisher keeping the samples even if there are no subscribers is a bug we should report to upstream Fast-RTPS?

The help text of the demo/title of the PR are misleading because keep last will keep up to the queue size, not just 1 sample. I added e8e401a to improve the docs to more closely match https://github.com/ros2/ros2/wiki/About-Quality-of-Service-Settings

👍

We should delay/keep this change in mind wrt ros2/rmw_fastrtps#202 which is referencing the existing defaults

Good point. I'll keep this in progress until that is merged, then I'll retest.

@dhood
Copy link
Member

dhood commented Jun 7, 2018

Do you think the publisher keeping the samples even if there are no subscribers is a bug we should report to upstream Fast-RTPS?

I'm not sure if the behaviour is within the spec or not, so don't know that it's a bug. We could ask eProsima.

@dhood
Copy link
Member

dhood commented Jun 7, 2018

Do you think the publisher keeping the samples even if there are no subscribers is a bug we should report to upstream Fast-RTPS?

@richiprosima did you have any input on if this is the expected behaviour?

@dirk-thomas
Copy link
Member

Do you think the publisher keeping the samples even if there are no subscribers is a bug

This is not a bug but expected behavior in DDS. The purpose is to provide all "historical" data to late joining subscribers.

@dhood
Copy link
Member

dhood commented Jun 7, 2018

even if using durability of volatile as mentioned?

@dirk-thomas
Copy link
Member

I wouldn't expect it to happen with volatile. Seems like the same thing we discussed earlier: ros2/rmw_fastrtps#68 (comment)

@richiware
Copy link

While publisher was running, was a subscriber running too? In this case, was the subscriber's application killed or stopped neatly?

My thought is that publisher doesn't know the subscriber is not anymore there and is storing the samples until the subscriber acknowledge them.

@mikaelarguedas
Copy link
Member

I can confirm that this is reproducible without any subscriber running.

Using master, without any other node running, I ran:
time ros2 run image_tools cam2image -b -x 640 -y 480 -f 400

After 14 sec, the process is using 4.5GB of RAM and crashes when trying to publish the 5001st sample

[INFO] [cam2image]: Publishing image #5000
[INFO] [cam2image]: Publishing image #5001
terminate called after throwing an instance of 'std::runtime_error'
  what():  failed to publish message: cannot publish data, at /home/mikael/work/ros2/bouncy_ws/src/ros2/rmw_fastrtps/rmw_fastrtps_cpp/src/rmw_publish.cpp:53, at /home/mikael/work/ros2/bouncy_ws/src/ros2/rcl/rcl/src/rcl/publisher.c:241

real	0m14.235s
user	0m4.079s
sys	0m2.804s

That is expected when we use KEEP_ALL from what I understand ? ros2/rmw_fastrtps#68
Or does @richiware mean that we should not face the 5000 message issue if we use volatile no subscriber was ever matched on that topic?


@clalancette The patch should be modified to use the defaults profile from rmw. This problem was faced and worked around in ros2/rmw#82. So I think that this demo should stop hard-coding the history and reliability policy (but keep defining the queue size) and use the default profiles instead

@richiware
Copy link

Although you use KEEP_ALL, in VOLATILE samples are stored until subscribers acknowledge them. As I see this mechanism is implemented when an ACKNACK message is received.

In your scenario where there aren't subscribers this mechanism is not called and the samples are not removed from the history. This is a bug. I will work on it. Thanks for the reporting.

@clalancette
Copy link
Contributor Author

Using master, without any other node running, I ran:
time ros2 run image_tools cam2image -b -x 640 -y 480 -f 400

After 14 sec, the process is using 4.5GB of RAM and crashes when trying to publish the 5001st sample

@mikaelarguedas Thanks for the additional testing; that is exactly what we saw in testing.

@clalancette The patch should be modified to use the defaults profile from rmw. This problem was faced and worked around in ros2/rmw#82. So I think that this demo should stop hard-coding the history and reliability policy (but keep defining the queue size) and use the default profiles instead

I think there is some value in having a demo where users can easily play with the "low-level" settings of history and reliability to see what the effects on the system are. That being said, I think that we should probably use the default profiles as a default, and only modify it if the user passes a flag. What do you think about that?

In your scenario where there aren't subscribers this mechanism is not called and the samples are not removed from the history. This is a bug. I will work on it. Thanks for the reporting.

@richiware Thanks for the confirmation.

@mikaelarguedas
Copy link
Member

mikaelarguedas commented Jun 8, 2018

I think that we should probably use the default profiles as a default, and only modify it if the user passes a flag. What do you think about that?

Yes, that is what I'm suggesting 👍

I think there is some value in having a demo where users can easily play with the "low-level" settings of history and reliability to see what the effects on the system are.

Removing the hard-coded defaults will not modify user's ability to tweak the QoS by passing command line arguments

@clalancette
Copy link
Contributor Author

I re-tested this with the latest Fast-RTPS changes, and this still "leaks" memory if there are no subscribers. Therefore, I've gone ahead nad rebased this, and changed it to stick with rmw_qos_profile_default unless the user explicitly requests something else via a command-line option.

I tested by first running ros2 run image_tools cam2image -- -b with no subscribers, and saw that it did not start using more and more memory (because of the KEEP_LAST default). Then I ran ros2 run image_tools cam2image -- -b -k 1, and saw that it used more and more memory (because of the KEEP_ALL requested plus the Fast-RTPS bug). I also ran it with ros2 run image_tools cam2image -- -b -d 400, and saw that it used up memory until it did the 400th publish, at which point it stop using more memory. So at this point, I think this is doing what we want (and working around the Fast-RTPS bug for now). I'm going to place it back in review, and new CI:

  • Linux Build Status
  • Linux-aarch64 Build Status
  • macOS Build Status
  • Windows Build Status

@clalancette clalancette added in review Waiting for review (Kanban column) and removed in progress Actively being worked on (Kanban column) labels Jun 14, 2018
@clalancette
Copy link
Contributor Author

@mikaelarguedas Pointed out that I should test against the release/1.6.0 branch of upstream Fast-RTPS. I rebuilt with that and ran the test again with KEEP_ALL, and no subscribers, and now there is no leaking. I think this PR is still valuable, so I'll leave it in review, but it is also less important now that the upstream Fast-RTPS bug has been fixed.

return 0;
}

// Initialize a ROS 2 node to publish images read from the OpenCV interface to the camera.
auto node = rclcpp::Node::make_shared("cam2image");
rclcpp::Logger node_logger = node->get_logger();

// Set the parameters of the quality of service profile. Initialize as the default profile
// and set the QoS parameters specified on the command line.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found these comments valuable even if they come at the cost of a few local variables.
Do you mind keeping reliability_policy and friends variables but initialize them with the default profile values instead of the previously hard-coded ones?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did basically this, but just re-used the fields of the structure. The code looks slightly weird this way, but at least has the documentation that you liked. I could re-introduce the local variables, but it seems like more of a way to make a mistake than anything else. Let me know what you think.

@mikaelarguedas
Copy link
Member

It's not a very big deal but while this allows to keep the comments, all the statements have not effect so I don't know if it's clearer for users.

I'd prefer keeping the temporary variables (initialized with the default values of the qos profile), modify them based on user provided input, and then assign them to the custom qos profile with the appropriate comment.

Then I don't see it as a blocker for getting this merged.

Particularly for cam2image, running it without a subscriber
and KEEP_ALL will eventually cause it to exhaust virtual
memory space.  Running it with KEEP_LAST lets it run indefinitely
without growing memory usage.

Signed-off-by: Chris Lalancette <clalancette@openrobotics.org>
@clalancette
Copy link
Contributor Author

OK, I've rebased now and switched to only using the default arguments for depth, reliability, and history. I think this should be ready for another review; one more CI:

  • Linux Build Status
  • Linux-aarch64 Build Status
  • macOS Build Status
  • Windows Build Status

Copy link
Member

@mikaelarguedas mikaelarguedas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, thanks for iterating

@mikaelarguedas
Copy link
Member

Can you look into the failures on windows ? it looks to be due to this change as it doesnt appear on master

@clalancette
Copy link
Contributor Author

Can you look into the failures on windows ? it looks to be due to this change as it doesnt appear on master

Yeah, just looking into it now. Odd failure, I'm not sure why it is happening.

Windows was getting upset at the Unicode ones.

Signed-off-by: Chris Lalancette <clalancette@openrobotics.org>
@clalancette
Copy link
Contributor Author

Got it, Windows didn't like the Unicode quotes. One more CI on Windows just to confirm that it is fixed:

  • Windows Build Status

@mikaelarguedas
Copy link
Member

So this was machine independant after all ? or is there something different on portable than the other machines?

@clalancette
Copy link
Contributor Author

So this was machine independant after all ? or is there something different on portable than the other machines?

It was not machine dependent. I made a mistake while trying to reproduce it, and finally found the real problem (the quotes).

@mikaelarguedas
Copy link
Member

👍

@clalancette
Copy link
Contributor Author

CI is passing, and I have two approvals, so I'm going to merge this in. Thanks!

@clalancette clalancette merged commit 71ab326 into master Jul 19, 2018
@clalancette clalancette deleted the image-tools-keep-last branch July 19, 2018 13:51
@clalancette clalancette removed the in review Waiting for review (Kanban column) label Jul 19, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants