New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
When sending big messages, the node will eventually crash due to high delays (not using loaned messages) #5
Comments
You run essentially out of memory in your SHM pool. That's why the error message essentially says it's unable to allocate space for a new message. The way I usually work around this is my increasing the shared memory pool size: The last entry there can be set to 10/20, which gives you more chunks for large messages. This is unfortunately far from being optimal, as these values are currently hardcoded. We have it on our list of features to make this more flexible and able to be set on startup. With that configuration, I could sent around 4K images without any delay. I am a bit puzzled about the increasing delay though. @michael-poehnl any ideas? |
@AlexisTM is NOT using the loaned messages API extension. So I would assume the increasing delay comes from the serialization which takes the longer the bigger the payload is. In rmw_iceoryx we take the queue size from the provided history qos. @Karsten1987 is there another source for the queue size? Should the "10" that @AlexisTM provides in the create_publisher call be propagated down to the rmw layer as history qos? Maybe the warning comes from another (built-in) subscriber that is using a queue size of 1000? If you want to have a queue size of 10 and can live with the fact that you will loose older chunks if the queue is overflowing, than this issue can be solved by increasing the number of chunks with 32 MB (e.g. to 20). If we than ensure that your desired queue size of 10 is used on iceoryx side (and not the 1000 is coming from your subscriber) it should crash no more. Currently we have a fail fast strategy. If your memory pool configuration is not sufficient to handle all the chunks that are blocked by queues and on user side Having the memory pool configuration as a config file and not only as compile time setting is a feature that is quite on top of the stack. |
From not using loaned messages, I expect delays from serialization. The typical delays with other middlewares are (18MBytes):
The reason it crashes is a lack of memory, which is coming from the required depth history which is only buffering because the listener doesn't receive the data fast enough (too high delay). For the queue size of 1000, the subscriptions are using the depth:
But there is no mention in the publisher side: |
So the good message is that we are 100 times faster with loaning messages. The bad one is that our "hack a thing to support non memcopy-able messages" serialization is 100 times slower. I'll check with @Karsten1987 if we can find there another solution by reusing things that are already available in ROS2. We currently have no use for the queue size on publisher side. We plan to support their a history QoS in future, but this is no queue but more a cache for messages. Currently we only support caching one on subscriber side which corresponds to the latched topic in ROS1 Could you check if it is no more crashing when increasing the chunks of the 32 MB mempool? |
NOTE: 0.1ms is the fastest we could go at for non-RT patched Linux. |
@AlexisTM Could you share some code you used for benchmarking? I am trying to take a shot at this. I'd love to have a similar setup as yours to see how you'd produced these numbers to come up with comparable ones on my end. |
I am out of office (and don't have the code with me).
This was using the ROS2 API (no loaned messages) |
@AlexisTM we modified our ROSCon demo a little bit to cope with loaned messages as well as "classic" method transport. In neither case we were able to reproduce the behavior of yours. It would be great if you could give that demo a shot and post some of the results you get here. To give you an idea on what we see on our machines: When sending 4k images with 15 Hz with loaned messages:
When sending 4k images with 15 Hz using the classic approach:
Even when adding a |
@AlexisTM I am going to close this issue because I am considering this problem being addressed. Please feel free to re-open this ticket if you have further questions about it. |
I sent messages of 33MBytes but the publishers/subscribers were set with a queue size of 10 without loaned messages, and the publisher crashed.
This means: If we are not using the zero-copy capability in all nodes (loaned message methodology), the nodes will crash; I would, therefore, expect the global/local planner and default nodes mapping nodes to have problems when running over iceoryx.
Publisher was:
Subscriber was:
When starting, it says the following but I expect the queue size to be 10.
After a few messages, the node crashes due to lack of memory to be allocated.
This last error is (for me) due to delays in the subscriber side that makes the RouDi not being able to repurpose memory making the publisher dying while it is the subscriber's fault.
Note that when I was doing tests on the raw Iceoryx; the typical delay was 50-150μs (18MB messages) but using rmw_iceoryx_cpp, I get a steadily increasing delay up to being few messages late (at 1Hz without any processing on a i9 machine).
The text was updated successfully, but these errors were encountered: