Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Foxy] Add support for ROS tracing tools #9

Open
carlossvg opened this issue Jan 12, 2020 · 13 comments
Open

[Foxy] Add support for ROS tracing tools #9

carlossvg opened this issue Jan 12, 2020 · 13 comments
Labels
enhancement New feature or request

Comments

@carlossvg
Copy link
Contributor

Currently we track statistics to measure real-time performance using different tools. Some of them are included in the project pendulum_tools and others are external (OSRF memory memory tools). To read this statistics we publish them using a ROS 2 publisher, which can be intrusive because it generates additional traffic, ROS 2 executables, etc.

In Eloquent ros2 tracing tools were introduced, these tools would allow to track most of the statistics we are currently tracking and even more. The advantages of using this approach are that we rely in only one tool and we can remove all the specific code to track statistics from the project. Also, the overhead of tracing tools is minimal, which makes it a less intrusive solution than the current approach.

These are some of the steps in order to add support for the ROS2 tracing tools:

  • Update real-time instructions. Explain how to use tracing tools.
  • Create launch file to run the demo with tracing enabled. (See the example)
  • Add several jupyter scripts (See the example):
    • Measure callbacks duration and jitter.
    • Count the number of messages received and lost.
    • Measure message latency.
    • Measure the number of memory allocations.
    • Measure the number of page faults. -> This must be enabled when building the kernel
    • Capture the node on_active event to process some statistics only in active state.

Some disadvantages:

  • For the moment is only compatible with Linux because it uses Lttng under the hood.
  • For the moment is not possible to track statistics at runtime (aka live-tracing).

Links:

@carlossvg carlossvg added the enhancement New feature or request label Jan 12, 2020
@carlossvg
Copy link
Contributor Author

Hi @christophebedard We plan to replace "OSRF memory tools" to analyze memory allocations and use ros2_tracing instead. Could you point us in the right direction?

The goal is to detect any malloc (and alike) syscalls only when a lifecycle node is in the active state.

  1. Do you think it is possible to do this just by using a jupyter notebook? Or does it make sense to add some tracepoints in the lifecycle callbacks to make this easier?
  2. How can we classify each memory allocation depending on the layer it is created?
  3. Do you have any similar example code to analyze memory allocations?

Thanks.

@christophebedard
Copy link

  1. Do you think it is possible to do this just by using a jupyter notebook? Or does it make sense to add some tracepoints in the lifecycle callbacks to make this easier?

Yeah, you can't do this without having a way to know which memory allocations happened during the active state and which didn't, so you will indeed need to add tracepoints to track lifecycle state changes.

I think support for this should be added to tracetools (and instrumentation added to rclcpp_lifecycle + support for it added to the processing tools), so I can take care of that if you want.

  1. How can we classify each memory allocation depending on the layer it is created?

So this is unrelated to the first point? You would need to track the execution flow across layers, e.g. by putting a tracepoint at each layer boundary. Then you use those to keep track of the "layer state." Of course this can get ugly if the execution flow is complex, i.e. if it triggers a lot of tracepoints 😆. I'm guessing you mean layers like rclcpp, rcl, rmw, rmw implementation?

Another way could be to gather callstack information along with the memory allocation trace events (with something like this, since this isn't a standard LTTng feature), look at the function which called malloc/free/etc. (and/or which executable/shared lib that function is from) and figure out which layer it belongs to. This is probably overly complex to set up, though, but since you don't expect to have memory allocations (well at least during the "active" state), this might have even less performance impact since you wouldn't need to keep track of layers using tracepoints. So less triggered tracepoints/generated events versus collecting more information for each memory allocation tracepoint but doing more processing work.

  1. Do you have any similar example code to analyze memory allocations?

Of course! Here's a few links:

  1. Example Jupyter notebook plotting memory consumption over time: https://gitlab.com/micro-ROS/ros_tracing/tracetools_analysis/-/blob/master/tracetools_analysis/analysis/memory_usage.ipynb
  2. Example launch file to get the data for the notebook above: https://gitlab.com/micro-ROS/ros_tracing/tracetools_analysis/-/blob/master/tracetools_analysis/launch/memory_usage.launch.py
  3. Maybe you can re-use the data that is used in the example notebook, e.g. use the lifecycle state tracepoints to identify "active" time intervals and check those intervals for any changes in consumption. If you only want to check for calls to malloc/free/etc., take a look at the processing code used for the example notebook: https://gitlab.com/micro-ROS/ros_tracing/tracetools_analysis/-/blob/master/tracetools_analysis/tracetools_analysis/processor/memory_usage.py

I hope that was helpful. Let me know if you need more information!

@carlossvg
Copy link
Contributor Author

I think support for this should be added to tracetools (and instrumentation added to rclcpp_lifecycle + support for it added to the processing tools), so I can take care of that if you want.

Ok cool. In the meanwhile I can test can test the checking for memory allocations for all the tracing session.

Another way could be to gather callstack information along with the memory allocation trace events

Yes, I think this fits better with what I had in mind. Plotting memory consumption over time just for the active state could be interesting too, at least as a POC. I will start with the easiest one and I will ask for help to set up the callstack example.

@carlossvg carlossvg changed the title Add support for ROS tracing tools [Foxy] Add support for ROS tracing tools Aug 16, 2020
@christophebedard
Copy link

christophebedard commented Aug 16, 2020

@carlossvg I've started working on this (see: https://gitlab.com/micro-ROS/ros_tracing/ros2_tracing/-/issues/99) and I was wondering if you can think of other things that would be good to instrument for lifecycle nodes (other than state transitions). I can't think of anything myself, but if there is anything it might be good to do at the same time.

@carlossvg
Copy link
Contributor Author

@christophebedard That's great. I think with the state transition traces most of my use case is covered. After looking at the rclcpp/rcl lifecycle implementation I can't find any clear places to put additional traces. For the moment tracing the state transitions seem to be more than enough for me.

@carlossvg
Copy link
Contributor Author

@christophebedard I started working on the callbacks duration scripts in this branch. I have some questions maybe you can help me to solve:

  1. The long time to process all the data is probably the biggest pain point for this use case. I tried to trace for some minutes and the processing time takes several minutes to complete. Do you know if there is some way to solve this issue (i.e: filtering events, plotting babeltrace output with other tool, etc)?
  2. I would like to plot the delta time between callbacks, for example, to measure the jitter of a timer callback. I think the code would be very similar to the one in the callback duration notebook. Unfortunately, I'm not familiar with notebook and pandas. Could you point me how could I modify the callback duration notebook to implement this?

@christophebedard
Copy link

christophebedard commented Aug 19, 2020

  1. Yeah it's a big problem currently. I see you're keeping default values for most params. If you don't need kernel events (at least currently), you should remove them, i.e. events_kernel=[]. That should reduce the processing time significantly. I just opened an issue to work on improving it. Also, you might want to put the Trace object first in the array for LaunchDescription. I haven't actually tested whether that changes anything though (but let me know if you find out).
  2. The dataframe returned by get_callback_durations() contains both the start timestamp of callbacks as well as their duration. That dataframe is used to make a ColumnDataSource object (itself used for plotting), where the timestamp is for the X axis and the duration for the Y axis. I'm not sure exactly how you want your jitter plot to look like, but I'm guessing you want to remove the duration column and add an "index" column that can act as the X axis and make the timestamp column the Y axis (I think the X axis column is simply the first column and the Y axis column is the second one). Then you need to transform/process the dataframe to change timestamps into time deltas. I'm sure there's examples somewhere online.

@carlossvg
Copy link
Contributor Author

@christophebedard Thanks for the answers. I added the changes you suggested and I created the script for the timestamp differences. I created a PR here #39, feel free to provide your input and drop me a comment there if you think something can be improved. Thanks for your support.

@christophebedard
Copy link

christophebedard commented Aug 22, 2020

@carlossvg I'll take a look!

I also wanted to mention (just in case) that the modifications (to rcl, tracetools, tracetools_analysis, etc.) won't be merged into Foxy, since it's a new feature and Foxy was already released. They'll go into Rolling (which will become Galactic eventually). Did you consider targeting Rolling instead of Foxy? Although I'm guessing the pendulum demo is meant to be used with the last release instead of the latest state.

@carlossvg
Copy link
Contributor Author

@christophebedard

Although I'm guessing the pendulum demo is meant to be used with the last release instead of the latest state.

That's right, for the moment we prefer to target the latest release. In the meanwhile, we could create a Rolling branch to test some of the new features.

@LanderU What's your input on this?

@LanderU
Copy link
Member

LanderU commented Aug 23, 2020

@carlossvg we can create a rolling branch and start testing features when they will integrated.

@christophebedard
Copy link

@carlossvg @LanderU so then you'll maintain forks/branches for rcl/tracetools/tracetools_analysis with the new instrumentation/modifications backported to Foxy?

@christophebedard
Copy link

also, just to link to the various related MRs/branches from here:

the MRs are still WIP, but they should be ready soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants