You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Parallel testing was enabled in ci.ros2.org in ros2/ci#723. Since then, many of the tracing-related tests in ros2_tracing have been failing. For example, see the failed tests for March 6th here: osrf/buildfarm-tools#31 (comment).
There are 2 issues:
There is a fundamental issue with running tracing tests in parallel with other tests (any ROS 2 app/executable). When one of the tracing tests starts tracing, it collects data from any running application. As they are written, the tests do not expect to collect unrelated data, and will often fail because of it (e.g., unexpected number of trace events). This can probably be improved and/or fundamentally fixed.
Some tracing tests do not expect other tests to control the tracer at the same time as them, and will fail because of it. This constraint was originally meant to circumvent issue 1, but it means that two packages containing tracing tests (which there are) cannot run simultaneously. Once issue 1 is fixed, then this is easy to change.
I changed the tracing test code to provide more information when a tracing test fails due to incorrect/unexpected trace data in #93. We can see the result in the March 6th nightlies, which confirms issue 1:
We need to let other ROS 2 tests/applications run while the tracing test runs & traces its test application(s), so we need to either only trace the relevant applications or only consider tracing data from the relevant applications. I can think of a few solutions:
Be more aggressive about filtering out unrelated trace data in the test code
This would most likely require a lot more work/test code and it would be repetitive; doesn't scale well
We would only be able to filter on the procname value, but there could be collisions, since the same test executables are used by many different tests and procname is limited to 15 characters
While filtering on PID would be ideal, the process ID isn't known before we launch the application, and we need to start tracing before we launch it
Also, the recording event rules feature is not exposed through the lttngpy Python bindings, so we would need to add support for it
Only consider trace events of the test process' child processes
This doesn't work because of course the child processes have terminated by the time we read the trace data
The only way this would work is if the parent process gets a list of its child processes at the right time, when they're still alive; I don't know how fragile this would be
And this assumes that the test process directly spawned the traced processes, which may be a fragile assumption
Set procname of spawned processes through Node(exec_name='...') and only consider events with that procname
Pretty simple, however this is limited to 15 characters, and some tests already expect the procname to correspond to the executable name
Use an "active" marker in the trace data itself: have the test generate a unique ID and pass it to the test executable processes being launched by the test using an environment variable, then make the test application record that unique ID in the trace using the simple lttng_ust_tracef API, and have the test code filter out trace data from any process that hasn't recorded that unique ID
This would be the most robust solution
We would need to modify code, but it's only a few lines and is limited to the test applications (all executables in the test_tracetools package)
All tracing tests use test applications/executables from the test_tracetools package, but not all tracing tests are done with tracetools_test for test_tracetools; I need probably need to make this filtering logic to other tests (mainly test_ros2trace?)
Any other solution?
This will then allow us to remove the constraint that led to issue 2.
The text was updated successfully, but these errors were encountered:
Parallel testing was enabled in ci.ros2.org in ros2/ci#723. Since then, many of the tracing-related tests in
ros2_tracing
have been failing. For example, see the failed tests for March 6th here: osrf/buildfarm-tools#31 (comment).There are 2 issues:
I changed the tracing test code to provide more information when a tracing test fails due to incorrect/unexpected trace data in #93. We can see the result in the March 6th nightlies, which confirms issue 1:
Test log: test_lifecycle_node.TestLifecycleNode
The test expects 1 event but gets 2 events. We can see that the PIDs (
vpid
) are different, and that the procname of the second event issimple_lifecycl
with atest_lifecycle_node
node name, which corresponds to thisros2lifecycle
test: https://github.com/ros2/ros2cli/blob/d0c8c35f7255f826b542f0d9283d9b706cb2b522/ros2lifecycle/test/test_cli.py#L137-L150.We need to let other ROS 2 tests/applications run while the tracing test runs & traces its test application(s), so we need to either only trace the relevant applications or only consider tracing data from the relevant applications. I can think of a few solutions:
lttng enable-event -k event_name --filter '$ctx.procname == "string"'
procname
value, but there could be collisions, since the same test executables are used by many different tests andprocname
is limited to 15 characterslttngpy
Python bindings, so we would need to add support for itprocname
of spawned processes throughNode(exec_name='...')
and only consider events with thatprocname
procname
to correspond to the executable namelttng_ust_tracef
API, and have the test code filter out trace data from any process that hasn't recorded that unique IDtest_tracetools
package)test_tracetools
package, but not all tracing tests are done withtracetools_test
fortest_tracetools
; I need probably need to make this filtering logic to other tests (mainlytest_ros2trace
?)This will then allow us to remove the constraint that led to issue 2.
The text was updated successfully, but these errors were encountered: