Allow tracing tests to be run in parallel with tracing tests & normal tests #94

christophebedard · 2024-03-06T22:41:36Z

Parallel testing was enabled in ci.ros2.org in ros2/ci#723. Since then, many of the tracing-related tests in ros2_tracing have been failing. For example, see the failed tests for March 6th here: osrf/buildfarm-tools#31 (comment).

There are 2 issues:

There is a fundamental issue with running tracing tests in parallel with other tests (any ROS 2 app/executable). When one of the tracing tests starts tracing, it collects data from any running application. As they are written, the tests do not expect to collect unrelated data, and will often fail because of it (e.g., unexpected number of trace events). This can probably be improved and/or fundamentally fixed.
Some tracing tests do not expect other tests to control the tracer at the same time as them, and will fail because of it. This constraint was originally meant to circumvent issue 1, but it means that two packages containing tracing tests (which there are) cannot run simultaneously. Once issue 1 is fixed, then this is easy to change.

I changed the tracing test code to provide more information when a tracing test fails due to incorrect/unexpected trace data in #93. We can see the result in the March 6th nightlies, which confirms issue 1:

Test log: test_lifecycle_node.TestLifecycleNode

23:01:14 6: =================================== FAILURES ===================================
23:01:14 6: __________________________ TestLifecycleNode.test_all __________________________
23:01:14 6: 
23:01:14 6: self = <test_lifecycle_node.TestLifecycleNode testMethod=test_all>
23:01:14 6: 
23:01:14 6:     def test_all(self):
23:01:14 6:         # Check events as set
23:01:14 6:         self.assertEventsSet(self._events_ros)
23:01:14 6:     
23:01:14 6:         # Check that the lifecycle node has an init event just like a normal node
23:01:14 6:         rcl_node_init_events = self.get_events_with_name(tp.rcl_node_init)
23:01:14 6:         lifecycle_node_matches = self.get_events_with_field_value(
23:01:14 6:             'node_name',
23:01:14 6:             'test_lifecycle_node',
23:01:14 6:             rcl_node_init_events,
23:01:14 6:         )
23:01:14 6: >       self.assertNumEventsEqual(lifecycle_node_matches, 1)
23:01:14 6: 
23:01:14 6: test/test_lifecycle_node.py:49: 
23:01:14 6: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
23:01:14 6: ../../../../install/tracetools_test/lib/python3.10/site-packages/tracetools_test/case.py:327: in assertNumEventsEqual
23:01:14 6:     self.assertEqual(len(events), expected_number, msg)
23:01:14 6: E   AssertionError: 2 != 1 : unexpected number of events: [{'_name': 'ros2:rcl_node_init', '_timestamp': 1709708441732661082, 'vpid': 319439, 'rmw_handle': 94411761626944, 'vtid': 319439, 'namespace': '/', 'cpu_id': 1, 'node_handle': 94411760046688, 'procname': 'test_lifecycle_', 'node_name': 'test_lifecycle_node'}, {'_name': 'ros2:rcl_node_init', '_timestamp': 1709708447683372035, 'vpid': 321945, 'rmw_handle': 94716619878512, 'vtid': 321945, 'namespace': '/', 'cpu_id': 2, 'node_handle': 94716619735216, 'procname': 'simple_lifecycl', 'node_name': 'test_lifecycle_node'}]

The test expects 1 event but gets 2 events. We can see that the PIDs (vpid) are different, and that the procname of the second event is simple_lifecycl with a test_lifecycle_node node name, which corresponds to this ros2lifecycle test: https://github.com/ros2/ros2cli/blob/d0c8c35f7255f826b542f0d9283d9b706cb2b522/ros2lifecycle/test/test_cli.py#L137-L150.

We need to let other ROS 2 tests/applications run while the tracing test runs & traces its test application(s), so we need to either only trace the relevant applications or only consider tracing data from the relevant applications. I can think of a few solutions:

Be more aggressive about filtering out unrelated trace data in the test code
- This would most likely require a lot more work/test code and it would be repetitive; doesn't scale well
Use LTTng's recording event rules feature, e.g., lttng enable-event -k event_name --filter '$ctx.procname == "string"'
- We would only be able to filter on the procname value, but there could be collisions, since the same test executables are used by many different tests and procname is limited to 15 characters
- While filtering on PID would be ideal, the process ID isn't known before we launch the application, and we need to start tracing before we launch it
- Also, the recording event rules feature is not exposed through the lttngpy Python bindings, so we would need to add support for it
Only consider trace events of the test process' child processes
- This doesn't work because of course the child processes have terminated by the time we read the trace data
- The only way this would work is if the parent process gets a list of its child processes at the right time, when they're still alive; I don't know how fragile this would be
- And this assumes that the test process directly spawned the traced processes, which may be a fragile assumption
Set procname of spawned processes through Node(exec_name='...') and only consider events with that procname
- Pretty simple, however this is limited to 15 characters, and some tests already expect the procname to correspond to the executable name
Use an "active" marker in the trace data itself: have the test generate a unique ID and pass it to the test executable processes being launched by the test using an environment variable, then make the test application record that unique ID in the trace using the simple lttng_ust_tracef API, and have the test code filter out trace data from any process that hasn't recorded that unique ID
- This would be the most robust solution
- We would need to modify code, but it's only a few lines and is limited to the test applications (all executables in the test_tracetools package)
- All tracing tests use test applications/executables from the test_tracetools package, but not all tracing tests are done with tracetools_test for test_tracetools; I need probably need to make this filtering logic to other tests (mainly test_ros2trace?)
Any other solution?

This will then allow us to remove the constraint that led to issue 2.

The text was updated successfully, but these errors were encountered:

christophebedard added the enhancement New feature or request label Mar 6, 2024

christophebedard self-assigned this Mar 6, 2024

This was referenced Mar 12, 2024

Allow tracing tests to be run in parallel with other tests #95

Merged

Fix interference between test_tracetools and ros2lifecycle tests #96

Merged

clalancette closed this as completed in #95 Mar 20, 2024

christophebedard mentioned this issue Mar 25, 2024

Fix LTTng issue with test_ros2trace and test_tracetools_launch so that they can run tracing tests in parallel #99

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow tracing tests to be run in parallel with tracing tests & normal tests #94

Allow tracing tests to be run in parallel with tracing tests & normal tests #94

christophebedard commented Mar 6, 2024 •

edited

Loading

Allow tracing tests to be run in parallel with tracing tests & normal tests #94

Allow tracing tests to be run in parallel with tracing tests & normal tests #94

Comments

christophebedard commented Mar 6, 2024 • edited Loading

christophebedard commented Mar 6, 2024 •

edited

Loading