Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow tracing tests to be run in parallel with tracing tests & normal tests #94

Closed
christophebedard opened this issue Mar 6, 2024 · 0 comments · Fixed by #95
Closed
Assignees
Labels
enhancement New feature or request

Comments

@christophebedard
Copy link
Member

christophebedard commented Mar 6, 2024

Parallel testing was enabled in ci.ros2.org in ros2/ci#723. Since then, many of the tracing-related tests in ros2_tracing have been failing. For example, see the failed tests for March 6th here: osrf/buildfarm-tools#31 (comment).

There are 2 issues:

  1. There is a fundamental issue with running tracing tests in parallel with other tests (any ROS 2 app/executable). When one of the tracing tests starts tracing, it collects data from any running application. As they are written, the tests do not expect to collect unrelated data, and will often fail because of it (e.g., unexpected number of trace events). This can probably be improved and/or fundamentally fixed.
  2. Some tracing tests do not expect other tests to control the tracer at the same time as them, and will fail because of it. This constraint was originally meant to circumvent issue 1, but it means that two packages containing tracing tests (which there are) cannot run simultaneously. Once issue 1 is fixed, then this is easy to change.

I changed the tracing test code to provide more information when a tracing test fails due to incorrect/unexpected trace data in #93. We can see the result in the March 6th nightlies, which confirms issue 1:

Test log: test_lifecycle_node.TestLifecycleNode
23:01:14 6: =================================== FAILURES ===================================
23:01:14 6: __________________________ TestLifecycleNode.test_all __________________________
23:01:14 6: 
23:01:14 6: self = <test_lifecycle_node.TestLifecycleNode testMethod=test_all>
23:01:14 6: 
23:01:14 6:     def test_all(self):
23:01:14 6:         # Check events as set
23:01:14 6:         self.assertEventsSet(self._events_ros)
23:01:14 6:     
23:01:14 6:         # Check that the lifecycle node has an init event just like a normal node
23:01:14 6:         rcl_node_init_events = self.get_events_with_name(tp.rcl_node_init)
23:01:14 6:         lifecycle_node_matches = self.get_events_with_field_value(
23:01:14 6:             'node_name',
23:01:14 6:             'test_lifecycle_node',
23:01:14 6:             rcl_node_init_events,
23:01:14 6:         )
23:01:14 6: >       self.assertNumEventsEqual(lifecycle_node_matches, 1)
23:01:14 6: 
23:01:14 6: test/test_lifecycle_node.py:49: 
23:01:14 6: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
23:01:14 6: ../../../../install/tracetools_test/lib/python3.10/site-packages/tracetools_test/case.py:327: in assertNumEventsEqual
23:01:14 6:     self.assertEqual(len(events), expected_number, msg)
23:01:14 6: E   AssertionError: 2 != 1 : unexpected number of events: [{'_name': 'ros2:rcl_node_init', '_timestamp': 1709708441732661082, 'vpid': 319439, 'rmw_handle': 94411761626944, 'vtid': 319439, 'namespace': '/', 'cpu_id': 1, 'node_handle': 94411760046688, 'procname': 'test_lifecycle_', 'node_name': 'test_lifecycle_node'}, {'_name': 'ros2:rcl_node_init', '_timestamp': 1709708447683372035, 'vpid': 321945, 'rmw_handle': 94716619878512, 'vtid': 321945, 'namespace': '/', 'cpu_id': 2, 'node_handle': 94716619735216, 'procname': 'simple_lifecycl', 'node_name': 'test_lifecycle_node'}]

The test expects 1 event but gets 2 events. We can see that the PIDs (vpid) are different, and that the procname of the second event is simple_lifecycl with a test_lifecycle_node node name, which corresponds to this ros2lifecycle test: https://github.com/ros2/ros2cli/blob/d0c8c35f7255f826b542f0d9283d9b706cb2b522/ros2lifecycle/test/test_cli.py#L137-L150.


We need to let other ROS 2 tests/applications run while the tracing test runs & traces its test application(s), so we need to either only trace the relevant applications or only consider tracing data from the relevant applications. I can think of a few solutions:

  1. Be more aggressive about filtering out unrelated trace data in the test code
    • This would most likely require a lot more work/test code and it would be repetitive; doesn't scale well
  2. Use LTTng's recording event rules feature, e.g., lttng enable-event -k event_name --filter '$ctx.procname == "string"'
    • We would only be able to filter on the procname value, but there could be collisions, since the same test executables are used by many different tests and procname is limited to 15 characters
    • While filtering on PID would be ideal, the process ID isn't known before we launch the application, and we need to start tracing before we launch it
    • Also, the recording event rules feature is not exposed through the lttngpy Python bindings, so we would need to add support for it
  3. Only consider trace events of the test process' child processes
    • This doesn't work because of course the child processes have terminated by the time we read the trace data
    • The only way this would work is if the parent process gets a list of its child processes at the right time, when they're still alive; I don't know how fragile this would be
    • And this assumes that the test process directly spawned the traced processes, which may be a fragile assumption
  4. Set procname of spawned processes through Node(exec_name='...') and only consider events with that procname
    • Pretty simple, however this is limited to 15 characters, and some tests already expect the procname to correspond to the executable name
  5. Use an "active" marker in the trace data itself: have the test generate a unique ID and pass it to the test executable processes being launched by the test using an environment variable, then make the test application record that unique ID in the trace using the simple lttng_ust_tracef API, and have the test code filter out trace data from any process that hasn't recorded that unique ID
    • This would be the most robust solution
    • We would need to modify code, but it's only a few lines and is limited to the test applications (all executables in the test_tracetools package)
    • All tracing tests use test applications/executables from the test_tracetools package, but not all tracing tests are done with tracetools_test for test_tracetools; I need probably need to make this filtering logic to other tests (mainly test_ros2trace?)
  6. Any other solution?

This will then allow us to remove the constraint that led to issue 2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant