Allow tracing tests to be run in parallel with other tests #95

christophebedard · 2024-03-12T01:02:00Z

Resolves #94

This allows tracing tests to be run in parallel with (a) other tracing tests (e.g., from another package) and (b) normal tests (e.g., any ROS 2 tests). It achieves that by doing two things in two separate commits:

Only consider test trace data corresponding to the current test
1. If we configure tracing in a tracing test and other tests (and therefore "ROS 2 applications") run in parallel, we will collect trace data from all running applications and end up with irrelevant data. Allow tracing tests to be run in parallel with tracing tests & normal tests #94 provides an overview of the possible solutions along with pros and cons. The chosen (and most robust) solution is:
  1. The tracing test passes a trace test ID to the test application processes it spawns through an environment variable
    1. The trace test ID contains the tracing test's PID and the test name (for easy identification)
  2. The test application processes spawned by the test (all in test_tracetools) retrieve that trace test ID from the environment and mark their process by triggering an lttng_ust_tracef:event event with that trace test ID in its payload, which will link the trace test ID to that process ID, since the trace event includes the PID
    1. lttng_ust_tracef:event was chosen because it is simple and doesn't require defining a new tracepoint; it simply collects a string msg value, see the lttng_ust_tracef API
  3. The tracing test reads those lttng_ust_tracef:event events, collects the PIDs of the events matching its test ID, and filters out trace data from all other processes, leaving only relevant trace data
2. This only concerns test_tracetools and test_ros2trace
Allow having existing tracing sessions during tracing tests
1. Now that having multiple tests (in this case multiple tracing tests) run in parallel is possible, we don't have to expect to have only one tracing session at any given time
  1. Since we don't check that no tracing session exists before and after each test, I modified the tests to check that they do not leave their tracing sessions behind
2. This only concerns test_ros2trace

clalancette

Overall, this is absolutely fantastic. Thanks for taking the time to do this.

I've left a few non-blocking nits inline. I'm still going to approve, and leave it to you to decide whether to respond to them, and then run full CI on this.

test_tracetools/CMakeLists.txt

tracetools_read/tracetools_read/trace.py

tracetools_test/tracetools_test/case.py

christophebedard · 2024-03-12T18:38:49Z

Full CI, removing --executor sequential in the test args to run tests in parallel (I think that's the only thing I need to do just by looking at ros2/ci#757):

Linux
- Rebuild: Linux
Linux-aarch64
- Rebuild: Linux-aarch64
Windows

If the above jobs look good, I will run a RHEL job too.

christophebedard · 2024-03-12T20:30:22Z

The ci_linux-aarch64 job passed, but the ci_linux job is failing with this weird LTTng error I've never seen when trying to destroy tracing sessions:

PERROR - 19:49:43.719443455 [372535/372535]: recvmsg: Connection reset by peer (in lttcomm_recv_unix_sock() at unix.c:205)

It's a direct failure of recvmsg(). @clalancette are you aware of any weird things going on with sockets in CI?

clalancette · 2024-03-12T21:28:39Z

It's a direct failure of recvmsg(). @clalancette are you aware of any weird things going on with sockets in CI?

Not off-hand, no. Also, the aarch64 jobs should be exactly equivalent, in the sense that they use the exact same Dockerfile. Let's see what happens on the rebuild, which was launched on a different building host. If we still see it, then we can work with the infrastructure team to try some things out and narrow it down.

christophebedard · 2024-03-12T22:24:35Z

Still failing.

I did realize that an lttngpy test tries to destroy all tracing sessions (because that's part of its API), which could affect other tests:

ros2_tracing/lttngpy/test/test_session.py

Line 57 in ed6f435

self.assertEqual(0, lttngpy.destroy_all_sessions())

. Tests of downstream packages shouldn't be running concurrently, as seen in the CI logs, though. It also wouldn't lead to the error we're getting now.

christophebedard · 2024-03-13T02:45:57Z

I re-triggered the aarch64 job and it failed. I'll try to figure out what tests were running in parallel with the failing tests for the Linux jobs, maybe that will give us a clue.

clalancette · 2024-03-13T12:10:00Z

I re-triggered the aarch64 job and it failed. I'll try to figure out what tests were running in parallel with the failing tests for the Linux jobs, maybe that will give us a clue.

OK, that sounds good.

If you need, one other thing I can do is to run the tests before and after this change back-to-back on the same machine. That would at least tell us whether it is an environmental problem or not. But I'll hold off on doing that until you do some more investigation.

christophebedard · 2024-03-13T18:33:52Z

If my script is correct, these are the packages that are running tests concurrently with test_ros2trace every time its tests fail but not when they pass:

lifecycle_py
test_quality_of_service
test_tf2
test_tracetools_launch

And for test_tracetools_launch:

test_ros2trace

The ci_linux-aarch64 job that passed indeed didn't test test_ros2trace and test_tracetools_launch concurrently. This suggests interference from/unexpected behaviour of LTTng itself, but I cannot reproduce it locally. I'll keep digging into that.

christophebedard · 2024-03-14T21:33:11Z

I rebased this after merging #96. Full parallel CI:

Linux
Linux-aarch64
Windows

I don't expect anything different except that test_tracetools's test_lifecycle_node test should not fail, but we'll see.

christophebedard · 2024-03-14T22:54:13Z

Linux

Linux-aarch64

For both jobs, test_ros2trace and test_tracetools_launch were running in parallel, but tests only failed in the aarch64 job. Could just be that the tests had less time overlap in the x86_64 job.

Signed-off-by: Christophe Bedard <christophe.bedard@apex.ai>

But make sure the tests do not leave tracing sessions behind. Signed-off-by: Christophe Bedard <christophe.bedard@apex.ai>

clalancette · 2024-03-19T20:59:05Z

So....I made a mistake.

Based on a comment from @christophebedard , I was going to add in a change to this PR that made test_ros2trace depend on test_tracetools_launch. It is a bit of a hack, but it should, in theory, force the tracing tests to not run in parallel, thus sidestepping our current parallel problems.

However, by accident I actually pushed that to rolling without any review. Normally I would open a revert PR, but since it is already done and it is a simple one-line change, I've decided to instead rebase this one on top. I'm going to run CI on this again to see if that improves the situation. Sorry for the bit of a mess.

clalancette · 2024-03-19T20:59:49Z

Parallel CI tests:

Linux
Linux-aarch64
Windows

clalancette · 2024-03-20T11:58:57Z

Given that this is much better than before, I'm going to go ahead and merge this one in. I'm sure the improvement is a combination of this PR plus what I accidentally pushed, but that is good enough for me at the moment.

christophebedard added the enhancement New feature or request label Mar 12, 2024

christophebedard requested a review from clalancette March 12, 2024 01:02

christophebedard self-assigned this Mar 12, 2024

clalancette approved these changes Mar 12, 2024

View reviewed changes

test_tracetools/CMakeLists.txt Outdated Show resolved Hide resolved

tracetools_read/tracetools_read/trace.py Outdated Show resolved Hide resolved

tracetools_test/tracetools_test/case.py Outdated Show resolved Hide resolved

christophebedard force-pushed the christophebedard/parallel-tracing-tests branch from cdab11e to edeaa94 Compare March 12, 2024 18:27

christophebedard mentioned this pull request Mar 13, 2024

Fix interference between test_tracetools and ros2lifecycle tests #96

Merged

christophebedard force-pushed the christophebedard/parallel-tracing-tests branch from edeaa94 to 91acfea Compare March 14, 2024 21:30

christophebedard added 2 commits March 19, 2024 20:56

Only consider relevant test trace data in tracing tests

d7dcae5

Signed-off-by: Christophe Bedard <christophe.bedard@apex.ai>

Allow having existing tracing sessions during tracing tests

faf81fe

But make sure the tests do not leave tracing sessions behind. Signed-off-by: Christophe Bedard <christophe.bedard@apex.ai>

clalancette force-pushed the christophebedard/parallel-tracing-tests branch from 91acfea to faf81fe Compare March 19, 2024 20:56

clalancette merged commit dfeb3d1 into rolling Mar 20, 2024
7 of 9 checks passed

clalancette deleted the christophebedard/parallel-tracing-tests branch March 20, 2024 11:59

This was referenced Mar 25, 2024

Fix LTTng issue with test_ros2trace and test_tracetools_launch so that they can run tracing tests in parallel #99

Open

Use lttng_ust_tracef instead of lttng_ust__tracef #103

Merged

christophebedard mentioned this pull request Apr 11, 2024

Remove unnecessary <string> include #111

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow tracing tests to be run in parallel with other tests #95

Allow tracing tests to be run in parallel with other tests #95

christophebedard commented Mar 12, 2024

clalancette left a comment

christophebedard commented Mar 12, 2024 •

edited

christophebedard commented Mar 12, 2024 •

edited

clalancette commented Mar 12, 2024

christophebedard commented Mar 12, 2024

christophebedard commented Mar 13, 2024

clalancette commented Mar 13, 2024

christophebedard commented Mar 13, 2024

christophebedard commented Mar 14, 2024

christophebedard commented Mar 14, 2024

clalancette commented Mar 19, 2024

clalancette commented Mar 19, 2024

clalancette commented Mar 20, 2024

Allow tracing tests to be run in parallel with other tests #95

Allow tracing tests to be run in parallel with other tests #95

Conversation

christophebedard commented Mar 12, 2024

clalancette left a comment

Choose a reason for hiding this comment

christophebedard commented Mar 12, 2024 • edited

christophebedard commented Mar 12, 2024 • edited

clalancette commented Mar 12, 2024

christophebedard commented Mar 12, 2024

christophebedard commented Mar 13, 2024

clalancette commented Mar 13, 2024

christophebedard commented Mar 13, 2024

christophebedard commented Mar 14, 2024

christophebedard commented Mar 14, 2024

clalancette commented Mar 19, 2024

clalancette commented Mar 19, 2024

clalancette commented Mar 20, 2024

christophebedard commented Mar 12, 2024 •

edited

christophebedard commented Mar 12, 2024 •

edited