Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow tracing tests to be run in parallel with other tests #95

Merged
merged 2 commits into from
Mar 20, 2024

Conversation

christophebedard
Copy link
Member

Resolves #94

This allows tracing tests to be run in parallel with (a) other tracing tests (e.g., from another package) and (b) normal tests (e.g., any ROS 2 tests). It achieves that by doing two things in two separate commits:

  1. Only consider test trace data corresponding to the current test
    1. If we configure tracing in a tracing test and other tests (and therefore "ROS 2 applications") run in parallel, we will collect trace data from all running applications and end up with irrelevant data. Allow tracing tests to be run in parallel with tracing tests & normal tests #94 provides an overview of the possible solutions along with pros and cons. The chosen (and most robust) solution is:
      1. The tracing test passes a trace test ID to the test application processes it spawns through an environment variable
        1. The trace test ID contains the tracing test's PID and the test name (for easy identification)
      2. The test application processes spawned by the test (all in test_tracetools) retrieve that trace test ID from the environment and mark their process by triggering an lttng_ust_tracef:event event with that trace test ID in its payload, which will link the trace test ID to that process ID, since the trace event includes the PID
        1. lttng_ust_tracef:event was chosen because it is simple and doesn't require defining a new tracepoint; it simply collects a string msg value, see the lttng_ust_tracef API
      3. The tracing test reads those lttng_ust_tracef:event events, collects the PIDs of the events matching its test ID, and filters out trace data from all other processes, leaving only relevant trace data
    2. This only concerns test_tracetools and test_ros2trace
  2. Allow having existing tracing sessions during tracing tests
    1. Now that having multiple tests (in this case multiple tracing tests) run in parallel is possible, we don't have to expect to have only one tracing session at any given time
      1. Since we don't check that no tracing session exists before and after each test, I modified the tests to check that they do not leave their tracing sessions behind
    2. This only concerns test_ros2trace

@christophebedard christophebedard added the enhancement New feature or request label Mar 12, 2024
@christophebedard christophebedard self-assigned this Mar 12, 2024
Copy link
Contributor

@clalancette clalancette left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, this is absolutely fantastic. Thanks for taking the time to do this.

I've left a few non-blocking nits inline. I'm still going to approve, and leave it to you to decide whether to respond to them, and then run full CI on this.

test_tracetools/CMakeLists.txt Outdated Show resolved Hide resolved
tracetools_read/tracetools_read/trace.py Outdated Show resolved Hide resolved
tracetools_test/tracetools_test/case.py Outdated Show resolved Hide resolved
@christophebedard christophebedard force-pushed the christophebedard/parallel-tracing-tests branch from cdab11e to edeaa94 Compare March 12, 2024 18:27
@christophebedard
Copy link
Member Author

christophebedard commented Mar 12, 2024

Full CI, removing --executor sequential in the test args to run tests in parallel (I think that's the only thing I need to do just by looking at ros2/ci#757):

  • Linux Build Status
    • Rebuild: Linux Build Status
  • Linux-aarch64 Build Status
    • Rebuild: Linux-aarch64 Build Status
  • Windows Build Status

If the above jobs look good, I will run a RHEL job too.

@christophebedard
Copy link
Member Author

christophebedard commented Mar 12, 2024

The ci_linux-aarch64 job passed, but the ci_linux job is failing with this weird LTTng error I've never seen when trying to destroy tracing sessions:

PERROR - 19:49:43.719443455 [372535/372535]: recvmsg: Connection reset by peer (in lttcomm_recv_unix_sock() at unix.c:205)

It's a direct failure of recvmsg(). @clalancette are you aware of any weird things going on with sockets in CI?

@clalancette
Copy link
Contributor

It's a direct failure of recvmsg(). @clalancette are you aware of any weird things going on with sockets in CI?

Not off-hand, no. Also, the aarch64 jobs should be exactly equivalent, in the sense that they use the exact same Dockerfile. Let's see what happens on the rebuild, which was launched on a different building host. If we still see it, then we can work with the infrastructure team to try some things out and narrow it down.

@christophebedard
Copy link
Member Author

Still failing.

I did realize that an lttngpy test tries to destroy all tracing sessions (because that's part of its API), which could affect other tests:

self.assertEqual(0, lttngpy.destroy_all_sessions())
. Tests of downstream packages shouldn't be running concurrently, as seen in the CI logs, though. It also wouldn't lead to the error we're getting now.

@christophebedard
Copy link
Member Author

I re-triggered the aarch64 job and it failed. I'll try to figure out what tests were running in parallel with the failing tests for the Linux jobs, maybe that will give us a clue.

@clalancette
Copy link
Contributor

I re-triggered the aarch64 job and it failed. I'll try to figure out what tests were running in parallel with the failing tests for the Linux jobs, maybe that will give us a clue.

OK, that sounds good.

If you need, one other thing I can do is to run the tests before and after this change back-to-back on the same machine. That would at least tell us whether it is an environmental problem or not. But I'll hold off on doing that until you do some more investigation.

@christophebedard
Copy link
Member Author

If my script is correct, these are the packages that are running tests concurrently with test_ros2trace every time its tests fail but not when they pass:

  • lifecycle_py
  • test_quality_of_service
  • test_tf2
  • test_tracetools_launch

And for test_tracetools_launch:

  • test_ros2trace

The ci_linux-aarch64 job that passed indeed didn't test test_ros2trace and test_tracetools_launch concurrently. This suggests interference from/unexpected behaviour of LTTng itself, but I cannot reproduce it locally. I'll keep digging into that.

@christophebedard
Copy link
Member Author

I rebased this after merging #96. Full parallel CI:

  • Linux Build Status
  • Linux-aarch64 Build Status
  • Windows Build Status

I don't expect anything different except that test_tracetools's test_lifecycle_node test should not fail, but we'll see.

@christophebedard
Copy link
Member Author

  • Linux Build Status
  • Linux-aarch64 Build Status

For both jobs, test_ros2trace and test_tracetools_launch were running in parallel, but tests only failed in the aarch64 job. Could just be that the tests had less time overlap in the x86_64 job.

Signed-off-by: Christophe Bedard <christophe.bedard@apex.ai>
But make sure the tests do not leave tracing sessions behind.

Signed-off-by: Christophe Bedard <christophe.bedard@apex.ai>
@clalancette clalancette force-pushed the christophebedard/parallel-tracing-tests branch from 91acfea to faf81fe Compare March 19, 2024 20:56
@clalancette
Copy link
Contributor

So....I made a mistake.

Based on a comment from @christophebedard , I was going to add in a change to this PR that made test_ros2trace depend on test_tracetools_launch. It is a bit of a hack, but it should, in theory, force the tracing tests to not run in parallel, thus sidestepping our current parallel problems.

However, by accident I actually pushed that to rolling without any review. Normally I would open a revert PR, but since it is already done and it is a simple one-line change, I've decided to instead rebase this one on top. I'm going to run CI on this again to see if that improves the situation. Sorry for the bit of a mess.

@clalancette
Copy link
Contributor

Parallel CI tests:

  • Linux Build Status
  • Linux-aarch64 Build Status
  • Windows Build Status

@clalancette
Copy link
Contributor

Given that this is much better than before, I'm going to go ahead and merge this one in. I'm sure the improvement is a combination of this PR plus what I accidentally pushed, but that is good enough for me at the moment.

@clalancette clalancette merged commit dfeb3d1 into rolling Mar 20, 2024
7 of 9 checks passed
@clalancette clalancette deleted the christophebedard/parallel-tracing-tests branch March 20, 2024 11:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Allow tracing tests to be run in parallel with tracing tests & normal tests
2 participants