Use --isolated if no ROS_DOMAIN_ID is set to help parallel testing #251

pbaughman · 2019-05-22T23:44:31Z

Description

Some background discussion can be found here: #240

Prior to PR 240, launch_testing would coordinate a random domain ID with all other launch_testing instances so that two tests run at the same time didn't conflict with one another.

This is a problem if you're also using the ROS_DOMAIN_ID to prevent multiple machines on the same network from talking to one-another. When launch_testing ran with a random domain ID, this broke the machine-to-machine isolation that OSRF had set up because launch_testing would pick a random domain ID that belonged to some other machine. Domain isolation was changed to be off by default.

This PR lets launch_testing use the ROS_DOMAIN_ID for process-to-process isolation when running as part of colcon test only if no ROS_DOMAIN_ID environment variable is set. Note the environment variable is checked when you do colcon build not colcon test. Some may find this confusing or unintuitive.

To address @tfoote 's concern here I'm also ditching the code that picks a random ROS_DOMAIN_ID in an effort to get a unique ID on the first try. This was meant to make selecting a domain ID faster, but it turns out opening sockets is so fast, even if we do it the 'slow' way it only takes about 1 millisecond longer.

Changes

launch_testing_ament_cmake - pass the --isolated flag to launch_testing if no ROS_DOMAIN_ID environment variable is set
launch_testing - When picking isolated domains, always start at 1 and count up until an unused domain is found.
launch_testing - Make the isolated ROS_DOMAIN_ID selection into an 'info' level log message instead of a 'debug' level log message so it's captured in the logs saved when you do colcon test

Testing:

I ran the following commands with ROS_DOMAIN_ID set, and unset and verified the correct behavior by looking at the logs

cd /launch
colcon build --cmake-args -DBUILD_TESTING=1
colcon test --packages-select launch_testing_ament_cmake 
cat log/latest_test/launch_testing_ament_cmake/stdout_stderr.log

Conclusion:

If you want to use ROS_DOMAIN_ID to isolate multiple computers, launch_testing will not interfere with that. In this case, you may not be able to run launch_test in parallel.

If you have another mechanism to isolate multiple computers, you can leave ROS_DOMAIN_ID unset and launch_test will isolate itself from other launch_test processes when run as part of colcon test

pbaughman · 2019-05-22T23:46:53Z

@dirk-thomas @tfoote I think I've addressed your concerns from the discussion on #240. Please let me know if there's anything else you'd like.

dirk-thomas · 2019-05-23T00:10:55Z

launch_testing_ament_cmake/cmake/add_launch_test.cmake

+      "--isolated"
+    )
+  endif()
+


Since this is determining the choice at CMake configure time it would be wrong if the env var is not set when building the package but set later when actually running the test. Therefore I think this check should be moved into the Python script.

Does the --isolated argument still make sense then? Is there a use case to force isolation to be turned on when ROS domain ID is set, or would the --isolation flag only do something when no domain ID is already set?

I guess we could also change the flag to be --isolated=0 or --isolated=1 to force it on or off, and you get the automatic behavior if you don't pass the flag.

I would be fine to not have any command line option and make the behavior only depend on the presence of the environment variable.

If you would like to have a command line option to explicitly control the behavior that is fine too. Maybe two mutually exclusive options would be most readable (since isolated doesn't convey much about the how):

--domain-id (type=int) (since the very same could be achieve by setting the env var I am not sure how valuable this one is...)

--random-domain-id (action=store_true)

launch_testing_ament_cmake/cmake/add_launch_test.cmake

Signed-off-by: Pete Baughman <pete.baughman@apex.ai>

- On my linux machine running the ROS nightly docker image, it makes domain selection take about 1ms on average when coordinating 50 isolated domains - If there's another platform where opening sockets is significantly slower, the old random seed code is still here, but commented out Signed-off-by: Pete Baughman <pete.baughman@apex.ai>

- This respects the value of ROS_DOMAIN_ID that's set when the test is run. - Setting this in launch_testing_ament_cmake used the value at build time, which was unintuitive Signed-off-by: Pete Baughman <pete.baughman@apex.ai>

pbaughman · 2019-05-23T23:02:12Z

@dirk-thomas

Ok, I changed the flab back to an option to disable the automatic isolation. There are now two ways to disable isolation - by setting a ROS_DOMAIN_ID before running launch_test, or by passing the --disable-isolation flag.

I made the flag a disable flag instead of an enable flag so that people running without ROS_DOMAIN_ID set can debug a launch_test by using the flag and then running 'ros2 topic echo' or something in another terminal window. Without the --disable-isolation flag, they would have to set up two new terminal windows with matching domain IDs in order to debug a launch_test.

I don't have a use case for "There is already a ROS_DOMAIN_ID set, but I want launch_testing to pick a sequential one starting at 1 anyway" so there's no flag to force isolation to be on.

tfoote

Overall I think that this gets us back to the base line for the known use cases discussed in #240

We need to revisit our strategy for isolating tests overall. I think that we will need to implement something more comprehensive for supporting parallel tests and fanning out domain id's such as setting up virtual networks or interfaces that are actually truely isolated. Where we can then use the DOMAIN_IDs without worrying about interacting with a live system.

tfoote · 2019-05-25T00:05:03Z

launch_testing/README.md

+This allows multiple instances to run in parallel (the default with `colcon test`).
+Note that `launch_test` cannot coordinate unique domains across multiple hosts.  
+If the ROS_DOMAIN_ID environment variable is already set, `launch_test` respects the environment variable and won't attempt to select a different ID.
+In this case it's the responsibility of the user to design tests that can be safely run in parallel, or not use parallel test workers.


This seems like a reasonable solution for now resolving the issues we've seen with. But this is still quite a bit of a risk that tests that were working single threaded will start failing if parallelization is turned on.

Might it be possible for us to prevent parallelization when the DOMAIN_ID is set?

@tfoote What if launch_testing detected that other launch_testing processes were running and printed a warning if ROS_DOMAIN_ID ls set?

I don't know of a platform independent way to do that that doesn't require the installation of additional python packages though. . .

What if launch_testing detected that other launch_testing processes were running and printed a warning if ROS_DOMAIN_ID ls set?

That would not be a reliable approach since a process might be about to be started but not running yet and that case wouldn't be captured.

@dirk-thomas It's a warning, though - it would probably work well enough to tell the user that there's a problem with what they're trying to do.

@dirk-thomas Furthermore, if you implemented this as a file lock (linux), or a named mutex (windows) where launch_test locks holds a resource for the duration, you don't have a start-up race.

Should the current patch be merged as is or do you want to keep iterating on it? (Note: distro freeze for the Dashing release is tomorrow.)

@dirk-thomas We should merge this - I don't have a good platform independent way to implement the above right now

Signed-off-by: Dirk Thomas <dirk-thomas@users.noreply.github.com>

dirk-thomas · 2019-05-29T00:03:54Z

I tested the patch locally 👍

I didn't run CI for this since all CI nodes have a domain id set so this patch won't have any effect.

…os2#251) * Make auto selected domain ID show up in test log Signed-off-by: Pete Baughman <pete.baughman@apex.ai> * Use deterministic selection for domain isolation - On my linux machine running the ROS nightly docker image, it makes domain selection take about 1ms on average when coordinating 50 isolated domains - If there's another platform where opening sockets is significantly slower, the old random seed code is still here, but commented out Signed-off-by: Pete Baughman <pete.baughman@apex.ai> * Put logic to disable isolation in launch_test - This respects the value of ROS_DOMAIN_ID that's set when the test is run. - Setting this in launch_testing_ament_cmake used the value at build time, which was unintuitive Signed-off-by: Pete Baughman <pete.baughman@apex.ai> * update description Signed-off-by: Dirk Thomas <dirk-thomas@users.noreply.github.com> * combine conditions Signed-off-by: Dirk Thomas <dirk-thomas@users.noreply.github.com>

pbaughman changed the title ~~Fix testing in parallel if no ROS_DOMAIN_ID is set~~ Allow testing in parallel if no ROS_DOMAIN_ID is set May 22, 2019

pbaughman changed the title ~~Allow testing in parallel if no ROS_DOMAIN_ID is set~~ Use --isolated if no ROS_DOMAIN_ID is set to help parallel testing May 22, 2019

dirk-thomas added the enhancement New feature or request label May 23, 2019

dirk-thomas reviewed May 23, 2019

View reviewed changes

Pete Baughman added 2 commits May 23, 2019 15:02

Make auto selected domain ID show up in test log

c0e88b2

Signed-off-by: Pete Baughman <pete.baughman@apex.ai>

pbaughman force-pushed the isolate_if_no_domain_id branch from b473eac to a95fefd Compare May 23, 2019 22:46

Put logic to disable isolation in launch_test

7e8f3a0

- This respects the value of ROS_DOMAIN_ID that's set when the test is run. - Setting this in launch_testing_ament_cmake used the value at build time, which was unintuitive Signed-off-by: Pete Baughman <pete.baughman@apex.ai>

pbaughman force-pushed the isolate_if_no_domain_id branch from a95fefd to 7e8f3a0 Compare May 23, 2019 22:56

tfoote reviewed May 25, 2019

View reviewed changes

dirk-thomas added 2 commits May 28, 2019 17:00

update description

32f6536

Signed-off-by: Dirk Thomas <dirk-thomas@users.noreply.github.com>

combine conditions

425036d

Signed-off-by: Dirk Thomas <dirk-thomas@users.noreply.github.com>

dirk-thomas force-pushed the isolate_if_no_domain_id branch from c192625 to 425036d Compare May 29, 2019 00:01

dirk-thomas merged commit 066009b into ros2:master May 29, 2019

pbaughman deleted the isolate_if_no_domain_id branch May 29, 2019 00:31

This was referenced May 29, 2019

I would like to extend run_test.py to provide ROS_DOMAIN_ID isolation ament/ament_cmake#172

Closed

Add ament_add_pytest_isolated and ament_add_gtest_isolated ros2/ament_cmake_ros#3

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use --isolated if no ROS_DOMAIN_ID is set to help parallel testing #251

Use --isolated if no ROS_DOMAIN_ID is set to help parallel testing #251

pbaughman commented May 22, 2019

pbaughman commented May 22, 2019

dirk-thomas May 23, 2019

pbaughman May 23, 2019

dirk-thomas May 23, 2019

pbaughman commented May 23, 2019

tfoote left a comment

tfoote May 25, 2019

pbaughman May 28, 2019 •

edited

Loading

dirk-thomas May 28, 2019

pbaughman May 28, 2019

pbaughman May 28, 2019

dirk-thomas May 28, 2019

pbaughman May 28, 2019

dirk-thomas commented May 29, 2019

Use --isolated if no ROS_DOMAIN_ID is set to help parallel testing #251

Use --isolated if no ROS_DOMAIN_ID is set to help parallel testing #251

Conversation

pbaughman commented May 22, 2019

Description

Changes

Testing:

Conclusion:

pbaughman commented May 22, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pbaughman commented May 23, 2019

tfoote left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pbaughman May 28, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dirk-thomas commented May 29, 2019

pbaughman May 28, 2019 •

edited

Loading