Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use --isolated if no ROS_DOMAIN_ID is set to help parallel testing #251

Merged
merged 5 commits into from
May 29, 2019

Conversation

pbaughman
Copy link
Contributor

Description

Some background discussion can be found here: #240

Prior to PR 240, launch_testing would coordinate a random domain ID with all other launch_testing instances so that two tests run at the same time didn't conflict with one another.

This is a problem if you're also using the ROS_DOMAIN_ID to prevent multiple machines on the same network from talking to one-another. When launch_testing ran with a random domain ID, this broke the machine-to-machine isolation that OSRF had set up because launch_testing would pick a random domain ID that belonged to some other machine. Domain isolation was changed to be off by default.

This PR lets launch_testing use the ROS_DOMAIN_ID for process-to-process isolation when running as part of colcon test only if no ROS_DOMAIN_ID environment variable is set. Note the environment variable is checked when you do colcon build not colcon test. Some may find this confusing or unintuitive.

To address @tfoote 's concern here I'm also ditching the code that picks a random ROS_DOMAIN_ID in an effort to get a unique ID on the first try. This was meant to make selecting a domain ID faster, but it turns out opening sockets is so fast, even if we do it the 'slow' way it only takes about 1 millisecond longer.

Changes

  • launch_testing_ament_cmake - pass the --isolated flag to launch_testing if no ROS_DOMAIN_ID environment variable is set
  • launch_testing - When picking isolated domains, always start at 1 and count up until an unused domain is found.
  • launch_testing - Make the isolated ROS_DOMAIN_ID selection into an 'info' level log message instead of a 'debug' level log message so it's captured in the logs saved when you do colcon test

Testing:

I ran the following commands with ROS_DOMAIN_ID set, and unset and verified the correct behavior by looking at the logs

cd /launch
colcon build --cmake-args -DBUILD_TESTING=1
colcon test --packages-select launch_testing_ament_cmake 
cat log/latest_test/launch_testing_ament_cmake/stdout_stderr.log

Conclusion:

If you want to use ROS_DOMAIN_ID to isolate multiple computers, launch_testing will not interfere with that. In this case, you may not be able to run launch_test in parallel.

If you have another mechanism to isolate multiple computers, you can leave ROS_DOMAIN_ID unset and launch_test will isolate itself from other launch_test processes when run as part of colcon test

@pbaughman pbaughman changed the title Fix testing in parallel if no ROS_DOMAIN_ID is set Allow testing in parallel if no ROS_DOMAIN_ID is set May 22, 2019
@pbaughman pbaughman changed the title Allow testing in parallel if no ROS_DOMAIN_ID is set Use --isolated if no ROS_DOMAIN_ID is set to help parallel testing May 22, 2019
@pbaughman
Copy link
Contributor Author

@dirk-thomas @tfoote I think I've addressed your concerns from the discussion on #240. Please let me know if there's anything else you'd like.

@dirk-thomas dirk-thomas added the enhancement New feature or request label May 23, 2019
"--isolated"
)
endif()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is determining the choice at CMake configure time it would be wrong if the env var is not set when building the package but set later when actually running the test. Therefore I think this check should be moved into the Python script.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the --isolated argument still make sense then? Is there a use case to force isolation to be turned on when ROS domain ID is set, or would the --isolation flag only do something when no domain ID is already set?

I guess we could also change the flag to be --isolated=0 or --isolated=1 to force it on or off, and you get the automatic behavior if you don't pass the flag.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be fine to not have any command line option and make the behavior only depend on the presence of the environment variable.

If you would like to have a command line option to explicitly control the behavior that is fine too. Maybe two mutually exclusive options would be most readable (since isolated doesn't convey much about the how):

  • --domain-id (type=int) (since the very same could be achieve by setting the env var I am not sure how valuable this one is...)
  • --random-domain-id (action=store_true)

launch_testing_ament_cmake/cmake/add_launch_test.cmake Outdated Show resolved Hide resolved
Pete Baughman added 2 commits May 23, 2019 15:02
Signed-off-by: Pete Baughman <pete.baughman@apex.ai>
  - On my linux machine running the ROS nightly docker image, it makes domain
    selection take about 1ms on average when coordinating 50 isolated domains
  - If there's another platform where opening sockets is significantly slower, the
    old random seed code is still here, but commented out

Signed-off-by: Pete Baughman <pete.baughman@apex.ai>
  - This respects the value of ROS_DOMAIN_ID that's set when the test is run.
  - Setting this in launch_testing_ament_cmake used the value at build time, which was unintuitive

Signed-off-by: Pete Baughman <pete.baughman@apex.ai>
@pbaughman
Copy link
Contributor Author

@dirk-thomas

Ok, I changed the flab back to an option to disable the automatic isolation. There are now two ways to disable isolation - by setting a ROS_DOMAIN_ID before running launch_test, or by passing the --disable-isolation flag.

I made the flag a disable flag instead of an enable flag so that people running without ROS_DOMAIN_ID set can debug a launch_test by using the flag and then running 'ros2 topic echo' or something in another terminal window. Without the --disable-isolation flag, they would have to set up two new terminal windows with matching domain IDs in order to debug a launch_test.

I don't have a use case for "There is already a ROS_DOMAIN_ID set, but I want launch_testing to pick a sequential one starting at 1 anyway" so there's no flag to force isolation to be on.

Copy link
Contributor

@tfoote tfoote left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall I think that this gets us back to the base line for the known use cases discussed in #240

We need to revisit our strategy for isolating tests overall. I think that we will need to implement something more comprehensive for supporting parallel tests and fanning out domain id's such as setting up virtual networks or interfaces that are actually truely isolated. Where we can then use the DOMAIN_IDs without worrying about interacting with a live system.

This allows multiple instances to run in parallel (the default with `colcon test`).
Note that `launch_test` cannot coordinate unique domains across multiple hosts.
If the ROS_DOMAIN_ID environment variable is already set, `launch_test` respects the environment variable and won't attempt to select a different ID.
In this case it's the responsibility of the user to design tests that can be safely run in parallel, or not use parallel test workers.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a reasonable solution for now resolving the issues we've seen with. But this is still quite a bit of a risk that tests that were working single threaded will start failing if parallelization is turned on.

Might it be possible for us to prevent parallelization when the DOMAIN_ID is set?

Copy link
Contributor Author

@pbaughman pbaughman May 28, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tfoote What if launch_testing detected that other launch_testing processes were running and printed a warning if ROS_DOMAIN_ID ls set?

I don't know of a platform independent way to do that that doesn't require the installation of additional python packages though. . .

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if launch_testing detected that other launch_testing processes were running and printed a warning if ROS_DOMAIN_ID ls set?

That would not be a reliable approach since a process might be about to be started but not running yet and that case wouldn't be captured.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dirk-thomas It's a warning, though - it would probably work well enough to tell the user that there's a problem with what they're trying to do.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dirk-thomas Furthermore, if you implemented this as a file lock (linux), or a named mutex (windows) where launch_test locks holds a resource for the duration, you don't have a start-up race.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the current patch be merged as is or do you want to keep iterating on it? (Note: distro freeze for the Dashing release is tomorrow.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dirk-thomas We should merge this - I don't have a good platform independent way to implement the above right now

Signed-off-by: Dirk Thomas <dirk-thomas@users.noreply.github.com>
Signed-off-by: Dirk Thomas <dirk-thomas@users.noreply.github.com>
@dirk-thomas
Copy link
Member

I tested the patch locally 👍

I didn't run CI for this since all CI nodes have a domain id set so this patch won't have any effect.

@dirk-thomas dirk-thomas merged commit 066009b into ros2:master May 29, 2019
@pbaughman pbaughman deleted the isolate_if_no_domain_id branch May 29, 2019 00:31
piraka9011 pushed a commit to aws-ros-dev/launch that referenced this pull request Aug 16, 2019
…os2#251)

* Make auto selected domain ID show up in test log

Signed-off-by: Pete Baughman <pete.baughman@apex.ai>

* Use deterministic selection for domain isolation

  - On my linux machine running the ROS nightly docker image, it makes domain
    selection take about 1ms on average when coordinating 50 isolated domains
  - If there's another platform where opening sockets is significantly slower, the
    old random seed code is still here, but commented out

Signed-off-by: Pete Baughman <pete.baughman@apex.ai>

* Put logic to disable isolation in launch_test

  - This respects the value of ROS_DOMAIN_ID that's set when the test is run.
  - Setting this in launch_testing_ament_cmake used the value at build time, which was unintuitive

Signed-off-by: Pete Baughman <pete.baughman@apex.ai>

* update description

Signed-off-by: Dirk Thomas <dirk-thomas@users.noreply.github.com>

* combine conditions

Signed-off-by: Dirk Thomas <dirk-thomas@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants