-
Notifications
You must be signed in to change notification settings - Fork 136
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use --isolated if no ROS_DOMAIN_ID is set to help parallel testing #251
Conversation
@dirk-thomas @tfoote I think I've addressed your concerns from the discussion on #240. Please let me know if there's anything else you'd like. |
"--isolated" | ||
) | ||
endif() | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is determining the choice at CMake configure time it would be wrong if the env var is not set when building the package but set later when actually running the test. Therefore I think this check should be moved into the Python script.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the --isolated argument still make sense then? Is there a use case to force isolation to be turned on when ROS domain ID is set, or would the --isolation flag only do something when no domain ID is already set?
I guess we could also change the flag to be --isolated=0 or --isolated=1 to force it on or off, and you get the automatic behavior if you don't pass the flag.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would be fine to not have any command line option and make the behavior only depend on the presence of the environment variable.
If you would like to have a command line option to explicitly control the behavior that is fine too. Maybe two mutually exclusive options would be most readable (since isolated
doesn't convey much about the how):
--domain-id
(type=int
) (since the very same could be achieve by setting the env var I am not sure how valuable this one is...)--random-domain-id
(action=store_true
)
Signed-off-by: Pete Baughman <pete.baughman@apex.ai>
- On my linux machine running the ROS nightly docker image, it makes domain selection take about 1ms on average when coordinating 50 isolated domains - If there's another platform where opening sockets is significantly slower, the old random seed code is still here, but commented out Signed-off-by: Pete Baughman <pete.baughman@apex.ai>
b473eac
to
a95fefd
Compare
- This respects the value of ROS_DOMAIN_ID that's set when the test is run. - Setting this in launch_testing_ament_cmake used the value at build time, which was unintuitive Signed-off-by: Pete Baughman <pete.baughman@apex.ai>
a95fefd
to
7e8f3a0
Compare
Ok, I changed the flab back to an option to disable the automatic isolation. There are now two ways to disable isolation - by setting a ROS_DOMAIN_ID before running launch_test, or by passing the --disable-isolation flag. I made the flag a disable flag instead of an enable flag so that people running without ROS_DOMAIN_ID set can debug a launch_test by using the flag and then running 'ros2 topic echo' or something in another terminal window. Without the --disable-isolation flag, they would have to set up two new terminal windows with matching domain IDs in order to debug a launch_test. I don't have a use case for "There is already a ROS_DOMAIN_ID set, but I want launch_testing to pick a sequential one starting at 1 anyway" so there's no flag to force isolation to be on. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall I think that this gets us back to the base line for the known use cases discussed in #240
We need to revisit our strategy for isolating tests overall. I think that we will need to implement something more comprehensive for supporting parallel tests and fanning out domain id's such as setting up virtual networks or interfaces that are actually truely isolated. Where we can then use the DOMAIN_IDs without worrying about interacting with a live system.
This allows multiple instances to run in parallel (the default with `colcon test`). | ||
Note that `launch_test` cannot coordinate unique domains across multiple hosts. | ||
If the ROS_DOMAIN_ID environment variable is already set, `launch_test` respects the environment variable and won't attempt to select a different ID. | ||
In this case it's the responsibility of the user to design tests that can be safely run in parallel, or not use parallel test workers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like a reasonable solution for now resolving the issues we've seen with. But this is still quite a bit of a risk that tests that were working single threaded will start failing if parallelization is turned on.
Might it be possible for us to prevent parallelization when the DOMAIN_ID is set?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tfoote What if launch_testing detected that other launch_testing processes were running and printed a warning if ROS_DOMAIN_ID ls set?
I don't know of a platform independent way to do that that doesn't require the installation of additional python packages though. . .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if launch_testing detected that other launch_testing processes were running and printed a warning if ROS_DOMAIN_ID ls set?
That would not be a reliable approach since a process might be about to be started but not running yet and that case wouldn't be captured.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dirk-thomas It's a warning, though - it would probably work well enough to tell the user that there's a problem with what they're trying to do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dirk-thomas Furthermore, if you implemented this as a file lock (linux), or a named mutex (windows) where launch_test locks holds a resource for the duration, you don't have a start-up race.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the current patch be merged as is or do you want to keep iterating on it? (Note: distro freeze for the Dashing release is tomorrow.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dirk-thomas We should merge this - I don't have a good platform independent way to implement the above right now
Signed-off-by: Dirk Thomas <dirk-thomas@users.noreply.github.com>
Signed-off-by: Dirk Thomas <dirk-thomas@users.noreply.github.com>
c192625
to
425036d
Compare
I tested the patch locally 👍 I didn't run CI for this since all CI nodes have a domain id set so this patch won't have any effect. |
…os2#251) * Make auto selected domain ID show up in test log Signed-off-by: Pete Baughman <pete.baughman@apex.ai> * Use deterministic selection for domain isolation - On my linux machine running the ROS nightly docker image, it makes domain selection take about 1ms on average when coordinating 50 isolated domains - If there's another platform where opening sockets is significantly slower, the old random seed code is still here, but commented out Signed-off-by: Pete Baughman <pete.baughman@apex.ai> * Put logic to disable isolation in launch_test - This respects the value of ROS_DOMAIN_ID that's set when the test is run. - Setting this in launch_testing_ament_cmake used the value at build time, which was unintuitive Signed-off-by: Pete Baughman <pete.baughman@apex.ai> * update description Signed-off-by: Dirk Thomas <dirk-thomas@users.noreply.github.com> * combine conditions Signed-off-by: Dirk Thomas <dirk-thomas@users.noreply.github.com>
Description
Some background discussion can be found here: #240
Prior to PR 240, launch_testing would coordinate a random domain ID with all other launch_testing instances so that two tests run at the same time didn't conflict with one another.
This is a problem if you're also using the ROS_DOMAIN_ID to prevent multiple machines on the same network from talking to one-another. When launch_testing ran with a random domain ID, this broke the machine-to-machine isolation that OSRF had set up because
launch_testing
would pick a random domain ID that belonged to some other machine. Domain isolation was changed to be off by default.This PR lets launch_testing use the ROS_DOMAIN_ID for process-to-process isolation when running as part of
colcon test
only if no ROS_DOMAIN_ID environment variable is set. Note the environment variable is checked when you docolcon build
notcolcon test
. Some may find this confusing or unintuitive.To address @tfoote 's concern here I'm also ditching the code that picks a random ROS_DOMAIN_ID in an effort to get a unique ID on the first try. This was meant to make selecting a domain ID faster, but it turns out opening sockets is so fast, even if we do it the 'slow' way it only takes about 1 millisecond longer.
Changes
colcon test
Testing:
I ran the following commands with ROS_DOMAIN_ID set, and unset and verified the correct behavior by looking at the logs
Conclusion:
If you want to use ROS_DOMAIN_ID to isolate multiple computers, launch_testing will not interfere with that. In this case, you may not be able to run
launch_test
in parallel.If you have another mechanism to isolate multiple computers, you can leave ROS_DOMAIN_ID unset and
launch_test
will isolate itself from otherlaunch_test
processes when run as part ofcolcon test