Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ROS 2 Launch System #163

Open
wants to merge 10 commits into
base: gh-pages
from

Conversation

@wjwwood
Copy link
Member

commented Feb 13, 2018

This is a WIP. It's not ready for review or comment right now (not even in a readable state I would say).

When it's ready for an early review I'll post a comment. When it's ready for wider review I'll post an RFC on discourse.ros.org.

wip

@wjwwood wjwwood added the in progress label Feb 13, 2018

@wjwwood wjwwood self-assigned this Feb 13, 2018

wjwwood added some commits Feb 16, 2018


In order for the launch system to execute a described system, it needs to understand how it can achieve the description.
This is an existing phrase in Computer Science[^calling_convention_wikipedia], but this section is not talking specifically about the compiler defined calling convention, through it is appropriating the term to describe a similar relationship.
In this case, the phrase "calling conventions" is meant to describe the "interface" or "contract" the launch system has with the entities it is executing and monitoring.

This comment has been minimized.

Copy link
@sloretz

sloretz Feb 27, 2018

Contributor

Might be a good idea to define what an entity is before this section. I assume entities means the types of nodes: a process with one node, a process manually composed with multiple nodes, a library containing a dynamically composible node, or all of these but with a lifecycle node.

This comment has been minimized.

Copy link
@wjwwood

wjwwood Feb 27, 2018

Author Member

I was hoping this sentence would imply "entities are anything that the launch system executes and/or monitors", rather than me specifically defining entities. I used "entities" because "things" sounded unsophisticated.

Perhaps I could reword it to be "... the launch system has with entities, which are anything it executes or monitors."? If not, do you have a suggestion as how to reword it?

This comment has been minimized.

Copy link
@sloretz

sloretz Feb 27, 2018

Contributor

Oops I assumed entity was going to be a special term. If it's not going to be used later on then I suggest shortening to "... the launch system has with anything it executes or monitors.".

#### Termination

Managed ROS Nodes have some additional observable effects when terminating (the node, not necessarily the process containing it).
A managed node enters the `Finalized` state after passing through the `ShuttingDown` transition state on termination.

This comment has been minimized.

Copy link
@sloretz

sloretz Feb 27, 2018

Contributor

During termination will ROS 2 launch try to transition to the Finalized state, or just send signals and hope handling code in the node takes care of it?

This comment has been minimized.

Copy link
@wjwwood

wjwwood Feb 27, 2018

Author Member

A good question, to which I do not know the answer. My gut reaction would be to say it sends a signal and the launch system tries not to send state transition requests itself, unless directed to do so by the user. So maybe, defaults to signals, the user can ask the launch system to transition first, then signal. I'll mention it and add an RFC block about the topic.

This comment has been minimized.

Copy link
@jack-oquin

jack-oquin Feb 28, 2018

I wonder if these things should be recursively nestable, with the roslauch itself serving as a managed ROS node?

This comment has been minimized.

Copy link
@wjwwood

wjwwood Mar 10, 2018

Author Member

It's possible, and I do expect the launch system will be a Python ROS node itself, though it might not be a managed one to start with, partially because there's no lifecycle in Python atm and because it's simpler to start without that.

But I can imagine calling the launch system from the launch system in a nesting fashion. This just has to be balanced with including other launch files instead. I think that might be more transparent than calling the launch systems executable multiple times.

This comment has been minimized.

Copy link
@wjwwood

wjwwood Mar 10, 2018

Author Member

For instance, in ROS 1, the capability server is often called by roslaunch and itself in turn calls roslaunch multiple times. If I were to reimplement that in ROS 2, I would hope to instead have the capability server ask an existing launch system to add something dynamically, rather than starting a new instance. That is if the capability server is not an instance of the launch system itself or rolled into it entirely.

For example, changing the namespace of the single node could be expressed with the command line argument `__ns:=new_namespace`.

Even though there is only one node in the process, that node does not need to start with the process starts, nor does the process need to end when the node is shutdown and/or destroyed.
If it is a managed node, the lifecycle of the node is best tracked using the lifecycle events.

This comment has been minimized.

Copy link
@sloretz

sloretz Feb 27, 2018

Contributor

Is ROS 2 launch responsible for transitioning a node to Active? If so, how would it know if a process contained a normal node or a managed one? If not, how would the user manage the node? By launching a separate node that triggers the transitions?

This comment has been minimized.

Copy link
@wjwwood

wjwwood Feb 27, 2018

Author Member

Is ROS 2 launch responsible for transitioning a node to Active?

Again, not sure. My initial reaction is "no, the node should transition itself unless directed to do otherwise". But I haven't fleshed out how that would work. We (me, Karsten, perhaps others) have spoken in the past about lifecycle profiles which would be basically an option to lifecycle nodes which determines what they do automatically and what they wait to do based on external requests, e.g. a default profile might transition all the way to active, then stay there until shutdown (sigint or something else), or a managed profile that goes automatically until inactive then waits for external transition requests or shutdown.

If so, how would it know if a process contained a normal node or a managed one?

I believe the user will already have to inform the launch system if a process contains a node and if so whether or not it is a managed node. So, the user would have to tell it, otherwise it will assume it's a normal node or a normal process.

If not, how would the user manage the node? By launching a separate node that triggers the transitions?

Yes, this gets into one of the trickier issues for which I've been trying to come up with the right taxonomy (meronomy?) relationships. Basically, where does the launch system end and general purpose lifecycle managed begin? The launch system could use the lifecycle to verify the system and/or do more sophisticated startup procedures (e.g. when nodes A and B get to state inactive, launch node C...), but how is that different from (if node A goes to error state, move node B to inactive). Should it do both, or neither, or some grey line in between?

If the answer is neither or some grey line in between, then yes the user would simply start another node to manage the nodes that need external transitions.

This comment has been minimized.

Copy link
@sloretz

sloretz Feb 27, 2018

Contributor

It sort of sounds like you're saying your gut reaction is to separate the concern of launching nodes from that of managing their lifecycle.

Assuming a separate process/project for managing lifecycle nodes, it seems like that project would need an API into ROS 2 launch rather than the other way around. E.g. a managing process might ask ROS 2 launch to start some unmanaged log-backup node when a managed node transitions to cleanup.

This comment has been minimized.

Copy link
@wjwwood

wjwwood Mar 10, 2018

Author Member

Assuming a separate process/project for managing lifecycle nodes, it seems like that project would need an API into ROS 2 launch rather than the other way around. E.g. a managing process might ask ROS 2 launch to start some unmanaged log-backup node when a managed node transitions to cleanup.

Yeah, I expect we'll have that interface regardless. It should be possible to interact with a running launch system via API (probably Python), or via ROS interfaces like services, or maybe even via a different SOA like REST or XMLRPC (just ideas, no concrete plans to do that).


It would be more efficient because you don't need to have a node in each proxy to actually make the Service call to the load that node the user wants to run, and you don't need to maintain a bond between the container and the proxy processes.

So I'm interested in what others think about coming up with a standardized executable which can load and configure many nodes at once from command line arguments or a config file.

This comment has been minimized.

Copy link
@sloretz

sloretz Feb 27, 2018

Contributor

The above XML could be syntactic sugar for a system similar to nodelets. However, I think that means ROS2 launch would have a dependency on the container process/package, which would have a dependency on rclpy/rclcpp. Third parties implementing different language client APIs may desire a way to be launched into a container process without making ros 2 launch depend on them. If that's supported then rclcpp/rclpy should probably use the same API.

This comment has been minimized.

Copy link
@wjwwood

wjwwood Feb 27, 2018

Author Member

That's a fair point, in ROS 1 this was worked around because <node_container_process name="my_container_process"> was more like <node pkg="nodelet" type="nodelet_manager" name="my_container_process" ...>, i.e. it was just another node to run. If we remove the syntatic sugar, this would just be another process and if someone wanted for e.g. a rust based container for rust plugins then they would just make a new package and a new executable.

A separate concern is what happens if you put a node which is implemented as a Python script within the "node_container_process" tag? I suppose it will try to load it and fail since it's not a plugin, but it brings up the, point should we have a way to express that a <node> tag refers to a "process" or a dynamically loadable thing, or both?

This comment has been minimized.

Copy link
@jack-oquin

jack-oquin Feb 28, 2018

I like the idea of a standardized executable for loading and configuring each supported language.

That assumes we're not expecting to mix C++ and Python nodes in a single process. Whether other languages provide similar support would be up to them.

This comment has been minimized.

Copy link
@wjwwood

wjwwood Mar 10, 2018

Author Member

That assumes we're not expecting to mix C++ and Python nodes in a single process. Whether other languages provide similar support would be up to them.

Yes, for now, this only helps C++, but it would in principle be possible to have an executable which could combine multiple Python nodes into one interpreter, or even run C++ nodes and Python nodes in a manually instantiated Python interpreter within the same process (that doesn't necessarily imply intra-process communication cross language). But I don't plan to do that right now.

The coarse breakdown is like so:

- Calling Conventions for Processes and Various Styles of Nodes
- System Description

This comment has been minimized.

Copy link
@jack-oquin

jack-oquin Feb 28, 2018

Is the system description static? Or, will there be ways to launch specific components following dynamic device discovery?

This comment has been minimized.

Copy link
@wjwwood

wjwwood Mar 10, 2018

Author Member

That's a good question. That's what has been consuming my time when working on this document is finding the right partition between the value of a static format that can be edited with tools and introspected without running it, versus a fully dynamic system that can do anything, but also is unpredictable without executing it (something like the halting problem, but not quite). It could be addressed by having a limited, declarative style for the description but being implemented within a more expressive syntax than xml or yaml. There are whole projects for this topic, basically using something like Python or Ruby to make a DSL for a specific topic or domain.

I've collected a lot of information on this subject, but I'm still working out how to present it most clearly and come up with a suggestion on how to proceed before discussing it.

To summarize, I think it's clear we need dynamic behavior (as you said "ways to launch specific components following dynamic device discovery"), but the question for me is "does the static description need to support that?" or "should that use case be solved by using the launch system's API, completely outside of the static description?".

This is the section where the Python API in `ros2/launch` and the XML file format from ROS 1 overlap.
Basically how do you describe the system in a way that is flexible, but also verifiable and ideally can also be statically analyzed (at least for something like the XML format).

I have compare and contrast like notes for other systems like upstart, systemd, and launchd.

This comment has been minimized.

Copy link
@jack-oquin

jack-oquin Feb 28, 2018

Those are good things to compare.

Is there some thought of integrating with your capabilities package at some point? Does that even make sense?

This comment has been minimized.

Copy link
@wjwwood

wjwwood Mar 10, 2018

Author Member

Is there some thought of integrating with your capabilities package at some point? Does that even make sense?

I don't know, I'll have to think about it more, but I think it would be good to mention in this document, if only to better describe how the launch system is different from a system like that.

I think it makes sense to consider it, but I don't know off hand if it makes sense to integrate something like the capabilities server into the launch system, or to instead build it on top.


Missing from this list is the user which should be used to execute the process.
It's possible that it would be necessary or at least useful to change the user based on the launch description.
However, it can always be done in a user written script and supporting it in our Python implementation in a portable way looks to be difficult.

This comment has been minimized.

Copy link
@jack-oquin

jack-oquin Feb 28, 2018

Adding a user name would also raise the complex question of how to authenticate.

I would avoid that here, unless ROS 2 somehow evolves a system-wide security design which depends on it.

This comment has been minimized.

Copy link
@wjwwood

wjwwood Mar 10, 2018

Author Member

I tend to agree, but I do think it would be a useful feature at some point. However, I'm concerned it would make launch files less reusable, e.g. you don't want a driver to try to run as "camera_user" when that might be specific to the machine it was originally developed on. In this way, it would need to be used more cautiously and at integration time, like the remote machine functionality.

<div class="alert alert-warning" markdown="1">
RFC:

There are several small decisions here that I made somewhat arbitrarily, e.g. the default time until escalation and the propagation of `SIGKILL` when the launch system receives `SIGTERM`.

This comment has been minimized.

Copy link
@jack-oquin

jack-oquin Feb 28, 2018

Sending SIGKILL on receipt of SIGTERM feels a bit abrupt. Clients have no opportunity to catch it a clean up anything. Are you assuming that SIGINT was already used, and the user got impatient?

Doing the SIGTERM; wait 10 sec.; SIGKILL sequence might work better, but would not satisfy impatient users.

This comment has been minimized.

Copy link
@stonier

stonier Mar 9, 2018

Per node parameterisation of those parameters would be nice. We were at times 1) impatient, but also at times 2) requiring more shutdown time to be made available to some processes

This comment has been minimized.

Copy link
@wjwwood

wjwwood Mar 10, 2018

Author Member

Sending SIGKILL on receipt of SIGTERM feels a bit abrupt.

So to be clear, if you SIGINT the launch system, then processes run by the launch system will get SIGINT too, before being escalated to SIGTERM and then finally SIGKILL.

If you SIGTERM the launch system, it will immediately SIGKILL child processes and try to exit as quickly as possible. So only do that if you're impatient or already tried SIGINT.

If you SIGKILL the launch system, I don't know what happens to the child processes, it might be OS dependent, but it could result in "zombie" processes.

So, given that graduation of signals the user can send the launch system, I felt that these were appropriate.

Clients have no opportunity to catch it a clean up anything.

Well, that's the idea with SIGTERM to the launch system. If you want the child processes to be able to react, then SIGINT the launch system and let it do the normal SIGINT->SIGTERM->SIGKILL escallation.

That all being said, I'm open to suggestions (and rationale) about what the launch system should do instead when receiving SIGTERM.

Are you assuming that SIGINT was already used, and the user got impatient?

Yes, or that something is wrong in the launch system itself, so I was aiming for an "as simple as possible" SIGTERM signal handler for the launch system, which basically sends SIGKILL to all child processes and then exits itself immediately.

Doing a SIGTERM to child processes, waiting 10 seconds or until all child processes exit, and then sending SIGKILL to all child process still up might be more complicated and therefore more likely to also hang.

Basically, you'd want to remove any incentive to SIGKILL the launch system by making SIGTERM reliable and quick to exit.

Doing the SIGTERM; wait 10 sec.; SIGKILL sequence might work better, but would not satisfy impatient users.

That might be better, I honestly don't know. Anyone else want to weigh in on that?


Per node parameterisation of those parameters would be nice. We were at times 1) impatient, but also at times 2) requiring more shutdown time to be made available to some processes

Yeah, that's a good idea, and I was thinking of that myself, though I'll have to double check the document to see if that's called out clearly.

However, one concern is that it might break some expected behavior of the launch system for users. For example, if a user expects the launch system to exit and escalate at set time intervals, but some nodes aren't affected by this and instead have different schedules, it might be very confusing to them why the launch system is suddenly behaving differently or taking so long to shutdown. So we'll have to be careful about logging what's happening when extra shutdown constraints are made on a per process basis.

This comment has been minimized.

Copy link
@stonier

stonier Mar 12, 2018

Aye, +1 to clear logging on shutdown. I'd be unconfused if roslaunch printed a table every few seconds on shutdown reporting out on what processes remain and when the next escalation is due to occur.

This comment has been minimized.

Copy link
@jack-oquin

jack-oquin Mar 14, 2018

Are you assuming that SIGINT was already used, and the user got impatient?

Yes, or that something is wrong in the launch system itself, so I was aiming for an "as simple as possible" SIGTERM signal handler for the launch system, which basically sends SIGKILL to all child processes and then exits itself immediately.

Basically, you'd want to remove any incentive to SIGKILL the launch system by making SIGTERM reliable and quick to exit.

That seems reasonable to me, and the rationale you give would help other readers.

#### Termination

Managed ROS Nodes have some additional observable effects when terminating (the node, not necessarily the process containing it).
A managed node enters the `Finalized` state after passing through the `ShuttingDown` transition state on termination.

This comment has been minimized.

Copy link
@jack-oquin

jack-oquin Feb 28, 2018

I wonder if these things should be recursively nestable, with the roslauch itself serving as a managed ROS node?

This is the section where the Python API in `ros2/launch` and the XML file format from ROS 1 overlap.
Basically how do you describe the system in a way that is flexible, but also verifiable and ideally can also be statically analyzed (at least for something like the XML format).

I have compare and contrast like notes for other systems like upstart, systemd, and launchd.

This comment has been minimized.

Copy link
@jack-oquin

jack-oquin Feb 28, 2018

Is there any notion of supporting something like the ROS 1 capabilities package?

This comment has been minimized.

Copy link
@stonier

stonier Mar 12, 2018

I like this idea also - capabilities can greatly simplify the launch for applications further up the stack, but it can probably go in as an extension and only then after capabilities have been reworked for ROS2.

## Requirements

<div class="alert alert-warning" markdown="1">
TODO: Reformat requirements list, possibly combine/reconcile with "separation of concerns section"

This comment has been minimized.

Copy link
@jack-oquin

jack-oquin Feb 28, 2018

Normally, requirements would go near the beginning of a design document.


It would be more efficient because you don't need to have a node in each proxy to actually make the Service call to the load that node the user wants to run, and you don't need to maintain a bond between the container and the proxy processes.

So I'm interested in what others think about coming up with a standardized executable which can load and configure many nodes at once from command line arguments or a config file.

This comment has been minimized.

Copy link
@jack-oquin

jack-oquin Feb 28, 2018

I like the idea of a standardized executable for loading and configuring each supported language.

That assumes we're not expecting to mix C++ and Python nodes in a single process. Whether other languages provide similar support would be up to them.

This pattern is used by the "nodelet's" in ROS 1.
It's a useful pattern when you're running nodes by hand but you want them to share a process and don't want to write your own program to do that.

There are a few other forms this can take, but the common thread between them is that a process instantiates nodes dynamically based on asynchronous input from external actors (proxy's), and that the configuration for those nodes is communicated through something other than command line arguments and environment variables.

This comment has been minimized.

Copy link
@jack-oquin

jack-oquin Feb 28, 2018

While this looks useful for porting existing ROS 1 nodelet usages, I prefer the <node_container_process> approach for native ROS 2 programming. I don't see the advantage of the proxy approach, but perhaps there are some and they should be mentioned explicitly.

<br/>
</wjwwod's opinion>

This will come up again in the "System Description" section, because I believe the mind set around the "Execution of the System Description" (basically the python library that runs things) will be "personal responsibility".

This comment has been minimized.

Copy link
@jack-oquin

jack-oquin Feb 28, 2018

Is there a general ROS 2 design decision to move away from the ROS 1 concept that each node should be written in a way that is independent of the order in which other nodes start?

Many people question that approach, but experience suggests that it was a good way to write reusable nodes.

If not, then the use case of "wait for A to get to Active then launch B" may not be very important. It plays into a simple-minded system design paradigm that we may not want to encourage.

This comment has been minimized.

Copy link
@jack-oquin

jack-oquin Feb 28, 2018

To me, the more important problem is providing recovery, availability and serviceability in the face of node or process failures. That could involve:

  • if a node fails, restart it
    • what do other nodes do in the interim? Just wait?
  • if a node cannot be restarted (device failure perhaps?), run something else as a fallback
  • if everything is FUBAR, capture as much relevant information as possible for off-line analysis

This comment has been minimized.

Copy link
@stonier

stonier Mar 12, 2018

+1 for focusing on recovery, availability and serviceability in the face of surviving post-apocolyptic failures.

One item that would have been very useful is for roslaunch to report out (with relevant stdout/stderr) on launch failures or crashes, either to a watchdog node or a graphical interface to the robot. We used to track a node's existence from watchdog programs, but were not able to extract stdout/stderr.

Perhaps for this, a useful feature would be the ability to provide callbacks for node launch failures/crashes? This could enable a variety of patterns, such as some of those suggested by Jack above. Common use cases could be implemented as extensions later.

@jack-oquin

This comment has been minimized.

Copy link

commented Feb 28, 2018

The current description abstracts the actual syntax of the "system configuration". That's nice at this design stage.

However, I find myself intensely curious whether XML, Python or some other syntax is under consideration.

@gbiggs

This comment has been minimized.

Copy link
Contributor

commented Mar 1, 2018

Are we reviewing this now? I was holding off until @wjwwood said go for it, but would be happy to start commenting. 😄

@jack-oquin

This comment has been minimized.

Copy link

commented Mar 1, 2018

I am commenting because I find the document and "RFC" sections worth discussing, and probably because I don't know any better. 😄

@wjwwood

This comment has been minimized.

Copy link
Member Author

commented Mar 1, 2018

Are we reviewing this now? I was holding off until @wjwwood said go for it, but would be happy to start commenting. 😄

I don't mind comments now, but it's not "ready" for review yet. I'm trying to push out a completed document as soon as possible, at which point I will solicit feedback actively, first on the pr and then on discourse. But until then, feel free to discuss, though I might not get back to you right away.

@wjwwood

This comment has been minimized.

Copy link
Member Author

commented Mar 2, 2018

Good morning :), I pushed a set of changes for the "context" section where it compares it to ROS 1 and considers what might be different. I'm close but not quite finished with the system description and event sections, at which I think it will be in a relatively good place to start discussion on the pr (not quite ready for discourse still though).

@jack-oquin I'll try to respond to your comments asap, but I'll have to switch gears a bit before I have time to do that.

@jack-oquin

This comment has been minimized.

Copy link

commented Mar 2, 2018

No hurry on my account.

title: ROS 2 Launch System
permalink: articles/roslaunch.html
abstract:
The launch system in ROS is responsible for helping the user describe the configuration of their system and then execute it as described. The configuration of the system includes what programs to run, what arguments to pass them, and ROS specific conventions which make it easy to reuse components throughout the system by giving them each different configurations. Also, because the launch system is the process (or the set of processes) which executes the user's processes, it is responsible for monitoring the state of the processes it launched, as well as reporting and/or reacting to changes in the state of those processes.

This comment has been minimized.

Copy link
@clalancette

clalancette Mar 9, 2018

Contributor

(edit: oh, you talk about this below, sorry; maybe we should add a little something to the abstract anyway?)

I hate to make a complicated subject even more complicated, but what do we think about executing processes across multiple different compute modules? After all, a reasonably complex robot will have multiple processing modules, and hence successful launching of the "system" really involves launching processes on all of the modules. There's also the related, even larger question about launching swarms of robots, but that gets a little pie-in-the-sky.

If it is not our intention to handle either of those cases here, I think that is fine, but maybe we want to explicitly say that it is a non-goal of this, and larger orchestration may come in a later design document?

This comment has been minimized.

Copy link
@wjwwood

wjwwood Mar 9, 2018

Author Member

I think it is a goal of the overall system and therefore this design doc (you've noticed that it's mentioned below), but not a feature we have to implement immediately if time doesn't allow, so long as we design for it.

This comment has been minimized.

Copy link
@iluetkeb

iluetkeb Mar 23, 2018

At the moment (i.e., roslaunch 1) the system specifies several things

  • what to run
    • including launch-prefixes
  • how to run (largely implicitly, but worth mentioning)
  • placement (where to run)
    • including remote-node access
  • configuration
    • parameters (in several different ways)
    • command-line arguments
    • env-handlers
  • data connectivity (remap, group)
  • error handling (respawn, required)
  • test definition

Arguably, this was not really necessary, but just came about because it was a convenient place to stick all the stuff.

This definitely caused usability issues already in ROS1. For example, I have rewritten countless launch files, instead of just including them, because one of these things was defined in a way that did not match requirements. Also, grouping is problematic -- you could group, but not ungroup. The various ways to set parameters, and the point in time when they apply, is also often unclear.

As you eloquently discuss in the remainder of the document, each of these issues becomes even more complex with ROS2.

Thus, we might want to revisit whether we can break this up, and create a combination of simpler tools.

@stonier

This comment has been minimized.

Copy link

commented Mar 14, 2018

Configuring the Quality of Service for Connections (reliable, unreliable, ...)

In ROS1 you could only do this in the code itself. It would have been awesome if you could configure this at the launch level. You could rewire connections, but could not reconfigure the kind of connection. That always struck me as odd.

We did however, manage to get by on workarounds. If it was our code, we parameterised the node to provide QoS configuration. If it wasn't, we got by with relays. Though for the product, we wrote specialised nodes to handle a conglomeration of relays with various capabilities (subsampling, QoS, ...) to minimise the impact of so many extra processes. We even had ghosts on the server to manage the unreliable connections to the robot and present a an api to the rest of the server that was always 'up'.

How hard would this be to do for ROS2's launch system? In scope?

@wjwwood

This comment has been minimized.

Copy link
Member Author

commented Mar 20, 2018

In ROS1 you could only do this in the code itself. It would have been awesome if you could configure this at the launch level. You could rewire connections, but could not reconfigure the kind of connection. That always struck me as odd.

So, DDS does have a way to do this with XML files which can affect the QoS of entities in a program or globally. And we could do something similar.

However, I personally dislike this kind of runtime configuration, because when the person writing the code originally is not the person doing the system integration, changing QoS settings might break expectations.

As an example, a developer might create a node with a publisher which expects certain blocking behavior from the publish call (e.g. non-blocking), but if you change the QoS settings that might cause the publish call's blocking behavior to change, which breaks the expectation of the original programmer and thus subtly breaking the program itself.

Another example of this might be that a developer creates a program which pre-allocates 10 messages and sets the publish history to keep last and depth 10. If you externally change that to say 100, then maybe that breaks the program because now the history is larger than the resources pre-allocated by the developer.

Remapping of topic names is similar, and so the previous paragraphs might sound hypocritical, but I would argue that the important difference is that changing a topic name doesn't change the behavior of any of the API's in ways that might cause the program to break (as far as I can imagine).

We did however, manage to get by on workarounds. If it was our code, we parameterised the node to provide QoS configuration. If it wasn't, we got by with relays. Though for the product, we wrote specialised nodes to handle a conglomeration of relays with various capabilities (subsampling, QoS, ...) to minimise the impact of so many extra processes.

I actually very much prefer this approach of having developer defined configurations which system integrators can tweak. By only allowing changes to settings the developer exposes, the developer can assume the others are not changing under their feet (so it avoids the subtle changes in behavior when QoS changes) and for the ones they expose then they can also put constraints on them, e.g. if the developer wanted to make the depth configurable, then they could expose a ROS parameter for that, and then constrain it to be within the range 10-100.

As for relays which rename, throttle, or downsample, I think the approach of having these are reusable nodes which can be added to the process of other user defined nodes is a better approach. In ROS 1 you had to have a separate process for each and that caused a lot of overhead.

So that's a bit opinionated from me about the way systems should be constructed, but I'm open to being convinced that runtime configuration of QoS is a good idea.

How hard would this be to do for ROS2's launch system? In scope?

It would be hard as described because we have no way to control this right now in the C++/Python API. Like if a developer hard codes the history depth to 10 with keep last, then we have no way to intercept that and change it as it is being created. We could add that, it just isn't planned as far as I know.

I do think this out of scope for the launch system because the only impact it would have on the launch system is whether or not it's represented in the launch description and if so how it is communicated to the process containing the node. So I'd mention here after we decide to support that use case, but I'd put it in the same bucket as static remapping and parameters which are mentioned here but not part of the actual launch system itself.

@stonier

This comment has been minimized.

Copy link

commented Mar 20, 2018

Going to take a step back and say I completely agree with you. That feature thought was from a few years ago, yet when working on the product, we aggressively found ways to very carefully manage robot-server connections using a variety of methods, only some of which I have mentioned above. If I think to where we might apply roslaunch QoS configuration, it would be for almost trivial situations and not for where the critical problems lie.

These machines all have SSH, which is the mechanism which is specifically called out to be used when launching processes on remote machines.
It also played a role in defining what you specified and how when configuring `roslaunch` from ROS 1 to be able to launch processes on remote machines.

In ROS 2, Windows has been added to the list of targeted platforms, and as of the writing of this document it does not support SSH natively.

This comment has been minimized.

Copy link
@josephduchesne

josephduchesne Sep 30, 2018

Would requiring the execution of a launch agent process on each additional available machine be a good solution for this? The agent could register its hostname, ip etc. (via a normal ros topic maybe?), which could then be used by the primary launch processes to delegate launches on other machines.

Would people be OK with each additional computer requiring some sort of ros2 launch client running? There could be some system that allows waiting for worker clients, handling events on an expected 2nd(+) computer failing to be present (e.g. timeout, connection loss).

Another option would be to allow launching the remote launch agent remotely (like Jenkins does with its agents). That way we could have a ssh launched worker on linux/mac, psexec launched processes on Windows. This does raise further questions about cross platform support.

If the default behavior is that the agent is launched somehow(daemon, cron, service, manually, etc.) on each secondary machine, cross-platform behavior would be identical. This would be portable but a departure from the convenience of ros1's ability to launch everything at once (despite the shortcomings of that specific implementation in error handling and recovery).

If there is a system to automatically launch such agents from the primary machine via an OS specific handler, things might get very complicated to implement and test (windows->linux launches for example).

@hidmic hidmic referenced this pull request Jan 14, 2019

Open

XML/YAML Front-end #163

Proposal for launching dynamically composable nodes (#206)
* Proposal for dynamically composed nodes

* allow multiple extra_arguments

* Allow node_name and namespace to be empty

* Human readable error message

* Update articles/150_roslaunch.md

Co-Authored-By: sloretz <shane.loretz@gmail.com>

* Assign nodes unique ids, but still forbid duplicates

* Update articles/150_roslaunch.md

Co-Authored-By: sloretz <shane.loretz@gmail.com>

* Update articles/150_roslaunch.md

Co-Authored-By: sloretz <shane.loretz@gmail.com>

* Section to list

* More generic wording about container processes

* namespace -> node_namespace

* _launch/ -> ~/_container/

Signed-off-by: Shane Loretz <sloretz@osrfoundation.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.