Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ROS 2 Launch System #163

Open
wants to merge 12 commits into
base: gh-pages
from

more work in progress

  • Loading branch information...
wjwwood committed Feb 16, 2018
commit cd833dde960785d8b3ba03edc4095a6615a287df
@@ -31,24 +31,26 @@ The launch system can be considered in parts, separated by concern.
The coarse breakdown is like so:

- Calling Conventions for Processes and Various Styles of Nodes
- System Description
- Execution of the System Description
- System Description and Static Analysis
- Execution and Verification of the System Description
- Reporting System for Event
- Testing

The purpose of this section is to enumerate what the launch system could interact with or do, but is not the requirements list for the launch system in ROS 2.
The requirements for the launch system will be enumerated in section below based on what's possible in this section.
The purpose of the following sections is to enumerate what the launch system could interact with or do, but is not the requirements list for the launch system in ROS 2.
The requirements for the launch system will be enumerated in section below based on what's possible in these sections.

### Calling Conventions
## Calling Conventions

In order for the launch system to execute a described system, it needs to understand how it can achieve the description.
This is an existing phrase in Computer Science[^calling_convention_wikipedia], but this section is not talking specifically about the compiler defined calling convention, through it is appropriating the term to describe a similar relationship.
In this case, the phrase "calling conventions" is meant to describe the "interface" or "contract" the launch system has with the entities it is executing and monitoring.

This comment has been minimized.

Copy link
@sloretz

sloretz Feb 27, 2018

Contributor

Might be a good idea to define what an entity is before this section. I assume entities means the types of nodes: a process with one node, a process manually composed with multiple nodes, a library containing a dynamically composible node, or all of these but with a lifecycle node.

This comment has been minimized.

Copy link
@wjwwood

wjwwood Feb 27, 2018

Author Member

I was hoping this sentence would imply "entities are anything that the launch system executes and/or monitors", rather than me specifically defining entities. I used "entities" because "things" sounded unsophisticated.

Perhaps I could reword it to be "... the launch system has with entities, which are anything it executes or monitors."? If not, do you have a suggestion as how to reword it?

This comment has been minimized.

Copy link
@sloretz

sloretz Feb 27, 2018

Contributor

Oops I assumed entity was going to be a special term. If it's not going to be used later on then I suggest shortening to "... the launch system has with anything it executes or monitors.".

This contract covers initial execution, activity during runtime, signal handling and behavior of the launch system, and shutdown.

#### Operating System Processes
### Operating System Processes

The most basic version of these entities, and the foundation for the other entities, are operating system processes.

##### Execution
#### Execution

For these, the launch system needs to know how to execute them, and to do that it needs:

@@ -68,7 +70,7 @@ However, it can always be done in a user written script and supporting it in our

With this information the launch system can execute any arbitrary operating system process on the local machine.

##### Runtime
#### Runtime

During runtime, the launch system may monitor all operating system process's:

@@ -86,7 +88,7 @@ In addition, the launch system may interact with, or allow the user to interact

Regardless of how the user uses the launch system to interact with these items, they should be exposed by the launch system, which is the only entity which can interact with them directly.

##### Termination
#### Termination

If the operating system process terminates, and therefore returns a return code, the launch system will report this event and it can be handled in a user defined way.
Termination covers expected termination (e.g. return from `main()` or use `exit()`) and unexpected termination (e.g. the abort trap or a segmentation fault or bus error).
@@ -97,6 +99,32 @@ Historically, ROS 1's `roslaunch` allowed a few common exit handling cases:
- `respawn=true`: if this process exits (any reason) restart it with the same settings as startup
- `respawn_delay=N`: if restarting it, delay a number of seconds between attempts

The launch system may initiate the termination of an operating system process.
This starts with the signaling of `SIGINT` on the child process.
If this does not result in the termination of the process, then one of a few things can happen based on the configuration of the launch system:

- after a period of time, signal `SIGTERM`
- after a period of time, signal `SIGKILL`
- nothing

By default, the launch system will:

- send `SIGINT`
- after 10 seconds, send `SIGTERM`
- after 10 additional seconds, send `SIGKILL`

The latter two steps can be skipped, or the time until escalation can be adjusted, on a per process basis.

The launch system will initiate this process when an event (built-in or user generated) initiates shutdown, e.g. when a process with the equivalent of the `require=true` exit handler terminates, or when the launch system itself receives the `SIGINT` signal.

If the launch system itself receives the `SIGTERM` signal it will send the `SIGKILL` signal to all child processes and exit immediately.

<div class="alert alert-warning" markdown="1">
RFC:

There are several small decisions here that I made somewhat arbitrarily, e.g. the default time until escalation and the propagation of `SIGKILL` when the launch system receives `SIGTERM`.

This comment has been minimized.

Copy link
@jack-oquin

jack-oquin Feb 28, 2018

Sending SIGKILL on receipt of SIGTERM feels a bit abrupt. Clients have no opportunity to catch it a clean up anything. Are you assuming that SIGINT was already used, and the user got impatient?

Doing the SIGTERM; wait 10 sec.; SIGKILL sequence might work better, but would not satisfy impatient users.

This comment has been minimized.

Copy link
@stonier

stonier Mar 9, 2018

Per node parameterisation of those parameters would be nice. We were at times 1) impatient, but also at times 2) requiring more shutdown time to be made available to some processes

This comment has been minimized.

Copy link
@wjwwood

wjwwood Mar 10, 2018

Author Member

Sending SIGKILL on receipt of SIGTERM feels a bit abrupt.

So to be clear, if you SIGINT the launch system, then processes run by the launch system will get SIGINT too, before being escalated to SIGTERM and then finally SIGKILL.

If you SIGTERM the launch system, it will immediately SIGKILL child processes and try to exit as quickly as possible. So only do that if you're impatient or already tried SIGINT.

If you SIGKILL the launch system, I don't know what happens to the child processes, it might be OS dependent, but it could result in "zombie" processes.

So, given that graduation of signals the user can send the launch system, I felt that these were appropriate.

Clients have no opportunity to catch it a clean up anything.

Well, that's the idea with SIGTERM to the launch system. If you want the child processes to be able to react, then SIGINT the launch system and let it do the normal SIGINT->SIGTERM->SIGKILL escallation.

That all being said, I'm open to suggestions (and rationale) about what the launch system should do instead when receiving SIGTERM.

Are you assuming that SIGINT was already used, and the user got impatient?

Yes, or that something is wrong in the launch system itself, so I was aiming for an "as simple as possible" SIGTERM signal handler for the launch system, which basically sends SIGKILL to all child processes and then exits itself immediately.

Doing a SIGTERM to child processes, waiting 10 seconds or until all child processes exit, and then sending SIGKILL to all child process still up might be more complicated and therefore more likely to also hang.

Basically, you'd want to remove any incentive to SIGKILL the launch system by making SIGTERM reliable and quick to exit.

Doing the SIGTERM; wait 10 sec.; SIGKILL sequence might work better, but would not satisfy impatient users.

That might be better, I honestly don't know. Anyone else want to weigh in on that?


Per node parameterisation of those parameters would be nice. We were at times 1) impatient, but also at times 2) requiring more shutdown time to be made available to some processes

Yeah, that's a good idea, and I was thinking of that myself, though I'll have to double check the document to see if that's called out clearly.

However, one concern is that it might break some expected behavior of the launch system for users. For example, if a user expects the launch system to exit and escalate at set time intervals, but some nodes aren't affected by this and instead have different schedules, it might be very confusing to them why the launch system is suddenly behaving differently or taking so long to shutdown. So we'll have to be careful about logging what's happening when extra shutdown constraints are made on a per process basis.

This comment has been minimized.

Copy link
@stonier

stonier Mar 12, 2018

Aye, +1 to clear logging on shutdown. I'd be unconfused if roslaunch printed a table every few seconds on shutdown reporting out on what processes remain and when the next escalation is due to occur.

This comment has been minimized.

Copy link
@jack-oquin

jack-oquin Mar 14, 2018

Are you assuming that SIGINT was already used, and the user got impatient?

Yes, or that something is wrong in the launch system itself, so I was aiming for an "as simple as possible" SIGTERM signal handler for the launch system, which basically sends SIGKILL to all child processes and then exits itself immediately.

Basically, you'd want to remove any incentive to SIGKILL the launch system by making SIGTERM reliable and quick to exit.

That seems reasonable to me, and the rationale you give would help other readers.

</div>

#### Shell Evaluation

A special case of operating system processes, shell evaluation would simply be passing shell script code as an argument to the default system shell.
@@ -110,23 +138,79 @@ An in between entity is an operating system process which uses shell evaluation
TODO: Describe this as a "mix-in" which can convert anything based on an "operating system process" into a remote one by adding ssh/user/setup/etc... information needed to do so, see how ROS 1's roslaunch does it.
</div>

#### Process with a Single Node
### ROS Nodes

Any operating system process can become ROS specific by having at least one ROS Node within it.
Having one or more "plain" ROS nodes in a process doesn't add new ways to get information into or out of the operating system process that contains them.
It does however, add some specific kinds of inputs during execution and it also can affect how the process reacts to signals.

#### Execution

In addition to the "Execution" subsection of the "Operating System Processes" section, processes with ROS Nodes in them may take need to consider additional elements, like:

- Package name + executable name rather than executable name + PATH (i.e. `ros2 run` equivalent)
- ROS specific environment variables (e.g. `ROS_DOMAIN_ID`, `RMW_IMPLEMENTATION`, console output formatting, etc...)
- ROS specific command line arguments
- Varies for single Node processes and multi-Node processes
- Change node name or namespace
- Remap topics, services, actions, parameters, etc...
- Initialize parameter values

The specific syntax of these extra environment variables and command line arguments are defined in other documents[^logging_wiki] [^static_remapping].

In each of these cases, they simply are ROS specific ways to extend or add to the existing mechanism described by the equivalent section for "operating system processes", i.e. there's no new ROS specific way to pass more information into the initial execution of the process.
Here is also the first opportunity for the launch system to take ROS specific declarations, e.g. "remap 'image' to 'left/image'", and convert them implicitly into terms that a normal operating system process can consume like environment variables or command line arguments, e.g. adding `image:=left/image` to the command line arguments.

This is a ROS specific entity which contains a single ROS node.
#### Runtime

During runtime a "plain" ROS node doesn't expose beyond what an operating system process does.

It also does not react to any additional signals, but processes containing ROS nodes do tend to have a signal handler for `SIGINT` which does a more graceful shutdown, but that is not enforced.
Sending the `SIGINT` signal typically causes most nodes to shutdown if they are monitoring `rclcpp::ok()` as recommended.

#### Termination

Termination of a ROS Node (the node, not the process) is not externally observable beyond what is observed with an operating system process (the return code).

### Managed ROS Nodes

For ROS nodes that have a lifecylce, a.k.a. Managed ROS Nodes[^lifecycle], each node will have additional runtime state, which the launch system could access and either utilize, pass through to the event system, or aggregate before passing it through the event system.

Building yet again on previous entities, the "Managed ROS Nodes" inherits all of the execution, runtime, and termination characteristics from normal ROS nodes and therefore operating system processes where applicable.

#### Execution

Managed ROS Nodes do not add any additional inputs or specific configurations at execution time on top of what "plain" ROS nodes add, at least not at this time.
In the future this might change, so reference the design doc[^lifecylce] or future documentation on the subject.

#### Runtime

During runtime, a Managed ROS node emits events anytime the state of the node changes.
This is at least emitted on a topic, but could also be captured, aggregated, and/or communicated in other ways too.

#### Termination

Again, Managed ROS Nodes have some additional observable effects when terminating (again the node, not necessarily the process containing it).
A managed node enters the `Finalized` state after passing through the `ShuttingDown` transition state on termination.

This comment has been minimized.

Copy link
@sloretz

sloretz Feb 27, 2018

Contributor

During termination will ROS 2 launch try to transition to the Finalized state, or just send signals and hope handling code in the node takes care of it?

This comment has been minimized.

Copy link
@wjwwood

wjwwood Feb 27, 2018

Author Member

A good question, to which I do not know the answer. My gut reaction would be to say it sends a signal and the launch system tries not to send state transition requests itself, unless directed to do so by the user. So maybe, defaults to signals, the user can ask the launch system to transition first, then signal. I'll mention it and add an RFC block about the topic.

This comment has been minimized.

Copy link
@jack-oquin

jack-oquin Feb 28, 2018

I wonder if these things should be recursively nestable, with the roslauch itself serving as a managed ROS node?

This comment has been minimized.

Copy link
@wjwwood

wjwwood Mar 10, 2018

Author Member

It's possible, and I do expect the launch system will be a Python ROS node itself, though it might not be a managed one to start with, partially because there's no lifecycle in Python atm and because it's simpler to start without that.

But I can imagine calling the launch system from the launch system in a nesting fashion. This just has to be balanced with including other launch files instead. I think that might be more transparent than calling the launch systems executable multiple times.

This comment has been minimized.

Copy link
@wjwwood

wjwwood Mar 10, 2018

Author Member

For instance, in ROS 1, the capability server is often called by roslaunch and itself in turn calls roslaunch multiple times. If I were to reimplement that in ROS 2, I would hope to instead have the capability server ask an existing launch system to add something dynamically, rather than starting a new instance. That is if the capability server is not an instance of the launch system itself or rolled into it entirely.

Since these are state transitions, they are observable via the lifecylce event system, at least through the ROS topic `lifecycle_state` (subject to change, always reference the managed nodes design document[^lifecylce]).

### Process with a Single Node

It extends the idea of
This was likely the most commonly used type of entity launched in ROS 1, as you could only have one node per process in ROS 1.
In ROS 2, this will likely be less common, but will still be used quite a bit in the form of quickly developed scripts and drivers or GUI tools which might require control over the main thread.

In
Additional

#### Process with Multiple Nodes
### Process with Multiple Nodes

#### Dynamically loaded Nodes
### Dynamically loaded Nodes

##### By Configuration
#### By Configuration

##### By Proxy
#### By Proxy

### System Description
## System Description

<div class="alert alert-warning" markdown="1">
TODO: Restructure notes on this and put them here.
@@ -139,7 +223,7 @@ Basically how do you describe the system in a way that is flexible, but also ver
I have compare and contrast like notes for other systems like upstart, systemd, and launchd.

This comment has been minimized.

Copy link
@jack-oquin

jack-oquin Feb 28, 2018

Those are good things to compare.

Is there some thought of integrating with your capabilities package at some point? Does that even make sense?

This comment has been minimized.

Copy link
@wjwwood

wjwwood Mar 10, 2018

Author Member

Is there some thought of integrating with your capabilities package at some point? Does that even make sense?

I don't know, I'll have to think about it more, but I think it would be good to mention in this document, if only to better describe how the launch system is different from a system like that.

I think it makes sense to consider it, but I don't know off hand if it makes sense to integrate something like the capabilities server into the launch system, or to instead build it on top.

</div>

### Event System
## Event System

<div class="alert alert-warning" markdown="1">
TODO: Restructure notes on this and put them here.
@@ -168,3 +252,10 @@ TODO: This will outline what we have and what we need to build and how it should
<div class="alert alert-warning" markdown="1">
TODO: Anything we choose not to support in the requirements vs. the "separation of concern section", and also any alternatives which we considered but rejected in the reference implementation proposal.
</div>

## References

[^calling_convention_wikipedia]: [https://en.wikipedia.org/wiki/Calling_convention](https://en.wikipedia.org/wiki/Calling_convention)
[^logging_wiki]: [https://github.com/ros2/ros2/wiki/Logging#console-output-configuration](https://github.com/ros2/ros2/wiki/Logging#console-output-configuration)
[^static_remapping]: [http://design.ros2.org/articles/static_remapping.html#remapping-rule-syntax](http://design.ros2.org/articles/static_remapping.html#remapping-rule-syntax)
[^lifecycle]: [http://design.ros2.org/articles/node_lifecycle.html](http://design.ros2.org/articles/node_lifecycle.html)
ProTip! Use n and p to navigate between commits in a pull request.
You can’t perform that action at this time.