Skip to content
This repository has been archived by the owner on Feb 24, 2020. It is now read-only.

Commit

Permalink
doc/devel: detail execution flow between stages
Browse files Browse the repository at this point in the history
  • Loading branch information
s-urbaniak committed Jul 19, 2016
1 parent 64e931a commit c33b220
Show file tree
Hide file tree
Showing 7 changed files with 209 additions and 10 deletions.
81 changes: 71 additions & 10 deletions Documentation/devel/architecture.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,8 +13,22 @@ Facilities like file-locking are used to ensure co-operation and mutual exclusio

Execution with rkt is divided into several distinct stages.

_**NB** The goal is for the ABI between stages to be relatively fixed, but while rkt is still under heavy development this is still evolving.
Until https://github.com/coreos/rkt/issues/572 is resolved, this should be considered in flux and the description below may not be authoritative._
**NB** The goal is for the ABI between stages to be relatively fixed, but while rkt is still under heavy development this is still evolving.

After calling `rkt` the execution chain follows the numbering of stages, having the following general order:

![execution-flow](execution-flow.png)

1. invoking process -> stage0:
The invoking process uses its own mechanism to invoke the rkt binary (stage0). When started via a regular shell or a supervisor, stage0 is usually forked and exec'ed becoming a child process of the invoking shell or supervisor.

2. stage0 -> stage1:
An ordinary `exec(3)` is being used to replace the stage0 process with the stage1 entrypoint. The entrypoint is referenced by the `coreos.com/rkt/stage1/run` annotation in the stage1 image manifest.

3. stage1 -> stage2:
The stage1 entrypoint uses its mechanism to invoke the stage2 app executables. The app executables are referenced by the `apps.app.exec` settings in the stage2 image manifest.

The details of the execution flow varies across the different stage1 implementations.

### Stage 0

Expand Down Expand Up @@ -60,11 +74,27 @@ At this point the stage0 execs `/stage1/rootfs/init` with the current working di

### Stage 1

The next stage is a binary that the user trusts to set up cgroups, execute processes, and perform other operations as root on the host.
This stage has the responsibility of taking the pod filesystem that was created by stage0 and creating the necessary cgroups, namespaces and mounts to launch the pod.
The next stage is a binary that the user trusts, and has the responsibility of taking the pod filesystem that was created by stage0, create the necessary container isolation, network, and mounts to launch the pod.
Specifically, it must:

- Read the Image and Pod Manifests. The Image Manifest defines the default `exec` specifications of each application; the Pod Manifest defines the ordering of the units, as well as any overrides.
- Set up/execute the actual isolation environment for the target pod, called the "stage1 flavor". Currently there are three flavors implemented:
- fly: a simple chroot only environment.
- systemd/nspawn: a cgroup/namespace based isolation environment using systemd, and systemd-nspawn.
- kvm: a fully isolated kvm environment.

### Stage 2

The final stage, stage2, is the actual environment in which the applications run, as launched by stage1.

## Flavors
### systemd/nspawn flavors

The "host", "src", and "coreos" flavors (referenced to as systemd/nspawn flavors) use `systemd-nspawn`, and `systemd` to set up the execution chain.
They include a very minimal systemd that takes care of launching the apps in each pod, apply per-app resource isolators and makes sure the apps finish in an orderly manner.

These flavors will:
- Read the image and pod manifests
- Generate systemd unit files from those Manifests
- Create and enter network namespace if rkt is not started with `--net=host`
- Start systemd-nspawn (which takes care of the following steps)
Expand All @@ -74,10 +104,6 @@ Specifically, it must:

This process is slightly different for the qemu-kvm stage1 but a similar workflow starting at `exec()`'ing kvm instead of an nspawn.

### Stage 1 systemd Architecture

rkt's Stage1 includes a very minimal systemd that takes care of launching the apps in each pod, apply per-app resource isolators and make sure the apps finish in an orderly manner.

We will now detail how the starting, shutdown, and exist status collection of the apps in a pod are implemented internally.

![rkt-systemd](rkt-systemd.png)
Expand All @@ -99,6 +125,41 @@ In this case, the failed app's exit status will get propagated to rkt.
A [*Conflicts*](http://www.freedesktop.org/software/systemd/man/systemd.unit.html#Conflicts=) dependency was also added between each reaper service and the halt and poweroff targets (they are triggered when the pod is stopped from the outside when rkt receives `SIGINT`).
This will activate all the reaper services when one of the targets is activated, causing the exit statuses to be saved and the pod to finish like it was described in the previous paragraph.

### Stage 2
We will now detail the execution chain for the stage1 systemd/nspawn flavors. The entrypoint is implemented in the `stage1/init/init.go` binary and sets up the following execution chain:

The final stage, stage2, is the actual environment in which the applications run, as launched by stage1.
1. "ld-linux-*.so.*": Depending on the architecture the appropriate loader helper in the stage1 rootfs is invoked using "exec". This makes sure that subsequent binaries load shared libraries from the stage1 rootfs and not from the host file system.

2. "systemd-nspawn": Used for starting the actual container. systemd-nspawn registers the started container in "systemd-machined" on the host, if available. It is parametrized with the `--boot` option to instruct it to "fork+exec" systemd as the supervisor in the started container.

3. "systemd": Used as the supervisor in the started container. Similar as on a regular host system, it uses "fork+exec" to execute the child app processes.

The following diagram illustrates the execution chain:

![execution-flow-systemd](execution-flow-systemd.png)

The resulting process tree reveals the parent-child relationships. Note that "exec"ing processes do not appear in the tree:

```
$ ps auxf
...
\_ -bash
\_ stage1/rootfs/usr/lib/ld-linux-x86-64.so.2 stage1/rootfs/usr/bin/systemd-nspawn
\_ /usr/lib/systemd/systemd
\_ /usr/lib/systemd/systemd-journald
\_ nginx
```

### fly flavor

The "fly" flavor uses a very simple mechanism being limited to only execute one child app process. The entrypoint is implemented in `stage1_fly/run/main.go`. After setting up a chroot'ed environment it simply exec's the target app without any further internal supervision:

![execution-flow-fly](execution-flow-fly.png)

The resulting example process tree shows the target process as a direct child of the invoking process:

```
$ ps auxf
...
\_ -bash
\_ nginx
```
29 changes: 29 additions & 0 deletions Documentation/devel/execution-flow-fly.dot
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
digraph G {
graph [fontname = "helvetica"];
node [fontname = "Arial", fillcolor="#FFE599", style="filled"];
edge [fontname = "monospace"];

{
invokingProcess [shape="node", label=<<B>bash/systemd/kubelet</B><BR/>invoking process>]
stage0 [shape="node", label=<<B>stage0</B><BR/>rkt>]
entrypoint [shape="node", label=<entrypoint<BR/>"coreos.com/rkt/stage1/run">]
app1 [shape="node", label=<<B>stage2</B><BR/>"apps.app.exec"<BR/>app1>]
}

invokingProcess -> stage0
stage0 -> entrypoint [label="exec(3)"]

subgraph cluster_1 {
label=<<B>stage1</B>>
labeljust="left"
entrypoint

subgraph cluster_2 {
label=<<B>stage2</B>>
labeljust="left"
app1
}
}

entrypoint -> app1 [label="chroot(2)+\nexec(3)"]
}
Binary file added Documentation/devel/execution-flow-fly.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
66 changes: 66 additions & 0 deletions Documentation/devel/execution-flow-systemd.dot
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
digraph G {
graph [fontname = "helvetica"];
node [fontname = "Arial", fillcolor="#FFE599", style="filled"];
edge [fontname = "monospace"];

{
invokingProcess [shape="node",
label=<<B>systemd-run / service unit</B><BR/>invoking process>, fillcolor="#FFF4D4"]
stage0 [shape="node", label=<<B>stage0</B><BR/>rkt>, fillcolor="#FFF4D4"]
init [shape="node", label=</init<BR/>"coreos.com/rkt/stage1/run">]
ld [shape="node", label="ld-linux-x86-64"]
systemdNspawn [shape="node", label=<systemd-nspawn>]
systemdMachined [shape="node", label=<systemd-machined>, fillcolor="#FFF4D4"]

systemd [shape="node",
label=<systemd>]

app1 [shape="node",
label=<"apps.app.exec"<BR/>app1>]

app2 [shape="node",
label=<"apps.app.exec"<BR/>app2>]

journal [shape="node",
label=<systemd-journal>]
}

invokingProcess -> stage0 [label="fork(2)+exec(3)"]
stage0 -> init [label="exec(3)"]
systemdNspawn -> systemd [label="fork(2)+exec(3)"]
systemdNspawn -> systemdMachined [label="register",
fontname="Arial"]
init -> ld [label="exec(3)"]
ld -> systemdNspawn
systemd -> app1
systemd -> app2
systemd -> journal [label="fork(2)+exec(3)"]

invokingProcess
stage0
systemdMachined

subgraph cluster_1 {
label=<<B>stage1</B>>
labeljust="left"

init
ld
systemdNspawn
systemd
journal

subgraph cluster_2 {
label=<<B>stage2</B>>
labeljust="left"
app1
}

subgraph cluster_3 {
label=<<B>stage2</B>>
labeljust="left"
app2
}
}

}
Binary file added Documentation/devel/execution-flow-systemd.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
43 changes: 43 additions & 0 deletions Documentation/devel/execution-flow.dot
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
digraph G {
graph [fontname = "helvetica"];
node [fontname = "Arial", fillcolor="#FFE599", style="filled"];
edge [fontname = "monospace"];

{
invokingProcess [shape="node",
label=<<B>bash/systemd/kubelet</B><BR/>invoking process>]
stage0 [shape="node",
label=<<B>stage0</B><BR/>rkt>]
entrypoint [shape="node",
label=<entrypoint<BR/>"coreos.com/rkt/stage1/run">]
app1 [shape="node",
label=<"apps.app.exec"<BR/>app1>]
app2 [shape="node",
label=<"apps.app.exec"<BR/>app2>]
}

invokingProcess -> stage0 [label="fork(2)+exec(3)"]
stage0 -> entrypoint [label="exec(3)"]

subgraph cluster_1 {
label=<<B>stage1</B>>
labeljust="left"

entrypoint

subgraph cluster_2 {
label=<<B>stage2</B>>
labeljust="left"
app1
}

subgraph cluster_3 {
label=<<B>stage2</B>>
labeljust="right"
app2
}
}

entrypoint -> app1
entrypoint -> app2
}
Binary file added Documentation/devel/execution-flow.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit c33b220

Please sign in to comment.