Skip to content
This repository has been archived by the owner on Feb 24, 2020. It is now read-only.

Commit

Permalink
Merge pull request #2945 from s-urbaniak/execflow
Browse files Browse the repository at this point in the history
doc/devel: detail execution flow between stages
  • Loading branch information
Sergiusz Urbaniak committed Aug 18, 2016
2 parents 222565f + 8b859cd commit e29600e
Show file tree
Hide file tree
Showing 11 changed files with 247 additions and 32 deletions.
104 changes: 72 additions & 32 deletions Documentation/devel/architecture.md
Expand Up @@ -13,8 +13,22 @@ Facilities like file-locking are used to ensure co-operation and mutual exclusio

Execution with rkt is divided into several distinct stages.

_**NB** The goal is for the ABI between stages to be relatively fixed, but while rkt is still under heavy development this is still evolving.
Until https://github.com/coreos/rkt/issues/572 is resolved, this should be considered in flux and the description below may not be authoritative._
**NB** The goal is for the ABI between stages to be relatively fixed, but while rkt is still under heavy development this is still evolving.

After calling `rkt` the execution chain follows the numbering of stages, having the following general order:

![execution-flow](execution-flow.png)

1. invoking process -> stage0:
The invoking process uses its own mechanism to invoke the rkt binary (stage0). When started via a regular shell or a supervisor, stage0 is usually forked and exec'ed becoming a child process of the invoking shell or supervisor.

2. stage0 -> stage1:
An ordinary `exec(3)` is being used to replace the stage0 process with the stage1 entrypoint. The entrypoint is referenced by the `coreos.com/rkt/stage1/run` annotation in the stage1 image manifest.

3. stage1 -> stage2:
The stage1 entrypoint uses its mechanism to invoke the stage2 app executables. The app executables are referenced by the `apps.app.exec` settings in the stage2 image manifest.

The details of the execution flow varies across the different stage1 implementations.

### Stage 0

Expand Down Expand Up @@ -60,11 +74,27 @@ At this point the stage0 execs `/stage1/rootfs/init` with the current working di

### Stage 1

The next stage is a binary that the user trusts to set up cgroups, execute processes, and perform other operations as root on the host.
This stage has the responsibility of taking the pod filesystem that was created by stage0 and creating the necessary cgroups, namespaces and mounts to launch the pod.
The next stage is a binary that the user trusts, and has the responsibility of taking the pod filesystem that was created by stage0, create the necessary container isolation, network, and mounts to launch the pod.
Specifically, it must:

- Read the Image and Pod Manifests. The Image Manifest defines the default `exec` specifications of each application; the Pod Manifest defines the ordering of the units, as well as any overrides.
- Set up/execute the actual isolation environment for the target pod, called the "stage1 flavor". Currently there are three flavors implemented:
- fly: a simple chroot only environment.
- systemd/nspawn: a cgroup/namespace based isolation environment using systemd, and systemd-nspawn.
- kvm: a fully isolated kvm environment.

### Stage 2

The final stage, stage2, is the actual environment in which the applications run, as launched by stage1.

## Flavors
### systemd/nspawn flavors

The "host", "src", and "coreos" flavors (referenced to as systemd/nspawn flavors) use `systemd-nspawn`, and `systemd` to set up the execution chain.
They include a very minimal systemd that takes care of launching the apps in each pod, apply per-app resource isolators and makes sure the apps finish in an orderly manner.

These flavors will:
- Read the image and pod manifests
- Generate systemd unit files from those Manifests
- Create and enter network namespace if rkt is not started with `--net=host`
- Start systemd-nspawn (which takes care of the following steps)
Expand All @@ -74,10 +104,6 @@ Specifically, it must:

This process is slightly different for the qemu-kvm stage1 but a similar workflow starting at `exec()`'ing kvm instead of an nspawn.

### Stage 1 systemd Architecture

rkt's Stage1 includes a very minimal systemd that takes care of launching the apps in each pod, apply per-app resource isolators and make sure the apps finish in an orderly manner.

We will now detail how the starting, shutdown, and exist status collection of the apps in a pod are implemented internally.

![rkt-systemd](rkt-systemd.png)
Expand All @@ -99,47 +125,61 @@ In this case, the failed app's exit status will get propagated to rkt.
A [*Conflicts*](http://www.freedesktop.org/software/systemd/man/systemd.unit.html#Conflicts=) dependency was also added between each reaper service and the halt and poweroff targets (they are triggered when the pod is stopped from the outside when rkt receives `SIGINT`).
This will activate all the reaper services when one of the targets is activated, causing the exit statuses to be saved and the pod to finish like it was described in the previous paragraph.

### Stage 2
We will now detail the execution chain for the stage1 systemd/nspawn flavors. The entrypoint is implemented in the `stage1/init/init.go` binary and sets up the following execution chain:

The final stage, stage2, is the actual environment in which the applications run, as launched by stage1.
1. "ld-linux-*.so.*": Depending on the architecture the appropriate loader helper in the stage1 rootfs is invoked using "exec". This makes sure that subsequent binaries load shared libraries from the stage1 rootfs and not from the host file system.

2. "systemd-nspawn": Used for starting the actual container. systemd-nspawn registers the started container in "systemd-machined" on the host, if available. It is parametrized with the `--boot` option to instruct it to "fork+exec" systemd as the supervisor in the started container.

### Image lifecycle
3. "systemd": Used as the supervisor in the started container. Similar as on a regular host system, it uses "fork+exec" to execute the child app processes.

rkt commands like prepare and run, as a first step, need to retrieve all the images requested in the command line and prepare the stage2 directories with the application contents.
The following diagram illustrates the execution chain:

This is done with the following chain:
![execution-flow-systemd](execution-flow-systemd.png)

The resulting process tree reveals the parent-child relationships. Note that "exec"ing processes do not appear in the tree:

```
----------- ----------- ------------
| | | | | |
| Fetch |--------->| Store |--------->| Render |
| | | | | |
----------- ----------- ------------
$ ps auxf
...
\_ -bash
\_ stage1/rootfs/usr/lib/ld-linux-x86-64.so.2 stage1/rootfs/usr/bin/systemd-nspawn
\_ /usr/lib/systemd/systemd
\_ /usr/lib/systemd/systemd-journald
\_ nginx
```

### fly flavor

The "fly" flavor uses a very simple mechanism being limited to only execute one child app process. The entrypoint is implemented in `stage1_fly/run/main.go`. After setting up a chroot'ed environment it simply exec's the target app without any further internal supervision:

![execution-flow-fly](execution-flow-fly.png)

The resulting example process tree shows the target process as a direct child of the invoking process:

```
$ ps auxf
...
\_ -bash
\_ nginx
```

### Image lifecycle

rkt commands like prepare and run, as a first step, need to retrieve all the images requested in the command line and prepare the stage2 directories with the application contents.

This is done with the following chain:

![image-chain](image-chain.png)

* Fetch: in the fetch phase rkt retrieves the requested images. The fetching implementation depends on the provided image argument such as an image string/hash/https URL/file (e.g. `example.com/app:v1.0`).
* Store: in the store phase the fetched images are saved to the local store. The local store is a cache for fetched images and related data.
* Render: in the render phase, a renderer pulls the required images from the store and renders them so they can be easily used as stage2 content.


These three logical blocks are implemented inside rkt in this way:

```
------------ --------------- ------------- ------------------------
| | | | | | overlayfs | |
| Fetchers |--------->| Image Store |<-------------| TreeStore |<-----------------| Stage1-2 fs contents |
| | | |<---- | | -------| |
------------ --------------- \ ------------- / ------------------------
\ /
\ ----------------- /
\ | | /
-----| Direct stage1-2 |---
| renderer |
| |
-----------------
```
![image-logical-blocks](image-logical-blocks.png)

Currently rkt implements the [appc][appc-spec] internally, converting to it from other container image formats for compatibility. In the future, additional formats like the [OCI image spec][oci-img-spec] may be added to rkt, keeping the same basic scheme for fetching, storing, and rendering application container images.

Expand Down
29 changes: 29 additions & 0 deletions Documentation/devel/execution-flow-fly.dot
@@ -0,0 +1,29 @@
digraph G {
graph [fontname = "helvetica"];
node [fontname = "Arial", fillcolor="#FFE599", style="filled"];
edge [fontname = "monospace"];

{
invokingProcess [shape="node", label=<<B>bash/systemd/kubelet</B><BR/>invoking process>]
stage0 [shape="node", label=<<B>stage0</B><BR/>rkt>]
entrypoint [shape="node", label=<entrypoint<BR/>"coreos.com/rkt/stage1/run">]
app1 [shape="node", label=<<B>stage2</B><BR/>"apps.app.exec"<BR/>app1>]
}

invokingProcess -> stage0
stage0 -> entrypoint [label="exec(3)"]

subgraph cluster_1 {
label=<<B>stage1</B>>
labeljust="left"
entrypoint

subgraph cluster_2 {
label=<<B>stage2</B>>
labeljust="left"
app1
}
}

entrypoint -> app1 [label="chroot(2)+\nexec(3)"]
}
Binary file added Documentation/devel/execution-flow-fly.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
66 changes: 66 additions & 0 deletions Documentation/devel/execution-flow-systemd.dot
@@ -0,0 +1,66 @@
digraph G {
graph [fontname = "helvetica"];
node [fontname = "Arial", fillcolor="#FFE599", style="filled"];
edge [fontname = "monospace"];

{
invokingProcess [shape="node",
label=<<B>systemd-run / service unit</B><BR/>invoking process>, fillcolor="#FFF4D4"]
stage0 [shape="node", label=<<B>stage0</B><BR/>rkt>, fillcolor="#FFF4D4"]
init [shape="node", label=</init<BR/>"coreos.com/rkt/stage1/run">]
ld [shape="node", label="ld-linux-x86-64"]
systemdNspawn [shape="node", label=<systemd-nspawn>]
systemdMachined [shape="node", label=<systemd-machined>, fillcolor="#FFF4D4"]

systemd [shape="node",
label=<systemd>]

app1 [shape="node",
label=<"apps.app.exec"<BR/>app1>]

app2 [shape="node",
label=<"apps.app.exec"<BR/>app2>]

journal [shape="node",
label=<systemd-journal>]
}

invokingProcess -> stage0 [label="fork(2)+exec(3)"]
stage0 -> init [label="exec(3)"]
systemdNspawn -> systemd [label="fork(2)+exec(3)"]
systemdNspawn -> systemdMachined [label="register",
fontname="Arial"]
init -> ld [label="exec(3)"]
ld -> systemdNspawn
systemd -> app1
systemd -> app2
systemd -> journal [label="fork(2)+exec(3)"]

invokingProcess
stage0
systemdMachined

subgraph cluster_1 {
label=<<B>stage1</B>>
labeljust="left"

init
ld
systemdNspawn
systemd
journal

subgraph cluster_2 {
label=<<B>stage2</B>>
labeljust="left"
app1
}

subgraph cluster_3 {
label=<<B>stage2</B>>
labeljust="left"
app2
}
}

}
Binary file added Documentation/devel/execution-flow-systemd.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
43 changes: 43 additions & 0 deletions Documentation/devel/execution-flow.dot
@@ -0,0 +1,43 @@
digraph G {
graph [fontname = "helvetica"];
node [fontname = "Arial", fillcolor="#FFE599", style="filled"];
edge [fontname = "monospace"];

{
invokingProcess [shape="node",
label=<<B>bash/systemd/kubelet</B><BR/>invoking process>]
stage0 [shape="node",
label=<<B>stage0</B><BR/>rkt>]
entrypoint [shape="node",
label=<entrypoint<BR/>"coreos.com/rkt/stage1/run">]
app1 [shape="node",
label=<"apps.app.exec"<BR/>app1>]
app2 [shape="node",
label=<"apps.app.exec"<BR/>app2>]
}

invokingProcess -> stage0 [label="fork(2)+exec(3)"]
stage0 -> entrypoint [label="exec(3)"]

subgraph cluster_1 {
label=<<B>stage1</B>>
labeljust="left"

entrypoint

subgraph cluster_2 {
label=<<B>stage2</B>>
labeljust="left"
app1
}

subgraph cluster_3 {
label=<<B>stage2</B>>
labeljust="right"
app2
}
}

entrypoint -> app1
entrypoint -> app2
}
Binary file added Documentation/devel/execution-flow.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
16 changes: 16 additions & 0 deletions Documentation/devel/image-chain.dot
@@ -0,0 +1,16 @@
digraph G {
rankdir="LR";

graph [fontname = "helvetica"];
node [fontname = "Arial", fillcolor="#FFE599", style="filled"];
edge [fontname = "monospace"];

{
fetch [shape="node", label=<Fetch>]
store [shape="node", label=<Store>]
render [shape="node", label=<Render>]
}

fetch -> store
store -> render
}
Binary file added Documentation/devel/image-chain.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
21 changes: 21 additions & 0 deletions Documentation/devel/image-logical-blocks.dot
@@ -0,0 +1,21 @@
digraph G {
rankdir="LR";

graph [fontname = "helvetica"];
node [fontname = "Arial", fillcolor="#FFE599", style="filled"];
edge [fontname = "monospace"];

{
fetchers [shape="node", label=<Fetchers>, pos="0,0!"]
image_store [shape="node", label=<Image Store>, pos="2,0!"]
tree_store [shape="node", label=<Tree Store>, pos="4,0!"]
fs_contents [shape="node", label=<Stage1-2 fs contents>, pos="6,0!"]
renderer [shape="node", label=<Direct Stage1-2 renderer>, pos="4,-1!"]
}

fetchers -> image_store
tree_store -> image_store
renderer -> image_store
fs_contents -> tree_store
renderer -> fs_contents [dir=none]
}
Binary file added Documentation/devel/image-logical-blocks.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit e29600e

Please sign in to comment.