Merge pull request #2945 from s-urbaniak/execflow

doc/devel: detail execution flow between stages
rkt · Aug 18, 2016 · e29600e · e29600e
2 parents 222565f + 8b859cd
commit e29600e
Show file tree

Hide file tree

Showing 11 changed files with 247 additions and 32 deletions.
diff --git a/Documentation/devel/architecture.md b/Documentation/devel/architecture.md
@@ -13,8 +13,22 @@ Facilities like file-locking are used to ensure co-operation and mutual exclusio
 
 Execution with rkt is divided into several distinct stages.
 
-_**NB** The goal is for the ABI between stages to be relatively fixed, but while rkt is still under heavy development this is still evolving.
-Until https://github.com/coreos/rkt/issues/572 is resolved, this should be considered in flux and the description below may not be authoritative._
+**NB** The goal is for the ABI between stages to be relatively fixed, but while rkt is still under heavy development this is still evolving.
+
+After calling `rkt` the execution chain follows the numbering of stages, having the following general order:
+
+![execution-flow](execution-flow.png)
+
+1. invoking process -> stage0:
+The invoking process uses its own mechanism to invoke the rkt binary (stage0). When started via a regular shell or a supervisor, stage0 is usually forked and exec'ed becoming a child process of the invoking shell or supervisor.
+
+2. stage0 -> stage1:
+An ordinary `exec(3)` is being used to replace the stage0 process with the stage1 entrypoint. The entrypoint is referenced by the `coreos.com/rkt/stage1/run` annotation in the stage1 image manifest.
+
+3. stage1 -> stage2:
+The stage1 entrypoint uses its mechanism to invoke the stage2 app executables. The app executables are referenced by the `apps.app.exec` settings in the stage2 image manifest.
+
+The details of the execution flow varies across the different stage1 implementations.
 
 ### Stage 0
 
@@ -60,11 +74,27 @@ At this point the stage0 execs `/stage1/rootfs/init` with the current working di
 
 ### Stage 1
 
-The next stage is a binary that the user trusts to set up cgroups, execute processes, and perform other operations as root on the host.
-This stage has the responsibility of taking the pod filesystem that was created by stage0 and creating the necessary cgroups, namespaces and mounts to launch the pod.
+The next stage is a binary that the user trusts, and has the responsibility of taking the pod filesystem that was created by stage0, create the necessary container isolation, network, and mounts to launch the pod.
 Specifically, it must:
 
 - Read the Image and Pod Manifests. The Image Manifest defines the default `exec` specifications of each application; the Pod Manifest defines the ordering of the units, as well as any overrides.
+- Set up/execute the actual isolation environment for the target pod, called the "stage1 flavor". Currently there are three flavors implemented:
+  - fly: a simple chroot only environment.
+  - systemd/nspawn: a cgroup/namespace based isolation environment using systemd, and systemd-nspawn.
+  - kvm: a fully isolated kvm environment.
+
+### Stage 2
+
+The final stage, stage2, is the actual environment in which the applications run, as launched by stage1.
+
+## Flavors
+### systemd/nspawn flavors
+
+The "host", "src", and "coreos" flavors (referenced to as systemd/nspawn flavors) use  `systemd-nspawn`, and `systemd` to set up the execution chain.
+They include a very minimal systemd that takes care of launching the apps in each pod, apply per-app resource isolators and makes sure the apps finish in an orderly manner.
+
+These flavors will:
+- Read the image and pod manifests
 - Generate systemd unit files from those Manifests
 - Create and enter network namespace if rkt is not started with `--net=host`
 - Start systemd-nspawn (which takes care of the following steps)
@@ -74,10 +104,6 @@ Specifically, it must:
 
 This process is slightly different for the qemu-kvm stage1 but a similar workflow starting at `exec()`'ing kvm instead of an nspawn.
 
-### Stage 1 systemd Architecture
-
-rkt's Stage1 includes a very minimal systemd that takes care of launching the apps in each pod, apply per-app resource isolators and make sure the apps finish in an orderly manner.
-
 We will now detail how the starting, shutdown, and exist status collection of the apps in a pod are implemented internally.
 
 ![rkt-systemd](rkt-systemd.png)
@@ -99,47 +125,61 @@ In this case, the failed app's exit status will get propagated to rkt.
 A [*Conflicts*](http://www.freedesktop.org/software/systemd/man/systemd.unit.html#Conflicts=) dependency was also added between each reaper service and the halt and poweroff targets (they are triggered when the pod is stopped from the outside when rkt receives `SIGINT`).
 This will activate all the reaper services when one of the targets is activated, causing the exit statuses to be saved and the pod to finish like it was described in the previous paragraph.
 
-### Stage 2
+We will now detail the execution chain for the stage1 systemd/nspawn flavors. The entrypoint is implemented in the `stage1/init/init.go` binary and sets up the following execution chain:
 
-The final stage, stage2, is the actual environment in which the applications run, as launched by stage1.
+1. "ld-linux-*.so.*": Depending on the architecture the appropriate loader helper in the stage1 rootfs is invoked using "exec". This makes sure that subsequent binaries load shared libraries from the stage1 rootfs and not from the host file system.
 
+2. "systemd-nspawn": Used for starting the actual container. systemd-nspawn registers the started container in "systemd-machined" on the host, if available. It is parametrized with the `--boot` option to instruct it to "fork+exec" systemd as the supervisor in the started container.
 
-### Image lifecycle
+3. "systemd": Used as the supervisor in the started container. Similar as on a regular host system, it uses "fork+exec" to execute the child app processes.
 
-rkt commands like prepare and run, as a first step, need to retrieve all the images requested in the command line and prepare the stage2 directories with the application contents.
+The following diagram illustrates the execution chain:
 
-This is done with the following chain:
+![execution-flow-systemd](execution-flow-systemd.png)
+
+The resulting process tree reveals the parent-child relationships. Note that "exec"ing processes do not appear in the tree:
 
 ```
- -----------            -----------            ------------
-|           |          |           |          |            |
-|   Fetch   |--------->|   Store   |--------->|   Render   |
-|           |          |           |          |            |
- -----------            -----------            ------------
+$ ps auxf
+...
+\_ -bash
+    \_ stage1/rootfs/usr/lib/ld-linux-x86-64.so.2 stage1/rootfs/usr/bin/systemd-nspawn
+        \_ /usr/lib/systemd/systemd
+            \_ /usr/lib/systemd/systemd-journald
+            \_ nginx
+```
+
+### fly flavor
 
+The "fly" flavor uses a very simple mechanism being limited to only execute one child app process. The entrypoint is implemented in `stage1_fly/run/main.go`. After setting up a chroot'ed environment it simply exec's the target app without any further internal supervision:
+
+![execution-flow-fly](execution-flow-fly.png)
+
+The resulting example process tree shows the target process as a direct child of the invoking process:
+
+```
+$ ps auxf
+...
+\_ -bash
+    \_ nginx
 ```
 
+### Image lifecycle
+
+rkt commands like prepare and run, as a first step, need to retrieve all the images requested in the command line and prepare the stage2 directories with the application contents.
+
+This is done with the following chain:
+
+![image-chain](image-chain.png)
+
 * Fetch: in the fetch phase rkt retrieves the requested images. The fetching implementation depends on the provided image argument such as an image string/hash/https URL/file (e.g. `example.com/app:v1.0`).
 * Store: in the store phase the fetched images are saved to the local store. The local store is a cache for fetched images and related data.
 * Render: in the render phase, a renderer pulls the required images from the store and renders them so they can be easily used as stage2 content.
 
 
 These three logical blocks are implemented inside rkt in this way:
 
-```
- ------------            ---------------                -------------                    ------------------------
-|            |          |               |              |             |     overlayfs    |                        |
-|  Fetchers  |--------->|  Image Store  |<-------------|  TreeStore  |<-----------------|  Stage1-2 fs contents  |
-|            |          |               |<----         |             |           -------|                        |
- ------------            ---------------      \         -------------           /        ------------------------
-                                               \                               /
-                                                \       -----------------     /
-                                                 \     |                 |   /
-                                                  -----| Direct stage1-2 |---
-                                                       |   renderer      |
-                                                       |                 |
-                                                        -----------------
-```
+![image-logical-blocks](image-logical-blocks.png)
 
 Currently rkt implements the [appc][appc-spec] internally, converting to it from other container image formats for compatibility. In the future, additional formats like the [OCI image spec][oci-img-spec] may be added to rkt, keeping the same basic scheme for fetching, storing, and rendering application container images.
 

diff --git a/Documentation/devel/execution-flow-fly.dot b/Documentation/devel/execution-flow-fly.dot
@@ -0,0 +1,29 @@
+digraph G {
+    graph [fontname = "helvetica"];
+    node [fontname = "Arial", fillcolor="#FFE599", style="filled"];
+    edge [fontname = "monospace"];
+
+    {
+        invokingProcess [shape="node", label=<<B>bash/systemd/kubelet</B><BR/>invoking process>]
+        stage0 [shape="node", label=<<B>stage0</B><BR/>rkt>]
+        entrypoint [shape="node", label=<entrypoint<BR/>"coreos.com/rkt/stage1/run">]
+        app1 [shape="node", label=<<B>stage2</B><BR/>"apps.app.exec"<BR/>app1>]
+    }
+
+    invokingProcess -> stage0
+    stage0 -> entrypoint [label="exec(3)"]
+
+    subgraph cluster_1 {
+        label=<<B>stage1</B>>
+        labeljust="left"
+        entrypoint
+
+        subgraph cluster_2 {
+            label=<<B>stage2</B>>
+            labeljust="left"
+            app1
+        }
+    }
+
+    entrypoint -> app1 [label="chroot(2)+\nexec(3)"]
+}
diff --git a/Documentation/devel/execution-flow-fly.png b/Documentation/devel/execution-flow-fly.png
diff --git a/Documentation/devel/execution-flow-systemd.dot b/Documentation/devel/execution-flow-systemd.dot
@@ -0,0 +1,66 @@
+digraph G {
+    graph [fontname = "helvetica"];
+    node [fontname = "Arial", fillcolor="#FFE599", style="filled"];
+    edge [fontname = "monospace"];
+
+    {
+        invokingProcess [shape="node",
+                       label=<<B>systemd-run / service unit</B><BR/>invoking process>, fillcolor="#FFF4D4"]
+        stage0 [shape="node", label=<<B>stage0</B><BR/>rkt>, fillcolor="#FFF4D4"]
+        init [shape="node", label=</init<BR/>"coreos.com/rkt/stage1/run">]
+        ld [shape="node", label="ld-linux-x86-64"]
+        systemdNspawn [shape="node", label=<systemd-nspawn>]
+        systemdMachined [shape="node", label=<systemd-machined>, fillcolor="#FFF4D4"]
+
+        systemd [shape="node",
+                 label=<systemd>]
+
+        app1 [shape="node",
+              label=<"apps.app.exec"<BR/>app1>]
+
+        app2 [shape="node",
+              label=<"apps.app.exec"<BR/>app2>]
+
+        journal [shape="node",
+                 label=<systemd-journal>]
+    }
+
+    invokingProcess -> stage0 [label="fork(2)+exec(3)"]
+    stage0 -> init [label="exec(3)"]
+    systemdNspawn -> systemd [label="fork(2)+exec(3)"]
+    systemdNspawn -> systemdMachined [label="register",
+                                      fontname="Arial"]
+    init -> ld [label="exec(3)"]
+    ld -> systemdNspawn
+    systemd -> app1
+    systemd -> app2
+    systemd -> journal [label="fork(2)+exec(3)"]
+
+    invokingProcess
+    stage0
+    systemdMachined
+
+    subgraph cluster_1 {
+        label=<<B>stage1</B>>
+        labeljust="left"
+
+        init
+        ld
+        systemdNspawn
+        systemd
+        journal
+
+        subgraph cluster_2 {
+            label=<<B>stage2</B>>
+            labeljust="left"
+            app1
+        }
+
+        subgraph cluster_3 {
+            label=<<B>stage2</B>>
+            labeljust="left"
+            app2
+        }
+    }
+
+}
diff --git a/Documentation/devel/execution-flow-systemd.png b/Documentation/devel/execution-flow-systemd.png
diff --git a/Documentation/devel/execution-flow.dot b/Documentation/devel/execution-flow.dot
@@ -0,0 +1,43 @@
+digraph G {
+    graph [fontname = "helvetica"];
+    node [fontname = "Arial", fillcolor="#FFE599", style="filled"];
+    edge [fontname = "monospace"];
+
+    {
+        invokingProcess [shape="node",
+                         label=<<B>bash/systemd/kubelet</B><BR/>invoking process>]
+        stage0 [shape="node",
+                label=<<B>stage0</B><BR/>rkt>]
+        entrypoint [shape="node",
+                    label=<entrypoint<BR/>"coreos.com/rkt/stage1/run">]
+        app1 [shape="node",
+              label=<"apps.app.exec"<BR/>app1>]
+        app2 [shape="node",
+              label=<"apps.app.exec"<BR/>app2>]
+    }
+
+    invokingProcess -> stage0 [label="fork(2)+exec(3)"]
+    stage0 -> entrypoint [label="exec(3)"]
+
+    subgraph cluster_1 {
+        label=<<B>stage1</B>>
+        labeljust="left"
+
+        entrypoint
+
+        subgraph cluster_2 {
+            label=<<B>stage2</B>>
+            labeljust="left"
+            app1
+        }
+
+        subgraph cluster_3 {
+            label=<<B>stage2</B>>
+            labeljust="right"
+            app2
+        }
+    }
+
+    entrypoint -> app1
+    entrypoint -> app2
+}
diff --git a/Documentation/devel/execution-flow.png b/Documentation/devel/execution-flow.png
diff --git a/Documentation/devel/image-chain.dot b/Documentation/devel/image-chain.dot
@@ -0,0 +1,16 @@
+digraph G {
+    rankdir="LR";
+
+    graph [fontname = "helvetica"];
+    node [fontname = "Arial", fillcolor="#FFE599", style="filled"];
+    edge [fontname = "monospace"];
+
+    {
+        fetch [shape="node", label=<Fetch>]
+        store [shape="node", label=<Store>]
+        render [shape="node", label=<Render>]
+    }
+
+    fetch -> store
+    store -> render
+}
diff --git a/Documentation/devel/image-chain.png b/Documentation/devel/image-chain.png
diff --git a/Documentation/devel/image-logical-blocks.dot b/Documentation/devel/image-logical-blocks.dot
@@ -0,0 +1,21 @@
+digraph G {
+    rankdir="LR";
+
+    graph [fontname = "helvetica"];
+    node [fontname = "Arial", fillcolor="#FFE599", style="filled"];
+    edge [fontname = "monospace"];
+
+    {
+        fetchers [shape="node", label=<Fetchers>, pos="0,0!"]
+        image_store [shape="node", label=<Image Store>, pos="2,0!"]
+        tree_store [shape="node", label=<Tree Store>, pos="4,0!"]
+        fs_contents [shape="node", label=<Stage1-2 fs contents>, pos="6,0!"]
+        renderer [shape="node", label=<Direct Stage1-2 renderer>, pos="4,-1!"]
+    }
+
+    fetchers -> image_store
+    tree_store -> image_store
+    renderer -> image_store
+    fs_contents -> tree_store
+    renderer -> fs_contents [dir=none]
+}
diff --git a/Documentation/devel/image-logical-blocks.png b/Documentation/devel/image-logical-blocks.png