Skip to content

Add support for runtime init image volumes #801

@dbkegley

Description

@dbkegley

Kubernetes 1.35+ has enabled the image volume feature gate by default. @plascaray has done some analysis of the init-container copy workflow and we can see that content startup time is significantly degraded by the amount of data copied from the init-container (about 300-500MB / job) - this seems to also be impacting containerd performance in weird ways under load because we are consuming so much io bandwidth by copying this data onto an overlayFS

In the near term, we can look at reducing the size of the init container image, however we're pretty limited in what we can do as the image only contains the runtime components we need to execute all of our known content types.

A better longer term solution is to support image volumes and eliminate the runtime copy operation altogether. I think we can do this today in the helm chart and allow users to opt-in to the new image-volume init.

This would require:

  1. A new init-image with the runtime components installed at the root, simply:

this is required for EKS which only supports containerd 2.1 (containerd 2.2 is required to support volume mount subPaths)

FROM scratch
COPY --from=ghcr.io/rstudio/rstudio-connect-content-init-preview:ubuntu2204-daily /opt/rstudio-connect-runtime/ /
  1. Update the helm chart to create an image volume mount instead of an init-container to provide the runtime components. The rendered job.tpl should look something like this:

Note that the mounted image volume is read-only, meaning it will conflict with Connect's default mnt/ directories, like /opt/rstudio-connect/mnt/app which we use as the content's CWD. We'll need to decouple the mount targets from the Connect runtime installation path somehow.

apiVersion: batch/v1
kind: Job
spec:
  template:
    spec:
      containers:
        - name: "rs-launcher-container"
        ...
          volumeMounts:
            - mountPath: /opt/rstudio-connect
              name: rsc-volume
      volumes:
        - name: rsc-volume
          image:
            reference: ghcr.io/rstudio/rstudio-connect-content-runtime:ubuntu2204-daily
            pullPolicy: Always

The full report is pasted below:

D2K Load Test: Slim Init Image vs Standard Init Image

Date: 2026-03-07 Stack: shared-ally-d2k-loadtest Duration: 03:58 UTC - 05:02 UTC (~64 minutes)

Change Under Test

The init container image was changed from the standard connect-content-init to a slim variant:

Standard (baseline) Slim (this test)
Init image ghcr.io/rstudio/rstudio-connect-content-init:ubuntu2204-daily ghcr.io/plascaray/connect-content-init-slim:ubuntu2204-daily

Hypothesis: The standard init image causes IO contention during pod startup (copying runtime binaries to the EmptyDir volume). A lighter image reduces this contention, allowing more pods to start concurrently without degrading performance.

Test Configuration (unchanged from baseline)

Parameter Value
Architecture AWS EKS (aws/k8s)
Node type m6i.2xlarge (8 vCPU, 32 GiB)
Node count 4
Connect replicas 3 (direct execution mode, no launcher)
Connect version 2026.03.0-dev
Content type add bundle (simple Python addition script)
Content items 160
ScheduleConcurrency 75 per replica (225 total)
Job CPU request/limit 150m / 250m
Job memory request/limit 128Mi / 256Mi
EFS throughput Provisioned 150 MiB/s

Results: Side-by-Side Comparison

Render Latency (p95 seconds)

Schedules Standard p95 Slim p95 Improvement
1 29.3 29.3 -
5 58.3 29.3 2.0x
10 285.7 29.3 9.8x
20 271.6 54.6 5.0x
40 270.1 56.1 4.8x
80 287.3 58.2 4.9x
120 288.0 59.1 4.9x
160 288.0 284.4 1.0x

Throughput (jobs/min)

Schedules Standard Slim Improvement
1 3.0 1.0 -
5 14.5 15.5 1.1x
10 19.5 28.5 1.5x
20 39.0 62.5 1.6x
40 78.5 115.5 1.5x
80 115.5 242.0 2.1x
120 147.0 264.3 1.8x
160 108.0 291.0 2.7x

Scheduling Pressure (pending pods)

Schedules Standard Slim
1 1 0
5 4 0
10 1 0
20 1 0
40 17 0
80 60 0
120 91 16
160 19 0

Full Metrics (Slim Init)

Schedules p50 (s) p95 (s) p99 (s) Jobs/min Avg Node CPU % Pending Pods
1 22.5 29.3 29.9 1.0 0.5% 0
5 22.5 29.3 29.9 15.5 3.0% 0
10 22.5 29.3 29.9 28.5 2.8% 0
20 25.4 54.6 58.9 62.5 7.0% 0
40 26.6 56.1 59.2 115.5 14.2% 0
80 42.1 58.2 59.6 242.0 30.0% 0
120 45.3 59.1 192.4 264.3 44.2% 16
160 144.4 284.4 296.9 291.0 38.5% 0

Key Observations

1. Performance cliff shifted from 10 to 160 schedules

With the standard init image, p95 render time jumped from 58s to 286s at just 10 concurrent schedules. With the slim init image, the cluster maintained p95 under 60s all the way up to 120 schedules. The cliff only appeared at 160 schedules (p95=284s), representing a 16x increase in the load level before degradation.

2. p95 plateau at ~55-59s from 20 to 120 schedules

The slim init image shows a gentle, predictable rise in p95 from 29s (1-10 schedules) to 59s (80-120 schedules), rather than the immediate jump to ~288s seen with the standard image. This indicates the cluster is genuinely processing jobs without significant queue-induced latency.

3. Throughput increased 2-3x across the board

At every load level above 5 schedules, the slim image delivered higher throughput. The peak was 291 jobs/min at 160 schedules (slim) vs 147 jobs/min at 120 schedules (standard) — a 2x improvement in peak sustainable throughput.

4. Near-zero scheduling pressure up to 120 schedules

Pending pod count stayed at 0 through 80 schedules and only reached 16 at 120. The standard image showed 60 pending pods at 80 schedules and 91 at 120. This confirms the hypothesis: the standard init image's IO during pod startup was the primary bottleneck creating scheduling contention.

5. CPU utilization now scales proportionally with load

With the slim image, node CPU climbed steadily: 0.5% -> 3% -> 7% -> 14% -> 30% -> 44%. This is healthy — the cluster is actually doing useful work rather than being blocked on IO. The standard image showed flat, low CPU (under 10%) because most time was spent waiting on init containers.

6. The 160-schedule cliff is a real capacity limit

At 160 schedules, p95 jumped to 284s — similar to the standard image's plateau. This is likely the genuine scheduling capacity of a 4-node cluster with ScheduleConcurrency=75 (225 total slots). The slim image didn't eliminate this limit; it simply pushed the bottleneck from IO contention to actual CPU/scheduling capacity.

7. p99 spike at 120 schedules (192s)

While p95 was still healthy at 59s, the p99 jumped to 192s at 120 schedules. This suggests a small fraction of jobs were hitting longer queue wait times, serving as an early warning of the cliff at 160.

8. Some job errors appeared

The job stats export shows 89 jobs with errors out of 3,468 total (~2.6%). The standard test had zero failures across 4,390 jobs. This warrants investigation — the errors may be related to the slim image missing something the content needs, or could be transient scheduling issues at high load.

Cluster Health

All 3 Connect server pods remained Running and Ready (1/1) throughout the test. All 4 nodes stayed in Ready state. The monitoring stack remained healthy after the pre-test restart.

Data Files

File Description
metrics.csv Per-step metrics (8 rows) for charting
job_stats.parquet Full job history (3,468 jobs) from Connect API
job_stats_metadata.json Metadata for the parquet file
lgtm-export.tar.gz Prometheus TSDB data from the test window (97MB)
test_start_time.txt UTC timestamp when the test began

Baseline data is in reports/d2k-loadtest/ (original report and metrics-standard-init.csv backup).

Tools Used

Same as baseline test. The load test script (reports/d2k-loadtest/run_load_test.sh) was reused with OUTPUT_DIR=reports/d2k-loadtest-slim.

Metadata

Metadata

Assignees

No one assigned

    Labels

    team: connectPosit Connect related issue

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions