Add support for runtime init image volumes

Kubernetes 1.35+ has enabled the image volume feature gate by default. @plascaray has done some analysis of the init-container copy workflow and we can see that content startup time is significantly degraded by the amount of data copied from the init-container (about 300-500MB / job) - this seems to also be impacting containerd performance in weird ways under load because we are consuming so much io bandwidth by copying this data onto an overlayFS

In the near term, we can look at reducing the size of the init container image, however we're pretty limited in what we can do as the image only contains the runtime components we need to execute all of our known content types.

A better longer term solution is to support image volumes and eliminate the runtime copy operation altogether. I think we can do this today in the helm chart and allow users to opt-in to the new image-volume init.

This would require:

1. A new init-image with the runtime components installed _at the root_, simply:

> this is required for EKS which only supports containerd 2.1 ([containerd 2.2 is required to support volume mount subPaths](https://github.com/awslabs/amazon-eks-ami/issues/2625))

```dockerfile
FROM scratch
COPY --from=ghcr.io/rstudio/rstudio-connect-content-init-preview:ubuntu2204-daily /opt/rstudio-connect-runtime/ /
```

2. Update the helm chart to create an image volume mount instead of an init-container to provide the runtime components. The rendered job.tpl should look something like this:

> Note that the mounted image volume is read-only, meaning it will conflict with Connect's default `mnt/` directories, like `/opt/rstudio-connect/mnt/app` which we use as the content's CWD. We'll need to decouple the mount targets from the Connect runtime installation path somehow.

```yaml
apiVersion: batch/v1
kind: Job
spec:
  template:
    spec:
      containers:
        - name: "rs-launcher-container"
        ...
          volumeMounts:
            - mountPath: /opt/rstudio-connect
              name: rsc-volume
      volumes:
        - name: rsc-volume
          image:
            reference: ghcr.io/rstudio/rstudio-connect-content-runtime:ubuntu2204-daily
            pullPolicy: Always
```

The full report is pasted below:

# D2K Load Test: Slim Init Image vs Standard Init Image

**Date**: 2026-03-07 **Stack**: `shared-ally-d2k-loadtest` **Duration**: 03:58 UTC \- 05:02 UTC (\~64 minutes)

## Change Under Test

The init container image was changed from the standard `connect-content-init` to a slim variant:

|  | Standard (baseline) | Slim (this test) |
| :---- | :---- | :---- |
| Init image | `ghcr.io/rstudio/rstudio-connect-content-init:ubuntu2204-daily` | `ghcr.io/plascaray/connect-content-init-slim:ubuntu2204-daily` |

**Hypothesis**: The standard init image causes IO contention during pod startup (copying runtime binaries to the EmptyDir volume). A lighter image reduces this contention, allowing more pods to start concurrently without degrading performance.

## Test Configuration (unchanged from baseline)

| Parameter | Value |
| :---- | :---- |
| Architecture | AWS EKS (`aws/k8s`) |
| Node type | m6i.2xlarge (8 vCPU, 32 GiB) |
| Node count | 4 |
| Connect replicas | 3 (direct execution mode, no launcher) |
| Connect version | 2026.03.0-dev |
| Content type | `add` bundle (simple Python addition script) |
| Content items | 160 |
| ScheduleConcurrency | 75 per replica (225 total) |
| Job CPU request/limit | 150m / 250m |
| Job memory request/limit | 128Mi / 256Mi |
| EFS throughput | Provisioned 150 MiB/s |

## Results: Side-by-Side Comparison

### Render Latency (p95 seconds)

| Schedules | Standard p95 | Slim p95 | Improvement |
| :---- | :---- | :---- | :---- |
| 1 | 29.3 | 29.3 | \- |
| 5 | 58.3 | 29.3 | 2.0x |
| 10 | 285.7 | 29.3 | 9.8x |
| 20 | 271.6 | 54.6 | 5.0x |
| 40 | 270.1 | 56.1 | 4.8x |
| 80 | 287.3 | 58.2 | 4.9x |
| 120 | 288.0 | 59.1 | 4.9x |
| 160 | 288.0 | 284.4 | 1.0x |

### Throughput (jobs/min)

| Schedules | Standard | Slim | Improvement |
| :---- | :---- | :---- | :---- |
| 1 | 3.0 | 1.0 | \- |
| 5 | 14.5 | 15.5 | 1.1x |
| 10 | 19.5 | 28.5 | 1.5x |
| 20 | 39.0 | 62.5 | 1.6x |
| 40 | 78.5 | 115.5 | 1.5x |
| 80 | 115.5 | 242.0 | 2.1x |
| 120 | 147.0 | 264.3 | 1.8x |
| 160 | 108.0 | 291.0 | 2.7x |

### Scheduling Pressure (pending pods)

| Schedules | Standard | Slim |
| :---- | :---- | :---- |
| 1 | 1 | 0 |
| 5 | 4 | 0 |
| 10 | 1 | 0 |
| 20 | 1 | 0 |
| 40 | 17 | 0 |
| 80 | 60 | 0 |
| 120 | 91 | 16 |
| 160 | 19 | 0 |

### Full Metrics (Slim Init)

| Schedules | p50 (s) | p95 (s) | p99 (s) | Jobs/min | Avg Node CPU % | Pending Pods |
| :---- | :---- | :---- | :---- | :---- | :---- | :---- |
| 1 | 22.5 | 29.3 | 29.9 | 1.0 | 0.5% | 0 |
| 5 | 22.5 | 29.3 | 29.9 | 15.5 | 3.0% | 0 |
| 10 | 22.5 | 29.3 | 29.9 | 28.5 | 2.8% | 0 |
| 20 | 25.4 | 54.6 | 58.9 | 62.5 | 7.0% | 0 |
| 40 | 26.6 | 56.1 | 59.2 | 115.5 | 14.2% | 0 |
| 80 | 42.1 | 58.2 | 59.6 | 242.0 | 30.0% | 0 |
| 120 | 45.3 | 59.1 | 192.4 | 264.3 | 44.2% | 16 |
| 160 | 144.4 | 284.4 | 296.9 | 291.0 | 38.5% | 0 |

## Key Observations

### 1\. Performance cliff shifted from 10 to 160 schedules

With the standard init image, p95 render time jumped from 58s to 286s at just 10 concurrent schedules. With the slim init image, the cluster maintained p95 under 60s all the way up to 120 schedules. The cliff only appeared at 160 schedules (p95=284s), representing a **16x increase in the load level before degradation**.

### 2\. p95 plateau at \~55-59s from 20 to 120 schedules

The slim init image shows a gentle, predictable rise in p95 from 29s (1-10 schedules) to 59s (80-120 schedules), rather than the immediate jump to \~288s seen with the standard image. This indicates the cluster is genuinely processing jobs without significant queue-induced latency.

### 3\. Throughput increased 2-3x across the board

At every load level above 5 schedules, the slim image delivered higher throughput. The peak was 291 jobs/min at 160 schedules (slim) vs 147 jobs/min at 120 schedules (standard) — a 2x improvement in peak sustainable throughput.

### 4\. Near-zero scheduling pressure up to 120 schedules

Pending pod count stayed at 0 through 80 schedules and only reached 16 at 120\. The standard image showed 60 pending pods at 80 schedules and 91 at 120\. This confirms the hypothesis: the standard init image's IO during pod startup was the primary bottleneck creating scheduling contention.

### 5\. CPU utilization now scales proportionally with load

With the slim image, node CPU climbed steadily: 0.5% \-\> 3% \-\> 7% \-\> 14% \-\> 30% \-\> 44%. This is healthy — the cluster is actually doing useful work rather than being blocked on IO. The standard image showed flat, low CPU (under 10%) because most time was spent waiting on init containers.

### 6\. The 160-schedule cliff is a real capacity limit

At 160 schedules, p95 jumped to 284s — similar to the standard image's plateau. This is likely the genuine scheduling capacity of a 4-node cluster with ScheduleConcurrency=75 (225 total slots). The slim image didn't eliminate this limit; it simply pushed the bottleneck from IO contention to actual CPU/scheduling capacity.

### 7\. p99 spike at 120 schedules (192s)

While p95 was still healthy at 59s, the p99 jumped to 192s at 120 schedules. This suggests a small fraction of jobs were hitting longer queue wait times, serving as an early warning of the cliff at 160\.

### 8\. Some job errors appeared

The job stats export shows 89 jobs with errors out of 3,468 total (\~2.6%). The standard test had zero failures across 4,390 jobs. This warrants investigation — the errors may be related to the slim image missing something the content needs, or could be transient scheduling issues at high load.

## Cluster Health

All 3 Connect server pods remained Running and Ready (1/1) throughout the test. All 4 nodes stayed in Ready state. The monitoring stack remained healthy after the pre-test restart.

## Data Files

| File | Description |
| :---- | :---- |
| `metrics.csv` | Per-step metrics (8 rows) for charting |
| `job_stats.parquet` | Full job history (3,468 jobs) from Connect API |
| `job_stats_metadata.json` | Metadata for the parquet file |
| `lgtm-export.tar.gz` | Prometheus TSDB data from the test window (97MB) |
| `test_start_time.txt` | UTC timestamp when the test began |

Baseline data is in `reports/d2k-loadtest/` (original report and `metrics-standard-init.csv` backup).

## Tools Used

Same as baseline test. The load test script (`reports/d2k-loadtest/run_load_test.sh`) was reused with `OUTPUT_DIR=reports/d2k-loadtest-slim`.  

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for runtime init image volumes #801

D2K Load Test: Slim Init Image vs Standard Init Image

Change Under Test

Test Configuration (unchanged from baseline)

Results: Side-by-Side Comparison

Render Latency (p95 seconds)

Throughput (jobs/min)

Scheduling Pressure (pending pods)

Full Metrics (Slim Init)

Key Observations

1. Performance cliff shifted from 10 to 160 schedules

2. p95 plateau at ~55-59s from 20 to 120 schedules

3. Throughput increased 2-3x across the board

4. Near-zero scheduling pressure up to 120 schedules

5. CPU utilization now scales proportionally with load

6. The 160-schedule cliff is a real capacity limit

7. p99 spike at 120 schedules (192s)

8. Some job errors appeared

Cluster Health

Data Files

Tools Used

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Parameter	Value
Architecture	AWS EKS (`aws/k8s`)
Node type	m6i.2xlarge (8 vCPU, 32 GiB)
Node count	4
Connect replicas	3 (direct execution mode, no launcher)
Connect version	2026.03.0-dev
Content type	`add` bundle (simple Python addition script)
Content items	160
ScheduleConcurrency	75 per replica (225 total)
Job CPU request/limit	150m / 250m
Job memory request/limit	128Mi / 256Mi
EFS throughput	Provisioned 150 MiB/s

Schedules	Standard p95	Slim p95	Improvement
1	29.3	29.3	-
5	58.3	29.3	2.0x
10	285.7	29.3	9.8x
20	271.6	54.6	5.0x
40	270.1	56.1	4.8x
80	287.3	58.2	4.9x
120	288.0	59.1	4.9x
160	288.0	284.4	1.0x

Schedules	Standard	Slim	Improvement
1	3.0	1.0	-
5	14.5	15.5	1.1x
10	19.5	28.5	1.5x
20	39.0	62.5	1.6x
40	78.5	115.5	1.5x
80	115.5	242.0	2.1x
120	147.0	264.3	1.8x
160	108.0	291.0	2.7x

Schedules	p50 (s)	p95 (s)	p99 (s)	Jobs/min	Avg Node CPU %	Pending Pods
1	22.5	29.3	29.9	1.0	0.5%	0
5	22.5	29.3	29.9	15.5	3.0%	0
10	22.5	29.3	29.9	28.5	2.8%	0
20	25.4	54.6	58.9	62.5	7.0%	0
40	26.6	56.1	59.2	115.5	14.2%	0
80	42.1	58.2	59.6	242.0	30.0%	0
120	45.3	59.1	192.4	264.3	44.2%	16
160	144.4	284.4	296.9	291.0	38.5%	0

File	Description
`metrics.csv`	Per-step metrics (8 rows) for charting
`job_stats.parquet`	Full job history (3,468 jobs) from Connect API
`job_stats_metadata.json`	Metadata for the parquet file
`lgtm-export.tar.gz`	Prometheus TSDB data from the test window (97MB)
`test_start_time.txt`	UTC timestamp when the test began

Add support for runtime init image volumes #801

Description

D2K Load Test: Slim Init Image vs Standard Init Image

Change Under Test

Test Configuration (unchanged from baseline)

Results: Side-by-Side Comparison

Render Latency (p95 seconds)

Throughput (jobs/min)

Scheduling Pressure (pending pods)

Full Metrics (Slim Init)

Key Observations

1. Performance cliff shifted from 10 to 160 schedules

2. p95 plateau at ~55-59s from 20 to 120 schedules

3. Throughput increased 2-3x across the board

4. Near-zero scheduling pressure up to 120 schedules

5. CPU utilization now scales proportionally with load

6. The 160-schedule cliff is a real capacity limit

7. p99 spike at 120 schedules (192s)

8. Some job errors appeared

Cluster Health

Data Files

Tools Used

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions