Eliminate cloud instance cold-start delays with pre-warmed, instantly-ready nodes.
Documentation • Issues • Quick Start
Stratos is a Kubernetes operator that eliminates cloud instance cold-start delays by maintaining pools of pre-warmed, stopped instances ready to start in seconds. Instead of waiting 3-5 minutes for new nodes to provision, boot, and initialize, Stratos enables sub-minute scale-up times by keeping instances in a "warm standby" state.
Spinning up a new cloud instance typically takes 3-5 minutes:
- Instance provisioning - Cloud provider allocates resources
- OS boot - Operating system initialization
- Kubernetes join - Node registers with the cluster
- CNI setup - Network plugin initialization
- Application initialization - User data scripts, image pulls
For time-sensitive workloads like CI/CD pipelines, autoscaling events, or burst traffic handling, this delay is unacceptable.
Stratos maintains a pool of pre-warmed, stopped instances using a four-phase lifecycle:
warmup --> standby --> running --> stopping
^ |
|_____________________|
- Warmup - Stratos launches instances that run initialization scripts (join cluster, pull images, configure networking) and self-stop when ready
- Standby - Stopped instances wait in the pool, costing only storage (no compute charges)
- Running - When pods are pending, Stratos instantly starts standby nodes (seconds, not minutes)
- Stopping - Empty nodes are drained and returned to standby for reuse
- Sub-minute scale-up - Start pre-warmed nodes in seconds instead of minutes
- Cost efficient - Stopped instances only incur storage costs, not compute
- Kubernetes native - Declarative NodePool and NodeClass CRDs, integrates with existing clusters
- CNI-aware - Properly handles startup taints for VPC CNI, Cilium, Calico
- Automatic maintenance - Pool replenishment, node recycling, state synchronization
- Observable - Prometheus metrics for all operations
- Kubernetes cluster (1.26+)
- Helm 3.x
- AWS credentials configured (for EC2 operations)
helm install stratos oci://ghcr.io/stratos-sh/charts/stratos \
--namespace stratos-system --create-namespace \
--set clusterName=my-clusterAWSNodeClass (cloud-specific configuration):
apiVersion: stratos.sh/v1alpha1
kind: AWSNodeClass
metadata:
name: workers
spec:
region: us-east-1
instanceType: m5.large
ami: ami-0123456789abcdef0
subnetIds: ["subnet-12345678"]
securityGroupIds: ["sg-12345678"]
iamInstanceProfile: arn:aws:iam::123456789:instance-profile/node-role
userData: |
#!/bin/bash
/etc/eks/bootstrap.sh my-cluster \
--kubelet-extra-args '--register-with-taints=node.eks.amazonaws.com/not-ready=true:NoSchedule'
until curl -sf http://localhost:10248/healthz; do sleep 5; done
sleep 30
poweroffNodePool (references the AWSNodeClass):
apiVersion: stratos.sh/v1alpha1
kind: NodePool
metadata:
name: workers
spec:
poolSize: 10
minStandby: 3
template:
nodeClassRef:
kind: AWSNodeClass
name: workers
labels:
stratos.sh/pool: workers
startupTaints:
- key: node.eks.amazonaws.com/not-ready
value: "true"
effect: NoScheduleVerify:
kubectl get awsnodeclasses,nodepools+------------------+ +-------------------+ +------------------+
| NodePool CRD | --> | Stratos Controller| --> | Cloud Provider |
| (Desired State) | | (Reconciler) | | (AWS EC2) |
+------------------+ +-------------------+ +------------------+
| |
v v
+------------------+ +-------------+
| AWSNodeClass CRD | | K8s Nodes |
| (Cloud Config) | | (Managed) |
+------------------+ +-------------+
NodePools reference a cloud-specific NodeClass (e.g., AWSNodeClass) that contains instance configuration. This separation allows multiple NodePools to share the same cloud configuration.
The controller watches for:
- NodePool changes - Create/update/delete pools
- NodeClass changes - Cloud configuration updates (e.g., AWSNodeClass)
- Pending pods - Trigger scale-up when pods can't be scheduled
- Node state changes - Track node lifecycle and health
Traditional autoscalers don't just make you wait for a node to boot — they give you a completely cold environment. Every pipeline run pulls all DaemonSet images from scratch, then pulls the CI agent image, and every docker build or npm install starts with an empty cache. Stratos nodes come pre-warmed with all DaemonSet images already pulled, and since nodes are reused (stopped and restarted rather than terminated), build caches, Docker layer caches, and package manager caches persist across runs. Your second pipeline run is dramatically faster than the first.
Large model images (often 10-50GB+) make cold starts painfully slow. Downloading a model, loading it into GPU memory, and running health checks can take 10+ minutes before the first request is served. With Stratos, the model image is pre-pulled during the warmup phase and persists on the node's EBS volume. When demand spikes, a standby node starts in seconds with the model image already on disk — cutting startup time from minutes to seconds.
Stratos's ~20-second pending-to-running time (when properly configured) makes true scale-to-zero viable for latency-sensitive services. Pair a simple ingress doorman with a 30-second timeout: when a request arrives at a scaled-down service, the doorman holds the connection while Stratos starts a standby node, and the request completes within the timeout window. No idle compute costs, no cold-start frustration.
Full documentation is available at stratos-sh.github.io/stratos
- Getting Started - Installation and quickstart
- Concepts - Architecture and node lifecycle
- Guides - AWS setup, scaling policies, monitoring
- API Reference - NodePool and AWSNodeClass CRDs
cd docs
npm install
npm startThe documentation site will be available at http://localhost:3000/stratos/.
make build# With fake cloud provider (for testing)
go run ./cmd/stratos/main.go --cluster-name=main --cloud-provider=fake
# With AWS
go run ./cmd/stratos/main.go --cluster-name=main --cloud-provider=aws# Unit tests
make test
# Integration tests (requires envtest setup)
make test-integration
# Coverage report
make coverageStratos is currently in alpha development. The API may change between versions.
Apache License 2.0
