fix(k8s): correct manifests and documentation for kubernetes deployment by NathanFlurry · Pull Request #4347 · rivet-dev/rivet

NathanFlurry · 2026-03-04T02:03:06Z

Summary

Fixed critical issues with Kubernetes manifests and documentation that prevented deployment. The engine was crashing due to an invalid --except-services singleton argument (singleton is not a real service name). Also removed unused NATS manifests, corrected resource sizing to match production values, and added production-grade reliability features.

Changes

Manifests: Fixed broken singleton deployment, removed unused NATS, reduced resource requests to 2000m/2Gi (request) and 3000m/4Gi (limits), added preStop lifecycle hook for graceful shutdown, added PodDisruptionBudget, cleaned up datacenter labels, renamed "local" to "default", renumbered files sequentially.

Documentation: Updated all docs to match working manifests, fixed health check port from 6420 to 6421, fixed rivet.gg references to rivet.dev, added config update instructions.

Testing

Tested deployment in k3d cluster - engine pods start and become ready, health checks pass on port 6421.

🤖 Generated with Claude Code

railway-app · 2026-03-04T02:03:15Z

🚅 Deployed to the rivet-pr-4347 environment in rivet-frontend

Service	Status	Web	Updated (UTC)
mcp-hub	🕗 Deploying (View Logs)	Web	Mar 4, 2026 at 3:23 am
website	🕒 Building (View Logs)	Web	Mar 4, 2026 at 3:22 am
frontend-inspector	😴 Sleeping (View Logs)	Web	Mar 4, 2026 at 2:15 am
frontend-cloud	❌ Build Failed (View Logs)	Web	Mar 4, 2026 at 2:06 am
ladle	❌ Build Failed (View Logs)	Web	Mar 4, 2026 at 2:03 am

claude · 2026-03-04T02:05:05Z

PR Review: fix(k8s): correct manifests and documentation for kubernetes deployment

This is a solid fix for the broken Kubernetes deployment. The critical crash fix and domain reference cleanup are well-executed. Below are my findings organized by severity.

Critical / Blockers

1. Singleton deployment removed without confirmation this is safe

The PR removes the separate rivet-engine-singleton Deployment entirely. Previously, the engine had two deployments:

Main: --except-services singleton
Singleton: --services singleton --services api-peer

Now there is only one deployment running start with no filter flags. If singleton services truly require exactly-one-replica semantics (e.g., schedulers with distributed locks), running them on 2+ replicas could cause data races or duplicate execution. The PR description says --except-services singleton was invalid because "singleton is not a real service name", but that raises the question: how do singleton services get their exclusive scheduling guarantee now?

If the engine handles singleton coordination internally (e.g., via distributed leader election), this is fine, but it should be documented or confirmed in the PR.

Medium Issues

2. engine.sh does not create the rivet-secrets secret for local dev

The deployment manifest at 03-rivet-engine-deployment.yaml references a rivet-secrets secret for the admin token, but engine.sh never creates this secret. Engine pods will crash-loop with CreateContainerConfigError until it is manually created. The script should either create a dev secret automatically or print a clear prerequisite warning before deploying.

3. 2-node NATS cluster with maxUnavailable: 1 PDB risks losing quorum

The NATS cluster is reduced from 3 to 2 replicas (07-nats-statefulset.yaml), and 08-nats-pdb.yaml allows maxUnavailable: 1. With a 2-node cluster, losing 1 node during a drain/rolling update leaves a single node unable to form a majority. NATS JetStream requires n/2+1 nodes for quorum, so for 2 nodes, losing 1 makes the cluster unavailable.

Options:

Use 3 replicas with maxUnavailable: 1 (true HA), or
Set maxUnavailable: 0 on the PDB (blocks drains but prevents split-brain)

The README describes this as "2-node HA" which is slightly misleading.

Minor Issues

4. Ordering hazard in docs: kubectl apply -f . before secret creation

The setup guide correctly tells users to create the namespace and secret (step 4) before running kubectl apply -f . (step 5). However, if users skip step 4 or run the apply command too early, engine pods will fail silently. A brief note reinforcing that the secret must exist before deploying would help.

5. NATS readiness probe missing initialDelaySeconds

07-nats-statefulset.yaml has no initialDelaySeconds on the readinessProbe. NATS starts quickly so this is unlikely to cause issues, but it is a minor inconsistency with the startupProbe pattern used on the engine.

6. Namespace manifest stripped of all labels

The new 00-namespace.yaml has no labels at all. Removing the incorrect datacenter: YOUR_DATACENTER_KEY placeholder is good, but basic labels like app: rivet-engine for observability tooling would be worth keeping.

What Is Good

Critical bug fixed: Removing the nonexistent --except-services singleton argument was the right call.
YAML structure fix in 06-nats-configmap.yaml: The old file had kind: ConfigMap at the bottom. The corrected apiVersion/kind/metadata order at the top is cleaner and consistent.
imagePullPolicy: Never moved to the dev script: Good separation. Production manifests now reference a registry image; engine.sh patches it locally with kubectl patch.
Lifecycle preStop hook + terminationGracePeriodSeconds: 25: Clean graceful shutdown pattern (15s sleep + 10s buffer before SIGKILL).
PodDisruptionBudgets added: Both NATS and engine PDBs improve rolling update safety.
Admin token via Secret: Correctly uses secretKeyRef instead of hardcoding credentials in the ConfigMap.
rivet.gg to rivet.dev cleanup: Dockerfile labels, docs, and README all updated. The dashboard.rivet.dev to hub.rivet.dev fix in connect/kubernetes.mdx is correct per project conventions.
Simplified docs: New docs pointing to repo files via giget are far more maintainable than embedding full YAML blocks that can drift out of sync with manifests.
CLAUDE.md addition: The new "Kubernetes manifest changes" sync rule is a good preventative for future documentation drift.

Summary

The PR fixes a real deployment-blocking bug and is well-structured overall. The main open questions are: (1) whether removing the singleton deployment is architecturally safe, (2) whether engine.sh needs to bootstrap the rivet-secrets secret for local dev workflows, and (3) whether the 2-node NATS quorum risk is acceptable. The documentation simplification is a clear improvement.

🤖 Generated with Claude Code

- Fixed broken singleton deployment concept that caused engine crash (`--except-services singleton` is not a valid service name) - Removed unused NATS manifests (engine uses PostgreSQL NOTIFY for pub/sub) - Reduced resource requests to match production (2000m/2Gi request, 3000m/4Gi limit) - Added preStop lifecycle hook (15s graceful drain before SIGTERM) - Added PodDisruptionBudget to prevent all pods evicting at once - Cleaned up datacenter labels and renamed "local" to "default" - Fixed rivet.gg references to rivet.dev in docs and Dockerfile - Fixed health check endpoint (port 6421, not 6420) - Renumbered manifests to be sequential (removed gap at 06) - Updated all documentation to match working manifests - Added config update instructions to README and docs Tested deployment in k3d cluster - engine pods start and become ready.

pkg-pr-new · 2026-03-04T02:06:40Z

More templates

@rivetkit/virtual-websocket

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/virtual-websocket@4347

@rivetkit/cloudflare-workers

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/cloudflare-workers@4347

@rivetkit/framework-base

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/framework-base@4347

@rivetkit/next-js

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/next-js@4347

@rivetkit/react

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/react@4347

rivetkit

pnpm add https://pkg.pr.new/rivet-dev/rivet/rivetkit@4347

@rivetkit/sql-loader

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/sql-loader@4347

@rivetkit/sqlite-vfs

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/sqlite-vfs@4347

@rivetkit/traces

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/traces@4347

@rivetkit/workflow-engine

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/workflow-engine@4347

@rivetkit/engine-runner

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/engine-runner@4347

@rivetkit/engine-runner-protocol

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/engine-runner-protocol@4347

commit: c93cfe7

…simplified docs Add NATS cluster (2-node HA) to k8s deployment, add admin token secret support, move manifests to self-host/k8s/, simplify kubernetes docs to use giget for downloading manifests, and update engine.sh for local dev.

railway-app bot temporarily deployed to rivet-frontend / rivet-pr-4347 March 4, 2026 02:03 Destroyed

NathanFlurry force-pushed the NathanFlurry/fix-k8s-docs branch from b09501a to 8136944 Compare March 4, 2026 02:05

railway-app bot temporarily deployed to rivet-frontend / rivet-pr-4347 March 4, 2026 02:05 Destroyed

railway-app bot temporarily deployed to rivet-frontend / rivet-pr-4347 March 4, 2026 03:22 Destroyed

NathanFlurry merged commit 710759a into main Mar 4, 2026
10 of 21 checks passed

NathanFlurry deleted the NathanFlurry/fix-k8s-docs branch March 4, 2026 03:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(k8s): correct manifests and documentation for kubernetes deployment#4347

fix(k8s): correct manifests and documentation for kubernetes deployment#4347
NathanFlurry merged 2 commits intomainfrom
NathanFlurry/fix-k8s-docs

NathanFlurry commented Mar 4, 2026

Uh oh!

railway-app bot commented Mar 4, 2026 •

edited

Loading

Uh oh!

claude bot commented Mar 4, 2026 •

edited

Loading

Uh oh!

pkg-pr-new bot commented Mar 4, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

NathanFlurry commented Mar 4, 2026

Summary

Changes

Testing

Uh oh!

railway-app bot commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

claude bot commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: fix(k8s): correct manifests and documentation for kubernetes deployment

Critical / Blockers

Medium Issues

Minor Issues

What Is Good

Summary

Uh oh!

pkg-pr-new bot commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

railway-app bot commented Mar 4, 2026 •

edited

Loading

claude bot commented Mar 4, 2026 •

edited

Loading

pkg-pr-new bot commented Mar 4, 2026 •

edited

Loading