Skip to content

fix(k8s): correct manifests and documentation for kubernetes deployment#4347

Merged
NathanFlurry merged 2 commits intomainfrom
NathanFlurry/fix-k8s-docs
Mar 4, 2026
Merged

fix(k8s): correct manifests and documentation for kubernetes deployment#4347
NathanFlurry merged 2 commits intomainfrom
NathanFlurry/fix-k8s-docs

Conversation

@NathanFlurry
Copy link
Member

Summary

Fixed critical issues with Kubernetes manifests and documentation that prevented deployment. The engine was crashing due to an invalid --except-services singleton argument (singleton is not a real service name). Also removed unused NATS manifests, corrected resource sizing to match production values, and added production-grade reliability features.

Changes

Manifests: Fixed broken singleton deployment, removed unused NATS, reduced resource requests to 2000m/2Gi (request) and 3000m/4Gi (limits), added preStop lifecycle hook for graceful shutdown, added PodDisruptionBudget, cleaned up datacenter labels, renamed "local" to "default", renumbered files sequentially.

Documentation: Updated all docs to match working manifests, fixed health check port from 6420 to 6421, fixed rivet.gg references to rivet.dev, added config update instructions.

Testing

Tested deployment in k3d cluster - engine pods start and become ready, health checks pass on port 6421.

🤖 Generated with Claude Code

@railway-app railway-app bot temporarily deployed to rivet-frontend / rivet-pr-4347 March 4, 2026 02:03 Destroyed
@railway-app
Copy link

railway-app bot commented Mar 4, 2026

🚅 Deployed to the rivet-pr-4347 environment in rivet-frontend

Service Status Web Updated (UTC)
mcp-hub 🕗 Deploying (View Logs) Web Mar 4, 2026 at 3:23 am
website 🕒 Building (View Logs) Web Mar 4, 2026 at 3:22 am
frontend-inspector 😴 Sleeping (View Logs) Web Mar 4, 2026 at 2:15 am
frontend-cloud ❌ Build Failed (View Logs) Web Mar 4, 2026 at 2:06 am
ladle ❌ Build Failed (View Logs) Web Mar 4, 2026 at 2:03 am

@claude
Copy link

claude bot commented Mar 4, 2026

PR Review: fix(k8s): correct manifests and documentation for kubernetes deployment

This is a solid fix for the broken Kubernetes deployment. The critical crash fix and domain reference cleanup are well-executed. Below are my findings organized by severity.


Critical / Blockers

1. Singleton deployment removed without confirmation this is safe

The PR removes the separate rivet-engine-singleton Deployment entirely. Previously, the engine had two deployments:

  • Main: --except-services singleton
  • Singleton: --services singleton --services api-peer

Now there is only one deployment running start with no filter flags. If singleton services truly require exactly-one-replica semantics (e.g., schedulers with distributed locks), running them on 2+ replicas could cause data races or duplicate execution. The PR description says --except-services singleton was invalid because "singleton is not a real service name", but that raises the question: how do singleton services get their exclusive scheduling guarantee now?

If the engine handles singleton coordination internally (e.g., via distributed leader election), this is fine, but it should be documented or confirmed in the PR.


Medium Issues

2. engine.sh does not create the rivet-secrets secret for local dev

The deployment manifest at 03-rivet-engine-deployment.yaml references a rivet-secrets secret for the admin token, but engine.sh never creates this secret. Engine pods will crash-loop with CreateContainerConfigError until it is manually created. The script should either create a dev secret automatically or print a clear prerequisite warning before deploying.

3. 2-node NATS cluster with maxUnavailable: 1 PDB risks losing quorum

The NATS cluster is reduced from 3 to 2 replicas (07-nats-statefulset.yaml), and 08-nats-pdb.yaml allows maxUnavailable: 1. With a 2-node cluster, losing 1 node during a drain/rolling update leaves a single node unable to form a majority. NATS JetStream requires n/2+1 nodes for quorum, so for 2 nodes, losing 1 makes the cluster unavailable.

Options:

  • Use 3 replicas with maxUnavailable: 1 (true HA), or
  • Set maxUnavailable: 0 on the PDB (blocks drains but prevents split-brain)

The README describes this as "2-node HA" which is slightly misleading.


Minor Issues

4. Ordering hazard in docs: kubectl apply -f . before secret creation

The setup guide correctly tells users to create the namespace and secret (step 4) before running kubectl apply -f . (step 5). However, if users skip step 4 or run the apply command too early, engine pods will fail silently. A brief note reinforcing that the secret must exist before deploying would help.

5. NATS readiness probe missing initialDelaySeconds

07-nats-statefulset.yaml has no initialDelaySeconds on the readinessProbe. NATS starts quickly so this is unlikely to cause issues, but it is a minor inconsistency with the startupProbe pattern used on the engine.

6. Namespace manifest stripped of all labels

The new 00-namespace.yaml has no labels at all. Removing the incorrect datacenter: YOUR_DATACENTER_KEY placeholder is good, but basic labels like app: rivet-engine for observability tooling would be worth keeping.


What Is Good

  • Critical bug fixed: Removing the nonexistent --except-services singleton argument was the right call.
  • YAML structure fix in 06-nats-configmap.yaml: The old file had kind: ConfigMap at the bottom. The corrected apiVersion/kind/metadata order at the top is cleaner and consistent.
  • imagePullPolicy: Never moved to the dev script: Good separation. Production manifests now reference a registry image; engine.sh patches it locally with kubectl patch.
  • Lifecycle preStop hook + terminationGracePeriodSeconds: 25: Clean graceful shutdown pattern (15s sleep + 10s buffer before SIGKILL).
  • PodDisruptionBudgets added: Both NATS and engine PDBs improve rolling update safety.
  • Admin token via Secret: Correctly uses secretKeyRef instead of hardcoding credentials in the ConfigMap.
  • rivet.gg to rivet.dev cleanup: Dockerfile labels, docs, and README all updated. The dashboard.rivet.dev to hub.rivet.dev fix in connect/kubernetes.mdx is correct per project conventions.
  • Simplified docs: New docs pointing to repo files via giget are far more maintainable than embedding full YAML blocks that can drift out of sync with manifests.
  • CLAUDE.md addition: The new "Kubernetes manifest changes" sync rule is a good preventative for future documentation drift.

Summary

The PR fixes a real deployment-blocking bug and is well-structured overall. The main open questions are: (1) whether removing the singleton deployment is architecturally safe, (2) whether engine.sh needs to bootstrap the rivet-secrets secret for local dev workflows, and (3) whether the 2-node NATS quorum risk is acceptable. The documentation simplification is a clear improvement.

🤖 Generated with Claude Code

- Fixed broken singleton deployment concept that caused engine crash
  (`--except-services singleton` is not a valid service name)
- Removed unused NATS manifests (engine uses PostgreSQL NOTIFY for pub/sub)
- Reduced resource requests to match production (2000m/2Gi request, 3000m/4Gi limit)
- Added preStop lifecycle hook (15s graceful drain before SIGTERM)
- Added PodDisruptionBudget to prevent all pods evicting at once
- Cleaned up datacenter labels and renamed "local" to "default"
- Fixed rivet.gg references to rivet.dev in docs and Dockerfile
- Fixed health check endpoint (port 6421, not 6420)
- Renumbered manifests to be sequential (removed gap at 06)
- Updated all documentation to match working manifests
- Added config update instructions to README and docs

Tested deployment in k3d cluster - engine pods start and become ready.
@NathanFlurry NathanFlurry force-pushed the NathanFlurry/fix-k8s-docs branch from b09501a to 8136944 Compare March 4, 2026 02:05
@railway-app railway-app bot temporarily deployed to rivet-frontend / rivet-pr-4347 March 4, 2026 02:05 Destroyed
@pkg-pr-new
Copy link

pkg-pr-new bot commented Mar 4, 2026

More templates

@rivetkit/virtual-websocket

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/virtual-websocket@4347

@rivetkit/cloudflare-workers

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/cloudflare-workers@4347

@rivetkit/framework-base

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/framework-base@4347

@rivetkit/next-js

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/next-js@4347

@rivetkit/react

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/react@4347

rivetkit

pnpm add https://pkg.pr.new/rivet-dev/rivet/rivetkit@4347

@rivetkit/sql-loader

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/sql-loader@4347

@rivetkit/sqlite-vfs

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/sqlite-vfs@4347

@rivetkit/traces

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/traces@4347

@rivetkit/workflow-engine

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/workflow-engine@4347

@rivetkit/engine-runner

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/engine-runner@4347

@rivetkit/engine-runner-protocol

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/engine-runner-protocol@4347

commit: c93cfe7

…simplified docs

Add NATS cluster (2-node HA) to k8s deployment, add admin token
secret support, move manifests to self-host/k8s/, simplify kubernetes
docs to use giget for downloading manifests, and update engine.sh
for local dev.
@railway-app railway-app bot temporarily deployed to rivet-frontend / rivet-pr-4347 March 4, 2026 03:22 Destroyed
@NathanFlurry NathanFlurry merged commit 710759a into main Mar 4, 2026
10 of 21 checks passed
@NathanFlurry NathanFlurry deleted the NathanFlurry/fix-k8s-docs branch March 4, 2026 03:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant