Skip to content

fix(pegboard): align drain grace defaults with stop threshold#4958

Merged
NathanFlurry merged 1 commit intomainfrom
runner-config/drain-grace-threshold
May 5, 2026
Merged

fix(pegboard): align drain grace defaults with stop threshold#4958
NathanFlurry merged 1 commit intomainfrom
runner-config/drain-grace-threshold

Conversation

@NathanFlurry
Copy link
Copy Markdown
Member

@NathanFlurry NathanFlurry commented May 4, 2026

Stack Context

This PR is stacked on the actor shutdown work in abort-signal/finalize-once.

What?

  • Reject serverless runner configs when drain_grace_period exceeds actor_stop_threshold.
  • Default serverless runner-config drain grace and actor stop threshold to 30 minutes.
  • Default serverless request lifespan to 60 minutes so the 30 minute drain window does not begin immediately.
  • Update docs for the new shutdown and Kubernetes grace-period defaults.

Why?

The engine can mark an actor lost if its configured drain grace exceeds the stop threshold. Making the default values consistent prevents default runner configs from entering an invalid or unsafe timing setup.

Testing

  • cargo test -p rivet-engine --test mod runner::api_runner_configs_upsert -- --nocapture
  • pnpm --dir rivetkit-typescript/packages/rivetkit run check-types

@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 4, 2026

PR Review: fix(pegboard): align drain grace defaults with stop threshold

Summary

This PR raises the actor_stop_threshold default from 30 seconds to 30 minutes, aligns the serverless runner-config drain_grace_period default from 10 seconds to 30 minutes, raises the default request_lifespan from 15 minutes to 60 minutes, adds a validation that rejects configs where drain_grace_period > actor_stop_threshold, adds a corresponding test, and updates documentation. The change is logically well-motivated.


Issues Found

1. Test added to deprecated runner/ directory

The new test upsert_runner_config_serverless_drain_grace_period_exceeds_actor_stop_threshold was added to:

engine/packages/engine/tests/runner/api_runner_configs_upsert.rs

Per engine/CLAUDE.md:

Write new actor-hosting engine tests under engine/packages/engine/tests/envoy/; do not add new legacy runner tests under engine/packages/engine/tests/runner/.

The runner/ test directory covers the deprecated runner path. If the validation (pegboard_runner_config_upsert) is shared infrastructure, the new test should live under the envoy/ test tree.

2. Stale comment on drain_grace_period default (two files)

In api-types/src/namespaces/runner_configs.rs and engine/sdks/rust/data/src/versioned/namespace_runner_config.rs, the comment reads:

// Default to deprecated config value (config.pegboard.serverless_drain_grace_period)
drain_grace_period: drain_grace_period.unwrap_or(30 * 60),

The deprecated config value defaulted to 10 seconds. The new 30 * 60 (1,800 seconds) is the new default, not the deprecated one. The comment is now factually incorrect and will mislead future readers.

3. Missing validation: drain_grace_period should be < request_lifespan

The new validation rejects drain_grace_period > actor_stop_threshold, which is correct. However, there is no guard that drain_grace_period < request_lifespan. In pegboard-outbound/src/lib.rs:

let sleep_until_drain =
    Duration::from_secs(request_lifespan.saturating_sub(drain_grace_period) as u64);

u32::saturating_sub silently clamps to zero when drain_grace_period >= request_lifespan, meaning drain would begin immediately at second 0. A user who sets drain_grace_period = 3600 and request_lifespan = 30 would pass validation today and get a silently broken config. A complementary check drain_grace_period < request_lifespan in the same upsert.rs block would close this.

4. railway.mdx graceful-shutdown guidance is now inconsistent

The Railway self-hosting doc still recommends a drain timeout of 60 seconds. With actor_stop_threshold now 30 minutes, 60 seconds is severely inadequate. This should be updated or cross-referenced with the updated Kubernetes recommendation of 1800 seconds.


Minor Notes

Magic number duplication. The value 30 * 60 appears in three separate files without a shared named constant:

  • engine/packages/api-types/src/namespaces/runner_configs.rs:71
  • engine/sdks/rust/data/src/versioned/namespace_runner_config.rs:312
  • engine/packages/config/src/config/pegboard.rs:176 (as 30 * 60 * 1000)

A named constant (e.g., DEFAULT_DRAIN_GRACE_PERIOD_SECS: u32 = 30 * 60) would make future changes easier and prevent these from drifting apart again.

terminationGracePeriodSeconds buffer. The old Kubernetes comment noted a buffer over the stop timeout. The new comment says the runner waits up to 30 minutes, with terminationGracePeriodSeconds: 1800 and no buffer. Consider bumping to 1830 or noting that a small buffer is advisable.


Test Coverage

The new test correctly exercises the rejection path. Consider adding:

  • A boundary-equality test (drain_grace_period == actor_stop_threshold should succeed, since the check is > not >=).
  • A happy-path test confirming omitted drain_grace_period stores the new 30-minute default.

Summary

Category Finding Severity
Convention New test placed in deprecated runner/ directory, not envoy/ Medium
Correctness Misleading comment on drain_grace_period default in two files Low
Logic No validation that drain_grace_period < request_lifespan Medium
Docs railway.mdx still recommends 60-second drain window (too short now) Low
Style Magic 30 * 60 duplicated across three files without a named constant Minor

Copy link
Copy Markdown
Member Author

NathanFlurry commented May 4, 2026

This stack of pull requests is managed by Graphite. Learn more about stacking.

@NathanFlurry NathanFlurry force-pushed the runner-config/drain-grace-threshold branch from f2aaf93 to 89798df Compare May 4, 2026 16:11
@NathanFlurry NathanFlurry changed the base branch from graphite-base/4958 to abort-signal/finalize-once May 4, 2026 16:11
@NathanFlurry NathanFlurry changed the base branch from abort-signal/finalize-once to graphite-base/4958 May 4, 2026 16:13
@NathanFlurry NathanFlurry force-pushed the runner-config/drain-grace-threshold branch from 89798df to e1be3d6 Compare May 4, 2026 16:13
@NathanFlurry NathanFlurry changed the base branch from graphite-base/4958 to bridge-errors/preserve-message May 4, 2026 16:13
Copy link
Copy Markdown
Contributor

@MasterPtato MasterPtato left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wasn't 60 minute actor stop a temporary fix? It should not be so high

@NathanFlurry NathanFlurry force-pushed the runner-config/drain-grace-threshold branch from f8261ca to 4bec251 Compare May 5, 2026 12:00
Base automatically changed from bridge-errors/preserve-message to main May 5, 2026 14:58
@NathanFlurry NathanFlurry merged commit 4bec251 into main May 5, 2026
6 of 13 checks passed
@NathanFlurry NathanFlurry deleted the runner-config/drain-grace-threshold branch May 5, 2026 14:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants