Skip to content

DO NOT MERGE: serverless restart race condition#4997

Draft
NathanFlurry wants to merge 1 commit intomainfrom
05-07-do_not_merge_serverless_restart_race_condition
Draft

DO NOT MERGE: serverless restart race condition#4997
NathanFlurry wants to merge 1 commit intomainfrom
05-07-do_not_merge_serverless_restart_race_condition

Conversation

@NathanFlurry
Copy link
Copy Markdown
Member

Description

Please include a summary of the changes and the related issue. Please also include relevant motivation and context.

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

How Has This Been Tested?

Please describe the tests that you ran to verify your changes.

Checklist:

  • My code follows the style guidelines of this project
  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

@railway-app
Copy link
Copy Markdown

railway-app Bot commented May 7, 2026

🚅 Deployed to the rivet-pr-4997 environment in rivet-frontend

Service Status Web Updated (UTC)
website 😴 Sleeping (View Logs) Web May 9, 2026 at 1:16 am
mcp-hub ✅ Success (View Logs) Web May 7, 2026 at 9:42 am
frontend-inspector ❌ Build Failed (View Logs) Web May 7, 2026 at 9:41 am
frontend-cloud ❌ Build Failed (View Logs) Web May 7, 2026 at 9:41 am
ladle ❌ Build Failed (View Logs) Web May 7, 2026 at 9:40 am
kitchen-sink ❌ Build Failed (View Logs) Web May 7, 2026 at 9:40 am

Copy link
Copy Markdown
Member Author

This stack of pull requests is managed by Graphite. Learn more about stacking.

@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 7, 2026

Code Review

Note: This is a DRAFT / DO NOT MERGE PR — the review is scoped to the diagnostic tooling, not an assessment of whether this should land.


Overview

This PR adds a standalone investigation harness for a serverless engine-restart race condition. It consists of:

  • engine-restart-serverless.md — thorough reproduction notes documenting the ~2.4–3.25 s post-restart window where existing-key gateway traffic hangs
  • engine-restart-serverless.ts — the orchestrator script (spawns engine + runtime, drives scenarios, sweeps delays)
  • engine-restart-serverless-runtime.ts — the serverless actor fixture (SQLite counter + heartbeat + gateway endpoints)

The investigation is solid. The finding is clear: engine /health returning ≠ gateway routing is settled for previously warmed actor keys. New keys succeed immediately; existing keys hang during the stale-routing window. SQLite is not the culprit.


Issues

Hard: Inconsistent indentation in main()

The main() function body has badly mixed indentation — some blocks at 2-space, some at 4-space, some at what looks like a tab width. Example:

async function main() {
    const namespace = ...
    const poolName = ...
        const actorKey = ...     // ← extra indent
        const engine = ...       // ← extra indent
        let runtime = ...        // ← extra indent

And later inside the try block:

        client = createClient({
            ...
            });                  // ← closing brace at wrong level

                const actorHandle = ...  // ← 4 more spaces than expected

This is likely a consequence of editing/rearranging code and is worth fixing before the harness is shared with others — it's hard to read.

Same problem in ServerlessRuntime.startProcess:

                    RIVETKIT_TEST_ENDPOINT: input.endpoint,
                    RIVETKIT_HEARTBEAT_MODE: HEARTBEAT_MODE,
                    RIVETKIT_TEST_HOST: HOST,
                RIVETKIT_TEST_POOL_NAME: input.poolName,   // ← outdented 4 spaces

And in getHeartbeatStats:

            case "heartbeat_sql_err":    // ← 4 extra spaces vs surrounding cases
                stats.sqlErr += 1;

Medium: withTimeout leaks the underlying promise on timeout

function withTimeout<T>(promise: Promise<T>, timeoutMs: number, message: string): Promise<T> {
    return new Promise<T>((resolve, reject) => {
        const timeout = setTimeout(() => {
            reject(new Error(message));  // ← original promise still runs
        }, timeoutMs);
        ...
    });
}

When the timeout fires, the original promise is not cancelled — it continues running to completion (or forever) in the background. For single-shot usage this is fine, but runPostRestartProbes runs four probeActor calls with Promise.allSettled, each wrapping a withTimeout. If probes time out, the underlying actor handles remain open and in-flight, which could interfere with later assertions (especially the gateway sweep runs). Consider threading an AbortController through client calls or accepting this as a known limitation of the diagnostic harness.


Medium: getGatewayUrl() is an undocumented test-only method

const url = buildActorRequestUrl(await handle.getGatewayUrl(), "health");

getGatewayUrl doesn't appear in the public client type. If it's a test-only surface, it should either be documented or the harness should construct the gateway URL from engine config + actor key directly so the method doesn't need to be test-exported.


Low: ensureTables called on every heartbeat tick

runHeartbeatSql calls ensureTables on every invocation (fires every second). CREATE TABLE IF NOT EXISTS is cheap but adds a round-trip per tick to an already-instrumented path. Fine for a diagnostic script, but worth noting if the timing data needs to be tight.


Low: getPort race window

Ports are reserved with getPort then passed to child processes. There is an inherent window between getPort returning a port and the child process binding it where another process could grab it. This is a known limitation of this pattern and tolerable in test infrastructure, but worth a comment so future readers understand why the harness occasionally fails with "address in use."


Low: rawSqlDatabaseProvider stub in fixture

const rawSqlDatabaseProvider = {
    createClient: async () => ({
        execute: async () => [],
        close: async () => {},
    }),
    onMigrate: async () => {},
};

This is supplied to db: but the actor uses ctx.sql for actual SQLite access. The stub is intentional (it satisfies the type while the real SQLite goes through ctx.sql), but it isn't obvious why. A one-line comment would help.


Positive notes

  • The .md investigation doc is excellent — sweep tables, narrowing runs, control cases, and the key conclusion are all clearly laid out. This is the right way to document a race-window investigation.
  • OwnedEngine / ServerlessRuntime class design cleanly separates lifecycle (start, stopProcess, cleanup) and makes the test readable.
  • stopChildProcess with SIGKILL fallback timeout is the correct pattern.
  • The heartbeat stats parser correctly filters by timestamp (event.ts < sinceTs) so "during restart" and "after health" windows can be distinguished independently.
  • parseDelayList correctly validates and sorts the env-var delay list.

Summary

The investigation is thorough and the finding is solid. Before this harness gets referenced by others or turned into a real regression test, the indentation in main() and startProcess should be fixed (it's genuinely confusing), and the withTimeout leak behavior should at minimum be documented. Everything else is low-priority for a diagnostic script.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant