Skip to content

fix: driver test suite & sqlite reliability#4559

Open
NathanFlurry wants to merge 6 commits into04-06-fix_remove_serverless_tokenfrom
04-05-wip_fix_whatever_is_going_on_with_envoys
Open

fix: driver test suite & sqlite reliability#4559
NathanFlurry wants to merge 6 commits into04-06-fix_remove_serverless_tokenfrom
04-05-wip_fix_whatever_is_going_on_with_envoys

Conversation

@NathanFlurry
Copy link
Copy Markdown
Member

Description

Please include a summary of the changes and the related issue. Please also include relevant motivation and context.

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

How Has This Been Tested?

Please describe the tests that you ran to verify your changes.

Checklist:

  • My code follows the style guidelines of this project
  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

@railway-app
Copy link
Copy Markdown

railway-app bot commented Apr 5, 2026

🚅 Deployed to the rivet-pr-4559 environment in rivet-frontend

Service Status Web Updated (UTC)
kitchen-sink ❌ Build Failed (View Logs) Web Apr 7, 2026 at 5:08 pm
frontend-cloud ❌ Build Failed (View Logs) Web Apr 7, 2026 at 1:35 am
frontend-inspector ❌ Build Failed (View Logs) Web Apr 7, 2026 at 1:35 am
mcp-hub ✅ Success (View Logs) Web Apr 5, 2026 at 11:57 am
website ❌ Build Failed (View Logs) Web Apr 5, 2026 at 11:57 am
ladle ❌ Build Failed (View Logs) Web Apr 5, 2026 at 11:57 am

Copy link
Copy Markdown
Member Author

NathanFlurry commented Apr 5, 2026

@claude
Copy link
Copy Markdown

claude bot commented Apr 5, 2026

PR Review: fix: driver test suite & sqlite reliability

Overall Assessment: Solid multi-fix PR addressing real bugs across the envoy client, KV channel, native SQLite adapter, WebSocket handling, and AgentOS process callbacks. The fixes are well-targeted and the test infrastructure improvements are valuable. A few issues warrant attention before landing.


High Priority Issues

Bug: NativeSqliteDatabase.close() error handling is incorrect

The close() method catches an error, calls #recoverFromStaleKvChannel, but then re-checks isStaleKvChannelError on the original error variable -- which was already checked inside the recovery method. If recovery throws, the original error is lost. On close, recovery is also questionable since the database is being torn down anyway. The fix should swallow stale-channel errors without recovery:

async close(): Promise<void> {
    try {
        await this.#module.closeDatabase(this.#nativeDb);
    } catch (error) {
        if (!isStaleKvChannelError(error)) {
            throw error;
        }
        // Stale channel on close is expected; swallow and continue teardown.
    }
}

Bug: open_actors still uses Arc<Mutex<HashSet>>

CLAUDE.md says to prefer scc::HashMap over Arc<Mutex<HashMap<...>>>. The open_actors set in engine/packages/pegboard-kv-channel/src/lib.rs is Arc<Mutex<HashSet<String>>>, locked on every incoming message per connection. This should be replaced with Arc<scc::HashSet<String>>.

Design concern: single-writer lock removal changes the documented concurrency contract

CLAUDE.md states: "The KV channel enforces single-writer locks per actor." The removal of KvChannelState.actor_locks eliminates this guarantee. Either update CLAUDE.md to document where the invariant now lives (e.g., at the SQLite VFS layer) or restore enforcement at a different layer. The current state leaves the documented guarantee and the implementation out of sync.


Medium Priority Issues

Duplicate named-parameter extraction with a behavioral difference

extractNamedSqliteParameters and the binding-resolution helpers exist independently in both native-adapter.ts and native-sqlite.ts. The two implementations differ: native-sqlite.ts deduplicates repeated parameter names via a seen Set, while native-adapter.ts does not. SQL like WHERE a = :x AND b = :x will behave differently between the two paths. One implementation should be authoritative and shared.

Fragile string-regex stale-channel detection

The pattern /kv channel (?:connection closed|shut down)/i in native-adapter.ts matches on free-text error messages from the Rust native addon. Any phrasing change on the Rust side silently breaks reconnection. A typed error code or sentinel property on the error object would be more robust.

Misleading fallback warning in open-database.ts

If nativeSqliteAvailable() returns true but ctx.nativeSqliteConfig is falsy (driver forgot to configure it), the warning says "native SQLite not available" when the addon is actually available but unconfigured. The two conditions should produce distinct warning messages.


Minor Issues

  • INITIAL_SLEEP_TIMEOUT_MS = 250 in actor-driver.ts has no comment explaining why 250 ms is the floor or what wake scenario this guards against.
  • The open_actors comment was simplified but now omits context. The set gates KV ops and enforces MAX_ACTORS_PER_CONNECTION, not just cleanup on disconnect.
  • Module-level shared state in engine tests (sharedNamespacePromise, etc.) relies on describe.sequential ordering with no explanatory comment.
  • destroyNamespaceActors outer loop is serial over actor names. Collecting all actor IDs in parallel first would speed up cleanup.

What Is Correct and Well-Done

  • Pegboard-envoy copy-paste fix: reading last_ping_ts_key instead of create_ts_key twice is a clear correctness fix.
  • Envoy lostTimeout return value fix: the outer variable was never updated when cleared inside handleConnMessage/handleConnClose.
  • connHandle.abort() in envoy cleanup: without aborting, the connection coroutine could outlive the envoy loop.
  • Conn.disconnect() single-flight fix: wrapping in #disconnectPromise prevents double-disconnect races idiomatically.
  • Non-awaited action processing in WebSocket (router-websocket-endpoints.ts): the added comment clearly explains why actions are intentionally not awaited while subscription updates remain serial.
  • broadcastProcessEvent catching ActorStopping in agent-os/actor/process.ts: correct pattern for shutdown-safe callbacks.
  • openActorDatabase unification: the shared call eliminates a class of mode-specific bugs where raw and Drizzle paths diverged.
  • actor_request_task cache change from single cached_actor to HashMap<String, (Id, String)>: correctly handles a connection multiplexing multiple actors.
  • Removing the committed .node binary and turbo.json for sqlite-native: correct housekeeping.
  • Legacy v2-format value removal in epoxy/src/ops/kv/read_value.rs: safe cleanup.

Summary

Priority Item
High Fix NativeSqliteDatabase.close() error handling logic
High Clarify/restore single-writer actor lock guarantee and update CLAUDE.md
High Replace Arc<Mutex<HashSet>> with Arc<scc::HashSet> in pegboard-kv-channel
Medium Deduplicate extractNamedSqliteParameters between native-adapter.ts and native-sqlite.ts
Medium Replace string-regex stale-channel detection with a typed error
Medium Fix misleading fallback warning in open-database.ts
Low Document INITIAL_SLEEP_TIMEOUT_MS = 250
Low Parallelize destroyNamespaceActors outer loop

Generated with Claude Code

@NathanFlurry NathanFlurry force-pushed the 04-05-wip_fix_whatever_is_going_on_with_envoys branch from 28d5a1f to 8af4472 Compare April 5, 2026 12:01
@NathanFlurry NathanFlurry force-pushed the 04-05-wip_fix_whatever_is_going_on_with_envoys branch from 8af4472 to 930ee95 Compare April 5, 2026 12:11
@NathanFlurry NathanFlurry changed the base branch from main to graphite-base/4559 April 6, 2026 02:47
@NathanFlurry NathanFlurry force-pushed the 04-05-wip_fix_whatever_is_going_on_with_envoys branch from 930ee95 to 6d9fd88 Compare April 6, 2026 02:47
@NathanFlurry NathanFlurry changed the base branch from graphite-base/4559 to 04-05-fix_cache_make_in_memory_cache_global April 6, 2026 02:47
@NathanFlurry NathanFlurry force-pushed the 04-05-wip_fix_whatever_is_going_on_with_envoys branch from 6d9fd88 to 2f73a59 Compare April 6, 2026 17:23
@NathanFlurry NathanFlurry force-pushed the 04-05-wip_fix_whatever_is_going_on_with_envoys branch from 2f73a59 to 5b25ad0 Compare April 6, 2026 17:26
@NathanFlurry NathanFlurry force-pushed the 04-05-wip_fix_whatever_is_going_on_with_envoys branch from 5b25ad0 to b8143ad Compare April 6, 2026 17:27
@NathanFlurry NathanFlurry force-pushed the 04-05-wip_fix_whatever_is_going_on_with_envoys branch from b8143ad to 33436b9 Compare April 6, 2026 17:33
@NathanFlurry NathanFlurry force-pushed the 04-05-wip_fix_whatever_is_going_on_with_envoys branch from 33436b9 to 6e18bf9 Compare April 6, 2026 17:37
@NathanFlurry NathanFlurry force-pushed the 04-05-wip_fix_whatever_is_going_on_with_envoys branch from 6e18bf9 to 7597a94 Compare April 6, 2026 17:45
@NathanFlurry NathanFlurry force-pushed the 04-05-wip_fix_whatever_is_going_on_with_envoys branch from 7597a94 to c6d8ae5 Compare April 6, 2026 18:55
@MasterPtato MasterPtato force-pushed the 04-05-fix_cache_make_in_memory_cache_global branch from 9f06a14 to e815bd1 Compare April 6, 2026 19:54
@NathanFlurry NathanFlurry changed the base branch from 04-06-fix_misc_token_fixes to graphite-base/4559 April 7, 2026 01:33
@NathanFlurry NathanFlurry force-pushed the 04-05-wip_fix_whatever_is_going_on_with_envoys branch from 5373668 to ea0f72e Compare April 7, 2026 01:33
@NathanFlurry NathanFlurry changed the base branch from graphite-base/4559 to 04-06-fix_remove_serverless_token April 7, 2026 01:33
@NathanFlurry NathanFlurry changed the base branch from 04-06-fix_remove_serverless_token to graphite-base/4559 April 7, 2026 05:12
@NathanFlurry NathanFlurry force-pushed the 04-05-wip_fix_whatever_is_going_on_with_envoys branch from ea0f72e to 8b3a33e Compare April 7, 2026 05:12
@NathanFlurry NathanFlurry changed the base branch from graphite-base/4559 to 04-06-fix_misc_token_fixes April 7, 2026 05:12
Reverts all file-system driver workarounds, test timing hacks,
client retry/polling logic, sleep timeout overrides, DB proxy
shutdown budgets, and connection-manager counters that were only
needed for FS driver tests. Keeps engine-relevant fixes: action
fire-and-forget in WS endpoints, conn single-flight disconnect,
DB unification, KV channel multiplexing, envoy bugfixes.
Normal (envoy-based) runner configs never had protocol_version set
by the metadata poller since that only runs for serverless configs.
This caused all actors on normal configs to be created with the v1
workflow, which uses the old runner architecture and never becomes
connectable through envoys.

The fix treats normal runner configs as v2 since they inherently
use the envoy protocol.
Normal (envoy-based) runner configs never had protocol_version set
by the metadata poller since that only runs for serverless configs.
This caused all actors on normal configs to be created with the v1
workflow, which uses the old runner architecture and never becomes
connectable through envoys.

The fix treats normal runner configs as v2 since they inherently
use the envoy protocol.
The cache module changes (global OnceLock for InMemoryDriver and
in_flight map) are not required for engine driver tests. Reverted
to main.

Added guidance to engine/CLAUDE.md about using concurrent containers
instead of Mutex<HashMap<...>>.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant