Skip to content

fix(mcp): fix race condition in concurrent MCP initialisation#3181

Merged
amitksingh1490 merged 4 commits intotailcallhq:mainfrom
toxicafunk:fix/mcp-init-race-condition
Apr 28, 2026
Merged

fix(mcp): fix race condition in concurrent MCP initialisation#3181
amitksingh1490 merged 4 commits intotailcallhq:mainfrom
toxicafunk:fix/mcp-init-race-condition

Conversation

@toxicafunk
Copy link
Copy Markdown
Contributor

Summary

Fix a race condition in ForgeMcpService where two concurrent init_mcp callers could leave the tool map empty, causing spurious "Tool not found" errors at startup.

Context

When Forge starts and multiple requests trigger MCP initialisation simultaneously, the old code wrote previous_config_hash at the top of update_mcp — before any server connections were established. A second concurrent caller would then see the hash already updated, conclude there was nothing to do, and return early against an empty self.tools map. Any subsequent tool call would fail with "Tool not found" despite the MCP server being correctly configured.

Changes

  • Added init_lock: Arc<Mutex<()>> to ForgeMcpService to serialise concurrent init_mcp calls
  • Implemented double-checked locking: fast path skips the lock when the hash is already current; slow path acquires the lock then re-checks before calling update_mcp
  • Moved previous_config_hash write to after all MCP connections (successful or failed) complete in update_mcp, so any waiter released from init_lock sees a fully populated tool map

Key Implementation Details

The double-checked locking pattern in init_mcp:

  1. Fast path — check is_config_modified before acquiring init_lock; return immediately if unchanged (no contention on the common case)
  2. Slow path — acquire init_lock, re-check is_config_modified (another task may have completed init while we waited), then call update_mcp
  3. update_mcp stamps previous_config_hash only after join_all finishes, guaranteeing self.tools is fully populated when any waiter is released

Testing

cargo test -p forge_services

New test added: test_concurrent_init_does_not_race — fires two concurrent get_mcp_servers() calls on a shared Arc<ForgeMcpService> and asserts that after both settle, every registered tool is callable without error.

- Add  to  to serialise
  concurrent calls to .
- Move  write from the top of  to
  after  completes, so any waiter on  only sees
  the hash updated once  is fully populated.
- Apply double-check pattern in : fast-path check without the
  lock, acquire lock, then re-check before calling . This
  prevents a second concurrent caller from running  while
  the first is still connecting to servers, which would cause
  intermittent "Tool not found" errors.
- Add  to verify the fix.

Co-Authored-By: ForgeCode <noreply@forgecode.dev>
@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Apr 27, 2026

CLA assistant check
All committers have signed the CLA.

@github-actions github-actions Bot added the type: fix Iterations on existing features or infrastructure. label Apr 27, 2026
…t test

- refresh_cache now holds init_lock before clearing tools and resetting
  the config hash, closing the race where reload_mcp could wipe the tool
  map while an in-flight update_mcp was still populating it
- Switch test_concurrent_init_does_not_race to multi_thread flavor
  (worker_threads = 2) so true parallelism exercises the timing windows
  that the cooperative scheduler cannot produce

Co-Authored-By: ForgeCode <noreply@forgecode.dev>
@ericnoam
Copy link
Copy Markdown
Contributor

ericnoam commented Apr 28, 2026

Fixed 1 issue and improved 1 test

Fix 1: refresh_cache acquiring init_lock — the more important fix

The race is real. The breaking sequence:

  1. Caller A holds init_lock, runs clear_tools(), and join_all starts populating self.tools
  2. Caller B (via reload_mcp → refresh_cache) runs concurrently with no lock — calls clear_tools(), wiping whatever A just inserted
  3. A finishes join_all, writes the hash
  4. Hash now says "up to date" but self.tools is empty

Any subsequent init_mcp caller hits the fast path (hash matches) and returns against an empty map — exactly the original bug but through a different code path. Fix applied: refresh_cache now acquires init_lock before mutating shared state.

Fix 2: multi_thread test flavor — warranted but less critical

Since all critical sections cross .await points (Mutex::lock().await, RwLock::write().await), the cooperative scheduler would likely exercise the interleavings even in single-threaded mode. However, "likely" is not "guaranteed" — true parallelism under flavor = "multi_thread" closes that gap at zero cost. Fix applied.

@amitksingh1490 amitksingh1490 enabled auto-merge (squash) April 28, 2026 09:36
@amitksingh1490 amitksingh1490 merged commit 9f97aee into tailcallhq:main Apr 28, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

type: fix Iterations on existing features or infrastructure.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants