Skip to content

fix: stop infinite reconnect storm on multi-session (close code 4001)#57

Merged
raysonmeng merged 7 commits into
masterfrom
fix/close-code-4001
Apr 13, 2026
Merged

fix: stop infinite reconnect storm on multi-session (close code 4001)#57
raysonmeng merged 7 commits into
masterfrom
fix/close-code-4001

Conversation

@raysonmeng
Copy link
Copy Markdown
Owner

@raysonmeng raysonmeng commented Mar 31, 2026

Summary / 概要

修复多 Claude Code 会话导致的无限重连循环,并改进审批请求生命周期可靠性。

Part 1: Close Code 4001 — 止血修复

When a second Claude Code session connects to the daemon, the old session is kicked with close code 4001. Previously the kicked client auto-reconnected, creating an infinite reconnect storm.

Root Cause / 根因: daemon-client.tsonclose handler 不区分 close code 4001(被替代)和其他 close code(daemon 崩溃),统一 emit "disconnect" 触发重连。

Fix / 修复:

  • control-protocol.ts — 导出 CLOSE_CODE_REPLACED = 4001 常量
  • daemon-client.tsonclose 检查 event.code === 4001,emit "replaced" 而非 "disconnect"
  • bridge.ts — 监听 "replaced" 事件,永久进入 dormant 状态(不启动 recovery poller 避免 ping-pong)
  • daemon.ts — 使用 CLOSE_CODE_REPLACED 常量替代 magic number

Part 2: Approval Lifecycle Reliability — 对标 codex-plugin-cc 协议

参考 codex-plugin-cc 的连接生命周期管理模式,改进 AgentBridge 的审批请求 passthrough 可靠性:

  • TUI 断连重放: 在 TUI 断连后 requeue in-flight server requests,新 TUI 连接时自动 replay(去掉 TTL timer,不再超时丢弃有效请求)
  • 响应缓冲: 用户审批后若 app-server 已断连,缓冲响应;app-server 重连后自动 flush
  • 去重保护: app-server 重连 pending 期间,忽略重复的 approval response
  • 状态拆分: clearTransientResponseTrackingState()clearResponseTrackingState() 分离,app-server 断连不清理缓冲的响应

Part 3: Bridge Disabled State 改进

  • bridge-disabled-state.ts — 提取 BridgeDisabledReason 类型("killed" | "replaced"),根据原因返回不同错误消息
  • replaced session 不再启动 recovery poller(永久 dormant),避免与新 session 互踢

Part 4: CI / Infra

  • scripts/verify-plugin-sync.cjs — 新增插件同步校验脚本,确保 build 产物与源码一致
  • .github/workflows/ci.yml — CI 改用 bun run check 统一检查(typecheck + test + plugin sync + version check)
  • package.json — 添加 verify:plugin-sync 脚本

Test plan / 测试计划

  • bun run typecheck — 通过
  • bun test src/ — 166 tests 全部通过
  • 手动测试:同时开两个 Claude Code 会话,确认第一个优雅进入 dormant
  • 手动测试:TUI 断连重连后审批请求正确重放
  • Codex review

New / updated tests / 新增和更新的测试

Close code 4001 (daemon-client):

  1. emits replaced (not disconnect) when server closes with code 4001
  2. emits disconnect (not replaced) for non-4001 close codes
  3. pending replies rejected on replaced close (code 4001)

Approval lifecycle (codex-adapter):
4. approval response buffered when app-server disconnected, flushed on reconnect
5. approval response send failure is buffered for retry
6. requeues in-flight server requests on TUI disconnect and replays them on reconnect
7. new TUI connection replays in-flight server requests before the old socket closes

Bridge disabled state:
8. bridge-disabled-state.test.ts — disabled reason type 测试

Closes #55 (Phase 1)
Relates to #39, #58

🤖 Generated with Claude Code

修复多 Claude 会话导致的无限重连循环

When a second Claude Code session connects to the daemon, the old session
is kicked with close code 4001. Previously, the kicked client treated this
as a generic disconnect and auto-reconnected, displacing the new session
and creating an infinite loop.

Now the client distinguishes close code 4001 ("replaced") from other close
codes. A replaced session enters a dormant disabled state instead of
reconnecting. The existing disabledRecoveryPoller handles recovery if the
replacing session later disconnects.

Changes:
- control-protocol.ts: export CLOSE_CODE_REPLACED = 4001
- daemon-client.ts: emit "replaced" (not "disconnect") on code 4001
- bridge.ts: handle "replaced" event via enterDisabledState()
- daemon.ts: use CLOSE_CODE_REPLACED constant instead of magic number
- daemon-client.test.ts: 3 new tests for 4001 vs non-4001 behavior

Closes #55 (Phase 1)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
rayson951005@gmail.com added 5 commits April 2, 2026 15:25
改进审批请求生命周期可靠性,对标 codex-plugin-cc 协议处理模式:
- TUI 断连后 requeue in-flight server requests 并在重连时 replay
- app-server 断连时 buffer approval responses,重连后 flush
- 去掉 TTL timer,改用 requeue 策略保证不丢弃有效请求
- 拆分 bridge disabled 状态,区分 killed 和 replaced 原因
- replaced session 永久 dormant,不再启动 recovery poller 避免 ping-pong
- 新增 verify:plugin-sync 脚本,CI 改用 bun run check 统一检查
修复 Codex review 发现的 correctness bug:app-server 断连后 approval
请求/响应状态被错误保留,可能将带旧 server ID 的响应 flush 到新连接。

- 提取 handleAppServerClose() 方法,调用 clearResponseTrackingState()
  全量清理 approval 状态(serverRequestToProxy、pendingServerRequests、
  pendingServerResponses)
- 新增回归测试覆盖 app-server close 清理路径
- TUI reconnect 维度的 requeue/replay 逻辑不受影响

根据设计文档 issue-37 的约束:审批 ID 是 session-scoped,
app-server 重连后旧 ID 无效,应丢弃审批状态。
将所有单元测试文件从 src/ 根目录移至 src/unit-test/,减少源码目录混杂。
新增 src/unit-test/e2e/ 目录,按 PR 记录 E2E 手动测试步骤。

- 13 个 .test.ts 文件迁移,更新相对 import 路径
- 新增 pr-57-close-code-4001.md E2E 测试文档
改变多会话设计方向:新 Claude 连接被拒绝,旧会话不受影响。

- daemon.ts: attachClaude() 检查 readyState !== CLOSED(含 CLOSING)
  拒绝后来者而非踢旧连接
- 全链路重命名 replaced → rejected:类型、事件名、错误消息、测试
- E2E 文档更新为新的"拒绝新连接"语义
abg dev 现在会先自动执行 bun run build:plugin,确保 plugin
产物与源码同步,避免用旧 build 测试新代码。
@raysonmeng raysonmeng force-pushed the fix/close-code-4001 branch from d5f82ef to efde6df Compare April 2, 2026 09:01
fix: 支持 Codex TUI resume 时的 secondary WebSocket 连接

Codex TUI uses two parallel WebSocket connections during thread resume:
a picker connection (secondary) and the main session connection (primary).
The proxy's "latest connection wins" model was dropping thread/resume
messages from the primary connection, causing the resume flow to freeze.

Changes:
- Add secondary connection support with dedicated app-server WS per
  secondary connection (raw passthrough, no id remapping)
- Add app-server generation counter to prevent stale close handlers
- Add fresh-session reconnect: buffer TUI messages during app-server
  reconnect on initialize, replay after reconnection
- Fix orphaned app-server WS if picker disconnects before onopen
- Fix zombie secondary if app-server WS closes first
- Clean up verbose diagnostic logging (app-server → proxy, stale
  message content preview, [track] per-message logging)
- Update "replaced" → "rejected" semantics for multi-session handling
- Sync compiled plugin files (bridge-server.js, daemon.js)
- Add codex-plugin-cc/ to .gitignore
- Update verify-plugin-sync script

Tests: 171 pass, 0 fail
@raysonmeng raysonmeng merged commit 714704b into master Apr 13, 2026
1 check failed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: multi-session & multi-workspace support / 多会话 + 多项目并行支持

1 participant