Skip to content

fix: process hygiene - MCP orphan watchdog, live-PID update locks, PATH-hijack-proof registration#385

Merged
RaghavChamadiya merged 1 commit into
mainfrom
fix/process-hygiene
Jun 5, 2026
Merged

fix: process hygiene - MCP orphan watchdog, live-PID update locks, PATH-hijack-proof registration#385
RaghavChamadiya merged 1 commit into
mainfrom
fix/process-hygiene

Conversation

@RaghavChamadiya
Copy link
Copy Markdown
Member

Problem

An audit of a real machine found three structural process-hygiene issues, all pointing at one repo:

  1. Orphaned MCP stdio servers (every Windows user). repowise mcp --transport stdio has no parent-death handling, and Windows never kills children when the parent dies — so every crashed / force-quit / abnormally-ended agent session leaks its MCP server pair, each holding wiki.db handles that contend with later repowise update runs (WAL writer locks).
  2. Wedged update locks (any OS, rapid commits). read_update_lock treated a lock as stale only after 30 min of wall clock and never checked whether the owning PID was alive. A SIGKILLed/crashed update — the paths atexit cannot cover — blocked further updates for the full window, and its leftover lock also suppressed the augment hook's stale-wiki warnings.
  3. PATH hijack of MCP registrations (multi-install users). All registrations wrote the bare command "repowise", resolved via PATH at session start — observed in the wild: a stale conda shadow install serving a repo indexed by the venv install.

Fix

New repowise.core.procutils — stdlib-only, fail-open process probes (no psutil, no subprocess on hot paths):

  • pid_alive — kernel32 OpenProcess + WaitForSingleObject on Windows (os.kill(pid, 0) on Windows terminates the target, so it is never used as a probe there), os.kill(pid, 0) on POSIX
  • process_create_token — process identity by creation time, to defeat PID reuse
  • ancestor_chain — single Toolhelp32 snapshot on Windows, /proc on Linux, one-shot ps on macOS

1. Parent-death watchdog (mcp_server/_watchdog.py, stdio only). A plain getppid() watch would fix nothing: console-script installs run as a launcher chain (client -> repowise.exe shim -> venv python -> real python on Windows), and the immediate parent is a shim that waits on us and survives the client. The watchdog walks the ancestor chain at startup, skips launcher shims (python*, repowise*, uv*) and shell wrappers (cmd, sh, pwsh, ...), and watches every recorded ancestor up to and including the first non-launcher — the client — identity-checked by creation token. It never watches above the client (a terminal may die while the client legitimately lives), fails open on any probe uncertainty, and can be disabled with REPOWISE_MCP_NO_WATCHDOG=1. 5s-poll daemon thread; os._exit(0) on client death.

2. Live-PID lock staleness. Locks now record the owner's pid_create_token; a lock whose owner is dead or recycled is stale immediately. Applied to both lock implementations — cli/helpers.py and its workspace mirror in core/workspace/update.py (formats kept in sync) — and the augment hook's _read_in_flight_marker now delegates to the canonical reader. Legacy token-less locks remain honored (liveness + wall clock), and the 30-min ceiling still reaps hung-but-alive updates.

3. Scope-split registration. Per-user configs (~/.claude/settings.json, claude_desktop_config.json) now pin the absolute path of the install that ran init — resolved from the running interpreter's scripts dir, never PATH — and are refreshed on every re-registration, so a moved venv self-heals on the next init/update. Repo-shared files (.mcp.json, .codex/config.toml) deliberately keep the bare name: they may be committed, and one contributor's absolute path would break every other checkout. Transient locations (uvx cache, temp dirs) fall back to the bare name.

Tests

  • tests/unit/test_procutils.py — real probes against self and reaped children
  • tests/unit/cli/test_update_lock.py — staleness matrix (dead / recycled / legacy / unknown / wall clock) for both lock implementations
  • tests/unit/cli/test_mcp_command_resolution.py — per-user absolute vs repo-shared bare; transient-location fallbacks; moved-venv self-heal
  • tests/unit/server/test_mcp_watchdog.py — launcher/client classification, watch-set boundaries, fail-open behavior, and an end-to-end orphan test: a client process spawns a watchdogged server and exits; the server self-terminates within seconds
  • Two existing augment-staleness tests fabricated locks with fake PIDs to mean "update running"; they now use a live PID, plus a new regression test pinning that a crashed update's lock no longer suppresses stale warnings

Full unit + provider suites pass (3,932 tests). Note for review: the Windows code paths were exercised locally end-to-end; Linux /proc and macOS ps paths are covered by the same tests and need a CI run to confirm — everything fails open (unknown => alive => no action), so the worst case is the previous behavior, never a wrongly-killed server or wrongly-broken lock.

Not in this PR (follow-ups)

  • Hook commands (repowise-augment, repowise-rewrite, post-commit command -v repowise) stay bare: their installers are has-hook-already idempotent, so an absolute path would be pinned forever and break when the venv moves — needs an update-in-place re-registration story first.
  • Doctor process audit (enumerate live repowise mcp/update processes, flag shadow-install servers and stale registrations) — separate PR as planned; one logical MCP server is 2–3 OS processes (shim chain), so the audit should count sets.

…TH-hijack-proof registration

Three structural fixes for leaked/wedged repowise processes, observed on
real machines and confirmed structural (hit every Windows user, and the
lock issue hits anyone with rapid commits on any OS):

1. Orphaned MCP stdio servers: `repowise mcp --transport stdio` never
   exited when the MCP client died abnormally; on Windows children are
   never killed with their parent, so every crashed/force-quit session
   leaked a server pair holding wiki.db handles. New parent-death
   watchdog walks the ancestor chain past launcher shims (repowise.exe →
   venv python → real python; uv/uvx; cmd/sh wrappers) to the actual
   client and exits when it dies. A plain getppid() watch would see only
   the shim, which survives the client. Opt out: REPOWISE_MCP_NO_WATCHDOG=1.

2. Wedged update locks: read_update_lock treated locks as stale only
   after 30 min wall clock; a SIGKILLed/crashed update (the paths atexit
   cannot cover) blocked further updates for the whole window. Locks now
   record the owner's process-creation token and are stale immediately
   when the owning PID is dead or recycled. Applied to both lock
   implementations (cli/helpers + the workspace mirror in
   core/workspace/update) and to the augment hook's in-flight marker, so
   a crashed update can't suppress stale-wiki warnings either. Legacy
   token-less locks remain honored; the wall-clock ceiling still reaps
   hung-but-alive updates.

3. PATH hijack of MCP registrations: bare repowise commands resolve
   via PATH at session start, so shadow installs (conda, old pip, pipx)
   hijack the server against the indexed repo. Per-user configs
   (~/.claude/settings.json, claude_desktop_config.json) now pin the
   absolute path of the install that ran init, refreshed on every
   re-registration so a moved venv self-heals. Repo-shared files
   (.mcp.json, .codex/config.toml) deliberately keep the bare name —
   they may be committed and must not carry one machine's paths.
   Transient locations (uvx cache, temp dirs) fall back to bare.

All probes live in new stdlib-only repowise.core.procutils: kernel32 via
ctypes on Windows (os.kill(pid, 0) on Windows TERMINATES the target —
never used as a probe), /proc on Linux, one-shot ps on macOS; fail-open
(unknown ⇒ alive) so uncertainty can never kill a live server or break a
live lock. No new dependencies; no subprocess calls on hot paths.

Tests: procutils probes, lock staleness matrix (dead/recycled/legacy/
unknown), per-user-absolute vs repo-shared-bare registration, watchdog
unit + end-to-end orphan test (client dies ⇒ server self-terminates).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@RaghavChamadiya RaghavChamadiya requested a review from swati510 as a code owner June 5, 2026 15:02
@repowise-bot
Copy link
Copy Markdown

repowise-bot Bot commented Jun 5, 2026

✅ Health: 7.0 (unchanged)
1 file moved · 5 hotspots · 5 hidden couplings · 5 with fix history · 2 dead-code findings

🚨 Change risk: high (riskier than 81% of this repo's commits · raw 9.6/10)
This change's risk is driven by:

  • large diff (many lines added)
  • scattered, high-entropy change

🩹 Review priority (files here with the most recent bug-fix history — defects cluster, so review these first)

File Score Δ Why
.../workspace/update.py 5.1 → 4.3 ▼ -0.8 🔻 introduced nested complexity, complex method · ✅ resolved dry violation

💡 .../workspace/update.py: Flatten the control flow. Pull early-return guards to the top, extract the deepest branch into a helper, and consider replacing nested conditionals with a strategy table or dispatch dict.

🔥 Hotspots touched (5)
  • .../workspace/update.py — 4 commits/90d, 8 dependents · primary owner: Raghav Chamadiya (100%)
  • .../mcp_server/_server.py — 9 commits/90d, 5 dependents · primary owner: Raghav Chamadiya (95%)
  • .../cli/test_augment_staleness.py — 3 commits/90d, 0 dependents · primary owner: Raghav Chamadiya (100%)
2 more
  • .../cli/helpers.py — 20 commits/90d, 40 dependents · primary owner: Raghav Chamadiya (72%)
  • .../commands/augment_cmd.py — 8 commits/90d, 5 dependents · primary owner: Raghav Chamadiya (100%)
🔗 Hidden coupling (3 files)
  • .../cli/helpers.py co-changes with these files (not in this PR):
    • .../cli/test_helpers.py (9× — 🟢 routine)
    • .../commands/update_cmd.py (8× — 🟢 routine)
    • README.md (5× — 🟢 routine)
  • .../commands/augment_cmd.py co-changes with README.md (5× — 🟢 routine) — not in this PR.
  • .../cli/mcp_config.py co-changes with .../cli/main.py (5× — 🟢 routine) — not in this PR.
💀 Dead code (2 findings)
  • 💀 .../cli/helpers.py ensure_db (confidence 1.00)
  • 💀 .../cli/helpers.py is_interactive_session (confidence 1.00)

📊 Full report · ⭐ Star Repowise · 📥 Install bot · Last updated 2026-06-05 15:03 UTC
Silence on a single PR with [skip repowise] in the title · Per-repo toggle on repowise.dev/settings?tab=bot

@RaghavChamadiya RaghavChamadiya merged commit 1b9c72d into main Jun 5, 2026
5 checks passed
@RaghavChamadiya RaghavChamadiya deleted the fix/process-hygiene branch June 5, 2026 15:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants