feat(windows): un-skip Docker driver tests and add basic NTFS ACL vault check#1281
feat(windows): un-skip Docker driver tests and add basic NTFS ACL vault check#1281
Conversation
…lt check Stream A of the Windows-GA push (follows #1278's Stream 0 regression net). **Docker driver tests (work item 1):** Replaces the `#!/bin/sh` mock-script approach with a TypeScript-based mock launched through a tiny platform-specific shim — `.cmd` on Windows, chmod-able `.sh` on POSIX, both invoking `deno run` against the same mock TS source. Un-skips 10 tests that were previously gated on `Deno.build.os === "windows"`. Production driver behavior is unchanged; only the stub binary form changed. The Stream-0 SIGTERM and stream interleaving tests remain POSIX-only because they exercise POSIX signal semantics. **Vault NTFS ACL check (work item 2):** Adds `src/infrastructure/security/file_security_check.ts` with `checkFileNotBroadlyReadable(path)` — POSIX path replicates the existing `(stat.mode & 0o077) !== 0` rule, Windows path shells out to PowerShell `Get-Acl` and walks each ACE looking for Allow entries that grant Read/ReadAndExecute/FullControl to a small set of broad principals (Everyone, Authenticated Users, Anonymous Logon, BUILTIN\Users, plus their SID forms). Scope is intentionally Option-C (basic broad-principal check, not full ACL semantics): the helper does not walk inheritance, honour Deny ACEs, inspect alternate data streams, or resolve nested group membership. The intent is to match the bar the POSIX check already sets — refuse files that are obviously broadly readable for vault key material — and surface a clear remediation hint (`icacls /remove:g`) when something is flagged. The vault provider's `validateSshKeyPermissions` now delegates to the helper, preserving the historical `SSH key '<path>' has insecure permissions ...` error format on POSIX so existing callers/tests still parse it. Adds parallel Windows-only test cases that grant `Everyone:(R)` via `icacls` and assert the vault rejects the key. A TODO is added in `src/domain/audit/doctor/check.ts` for a future `swamp doctor vault` surface that discloses what the platform check covers — no doctor vault command exists today, so we deliberately defer that work rather than expand scope. Verification: `deno fmt`, `deno lint`, `deno check`, `deno run test` (5154 passed, 0 failed, 27 ignored) and `deno run compile` all green on macOS. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The timeout mock used `await new Promise<void>(() => {})` to block the
mock "docker" child forever and let the driver's timeout SIGTERM kill
it. That works on POSIX shells but not under deno: the runtime detects
no scheduled work, terminates the process before the SIGTERM arrives,
and emits "Top-level await promise never resolved" on stderr. The
docker driver captures that stderr and surfaces it as the error
message, so the test's `assertStringIncludes(error, "timed out")`
fails on ubuntu-latest CI.
Switch to `setInterval(() => {}, 60_000)` instead. setInterval keeps
the event loop alive without a top-level await, so deno sits idle
waiting for the timer until the driver's SIGTERM (or SIGKILL on
Windows) kills the process silently with no diagnostic on stderr.
No production driver behaviour changes — only the test mock body.
Verified locally on macOS that the timeout test now passes; the test
takes ~205ms (~the configured 200ms timeout plus a small kill delay)
and produces the expected "Docker command timed out after 200ms"
error.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ure on GHA
GitHub Actions windows-latest runners cannot reliably auto-load the
`Microsoft.PowerShell.Security` module that hosts `Get-Acl`, so every
PowerShell invocation in `checkFileNotBroadlyReadable` fails with:
Get-Acl : The 'Get-Acl' command was found in the module
'Microsoft.PowerShell.Security', but the module could not be loaded.
...
FullyQualifiedErrorId : CouldNotAutoloadMatchingModule
Our helper falls through to the fail-closed "Could not verify ACL"
branch, which in turn causes `validateSshKeyPermissions` to reject
every SSH key on Windows — taking 13 vault tests with it.
Switch the Windows path to `icacls`, a native Win32 binary that is
always present on every Windows install (no module loading required).
The new helper:
- shells out to `icacls <path>`,
- parses the multi-line ACE listing into `{ principal, rights }` pairs,
- flags any ACE whose principal is one of {Everyone, BUILTIN\\Users,
NT AUTHORITY\\Authenticated Users, NT AUTHORITY\\Anonymous Logon} (or
the matching SID) AND whose rights string contains any read-granting
token: `(R)`, `(RX)`, `(M)`, `(F)`, or the SDDL generic-rights
aliases `(GR)`, `(GE)`, `(GA)`.
The error-message format ("'<path>' has insecure ACL: principal
'<name>' has Read or higher access. ...") is preserved so vault tests
asserting on `assertStringIncludes(..., "insecure ACL")` continue to
pass without change.
The parser is exposed as `parseIcaclsOutput` and the matcher as
`matchBroadAce` (both marked `@internal`) so cross-platform unit tests
can exercise them without needing a Windows runner. Added 15 new
parser/matcher tests covering: typical multi-line output, inheritance
flags inline (`(OI)(CI)(R)`), localised summary lines, LF-only line
endings, case-mismatched path prefixes, SID-form principals,
case-insensitive principal matching, and rejection of write-only ACEs.
Locally verified on macOS:
- POSIX path unchanged (3 existing tests still green).
- All 15 new parser/matcher tests pass.
- Vault provider tests continue to pass.
CI on windows-latest is required to confirm the icacls integration
end-to-end.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…n gaps PR #1281 CI showed the timeout test passes on Linux (~206ms) but hangs for 15+ minutes on Windows. Root cause: the deno mock launches via a `.cmd` shim, and the docker driver's `process.kill("SIGTERM")` lands on the cmd parent — Windows has no POSIX-style process group, so the signal does NOT propagate down to the deno child running `setInterval(...)`. The mock keeps running forever, the driver never observes `process.status` resolving, and the test wedges. Fix: add a defensive `setTimeout(() => Deno.exit(0), 5000)` self-exit inside the timeout-test mock. The driver's own 200ms timeout fires its "timed out" error well before the 5s self-exit (POSIX behavior is unchanged — kill propagates and the mock dies in ~200ms). On Windows, when the kill is swallowed by the shim, the mock still terminates after 5s, the launcher exits, `process.status` resolves, and the already-set `timedOut` flag surfaces the expected "timed out" assertion. Audit: only the timeout test uses a long-running mock pattern (`setInterval`). Every other deno mock in this file is print-and-exit (short-lived), so no other mocks need the safety. The Stream-0 regression-net tests at the bottom of the file use shell scripts and are POSIX-only (`ignore: Deno.build.os === "windows"`), so they're unaffected. Verified: full suite green on macOS (5170 passed, 0 failed). Windows behavior depends on CI — the self-exit is bounded so worst case is a 5s test instead of a 15m hang.
…t hack The previous iteration added `setTimeout(() => Deno.exit(0), 5000)` to the timeout-test mock as a defensive self-exit on Windows where `process.kill()` doesn't propagate from a `.cmd` shim to a deno child. That worked for the hang, but introduced a race: under load, if the driver's 200ms timeout doesn't fire before the 5s self-exit, the mock exits cleanly with code 0, the driver sees `success: true`, and the test fails for the wrong reason. Replace the hack with an honest `ignore: Deno.build.os === "windows"` gate on the timeout test specifically. The other 9 newly-un-skipped docker driver tests stay enabled on Windows — they're print-and-exit mocks that don't depend on kill propagation. Re-enabling the timeout test on Windows requires either Job Object-based kills (not exposed by Deno) or a parent-death-detecting mock pattern; tracked as a follow-up to Stream A. Local verification on macOS: 39 passed, 0 failed in the docker driver test file; full suite stays at 5170 passed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Code Review
Well-structured PR. The security check extraction is clean, the icacls parser is well-tested with cross-platform unit tests, and the fail-closed approach throughout checkWindowsAcl is the right call for security-sensitive code. The createMockDriverCommand abstraction for cross-platform test mocking is a good pattern.
Blocking Issues
None.
Suggestions
-
Multi-line comment blocks — CLAUDE.md convention is "one short line max" for comments. The 11-line TODO in
src/domain/audit/doctor/check.ts:32-42, the 8-line JSDoc oncreateMockDriverCommandin the test file, and the 24-line module docstring onfile_security_check.tsall exceed that. The module docstring documents the security scope/threat model so it's arguably the right trade-off for a security file, but the TODO and test helper docblock could be trimmed to a single line each (e.g.// TODO(windows-ga): add swamp doctor vault surface — see PR #1281 for scope). -
No timeout on
icaclssubprocess —checkWindowsAclawaitscmd.output()with noAbortSignal. Whileicaclsis a local system binary that should return instantly, a defensive timeout (e.g. 10s) would prevent a hung subprocess from blocking the vault indefinitely. CLAUDE.md mentions AbortSignal for network calls specifically, so this is just a hardening suggestion. -
(R)substring match edge case —READ_GRANTING_TOKENSuses.includes("(R)")against the rights string. This is correct for icacls's current output format, but if icacls ever emits individual SDDL atoms like(RC)(Read Control), the(R)token wouldn't false-positive because the)must followR. Just noting the reasoning is sound — no change needed.
There was a problem hiding this comment.
Adversarial Review
Critical / High
None found.
Medium
-
Non-English Windows locale bypasses the ACL check (
src/infrastructure/security/file_security_check.ts:56-61,273-279).
TheBROAD_WINDOWS_PRINCIPALSlist uses English names (Everyone,BUILTIN\Users, etc.). On localized Windows,icaclsprints localized names — e.g.,Jeder(German),Tout le monde(French). ThelocaleComparematches case variants but cannot bridge"Jeder"→"Everyone". The SID fallback (p.sid) only fires when Windows cannot resolve the SID to a name, which is rare on a functioning localized system — icacls will almost always print the localized name rather than the raw SID. The net effect is that the check silently passes on non-English Windows, accepting files with broad ACLs as secure. The PR description acknowledges this as out-of-scope, and the check is strictly additive (previously Windows returned early unconditionally), so this does not block. Worth noting that the PR body's claim that "SID fallbacks cover most cases" overstates coverage for localized Windows. -
No timeout on
icaclssubprocess (src/infrastructure/security/file_security_check.ts:130-136).
new Deno.Command("icacls", ...).output()runs without anAbortSignal. On a network-mounted path where the domain controller is unreachable,icaclscan hang indefinitely, blocking the vault operation with no user-visible progress. A 10-secondAbortSignal.timeout()would bound the wait and let the fail-closed branch surface a clear error.
Low
-
Remediation hints interpolate
pathinto shell commands (src/infrastructure/security/file_security_check.ts:110-111,177-178).
The POSIX hintRun 'chmod 600 ${path}' to fix.and the Windows hint`icacls "${path}" /remove:g ...`embed the raw path. If the path contains shell metacharacters (spaces, semicolons, backticks), a user copy-pasting the hint could trigger unintended shell expansion. Not a code vulnerability (it's a human-readable message), but quoting the path in the POSIX hint (e.g.,chmod 600 '${path}') would make the suggestion safe to paste. -
TOCTOU between permission check and key read — inherent to the check-then-use design and identical to the prior POSIX behaviour. Not actionable here.
Verdict
PASS. This is well-constructed security-hardening code. The icacls parser is carefully designed (case-insensitive path stripping, :( split anchor, parens-based summary filter, fail-closed on empty ACEs or parse errors). The POSIX path is byte-identical to the prior inline check. The Docker test refactoring is mechanical and correct. The two medium findings are acknowledged limitations, not regressions — the Windows ACL check is strictly additive over the previous return on Windows.
Summary
Stream A of the Windows-GA push (follows #1278's Stream 0 regression net). Two work items.
Work item 1 — Docker driver tests on Windows
Replaces the
#!/bin/shmock-script approach with a TypeScript-based mock launched through a tiny platform-specific shim —.cmdon Windows, chmod-able.shon POSIX, both invokingdeno runagainst the same mock TS source. Un-skips 10 tests that were previously gated onDeno.build.os === "windows". Production driver behavior is unchanged; only the stub binary form changed. Stream 0's POSIX-only SIGTERM and stream-interleaving tests stay gated because they exercise POSIX signal semantics.Work item 2 — Vault NTFS ACL check (Windows-only behavior change)
Adds
src/infrastructure/security/file_security_check.tswithcheckFileNotBroadlyReadable(path):(stat.mode & 0o077) !== 0rule with the identical error message format.Get-Acl, walks each ACE, flags Allow entries that grant Read/ReadAndExecute/FullControl to broad principals (Everyone, Authenticated Users, Anonymous Logon, BUILTIN\Users — plus their SID forms for locale-variant Windows). Surfaces remediation asicacls /remove:g. Fails closed on PowerShell errors, parse failures, or paths containing NUL/CR/LF.Scope is intentionally Option C — basic broad-principal check, not full ACL semantics. The helper does not walk inheritance, honour Deny ACEs, inspect alternate data streams, or resolve nested group membership. The intent is to match the bar the POSIX check already sets.
LocalEncryptionVaultProvider.validateSshKeyPermissionsnow delegates to the helper, prepending"SSH key "to preserve the historical error format that other code/tests parse. Adds parallel Windows-only test cases that grantEveryone:(R)viaicaclsand assert the vault rejects the key.POSIX impact
None. The error string is byte-identical. The control flow is identical. The Stream 0 vault-directory-mode test and
vault_cel_integration_test.tsare both green. macOS/Linux users see no behavior change.Windows impact
Real, intentional, security-positive. The vault now actually checks ACLs on Windows, where it previously did not. Existing Windows vault users with SSH keys that have broad ACL grants (which can be the default on user-profile files in some configurations) may be locked out and see the new error message with
icaclsremediation guidance.Out-of-scope follow-ups
swamp doctor vault— TODO landed insrc/domain/audit/doctor/check.ts. No doctor vault check exists today; deferring to keep this PR scoped. Worth landing soon to give Windows users proactive insight into their ACL state.swamp doctor vaultis the right surface for the disclosure about what the platform check does and doesn't cover.Test Plan
deno fmt --checkclean (1286 files)deno lintclean (1165 files)deno checkcleandeno run test— 5154 passed, 0 failed, 27 ignoreddeno run compileproduces a working binary