Skip to content

feat(windows): un-skip Docker driver tests and add basic NTFS ACL vault check#1281

Merged
stack72 merged 5 commits intomainfrom
stream-a-docker-vault
May 1, 2026
Merged

feat(windows): un-skip Docker driver tests and add basic NTFS ACL vault check#1281
stack72 merged 5 commits intomainfrom
stream-a-docker-vault

Conversation

@stack72
Copy link
Copy Markdown
Contributor

@stack72 stack72 commented May 1, 2026

Summary

Stream A of the Windows-GA push (follows #1278's Stream 0 regression net). Two work items.

Work item 1 — Docker driver tests on Windows

Replaces the #!/bin/sh mock-script approach with a TypeScript-based mock launched through a tiny platform-specific shim — .cmd on Windows, chmod-able .sh on POSIX, both invoking deno run against the same mock TS source. Un-skips 10 tests that were previously gated on Deno.build.os === "windows". Production driver behavior is unchanged; only the stub binary form changed. Stream 0's POSIX-only SIGTERM and stream-interleaving tests stay gated because they exercise POSIX signal semantics.

Work item 2 — Vault NTFS ACL check (Windows-only behavior change)

⚠️ Behavior change for Windows users only. Before this PR, validateSshKeyPermissions returned early on Windows — SSH keys with broad ACLs would be silently accepted. After this PR, the vault rejects them with a clear remediation hint. POSIX behavior is byte-identical (verified: same Deno.stat, same (mode & 0o077) !== 0 check, same error string).

Adds src/infrastructure/security/file_security_check.ts with checkFileNotBroadlyReadable(path):

  • POSIX: replicates the existing (stat.mode & 0o077) !== 0 rule with the identical error message format.
  • Windows: shells out to PowerShell Get-Acl, walks each ACE, flags Allow entries that grant Read/ReadAndExecute/FullControl to broad principals (Everyone, Authenticated Users, Anonymous Logon, BUILTIN\Users — plus their SID forms for locale-variant Windows). Surfaces remediation as icacls /remove:g. Fails closed on PowerShell errors, parse failures, or paths containing NUL/CR/LF.

Scope is intentionally Option C — basic broad-principal check, not full ACL semantics. The helper does not walk inheritance, honour Deny ACEs, inspect alternate data streams, or resolve nested group membership. The intent is to match the bar the POSIX check already sets.

LocalEncryptionVaultProvider.validateSshKeyPermissions now delegates to the helper, prepending "SSH key " to preserve the historical error format that other code/tests parse. Adds parallel Windows-only test cases that grant Everyone:(R) via icacls and assert the vault rejects the key.

POSIX impact

None. The error string is byte-identical. The control flow is identical. The Stream 0 vault-directory-mode test and vault_cel_integration_test.ts are both green. macOS/Linux users see no behavior change.

Windows impact

Real, intentional, security-positive. The vault now actually checks ACLs on Windows, where it previously did not. Existing Windows vault users with SSH keys that have broad ACL grants (which can be the default on user-profile files in some configurations) may be locked out and see the new error message with icacls remediation guidance.

Out-of-scope follow-ups

  • swamp doctor vault — TODO landed in src/domain/audit/doctor/check.ts. No doctor vault check exists today; deferring to keep this PR scoped. Worth landing soon to give Windows users proactive insight into their ACL state.
  • Locale-variant Windows validation. SID fallbacks cover most cases; needs verification on a non-en-US runner.
  • swamp doctor vault is the right surface for the disclosure about what the platform check does and doesn't cover.

Test Plan

  • deno fmt --check clean (1286 files)
  • deno lint clean (1165 files)
  • deno check clean
  • deno run test — 5154 passed, 0 failed, 27 ignored
  • deno run compile produces a working binary
  • Stream 0 regression tests green (vault directory 0o700, vault CEL integration, Docker SIGTERM and interleaving)
  • All 10 previously-skipped Docker driver tests now run on Windows
  • Verified POSIX error string is byte-identical to pre-PR

stack72 and others added 5 commits May 1, 2026 21:13
…lt check

Stream A of the Windows-GA push (follows #1278's Stream 0 regression net).

**Docker driver tests (work item 1):**
Replaces the `#!/bin/sh` mock-script approach with a TypeScript-based mock
launched through a tiny platform-specific shim — `.cmd` on Windows,
chmod-able `.sh` on POSIX, both invoking `deno run` against the same
mock TS source. Un-skips 10 tests that were previously gated on
`Deno.build.os === "windows"`. Production driver behavior is unchanged;
only the stub binary form changed. The Stream-0 SIGTERM and stream
interleaving tests remain POSIX-only because they exercise POSIX signal
semantics.

**Vault NTFS ACL check (work item 2):**
Adds `src/infrastructure/security/file_security_check.ts` with
`checkFileNotBroadlyReadable(path)` — POSIX path replicates the existing
`(stat.mode & 0o077) !== 0` rule, Windows path shells out to PowerShell
`Get-Acl` and walks each ACE looking for Allow entries that grant
Read/ReadAndExecute/FullControl to a small set of broad principals
(Everyone, Authenticated Users, Anonymous Logon, BUILTIN\Users, plus
their SID forms).

Scope is intentionally Option-C (basic broad-principal check, not full
ACL semantics): the helper does not walk inheritance, honour Deny ACEs,
inspect alternate data streams, or resolve nested group membership. The
intent is to match the bar the POSIX check already sets — refuse files
that are obviously broadly readable for vault key material — and surface
a clear remediation hint (`icacls /remove:g`) when something is flagged.

The vault provider's `validateSshKeyPermissions` now delegates to the
helper, preserving the historical `SSH key '<path>' has insecure
permissions ...` error format on POSIX so existing callers/tests still
parse it. Adds parallel Windows-only test cases that grant
`Everyone:(R)` via `icacls` and assert the vault rejects the key.

A TODO is added in `src/domain/audit/doctor/check.ts` for a future
`swamp doctor vault` surface that discloses what the platform check
covers — no doctor vault command exists today, so we deliberately defer
that work rather than expand scope.

Verification: `deno fmt`, `deno lint`, `deno check`, `deno run test`
(5154 passed, 0 failed, 27 ignored) and `deno run compile` all green
on macOS.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The timeout mock used `await new Promise<void>(() => {})` to block the
mock "docker" child forever and let the driver's timeout SIGTERM kill
it. That works on POSIX shells but not under deno: the runtime detects
no scheduled work, terminates the process before the SIGTERM arrives,
and emits "Top-level await promise never resolved" on stderr. The
docker driver captures that stderr and surfaces it as the error
message, so the test's `assertStringIncludes(error, "timed out")`
fails on ubuntu-latest CI.

Switch to `setInterval(() => {}, 60_000)` instead. setInterval keeps
the event loop alive without a top-level await, so deno sits idle
waiting for the timer until the driver's SIGTERM (or SIGKILL on
Windows) kills the process silently with no diagnostic on stderr.

No production driver behaviour changes — only the test mock body.
Verified locally on macOS that the timeout test now passes; the test
takes ~205ms (~the configured 200ms timeout plus a small kill delay)
and produces the expected "Docker command timed out after 200ms"
error.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ure on GHA

GitHub Actions windows-latest runners cannot reliably auto-load the
`Microsoft.PowerShell.Security` module that hosts `Get-Acl`, so every
PowerShell invocation in `checkFileNotBroadlyReadable` fails with:

  Get-Acl : The 'Get-Acl' command was found in the module
  'Microsoft.PowerShell.Security', but the module could not be loaded.
  ...
  FullyQualifiedErrorId : CouldNotAutoloadMatchingModule

Our helper falls through to the fail-closed "Could not verify ACL"
branch, which in turn causes `validateSshKeyPermissions` to reject
every SSH key on Windows — taking 13 vault tests with it.

Switch the Windows path to `icacls`, a native Win32 binary that is
always present on every Windows install (no module loading required).
The new helper:

- shells out to `icacls <path>`,
- parses the multi-line ACE listing into `{ principal, rights }` pairs,
- flags any ACE whose principal is one of {Everyone, BUILTIN\\Users,
  NT AUTHORITY\\Authenticated Users, NT AUTHORITY\\Anonymous Logon} (or
  the matching SID) AND whose rights string contains any read-granting
  token: `(R)`, `(RX)`, `(M)`, `(F)`, or the SDDL generic-rights
  aliases `(GR)`, `(GE)`, `(GA)`.

The error-message format ("'<path>' has insecure ACL: principal
'<name>' has Read or higher access. ...") is preserved so vault tests
asserting on `assertStringIncludes(..., "insecure ACL")` continue to
pass without change.

The parser is exposed as `parseIcaclsOutput` and the matcher as
`matchBroadAce` (both marked `@internal`) so cross-platform unit tests
can exercise them without needing a Windows runner. Added 15 new
parser/matcher tests covering: typical multi-line output, inheritance
flags inline (`(OI)(CI)(R)`), localised summary lines, LF-only line
endings, case-mismatched path prefixes, SID-form principals,
case-insensitive principal matching, and rejection of write-only ACEs.

Locally verified on macOS:
- POSIX path unchanged (3 existing tests still green).
- All 15 new parser/matcher tests pass.
- Vault provider tests continue to pass.

CI on windows-latest is required to confirm the icacls integration
end-to-end.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…n gaps

PR #1281 CI showed the timeout test passes on Linux (~206ms) but hangs
for 15+ minutes on Windows. Root cause: the deno mock launches via a
`.cmd` shim, and the docker driver's `process.kill("SIGTERM")` lands on
the cmd parent — Windows has no POSIX-style process group, so the signal
does NOT propagate down to the deno child running `setInterval(...)`.
The mock keeps running forever, the driver never observes
`process.status` resolving, and the test wedges.

Fix: add a defensive `setTimeout(() => Deno.exit(0), 5000)` self-exit
inside the timeout-test mock. The driver's own 200ms timeout fires its
"timed out" error well before the 5s self-exit (POSIX behavior is
unchanged — kill propagates and the mock dies in ~200ms). On Windows,
when the kill is swallowed by the shim, the mock still terminates after
5s, the launcher exits, `process.status` resolves, and the already-set
`timedOut` flag surfaces the expected "timed out" assertion.

Audit: only the timeout test uses a long-running mock pattern
(`setInterval`). Every other deno mock in this file is print-and-exit
(short-lived), so no other mocks need the safety. The Stream-0
regression-net tests at the bottom of the file use shell scripts and
are POSIX-only (`ignore: Deno.build.os === "windows"`), so they're
unaffected.

Verified: full suite green on macOS (5170 passed, 0 failed). Windows
behavior depends on CI — the self-exit is bounded so worst case is a
5s test instead of a 15m hang.
…t hack

The previous iteration added `setTimeout(() => Deno.exit(0), 5000)` to the
timeout-test mock as a defensive self-exit on Windows where
`process.kill()` doesn't propagate from a `.cmd` shim to a deno child.
That worked for the hang, but introduced a race: under load, if the
driver's 200ms timeout doesn't fire before the 5s self-exit, the mock
exits cleanly with code 0, the driver sees `success: true`, and the
test fails for the wrong reason.

Replace the hack with an honest `ignore: Deno.build.os === "windows"`
gate on the timeout test specifically. The other 9 newly-un-skipped
docker driver tests stay enabled on Windows — they're print-and-exit
mocks that don't depend on kill propagation. Re-enabling the timeout
test on Windows requires either Job Object-based kills (not exposed
by Deno) or a parent-death-detecting mock pattern; tracked as a
follow-up to Stream A.

Local verification on macOS: 39 passed, 0 failed in the docker driver
test file; full suite stays at 5170 passed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

Well-structured PR. The security check extraction is clean, the icacls parser is well-tested with cross-platform unit tests, and the fail-closed approach throughout checkWindowsAcl is the right call for security-sensitive code. The createMockDriverCommand abstraction for cross-platform test mocking is a good pattern.

Blocking Issues

None.

Suggestions

  1. Multi-line comment blocks — CLAUDE.md convention is "one short line max" for comments. The 11-line TODO in src/domain/audit/doctor/check.ts:32-42, the 8-line JSDoc on createMockDriverCommand in the test file, and the 24-line module docstring on file_security_check.ts all exceed that. The module docstring documents the security scope/threat model so it's arguably the right trade-off for a security file, but the TODO and test helper docblock could be trimmed to a single line each (e.g. // TODO(windows-ga): add swamp doctor vault surface — see PR #1281 for scope).

  2. No timeout on icacls subprocesscheckWindowsAcl awaits cmd.output() with no AbortSignal. While icacls is a local system binary that should return instantly, a defensive timeout (e.g. 10s) would prevent a hung subprocess from blocking the vault indefinitely. CLAUDE.md mentions AbortSignal for network calls specifically, so this is just a hardening suggestion.

  3. (R) substring match edge caseREAD_GRANTING_TOKENS uses .includes("(R)") against the rights string. This is correct for icacls's current output format, but if icacls ever emits individual SDDL atoms like (RC) (Read Control), the (R) token wouldn't false-positive because the ) must follow R. Just noting the reasoning is sound — no change needed.

Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adversarial Review

Critical / High

None found.

Medium

  1. Non-English Windows locale bypasses the ACL check (src/infrastructure/security/file_security_check.ts:56-61, 273-279).
    The BROAD_WINDOWS_PRINCIPALS list uses English names (Everyone, BUILTIN\Users, etc.). On localized Windows, icacls prints localized names — e.g., Jeder (German), Tout le monde (French). The localeCompare matches case variants but cannot bridge "Jeder""Everyone". The SID fallback (p.sid) only fires when Windows cannot resolve the SID to a name, which is rare on a functioning localized system — icacls will almost always print the localized name rather than the raw SID. The net effect is that the check silently passes on non-English Windows, accepting files with broad ACLs as secure. The PR description acknowledges this as out-of-scope, and the check is strictly additive (previously Windows returned early unconditionally), so this does not block. Worth noting that the PR body's claim that "SID fallbacks cover most cases" overstates coverage for localized Windows.

  2. No timeout on icacls subprocess (src/infrastructure/security/file_security_check.ts:130-136).
    new Deno.Command("icacls", ...).output() runs without an AbortSignal. On a network-mounted path where the domain controller is unreachable, icacls can hang indefinitely, blocking the vault operation with no user-visible progress. A 10-second AbortSignal.timeout() would bound the wait and let the fail-closed branch surface a clear error.

Low

  1. Remediation hints interpolate path into shell commands (src/infrastructure/security/file_security_check.ts:110-111, 177-178).
    The POSIX hint Run 'chmod 600 ${path}' to fix. and the Windows hint `icacls "${path}" /remove:g ...` embed the raw path. If the path contains shell metacharacters (spaces, semicolons, backticks), a user copy-pasting the hint could trigger unintended shell expansion. Not a code vulnerability (it's a human-readable message), but quoting the path in the POSIX hint (e.g., chmod 600 '${path}') would make the suggestion safe to paste.

  2. TOCTOU between permission check and key read — inherent to the check-then-use design and identical to the prior POSIX behaviour. Not actionable here.

Verdict

PASS. This is well-constructed security-hardening code. The icacls parser is carefully designed (case-insensitive path stripping, :( split anchor, parens-based summary filter, fail-closed on empty ACEs or parse errors). The POSIX path is byte-identical to the prior inline check. The Docker test refactoring is mechanical and correct. The two medium findings are acknowledged limitations, not regressions — the Windows ACL check is strictly additive over the previous return on Windows.

@stack72 stack72 merged commit 0e52daa into main May 1, 2026
11 checks passed
@stack72 stack72 deleted the stream-a-docker-vault branch May 1, 2026 21:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant