fix: implement audit findings from consolidated review#3935
fix: implement audit findings from consolidated review#3935lrsaturnino merged 20 commits intofix/review-findingsfrom
Conversation
…hange TestStoreLoadFailsOnInvalidUpdatedAtForDuplicateRouteKeys now asserts success (resilient loading), not failure. Rename to reflect actual behavior.
Direct == comparison is correct today since errJobNotFound is never wrapped, but errors.Is is more resilient to future wrapping changes.
The auth bypass checked only the path, allowing any HTTP method to skip bearer auth on /healthz. Restrict to GET to match the registered handler and prevent unintended bypass on other methods.
The auth token is visible in /proc/PID/cmdline when passed as a CLI flag. Add documentation recommending environment variables or config files for non-loopback deployments.
Operators previously saw only individual warnings per corrupt file but had no summary of total loaded vs skipped. Add a summary log line at the end of load() for operational visibility.
When load replaces a job during route-key deduplication, the superseded job's entry remained in byRequestID, leaking stale data in the secondary index.
…s unparseable When two duplicate-route-key jobs both have unparseable timestamps, the winner previously depended on non-deterministic file iteration order. Use lexicographic RequestID comparison as a stable tiebreaker.
…dedupe When load() skips a malformed job file, GetByRouteRequest can no longer find that job. A retry then silently creates a duplicate signing job, breaking node-local idempotency. Fix by partially parsing skipped files to extract route keys and marking them as poisoned. GetByRouteRequest returns an error for poisoned keys, forcing the caller to investigate rather than creating a duplicate.
The Submit method had 5 separate mutex.Unlock() call sites, making the locking pattern fragile for a security-critical signing path. Extract the dedup-check-and-create logic into a helper that uses defer Unlock, reducing the main Submit method to two clean lock scopes.
…gnals Two fixes: - Call cancelService() when net.Listen fails after context creation to prevent a context leak on initialization error. - Add SIGINT/SIGTERM signal handling so in-flight signing operations are cancelled promptly on any shutdown path, not only when the parent context is cancelled.
… requirements POSIX flock is advisory and Linux-specific. Document that the data directory must use local or block-level storage with single-writer access, not network filesystems.
Three improvements for operator visibility during registry outages: - Sentinel errors now mention that the wallet registry may be unavailable, helping operators distinguish registry failures from genuinely missing data. - GetWallet log elevated from Warn to Error with actionable message explaining that signer approval operations will fail. - WalletChainData godoc documents zero-value semantics for registry- sourced fields.
Switch from encoding/json.Marshal to canonicaljson.Marshal for the content-addressed handoff bundle ID. Both produce identical output for current payloads (alphabetical key ordering, no HTML content), but canonicaljson explicitly disables HTML escaping, making the serialization contract clearer for non-Go consumers.
… timestamps The earlier tiebreaker fix incorrectly entered the "both unparseable" branch when only the candidate had an invalid timestamp. Parse both timestamps explicitly to distinguish three cases: only candidate bad (keep existing), only existing bad (replace), both bad (lexicographic tiebreak).
0495130 to
5ef7617
Compare
… context The test injected a value into the HTTP request context and expected it to be visible in the engine, but the submit handler deliberately derives its context from the service context (not the request context) to survive HTTP disconnects. Fix the test to inject the value into the service context, which matches the actual design contract.
mswilkison
left a comment
There was a problem hiding this comment.
Findings
-
High:
pkg/covenantsigner/server.go:136now installs its ownsignal.NotifyContextforSIGINT/SIGTERM, but the normal node start path still builds the process root context fromcontext.Background()incmd/start.go:145and then waits forever on<-ctx.Done()incmd/start.go:276. With covenant signer enabled, the firstSIGINT/SIGTERMwill now be consumed by the signer's internal handler, which only cancelsserviceCtxand shuts down the signer HTTP server. The rest of the node keeps running instead of exiting on the first signal. Before this change, the process terminated normally. This signal handling needs to live at the top level and cancel the shared root context, rather than being added insidecovenantsigner.Initialize. -
Medium: The new poisoned-route safeguard can mask a valid surviving job. In
pkg/covenantsigner/store.go:240, every malformed file with a recoverable route key marks that route as poisoned, andpkg/covenantsigner/store.go:371returnserrPoisonedRouteKeybefore checkingbyRouteKey. That means if startup sees both a valid job and a malformed stale sibling for the same route key, retries byrouteRequestIdstop deduping to the valid job and instead fail with "manual recovery required." A realistic path to that state already exists when a replacement leaves the old file behind after delete failure. Poisoning should only apply when no valid winner for that route survives load, or be cleared once a valid job is loaded for the same key.
…ess signals The signal.NotifyContext for SIGINT/SIGTERM inside Initialize consumed the first signal, cancelling only the signer's service context while the rest of the node kept running. The process root context in cmd/start.go is context.Background() and relies on the OS default signal handler to terminate. Revert to the original parent-context-only shutdown path.
…me key A malformed file processed before its valid sibling would poison the route key, then the valid job would load into byRouteKey but remain inaccessible via GetByRouteRequest because the poison check ran first. Clear the poison entry when a valid job is successfully indexed for that route key.
…ersisted job files The poison-route mechanism (A22) attempted graceful degradation by skipping corrupt files and marking their route keys as poisoned. This had two fundamental issues: (1) an ordering-dependent bug where a valid job iterated before its malformed sibling left the route permanently poisoned despite a valid job being loaded, and (2) when the corrupt file's route key was unrecoverable (binary garbage), no poison was set and dedupe protection was silently lost. For a pre-mainnet Bitcoin signing service, fail-closed is the safer default: ambiguous persisted state is a safety fault, not a recoverable nuisance. A single corrupt file should block startup rather than risk duplicate signing jobs. Changes: - Store.load() now returns a hard error on unreadable or malformed persisted job files instead of skipping them - Removed poisonedRoutes map, errPoisonedRouteKey sentinel, poisonRouteFromPartialJob, SkippedJobFiles, and poison check in GetByRouteRequest - Updated tests to expect hard failure on corrupt files - Added superseded-job-removal assertions to dedup tests
|
Dropped the poison-route mechanism (A22) in 52a3f2d in favor of hard-fail on corrupt persisted job files, aligning with the approach in PR #3933. The poison-route had two issues: (1) an ordering bug where a valid job iterated before its malformed sibling left the route permanently poisoned despite the valid job being loaded, and (2) when the corrupt file was unrecoverable (binary garbage), no poison was set at all, silently losing dedupe protection. Hard-fail is the safer default for a pre-mainnet signing service — ambiguous persisted state should block startup rather than risk duplicate signing jobs. The rest of this PR (createOrDedup extraction, healthz auth bypass, canonicaljson, error messages, test fixes, signal handler removal) is unaffected. |
…e operations Add #nosec annotations for three findings in store.go: - G304 on os.OpenFile: lockPath is built from operator config (dataDir) plus constants (jobsDirectory, lockFileName), not user input - G104 on lockFile.Close(): best-effort cleanup after failed flock; the lock error is what gets returned - G104 on store.Close(): best-effort lock release after failed load; the load error is what gets returned
8fbd5d1 to
2d790b6
Compare
…claration gosec flags the `go func()` declaration line, not the `context.Background()` call inside the body. Move the annotation to the correct line so the client-scan CI job passes.
Implements 15 fixes from the consolidated audit of PR #3933 (covenant signer + fault isolation). These address findings from 5 review passes (initial triage, 2 multi-agent lens reviews, PR reviewer feedback, validation review). **P1 - High** - **A2**: Use `canonicaljson.Marshal` for handoff payload hash (cross-language determinism) - **A3/A23**: Improve wallet registry error messages, elevate log to Error, add WalletChainData godoc - **A4**: Extract Submit critical section into `createOrDedup` helper (5 unlock points -> defer) - **A5**: Deterministic tiebreaker when both timestamps unparseable (lexicographic RequestID) - **A6**: Restrict healthz auth bypass to GET method only - **A22**: Poison route keys from skipped jobs to preserve dedupe semantics **P2 - Medium** - **A7**: Document AuthToken CLI flag /proc visibility risk - **A12/A27**: Cancel service context on init failure + add SIGINT/SIGTERM signal handling - **A15**: Document advisory flock limitations and storage requirements - **A16**: Add aggregate load summary with skip count - **A24**: Remove superseded job from byRequestID on dedup replacement - **A25**: Rename misleading test after resilient loading change - **A29**: Use `errors.Is()` for errJobNotFound comparison --- **A1: Default `requireApprovalTrustRoots` to true in production** - **File:** `pkg/covenantsigner/server.go:70-75` - **Risk:** When `signerApprovalVerifier` is nil, the service accepts requests without cryptographic signer approval verification. The `requireApprovalTrustRoots` config flag exists as mitigation but defaults to false. - **Action:** Default `requireApprovalTrustRoots` to true in production configs, or fail startup when port is non-zero and no verifier is available. - **Effort:** Low **A8: Bitcoin regtest integration tests for witness scripts** - **File:** `pkg/tbtc/covenant_signer.go` (design) - **Risk:** Go unit tests cannot enforce `cleanstack`, signature encoding quirks, or opcode limits. A script passing unit tests could be rejected by the Bitcoin network, locking funds. - **Action:** Introduce `btcd` or Bitcoin Core regtest integration tests that compile witness scripts, sign transactions, and broadcast on a local regtest node. - **Effort:** High **A9: Split monolithic validation.go (1922 lines)** - **File:** `pkg/covenantsigner/validation.go` - **Issue:** Mixes input parsing, crypto verification, normalization, and commitment computation in a single file. - **Action:** Split into focused files: `validation_quote.go`, `validation_approval.go`, `validation_template.go`. - **Effort:** High **A10: Deduplicate UTXO resolution logic** - **File:** `pkg/tbtc/covenant_signer.go:512-628` - **Issue:** `resolveSelfV1ActiveUtxo` and `resolveQcV1ActiveUtxo` are near-identical (~60 lines each). - **Action:** Extract shared `resolveActiveUtxo(request, witnessScript, templateName)` helper. - **Effort:** Medium **A11: Deduplicate trust root normalization** - **File:** `pkg/covenantsigner/validation.go:628-724` - **Issue:** `normalizeDepositorTrustRoots` and `normalizeCustodianTrustRoots` are structurally identical. - **Action:** Unify via generics or shared inner function. - **Effort:** Low **A13: strictUnmarshal trailing token rejection** - **File:** `pkg/covenantsigner/validation.go:71-75` - **Issue:** Inconsistent with server-level `decodeJSON` which rejects trailing JSON tokens. - **Action:** Add trailing-token rejection or document single-document context constraint. - **Effort:** Low **A14: Document Engine interface contract** - **File:** `pkg/covenantsigner/engine.go` - **Issue:** No documentation of what nil Transition means (OnSubmit: default to Pending; OnPoll: no-op). - **Action:** Add godoc specifying the contract and allowed return states. - **Effort:** Low **A26: Rate limiting on submit endpoint** - **File:** `pkg/covenantsigner/server.go:345` - **Issue:** Authenticated callers can spawn unbounded concurrent 5-minute signing operations. Mitigated by auth and routeRequestID dedup but no per-client throttle exists. - **Action:** Add token-bucket rate limiting per auth token (e.g., `golang.org/x/time/rate`, ~5 req/min). - **Effort:** Medium **A28: Replace reflect.DeepEqual for Handoff comparison** - **File:** `pkg/covenantsigner/service.go:201` - **Issue:** `sameJobRevision` uses `reflect.DeepEqual` for `map[string]any` Handoff comparison. Correct but potentially slow at scale. - **Action:** Define a concrete `HandoffData` struct with an `Equal` method. Acceptable as-is for single-digit RPM throughput. - **Effort:** Low **A30: Document qcV1SignerHandoff schema** - **File:** `pkg/tbtc/covenant_signer.go:35` - **Issue:** The `Kind` constant serves as a version identifier but the downstream contract for handoff artifacts is implicit. - **Action:** Add doc comment noting this struct defines the downstream API schema. - **Effort:** Low **A3 (type-level): Add `HasRegistryData` to WalletChainData** - **File:** `pkg/tbtc/chain.go:418` - **Issue:** Zero `[32]byte` for MembersIDsHash means "unavailable" but this is implicit. Downstream guards work but future code could silently use the zero value. - **Action:** Add `HasRegistryData bool` field or use `*[32]byte` for MembersIDsHash to make unavailability explicit at the type level. - **Effort:** Medium (touches many callers) --- `TestSubmitHandlerPreservesContextValues` fails on the base branch (`fix/review-findings`). Confirmed not introduced by these changes -- the test expects request context values to propagate through the service context detachment, which is not the current design. - [x] `go build ./pkg/covenantsigner/...` and `go build ./pkg/tbtc/...` compile cleanly - [x] All covenantsigner tests pass (except pre-existing `TestSubmitHandlerPreservesContextValues`) - [x] All relevant tbtc tests pass (GetWallet fault isolation, signer approval, payload hash) - [x] Pinned hash test (`TestComputeQcV1SignerHandoffPayloadHash_DeterministicKeyOrdering`) still passes after canonicaljson switch - [ ] CI green
…rent base (#3940) ## Problem PR #3935 was merged into `fix/review-findings`, an intermediate base branch that was never integrated into `feat/psbt-covenant-final-project-pr`. This left `feat/psbt-covenant-final-project-pr` without the audit-finding fixes that #3935 implemented (createOrDedup mutex extraction, healthz auth bypass restriction, canonical JSON for handoff payload, error message improvements, gosec hardening, and more). ## Solution This PR cherry-picks the squash commit of #3935 onto the current state of `feat/psbt-covenant-final-project-pr`, which already contains PR #3933 (Maclane's fault-isolation work plus his subsequent fixes for the review comments raised on that PR). Three files required manual conflict resolution where #3935 and #3933 made overlapping changes addressing the same concerns. In `pkg/tbtc/signer_approval_certificate.go`, the `ErrMissingWalletID` and `ErrMissingMembersIDsHash` sentinel errors from #3935 were dropped because Maclane's `ensureWalletRegistryDataAvailable` function already covers the same case and is in active use; keeping both would have left the sentinels as dead code. In `pkg/covenantsigner/server_test.go`, the test rewrite from #3935 (which depended on a `serviceCtx` parameter to `newHandler`) was reverted in favor of the existing middleware-based test, which already exercises the `context.WithoutCancel(r.Context())` detachment path. In `pkg/covenantsigner/store_test.go`, the test rename collision was resolved by taking #3935's name (`TestStoreLoadResolvesInvalidUpdatedAtForDuplicateRouteKeys`). Three additional cleanups followed from the merge mechanics: the godoc on `WalletChainData.MembersIDsHash` was updated to reference `ensureWalletRegistryDataAvailable` instead of the dropped sentinels, an orphaned `cancelService()` call was removed from `server.go` (its `serviceCtx` infrastructure did not auto-merge), and a duplicate `delete(s.byRequestID, existingID)` was deduplicated in `store.go`'s load path. ## Tests All tests across `pkg/covenantsigner`, `pkg/tbtc`, and `pkg/chain/ethereum` pass. A behavioral diff against the base branch shows identical build, vet, and file-listing outputs, with all `ok` lines matching except for timing variation. The two adapter tests added by Maclane (`TestMakeWalletChainData...`) and the two registry-unavailable tests for the issue-and-verify signer approval paths are preserved and pass.
Summary
Implements 15 fixes from the consolidated audit of PR #3933 (covenant signer + fault isolation). These address findings from 5 review passes (initial triage, 2 multi-agent lens reviews, PR reviewer feedback, validation review).
Fixes included
P1 - High
canonicaljson.Marshalfor handoff payload hash (cross-language determinism)createOrDeduphelper (5 unlock points -> defer)P2 - Medium
errors.Is()for errJobNotFound comparisonFollow-up issues
Pre-mainnet (must address before mainnet deployment)
A1: Default
requireApprovalTrustRootsto true in productionpkg/covenantsigner/server.go:70-75signerApprovalVerifieris nil, the service accepts requests without cryptographic signer approval verification. TherequireApprovalTrustRootsconfig flag exists as mitigation but defaults to false.requireApprovalTrustRootsto true in production configs, or fail startup when port is non-zero and no verifier is available.A8: Bitcoin regtest integration tests for witness scripts
pkg/tbtc/covenant_signer.go(design)cleanstack, signature encoding quirks, or opcode limits. A script passing unit tests could be rejected by the Bitcoin network, locking funds.btcdor Bitcoin Core regtest integration tests that compile witness scripts, sign transactions, and broadcast on a local regtest node.Post-merge (code quality and maintainability)
A9: Split monolithic validation.go (1922 lines)
pkg/covenantsigner/validation.govalidation_quote.go,validation_approval.go,validation_template.go.A10: Deduplicate UTXO resolution logic
pkg/tbtc/covenant_signer.go:512-628resolveSelfV1ActiveUtxoandresolveQcV1ActiveUtxoare near-identical (~60 lines each).resolveActiveUtxo(request, witnessScript, templateName)helper.A11: Deduplicate trust root normalization
pkg/covenantsigner/validation.go:628-724normalizeDepositorTrustRootsandnormalizeCustodianTrustRootsare structurally identical.A13: strictUnmarshal trailing token rejection
pkg/covenantsigner/validation.go:71-75decodeJSONwhich rejects trailing JSON tokens.A14: Document Engine interface contract
pkg/covenantsigner/engine.goA26: Rate limiting on submit endpoint
pkg/covenantsigner/server.go:345golang.org/x/time/rate, ~5 req/min).A28: Replace reflect.DeepEqual for Handoff comparison
pkg/covenantsigner/service.go:201sameJobRevisionusesreflect.DeepEqualformap[string]anyHandoff comparison. Correct but potentially slow at scale.HandoffDatastruct with anEqualmethod. Acceptable as-is for single-digit RPM throughput.A30: Document qcV1SignerHandoff schema
pkg/tbtc/covenant_signer.go:35Kindconstant serves as a version identifier but the downstream contract for handoff artifacts is implicit.Type-safety follow-up (from A3)
A3 (type-level): Add
HasRegistryDatato WalletChainDatapkg/tbtc/chain.go:418[32]bytefor MembersIDsHash means "unavailable" but this is implicit. Downstream guards work but future code could silently use the zero value.HasRegistryData boolfield or use*[32]bytefor MembersIDsHash to make unavailability explicit at the type level.Pre-existing test failure
TestSubmitHandlerPreservesContextValuesfails on the base branch (fix/review-findings). Confirmed not introduced by these changes -- the test expects request context values to propagate through the service context detachment, which is not the current design.Test plan
go build ./pkg/covenantsigner/...andgo build ./pkg/tbtc/...compile cleanlyTestSubmitHandlerPreservesContextValues)TestComputeQcV1SignerHandoffPayloadHash_DeterministicKeyOrdering) still passes after canonicaljson switch