Skip to content

fix: implement audit findings from consolidated review#3935

Merged
lrsaturnino merged 20 commits intofix/review-findingsfrom
fix/audit-findings
Apr 16, 2026
Merged

fix: implement audit findings from consolidated review#3935
lrsaturnino merged 20 commits intofix/review-findingsfrom
fix/audit-findings

Conversation

@piotr-roslaniec
Copy link
Copy Markdown
Collaborator

@piotr-roslaniec piotr-roslaniec commented Apr 9, 2026

Summary

Implements 15 fixes from the consolidated audit of PR #3933 (covenant signer + fault isolation). These address findings from 5 review passes (initial triage, 2 multi-agent lens reviews, PR reviewer feedback, validation review).

Fixes included

P1 - High

  • A2: Use canonicaljson.Marshal for handoff payload hash (cross-language determinism)
  • A3/A23: Improve wallet registry error messages, elevate log to Error, add WalletChainData godoc
  • A4: Extract Submit critical section into createOrDedup helper (5 unlock points -> defer)
  • A5: Deterministic tiebreaker when both timestamps unparseable (lexicographic RequestID)
  • A6: Restrict healthz auth bypass to GET method only
  • A22: Poison route keys from skipped jobs to preserve dedupe semantics

P2 - Medium

  • A7: Document AuthToken CLI flag /proc visibility risk
  • A12/A27: Cancel service context on init failure + add SIGINT/SIGTERM signal handling
  • A15: Document advisory flock limitations and storage requirements
  • A16: Add aggregate load summary with skip count
  • A24: Remove superseded job from byRequestID on dedup replacement
  • A25: Rename misleading test after resilient loading change
  • A29: Use errors.Is() for errJobNotFound comparison

Follow-up issues

Pre-mainnet (must address before mainnet deployment)

A1: Default requireApprovalTrustRoots to true in production

  • File: pkg/covenantsigner/server.go:70-75
  • Risk: When signerApprovalVerifier is nil, the service accepts requests without cryptographic signer approval verification. The requireApprovalTrustRoots config flag exists as mitigation but defaults to false.
  • Action: Default requireApprovalTrustRoots to true in production configs, or fail startup when port is non-zero and no verifier is available.
  • Effort: Low

A8: Bitcoin regtest integration tests for witness scripts

  • File: pkg/tbtc/covenant_signer.go (design)
  • Risk: Go unit tests cannot enforce cleanstack, signature encoding quirks, or opcode limits. A script passing unit tests could be rejected by the Bitcoin network, locking funds.
  • Action: Introduce btcd or Bitcoin Core regtest integration tests that compile witness scripts, sign transactions, and broadcast on a local regtest node.
  • Effort: High

Post-merge (code quality and maintainability)

A9: Split monolithic validation.go (1922 lines)

  • File: pkg/covenantsigner/validation.go
  • Issue: Mixes input parsing, crypto verification, normalization, and commitment computation in a single file.
  • Action: Split into focused files: validation_quote.go, validation_approval.go, validation_template.go.
  • Effort: High

A10: Deduplicate UTXO resolution logic

  • File: pkg/tbtc/covenant_signer.go:512-628
  • Issue: resolveSelfV1ActiveUtxo and resolveQcV1ActiveUtxo are near-identical (~60 lines each).
  • Action: Extract shared resolveActiveUtxo(request, witnessScript, templateName) helper.
  • Effort: Medium

A11: Deduplicate trust root normalization

  • File: pkg/covenantsigner/validation.go:628-724
  • Issue: normalizeDepositorTrustRoots and normalizeCustodianTrustRoots are structurally identical.
  • Action: Unify via generics or shared inner function.
  • Effort: Low

A13: strictUnmarshal trailing token rejection

  • File: pkg/covenantsigner/validation.go:71-75
  • Issue: Inconsistent with server-level decodeJSON which rejects trailing JSON tokens.
  • Action: Add trailing-token rejection or document single-document context constraint.
  • Effort: Low

A14: Document Engine interface contract

  • File: pkg/covenantsigner/engine.go
  • Issue: No documentation of what nil Transition means (OnSubmit: default to Pending; OnPoll: no-op).
  • Action: Add godoc specifying the contract and allowed return states.
  • Effort: Low

A26: Rate limiting on submit endpoint

  • File: pkg/covenantsigner/server.go:345
  • Issue: Authenticated callers can spawn unbounded concurrent 5-minute signing operations. Mitigated by auth and routeRequestID dedup but no per-client throttle exists.
  • Action: Add token-bucket rate limiting per auth token (e.g., golang.org/x/time/rate, ~5 req/min).
  • Effort: Medium

A28: Replace reflect.DeepEqual for Handoff comparison

  • File: pkg/covenantsigner/service.go:201
  • Issue: sameJobRevision uses reflect.DeepEqual for map[string]any Handoff comparison. Correct but potentially slow at scale.
  • Action: Define a concrete HandoffData struct with an Equal method. Acceptable as-is for single-digit RPM throughput.
  • Effort: Low

A30: Document qcV1SignerHandoff schema

  • File: pkg/tbtc/covenant_signer.go:35
  • Issue: The Kind constant serves as a version identifier but the downstream contract for handoff artifacts is implicit.
  • Action: Add doc comment noting this struct defines the downstream API schema.
  • Effort: Low

Type-safety follow-up (from A3)

A3 (type-level): Add HasRegistryData to WalletChainData

  • File: pkg/tbtc/chain.go:418
  • Issue: Zero [32]byte for MembersIDsHash means "unavailable" but this is implicit. Downstream guards work but future code could silently use the zero value.
  • Action: Add HasRegistryData bool field or use *[32]byte for MembersIDsHash to make unavailability explicit at the type level.
  • Effort: Medium (touches many callers)

Pre-existing test failure

TestSubmitHandlerPreservesContextValues fails on the base branch (fix/review-findings). Confirmed not introduced by these changes -- the test expects request context values to propagate through the service context detachment, which is not the current design.

Test plan

  • go build ./pkg/covenantsigner/... and go build ./pkg/tbtc/... compile cleanly
  • All covenantsigner tests pass (except pre-existing TestSubmitHandlerPreservesContextValues)
  • All relevant tbtc tests pass (GetWallet fault isolation, signer approval, payload hash)
  • Pinned hash test (TestComputeQcV1SignerHandoffPayloadHash_DeterministicKeyOrdering) still passes after canonicaljson switch
  • CI green

…hange

TestStoreLoadFailsOnInvalidUpdatedAtForDuplicateRouteKeys now asserts
success (resilient loading), not failure. Rename to reflect actual
behavior.
Direct == comparison is correct today since errJobNotFound is never
wrapped, but errors.Is is more resilient to future wrapping changes.
The auth bypass checked only the path, allowing any HTTP method to
skip bearer auth on /healthz. Restrict to GET to match the registered
handler and prevent unintended bypass on other methods.
The auth token is visible in /proc/PID/cmdline when passed as a CLI
flag. Add documentation recommending environment variables or config
files for non-loopback deployments.
Operators previously saw only individual warnings per corrupt file but
had no summary of total loaded vs skipped. Add a summary log line at
the end of load() for operational visibility.
When load replaces a job during route-key deduplication, the superseded
job's entry remained in byRequestID, leaking stale data in the
secondary index.
…s unparseable

When two duplicate-route-key jobs both have unparseable timestamps, the
winner previously depended on non-deterministic file iteration order.
Use lexicographic RequestID comparison as a stable tiebreaker.
…dedupe

When load() skips a malformed job file, GetByRouteRequest can no longer
find that job. A retry then silently creates a duplicate signing job,
breaking node-local idempotency. Fix by partially parsing skipped files
to extract route keys and marking them as poisoned. GetByRouteRequest
returns an error for poisoned keys, forcing the caller to investigate
rather than creating a duplicate.
The Submit method had 5 separate mutex.Unlock() call sites, making the
locking pattern fragile for a security-critical signing path. Extract
the dedup-check-and-create logic into a helper that uses defer Unlock,
reducing the main Submit method to two clean lock scopes.
…gnals

Two fixes:
- Call cancelService() when net.Listen fails after context creation to
  prevent a context leak on initialization error.
- Add SIGINT/SIGTERM signal handling so in-flight signing operations are
  cancelled promptly on any shutdown path, not only when the parent
  context is cancelled.
… requirements

POSIX flock is advisory and Linux-specific. Document that the data
directory must use local or block-level storage with single-writer
access, not network filesystems.
Three improvements for operator visibility during registry outages:
- Sentinel errors now mention that the wallet registry may be
  unavailable, helping operators distinguish registry failures from
  genuinely missing data.
- GetWallet log elevated from Warn to Error with actionable message
  explaining that signer approval operations will fail.
- WalletChainData godoc documents zero-value semantics for registry-
  sourced fields.
Switch from encoding/json.Marshal to canonicaljson.Marshal for the
content-addressed handoff bundle ID. Both produce identical output for
current payloads (alphabetical key ordering, no HTML content), but
canonicaljson explicitly disables HTML escaping, making the
serialization contract clearer for non-Go consumers.
… timestamps

The earlier tiebreaker fix incorrectly entered the "both unparseable"
branch when only the candidate had an invalid timestamp. Parse both
timestamps explicitly to distinguish three cases: only candidate bad
(keep existing), only existing bad (replace), both bad (lexicographic
tiebreak).
… context

The test injected a value into the HTTP request context and expected it
to be visible in the engine, but the submit handler deliberately
derives its context from the service context (not the request context)
to survive HTTP disconnects. Fix the test to inject the value into the
service context, which matches the actual design contract.
Copy link
Copy Markdown

@mswilkison mswilkison left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Findings

  • High: pkg/covenantsigner/server.go:136 now installs its own signal.NotifyContext for SIGINT/SIGTERM, but the normal node start path still builds the process root context from context.Background() in cmd/start.go:145 and then waits forever on <-ctx.Done() in cmd/start.go:276. With covenant signer enabled, the first SIGINT/SIGTERM will now be consumed by the signer's internal handler, which only cancels serviceCtx and shuts down the signer HTTP server. The rest of the node keeps running instead of exiting on the first signal. Before this change, the process terminated normally. This signal handling needs to live at the top level and cancel the shared root context, rather than being added inside covenantsigner.Initialize.

  • Medium: The new poisoned-route safeguard can mask a valid surviving job. In pkg/covenantsigner/store.go:240, every malformed file with a recoverable route key marks that route as poisoned, and pkg/covenantsigner/store.go:371 returns errPoisonedRouteKey before checking byRouteKey. That means if startup sees both a valid job and a malformed stale sibling for the same route key, retries by routeRequestId stop deduping to the valid job and instead fail with "manual recovery required." A realistic path to that state already exists when a replacement leaves the old file behind after delete failure. Poisoning should only apply when no valid winner for that route survives load, or be cleared once a valid job is loaded for the same key.

piotr-roslaniec and others added 3 commits April 9, 2026 17:46
…ess signals

The signal.NotifyContext for SIGINT/SIGTERM inside Initialize consumed
the first signal, cancelling only the signer's service context while
the rest of the node kept running. The process root context in
cmd/start.go is context.Background() and relies on the OS default
signal handler to terminate. Revert to the original parent-context-only
shutdown path.
…me key

A malformed file processed before its valid sibling would poison the
route key, then the valid job would load into byRouteKey but remain
inaccessible via GetByRouteRequest because the poison check ran first.
Clear the poison entry when a valid job is successfully indexed for that
route key.
…ersisted job files

The poison-route mechanism (A22) attempted graceful degradation by
skipping corrupt files and marking their route keys as poisoned. This
had two fundamental issues: (1) an ordering-dependent bug where a valid
job iterated before its malformed sibling left the route permanently
poisoned despite a valid job being loaded, and (2) when the corrupt
file's route key was unrecoverable (binary garbage), no poison was set
and dedupe protection was silently lost.

For a pre-mainnet Bitcoin signing service, fail-closed is the safer
default: ambiguous persisted state is a safety fault, not a recoverable
nuisance. A single corrupt file should block startup rather than risk
duplicate signing jobs.

Changes:
- Store.load() now returns a hard error on unreadable or malformed
  persisted job files instead of skipping them
- Removed poisonedRoutes map, errPoisonedRouteKey sentinel,
  poisonRouteFromPartialJob, SkippedJobFiles, and poison check in
  GetByRouteRequest
- Updated tests to expect hard failure on corrupt files
- Added superseded-job-removal assertions to dedup tests
Copy link
Copy Markdown
Member

@lrsaturnino lrsaturnino left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@lrsaturnino
Copy link
Copy Markdown
Member

Dropped the poison-route mechanism (A22) in 52a3f2d in favor of hard-fail on corrupt persisted job files, aligning with the approach in PR #3933.

The poison-route had two issues: (1) an ordering bug where a valid job iterated before its malformed sibling left the route permanently poisoned despite the valid job being loaded, and (2) when the corrupt file was unrecoverable (binary garbage), no poison was set at all, silently losing dedupe protection. Hard-fail is the safer default for a pre-mainnet signing service — ambiguous persisted state should block startup rather than risk duplicate signing jobs.

The rest of this PR (createOrDedup extraction, healthz auth bypass, canonicaljson, error messages, test fixes, signal handler removal) is unaffected.

…e operations

Add #nosec annotations for three findings in store.go:
- G304 on os.OpenFile: lockPath is built from operator config (dataDir)
  plus constants (jobsDirectory, lockFileName), not user input
- G104 on lockFile.Close(): best-effort cleanup after failed flock;
  the lock error is what gets returned
- G104 on store.Close(): best-effort lock release after failed load;
  the load error is what gets returned
…claration

gosec flags the `go func()` declaration line, not the
`context.Background()` call inside the body. Move the annotation to
the correct line so the client-scan CI job passes.
@lrsaturnino lrsaturnino merged commit 0c3d950 into fix/review-findings Apr 16, 2026
17 checks passed
@lrsaturnino lrsaturnino deleted the fix/audit-findings branch April 16, 2026 04:09
lrsaturnino added a commit that referenced this pull request Apr 16, 2026
Implements 15 fixes from the consolidated audit of PR #3933 (covenant
signer + fault isolation). These address findings from 5 review passes
(initial triage, 2 multi-agent lens reviews, PR reviewer feedback,
validation review).

**P1 - High**
- **A2**: Use `canonicaljson.Marshal` for handoff payload hash
(cross-language determinism)
- **A3/A23**: Improve wallet registry error messages, elevate log to
Error, add WalletChainData godoc
- **A4**: Extract Submit critical section into `createOrDedup` helper (5
unlock points -> defer)
- **A5**: Deterministic tiebreaker when both timestamps unparseable
(lexicographic RequestID)
- **A6**: Restrict healthz auth bypass to GET method only
- **A22**: Poison route keys from skipped jobs to preserve dedupe
semantics

**P2 - Medium**
- **A7**: Document AuthToken CLI flag /proc visibility risk
- **A12/A27**: Cancel service context on init failure + add
SIGINT/SIGTERM signal handling
- **A15**: Document advisory flock limitations and storage requirements
- **A16**: Add aggregate load summary with skip count
- **A24**: Remove superseded job from byRequestID on dedup replacement
- **A25**: Rename misleading test after resilient loading change
- **A29**: Use `errors.Is()` for errJobNotFound comparison

---

**A1: Default `requireApprovalTrustRoots` to true in production**
- **File:** `pkg/covenantsigner/server.go:70-75`
- **Risk:** When `signerApprovalVerifier` is nil, the service accepts
requests without cryptographic signer approval verification. The
`requireApprovalTrustRoots` config flag exists as mitigation but
defaults to false.
- **Action:** Default `requireApprovalTrustRoots` to true in production
configs, or fail startup when port is non-zero and no verifier is
available.
- **Effort:** Low

**A8: Bitcoin regtest integration tests for witness scripts**
- **File:** `pkg/tbtc/covenant_signer.go` (design)
- **Risk:** Go unit tests cannot enforce `cleanstack`, signature
encoding quirks, or opcode limits. A script passing unit tests could be
rejected by the Bitcoin network, locking funds.
- **Action:** Introduce `btcd` or Bitcoin Core regtest integration tests
that compile witness scripts, sign transactions, and broadcast on a
local regtest node.
- **Effort:** High

**A9: Split monolithic validation.go (1922 lines)**
- **File:** `pkg/covenantsigner/validation.go`
- **Issue:** Mixes input parsing, crypto verification, normalization,
and commitment computation in a single file.
- **Action:** Split into focused files: `validation_quote.go`,
`validation_approval.go`, `validation_template.go`.
- **Effort:** High

**A10: Deduplicate UTXO resolution logic**
- **File:** `pkg/tbtc/covenant_signer.go:512-628`
- **Issue:** `resolveSelfV1ActiveUtxo` and `resolveQcV1ActiveUtxo` are
near-identical (~60 lines each).
- **Action:** Extract shared `resolveActiveUtxo(request, witnessScript,
templateName)` helper.
- **Effort:** Medium

**A11: Deduplicate trust root normalization**
- **File:** `pkg/covenantsigner/validation.go:628-724`
- **Issue:** `normalizeDepositorTrustRoots` and
`normalizeCustodianTrustRoots` are structurally identical.
- **Action:** Unify via generics or shared inner function.
- **Effort:** Low

**A13: strictUnmarshal trailing token rejection**
- **File:** `pkg/covenantsigner/validation.go:71-75`
- **Issue:** Inconsistent with server-level `decodeJSON` which rejects
trailing JSON tokens.
- **Action:** Add trailing-token rejection or document single-document
context constraint.
- **Effort:** Low

**A14: Document Engine interface contract**
- **File:** `pkg/covenantsigner/engine.go`
- **Issue:** No documentation of what nil Transition means (OnSubmit:
default to Pending; OnPoll: no-op).
- **Action:** Add godoc specifying the contract and allowed return
states.
- **Effort:** Low

**A26: Rate limiting on submit endpoint**
- **File:** `pkg/covenantsigner/server.go:345`
- **Issue:** Authenticated callers can spawn unbounded concurrent
5-minute signing operations. Mitigated by auth and routeRequestID dedup
but no per-client throttle exists.
- **Action:** Add token-bucket rate limiting per auth token (e.g.,
`golang.org/x/time/rate`, ~5 req/min).
- **Effort:** Medium

**A28: Replace reflect.DeepEqual for Handoff comparison**
- **File:** `pkg/covenantsigner/service.go:201`
- **Issue:** `sameJobRevision` uses `reflect.DeepEqual` for
`map[string]any` Handoff comparison. Correct but potentially slow at
scale.
- **Action:** Define a concrete `HandoffData` struct with an `Equal`
method. Acceptable as-is for single-digit RPM throughput.
- **Effort:** Low

**A30: Document qcV1SignerHandoff schema**
- **File:** `pkg/tbtc/covenant_signer.go:35`
- **Issue:** The `Kind` constant serves as a version identifier but the
downstream contract for handoff artifacts is implicit.
- **Action:** Add doc comment noting this struct defines the downstream
API schema.
- **Effort:** Low

**A3 (type-level): Add `HasRegistryData` to WalletChainData**
- **File:** `pkg/tbtc/chain.go:418`
- **Issue:** Zero `[32]byte` for MembersIDsHash means "unavailable" but
this is implicit. Downstream guards work but future code could silently
use the zero value.
- **Action:** Add `HasRegistryData bool` field or use `*[32]byte` for
MembersIDsHash to make unavailability explicit at the type level.
- **Effort:** Medium (touches many callers)

---

`TestSubmitHandlerPreservesContextValues` fails on the base branch
(`fix/review-findings`). Confirmed not introduced by these changes --
the test expects request context values to propagate through the service
context detachment, which is not the current design.

- [x] `go build ./pkg/covenantsigner/...` and `go build ./pkg/tbtc/...`
compile cleanly
- [x] All covenantsigner tests pass (except pre-existing
`TestSubmitHandlerPreservesContextValues`)
- [x] All relevant tbtc tests pass (GetWallet fault isolation, signer
approval, payload hash)
- [x] Pinned hash test
(`TestComputeQcV1SignerHandoffPayloadHash_DeterministicKeyOrdering`)
still passes after canonicaljson switch
- [ ] CI green
lrsaturnino added a commit that referenced this pull request Apr 16, 2026
…rent base (#3940)

## Problem

PR #3935 was merged into `fix/review-findings`, an intermediate base
branch that was never integrated into
`feat/psbt-covenant-final-project-pr`. This left
`feat/psbt-covenant-final-project-pr` without the audit-finding fixes
that #3935 implemented (createOrDedup mutex extraction, healthz auth
bypass restriction, canonical JSON for handoff payload, error message
improvements, gosec hardening, and more).

## Solution

This PR cherry-picks the squash commit of #3935 onto the current state
of `feat/psbt-covenant-final-project-pr`, which already contains PR
#3933 (Maclane's fault-isolation work plus his subsequent fixes for the
review comments raised on that PR). Three files required manual conflict
resolution where #3935 and #3933 made overlapping changes addressing the
same concerns. In `pkg/tbtc/signer_approval_certificate.go`, the
`ErrMissingWalletID` and `ErrMissingMembersIDsHash` sentinel errors from
#3935 were dropped because Maclane's `ensureWalletRegistryDataAvailable`
function already covers the same case and is in active use; keeping both
would have left the sentinels as dead code. In
`pkg/covenantsigner/server_test.go`, the test rewrite from #3935 (which
depended on a `serviceCtx` parameter to `newHandler`) was reverted in
favor of the existing middleware-based test, which already exercises the
`context.WithoutCancel(r.Context())` detachment path. In
`pkg/covenantsigner/store_test.go`, the test rename collision was
resolved by taking #3935's name
(`TestStoreLoadResolvesInvalidUpdatedAtForDuplicateRouteKeys`).

Three additional cleanups followed from the merge mechanics: the godoc
on `WalletChainData.MembersIDsHash` was updated to reference
`ensureWalletRegistryDataAvailable` instead of the dropped sentinels, an
orphaned `cancelService()` call was removed from `server.go` (its
`serviceCtx` infrastructure did not auto-merge), and a duplicate
`delete(s.byRequestID, existingID)` was deduplicated in `store.go`'s
load path.

## Tests

All tests across `pkg/covenantsigner`, `pkg/tbtc`, and
`pkg/chain/ethereum` pass. A behavioral diff against the base branch
shows identical build, vet, and file-listing outputs, with all `ok`
lines matching except for timing variation. The two adapter tests added
by Maclane (`TestMakeWalletChainData...`) and the two
registry-unavailable tests for the issue-and-verify signer approval
paths are preserved and pass.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants