[CRE] [3/5] Allow capability DONs to discover remote capabilities#21640
[CRE] [3/5] Allow capability DONs to discover remote capabilities#21640
Conversation
|
👋 nadahalli, thanks for creating this pull request! To help reviewers, please consider creating future PRs as drafts first. This allows you to self-review and make any final changes before notifying the team. Once you're ready, you can mark it as "Ready for review" to request feedback. Thanks! |
|
I see you updated files related to
|
|
✅ No conflicts with other open PRs targeting |
There was a problem hiding this comment.
Pull request overview
Risk Rating: HIGH — changes touch capability discovery/routing, add new gateway handler logic, and update core dependencies used in security-sensitive paths (attestation).
Purpose: Improve capability discovery behavior across single-DON and cross-DON topologies (CRE), and introduce confidential relay handling components needed for confidential workflow execution.
Changes:
- Fix capability serving in single-DON topologies by passing a combined workflow-DON list when serving capabilities.
- Enable capability DONs (not just workflow DONs) to discover and add remote capabilities.
- Add confidential relay gateway/capability handler implementations + tests, and bump related Go dependencies.
Reviewed changes
Copilot reviewed 9 out of 10 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
core/capabilities/launcher.go |
Adjusts DON discovery/serving logic; adds capability-DON remote discovery; modifies don2don connection update behavior. |
core/capabilities/launcher_test.go |
Adds regression test for single-DON capability serving; updates don2don stream config expectations. |
core/services/gateway/handlers/confidentialrelay/handler.go |
New gateway handler: fan-out to DON nodes + quorum aggregation + timeout cleanup. |
core/services/gateway/handlers/confidentialrelay/aggregator.go |
New quorum aggregator based on matching response digests. |
core/services/gateway/handlers/confidentialrelay/handler_test.go |
Tests for fan-out, quorum, timeouts, rate limiting, and edge cases. |
core/capabilities/confidentialrelay/handler.go |
New capability-side gateway connector handler: verifies attestation and proxies secrets/capability execution. |
core/capabilities/confidentialrelay/handler_test.go |
Unit tests for request handling, attestation failure, payload decoding, and lifecycle. |
core/capabilities/confidentialrelay/service.go |
Lifecycle wrapper to delay handler creation until gateway connector is available. |
go.mod / go.sum |
Dependency bumps/additions (notably chainlink-common, teeattestation, CBOR, etc.). |
Scrupulous human review recommended for:
core/capabilities/launcher.go: the updated control flow aroundserveCapabilities+addRemoteCapabilities, and any knock-on effects in multi-DON topologies.core/capabilities/confidentialrelay/handler.go: attestation hashing/validation domain separation and trusted PCR selection behavior.core/services/gateway/handlers/confidentialrelay/*: quorum/timeout behavior and error handling paths under partial failures.
Reviewer recommendations (per CODEOWNERS):
core/capabilities/**:@smartcontractkit/keystone,@smartcontractkit/capabilities-teamgo.mod,go.sum(and otherwise uncovered areas):@smartcontractkit/core,@smartcontractkit/foundations
core/capabilities/launcher.go
Outdated
| if w.don2donSharedPeer != nil { | ||
| donPairs := w.donPairsToUpdate(w.myPeerID, localRegistry) | ||
| err := w.don2donSharedPeer.UpdateConnectionsByDONs(ctx, donPairs, w.p2pStreamConfig) | ||
| err := w.don2donSharedPeer.UpdateConnectionsByDONs(ctx, donPairs, defaultStreamConfig) |
There was a problem hiding this comment.
UpdateConnectionsByDONs is being called with defaultStreamConfig, which ignores the streamConfig passed into NewLauncher and the computed w.p2pStreamConfig. This looks like an accidental regression (configurable P2P stream limits/rate-limiters will no longer take effect for don2don connections). Consider passing w.p2pStreamConfig here instead so the launcher honors node configuration and the stored field isn't dead code.
| err := w.don2donSharedPeer.UpdateConnectionsByDONs(ctx, donPairs, defaultStreamConfig) | |
| err := w.don2donSharedPeer.UpdateConnectionsByDONs(ctx, donPairs, w.p2pStreamConfig) |
There was a problem hiding this comment.
Fixed. The launcher.go diff has been reduced to only the two intended fixes (allWorkflowDONs and addRemoteCapabilities). The defaultStreamConfig change was an artifact from the source branch and is no longer in this PR.
| remainingResponses := donMembersCount - len(resps) | ||
| if maxShaToCount+remainingResponses < requiredQuorum { | ||
| l.Warnw("quorum unattainable for request", "requiredQuorum", requiredQuorum, "remainingResponses", remainingResponses, "maxShaToCount", maxShaToCount) | ||
| return nil, errors.New(errQuorumUnobtainable.Error() + ". RequiredQuorum=" + strconv.Itoa(requiredQuorum) + ". maxShaToCount=" + strconv.Itoa(maxShaToCount) + " remainingResponses=" + strconv.Itoa(remainingResponses)) | ||
| } |
There was a problem hiding this comment.
errQuorumUnobtainable is declared as a sentinel, but the returned error is created via errors.New(...), so callers can’t reliably detect the condition with errors.Is. Consider returning an error that wraps errQuorumUnobtainable (and includes the extra details) so higher layers can branch on it consistently.
There was a problem hiding this comment.
This file has been removed from this PR. It belongs to #21638 [1/5].
go.mod
Outdated
| github.com/smartcontractkit/chainlink-ccv v0.0.0-20260320145055-eb20b529ff95 | ||
| github.com/smartcontractkit/chainlink-common v0.11.0 | ||
| github.com/smartcontractkit/chainlink-common v0.11.1-0.20260323163826-2c5b95089478 | ||
| github.com/smartcontractkit/chainlink-common/keystore v1.0.2 | ||
| github.com/smartcontractkit/chainlink-common/pkg/chipingress v0.0.10 | ||
| github.com/smartcontractkit/chainlink-common/pkg/teeattestation v0.0.0-20260316172927-2c727f64397c |
There was a problem hiding this comment.
The PR description says this change is limited to “Two fixes in launcher.go”, but this PR also updates module dependencies (e.g. chainlink-common, teeattestation) and introduces a new confidential relay capability/handler implementation. Please update the PR description (or split the PR) so reviewers can accurately assess scope and risk.
There was a problem hiding this comment.
Fixed. The PR now only contains launcher.go and launcher_test.go changes. No dependency changes needed.
CORA - Analysis SkippedReason: The number of code owners (2) is less than the minimum required (5) and/or the number of CODEOWNERS entries with changed files (1) is less than the minimum required (2). |
0c3f1f3 to
bb05a15
Compare
bb05a15 to
d57a4b0
Compare
core/capabilities/launcher.go
Outdated
|
|
||
| belongsToACapabilityDON := len(myCapabilityDONs) > 0 | ||
| if belongsToACapabilityDON { | ||
| // Include both remote workflow DONs and the node's own workflow DONs. |
There was a problem hiding this comment.
Once again, setting isLocal to true for the capability in the capabilities registry should fix this! If you are running into issues with this in your E2E test, just set this param when you set your capability config on-chain.
I believe there are no code changes needed here, actually.
There was a problem hiding this comment.
Yes. But I remember running into some other DON discovery trouble after setting IsLocal = true. This fix had helped then I think. Let me do an E2E test with this PR disabled and see if it works.
There was a problem hiding this comment.
cc @cedric-cordenier for help here, who pointed out that flag to me in the first place and might have some good insights on this particular bit of logic.
There was a problem hiding this comment.
Have updated the PR description. Here's what actually happened.
LocalOnly: true fixed the "failed to add action server-side receiver" error. But then the relay DON still couldn't find vault@1.0.0 because addRemoteCapabilities was never called for capability DONs. That was a completely separate failure that happened after LocalOnly was set.
The timeline was: set LocalOnly -> "action server-side receiver" error gone -> relay DON still can't find vault -> traced to addRemoteCapabilities only running for workflow DONs -> added the fix -> also needed exposes_remote_capabilities = true on the workflow DON TOML -> both together made vault discoverable by the rekay don.
There was a problem hiding this comment.
It makes no sense to me why this change is needed. How is our workflow DON able to call Vault DON as-is? We're missing something IMO.
Also, we should make the relay DON == the workflow DON to start, so this should be moot anyway. localOnly should fix our issue and then the relayDON can be the same thing as the workflow DON.
There was a problem hiding this comment.
I would really prefer getting rid of this PR and just updating the topology of the E2E tests to make relay DON == workflow DON.
A capability DON (e.g. relay DON) that needs to call another capability DON (e.g. vault DON for secret fetching) could not discover it because addRemoteCapabilities was only called for workflow DONs. Add addRemoteCapabilities call inside the belongsToACapabilityDON block so capability DONs also discover cross-DON capabilities. Part of #21635
d57a4b0 to
3353da8
Compare
|
|
Closing. The relay DON will be configured as a workflow DON instead, which gets addRemoteCapabilities for free. The mock capability moves to the workflow DON. Config-only change in CC's E2E TOML, no launcher code change needed. |




Context
Part of #21635 (confidential workflow execution). [3/5] in the series.
Can be reviewed and merged independently.
What this does
Adds
addRemoteCapabilitiesinside thebelongsToACapabilityDONblockso capability DONs can discover capabilities on other DONs.
Previously, only workflow DONs called
addRemoteCapabilities. Acapability DON (e.g. relay DON) that needs to call another capability
DON (e.g. vault DON for secret fetching) could not discover it.
Dependencies
None.