Skip to content

[CRE] [3/5] Allow capability DONs to discover remote capabilities#21640

Closed
nadahalli wants to merge 1 commit intodevelopfrom
tejaswi/cw-3-launcher-fix
Closed

[CRE] [3/5] Allow capability DONs to discover remote capabilities#21640
nadahalli wants to merge 1 commit intodevelopfrom
tejaswi/cw-3-launcher-fix

Conversation

@nadahalli
Copy link
Copy Markdown
Contributor

@nadahalli nadahalli commented Mar 23, 2026

Context

Part of #21635 (confidential workflow execution). [3/5] in the series.
Can be reviewed and merged independently.

What this does

Adds addRemoteCapabilities inside the belongsToACapabilityDON block
so capability DONs can discover capabilities on other DONs.

Previously, only workflow DONs called addRemoteCapabilities. A
capability DON (e.g. relay DON) that needs to call another capability
DON (e.g. vault DON for secret fetching) could not discover it.

Dependencies

None.

Copilot AI review requested due to automatic review settings March 23, 2026 17:00
@nadahalli nadahalli requested review from a team as code owners March 23, 2026 17:00
@github-actions
Copy link
Copy Markdown
Contributor

👋 nadahalli, thanks for creating this pull request!

To help reviewers, please consider creating future PRs as drafts first. This allows you to self-review and make any final changes before notifying the team.

Once you're ready, you can mark it as "Ready for review" to request feedback. Thanks!

@github-actions
Copy link
Copy Markdown
Contributor

I see you updated files related to core. Please run make gocs in the root directory to add a changeset as well as in the text include at least one of the following tags:

  • #added For any new functionality added.
  • #breaking_change For any functionality that requires manual action for the node to boot.
  • #bugfix For bug fixes.
  • #changed For any change to the existing functionality.
  • #db_update For any feature that introduces updates to database schema.
  • #deprecation_notice For any upcoming deprecation functionality.
  • #internal For changesets that need to be excluded from the final changelog.
  • #nops For any feature that is NOP facing and needs to be in the official Release Notes for the release.
  • #removed For any functionality/config that is removed.
  • #updated For any functionality that is updated.
  • #wip For any change that is not ready yet and external communication about it should be held off till it is feature complete.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 23, 2026

✅ No conflicts with other open PRs targeting develop

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Risk Rating: HIGH — changes touch capability discovery/routing, add new gateway handler logic, and update core dependencies used in security-sensitive paths (attestation).

Purpose: Improve capability discovery behavior across single-DON and cross-DON topologies (CRE), and introduce confidential relay handling components needed for confidential workflow execution.

Changes:

  • Fix capability serving in single-DON topologies by passing a combined workflow-DON list when serving capabilities.
  • Enable capability DONs (not just workflow DONs) to discover and add remote capabilities.
  • Add confidential relay gateway/capability handler implementations + tests, and bump related Go dependencies.

Reviewed changes

Copilot reviewed 9 out of 10 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
core/capabilities/launcher.go Adjusts DON discovery/serving logic; adds capability-DON remote discovery; modifies don2don connection update behavior.
core/capabilities/launcher_test.go Adds regression test for single-DON capability serving; updates don2don stream config expectations.
core/services/gateway/handlers/confidentialrelay/handler.go New gateway handler: fan-out to DON nodes + quorum aggregation + timeout cleanup.
core/services/gateway/handlers/confidentialrelay/aggregator.go New quorum aggregator based on matching response digests.
core/services/gateway/handlers/confidentialrelay/handler_test.go Tests for fan-out, quorum, timeouts, rate limiting, and edge cases.
core/capabilities/confidentialrelay/handler.go New capability-side gateway connector handler: verifies attestation and proxies secrets/capability execution.
core/capabilities/confidentialrelay/handler_test.go Unit tests for request handling, attestation failure, payload decoding, and lifecycle.
core/capabilities/confidentialrelay/service.go Lifecycle wrapper to delay handler creation until gateway connector is available.
go.mod / go.sum Dependency bumps/additions (notably chainlink-common, teeattestation, CBOR, etc.).

Scrupulous human review recommended for:

  • core/capabilities/launcher.go: the updated control flow around serveCapabilities + addRemoteCapabilities, and any knock-on effects in multi-DON topologies.
  • core/capabilities/confidentialrelay/handler.go: attestation hashing/validation domain separation and trusted PCR selection behavior.
  • core/services/gateway/handlers/confidentialrelay/*: quorum/timeout behavior and error handling paths under partial failures.

Reviewer recommendations (per CODEOWNERS):

  • core/capabilities/**: @smartcontractkit/keystone, @smartcontractkit/capabilities-team
  • go.mod, go.sum (and otherwise uncovered areas): @smartcontractkit/core, @smartcontractkit/foundations

if w.don2donSharedPeer != nil {
donPairs := w.donPairsToUpdate(w.myPeerID, localRegistry)
err := w.don2donSharedPeer.UpdateConnectionsByDONs(ctx, donPairs, w.p2pStreamConfig)
err := w.don2donSharedPeer.UpdateConnectionsByDONs(ctx, donPairs, defaultStreamConfig)
Copy link

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UpdateConnectionsByDONs is being called with defaultStreamConfig, which ignores the streamConfig passed into NewLauncher and the computed w.p2pStreamConfig. This looks like an accidental regression (configurable P2P stream limits/rate-limiters will no longer take effect for don2don connections). Consider passing w.p2pStreamConfig here instead so the launcher honors node configuration and the stored field isn't dead code.

Suggested change
err := w.don2donSharedPeer.UpdateConnectionsByDONs(ctx, donPairs, defaultStreamConfig)
err := w.don2donSharedPeer.UpdateConnectionsByDONs(ctx, donPairs, w.p2pStreamConfig)

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. The launcher.go diff has been reduced to only the two intended fixes (allWorkflowDONs and addRemoteCapabilities). The defaultStreamConfig change was an artifact from the source branch and is no longer in this PR.

Comment on lines +46 to +50
remainingResponses := donMembersCount - len(resps)
if maxShaToCount+remainingResponses < requiredQuorum {
l.Warnw("quorum unattainable for request", "requiredQuorum", requiredQuorum, "remainingResponses", remainingResponses, "maxShaToCount", maxShaToCount)
return nil, errors.New(errQuorumUnobtainable.Error() + ". RequiredQuorum=" + strconv.Itoa(requiredQuorum) + ". maxShaToCount=" + strconv.Itoa(maxShaToCount) + " remainingResponses=" + strconv.Itoa(remainingResponses))
}
Copy link

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

errQuorumUnobtainable is declared as a sentinel, but the returned error is created via errors.New(...), so callers can’t reliably detect the condition with errors.Is. Consider returning an error that wraps errQuorumUnobtainable (and includes the extra details) so higher layers can branch on it consistently.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file has been removed from this PR. It belongs to #21638 [1/5].

go.mod Outdated
Comment on lines +87 to +91
github.com/smartcontractkit/chainlink-ccv v0.0.0-20260320145055-eb20b529ff95
github.com/smartcontractkit/chainlink-common v0.11.0
github.com/smartcontractkit/chainlink-common v0.11.1-0.20260323163826-2c5b95089478
github.com/smartcontractkit/chainlink-common/keystore v1.0.2
github.com/smartcontractkit/chainlink-common/pkg/chipingress v0.0.10
github.com/smartcontractkit/chainlink-common/pkg/teeattestation v0.0.0-20260316172927-2c727f64397c
Copy link

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR description says this change is limited to “Two fixes in launcher.go”, but this PR also updates module dependencies (e.g. chainlink-common, teeattestation) and introduces a new confidential relay capability/handler implementation. Please update the PR description (or split the PR) so reviewers can accurately assess scope and risk.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. The PR now only contains launcher.go and launcher_test.go changes. No dependency changes needed.

@nadahalli nadahalli requested a review from a team as a code owner March 23, 2026 17:19
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 23, 2026

CORA - Analysis Skipped

Reason: The number of code owners (2) is less than the minimum required (5) and/or the number of CODEOWNERS entries with changed files (1) is less than the minimum required (2).

@nadahalli nadahalli force-pushed the tejaswi/cw-3-launcher-fix branch from 0c3f1f3 to bb05a15 Compare March 23, 2026 17:25
@trunk-io
Copy link
Copy Markdown

trunk-io bot commented Mar 23, 2026

Static BadgeStatic BadgeStatic BadgeStatic Badge

Failed Test Failure Summary Logs
Test_CCIP_Messaging_EVM2Aptos/Hello_World_Message_-_Should_Succeed Logs ↗︎
Test_CCIP_Messaging_EVM2Aptos Logs ↗︎

View Full Report ↗︎Docs

@nadahalli nadahalli force-pushed the tejaswi/cw-3-launcher-fix branch from bb05a15 to d57a4b0 Compare March 24, 2026 11:32

belongsToACapabilityDON := len(myCapabilityDONs) > 0
if belongsToACapabilityDON {
// Include both remote workflow DONs and the node's own workflow DONs.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once again, setting isLocal to true for the capability in the capabilities registry should fix this! If you are running into issues with this in your E2E test, just set this param when you set your capability config on-chain.

I believe there are no code changes needed here, actually.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. But I remember running into some other DON discovery trouble after setting IsLocal = true. This fix had helped then I think. Let me do an E2E test with this PR disabled and see if it works.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @cedric-cordenier for help here, who pointed out that flag to me in the first place and might have some good insights on this particular bit of logic.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have updated the PR description. Here's what actually happened.

LocalOnly: true fixed the "failed to add action server-side receiver" error. But then the relay DON still couldn't find vault@1.0.0 because addRemoteCapabilities was never called for capability DONs. That was a completely separate failure that happened after LocalOnly was set.

The timeline was: set LocalOnly -> "action server-side receiver" error gone -> relay DON still can't find vault -> traced to addRemoteCapabilities only running for workflow DONs -> added the fix -> also needed exposes_remote_capabilities = true on the workflow DON TOML -> both together made vault discoverable by the rekay don.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes no sense to me why this change is needed. How is our workflow DON able to call Vault DON as-is? We're missing something IMO.

Also, we should make the relay DON == the workflow DON to start, so this should be moot anyway. localOnly should fix our issue and then the relayDON can be the same thing as the workflow DON.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would really prefer getting rid of this PR and just updating the topology of the E2E tests to make relay DON == workflow DON.

A capability DON (e.g. relay DON) that needs to call another capability
DON (e.g. vault DON for secret fetching) could not discover it because
addRemoteCapabilities was only called for workflow DONs.

Add addRemoteCapabilities call inside the belongsToACapabilityDON block
so capability DONs also discover cross-DON capabilities.

Part of #21635
@nadahalli nadahalli force-pushed the tejaswi/cw-3-launcher-fix branch from d57a4b0 to 3353da8 Compare March 25, 2026 12:03
@nadahalli nadahalli changed the title [CRE] [3/5] Fix capability discovery for single-DON and cross-DON topologies [CRE] [3/5] Allow capability DONs to discover remote capabilities Mar 25, 2026
@cl-sonarqube-production
Copy link
Copy Markdown

Quality Gate failed Quality Gate failed

Failed conditions
15.56% Technical Debt Ratio on New Code (required ≤ 4%)
C Maintainability Rating on New Code (required ≥ A)

See analysis details on SonarQube

Catch issues before they fail your Quality Gate with our IDE extension SonarQube IDE SonarQube IDE

@nadahalli
Copy link
Copy Markdown
Contributor Author

Closing. The relay DON will be configured as a workflow DON instead, which gets addRemoteCapabilities for free. The mock capability moves to the workflow DON. Config-only change in CC's E2E TOML, no launcher code change needed.

@nadahalli nadahalli closed this Mar 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants