Skip to content

Abstract Airlift Discovery from InternalNodeManager #26083

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 17 commits into
base: master
Choose a base branch
from

Conversation

dain
Copy link
Member

@dain dain commented Jun 27, 2025

This PR encapsulates Airlfit discovery behind a new NodeInventory API. This is pre-work to replace discovery with a simpler system and to support K8s discovery directly.

The only significant change is that the information about nodes is now loaded directly from the node itself instead of relying on discovery announcements. This additional information is fetched from the node when the node state is being fetched, so it does not place any new additional burden on the nodes.

Release notes

(X) This is not user-visible or is docs only, and no release notes are required.

@dain dain requested a review from electrum June 27, 2025 01:49
@cla-bot cla-bot bot added the cla-signed label Jun 27, 2025
@github-actions github-actions bot added the postgresql PostgreSQL connector label Jun 27, 2025
@dain dain marked this pull request as draft June 27, 2025 04:48
@wendigo
Copy link
Contributor

wendigo commented Jun 27, 2025

Thanks @dain for doing that!

@dain dain force-pushed the discovery-node-manager branch 5 times, most recently from ff703ad to 2f93c9d Compare June 28, 2025 19:21
@dain dain marked this pull request as ready for review June 28, 2025 20:10
dain added 17 commits June 30, 2025 13:16
The testing HTTP endpoint was invalid so state was always active
When new nodes are discovered the initial state is set to ACTIVE, but
a node must be probed to get the initial state. It appears this bug was
introduced when worker reactivation was added.

Additional, nodes without the correct version are not tracked at all
Create InternalNode directly for current node
Add bound CurrentNodeState which is a supplier of NodeState for the
current node. This is requires changes to NodeStateManager to avoid
circular Guice dependencies.
Instead of relying on discovery, node information can be pulled from the
node when state is being fetched.
@dain dain force-pushed the discovery-node-manager branch from 2f93c9d to 5e2de86 Compare June 30, 2025 20:56
@wendigo wendigo requested a review from Copilot July 1, 2025 08:32
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR abstracts Airlift-based discovery behind a new NodeInventory API, migrating node-related types from the io.trino.metadata package into io.trino.node, and updating consumers to use the new interfaces. It also refactors NodeStateManager to use the new CurrentNodeState supplier and enhances tests to wait for asynchronous node updates.

  • Introduce io.trino.node package with NodeInventory, InternalNode, NodeState, DiscoveryNodeManager, and related classes.
  • Refactor NodeStateManager to delegate state/version tracking to CurrentNodeState.
  • Update imports across the codebase and add polling loops in tests to stabilize asynchronous node visibility.

Reviewed Changes

Copilot reviewed 108 out of 108 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
testing/trino-tests/src/test/java/io/trino/tests/TestMinWorkerRequirement.java Add @Timeout and polling loop to wait for node drop before assert
testing/trino-testing/src/main/java/io/trino/testing/DistributedQueryRunner.java Replace immediate verify with a polling loop and timeout
core/trino-main/src/main/java/io/trino/node/NodeInventory.java New interface to abstract node discovery
core/trino-main/src/main/java/io/trino/node/DiscoveryNodeManager.java New discovery manager using NodeInventory and HTTP polling
core/trino-main/src/main/java/io/trino/server/NodeStateManager.java Refactor to remove internal VersionedState and use CurrentNodeState
Comments suppressed due to low confidence (2)

core/trino-main/src/main/java/io/trino/server/NodeStateManager.java:316

  • [nitpick] This nested VersionedState class duplicates logic also found in the outer NodeStateManager refactor. Consider extracting it into a shared static class or moving it into CurrentNodeState as a static nested class to reduce duplication and improve clarity.
    public static class CurrentNodeState

core/trino-main/src/main/java/io/trino/node/AirliftNodeInventory.java:49

  • [nitpick] Consider adding unit tests for AirliftNodeInventory.getNodes() to verify that it correctly filters out failed services and extracts URIs based on the httpsRequired flag.
    public Set<URI> getNodes()

// If the result is null, mark the node as INACTIVE to prevent work from being scheduled on it
NodeState nodeState;
if (result == null || !result.hasValue()) {
nodeState = ACTIVE;
Copy link
Preview

Copilot AI Jul 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment above indicates marking the node as INACTIVE when the result is null, but the code sets it to ACTIVE. It should set nodeState = INACTIVE to match the intended logic.

Suggested change
nodeState = ACTIVE;
nodeState = INACTIVE;

Copilot uses AI. Check for mistakes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that it is right :)

@@ -121,7 +123,10 @@ public void testInsufficientWorkerNodesAfterDrop()
assertThat(queryRunner.getCoordinator().refreshNodes().getActiveNodes()).hasSize(4);

queryRunner.getServers().get(0).close();
assertThat(queryRunner.getCoordinator().refreshNodes().getActiveNodes()).hasSize(3);
while (queryRunner.getCoordinator().refreshNodes().getActiveNodes().size() > 3) {
Copy link
Preview

Copilot AI Jul 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Instead of a manual polling loop, consider using a testing utility (e.g., Awaitility) to await the expected active node count. This can make the test clearer and reduce boilerplate.

Copilot uses AI. Check for mistakes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla-signed postgresql PostgreSQL connector
Development

Successfully merging this pull request may close these issues.

2 participants