Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Enhancement]: New Search Architecture Based On Streaming Service #40451

Open
1 of 8 tasks
chyezh opened this issue Mar 7, 2025 · 0 comments
Open
1 of 8 tasks

[Enhancement]: New Search Architecture Based On Streaming Service #40451

chyezh opened this issue Mar 7, 2025 · 0 comments
Assignees
Labels
feature/streaming node streaming node feature kind/enhancement Issues or changes related to enhancement
Milestone

Comments

@chyezh
Copy link
Contributor

chyezh commented Mar 7, 2025

Is there an existing issue for this?

  • I have searched the existing issues

What would you like to be added?

New Search Architecture Based On Streaming Service

Why is this needed?

Current Streaming Service supports a embedded querynode to implement search/query, also see #38399.
But old delegator logic is too heavy for streaming node, and we cannot split the batch and streaming process based on current delegator arch completely:

  • Cannot merge the flush process and search search built process, so there're always double consuming from wal when recovery, double memory usage if collection is loaded.
  • Cannot put all meta management of growing data on streaming node and make a light weight history meta coordinator.
  • Cannot remove the forwarding RPC of the delegator to reduce streamingnode's work.

Here's the new distributed architecture for search/query process based on streaming service:

Image

The query process is implemented as shown in the diagram above, following a two-phase query approach:

  • Coordinator will generate global versioned query view and sync the view to all streaming node and query node, and keep consistency by a cross-node state machine.

  • QueryNode will subscribe the pure delete stream from the streaming node and apply the delete request to all segment on it.

  • Phase 1: The Proxy generates a shard-level query plan using the highest version of the QueryView from the StreamingNode:

    • Includes MVCC.
    • Query optimization (BM25, segment filtering, etc.).
    • Query view versioning.
  • Phase 2: The Proxy sends the query plan to both the StreamingNode and QueryNode:

    • The StreamingNode and QueryNode execute all query operations on their respective segments based on the specified view version (similar to the current SearchSegments process, but using version numbers instead of a segment list).
  • Final Step: The Proxy reduces all results and returns them to the user.

During this process, if a node crashes or the view becomes invalid, the process is canceled and the query operation is retried.

Here's the TODO list:

Anything else?

No response

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature/streaming node streaming node feature kind/enhancement Issues or changes related to enhancement
Projects
None yet
Development

No branches or pull requests

1 participant