RFC: DAG Structured Query Plan #28

chenzl25 · 2022-12-01T09:12:41Z

No description provided.

TennyZhuang · 2022-12-01T09:38:27Z

rfcs/0028-dag-structured_query_plan.md

+- CTE. At the binder/planner stage, we can create a `Share` for the table expression and reuse it.
+- Subquery unnesting. We can calculate the domain by creating a `Share` operator of the LHS of the apply operator.
+- Same source operator appears multiple times in a query. Use Share operator at binder/planner or at end of optimizer are both ok.
+- Common sub-plan detection. We can use a field `digest` for each sub-plan and if two sub-plan have the same `digest` it means they are probably the same sub-plan. `digest` is just like a hashcode. We also need to have a `deepEqual` method for each plan to compare with another one.


IIUC we can implement Hash and Eq for them directly.

It depends. Every plan node has its plan id by default, so if we put them into a HashSet or other collections, they will be regarded as different items. Once we overwrite this method, the behavior will change.

BugenZhao · 2022-12-01T11:33:41Z

rfcs/0028-dag-structured_query_plan.md

+### Execution
+
+We only discuss the execution for streaming query here for the reason that batch query needs to do a materialization/buffering which is complicated and hard to tell if DAG plan execution is better or not. `Share` operator has multiple downstream and we can dispatch its data to each downstream just like `ChainNode`.


FYI, executing a graph with multiple-edges is currently not a simple deal in the compute node. 🥵
https://www.notion.so/risingwave-labs/Multiple-Edges-in-the-Stream-Graph-f0ab595523d74823a39edaebdea2be16

Yes, I remember this issue. Based on this RFC, if a node has multiple downstream, it must be the Share node. The fragmenter can split the Share node just like it splits the exchange node. In this way, we can separate the upstream and the downstream into different fragments. So the last thing we need to deal with is ensuring all the downstreams of the Share Node are in different fragments. If there are multiple edge connected to the same downstream fragment, use shuffle to separate them.

fuyufjh · 2022-12-05T05:49:55Z

rfcs/0028-dag-structured_query_plan.md

+
+For the batch query, DAG means we need to read the input multiple times which inevitably leads to buffer or materialization. The cost of DAG is determined by how many times it has been read. Normally, the first time has a higher cost than later ones, because of the materialization. It is hard to tell whether to use DAG or not for batch query, I think we can always convert the DAG back to the tree for batch executor.
+
+For the streaming query, it seems not a big deal. We support buffering and dispatch naturally.


Interesting observation. Indeed, for streaming, execution of DAG plan doesn't introduce any additional cost. No extra buffering needed.

xxchan

Completed at risingwavelabs/risingwave#6955

Is there anything need to be edited? If not, let's merge it.

chenzl25 added 2 commits December 1, 2022 17:11

RFC: DAG Structured Query Plan

435c4e2

rename

1323aa1

TennyZhuang reviewed Dec 1, 2022

View reviewed changes

BugenZhao reviewed Dec 1, 2022

View reviewed changes

fuyufjh requested a review from st1page December 5, 2022 05:46

fuyufjh reviewed Dec 5, 2022

View reviewed changes

chenzl25 mentioned this pull request Dec 19, 2022

Tracking: DAG Structured Query Plan risingwavelabs/risingwave#6955

Closed

12 tasks

xxchan reviewed Mar 29, 2023

View reviewed changes

xxchan approved these changes Mar 29, 2023

View reviewed changes

chenzl25 merged commit d89f549 into main Mar 29, 2023

xxchan deleted the dylan/dag_structured_query_plan branch March 29, 2023 15:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: DAG Structured Query Plan #28

RFC: DAG Structured Query Plan #28

chenzl25 commented Dec 1, 2022

TennyZhuang Dec 1, 2022

chenzl25 Dec 1, 2022 •

edited

BugenZhao Dec 1, 2022

chenzl25 Dec 1, 2022 •

edited

fuyufjh Dec 5, 2022

xxchan left a comment

		### Execution

		We only discuss the execution for streaming query here for the reason that batch query needs to do a materialization/buffering which is complicated and hard to tell if DAG plan execution is better or not. `Share` operator has multiple downstream and we can dispatch its data to each downstream just like `ChainNode`.


		For the batch query, DAG means we need to read the input multiple times which inevitably leads to buffer or materialization. The cost of DAG is determined by how many times it has been read. Normally, the first time has a higher cost than later ones, because of the materialization. It is hard to tell whether to use DAG or not for batch query, I think we can always convert the DAG back to the tree for batch executor.

		For the streaming query, it seems not a big deal. We support buffering and dispatch naturally.

RFC: DAG Structured Query Plan #28

RFC: DAG Structured Query Plan #28

Conversation

chenzl25 commented Dec 1, 2022

TennyZhuang Dec 1, 2022

Choose a reason for hiding this comment

chenzl25 Dec 1, 2022 • edited

Choose a reason for hiding this comment

BugenZhao Dec 1, 2022

Choose a reason for hiding this comment

chenzl25 Dec 1, 2022 • edited

Choose a reason for hiding this comment

fuyufjh Dec 5, 2022

Choose a reason for hiding this comment

xxchan left a comment

Choose a reason for hiding this comment

chenzl25 Dec 1, 2022 •

edited

chenzl25 Dec 1, 2022 •

edited