RFC: DML Design #17

st1page · 2022-11-07T08:07:38Z

Preview: https://github.com/risingwavelabs/rfcs/blob/sts/DML/rfcs/0017-DML.md

BowenXiao1999

BTW, if we do not support primary key def on table, we should forbidden this behaviour (?) and report to user (Currently seems we do not do that).

BowenXiao1999 · 2022-11-07T08:15:00Z

rfcs/0017-DML.md

+
+#### Materialize check
+
+The consistency and constraints check will be done in materialize executor. For every operation from upstream, the materialize executor will lookup its pk from storage and fix the changes if it is conflict with the storage's data. And because of the look up queries, a cache is needed.


How to design the cache?

If For every operation (INSERT\UPDATE\DELETE), the cost seems huge and hard to mitigate... Also I think it looks similar to Concurrent DELETE doc proposed by @fuyufjh a long time ago.

Yes, it is that doc mentions. The check is to ensure the external data is correct. And we should do the check to ensure the data is ok and do not lead to panic of other component in our system

BowenXiao1999 · 2022-11-07T08:17:54Z

rfcs/0017-DML.md

+
+#### DML manager
+
+The `TableSourceManager` will rename as `DMLManager`. It will be indexed as `table_id` instead of `source_id` of the table source. The DML batch executors can get the `TableSource` for DML operation and the `Mutex<RowIdGenerator>` easily.


DDL on MV will not go through SourceManager? 🤔

rfcs/0017-DML.md

Co-authored-by: Bowen <36908971+BowenXiao1999@users.noreply.github.com>

… sts/DML

fuyufjh · 2022-11-08T06:52:03Z

LGTM.

rfcs/0017-DML.md

fuyufjh · 2022-11-08T06:51:32Z

rfcs/0017-DML.md

+
+#### Generate row_id in BatchInsert
+
+When user do an insert on a table without PK, we should help to generate the row_id for each records. Now we can only generate row_id in the SourceExecutor. But now, we might need to generate row id for both external data and DML insert's data. So the `BatchInsert` executor and `SourceExecutor` will share the `RowIdGenerator`.


So the BatchInsert executor and SourceExecutor will share the RowIdGenerator

Please allow me to clarify that I don't have any insistence on this part. Our current approach is to let the BatchInsert insert rows with row_id NULL, which is acceptable to me. I just hope to keep it simple as possible, so let's see which approach is better when writing the code. 😋

It makes me change my mind that we can not ensure we can reorder the DML messages in the future. Treating Insert and other DMLs differently and separating them into different two channels will break any happens-before relation between them. I am not sure we should support the strict transaction now because our other component(such as frontend and batch) has not supported it. But I'd like to leave the possibility with a little price.

Co-authored-by: Eric Fu <fuyufjh@gmail.com>

fuyufjh · 2022-11-11T05:54:10Z

Besides, We can move Exchange before DMLExec if we can

Schedule batch DML executors according to the PKs
Generate RowIDs of a certain vnode

fuyufjh · 2022-11-11T08:18:50Z

By the way, in the future we may need to optionally support persisting users' writes somewhere, to ensure no inserts will be lost even on failure recovery. To achieve that, we will refactor the channel between batch DML executor and streaming DMLExec to a MQ with high availablity like Kafka or SQS, and it seems to be another reason for using a single channel for insert/delete/update instead of two.

xx01cyx

The first query description in the figure should be "no user defined pk". Rest LGTM.

rfcs/0017-DML.md

st1page · 2023-01-19T10:10:16Z

st1page added 2 commits November 7, 2022 16:07

add dml

a0b7122

rename

d19d27f

BowenXiao1999 reviewed Nov 7, 2022

View reviewed changes

Update rfcs/0017-DML.md

f4121df

Co-authored-by: Bowen <36908971+BowenXiao1999@users.noreply.github.com>

This comment was marked as duplicate.

Sign in to view

st1page added 2 commits November 7, 2022 17:40

add graph

fe793bf

Merge branch 'sts/DML' of https://github.com/risingwavelabs/rfcs into…

0ae5244

… sts/DML

fuyufjh reviewed Nov 8, 2022

View reviewed changes

fuyufjh changed the title ~~DML~~ RFC: DML Design Nov 8, 2022

Update rfcs/0017-DML.md

7ba8f4f

Co-authored-by: Eric Fu <fuyufjh@gmail.com>

st1page mentioned this pull request Nov 20, 2022

Tracking: new DML design risingwavelabs/risingwave#5949

Closed

9 tasks

st1page requested a review from TennyZhuang January 19, 2023 06:16

This comment was marked as duplicate.

Sign in to view

change the design

d632c0b

st1page requested a review from xx01cyx January 19, 2023 09:12

xx01cyx reviewed Jan 19, 2023

View reviewed changes

rfcs/0017-DML.md Outdated Show resolved Hide resolved

fix some mistake

8886280

xx01cyx approved these changes Jan 19, 2023

View reviewed changes

st1page merged commit 64f3772 into main Jan 19, 2023

ice1000 deleted the sts/DML branch February 1, 2023 05:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: DML Design #17

RFC: DML Design #17

st1page commented Nov 7, 2022 •

edited by fuyufjh

BowenXiao1999 left a comment

BowenXiao1999 Nov 7, 2022

st1page Nov 7, 2022

BowenXiao1999 Nov 7, 2022

st1page Nov 7, 2022

This comment was marked as duplicate.

fuyufjh commented Nov 8, 2022

fuyufjh Nov 8, 2022

st1page Nov 8, 2022

fuyufjh commented Nov 11, 2022

fuyufjh commented Nov 11, 2022

This comment was marked as duplicate.

xx01cyx left a comment

st1page commented Jan 19, 2023


		#### Materialize check

		The consistency and constraints check will be done in materialize executor. For every operation from upstream, the materialize executor will lookup its pk from storage and fix the changes if it is conflict with the storage's data. And because of the look up queries, a cache is needed.


		#### DML manager

		The `TableSourceManager` will rename as `DMLManager`. It will be indexed as `table_id` instead of `source_id` of the table source. The DML batch executors can get the `TableSource` for DML operation and the `Mutex<RowIdGenerator>` easily.


		#### Generate row_id in BatchInsert

		When user do an insert on a table without PK, we should help to generate the row_id for each records. Now we can only generate row_id in the SourceExecutor. But now, we might need to generate row id for both external data and DML insert's data. So the `BatchInsert` executor and `SourceExecutor` will share the `RowIdGenerator`.

RFC: DML Design #17

RFC: DML Design #17

Conversation

st1page commented Nov 7, 2022 • edited by fuyufjh

BowenXiao1999 left a comment

Choose a reason for hiding this comment

BowenXiao1999 Nov 7, 2022

Choose a reason for hiding this comment

st1page Nov 7, 2022

Choose a reason for hiding this comment

BowenXiao1999 Nov 7, 2022

Choose a reason for hiding this comment

st1page Nov 7, 2022

Choose a reason for hiding this comment

This comment was marked as duplicate.

fuyufjh commented Nov 8, 2022

fuyufjh Nov 8, 2022

Choose a reason for hiding this comment

st1page Nov 8, 2022

Choose a reason for hiding this comment

fuyufjh commented Nov 11, 2022

fuyufjh commented Nov 11, 2022

This comment was marked as duplicate.

xx01cyx left a comment

Choose a reason for hiding this comment

st1page commented Jan 19, 2023

st1page commented Nov 7, 2022 •

edited by fuyufjh