Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: DML Design #17

Merged
merged 8 commits into from Jan 19, 2023
Merged

RFC: DML Design #17

merged 8 commits into from Jan 19, 2023

Conversation

st1page
Copy link
Contributor

@st1page st1page commented Nov 7, 2022

Copy link
Contributor

@BowenXiao1999 BowenXiao1999 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, if we do not support primary key def on table, we should forbidden this behaviour (?) and report to user (Currently seems we do not do that).


#### Materialize check

The consistency and constraints check will be done in materialize executor. For every operation from upstream, the materialize executor will lookup its pk from storage and fix the changes if it is conflict with the storage's data. And because of the look up queries, a cache is needed.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How to design the cache?

If For every operation (INSERT\UPDATE\DELETE), the cost seems huge and hard to mitigate... Also I think it looks similar to Concurrent DELETE doc proposed by @fuyufjh a long time ago.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is that doc mentions. The check is to ensure the external data is correct. And we should do the check to ensure the data is ok and do not lead to panic of other component in our system

rfcs/0017-DML.md Outdated

#### DML manager

The `TableSourceManager` will rename as `DMLManager`. It will be indexed as `table_id` instead of `source_id` of the table source. The DML batch executors can get the `TableSource` for DML operation and the `Mutex<RowIdGenerator>` easily.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DDL on MV will not go through SourceManager? 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔

rfcs/0017-DML.md Outdated Show resolved Hide resolved
rfcs/0017-DML.md Show resolved Hide resolved
Co-authored-by: Bowen <36908971+BowenXiao1999@users.noreply.github.com>
@st1page

This comment was marked as duplicate.

@fuyufjh
Copy link
Contributor

fuyufjh commented Nov 8, 2022

LGTM.

rfcs/0017-DML.md Show resolved Hide resolved
rfcs/0017-DML.md Outdated Show resolved Hide resolved
rfcs/0017-DML.md Outdated

#### Generate row_id in BatchInsert

When user do an insert on a table without PK, we should help to generate the row_id for each records. Now we can only generate row_id in the SourceExecutor. But now, we might need to generate row id for both external data and DML insert's data. So the `BatchInsert` executor and `SourceExecutor` will share the `RowIdGenerator`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the BatchInsert executor and SourceExecutor will share the RowIdGenerator

Please allow me to clarify that I don't have any insistence on this part. Our current approach is to let the BatchInsert insert rows with row_id NULL, which is acceptable to me. I just hope to keep it simple as possible, so let's see which approach is better when writing the code. 😋

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes me change my mind that we can not ensure we can reorder the DML messages in the future. Treating Insert and other DMLs differently and separating them into different two channels will break any happens-before relation between them. I am not sure we should support the strict transaction now because our other component(such as frontend and batch) has not supported it. But I'd like to leave the possibility with a little price.

@fuyufjh fuyufjh changed the title DML RFC: DML Design Nov 8, 2022
Co-authored-by: Eric Fu <fuyufjh@gmail.com>
@fuyufjh
Copy link
Contributor

fuyufjh commented Nov 11, 2022

Besides, We can move Exchange before DMLExec if we can

  1. Schedule batch DML executors according to the PKs
  2. Generate RowIDs of a certain vnode

@fuyufjh
Copy link
Contributor

fuyufjh commented Nov 11, 2022

By the way, in the future we may need to optionally support persisting users' writes somewhere, to ensure no inserts will be lost even on failure recovery. To achieve that, we will refactor the channel between batch DML executor and streaming DMLExec to a MQ with high availablity like Kafka or SQS, and it seems to be another reason for using a single channel for insert/delete/update instead of two.

@st1page

This comment was marked as duplicate.

@st1page st1page requested a review from xx01cyx January 19, 2023 09:12
Copy link

@xx01cyx xx01cyx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first query description in the figure should be "no user defined pk". Rest LGTM.

rfcs/0017-DML.md Outdated Show resolved Hide resolved
@st1page
Copy link
Contributor Author

st1page commented Jan 19, 2023

dml_new drawio

@st1page st1page merged commit 64f3772 into main Jan 19, 2023
@ice1000 ice1000 deleted the sts/DML branch February 1, 2023 05:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants