discussion: allow stream query on creating mv? #12771

st1page · 2023-10-11T06:16:47Z

offline discussed with @wyhyhyhyh
Currently, the creating MV is invisible and the user can not do batch or streaming query on it.

dev=> SET BACKGROUND_DDL=true;
SET_VARIABLE
dev=> CREATE TABLE t (v1 int);
CREATE_TABLE
dev=> INSERT INTO t select * from generate_series(1, 300000);
INSERT 0 300000
dev=> FLUSH;
FLUSH
dev=> CREATE MATERIALIZED VIEW m1 as SELECT * FROM t;
CREATE_MATERIALIZED_VIEW
dev=> CREATE MATERIALIZED VIEW m2 as SELECT * FROM m1;
ERROR:  QueryError: Catalog error: table or source not found: m1

Complex data processing pipelines usually are layered which include many materialized views depending on each other. While creating a stream query on the existing MV, RW will backfill all the historical data in the upstream MV and union it with the coming changes.
Considering there are mvA, a mvB on mvA, and a mvC on mvB(S->MvA->MvB->MvC). Under the current design, the user must create the materialized views one by one. He should create the MvA and wait for all historical data to be backfilled in MvB, and only after that, he can create the mvB. But if the pipeline can be constructed at a time, we can see the backfilling is unnecessary.
We can achieve that in two ways.

allow users to create multiple materialized views using syntax like "transactional DDL"
allow users to create stream query on the creating mv, so that the downstream mv will consume the historical data faster(because there are less data than the base table) and catch up with the upstream.

st1page · 2023-10-11T06:26:26Z

Also for user experience, I think we should allow stream query on creating mv, or he must wait for the mv backfill all the historical data.

chenzl25 · 2023-10-11T07:46:28Z

allow users to create stream query on the creating mv, so that the downstream mv will consume the historical data faster(because there are less data than the base table) and catch up with the upstream.

Personally I prefer the second one, because it is more practical and the concept of transactional ddl is too big? BTW, recoverable backfill is also needed in this case. We can return the ddl immediately to make creating mv visible to streaming query and batch query. But for batch query consistency, we need to block it until all its upstream backfilling mvs are finished.

kwannoel · 2024-05-12T01:49:50Z

allow users to create stream query on the creating mv, so that the downstream mv will consume the historical data faster(because there are less data than the base table) and catch up with the upstream.

Personally I prefer the second one, because it is more practical and the concept of transactional ddl is too big? BTW, recoverable backfill is also needed in this case. We can return the ddl immediately to make creating mv visible to streaming query and batch query. But for batch query consistency, we need to block it until all its upstream backfilling mvs are finished.

I think this sounds reasonable, with a session variable to configure it. For implementation, we just need to sync the catalog back to the frontend.
Note that we must differentiate Finished and Creating States of the MV, within the catalog, so we know whether to expose an MV to the batch side.

I'm already planning to work on this part (syncing catalog to fe), so we can unify Drop and Cancel, so the work should have some overlap.

xxchan · 2024-05-12T03:29:43Z

it looks wierd to me that a creating MV can't be SELECT-ed but can be referred to create new MV. From user's perspective, the MV is still unavailable.

BTW, if we choose this approach, maybe we can return a notice on the second CREATE MV (maybe also for SET BACKGROUND_DDL=true, maybe link to a doc page) to explain the behavior.

kwannoel · 2024-05-12T04:57:10Z

Btw we also need a synchronising mechanism, like wait to indicate when the MV is finished with backfill. The current implementation of it polls meta. We can let it subscribe to the observer instead.

fuyufjh · 2024-05-13T03:31:44Z

allow users to create stream query on the creating mv, so that the downstream mv will consume the historical data faster(because there are less data than the base table) and catch up with the upstream.

Personally I prefer the second one, because it is more practical and the concept of transactional ddl is too big? BTW, recoverable backfill is also needed in this case. We can return the ddl immediately to make creating mv visible to streaming query and batch query. But for batch query consistency, we need to block it until all its upstream backfilling mvs are finished.

+1 for this idea.

Think a step further: (assuming MV2 depends on MV1 and both are creating)

If the MV2 completes backfilling first, it doesn't mean MV2 has been created but only means it catches the progress with MV1. Thus, it's still invisible.
If recovery happens while MV1 is creating, both MV1 and MV2 fail to create and needs to be removed from metadata. More accurately, an MV must not be considered as completed before its dependencies completes - If MV1 completes and MV2 is on-going, it's okay to mark MV1 as succeeded.

Furthermore, MV2->MV1 is the simplest form. The actual implementation need to handle a DAG of creating MVs, which seems to impose lots of complexity in stream manager.

kwannoel · 2024-05-14T08:41:40Z

We should only provide this feature on background ddl I suppose. Because in many cases, users are using DBT to handle creation of stream job DAG.

For a normal stream job, we only return a response once it is done. If using background ddl, we will immediately return a response on firing the command. Then DBT can immediately continue to create the next MVs.

kwannoel · 2024-05-14T08:46:07Z

Furthermore, MV2->MV1 is the simplest form. The actual implementation need to handle a DAG of creating MVs, which seems to impose lots of complexity in stream manager.

For each MV we create, we now also need to watch its upstream MVs, and only mark its state as Finished, once the upstream MVs are also Finished.
This should not affect foreground jobs, since this feature should only be provided for background ddl.

In terms of cancelling / dropping the streaming DAG, once we unify cancel / drop, we can reuse the cascade logic of drop to handle that.

st1page · 2024-05-15T01:06:02Z

We should only provide this feature on background ddl I suppose. Because in many cases, users are using DBT to handle creation of stream job DAG.

For a normal stream job, we only return a response once it is done. If using background ddl, we will immediately return a response on firing the command. Then DBT can immediately continue to create the next MVs.

IIUC, DBT driver has not do any special for it, so it does not use background ddl. Could and should it using background DDL by default? cc @chenzl25

chenzl25 · 2024-05-15T06:55:38Z

I think we should not enable background DDL by default for DBT. DBT has different models, e.g. materialized_view and table. Imaging models have the following dependency mv1 -> table2 -> mv3. If the backfilling is blocking, then table2 would contain all data from mv1 which I think is the most straightforward. If we use background DDL, the mv1 will be created immediately, but table2 could just read a small portion of the data of the backfilling mv1. Finally, mv3 will consume a small portion of data as well.

github-actions bot added this to the release-1.4 milestone Oct 11, 2023

This comment was marked as outdated.

Sign in to view

st1page mentioned this issue Oct 27, 2023

Discussion: create source without consuming data until a start command #13103

Open

fuyufjh removed this from the release-1.4 milestone Nov 8, 2023

st1page added this to the release-1.10 milestone May 11, 2024

kwannoel self-assigned this May 12, 2024

kwannoel mentioned this issue Jun 18, 2024

refactor(stream): return command execution after initial barrier collection #17297

Draft

9 tasks

BugenZhao mentioned this issue Jun 25, 2024

discussion: backfill subqueries/CTEs "one by one" in a big complex query automatically #17404

Open

BugenZhao mentioned this issue Jul 3, 2024

feat(meta): support drop creating materialized views for v1 backend #17484

Merged

9 tasks

BugenZhao mentioned this issue Jul 16, 2024

refactor(meta): commit finish catalog in barrier manager #17428

Merged

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

discussion: allow stream query on creating mv? #12771

discussion: allow stream query on creating mv? #12771

st1page commented Oct 11, 2023 •

edited

Loading

This comment was marked as outdated.

st1page commented Oct 11, 2023

This comment was marked as outdated.

chenzl25 commented Oct 11, 2023

kwannoel commented May 12, 2024 •

edited

Loading

xxchan commented May 12, 2024

kwannoel commented May 12, 2024

fuyufjh commented May 13, 2024 •

edited

Loading

kwannoel commented May 14, 2024

kwannoel commented May 14, 2024

st1page commented May 15, 2024

chenzl25 commented May 15, 2024

discussion: allow stream query on creating mv? #12771

discussion: allow stream query on creating mv? #12771

Comments

st1page commented Oct 11, 2023 • edited Loading

This comment was marked as outdated.

st1page commented Oct 11, 2023

This comment was marked as outdated.

chenzl25 commented Oct 11, 2023

kwannoel commented May 12, 2024 • edited Loading

xxchan commented May 12, 2024

kwannoel commented May 12, 2024

fuyufjh commented May 13, 2024 • edited Loading

kwannoel commented May 14, 2024

kwannoel commented May 14, 2024

st1page commented May 15, 2024

chenzl25 commented May 15, 2024

st1page commented Oct 11, 2023 •

edited

Loading

kwannoel commented May 12, 2024 •

edited

Loading

fuyufjh commented May 13, 2024 •

edited

Loading