Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discussion: create source without consuming data until a start command #13103

Open
chenzl25 opened this issue Oct 27, 2023 · 5 comments
Open

Comments

@chenzl25
Copy link
Contributor

Currently, when users create a source and then create some Mvs on it, it will immediately consume data. However, in some circumstances, it will be better if we don't consume it first.

Case1:
Users may want to rebuild the entire streaming job without waiting for each individual Mv to be backfilled. To achieve this, they can create a source without consuming the data. Once the pipeline is built, they can then start consuming the source data.

Case2:
I have found that a user used temporal join to enforce some invariants for the output data. To ensure that the invariants are valid for all output, we need to ensure that there is no data consumption between Mv latest_b_per_kind and a_b. Otherwise, the temporal join may lookup future data.

SET streaming_parallelism = 1;
​
CREATE TABLE events (seq bigint, event_type int,  kind varchar) APPEND ONLY
WITH (
    connector = 'datagen',
    fields.seq.kind = 'sequence',
    fields.seq.end = '9223372036854775807',
    fields.event_type.kind = 'random',
    fields.event_type.min = '1',
    fields.event_type.max = '2',
    fields.event_type.seed = 1,
    fields.kind.kind = 'random',
    fields.kind.length = 2,
    fields.kind.seed = 1,
    datagen.rows.per.second = 10000
) FORMAT PLAIN ENCODE JSON;
​
​
CREATE MATERIALIZED VIEW IF NOT EXISTS a AS (
    SELECT
        DISTINCT ON (seq) seq,
        kind
    FROM
        events
    WHERE
        event_type = 1
);
​
CREATE MATERIALIZED VIEW IF NOT EXISTS b AS (
    SELECT
        DISTINCT ON (seq) seq,
        kind
    FROM
        events
    WHERE
        event_type = 2
);
​
CREATE MATERIALIZED VIEW IF NOT EXISTS latest_b_per_kind AS (
    SELECT
        kind,
        seq
    FROM
        (
            SELECT
                kind,
                seq,
                row_number() OVER (
                    PARTITION BY kind
                    ORDER BY
                        seq DESC
                ) as rank
            FROM
                b
        )
    WHERE
        rank = 1
);
​
CREATE MATERIALIZED VIEW IF NOT EXISTS a_b AS (
    SELECT
        a.seq as a_seq,
        a.kind as kind,
        latest_b_per_kind.seq as b_seq
    FROM
        a
        LEFT JOIN latest_b_per_kind FOR SYSTEM_TIME AS OF PROCTIME() ON a.kind = latest_b_per_kind.kind
);

Check invariant:

SELECT count(1) from a_b where b_seq is not null and a_seq < b_seq;
@github-actions github-actions bot added this to the release-1.4 milestone Oct 27, 2023
@hzxa21
Copy link
Collaborator

hzxa21 commented Oct 27, 2023

This can also be a very useful admin operations for troubleshooting and issue mitigation. Related: #12997

@BugenZhao
Copy link
Member

There's a similar issue long ago: #3073

@StrikeW
Copy link
Contributor

StrikeW commented Oct 27, 2023

For case1, I think transactional DDL makes more sense.

@st1page
Copy link
Contributor

st1page commented Oct 27, 2023

some discussion about case1: #12771

@fuyufjh fuyufjh removed this from the release-1.4 milestone Nov 8, 2023
Copy link
Contributor

github-actions bot commented Jan 8, 2024

This issue has been open for 60 days with no activity. Could you please update the status? Feel free to continue discussion or close as not planned.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants