Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should we ban full outer join for streaming query? #8084

Closed
chenzl25 opened this issue Feb 21, 2023 · 23 comments
Closed

Should we ban full outer join for streaming query? #8084

chenzl25 opened this issue Feb 21, 2023 · 23 comments
Assignees
Labels
type/bug Something isn't working
Milestone

Comments

@chenzl25
Copy link
Contributor

Describe the bug

A null row from either left or right side produces the same row (null, null) to the downstream.

create table t (a int primary key);
insert into t values(null);
create materialized view v as select t1.* from t as t1 full join t as t2 on t1.a = t2.a; -- panic

To Reproduce

No response

Expected behavior

No response

Additional context

No response

@chenzl25 chenzl25 added the type/bug Something isn't working label Feb 21, 2023
@github-actions github-actions bot added this to the release-0.1.18 milestone Feb 21, 2023
@chenzl25
Copy link
Contributor Author

cc @yuhao-su

@fuyufjh
Copy link
Contributor

fuyufjh commented Feb 21, 2023

What's the cause of panic?

@chenzl25
Copy link
Contributor Author

What's the cause of panic?

left side: +[null] --> Full Join -> +[null, null]
right side: +[null] --> Full Join -> +[null, null]

From the downstream's view, the full join operator inserts the same row twice.

@yuhao-su
Copy link
Contributor

yuhao-su commented Feb 21, 2023

The correct behavior should be:
left side: +[null] --> Full Join -> +[null, null]
left side: +[null] --> Full Join -> -[null, null] +[null, null]
It might be a bug

@chenzl25
Copy link
Contributor Author

The correct behavior should be: left side: +[null] --> Full Join -> +[null, null] left side: +[null] --> Full Join -> -[null, null] +[null, null] It might be a bug

I don't think it is a bug. The batch query will output 2 rows instead of one row.

@yuhao-su
Copy link
Contributor

I don't think it is a bug. The batch query will output 2 rows instead of one row.

You are right! I can't think of any easy way to fix this. Maybe we should ban it for now. cc. @st1page

@st1page
Copy link
Contributor

st1page commented Feb 21, 2023

😇 in fact, I prefer to ban all null primary keys. The only origin of that is Materialized view with GROUP BY K where k could be null. But if there any users need that behavior?

@BugenZhao
Copy link
Member

Link to #8059.

@fuyufjh
Copy link
Contributor

fuyufjh commented Feb 21, 2023

😇 in fact, I prefer to ban all null primary keys. The only origin of that is Materialized view with GROUP BY K where k could be null. But if there any users need that behavior?

Group by CUBE will generate a NULL group by default. :lark-cry:

@fuyufjh
Copy link
Contributor

fuyufjh commented Feb 21, 2023

I prefer to keep the full outer join. In batch query it's almost useless but I guess it might be more useful for stream joining stream. Just guess.

But I don't have any idea to fix it now 🤔

@chenzl25
Copy link
Contributor Author

chenzl25 commented Feb 21, 2023

Maybe we can use a trick like I used in union all operator before. Use a project plus a constant (indicating which side) to extend the stream key of both sides. In this way, we can tell the difference between left side +[null] and right side +[null], because they will become +[null, 0] and +[null, 1].

dev=> explain create materialized view v as select * from t union all select * from t;
                                  QUERY PLAN
-------------------------------------------------------------------------------
 StreamMaterialize { columns: [a, 0:Int32(hidden)], pk_columns: [a, 0:Int32] }
 └─StreamUnion { all: true }
   ├─StreamExchange { dist: HashShard(t.a, 0:Int32) }
   | └─StreamProject { exprs: [t.a, 0:Int32] }
   |   └─StreamTableScan { table: t, columns: [a] }
   └─StreamExchange { dist: HashShard(t.a, 1:Int32) }
     └─StreamProject { exprs: [t.a, 1:Int32] }
       └─StreamTableScan { table: t, columns: [a] }
(8 rows)

@fuyufjh
Copy link
Contributor

fuyufjh commented Feb 22, 2023

Maybe we can use a trick like I used in union all operator before. Use a project plus a constant (indicating which side) to extend the stream key of both sides. In this way, we can tell the difference between left side +[null] and right side +[null], because they will became +[null, 0] and +[null, 1].

I think this is the correct direction. Let me explain more about my thoughts.

Theoretically, you may consider a full outer join as a union all which combines the results from 3 ways:

  1. All the results of the "inherent" inner join i.e. [left_row, right_row]
  2. Result of [left_row, NULLs] for those not matched from left side
  3. Result of [NULLs, right_row] for those not matched from right side

For the left outer join, only 1 & 2 exists. Luckily, they must be non-conflict because a left_row must belong to either 1 or 2, not both, so the left_row ensures the uniqueness of left_pk in the result's PK (which is [left_pk, right_pk])

While, for the full outer join, the problem happened because result rows in 2 & 3 can be conflicted, as @chenzl25's example shows.

Thus, I think adding a column to mark the "source" (1/2/3, as explained above) of the result row is the correct solution, but might be too heavy.

@fuyufjh
Copy link
Contributor

fuyufjh commented Feb 22, 2023

Another way to mitigate the problem is to forbid null PKs on base tables. I know this cannot solve the problem completely because you can construct a MView with aggregation, but it can reduce the odds hopefully.

By the way, PG also rejects null PK:

dev=# create table t1 (pk int, jk int, primary key (pk));
CREATE TABLE
dev=# insert into t1 values (null, 5);
ERROR:  null value in column "pk" of relation "t1" violates not-null constraint

@yuhao-su
Copy link
Contributor

yuhao-su commented Feb 22, 2023

By the way, PG also rejects null PK

Yes, we can simply remove the pk constraint and get the same result. I think PG will add a hidden column on the source in this case.

I can't think of any way to fully solve this problem by banning null pk from the source since we have agg. So I prefer adding a column to mark the "source" solution. The cost of adding 1 column on two sides only in full outer join sound acceptable to me.

@yuhao-su yuhao-su self-assigned this Mar 3, 2023
@st1page
Copy link
Contributor

st1page commented Mar 8, 2023

Another way to mitigate the problem is to forbid null PKs on base tables. I know this cannot solve the problem completely because you can construct a MView with aggregation, but it can reduce the odds hopefully.

By the way, PG also rejects null PK:

dev=# create table t1 (pk int, jk int, primary key (pk));
CREATE TABLE
dev=# insert into t1 values (null, 5);
ERROR:  null value in column "pk" of relation "t1" violates not-null constraint

How about just adding a filter to ignore NULL stream key for the full outer join as a workaround.

@yuhao-su
Copy link
Contributor

yuhao-su commented Mar 8, 2023

How about just adding a filter to ignore NULL stream key for the full outer join.

This will provide incorrect result 🥵

@st1page
Copy link
Contributor

st1page commented Mar 8, 2023

we can control the incorrect field with a more narrow predicate on the output of the outer join. just add a filter to remove the outer join's output rows where the output stream key is NULL.

SELECT * from t1 full outer join t2 on t1.pk = t2.pk;
/*will be transformed to*/
SELECT * from (
    SELECT * from t1 full outer join t2 on t1.pk = t2.pk;
) where NOT (t1.pk IS NULL AND t2.pk IS NULL)

@liurenjie1024
Copy link
Contributor

Why we don't just check every derived pk in plan node and check that it can't be all nullable?

@st1page
Copy link
Contributor

st1page commented Mar 9, 2023

innocent in fact, I prefer to ban all null primary keys. The only origin of that is Materialized view with GROUP BY K where k could be null. But if there any users need that behavior?

Group by CUBE will generate a NULL group by default. :lark-cry:

@liurenjie1024 :lark_cry

@liurenjie1024
Copy link
Contributor

I think this is a bug in our optimizer to determine the primary key of each streaming plan node. In full outer join, the pk + join key may not be unique when pk can be null, and in this case we may need to add extra column to ensure uniqueness.

@fuyufjh
Copy link
Contributor

fuyufjh commented Mar 13, 2023

I think this is a bug in our optimizer to determine the primary key of each streaming plan node. In full outer join, the pk + join key may not be unique when pk can be null, and in this case we may need to add extra column to ensure uniqueness.

Correct, but we need to trade off between this very rare cases and complexity. I agree with @st1page that we can tolerate this incorrect behavior (i.e. just don't panic) in an off-line discussion

@liurenjie1024
Copy link
Contributor

Correct, but we need to trade off between this very rare cases and complexity. I agree with @st1page that we can tolerate this incorrect behavior (i.e. just don't panic) in an off-line discussion

I feel weird to allow incorrect behavior. Why not just ban such kind of query which may cause this wrong behavior? I think this is much effort.

@st1page
Copy link
Contributor

st1page commented Mar 14, 2023

We will ban it when we have "not null" property in the optimizer and now workaround in #8520

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

6 participants