Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: on conflict clause #48

Closed
wants to merge 10 commits into from
Closed

RFC: on conflict clause #48

wants to merge 10 commits into from

Conversation

st1page
Copy link
Contributor

@st1page st1page commented Feb 7, 2023

No description provided.

Co-authored-by: xxchan <xxchan22f@gmail.com>
Co-authored-by: xiangjinwu <17769960+xiangjinwu@users.noreply.github.com>
(1 row)
```

The most strange part is that the input parameter and return value of the "reduce function" must have the same schema. But it can be easily workaround with constant expressions.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious about how schema change will handle this case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the mechanism to handle the outdated schema has no conflict with this RFC. 🤔 We have the same issue before we implement this RFC

COALESCE(EXCLUDED.b, b));
```

This feature will be useful when user use wide-table schema data model.
Copy link
Member

@fuyufjh fuyufjh Feb 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I understand it, this is similar to Apache Doris' pre-aggregate data model, where the data of the same PK can be automatically merged according to some custom rules. However, as one of its selling points, Doris has a specially designed storage to do this pre-aggregation efficiently, but we don't have it.

Apparently, our storage is optimized for non-conflict use cases, which means that most rows will not have a conflict on PK, and the PK check in MaterializeExecutor is just a precaution to avoid a crash. On the contrary, this RFC assumes that incoming rows will have many conflicts by design, and hopes to resolve them in MaterializeExecutor. I don't think it's a good idea to encourage users to do this, and it's not worth optimizing for.

Also, as the proposal of it, I am not very confident that checking in MaterializeExecutor is a good idea. Maybe it will become a painpoint at some point and maybe users will ask us to remove it. I don't really recommend relying on it much.

Copy link
Contributor Author

@st1page st1page Feb 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, as the proposal of it, I am not very confident that checking in MaterializeExecutor is a good idea. Maybe it will become a painpoint at some point and maybe users will ask us to remove it. I don't really recommend relying on it much.

We can just treat the check in the MaterializeExecutor as a pluggable feature. Now it is implemented as a flag in MaterializeExecutor, so if users ask us to remove it we can remove it easily.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I understand it, this is similar to Apache Doris' pre-aggregate data model, where the data of the same PK can be automatically merged according to some custom rules. However, as one of its selling points, Doris has a specially designed storage to do this pre-aggregation efficiently, but we don't have it.

https://doris.apache.org/docs/data-table/data-model/#aggregate-model just supports some limited behaviors.

AggregationTypecurrently has the following four ways of aggregation:
SUM: Sum, multi-line Value accumulation.
REPLACE: Instead, Values in the next batch of data will replace Values in rows previously imported.
MAX: Keep the maximum.
MIN: Keep the minimum.

In fact, if we just want to support this behavior, RisingWave has already been able to express it with a Create Materialized view with an aggregate query.
The on conflicts can take much more expressiveness to users.


## Summary

allow user to declare the conflict behavior when the newly inserted row break the unique constraint of primary key.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

by the way, when do we do constraint checks? every time a row is materialized?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the logic will be implemented in the materialize executor of the table.

rfcs/0048-on-conflict-clause-in-create-table-statement.md Outdated Show resolved Hide resolved

### Future possibilities

We can support the `ON CONFLICT` clause in the Insert statement if we support attach the on conflict description for the chunk in `BatchInsert` executor and `StreamDML` executor to let `Materialize` executor know how to handle the conflict. But the on conflict clause is always needed in create table statement for the table with connector.
Copy link
Member

@BugenZhao BugenZhao Feb 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's still "stream path" between the DML and Materialize, also Exchange if there's user-specified PK, so eventually it seems we need to attach this "description" to each row, which sounds like an overkill. 🤔

rfcs/0048-on-conflict-clause-in-create-table-statement.md Outdated Show resolved Hide resolved

```SQL
CREATE table t(k int primary key, int cnt)
ON CONFLICT DO UPDATE SET cnt = t.cnt + EXCLUDED.cnt
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why we don't use aggregation call here:

select k, sum(cnt) from t;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is just a trivial example here to show that this grammar can let user declare their own aggregation logic with SQL. And the loigic could be more complex.

st1page and others added 2 commits February 16, 2023 20:01
[ WHERE condition ]
```

- **Defined when `CREATE TABLE` instead of `INSERT`**
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can consider this as a config for the data from connector. i.e. What command the connector should use to write data - REPLACE or INSERT or anything else.

@st1page
Copy link
Contributor Author

st1page commented Mar 1, 2024

details and description in the RFC is out-of-date and the behavior of the feature is written in risingwavelabs/risingwave-docs#1860

@st1page st1page closed this Mar 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants