Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

expr: give a clear definition of the "volatility" for all expressions #9030

Closed
BugenZhao opened this issue Apr 6, 2023 · 6 comments
Closed
Labels
component/common Common components, such as array, data chunk, expression. type/enhancement Improvements to existing implementation.

Comments

@BugenZhao
Copy link
Member

BugenZhao commented Apr 6, 2023

Recently there're more and more features that require expressions to be categorized, here're some examples:

I think it's time for us to give a clear definition of the "volatility" for all expressions, just like the definition in Postgres: https://www.postgresql.org/docs/current/xfunc-volatility.html.

Every function has a volatility classification, with the possibilities being VOLATILE, STABLE, or IMMUTABLE. VOLATILE is the default if the CREATE FUNCTION command does not specify a category. The volatility category is a promise to the optimizer about the behavior of the function:

A VOLATILE function can do anything, including modifying the database. It can return different results on successive calls with the same arguments. The optimizer makes no assumptions about the behavior of such functions. A query using a volatile function will re-evaluate the function at every row where its value is needed.

A STABLE function cannot modify the database and is guaranteed to return the same results given the same arguments for all rows within a single statement. This category allows the optimizer to optimize multiple calls of the function to a single call. In particular, it is safe to use an expression containing such a function in an index scan condition. (Since an index scan will evaluate the comparison value only once, not once at each row, it is not valid to use a VOLATILE function in an index scan condition.)

An IMMUTABLE function cannot modify the database and is guaranteed to return the same results given the same arguments forever. This category allows the optimizer to pre-evaluate the function when a query calls it with constant arguments. For example, a query like SELECT ... WHERE x = 2 + 2 can be simplified on sight to SELECT ... WHERE x = 4, because the function underlying the integer addition operator is marked IMMUTABLE.

Based on these definitions, we have:

  • Currently, our expression framework is only able to execute IMMUTABLE expressions (except for UDF).
  • For STABLE and VOLATILE functions, we need to persist the result in case of deletes.
  • Only IMMUTABLE functions are allowed in functional indexes and non-NULL default values.

cc @xiangjinwu @wangrunji0408 @chenzl25

@BugenZhao BugenZhao added type/enhancement Improvements to existing implementation. component/common Common components, such as array, data chunk, expression. labels Apr 6, 2023
@github-actions github-actions bot added this to the release-0.19 milestone Apr 6, 2023
@st1page
Copy link
Contributor

st1page commented Apr 7, 2023

STABLE and VOLATILE functions can also be supported to

  • batch query
  • append-only stream
  • functional indexes on append-only table

@CAJan93
Copy link
Contributor

CAJan93 commented Apr 11, 2023

Do/can we check if a user provided the correct category in CREATE FUNCTION? Can a user mark a function that manipulates data as IMMUTABLE?

@xxchan
Copy link
Member

xxchan commented Apr 11, 2023

can we check if a user provided the correct category ...

No we can't 😄

@xxchan
Copy link
Member

xxchan commented Apr 12, 2023

Which functions are allowed to be constant-folded.

How is volatility related with constant folding? I think it's done in frontend at runtime, so it should be fine. Oh, unless we have plan cache?

Ignore me, I'm stupid. I get it:

select RANDOM() from (values (1),(2)) t;
+---------------------+
| random              |
|---------------------|
| 0.24088808981450427 |
| 0.42239150263949554 |
+---------------------+

select NOW() from (values (1),(2)) t;
+-------------------------------+
| now                           |
|-------------------------------|
| 2023-04-12 23:33:34.738809+02 |
| 2023-04-12 23:33:34.738809+02 |
+-------------------------------+

(So NOW is STABLE and RANDOM is VOLATILE

I'm asking this while wondering how many optimizations should consider volatility, after seeing #9133.

BTW, CSE should be one. 🥸 https://twitter.com/zty0826/status/1609584138490904577

@BugenZhao BugenZhao removed their assignment Jun 12, 2023
@BugenZhao BugenZhao removed this from the release-0.20 milestone Jun 12, 2023
@BugenZhao
Copy link
Member Author

#9133 has roughly divided all expressions into impure and pure. Based on this property, several optimizations have become more serious these days, and it seems good enough. Let's distinguish STABLE and VOLATILE if we find it needed in the future.

@xxchan
Copy link
Member

xxchan commented Jun 12, 2023

I guess currently impure = VOLATILE, so it's conservative and maybe good enough.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/common Common components, such as array, data chunk, expression. type/enhancement Improvements to existing implementation.
Projects
None yet
Development

No branches or pull requests

5 participants