RFC: lambda expression #69

st1page · 2023-08-02T03:51:01Z

use lambda expression to enhance array's complex processing

xiangjinwu · 2023-08-02T04:26:15Z

rfcs/0069-lambda-expression.md

+
+### level 2: scalar lambda expression
+
+Implement lambda expression function to make user can define there `transform` logic for array's each elements.


risingwavelabs/risingwave#11123
Another part of this solution is to implement transform and reduce functions, whose inputs contain both data column and the lambda expression.

chenzl25 · 2023-08-02T10:10:01Z

I find that we can unnest this kind of subquery in some way. Here is a PR on how to transpose Apply with ProjectSet. risingwavelabs/risingwave#11390.
If we support the following two optimizer rules later, we can resolve this issue.

transposing Apply with TableFunction which I think it can be reduced into ProjectSet + Value in the future.
reducing trivial nested-loop join generated by subquery unnesting.

st1page · 2023-08-03T05:34:16Z

I find that we can unnest this kind of subquery in some way. Here is a PR on how to transpose Apply with ProjectSet. risingwavelabs/risingwave#11390.
If we support the following two optimizer rules later, we can resolve this issue.

LGTM, but still not the best because it introduce unnessary Join operator 🤔

chenzl25 · 2023-08-03T05:39:07Z

I find that we can unnest this kind of subquery in some way. Here is a PR on how to transpose Apply with ProjectSet. risingwavelabs/risingwave#11390.
If we support the following two optimizer rules later, we can resolve this issue.

LGTM, but still not the best because it introduce unnessary Join operator 🤔

True. It could be a temporal solution. From a performance perspective, this RFC could do better.

rfcs/0069-lambda-expression.md

Co-authored-by: TennyZhuang <zty0826@gmail.com>

rfcs/0069-lambda-expression.md

Co-authored-by: CAJan93 <jan@singularity-data.com>

TennyZhuang · 2023-08-14T07:46:48Z

rfcs/0069-lambda-expression.md

+```SQL
+select *,
+  array_sum(arr_i),
+  array_sum(transform(arr_json, v -> (v->'x')::int))


It'll be a little confused that use -> for two purposes, and increase the complexity of parser and binder.

I'd propose using => for lambda function.

TennyZhuang · 2023-09-01T03:09:43Z

The latest progress:

TL;DR

We use |x| x * 2 syntax for lambda function MVP.
The array_reduce is very heavy in our system, so we implement the utility array functions array_{sum|min|max} first.

The lambda syntax

In this RFC, we proposed using the arrow symbol “->” as the syntax for lambda expressions, such as x -> x * 2, since it's widely supported by different systems. However, this symbol conflicts with JsonAccess and creates significant ambiguity.

A bad case: how should f(ARRAY[1,2,3], x -> 'y' or z) be parsed? (x -> 'y') or z or x -> ('y' or z)? The AST will be much different after the parser phase.

We also proposed the => symbol, which is used by JavaScript and also widely accepted by programmers. Unfortunately, it's still conflict with a pg syntax, Named notation.

-- Lambda function
SELECT f(ARRAY[1,2,3], x => x * 2);
-- Named notation
SELECT abs(a => -1)
-- ???
SELECT f(apply => (x => x * 2), array => ARRAY[1,2,3]);

There are definitely many compiler techniques that can solve the problem of ambiguity in the above syntax. After all, it is much simpler compared to C++ syntax. However, as an experimental feature, we do not want to introduce very complex refactors into the parser for it. We have decided to look for a syntax that is not ambiguous and relatively simple to support this feature. The final choice is a Rust-like syntax |x| x * 2.

In the current pg 15, there is no built-in operator that allows the prefix | operator. It may conflict with custom operators, but currently we do not support the feature. Even if we support it, it 's still acceptable to forbid “|”” as a unary operator.

Of course, another important reason is that we like Rust.

Choosing the Rust-like syntax in the MVP version does not mean we will not change, but because the cost of implementing it is very low, even if we eventually choose another syntax, we can simply deprecate the rust-like one and maintain compatibility with it forever without much effort.

If the PostgreSQL eventually support the lambda function using another syntax, or our users really like the -> symbol such as duckdb or prestodb, we can support that (with a huge refactor in the parser module).

The `reduce` performance issue

The RFC also proposed a reduce function to create an aggregate function or a scalar function which aggregates an array. Unfortunately, the implementation is heavy since we can't pass a column-based chunk to the lambda function.

-- Calculate the sum of an array
select array_reduce(ARRAY[1,2,3], (x, y) => x + y);

The lambda arg will eventually be converted to a BoxedExpression in the backend. However, we cannot invoke the eval(DataChunk) method and can only choose the inefficient eval_row(OwnedRow) method instead. The accumulation logic is very unfriendly to chunk-based evaluation.

There many overheads when evaluating one row.

The chained dynamic dispatch Box<dyn Expression>
Box and unbox of OwnedRow and Datum.

The overhead is much heavy, and the only solution is to support query compilation for expression, which is equivalent to fully rewrite the expression framework.

As a conclusion, even if we supported array_reduce, we still need to implement the array_{sum|min|max|etc} to optimize the performance. So we decide to implement the utility functions first instead of introducing array_reduce.

The bridge between aggregate functions and array scalar functions

The logic are almost identical for the scalar function array_sum and the aggregate function sum. Should we call the sum batch aggregator directly when implementing array_sum?

My personal opinion is to first try to implement array_sum manually, observe the commonalities, and then draw conclusions.

wangrunji0408 · 2023-09-01T04:23:01Z

It is possible to introduce another macro to generate array scalar functions from aggregate definitions. It may look like:

#[aggregate("sum(int64) -> int64")]                // original aggregate
#[array_function("array_sum(int64[]) -> int64")]   // the new macro
fn sum(state: i64, input: i64) -> i64 {
    state + input
}

This approach requires no extra code for each function, instead it puts all complexity into the macro.

By contrast, manually implement each array_* function is a bit tedious but trivial:

#[function("array_sum(int64[]) -> int64")]
fn array_sum(array: ListRef<'_>) -> i64 {
    array.as_int64().iter().sum()
}

So I would also prefer manual implementing each array function in the first step.

xiangjinwu · 2024-01-31T06:26:18Z

Just ran into this:
https://www.postgresql.org/docs/current/intarray.html

st1page added 4 commits August 1, 2023 20:19

add Background

15abb8a

investigation

ebd4ed5

fin

bb05310

rename

db68201

xiangjinwu reviewed Aug 2, 2023

View reviewed changes

st1page mentioned this pull request Aug 2, 2023

Shall we implement higher order function map (Presto Transform)? risingwavelabs/risingwave#11123

Closed

TennyZhuang reviewed Aug 3, 2023

View reviewed changes

rfcs/0069-lambda-expression.md Outdated Show resolved Hide resolved

Update rfcs/0069-lambda-expression.md

0084b77

Co-authored-by: TennyZhuang <zty0826@gmail.com>

CAJan93 reviewed Aug 11, 2023

View reviewed changes

rfcs/0069-lambda-expression.md Outdated Show resolved Hide resolved

fix typo

0c39fa0

Co-authored-by: CAJan93 <jan@singularity-data.com>

TennyZhuang reviewed Aug 14, 2023

View reviewed changes

fuyufjh changed the title ~~WIP RFC: lambda expression~~ RFC: lambda expression Aug 22, 2023

TennyZhuang mentioned this pull request Aug 22, 2023

Tracking: Lambda function risingwavelabs/risingwave#11837

Closed

3 tasks

TennyZhuang mentioned this pull request Sep 1, 2023

Implement array_{min/max/sum/sort} risingwavelabs/risingwave#11996

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: lambda expression #69

RFC: lambda expression #69

st1page commented Aug 2, 2023

xiangjinwu Aug 2, 2023

chenzl25 commented Aug 2, 2023 •

edited

st1page commented Aug 3, 2023

chenzl25 commented Aug 3, 2023

TennyZhuang Aug 14, 2023

TennyZhuang commented Sep 1, 2023 •

edited

wangrunji0408 commented Sep 1, 2023 •

edited

xiangjinwu commented Jan 31, 2024


		### level 2: scalar lambda expression

		Implement lambda expression function to make user can define there `transform` logic for array's each elements.

RFC: lambda expression #69

Are you sure you want to change the base?

RFC: lambda expression #69

Conversation

st1page commented Aug 2, 2023

xiangjinwu Aug 2, 2023

Choose a reason for hiding this comment

chenzl25 commented Aug 2, 2023 • edited

st1page commented Aug 3, 2023

chenzl25 commented Aug 3, 2023

TennyZhuang Aug 14, 2023

Choose a reason for hiding this comment

TennyZhuang commented Sep 1, 2023 • edited

The lambda syntax

The reduce performance issue

The bridge between aggregate functions and array scalar functions

wangrunji0408 commented Sep 1, 2023 • edited

xiangjinwu commented Jan 31, 2024

chenzl25 commented Aug 2, 2023 •

edited

TennyZhuang commented Sep 1, 2023 •

edited

The `reduce` performance issue

wangrunji0408 commented Sep 1, 2023 •

edited