Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(udf): support embedded python udf #15168

Merged
merged 6 commits into from
Feb 28, 2024
Merged

Conversation

wangrunji0408
Copy link
Contributor

@wangrunji0408 wangrunji0408 commented Feb 21, 2024

I hereby agree to the terms of the RisingWave Labs, Inc. Contributor License Agreement.

What's changed and what's your intention?

This PR allows user to define Python UDFs that will be run in an embedded Python interpreter in RisingWave.

# scalar function
dev=> create function gcd(a int, b int) returns int language python as $$
def gcd(a, b):
    while b != 0:
        a, b = b, a % b
    return a
$$;
CREATE_FUNCTION
dev=> select gcd(15, 25);
 gcd 
-----
   5
(1 row)

# table function
dev=> create function series(n int) returns table (x int) language python as $$
def series(n):
    for i in range(n):
        yield i
$$;
CREATE_FUNCTION
dev=> select series(5);
 series 
--------
      0
      1
      2
      3
      4
(5 rows)

This is very similar to #14513. We use pyo3 to embed a Python library into RisingWave. The minimum supported Python version is 3.12. Because we require sub-interpreters and per-interpreter GIL to run different UDFs in parallel.

For safety reasons, functions are limited to pure computational logic. We only allow importing packages in a whitelist and have removed some builtin functions such as exit and open. The goal is to create a sandbox for untrusted code. However, it is not absolute safe, at least for now. It seems that Python interpreter doesn't provide a way to limit CPU or memory usage. The system can be easily blocked by an infinite loop in UDF. Until this problem is resolved, it is be better to only open this feature to admin users.

Checklist

  • I have written necessary rustdoc comments
  • I have added necessary unit tests and integration tests
  • I have added test labels as necessary. See details.
  • I have added fuzzing tests or opened an issue to track them. (Optional, recommended for new SQL features Sqlsmith: Sql feature generation #7934).
  • My PR contains breaking changes. (If it deprecates some features, please create a tracking issue to remove them in the future).
  • All checks passed in ./risedev check (or alias, ./risedev c)
  • My PR changes performance-critical code. (Please run macro/micro-benchmarks and show the results.)
  • My PR contains critical fixes that are necessary to be merged into the latest release. (Please check out the details)

Documentation

  • My PR needs documentation updates. (Please use the Release note section below to summarize the impact on users)

Release note

Support embedded Python UDFs. Functions will run inside RisingWave and don't require external UDF servers.

Documentations will be updated later.

Signed-off-by: Runji Wang <wangrunji0408@163.com>
@wangrunji0408 wangrunji0408 changed the title feat(udf): support inline python udf feat(udf): support inlined python udf Feb 21, 2024
@fuyufjh
Copy link
Member

fuyufjh commented Feb 21, 2024

Would this Python interpreter be able to execute any command like os.system(command)?

@wangrunji0408
Copy link
Contributor Author

Would this Python interpreter be able to execute any command like os.system(command)?

No. We only allow importing packages in a whitelist and have removed dangerous functions such as exit and open, before running untrusted code from user.

https://github.com/risingwavelabs/arrow-udf/blob/273c8c50ddca10a8988691e8d95374f3af371548/arrow-udf-python/src/lib.rs#L58

@wangrunji0408 wangrunji0408 marked this pull request as ready for review February 22, 2024 08:16
@wangrunji0408 wangrunji0408 requested a review from a team as a code owner February 22, 2024 08:16
@BugenZhao
Copy link
Member

Our embedded UDF is getting more and more exciting. 🤩

BTW, may I ask about the status of Python WASM ecosystem? I suppose it's not that functional so that we chose to run it on a Python interpreter and handle the isolation and security ourselves.

@wangrunji0408
Copy link
Contributor Author

wangrunji0408 commented Feb 23, 2024

BTW, may I ask about the status of Python WASM ecosystem? I suppose it's not that functional so that we chose to run it on a Python interpreter and handle the isolation and security ourselves.

We have been able to run Python UDFs on WASM using a prebuilt Python WASM image provided by vmware labs. The ecosystem looks fine and WASM does provide better isolation. However, my biggest concern is performance. According to our benchmark result, Python is already very slow (100x slower than native code), running it on WASM makes it 10x slower. I think the performance impact is far more than security benefits, making it less useful in practice. So I don't choose WASM for Python UDF. 😢

Signed-off-by: Runji Wang <wangrunji0408@163.com>
@wangrunji0408 wangrunji0408 force-pushed the wrj/inline-python-udf branch 8 times, most recently from b728aa5 to d98f7ac Compare February 23, 2024 09:09
Signed-off-by: Runji Wang <wangrunji0408@163.com>
Signed-off-by: Runji Wang <wangrunji0408@163.com>
Copy link
Member

@fuyufjh fuyufjh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ci/Dockerfile Show resolved Hide resolved
e2e_test/udf/python_udf.slt Outdated Show resolved Hide resolved
Signed-off-by: Runji Wang <wangrunji0408@163.com>
Signed-off-by: Runji Wang <wangrunji0408@163.com>
@wangrunji0408 wangrunji0408 added this pull request to the merge queue Feb 27, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Feb 27, 2024
@wangrunji0408 wangrunji0408 added this pull request to the merge queue Feb 28, 2024
Merged via the queue into main with commit 4b5e4b3 Feb 28, 2024
28 of 29 checks passed
@wangrunji0408 wangrunji0408 deleted the wrj/inline-python-udf branch February 28, 2024 02:57
@wangrunji0408 wangrunji0408 changed the title feat(udf): support inlined python udf feat(udf): support embedded python udf Feb 28, 2024
@neverchanje neverchanje added the user-facing-changes Contains changes that are visible to users label Feb 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/feature user-facing-changes Contains changes that are visible to users
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants