Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace pybind11 #6395

Closed
PGZXB opened this issue Oct 20, 2022 · 8 comments
Closed

Replace pybind11 #6395

PGZXB opened this issue Oct 20, 2022 · 8 comments
Assignees
Labels
discussion Welcome discussion!

Comments

@PGZXB
Copy link
Contributor

PGZXB commented Oct 20, 2022

(Related issue: #4830)

Motivation

Because of the high overhead of pybind11 (see ailzhang/c_ext_for_py), using a more efficient method to export the Taichi core APIs to Python is necessary.

Preliminary Solution

  • Replace pybind11 with ctypes or cpython (or others) for HOT Taichi core APIs first, e.g., make_const_expr_int, expr_* and so on.
  • THINKING...

TODO

THINKING...

Performance

...

Appendix

Counting API calls during run examples or tests (very ugly charts; ignore get_max_num_indices and pop_python_print_buffer):
Source code: PGZXB:dev-profile-ticore-APIs
image

@k-ye
Copy link
Member

k-ye commented Oct 20, 2022

FYI, is it possible to quickly try nanobind as suggested in #4830 (comment)?

@PGZXB
Copy link
Contributor Author

PGZXB commented Oct 20, 2022

FYI, is it possible to quickly try nanobind as suggested in #4830 (comment)?

Thanks. @k-ye

Python 3.8+: nanobind heavily relies on PEP 590 vector calls that were introduced in version 3.8.

But the nanobind only support Python3.8+?

And I'm thinking about whether we should standardize the APIs to be used to build new frontends, which require the stable C-APIs. If we have the C-APIs, we can bind them to Python by using ctypes. Of course, if using the C-API, we can't export a C++ class as a Python class conveniently.

@PGZXB PGZXB added the discussion Welcome discussion! label Oct 20, 2022
@PGZXB PGZXB self-assigned this Oct 20, 2022
@bobcao3
Copy link
Collaborator

bobcao3 commented Oct 21, 2022

https://github.com/bobcao3/taichi/blob/dart-native/c_api/include/taichi/frontend_ir.h

Here was a half-made attempt to build a Taichi compiler C-API used to bind to Dart (and because it's CAPI it should be able to go everywhere)

@PGZXB
Copy link
Contributor Author

PGZXB commented Oct 21, 2022

https://github.com/bobcao3/taichi/blob/dart-native/c_api/include/taichi/frontend_ir.h

Here was a half-made attempt to build a Taichi compiler C-API used to bind to Dart (and because it's CAPI it should be able to go everywhere)

Awesome work! BTW, I want to bind Taichi to my programming language that I have been developing in my spare time, but we don't have standard and stable APIs to build Taichi AST😂.

@bobcao3
Copy link
Collaborator

bobcao3 commented Oct 21, 2022

I think a critical part of experience is to reduce launch overhead. The API surface for launching kernels is quite a bit smaller than the AST APIs. Starting from there could be easier?

@PGZXB
Copy link
Contributor Author

PGZXB commented Oct 21, 2022

Thanks for your suggestion!

I think a critical part of experience is to reduce launch overhead.

Agree.

The API surface for launching kernels is quite a bit smaller than the AST APIs. Starting from there could be easier?

Yes, starting from small part of APIs is easier.

As a result I'd view this issue more to identify hotspots in py->c interaction and migrate them to cpython/ctypes step by step in a measurable way. We can probably employ cpython/ctypes in the critical parts for perf gain and keep some components in pybind11 to enjoy the ready-to-use C++ features. -- #4830 (comment)

My preliminary thought is similar with @ailzhang's

@PGZXB
Copy link
Contributor Author

PGZXB commented Oct 21, 2022

I extended @ailzhang's c_ext_for_py to test nanobind (source code: PGZXB:c_ext_for_py).

The result is....🤔

pybind took 1.4078617095947266e-06s
ctypes took 6.830692291259766e-07s
cpython took 4.100799560546875e-07s
nanobind took 6.326436996459961e-06s

P.S. Test env: macOS, M1

@yuanming-hu
Copy link
Member

yuanming-hu commented Oct 24, 2022

Note on the benchmark data: don't just test functions that take a simple std::vector<int> as input :-)

The JIT AST construction overhead analysis above looks good.

IIRC The pybind11 overhead mainly comes from RTTI (isinstance in pybind11 can take ~50 us). Such overhead may come from constructing the AST on JITing and launching (testing argument types & casting to the Taichi kernel argument list). I don't remember whether the AST construction part involves smart pointers - if that is the case we'd better testing the libs against these cases with C++ types. (For ctypes/cython we can just use raw pointers, which will likely be faster.)

I guess using a simple C API can significantly reduce the overhead already since the calling mechanism becomes much easier compared to C++.

A while (2 years?) ago I wrote a simple test script:

import time
import taichi as ti
ti.init()
@ti.kernel
def compute_div(a: ti.i32):
pass
compute_div(0)
print("starting...")
t = time.time()
for i in range(100000):
compute_div(0)
print((time.time() - t) * 10, 'us')
exit(0)

On my end (M1 Mac) such kernel tasks 8e-6s, more than 5x overhead from @PGZXB's result above. It is also worth looking into what else contributes to the 8e-6s launching overhead.

@PGZXB PGZXB mentioned this issue Dec 28, 2022
36 tasks
@PGZXB PGZXB changed the title Replace pybind11 (Tracer) Replace pybind11 Apr 25, 2023
@PGZXB PGZXB closed this as completed Sep 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion Welcome discussion!
Projects
Status: Done
Development

No branches or pull requests

4 participants