ENH: Allow JIT compilation with an internal API #61032

datapythonista · 2025-03-02T17:28:30Z

closes ENH: Add support for executing UDF's using Bodo as the engine #60668, supersedes [WIP] df.apply: add support for engine='bodo' #60622

I've been exploring what @MarcoGorelli proposed in #60622 (review) regarding allowing JIT compilers to implement their own apply engine. I think it is a much better approach than the current implementation. I've been testing different APIs, both the user facing one, and the internal (what Numba and Bodo have to implement that we will call), and this PR is what I think makes more sense...

The approach here is to use a jit parameter for any function that could make sense to JIT in pandas (DataFrame.apply, Series.map, SeriesGroupBy.transform...) that delegates to the JIT compiler (Numba or Bodo) 100% of the logic. So far I just implement DataFrame.apply for simplicity, but happy to add the rest once there is agreement.

Final user API would look like next examples:

df.apply(lambda x: x.A + x.B, axis=1, jit=bodo.njit)
df.apply(lambda x: x.A + x.B, axis=1, jit=bodo.jit(parallel=True))

Which I think it's very simple and intuitive, and at the same time it makes users import numba and bodo themselves, creating the right impression that they are using those libraries to JIT compile, and it's not something provided by pandas. Also, it makes users install the library themselves, which I think makes things easier (any installation or importing error is not reported to pandas, no need for a soft dependency, no need to even add them to our CI if we don't want).

I think this approach is very convenient for us, as maintaining the code in pandas is trivial. Which I think it should address the concerns @jbrockmendel expressed, and I think many of us share. It should also be very convenient for Numba and Bodo, which should be able to fix any issue much faster, as well as performance improvements, and support for new use cases . Numba and Bodo can probably release faster than us, so any needed work in the JIT functionality shouldn't consume pandas resources, make Numba and Bodo life easier, and end up in the users faster.

The exact internal API (the __pandas_udf__ function in this PR) can probably be improved. To make it looks simple and reasonable, but happy to know other points of view.

As said in the code, the bodo_patched.py file is only in the PR so there is more context on the API that we will call. But the code will be properly implemented in Numba and Bodo, and removed from this PR before it's merged.

@ehsantn @scott-routledge2 @pandas-dev/pandas-core

rhshadrach · 2025-03-02T18:35:00Z

This looks great to me!

The exact internal API (the __pandas_udf__ function in this PR) can probably be improved. To make it looks simple and reasonable, but happy to know other points of view.

I think it would be good to include result_dtype (can be None for current behavior), but that may be difficult with how flexible apply is. Offhand I would imagine it could be a single dtype (we'll raise if result has more than one column), list of dtypes (we'll raise if the number of columns is different from the length of the list), or a dictionary (we'll raise if key is missing). But this might require a bit more thought.

ehsantn · 2025-03-03T01:15:14Z

@DrTodd13 @sklam @stuartarchibald What do you think of implementing this interface in Numba?

sklam · 2025-03-03T14:41:10Z

@ehsantn, To clarify, do you mean that Bodo will be the main interface (given it has the support for pandas datastructure) and Numba may need to add things to properly support such use?

datapythonista · 2025-03-03T15:40:19Z

@sklam, the proposal here is that pandas provides a jit parameter to the functions that makes sense (DataFrame.apply, Series.map...). And if the user provides numba (the numba.jit decorator in the proposal, but can be discussed), then pandas will call numba with the parameters of the call (the dataframe, the function to be jitted, the axis...) and numba will be the one deciding how to execute that code and return the result to pandas.

Dr-Irv · 2025-03-03T17:15:43Z

@datapythonista wouldn't you need to add a test for this?

datapythonista · 2025-03-03T17:23:57Z

@datapythonista wouldn't you need to add a test for this?

Yes, thanks for pointing that out. I wanted to make the PR simple first, mainly in case there is feedback on the API between pandas and the JIT compilers. I think this first version of the PR should make reviewing the API simple. Once the API has been discussed and we are happy with it, I'll be adding the tests, a release note, and I'll review if there is anywhere else in the docs where this feature should be mentioned.

mroeschke · 2025-03-03T18:16:45Z

pandas still could cache the jit function under this protocol, correct? Currently pandas does that for numba to avoid overhead of jitting the same function more than once

jbrockmendel · 2025-03-03T19:20:17Z

pandas/core/frame.py

@@ -10345,6 +10346,15 @@ def apply(
            Pass keyword arguments to the engine.
            This is currently only used by the numba engine,
            see the documentation for the engine argument for more information.
+
+        jit : function, optional


what if the 3rd party implementation isn't a jit? e.g. it is just parallel?

Would you rename the parameter name to for example executor? I'm happy with that. I guess that would make naming more accurate if this is used for other use cases as running in parallel, which is possible with this interface. Do you have a specific use case in mind?

id just use 'engine', avoid having multiple similar keywords

Cool, that makes sense, thanka for the feedback. This is how it's implemented now in the PR, just using the existing parameter engine.

jbrockmendel · 2025-03-03T19:20:47Z

pandas/core/frame.py

+        significant amount of time to run. Fast functions are unlikely to run faster
+        with JIT compilation.
+        """
+        if hasattr(jit, "__pandas_udf__"):


what if jit is provided but doesnt have this attribute?

what if engine or enging_kwargs are passed?

Should error out I think.

The idea here is that the new parameter would make engine and engine_kwargs deprecated. While not removed, yes, I think we should raise an exception if both engine and jit/executor are provided.

Also, I agree, if the value of the parameter doesn't implement the interface, we should raise an exception. It should probably be quite specific, on what is expected and which versions of Numba or Bodo can be used. But I think it's easy to provide something that is clear to users if they pass something that doesn't implement this interface.

jbrockmendel · 2025-03-03T19:22:48Z

I think it would be good to include result_dtype [...]

Exactly what would be useful is going to differ by implementation, and may not be stable. This is a reason for the correct usage to be whatever.apply(func, obj, specific_keywords=...)

DrTodd13 · 2025-03-03T19:54:22Z

@sklam, the proposal here is that pandas provides a jit parameter to the functions that makes sense (DataFrame.apply, Series.map...). And if the user provides numba (the numba.jit decorator in the proposal, but can be discussed), then pandas will call numba with the parameters of the call (the dataframe, the function to be jitted, the axis...) and numba will be the one deciding how to execute that code and return the result to pandas.

I guess I have the same question as @sklam. When you say "parameters of the call" and list dataframe as an example then given that numba doesn't support dataframes, I question what you expect to happen. I can imagine a conversion layer in Python that could convert a dataframe to a set of numpy arrays and convert references to columns in the lambda to those individual arrays and then call numba to compile and run that code and convert the result back. (The rest of this message assumes this conversion layer is necessary. Please disregard if this is somehow already happening and I'm not aware of it.) The question is where would this conversion layer live. If in numba, wouldn't this require numba to take a pandas dependency? I'm pretty sure they wouldn't want to do that, partially for the same reason that pandas doesn't want to take a numba dependency but also because it would be considered out-of-scope for numba.

I think the issue of maintenance was hinted at in the first post. If this doesn't work for some reason, then I guess the pandas devs' hope would be for the user to complain directly to bodo or numba. Some may but I guarantee others won't. The pandas shim is pretty thin so it is unlikely an error would be there but it is theoretically possible. If there is a need for a conversion layer then there is a lot of room for error there in addition to the base numba compilation layers. It is sounding to me like neither numba or pandas is going to want to be responsible for this conversion layer. Perhaps a separate org or separate repo in the pandas org? If that separate package is listed in the pandas docs as a suggested option then you might get some uptake but still have to identify who is willing to maintain it.

datapythonista · 2025-03-05T16:11:54Z

pandas still could cache the jit function under this protocol, correct? Currently pandas does that for numba to avoid overhead of jitting the same function more than once

@mroeschke I think the way the interface is designed, and what makes more sense to me, is that are Numba and Bodo that cache their compiled functions. pandas will send the data, the function, and anything else needed to the execution backends, and will get the new data. For the Python/pandas backend there is nothing to cache, for Numba the function itself is jitted for what I've seen, and for Bodo they create a new function with the call to apply again, and that's what they jit. So, I think intuitively it could make sense that we take the jit decorator and the function in a generic way, and we cache it in pandas, but in practice I think it would make things very convoluted, compared to just let Numba and Bodo take care of it.

mroeschke · 2025-03-05T18:08:10Z

for Numba the function itself is jitted for what I've seen

and we create another jit function to apply the UDF over each column/row which is the jit function we cache i.e.

We jit the UDF passed from apply
We embed this jitted UDF in another jit function designed to apply this UDF over each row/column and cache this function

The specifics can be found in

pandas/pandas/core/_numba/executor.py

Line 21 in 56847c5

def generate_apply_looper(func, nopython=True, nogil=True, parallel=False):

(and as you can see, there are different flavors further down whether we're doing an apply or groupby.apply or rolling.apply)

So are you suggesting that Numba/Bodo cache the larger jitted function (2 in the list above) as well?

DrTodd13 · 2025-03-05T18:44:21Z

for Numba the function itself is jitted for what I've seen

and we create another jit function to apply the UDF over each column/row which is the jit function we cache i.e.

We jit the UDF passed from apply

We embed this jitted UDF in another jit function designed to apply this UDF over each row/column and cache this function

The specifics can be found in

pandas/pandas/core/_numba/executor.py

Line 21 in 56847c5

def generate_apply_looper(func, nopython=True, nogil=True, parallel=False):

(and as you can see, there are different flavors further down whether we're doing an apply or groupby.apply or rolling.apply)

So are you suggesting that Numba/Bodo cache the larger jitted function (2 in the list above) as well?

If you want the Numba/Bodo function to be cached, then you can just do functools.partial on the decorator you pass to jit/executor? However, note that if you cache an outer function that the inner function it calls is also included in that module even if it isn't given as cache=True. So, caching the inner one would help only in the case where the inner is the same but the outer function is different for some reason.

datapythonista · 2025-03-09T10:53:41Z

Some updates here:

I changed the approach here and instead of trying to fully replace the existing approach with a new jit parameter, I'm implementing it as a hybrid approach of what we have now with Numba, and the new interface. So, the engine parameter accepts the strings python and numba, as well as the numba.jit and bodo.jit decorators when they implement the interface. Also, I leave for a follow up making the python and numba engines use the new interface.
Instead of passing the value of raw to the engines, now when raw=True pandas will convert the Series/DataFrame to a NumPy array before passing it to the engine. While for now the Numba engine will stay the same, this should help move it to the Numba codebase without making them depend on pandas.
I improved the API that Numba and Bodo will implement, which before was just a function accepting what DataFrame.apply accepts. Now it's a base class with simpler methods.

Feedback welcome

pandas/core/frame.py

datapythonista · 2025-03-10T08:32:14Z

Thanks all for the feedback, it was very helpful. This should be now ready, I added tests, a release note and CI should be green (other than the unrelated pyarrow failure).

In a follow up, we should move the python and numba engine to the new interface (in the pandas codebase), which I think should make the code much simpler. And then have the discussion on moving the numba engine to the numba codebase, which I think it should be quite straightforward at that point.

Also, seems like blosc2 implemented a jit for numpy operations, that can probably also benefit from this new interface: https://www.blosc.org/python-blosc2/reference/autofiles/utilities/blosc2.jit.html

pandas/core/apply.py

pandas/core/frame.py

DrTodd13 · 2025-03-10T16:10:12Z

pandas/core/frame.py

+                # and likely result in an object 2D array.
+                # We should probably pass a list of 1D arrays instead, at
+                # lest for ``axis=0``
+                data = self.values


Numba doesn't support heterogeneous lists but does support heterogeneous tuples. I would say that it is so important to avoid 2D object arrays here that it should go in the first revision.

Also, how is the mapping from column name in "func" to index in the tuple supposed to happen?

Thanks for the comments, and for catching the typos.

Fully agree with your first comment. I don't want to change the existing behavior in the same PR I'm implementing a new interface, but converting the whole dataframe to a single type doesn't seem a great option. The raw parameter and this behavior is as old as pandas, I don't think we would design it this way as of today. It won't still be trivial, as passing Series is easy, but the internal data can be different things, in the simple case a single NumPy array, but also other structured such as an Arrow column, two NumPy arrays... I don't think it's trivial to decide how to pass the data in those cases.

Also, how is the mapping from column name in "func" to index in the tuple supposed to happen?

If I understand you correctly, this is great question, but also a bit tricky. For the simple case, the function is called for every column. In the case the same function needs to be applied to every column, I don't think there is an issue. If the function receives Series (raw=False), then this can be done:

def func(series): if series.name == "column_1": return series.str.upper() return series.str.lower()

But when raw=True, in general the input of the function will be a numpy array, and it's not known which was the name of the column or the index. Nothing in this PR is preventing the engines from passing extra parameters to the function, but I don't think we should encourage that behavior, since users will in general expect that changing the value of engine doesn't require changing the function, and in particular their signature.

This is when a column at a time is passed. When axis=1, a row at a time is passed. When raw=False, then a Series is passed, which again is not ideal, since an upcast to probably object will happen. But this is how it works now. For raw=True and Numpy objects, the case it's exactly the same as before, just the axis changes.

I think historical reasons made the signatures or .map() and .apply() functions inconsistent, and in some cases too complex and with strange behaviors. There have been several improvements, like removing applymap for example. But several other things can be improved. I personally think the internal API here makes things simpler, and public APIs should follow the same idea. But this will take several iterations, and I think we don't want to make this PR too complex with any of the many possible improvements. I don't think you are proposing it, I really appreciate your feedback and question. Just commenting how I think we should move forward, given the amount of technical debt and complexity.

FrancescAlted · 2025-03-10T18:31:59Z

Also, seems like blosc2 implemented a jit for numpy operations, that can probably also benefit from this new interface: https://www.blosc.org/python-blosc2/reference/autofiles/utilities/blosc2.jit.html

Thanks for the intro, @datapythonista. FWIW, I did a quick micro-benchmark comparing speeds of blosc2 vs numba vs numpy and run it on my desktop machine.

One of the advantages of the new compute engine in blosc2 is the reduced compile time (at least when compared to numba; see first point in plot above). Then, thanks to its CPU cache-sensible blocking, it also scales quite well when operands are large. The operators and functions that are accelerated are essentially those in numexpr (+, -, *, **, /, sin, cos...), plus reductions (sum, prod, mean, std, var, min, max, all, any); others not in that subset should fallback and executed via plain numpy. Last, but not least, dtype inference mimics more numpy (although not perfectly yet) than the numexpr one.

In case you find something that you may need to use blosc2.jit from pandas, tell us and we will try to help.

pandas/core/frame.py

datapythonista · 2025-03-12T08:32:19Z

@DrTodd13 when you have the chance, can you check if your concerns for this PR are addressed (we surely need few follow ups to leave all this perfect). And if it's the case, remove the "requested chanches" tick.

@pandas-dev/pandas-core please let me know if there is any additional feedback worth discussing before merging this, and starting work on the follow up PRs mentioned above. Otherwise I'll be merging this soon.

datapythonista · 2025-03-14T15:04:54Z

Thanks all for the reviews and feedback. Merging this so we can continue with the follow ups. If there is any further comment, please let me know, happy to discuss and address in a follow up.

datapythonista added 2 commits March 2, 2025 17:43

[PoC] Allow JIT compilation with an internal API

7ec827e

Improving the documentation

bc2a178

datapythonista added the Apply Apply, Aggregate, Transform, Map label Mar 2, 2025

CI

8b420cc

jbrockmendel reviewed Mar 3, 2025

View reviewed changes

Better execution engine API

6a9ee5a

Fixing test

444de67

Dr-Irv reviewed Mar 9, 2025

View reviewed changes

pandas/core/frame.py Show resolved Hide resolved

datapythonista added 2 commits March 10, 2025 14:00

Added tests, fixed some bugs and added a release note

7e1e855

Removed temporary bodo decorator example

58fb30d

datapythonista changed the title ~~[WIP] Allow JIT compilation with an internal API~~ Allow JIT compilation with an internal API Mar 10, 2025

make mypy happy

c239fc9

DrTodd13 suggested changes Mar 10, 2025

View reviewed changes

Typos

9567152

Merge remote-tracking branch 'upstream/main' into bodo_frame_apply

2ff333f

datapythonista changed the title ~~Allow JIT compilation with an internal API~~ ENH: Allow JIT compilation with an internal API Mar 11, 2025

mroeschke reviewed Mar 11, 2025

View reviewed changes

pandas/core/frame.py Show resolved Hide resolved

DrTodd13 approved these changes Mar 12, 2025

View reviewed changes

datapythonista merged commit 8943c97 into pandas-dev:main Mar 14, 2025
42 checks passed

This was referenced Mar 15, 2025

ENH: Supporting third-party engines for all map and apply methods #61125

Open

DOC: Write user guide page on apply/map/transform methods #61126

Closed

[WIP] df.apply: add support for engine='bodo' #60622

Closed

simonjayhawkins added this to the 3.0 milestone Apr 14, 2025

simonjayhawkins added the Enhancement label Apr 14, 2025

scott-routledge2 mentioned this pull request Apr 16, 2025

[BSE-4737] Implement interface for using bodo as engine in Pandas UDFs. bodo-ai/Bodo#410

Merged

3 tasks

datapythonista mentioned this pull request May 19, 2025

Use BaseExecutionEngine for Python and Numba engines #61458

Open

scott-routledge2 mentioned this pull request Jun 20, 2025

Add Bodo Recipe conda-forge/staged-recipes#30214

Merged

10 tasks

Uh oh!

ENH: Allow JIT compilation with an internal API #61032

ENH: Allow JIT compilation with an internal API #61032

Uh oh!

Conversation

datapythonista commented Mar 2, 2025

Uh oh!

rhshadrach commented Mar 2, 2025

Uh oh!

ehsantn commented Mar 3, 2025

Uh oh!

sklam commented Mar 3, 2025

Uh oh!

datapythonista commented Mar 3, 2025

Uh oh!

Dr-Irv commented Mar 3, 2025

Uh oh!

datapythonista commented Mar 3, 2025

Uh oh!

mroeschke commented Mar 3, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jbrockmendel commented Mar 3, 2025

Uh oh!

DrTodd13 commented Mar 3, 2025

Uh oh!

datapythonista commented Mar 5, 2025

Uh oh!

mroeschke commented Mar 5, 2025

Uh oh!

DrTodd13 commented Mar 5, 2025

Uh oh!

datapythonista commented Mar 9, 2025

Uh oh!

Uh oh!

datapythonista commented Mar 10, 2025

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

FrancescAlted commented Mar 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

datapythonista commented Mar 12, 2025

Uh oh!

Uh oh!

datapythonista commented Mar 14, 2025

Uh oh!

Uh oh!

FrancescAlted commented Mar 10, 2025 •

edited

Loading