Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Allow JIT compilation with an internal API #61032

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

datapythonista
Copy link
Member

I've been exploring what @MarcoGorelli proposed in #60622 (review) regarding allowing JIT compilers to implement their own apply engine. I think it is a much better approach than the current implementation. I've been testing different APIs, both the user facing one, and the internal (what Numba and Bodo have to implement that we will call), and this PR is what I think makes more sense...

The approach here is to use a jit parameter for any function that could make sense to JIT in pandas (DataFrame.apply, Series.map, SeriesGroupBy.transform...) that delegates to the JIT compiler (Numba or Bodo) 100% of the logic. So far I just implement DataFrame.apply for simplicity, but happy to add the rest once there is agreement.

Final user API would look like next examples:

df.apply(lambda x: x.A + x.B, axis=1, jit=bodo.njit)
df.apply(lambda x: x.A + x.B, axis=1, jit=bodo.jit(parallel=True))

Which I think it's very simple and intuitive, and at the same time it makes users import numba and bodo themselves, creating the right impression that they are using those libraries to JIT compile, and it's not something provided by pandas. Also, it makes users install the library themselves, which I think makes things easier (any installation or importing error is not reported to pandas, no need for a soft dependency, no need to even add them to our CI if we don't want).

I think this approach is very convenient for us, as maintaining the code in pandas is trivial. Which I think it should address the concerns @jbrockmendel expressed, and I think many of us share. It should also be very convenient for Numba and Bodo, which should be able to fix any issue much faster, as well as performance improvements, and support for new use cases . Numba and Bodo can probably release faster than us, so any needed work in the JIT functionality shouldn't consume pandas resources, make Numba and Bodo life easier, and end up in the users faster.

The exact internal API (the __pandas_udf__ function in this PR) can probably be improved. To make it looks simple and reasonable, but happy to know other points of view.

As said in the code, the bodo_patched.py file is only in the PR so there is more context on the API that we will call. But the code will be properly implemented in Numba and Bodo, and removed from this PR before it's merged.

@ehsantn @scott-routledge2 @pandas-dev/pandas-core

@datapythonista datapythonista added the Apply Apply, Aggregate, Transform, Map label Mar 2, 2025
@rhshadrach
Copy link
Member

This looks great to me!

The exact internal API (the __pandas_udf__ function in this PR) can probably be improved. To make it looks simple and reasonable, but happy to know other points of view.

I think it would be good to include result_dtype (can be None for current behavior), but that may be difficult with how flexible apply is. Offhand I would imagine it could be a single dtype (we'll raise if result has more than one column), list of dtypes (we'll raise if the number of columns is different from the length of the list), or a dictionary (we'll raise if key is missing). But this might require a bit more thought.

@ehsantn
Copy link
Contributor

ehsantn commented Mar 3, 2025

@DrTodd13 @sklam @stuartarchibald What do you think of implementing this interface in Numba?

@sklam
Copy link
Contributor

sklam commented Mar 3, 2025

@ehsantn, To clarify, do you mean that Bodo will be the main interface (given it has the support for pandas datastructure) and Numba may need to add things to properly support such use?

@datapythonista
Copy link
Member Author

@sklam, the proposal here is that pandas provides a jit parameter to the functions that makes sense (DataFrame.apply, Series.map...). And if the user provides numba (the numba.jit decorator in the proposal, but can be discussed), then pandas will call numba with the parameters of the call (the dataframe, the function to be jitted, the axis...) and numba will be the one deciding how to execute that code and return the result to pandas.

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Mar 3, 2025

@datapythonista wouldn't you need to add a test for this?

@datapythonista
Copy link
Member Author

@datapythonista wouldn't you need to add a test for this?

Yes, thanks for pointing that out. I wanted to make the PR simple first, mainly in case there is feedback on the API between pandas and the JIT compilers. I think this first version of the PR should make reviewing the API simple. Once the API has been discussed and we are happy with it, I'll be adding the tests, a release note, and I'll review if there is anywhere else in the docs where this feature should be mentioned.

@mroeschke
Copy link
Member

pandas still could cache the jit function under this protocol, correct? Currently pandas does that for numba to avoid overhead of jitting the same function more than once

@@ -10345,6 +10346,15 @@ def apply(
Pass keyword arguments to the engine.
This is currently only used by the numba engine,
see the documentation for the engine argument for more information.

jit : function, optional
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if the 3rd party implementation isn't a jit? e.g. it is just parallel?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you rename the parameter name to for example executor? I'm happy with that. I guess that would make naming more accurate if this is used for other use cases as running in parallel, which is possible with this interface. Do you have a specific use case in mind?

significant amount of time to run. Fast functions are unlikely to run faster
with JIT compilation.
"""
if hasattr(jit, "__pandas_udf__"):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if jit is provided but doesnt have this attribute?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if engine or enging_kwargs are passed?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should error out I think.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea here is that the new parameter would make engine and engine_kwargs deprecated. While not removed, yes, I think we should raise an exception if both engine and jit/executor are provided.

Also, I agree, if the value of the parameter doesn't implement the interface, we should raise an exception. It should probably be quite specific, on what is expected and which versions of Numba or Bodo can be used. But I think it's easy to provide something that is clear to users if they pass something that doesn't implement this interface.

@jbrockmendel
Copy link
Member

I think it would be good to include result_dtype [...]

Exactly what would be useful is going to differ by implementation, and may not be stable. This is a reason for the correct usage to be whatever.apply(func, obj, specific_keywords=...)

@DrTodd13
Copy link

DrTodd13 commented Mar 3, 2025

@sklam, the proposal here is that pandas provides a jit parameter to the functions that makes sense (DataFrame.apply, Series.map...). And if the user provides numba (the numba.jit decorator in the proposal, but can be discussed), then pandas will call numba with the parameters of the call (the dataframe, the function to be jitted, the axis...) and numba will be the one deciding how to execute that code and return the result to pandas.

I guess I have the same question as @sklam. When you say "parameters of the call" and list dataframe as an example then given that numba doesn't support dataframes, I question what you expect to happen. I can imagine a conversion layer in Python that could convert a dataframe to a set of numpy arrays and convert references to columns in the lambda to those individual arrays and then call numba to compile and run that code and convert the result back. (The rest of this message assumes this conversion layer is necessary. Please disregard if this is somehow already happening and I'm not aware of it.) The question is where would this conversion layer live. If in numba, wouldn't this require numba to take a pandas dependency? I'm pretty sure they wouldn't want to do that, partially for the same reason that pandas doesn't want to take a numba dependency but also because it would be considered out-of-scope for numba.

I think the issue of maintenance was hinted at in the first post. If this doesn't work for some reason, then I guess the pandas devs' hope would be for the user to complain directly to bodo or numba. Some may but I guarantee others won't. The pandas shim is pretty thin so it is unlikely an error would be there but it is theoretically possible. If there is a need for a conversion layer then there is a lot of room for error there in addition to the base numba compilation layers. It is sounding to me like neither numba or pandas is going to want to be responsible for this conversion layer. Perhaps a separate org or separate repo in the pandas org? If that separate package is listed in the pandas docs as a suggested option then you might get some uptake but still have to identify who is willing to maintain it.

@datapythonista
Copy link
Member Author

pandas still could cache the jit function under this protocol, correct? Currently pandas does that for numba to avoid overhead of jitting the same function more than once

@mroeschke I think the way the interface is designed, and what makes more sense to me, is that are Numba and Bodo that cache their compiled functions. pandas will send the data, the function, and anything else needed to the execution backends, and will get the new data. For the Python/pandas backend there is nothing to cache, for Numba the function itself is jitted for what I've seen, and for Bodo they create a new function with the call to apply again, and that's what they jit. So, I think intuitively it could make sense that we take the jit decorator and the function in a generic way, and we cache it in pandas, but in practice I think it would make things very convoluted, compared to just let Numba and Bodo take care of it.

@mroeschke
Copy link
Member

for Numba the function itself is jitted for what I've seen

and we create another jit function to apply the UDF over each column/row which is the jit function we cache i.e.

  1. We jit the UDF passed from apply
  2. We embed this jitted UDF in another jit function designed to apply this UDF over each row/column and cache this function

The specifics can be found in

def generate_apply_looper(func, nopython=True, nogil=True, parallel=False):

(and as you can see, there are different flavors further down whether we're doing an apply or groupby.apply or rolling.apply)

So are you suggesting that Numba/Bodo cache the larger jitted function (2 in the list above) as well?

@DrTodd13
Copy link

DrTodd13 commented Mar 5, 2025

for Numba the function itself is jitted for what I've seen

and we create another jit function to apply the UDF over each column/row which is the jit function we cache i.e.

  1. We jit the UDF passed from apply
  2. We embed this jitted UDF in another jit function designed to apply this UDF over each row/column and cache this function

The specifics can be found in

def generate_apply_looper(func, nopython=True, nogil=True, parallel=False):

(and as you can see, there are different flavors further down whether we're doing an apply or groupby.apply or rolling.apply)

So are you suggesting that Numba/Bodo cache the larger jitted function (2 in the list above) as well?

If you want the Numba/Bodo function to be cached, then you can just do functools.partial on the decorator you pass to jit/executor? However, note that if you cache an outer function that the inner function it calls is also included in that module even if it isn't given as cache=True. So, caching the inner one would help only in the case where the inner is the same but the outer function is different for some reason.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Apply Apply, Aggregate, Transform, Map
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: Add support for executing UDF's using Bodo as the engine
8 participants