-
-
Notifications
You must be signed in to change notification settings - Fork 18.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Allow JIT compilation with an internal API #61032
base: main
Are you sure you want to change the base?
Conversation
This looks great to me!
I think it would be good to include |
@DrTodd13 @sklam @stuartarchibald What do you think of implementing this interface in Numba? |
@ehsantn, To clarify, do you mean that Bodo will be the main interface (given it has the support for pandas datastructure) and Numba may need to add things to properly support such use? |
@sklam, the proposal here is that pandas provides a jit parameter to the functions that makes sense ( |
@datapythonista wouldn't you need to add a test for this? |
Yes, thanks for pointing that out. I wanted to make the PR simple first, mainly in case there is feedback on the API between pandas and the JIT compilers. I think this first version of the PR should make reviewing the API simple. Once the API has been discussed and we are happy with it, I'll be adding the tests, a release note, and I'll review if there is anywhere else in the docs where this feature should be mentioned. |
pandas still could cache the jit function under this protocol, correct? Currently pandas does that for numba to avoid overhead of jitting the same function more than once |
@@ -10345,6 +10346,15 @@ def apply( | |||
Pass keyword arguments to the engine. | |||
This is currently only used by the numba engine, | |||
see the documentation for the engine argument for more information. | |||
|
|||
jit : function, optional |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if the 3rd party implementation isn't a jit? e.g. it is just parallel?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you rename the parameter name to for example executor
? I'm happy with that. I guess that would make naming more accurate if this is used for other use cases as running in parallel, which is possible with this interface. Do you have a specific use case in mind?
significant amount of time to run. Fast functions are unlikely to run faster | ||
with JIT compilation. | ||
""" | ||
if hasattr(jit, "__pandas_udf__"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if jit is provided but doesnt have this attribute?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if engine or enging_kwargs are passed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should error out I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea here is that the new parameter would make engine
and engine_kwargs
deprecated. While not removed, yes, I think we should raise an exception if both engine
and jit
/executor
are provided.
Also, I agree, if the value of the parameter doesn't implement the interface, we should raise an exception. It should probably be quite specific, on what is expected and which versions of Numba or Bodo can be used. But I think it's easy to provide something that is clear to users if they pass something that doesn't implement this interface.
Exactly what would be useful is going to differ by implementation, and may not be stable. This is a reason for the correct usage to be |
I guess I have the same question as @sklam. When you say "parameters of the call" and list dataframe as an example then given that numba doesn't support dataframes, I question what you expect to happen. I can imagine a conversion layer in Python that could convert a dataframe to a set of numpy arrays and convert references to columns in the lambda to those individual arrays and then call numba to compile and run that code and convert the result back. (The rest of this message assumes this conversion layer is necessary. Please disregard if this is somehow already happening and I'm not aware of it.) The question is where would this conversion layer live. If in numba, wouldn't this require numba to take a pandas dependency? I'm pretty sure they wouldn't want to do that, partially for the same reason that pandas doesn't want to take a numba dependency but also because it would be considered out-of-scope for numba. I think the issue of maintenance was hinted at in the first post. If this doesn't work for some reason, then I guess the pandas devs' hope would be for the user to complain directly to bodo or numba. Some may but I guarantee others won't. The pandas shim is pretty thin so it is unlikely an error would be there but it is theoretically possible. If there is a need for a conversion layer then there is a lot of room for error there in addition to the base numba compilation layers. It is sounding to me like neither numba or pandas is going to want to be responsible for this conversion layer. Perhaps a separate org or separate repo in the pandas org? If that separate package is listed in the pandas docs as a suggested option then you might get some uptake but still have to identify who is willing to maintain it. |
@mroeschke I think the way the interface is designed, and what makes more sense to me, is that are Numba and Bodo that cache their compiled functions. pandas will send the data, the function, and anything else needed to the execution backends, and will get the new data. For the Python/pandas backend there is nothing to cache, for Numba the function itself is jitted for what I've seen, and for Bodo they create a new function with the call to apply again, and that's what they jit. So, I think intuitively it could make sense that we take the jit decorator and the function in a generic way, and we cache it in pandas, but in practice I think it would make things very convoluted, compared to just let Numba and Bodo take care of it. |
and we create another jit function to apply the UDF over each column/row which is the jit function we cache i.e.
The specifics can be found in pandas/pandas/core/_numba/executor.py Line 21 in 56847c5
(and as you can see, there are different flavors further down whether we're doing an So are you suggesting that Numba/Bodo cache the larger jitted function (2 in the list above) as well? |
If you want the Numba/Bodo function to be cached, then you can just do functools.partial on the decorator you pass to jit/executor? However, note that if you cache an outer function that the inner function it calls is also included in that module even if it isn't given as cache=True. So, caching the inner one would help only in the case where the inner is the same but the outer function is different for some reason. |
df.apply
: add support forengine='bodo'
#60622I've been exploring what @MarcoGorelli proposed in #60622 (review) regarding allowing JIT compilers to implement their own apply engine. I think it is a much better approach than the current implementation. I've been testing different APIs, both the user facing one, and the internal (what Numba and Bodo have to implement that we will call), and this PR is what I think makes more sense...
The approach here is to use a
jit
parameter for any function that could make sense to JIT in pandas (DataFrame.apply
,Series.map
,SeriesGroupBy.transform
...) that delegates to the JIT compiler (Numba or Bodo) 100% of the logic. So far I just implementDataFrame.apply
for simplicity, but happy to add the rest once there is agreement.Final user API would look like next examples:
Which I think it's very simple and intuitive, and at the same time it makes users import
numba
andbodo
themselves, creating the right impression that they are using those libraries to JIT compile, and it's not something provided by pandas. Also, it makes users install the library themselves, which I think makes things easier (any installation or importing error is not reported to pandas, no need for a soft dependency, no need to even add them to our CI if we don't want).I think this approach is very convenient for us, as maintaining the code in pandas is trivial. Which I think it should address the concerns @jbrockmendel expressed, and I think many of us share. It should also be very convenient for Numba and Bodo, which should be able to fix any issue much faster, as well as performance improvements, and support for new use cases . Numba and Bodo can probably release faster than us, so any needed work in the JIT functionality shouldn't consume pandas resources, make Numba and Bodo life easier, and end up in the users faster.
The exact internal API (the
__pandas_udf__
function in this PR) can probably be improved. To make it looks simple and reasonable, but happy to know other points of view.As said in the code, the
bodo_patched.py
file is only in the PR so there is more context on the API that we will call. But the code will be properly implemented in Numba and Bodo, and removed from this PR before it's merged.@ehsantn @scott-routledge2 @pandas-dev/pandas-core