-
-
Notifications
You must be signed in to change notification settings - Fork 18.6k
ENH: Support Plugin Accessors Via Entry Points #61499
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…9076) Allows external libraries to register DataFrame accessors using the 'pandas_dataframe_accessor' entry point group. This enables plugins to be automatically used without explicit import. Co-authored-by: Afonso Antunes <afonso.antunes@tecnico.ulisboa.pt>
Most of the errors are from: |
I'm personally happy to add this, but it's kind of a big change in terms of the code users can write. @pandas-dev/pandas-core, thoughts here? If this moves forward, you'll want to fix the CI and add documentation for this. |
Hard to understand without documentation that illustrates a use case. |
We allow creating pandas accessors with df.ip.is_ipv6 If we allow registration via Python entry points, the Same idea as what PDEP-9 proposed for the |
Couldn't that be really expensive if lots of packages were installed in the environment? What if there were conflicts in naming? |
Agreed this would need docs, but I'm generally +1 on using entry points rather than import-time side effects.
It's scoped to projects that declare an entrypoint, so it doesn't scale with every package installed. But it would be good to measure the performance impact on
That's probably defined somewhere in |
Co-authored-by: Afonso Antunes <afonso.antunes@tecnico.ulisboa.pt>
Since the implementation seems now stable, would it make sense to start working on the documentation already, or would you prefer we wait for further maintainer input? |
I think by default, the last package found for the entry point with that name will overwrite the previous. As Tom says, this is the same as with imports now, the second import will overwrite the accessor of the first. But we have control over it when registering the entry points. We could keep the behaviour but show a warning, raise an exception and ask the user to remove one of the packages (probably not a great option), let the user decide which package has higher priority in the config... Since this should be very rare, I would go for the simplest solution that doesn't "fail" silently, which would be show a warning saying something like
Up to you @PedroM4rques. The more complete is this PR the easier is for everybody to understand what it's proposed. But if at the end there is no agreement to add this, you'll be spending time in a PR that won't get merged. |
From what I could test locally, this is true.
I think raising an exception is the better approach, as this is a critical error. It’s likely a rare scenario, and if it does occur intentionally, the user can always handle it explicitly ( |
I don't think we should raise an exception. Imagine a case where someone workimg with dna has two packages installed that provide a dna accessor. The user doesn't even care about the accessors, it's using the packages independently of pandas. Raising meams that the user needs to uninstall one of the packagea they need in order to use pandas. It doesn't make any sense in my opinion. Ideally we would just inform which accessor pandas will use, in case the user cares. And how to change it if needed. Which probably should be with an option, but since at present is an extfemely rare scenario I wouldn't make things complex to implement it. |
I agree, that wouldn't make any sense. Unless there are any objections, we'll implement the warning system. |
Are there other mainstream packages using entrypoints? |
We already use entrypoints in pandas for the plotting backends. Besides what Tom said, if I'm not wrong many projects using commands (e.g. |
Thanks @TomAugspurger. Looking at prior art, all options to deal with collisions exist. My personal preference would be to warn. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess there are no objections to this. Added some comments.
This will need proper documentation, so libraries can use this, and users can understand what's going on.
Also, this will need better tests.
Do you mind letting us know which third-party accessor are you planning to implement this on? Or what's the motivation for this work.
Thanks!
pandas/__init__.py
Outdated
|
||
from pandas.core.accessor import DataFrameAccessorLoader | ||
|
||
DataFrameAccessorLoader.load() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why a class with a single method?
And why only doing this for DataFrame
, not for Series
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Me and @PedroM4rques are currently planning to use this for a third-party accessor related to Vaex (see related discussion in issue #29076 . Our main motivation is to provide a structured and maintainable way to register external accessors without cluttering the core codebase.
We initially opted for a class with a single method (load) mostly as a pragmatic choice, since we're not yet deeply familiar with all the internals of Pandas. It seemed like a clean and extensible way to isolate the registration logic. That said, if a standalone function would be preferred, we’re absolutely open to changing it.
Regarding the focus on DataFrame only: we started with that use case since it was our immediate need, but extending this to Series or other objects makes sense and could certainly be part of the plan going forward. Would you recommend covering that already in this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you recommend covering that already in this PR?
Yes, I'd say so. It won't be much more complicated and any libraries that provide both Series and DataFrame accessors wanting to will want these at the same time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably add the same for Index too.
Better to use a function than class with a single method. I doubt it'll never have more methods, but if it does, we can always change later, as this is private.
- Added tests - created doc .rst file Co-authored-by: Afonso Antunes <afonso.antunes@tecnico.ulisboa.pt>
Above, I raised 3 issues:
I don't think (1) or (2) have been discussed. For (3), the suggestion of warning if a conflict occurs is fine with me. |
Tom commented about 2. The entry points are a registry. I think the cost is just a lookup of the entry point name in a hash table. It shouldn't depend on the amount of packages installed. So it's just the loop over the packages that register an accessor that exist in the user environment. Even if this becomes popular, I'd be surprised the number is more than around 5. I don't think there should be any impact in practice. But worth benchmarking, better to be sure. |
- Added 1 test for no packages Co-authored-by: Afonso Antunes <afonso.antunes@tecnico.ulisboa.pt>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the changes. Added some comments that hopefully can be helpful. I'll have another look when the accessors for all datastructures are implemented, which will change this PR significantly.
@@ -0,0 +1 @@ | |||
TODO |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the main document you want to update: https://pandas.pydata.org/docs/development/extending.html
Probably a comment here too: https://pandas.pydata.org/docs/reference/series.html#accessors (and for dataframe and index maybe).
I think for most users, if they are using a package providing an accessor, they'll already get the idea on how this works from the package documentation.
pandas/__init__.py
Outdated
|
||
from pandas.core.accessor import DataFrameAccessorLoader | ||
|
||
DataFrameAccessorLoader.load() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably add the same for Index too.
Better to use a function than class with a single method. I doubt it'll never have more methods, but if it does, we can always change later, as this is private.
pandas/core/accessor.py
Outdated
class DataFrameAccessorLoader: | ||
"""Loader class for registering DataFrame accessors via entry points.""" | ||
|
||
ENTRY_POINT_GROUP: str = "pandas_dataframe_accessor" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we want different entrypoints for each data type, so pandas_accessor
should be better.
pandas/core/accessor.py
Outdated
|
||
|
||
class DataFrameAccessorLoader: | ||
"""Loader class for registering DataFrame accessors via entry points.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mind adding proper documentation here? Explaining how the entrypoints need to be implemented, when this is called, what happens with conflicts...
pandas/core/accessor.py
Outdated
|
||
if name in names: # Verifies duplicated package names | ||
warnings.warn( | ||
f"Warning: you have two packages with the same name: '{name}'. " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something like The accessor 'foo' has already been registered by the package 'bar', so the accessor provided by the package 'foobar' is not being registered
seems like more useful to the users. If I remember correctly, it was a bit tricky but possible to get the pip name of the package registering the entry point.
pandas/core/accessor.py
Outdated
def load(cls) -> None: | ||
"""loads and registers accessors defined by 'pandas_dataframe_accessor'.""" | ||
eps = entry_points(group=cls.ENTRY_POINT_GROUP) | ||
names: set[str] = set() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you use more descriptive variable names please? names
doesn't mean much. This is registered_accessor_names
I guess?
Also eps
, ep
aren't great names in my opinion.
- Added DocStrs - Fixed small typo in test file name Co-authored-by: Afonso Antunes <afonso.antunes@tecnico.ulisboa.pt>
TLDR: Allows external libraries to register DataFrame accessors using the 'pandas_dataframe_accessor' entry point group. This enables plugins to be automatically used without explicit import.
I'm working on this PR collaboratively with @afonso-antunes .
doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.Proposal
We propose implementing an entrypoint system similar to Vaex (#29076) to allow easy access to the functionalities of any installed plugin without requiring explicit imports. The idea is to make all installed packages available for use, only being "imported" when they are needed in the program, in a seamless manner.
Current Behavior
Currently, each plugin must be explicitly imported:
Proposed Behavior
With our feature implemented, the code would be simplified to: