Skip to content

ENH: Support Plugin Accessors Via Entry Points #61499

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

PedroM4rques
Copy link
Contributor

@PedroM4rques PedroM4rques commented May 26, 2025

TLDR: Allows external libraries to register DataFrame accessors using the 'pandas_dataframe_accessor' entry point group. This enables plugins to be automatically used without explicit import.

I'm working on this PR collaboratively with @afonso-antunes .

Proposal

We propose implementing an entrypoint system similar to Vaex (#29076) to allow easy access to the functionalities of any installed plugin without requiring explicit imports. The idea is to make all installed packages available for use, only being "imported" when they are needed in the program, in a seamless manner.

Current Behavior

Currently, each plugin must be explicitly imported:

import pandas as pd
import vaex.graphql  # required to enable .graphql (.graphql is compatible with pd.DataFrames)

df = pd.DataFrame(...)
df.graphql.query(...)  # only works after the import

Proposed Behavior

With our feature implemented, the code would be simplified to:

import pandas as pd

df = pd.DataFrame(...)
df.graphql.query(...)  # works directly if the plugin is installed via pip

PedroM4rques and others added 2 commits May 26, 2025 20:07
…9076)

Allows external libraries to register DataFrame accessors
using the 'pandas_dataframe_accessor' entry point group.
This enables plugins to be automatically used without
explicit import.

Co-authored-by: Afonso Antunes <afonso.antunes@tecnico.ulisboa.pt>
@afonso-antunes
Copy link

afonso-antunes commented May 26, 2025

Most of the errors are from:
from importlib_metadata import entry_points

@datapythonista datapythonista added API Design Needs Discussion Requires discussion from core team before further action labels May 27, 2025
@datapythonista
Copy link
Member

I'm personally happy to add this, but it's kind of a big change in terms of the code users can write. @pandas-dev/pandas-core, thoughts here?

If this moves forward, you'll want to fix the CI and add documentation for this.

@Dr-Irv
Copy link
Contributor

Dr-Irv commented May 27, 2025

I'm personally happy to add this, but it's kind of a big change in terms of the code users can write. @pandas-dev/pandas-core, thoughts here?

If this moves forward, you'll want to fix the CI and add documentation for this.

Hard to understand without documentation that illustrates a use case.

@datapythonista
Copy link
Member

Hard to understand without documentation that illustrates a use case.

We allow creating pandas accessors with register_dataframe_accessor. For cyberpandas for example, when you do import cyberpandas they'd call that, and you'll be able to use the accessor:

df.ip.is_ipv6

If we allow registration via Python entry points, the import cyberpandas won't be needed anymore. On import pandas we will check the packages the user has installed in their environment that provide an accessor, and we will register them automatically.

Same idea as what PDEP-9 proposed for the read_* and to_* functions/methods if you remember that.

@Dr-Irv
Copy link
Contributor

Dr-Irv commented May 27, 2025

On import pandas we will check the packages the user has installed in their environment that provide an accessor, and we will register them automatically.

Couldn't that be really expensive if lots of packages were installed in the environment? What if there were conflicts in naming?

@TomAugspurger
Copy link
Contributor

Agreed this would need docs, but I'm generally +1 on using entry points rather than import-time side effects.

Couldn't that be really expensive if lots of packages were installed in the environment?

It's scoped to projects that declare an entrypoint, so it doesn't scale with every package installed. But it would be good to measure the performance impact on import pandas here, both for an environment with several and without any entrypoints installed.

What if there were conflicts in naming?

That's probably defined somewhere in importlib, but the same issue would affect plugins using import-time registration today.

PedroM4rques and others added 2 commits May 27, 2025 17:53
Co-authored-by: Afonso Antunes <afonso.antunes@tecnico.ulisboa.pt>
@afonso-antunes
Copy link

Since the implementation seems now stable, would it make sense to start working on the documentation already, or would you prefer we wait for further maintainer input?

@datapythonista
Copy link
Member

What if there were conflicts in naming?

I think by default, the last package found for the entry point with that name will overwrite the previous. As Tom says, this is the same as with imports now, the second import will overwrite the accessor of the first. But we have control over it when registering the entry points. We could keep the behaviour but show a warning, raise an exception and ask the user to remove one of the packages (probably not a great option), let the user decide which package has higher priority in the config... Since this should be very rare, I would go for the simplest solution that doesn't "fail" silently, which would be show a warning saying something like Both packageA and packageB provide the accessor foo. packageA is being used, please uninstall the package you don't want to use to remove this warning.

would it make sense to start working on the documentation already, or would you prefer we wait for further maintainer input?

Up to you @PedroM4rques. The more complete is this PR the easier is for everybody to understand what it's proposed. But if at the end there is no agreement to add this, you'll be spending time in a PR that won't get merged.

@PedroM4rques
Copy link
Contributor Author

I think by default, the last package found for the entry point with that name will overwrite the previous

From what I could test locally, this is true.

We could keep the behaviour but show a warning [or] raise an exception and ask the user to remove one of the packages

I think raising an exception is the better approach, as this is a critical error. It’s likely a rare scenario, and if it does occur intentionally, the user can always handle it explicitly (try-catch and pass for example). I think the plugin system would be safer this way.
I can also imagine that it's possible implement a system where the user chooses the plugins to throw away in the catch block, would that be desirable?

@datapythonista
Copy link
Member

I don't think we should raise an exception. Imagine a case where someone workimg with dna has two packages installed that provide a dna accessor. The user doesn't even care about the accessors, it's using the packages independently of pandas. Raising meams that the user needs to uninstall one of the packagea they need in order to use pandas. It doesn't make any sense in my opinion. Ideally we would just inform which accessor pandas will use, in case the user cares. And how to change it if needed. Which probably should be with an option, but since at present is an extfemely rare scenario I wouldn't make things complex to implement it.

@PedroM4rques
Copy link
Contributor Author

Raising means that the user needs to uninstall one of the packages they need in order to use pandas

I agree, that wouldn't make any sense.

Unless there are any objections, we'll implement the warning system.

@rhshadrach
Copy link
Member

Are there other mainstream packages using entrypoints?

@TomAugspurger
Copy link
Contributor

Yeah, xarray, fsspec, pytest are a few.

Entry points can be a good option anytime you have som sort of plugin system that requires coordinating how a "framework" (pandas in this case) loads code provided by a plugin.

@datapythonista
Copy link
Member

We already use entrypoints in pandas for the plotting backends. Besides what Tom said, if I'm not wrong many projects using commands (e.g. jupyter <command>, black <command>, flake8 <command>) are implemented with entrypoints. Airflow plugins are also using them. I don't think they are super popular, but surely not experimental or rare.

@rhshadrach
Copy link
Member

rhshadrach commented May 31, 2025

Thanks @TomAugspurger. Looking at prior art, all options to deal with collisions exist.

My personal preference would be to warn.

Copy link
Member

@datapythonista datapythonista left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess there are no objections to this. Added some comments.

This will need proper documentation, so libraries can use this, and users can understand what's going on.

Also, this will need better tests.

Do you mind letting us know which third-party accessor are you planning to implement this on? Or what's the motivation for this work.

Thanks!


from pandas.core.accessor import DataFrameAccessorLoader

DataFrameAccessorLoader.load()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why a class with a single method?

And why only doing this for DataFrame, not for Series?

Copy link

@afonso-antunes afonso-antunes Jun 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Me and @PedroM4rques are currently planning to use this for a third-party accessor related to Vaex (see related discussion in issue #29076 . Our main motivation is to provide a structured and maintainable way to register external accessors without cluttering the core codebase.

We initially opted for a class with a single method (load) mostly as a pragmatic choice, since we're not yet deeply familiar with all the internals of Pandas. It seemed like a clean and extensible way to isolate the registration logic. That said, if a standalone function would be preferred, we’re absolutely open to changing it.

Regarding the focus on DataFrame only: we started with that use case since it was our immediate need, but extending this to Series or other objects makes sense and could certainly be part of the plan going forward. Would you recommend covering that already in this PR?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you recommend covering that already in this PR?

Yes, I'd say so. It won't be much more complicated and any libraries that provide both Series and DataFrame accessors wanting to will want these at the same time.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably add the same for Index too.

Better to use a function than class with a single method. I doubt it'll never have more methods, but if it does, we can always change later, as this is private.

@datapythonista datapythonista removed the Needs Discussion Requires discussion from core team before further action label Jun 1, 2025
- Added tests
- created doc .rst file

Co-authored-by: Afonso Antunes <afonso.antunes@tecnico.ulisboa.pt>
@PedroM4rques PedroM4rques changed the title ENH: Support plugin DataFrame accessor via entry points (#29076) ENH: Support plugin DataFrame accessor via entry points Jun 3, 2025
@Dr-Irv
Copy link
Contributor

Dr-Irv commented Jun 3, 2025

Above, I raised 3 issues:

  1. Documentation is needed
  2. Concern about performance when people don't have packages using entry points, and import pandas as pd just takes longer because there are a lot of packages installed. This probably has more to do with the performance of importlb.entry_points() than anything else.
  3. Concern about duplicates and conflicts.

I don't think (1) or (2) have been discussed.

For (3), the suggestion of warning if a conflict occurs is fine with me.

@datapythonista
Copy link
Member

Tom commented about 2. The entry points are a registry. I think the cost is just a lookup of the entry point name in a hash table. It shouldn't depend on the amount of packages installed. So it's just the loop over the packages that register an accessor that exist in the user environment. Even if this becomes popular, I'd be surprised the number is more than around 5. I don't think there should be any impact in practice.

But worth benchmarking, better to be sure.

- Added 1 test for no packages

Co-authored-by: Afonso Antunes <afonso.antunes@tecnico.ulisboa.pt>
Copy link
Member

@datapythonista datapythonista left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes. Added some comments that hopefully can be helpful. I'll have another look when the accessors for all datastructures are implemented, which will change this PR significantly.

@@ -0,0 +1 @@
TODO
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the main document you want to update: https://pandas.pydata.org/docs/development/extending.html

Probably a comment here too: https://pandas.pydata.org/docs/reference/series.html#accessors (and for dataframe and index maybe).

I think for most users, if they are using a package providing an accessor, they'll already get the idea on how this works from the package documentation.


from pandas.core.accessor import DataFrameAccessorLoader

DataFrameAccessorLoader.load()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably add the same for Index too.

Better to use a function than class with a single method. I doubt it'll never have more methods, but if it does, we can always change later, as this is private.

class DataFrameAccessorLoader:
"""Loader class for registering DataFrame accessors via entry points."""

ENTRY_POINT_GROUP: str = "pandas_dataframe_accessor"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we want different entrypoints for each data type, so pandas_accessor should be better.



class DataFrameAccessorLoader:
"""Loader class for registering DataFrame accessors via entry points."""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mind adding proper documentation here? Explaining how the entrypoints need to be implemented, when this is called, what happens with conflicts...


if name in names: # Verifies duplicated package names
warnings.warn(
f"Warning: you have two packages with the same name: '{name}'. "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something like The accessor 'foo' has already been registered by the package 'bar', so the accessor provided by the package 'foobar' is not being registered seems like more useful to the users. If I remember correctly, it was a bit tricky but possible to get the pip name of the package registering the entry point.

def load(cls) -> None:
"""loads and registers accessors defined by 'pandas_dataframe_accessor'."""
eps = entry_points(group=cls.ENTRY_POINT_GROUP)
names: set[str] = set()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you use more descriptive variable names please? names doesn't mean much. This is registered_accessor_names I guess?

Also eps, ep aren't great names in my opinion.

PedroM4rques and others added 3 commits June 4, 2025 14:52
- Added DocStrs
- Fixed small typo in test file name

Co-authored-by: Afonso Antunes <afonso.antunes@tecnico.ulisboa.pt>
@PedroM4rques PedroM4rques changed the title ENH: Support plugin DataFrame accessor via entry points ENH: Support Plugin Accessors Via Entry Points Jun 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

GraphQL support / accessor plugin system
6 participants