Skip to content

ENH: Allow third-party packages to register IO engines #61642

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 12 commits into from
Jul 3, 2025

Conversation

datapythonista
Copy link
Member

Added the new system to the Iceberg connection only to keep this smaller. The idea is to add the decorator to all other connectors, happy to do it here or in a follow up PR.

@datapythonista datapythonista requested a review from noatamir as a code owner June 12, 2025 21:58
@datapythonista datapythonista added IO Data IO issues that don't fit into a more specific label API Design labels Jun 12, 2025
method on it with the arguments provided by the user (except the ``engine`` parameter).

To avoid conflicts in the names of engines, we keep an "IO engines" section in our
[Ecosystem page](https://pandas.pydata.org/community/ecosystem.html#io-engines).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will need different formatting since rst hyperlink syntax is different from md

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, thanks for the heads up. I updated it.

@@ -52,6 +56,10 @@ def read_iceberg(
scan_properties : dict of {str: obj}, optional
Additional Table properties as a dictionary of string key value pairs to use
for this scan.
engine : str, optional
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the read_* and to_* signatures also have an engine_kwargs: dict[str, Any] | None argument to allow specific engine arguments to be passes per implementation?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good point. In read_parquet we already have a **kwargs for engine specific arguments. In map, apply... it's a normal engine_kwargs since **kwargs is used in some cases for the udf keyword arguments. I think for IO readers/writers **kwargs as read_parquet does is fine.

I didn't want to add the engine to all connectors in this PR to keep it simpler, but I'm planning to follow up with another PR that adds it, and adds **kwargs for connectors where it's not there already. Surely happy to add both things here if you prefer, just thought it would make reviewing simpler to keep the implementation separate from all the changes to parameters.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if engine-specific kwargs are needed, isn't that a good reason to use engine.read_whatever(path, **kwargs) instead of pd.read_[...]?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good point. Thinking about readers we don't care about I think what you propose is the best choice. And this PR doesn't really prevent that from happening anyway. But for readers we cared enough to include in pandas, I think this new interface offers an advantage. For example, there was some discussion on whether we should move the fastparquet engine out of pandas, Patrick suggested it. I think this interface allows moving the fastparquet engine to the fastparquet package, users with fastparquet installed will still have it available in the same way as it is now, but we can forget about it.

Of course discussions about moving readers out of pandas will have to happen later. But this interface seems quite useful and it's very simple, so in my opinion it's a good deal.

@datapythonista
Copy link
Member Author

/preview

Copy link
Contributor

Website preview of this PR available at: https://pandas.pydata.org/preview/pandas-dev/pandas/61642/

@datapythonista
Copy link
Member Author

@mroeschke I addressed your comments and I think this should be ready when you've got time. Thanks!

Copy link
Member

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall I think this is a good idea. Might be good to get another opinion to vet the idea.

I think still having a **kwargs like argument so users can pass engine specific arguments without manually expanding the pandas signatures would be good. But since this is for the new IO method iceberg, it's not as critical now.

@datapythonista
Copy link
Member Author

Fully agree, I just didn't want to make this PR too big by allowing the engines and adding**kwargs everywhere here. I'll do it in a follow up.

@pandas-dev/pandas-core any opinion or comment before merging this?

@WillAyd
Copy link
Member

WillAyd commented Jun 26, 2025

Thanks for the ping. I haven't been involved enough to really block, but I'm curious what the advantage of this is over leveraging the Arrow PyCapsule interface for data exchange; I feel like the latter would be a better thing to build against, given it is a rather well adopted standard in the ecosystem

@datapythonista
Copy link
Member Author

Good point, thanks for the feedback. I think this is a higher interface that still allows for using the Arrow pycapsule internally. Surely an option would be to get rid of I/O in pandas, and have an ecosystem of readers that can be used via a single pd.read(engine) or something similar. But I think it's better to use the current API, allow third party engines as implemented here, and let the engines decide how the data is exchanged. Or what is your idea?

@WillAyd
Copy link
Member

WillAyd commented Jun 26, 2025

It would depend on a resolution to #59631, but for demo purposes let's assume we decide to implement a new DataFrame.from_pycapsule class method.

So instead of an I/O method like:

df = pd.read_iceberg(..., kwargs)

You could construct a dataframe like:

df = pd.DataFrame.from_pycapsule(
    # whatever the iceberg API is here to create a table
    pyiceberg.table(..., kwargs)
)

The main downside is verbosity, but the upsides are:

  1. Generic approach to read any datasource
  2. pandas itself has to do very little development (third parties are responsible for integration)
  3. polars, cudf, etc... all benefit in kind

So yea this could be extended to say even Delta if they decided to implement the PyCapsule interface (whether they do currently or not, I don't know):

df = pd.DataFrame.from_pycapsule(
    # whatever the delta API is here to create a table
    DeltaLake.table(..., kwargs)
)

and if polars decided on the same API you could create that dataframe as well:

df = pl.DataFrame.from_pycapsule(
    # whatever the delta API is here to create a table
    DeltaLake.table(..., kwargs)
)

@WillAyd
Copy link
Member

WillAyd commented Jun 26, 2025

As I said though, I haven't been involved enough to really block, so if this PR has some support happy to roll with it and can clean up later if it comes to it. Thanks for giving it consideration @datapythonista

@jbrockmendel
Copy link
Member

i don't know anything about pycapsules. is that effectively a pd.from_arrow_table method that isn't pyarrow-specific?

@WillAyd
Copy link
Member

WillAyd commented Jun 26, 2025

Yea kind of. In a generic sense, a PyCapsule is just a Python construct that can wrap generic memory. The Arrow PyCapsule interface further defines lifetime semantics for passing Arrow data across a PyCapsule:

https://arrow.apache.org/docs/format/CDataInterface/PyCapsuleInterface.html

Somewhat separately there is the question of how you would read and write the data in a capsule. For pandas that is pyarrow, but other libraries may choose a tool like nanoarrow for a smaller dependency

@datapythonista
Copy link
Member Author

@WillAyd what you propose seems reasonable, but I guess we aren't planning to remove all pandas IO anytime soon. And if we keep our readers and writers with multiengine support, I think this interface is going to be useful, even if long term we move into the pyarrow capsule reader you propose. Also, since pandas is not arrow based yet, this PR could be used to move the xarray connectors to the xarray package, while using pycapsule wouldn't be ideal for pandas/xarray interchange, as they are numpy based.

Dr-Irv
Dr-Irv previously requested changes Jun 26, 2025
Comment on lines +531 to +536
it. This is done in ``pyproject.toml``:

```toml
[project.entry-points."pandas.io_engine"]
empty = empty_data:EmptyDataEngine
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not 100% sure if this can happen, but what if the project isn't using pyproject.toml for some reason. Is there another way to do the configuration or is using pyproject.toml required?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Entry points existed before pyproject.toml, and can also be added to setup.py. it makes no difference how the package defines them, pip or conda will add the entry point to the environment registry, and pandas will be able to find them regardless of how the project created them.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The language here suggests that the only way to add the entry point is via pyproject.toml. If this is the recommended way, we can say that. Or if other ways are supported, we should show that too.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pyproject.toml is the way to do it, setup.py is how it was done in the past. I'm sure people reading this will be able to figure out how this was done in the past if their code is still using setup.py

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would having the entry point being a module variable like in the UDF engine PR address some of the concerns about using pyproject.toml ?

@@ -90,6 +90,7 @@ Other enhancements
- Support passing a :class:`Iterable[Hashable]` input to :meth:`DataFrame.drop_duplicates` (:issue:`59237`)
- Support reading Stata 102-format (Stata 1) dta files (:issue:`58978`)
- Support reading Stata 110-format (Stata 7) dta files (:issue:`47176`)
- Third-party packages can now register engines that can be used in pandas I/O operations :func:`read_iceberg` and :meth:`DataFrame.to_iceberg` (:issue:`61584`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sentence makes it seem that it only applies to read_iceberg . But doesn't the engine comment apply to ANY of the current IO routines that have an engine argument?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. This PR creates the new system for third party engines in a generic way, and the idea is to use it everywhere, but the PR only applies it to iceberg for now. The reason is to make reviewing easier, as adding the engine keyword to mamy connectors will make the PR significantly bigger.

My idea is to add the whatsnew note for what's delivered in this PR, and in the follow up PR update it to what you suggest.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, since read_iceberg() and to_iceberg() are new for 3.0 anyway, I don't think you need a whatsnew item for this PR, because, as you say, it is just functionality that applies to iceberg right now. Then, if it is accepted, you can add a whatsnew to point to all the readers/writers that support it, if you add support for those readers/writers.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's a good practice to make PRs atomic, and don't assume things about other PRs. If we were going to release just after this commit, things would be correct. As said, the follow up PR will update the whatsnew.

Comment on lines +1342 to +1349
package_name = entry_point.dist.metadata["Name"]
else:
package_name = None
if entry_point.name in _io_engines:
_io_engines[entry_point.name]._packages.append(package_name)
else:
_io_engines[entry_point.name] = entry_point.load()
_io_engines[entry_point.name]._packages = [package_name]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have to wonder if it is better to just get the entry points here but NOT load them., and then load them on demand. So the dict would just have EntryPoint objects, or a tuple of EntryPoint and package names.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds like a good idea, I didn't think about it before. I think it'll make the code slightly more complex, but not loading the code of unused connectors would be nice, in case a package takes a long time to run.

I won't be updating this PR, as I don't think it's likely that it'll be merged, so not worth the effort. But I'd be happy to implement it in a follow up.

@jbrockmendel
Copy link
Member

And if we keep our readers and writers with multiengine support

ATM that's just csv and parquet? And the parquet one plausibly is not needed?

AFAICT this adds a bunch of new tests/code/docs, complicates the import with entrypoints, and lets 3rd parties hijack our namespace. All when there's a perfectly good option of using their own namespace.

Also if we ever did change defaults like #61618 or PDEP16, that would Break The World for any 3rd party readers that we are implicitly committed to supporting.

-1.

@WillAyd
Copy link
Member

WillAyd commented Jun 26, 2025

Also, since pandas is not arrow based yet, this PR could be used to move the xarray connectors to the xarray package, while using pycapsule wouldn't be ideal for pandas/xarray interchange, as they are numpy based.

Definitely not an expert, but I want to point out that DLPack also offers a PyCapsule for data exchange. See https://dmlc.github.io/dlpack/latest/python_spec.html

So depending on how generic we want things to be, PyCapsule support doesn't just mean consuming Arrow data, but could mean dlpack data as well (which I assume xarray can do, if it doesn't already)

@datapythonista
Copy link
Member Author

ATM that's just csv and parquet? And the parquet one plausibly is not needed?

Any reader could have an engine option, this would allow having xarray, sas... as other packages.

AFAICT this adds a bunch of new tests/code/docs, complicates the import with entrypoints, and lets 3rd parties hijack our namespace. All when there's a perfectly good option of using their own namespace.

This adds minimal tests, code or docs, any other PR adds as many as this. We already allow entrypoints in pandas, and 3rd parties can not use this to change the namespace directly, just to allow engine="foo".

Also if we ever did change defaults like #61618 or PDEP16, that would Break The World for any 3rd party readers that we are implicitly committed to supporting.

We could add the value of the flag setting the types to use to the interface, so third parties can transition in the same way as us.

-1.

Being honest I think you'll block anything that is Bodo related. I think it was best to be -1 to accept their money when they offered it. In any case, I'll close this, and I'll open a separate issue to discuss returning the remaining funds, as I think it can make sense at this point.

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Jul 2, 2025

I have comments above that still should be addressed:

@datapythonista Just to clarify, I'm not for or against adding this particular feature. The comments that I was hoping you would respond to were about documentation and performance. With respect to the documentation, I think it can be improved (and would be helpful if this were to be reopened and merged in). With respect to performance, you acknowledged that delayed loading is a good idea, but you didn't want to do it because you didn't think the PR would be accepted.

I'm not making the decision of accepting and merging the PR if you were to reopen it. I will leave that to others.

@jbrockmendel
Copy link
Member

But with the engine param and corresponding docs, they could discover changing the engine might simplify this workflow:

  1. I don't think we're putting the docs for third party engines in our docs.
  2. A third party engine with behavior significantly different from the pandas API doubly doesn't belong in the pandas namespace.

@datapythonista
Copy link
Member Author

Thanks @Dr-Irv for you comment. I think there are two separate things we are talking about.

About the specific feedback, I think loading the engines lazily is great feedback, and I surely want to incorporate it here. As said, I don't think it make sense to spend the time on it just now, since Brock has some other concerns. And I guess nobody else probably spend enough time fixing the CI for read_clipboard or trying to make the conda solver resolve our dependencies fast, to share my excitement about moving some of the IO connectors to third-party packages. There wasn't much interest when I opened PDEP-9, and there hasn't been anyone very positive in this minimal version of it. So, this is not a problem. I think it makes more sense to take care of in a follow up, but if this PR is merged, we'll load the engines lazily, as it's surely better than the current implementation.

Removing the release note here, and adding it in the follow up PR is also fine. Also adding information on how to add an entrypoint for setup.py users. I personally prefer how things are now, and I understand that you have a preference for what you suggest. I don't think there is even one option clearly better than another, just taste, and anything would be fine.

So, the second thing we're talking about, is about blocking other people's PRs. I think blocking PRs is a useful tool we have, and I surely use it when needed. For example, if a PR is going to cause us future trouble, in our CI, with too much technical debt, with a poor experience for the users... I think it's great that we all can block PRs until the concerns are resolved. But for some reason, you seem to be abusing blocking PRs recently (at least in my opinion). If you agree with my comments about this PR above, you are literally not letting this PR be merged until I implement your (in my opinion) very opinionated and not very important comments of removing a release note that as I told you will be updated later. And to add an explanation about how to also add entry points with the legacy setup.py. If you share those as suggestions, I appreciate them, as I think it's good feedback. If other people agree with you, I'll be happy to update the PR with them. But if you block a PR for such trivial topics, it feels like you're making pandas your personal project, and that no work in pandas will be possible unless it follows your personal preferences. If that's the case, and pandas is happy with this sort of behavior, I surely don't want to continue contributing to it.

And this already happened recently in the df.select("col1", "col2"), which I think it's the same exact case.

I don't think we're putting the docs for third party engines in our docs.
A third party engine with behavior significantly different from the pandas API doubly doesn't belong in the pandas namespace.

Agree on not having third party engines in the docs (except for the Ecosystem page in our website).

As opposed to PDEP-9, this PR tries to unify the behavior of readers/writers of the same format. I agree it'd be better for users if third-party connectors have consistency with the file path (accepting urls, compressed files, cloud files...). I'm not too sure if enforcing it is feasible or the best option. Maybe having it recommended in the docs, with info on how to support the standard pandas syntactic sugar is the best approach.

@jbrockmendel
Copy link
Member

read_clipboard

ironically i'd be fine with deprecating this (xref #56039)

is about blocking other people's PRs

I don't see why it is relevant whether someone has hit the the "block" button since the informal norm is to not merge until there is consensus. For all their faults, PDEPs have a lower bar to acceptance than the informal system.

@datapythonista
Copy link
Member Author

I don't see why it is relevant whether someone has hit the the "block" button since the informal norm is to not merge until there is consensus.

If I start blocking every PR for small details that I would do differently than in the PR, nothing is probably going to be merged again in pandas. And we aren't too far from this situation lately. Any PR that gets enough attention is likely to not move forward.

This is what all this is about to me. If a small PR like this one, with no backward-compatibility issues, small and simple code, and with a huge potential (IMHO) can't be merged because there are different opinions on when to add the release note, I think we are wasting our time.

@jbrockmendel
Copy link
Member

jbrockmendel commented Jul 3, 2025

It can’t be merged because I’m strongly against it and you need to get a consensus of the not-me team members to get me to say “oh well”

@datapythonista
Copy link
Member Author

I know, this is a different discussion than what I'm saying about @Dr-Irv blocking PRs.

But clearly an equal impediment to pandas progress. But we already discussed with you, and seems obvious to me that you'll be strongly against anything that is related to Bodo. Including this case where Bodo can benefit from this, but this is clearly independent of any specific use case.

In any case, thanks for not blocking it if everyone else is happy to get this merged.

@jbrockmendel
Copy link
Member

The only bodo things I’ve opposed have been engine keywords in places they aren’t needed.

@datapythonista
Copy link
Member Author

The only bodo things I’ve opposed have been engine keywords in places they aren’t needed.

You didn't oppose adding the engine keyword to apply. That was done by Thomas. You opposed making it more generic, so not only numba would benefit from it.

@jbrockmendel
Copy link
Member

Good clarification. I opposed there the same thing I oppose here, and for the same correct reasons.

@datapythonista
Copy link
Member Author

Not sure if it makes sense to iterate much longer, but just a last comment on why I'm not convinced about your reasons (stated in this comment #61642 (comment)). And I continue to think being strongly opposed to this is based on other less objective reasons.

No users have asked for this

This is true for lots of things, in particular for work like this one, where one of the main motivations is make our CI easier, by being able to move code tricky to test to separate packages in a transparent way to users.

When a library such as xarray or fastparquet is not install, we silently stop testing the connector. It happened in the past (several years ago) that we spent weeks not testing IO connectors. The way we implement IO in pandas is wrong. Having the IO just outside of pandas is an option as you say. But we aren't removing read_sas, read_spss, read_clipboard, read_xarray from our API. This PR allows a better architecture without having to remove those and cause backward compatibility issues. In particular, this PR, followed up by related work, would reduce the maintenance of pandas, as IO code could live elsewhere. It would avoid the import_or_skip use which can easily lead to stop testing code in a silent way. It would simplify our dependendencies and its resolutions.

And of course, it would also allow to add new IO connectors at literally no cost for pandas. Polars and DuckDB have the fastest CSV readers. Should we add them as engines to pandas? Yes if we thinl about making read_csv faster. But probably not, if we need to add Polars and DuckDB as optional dependencies, maintain the wrappers among them... What's proposed in this PR immediately solves this at a minimal cost.

All downsides, no upsides

Exactly the opposite. I personally think you are repeatedly ignoring the upsides. I know you say that we can do the same without a pandas interface. But for good or for bad this is how pandas is designed. We could get rid of pandas.read_xarray, or pandas.read_clibpboard. You tried with the latter but it's not happening. So, saying we can just have pandas_clipboard.read_clipboard is not realistic.

this complicates the API

I disagree. This was true for PDEP-9, but not here. The API is the same. If users have fastparquet installed they will be able to use the engine, otherwise they won't and will get an error like now.

Users will have to learn that engines require packages to be installed, as it's the case now even if in a subtly different way. But the API is exactly the same.

more docs and more code and more tests increase burdens

The diff here is minimal, if you used this argument consistently, nothing would be ever added to pandas again.

What's more paradoxical is that if you consider this PR in the prespective that it'd allow moving connectors to other packages, then this is exactly the opposite. If we just move the fastparquet connector to their package, the burden already decreases. Or the clipboard, but I guess that would live in a new package and that's a bit trickier since it's not obvious who would maintain it. Not a big deal in my opinion, but that was a concern when PDEP-9 was discussed.

Users what we do and don't maintain

A valid point, but not sure how this PR is different than the status quo. So, if fastparquet is not installed, read_parquet with fastparquet engine won't work. If there is any problem in the data returned or an exception, it's very likely that it's a problem in fastparquet. Do users know it? I think they do in most cases actually. This PR probably helps in communicating, aa the fastparquet engine won't be mentioned in our API docs, but in the ecosystem, making ot clearer that they are using something not in pandas.

The main difference here is that now we do maintain wrappers that map our signature to the engine signature. Do users know that we maintain this tiny wrapper, and do they know when a problem with the connector is because the wrapper we maintain, or the external library? Probably not. So, probably once more this PR is helping solve your concern more than causing it.

You can of course hold any opinion about this you want. And in the case of apply I think some of your feedback was valuable and made the work there better. But I still have the feeling that you want to oppose this just because you don't like the way the agreement with Bodo works. And you are looking for reasons to your opposition, more than opposing this because of the reasons you presented. I can tell not only because the reasons don't seem factual, and don't seem important enough for strongly opposing it. But because of the double standards compared to other PRs. For example, you didn't have any concern when Thomas added engine to apply, then you presented all the concerns you could have there to me when I made what he implemented simpler and more generic. Or you don't have any opposition that I've seen to using entrypoints for acceasors. Which would be by far a better example of using the pandaa namespace for other's code than this engine system.

In any case, I genuinely appreciate your flexibility to be ok with this if there is broad support from other team members. I don't think I'll be able to make you like this, as in my mind your opposition is more ideological or emotional. But that flexibility and team work attitude is what allows the project to move forward. That's why I was more concerned about Irv's block, even if technically the disagreement is way smaller.

@WillAyd
Copy link
Member

WillAyd commented Jul 3, 2025

Thanks all for some of the philosophical discussions around PR review. I know frustrating at times, but I think there are some positives in bringing those to light. Improving PR review is something we can as a team continue to work on, but at this point that is probably best to discuss in one of our core or governance team meetings to progress.

For the time being, I've given this PR some more thought and I'd be happy to throw my vote in as a +1. Yes, there are a few things I may not like or be unsure about, but stepping back I have a hard time envisioning this being very problematic. On the flip side, it helps at least one use case for now, and is done in a way that is generic to expand to others. If things really fall apart, it doesn't seem all that difficult to have to deprecate and move on :-)

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Jul 3, 2025

So, the second thing we're talking about, is about blocking other people's PRs. I think blocking PRs is a useful tool we have, and I surely use it when needed. For example, if a PR is going to cause us future trouble, in our CI, with too much technical debt, with a poor experience for the users... I think it's great that we all can block PRs until the concerns are resolved. But for some reason, you seem to be abusing blocking PRs recently (at least in my opinion). If you agree with my comments about this PR above, you are literally not letting this PR be merged until I implement your (in my opinion) very opinionated and not very important comments of removing a release note that as I told you will be updated later. And to add an explanation about how to also add entry points with the legacy setup.py. If you share those as suggestions, I appreciate them, as I think it's good feedback. If other people agree with you, I'll be happy to update the PR with them. But if you block a PR for such trivial topics, it feels like you're making pandas your personal project, and that no work in pandas will be possible unless it follows your personal preferences. If that's the case, and pandas is happy with this sort of behavior, I surely don't want to continue contributing to it.

And this already happened recently in the df.select("col1", "col2"), which I think it's the same exact case.

I'm not sure what happened here to make you think I was "blocking" the PR. I did NOTHING proactively to make the PR blocked except I made some comments that I'd like you to address.

What I think has happened is that GitHub is now blocking PR's until all conversations are resolved. (This is something that I've seen with Gitlab on a customer project). So in this case, there were 3 conversations where I made the last comment - you can respond to those comments, and if I'm fine to move on from them, I can resolve the comments, and I think then the PR will become unblocked. Now I'm not sure what to do if you reopen this PR and others want to move it forward, and I don't resolve those conversations, as I will be disconnected for the next 2 weeks.

In the case of the select() PR, the typing you wanted to do was incorrect. And since the typing gets also reflected in pandas-stubs that I maintain, I think it is appropriate for me to ask you to get the typing declarations to be correct. But again, I did nothing to proactively block the PR. It appears blocked because of how GitHub is doing things.

@jbrockmendel
Copy link
Member

jbrockmendel commented Jul 3, 2025

[Will] it helps at least one use case for now

Which one is that? I seriously would love to see a single use case where bodo.to_iceberg(df) isn't fully functional, clearer as to what's being called, clearer as to where to look for docs, clearer as to who is responsible for maintaining it, ... Is it really "save a few keystrokes for users who are pathologically addicted to method chaining"?

[Me] No users have asked for this

[Marc] This is true for lots of things, in particular for work like this one, where one of the main motivations is make our CI easier

As long as we're not-believing each other, I don't believe that this is one of the main motivations. If anything, it'll just mean that we'll be asked to add testing the bodo engine to our CI and further complicate things.

As I said before, if you want to move a reader/writer out of pandas, just open an issue about that explicitly.

[Me] this complicates the API

[Marc] I disagre

It adds a keyword to one method and suggests you're going to follow up by adding a keyword to a bunch more methods. And options for that keyword to methods that already have it. Yes, that is API complication.

Polars and DuckDB have the fastest CSV readers. Should we add them as engines to pandas?

I think that merits its own issue, but I don't anticipate having a problem with it as long as we are clear-eyed about the costs of maintaining that support.

The main difference here is that now we do maintain wrappers that map our signature to the engine signature. Do users know that we maintain this tiny wrapper, and do they know when a problem with the connector is because the wrapper we maintain, or the external library? Probably not. So, probably once more this PR is helping solve your concern more than causing it.

They complain to us regardless. When we're the ones who maintain the wrapper, we are actually the correct people to complain to. Even when it turns out to be an upstream issue, by maintaining the wrapper we actively decided to take on that responsibility. This would not be the case for 3rd party engines.

For example, you didn't have any concern when Thomas added engine to apply

IIRC we maintain the wrapper around numba, just like we do for the groupby.apply numba paths (which I paid more attention to). If I'm wrong on that and it'll make you feel better, I'll happily post some comments there about why it should be deprecated in favor of explicitly-3rd-party solution. And FWIW I am consistent about wanting to get rid of the engine keyword xref #53239.

Or you don't have any opposition that I've seen to using entrypoints for acceasors

Honestly not wild about that, but AFAICT 3rd party accessors are mostly used for in-house or personal use cases, as opposed to libraries (maybe geopandas?) so se la vie.

@datapythonista
Copy link
Member Author

Thanks for the comment @Dr-Irv. It's my understanding that the convention is that when you review and set the flag "Requested changes" that's a block. At least to me, I don't think that PR is mergeable, and GitHub will also disable the merge button (it becomes red and you need an extra check box). For things that are not a blocker, I add the comments and review as a "Comment". That would in general also require that the reviewer to have a second look before merging, but at least to me not in the same way, if comments are addressed or answered, in general it's fine. If you don't follow this same procedure, that can be where the misunderstanding comes from, and maybe it'd be worth to make sure we're all in the same page. Maybe it's me who has been doing things differently.

Sorry @jbrockmendel, I guess I misunderstood, and it's just that we have exact opposite views on how pandas should play with other packages. To me pandas is too big and would benefit from "plugins" in many places, and you are more into monorepos, or not sure how to describe wanting to have everything in pandas. If that's the issue, we'll surely have disagreements in most of what I do. From the work on the plotting plugins, to the UDF executors, to third-party IO, most of what I do in pandas that is not maintenance is this sort of divide and conquer work. I guess there is no easy solution for it.

I'm not too bothered of a bodo writer being written as bodo.to_iceberg. As you say, it's functionally the same. I'm bothered that we don't then use pyarrow.read_parquet, xarray.to_pandas... As you suggested, I could open issues for those, and I guess I'd get your support there, but I'm sure there won't be consensus. I'll leave that battle to someone else, the approach I want to see in pandas is the one in this PR or PDEP-9.

And I do agree that adding engine is a change in the API. If done consistently everywhere seems like a small change to me, and users are already familiar with if they used read_csv or read_parquet. But it does introduce a change to the API, you are right with that.

@datapythonista datapythonista reopened this Jul 3, 2025
@WillAyd WillAyd dismissed Dr-Irv’s stale review July 3, 2025 16:10

Dismissing as this feedback is not meant to be a blocker, and the requestor is going on vacation. Confirmed by @Dr-Irv on Slack

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Jul 3, 2025

Thanks for the comment @Dr-Irv. It's my understanding that the convention is that when you review and set the flag "Requested changes" that's a block. At least to me, I don't think that PR is mergeable, and GitHub will also disable the merge button (it becomes red and you need an extra check box). For things that are not a blocker, I add the comments and review as a "Comment". That would in general also require that the reviewer to have a second look before merging, but at least to me not in the same way, if comments are addressed or answered, in general it's fine. If you don't follow this same procedure, that can be where the misunderstanding comes from, and maybe it'd be worth to make sure we're all in the same page. Maybe it's me who has been doing things differently.

I think this might be a project setting that recently changed, because with pandas-stubs, which I manage, saying "requested changes" doesn't block the merge. So I think this is the misunderstanding because of how the project settings are. I see that @WillAyd unblocked my approval, which is fine. Although I do think there should be more comments from other maintainers, because I do see the comments from @jbrockmendel as being relevant. I'm indifferent.

@WillAyd WillAyd merged commit 9dcce63 into pandas-dev:main Jul 3, 2025
87 of 88 checks passed
@WillAyd
Copy link
Member

WillAyd commented Jul 3, 2025

Thanks @datapythonista

@jbrockmendel
Copy link
Member

No, will, we are waiting for more votes, per Irv's comment.

jbrockmendel added a commit that referenced this pull request Jul 3, 2025
jbrockmendel added a commit that referenced this pull request Jul 3, 2025
Revert "ENH: Allow third-party packages to register IO engines (#61642)"

This reverts commit 9dcce63.
@WillAyd
Copy link
Member

WillAyd commented Jul 3, 2025

The core team was pinged over a week ago and there are 2 approvals with 0 blockers (at least I think). What is it exactly that we are looking for?

@jbrockmendel
Copy link
Member

What is it exactly that we are looking for?

If @jorisvandenbossche gives a +1 I'll drop my objection and merge it myself. And buy everyone a beer for their trouble.

@phofl
Copy link
Member

phofl commented Jul 3, 2025

Few comments first before I comment on the proposal itself.

The core team was pinged over a week ago and there are 2 approvals with 0 blockers (at least I think). What is it exactly that we are looking for?

This PR was opened and closed multiple times and didn't give of the impression that it would go anywhere with how the discussion was going. So I held of on commenting and really digging through things after these incidents, I suspect that others might have felt similar.

I think this might be a project setting that recently changed,

This has been the case for at least 1.5 years. I remember rejecting one of your reviews an around that time that was docs related as well and went stale. Not an issue on my side, just wanted to clarify that this is not new

About the actual topic:

While I like the general concept that more libraries offer features to load into pandas DataFrames (this is very helpful in multiple places to me personally), I don't fully understand or share why this has to be through an engine keyword instead of a different package like the big query reader.

There is a fundamental difference between using an engine to just dispatch somewhere and stuff we do for the things we are maintaining ourselves. For example, arrows csv parsing is modified by us in many places to make it fit the pandas read_csv interface and a bunch of keywords are still not supported last I checked. What I am trying to say is that just calling the arrow reader is not equivalent. (fast parquet might be a bit different, but I would prefer ripping this out anyway, but that's a different discussion).

Swapping out the readers via an engine keyword implies to me as a user that the APIs are the same, i.e. same keywords supported, same input returns same output. I would argue that this is generally NOT the case (Bodo is a good example here). Making it explicit to users that they call something else will make this distinction very clear, the opposite is just a mess. This is all very confusing to casual users, while probably not offering any advantages for more advanced folks since they would know these tools anyway.

What happens if we decide to deprecate a keyword or behavior? Do we need consensus with the engine maintainers, which is a mess all around? Otherwise we silently break stuff if no consensus, not great either. Again, having separate APIs doesn't even get us into that mess.

Additionally, while moving out some of our IO readers sounds like a good idea in principle, most of them would be unmaintained and break after a few releases if we do that. I've worked with tools here and there that were based around a plugin system and most of the ones not living in the main library didn't work after a few releases, which is super annoying. While it's not great that a bunch of them live with us, removing them would basically mean killing them. Straight up deprecating them is probably a better option.

Re Polars and DuckDB CSV Readers as an example: There are again two approaches we could take

  • just use their readers and call to_pandas, shouldn't live with us since users can do it themselves and the behavior will be significantly different
  • Add a lot of patching on top of it to make it run like the pandas csv reader -> we would want to maintain this ourselves anyway and this would live inside of pandas.

I don't see much value here to most users and a lot of potential confusion around this.

@datapythonista
Copy link
Member Author

Thanks @phofl for the feedback. Good to have so much detail from a fresh point of view. Maybe I'm just too optimistic with this, since your view is mostly aligned with Brock's for what I understand. But let me share my point of view on your concerns.

I don't fully understand or share why this has to be through an engine keyword instead of a different package like the big query reader.

It allows to have the connectors in pandas (as we do now), with a (mostly) unified API if more than one engine exists for the same format (CSV, Excel, Parquet), but we don't necessarily need to maintain the code, and test the connectors in our CI (which in many instances has been annoying). Personally, I'd also be a big fan of the GBQ connector approach, if all other connectors worked that way. But providing read_sas, read_clipboard and read_xarray tells users we're batteries included, but then we make users look for IMHO potentially way more popular connectors in other packages.

For example, arrows csv parsing is modified by us in many places to make it fit the pandas read_csv interface and a bunch of keywords

What I'd personally like (and it's not clear if other packages will onboard) is that what is modified to fit the pandas read_csv interface is implemented in the other packages. So, our pyarrow csv code would ideally live in pandas. If pandas is important enough for many of these projects to have a to_pandas, I think the pandas read_csv, read_parquet or read_excel signature could be important enough for packages implementing the actual connectors to also implement this "map" between their signature and our signature.

What happens if we decide to deprecate a keyword or behavior?

The idea is that connectors have a standard signature, and then support **kwargs for engine specific keywords. Deprecating a keyword wouldn't be an issue for engines that want to support it. Changes in behavior could be trickier, but I think our connectors are in general stable, and I feel API changes would be rare, and the number of connectors small, and not an issue to agree and coordinate changes if they are needed.

most of them would be unmaintained and break after a few releases

That's possible. For PyArrow, Fastparquet, Xarray and possibly others, if they are happy to get the connectors there, I guess those wouldn't be an issue. For SAS, IIRC was mostly maintained by people not in our core team. Maybe it'd be better maintained by them in a separate project. For Excel, if any of the engines stops being maintained, I guess it's because there isn't much interest in that one, and becoming unmaintained may be a positive thing. Of course we're just guessing, but personally, if a connector needs to become unmaintained, it'd better be, than having to maintain it with pandas, which for very long periods of time has been only maintained because of heroic efforts of Jeff and not Matt.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design IO Data IO issues that don't fit into a more specific label
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants