Skip to content

ENH: Implement to_iceberg #61507

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 14 commits into
base: main
Choose a base branch
from

Conversation

datapythonista
Copy link
Member

@datapythonista datapythonista added the IO Data IO issues that don't fit into a more specific label label May 27, 2025
"""
Write a DataFrame to an Apache Iceberg table.

.. versionadded:: 3.0.0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add an experimental tag to this API as well like we did with read_iceberg?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absolutely, I forgot about that. Added it now. I also expanded the user guide docs of iceberg with to_iceberg, which I also had forgotten. Thanks for the review and the feedback!

*,
catalog_properties: dict[str, Any] | None = None,
location: str | None = None,
snapshot_properties: dict[str, str] | None = None,
Copy link

@IsaacWarren IsaacWarren Jun 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any thoughts on adding append to match to_parquet? Something like

append: bool = False

Then this could default to table.overwrite instead of append. I think it might be confusing if this doesn't match other to_* functions

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does PyIceberg support it?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the table.overwrite method

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course, I didn't think about it. I'll add it, thanks for the feedback.

identifier=table_identifier,
schema=arrow_table.schema,
location=location,
# we could add `partition_spec`, `sort_order` and `properties` in the

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I definitely think these would be great to have but I don't really have any ideas on how to do it without just using PyIceberg objects

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding them later is easy if we think of a good signature. That's why I didn't worry too much about adding them.

@datapythonista
Copy link
Member Author

Added the append parameter. I think it's a great addition, thanks for the feedback @IsaacWarren.

I was thinking that for the parameters that receive PyIceberg objects, one option is to use a generic **kwargs like to_parquet does, that are sent to the engine (only PyIceberg so far). This wouldn't directly expose PyIceberg details to our API and they could be supported. This would be very simple if only one method from PyIceberg received extra parameters. But there are a couple in overwrite where the same could be done, and that makes it a bit trickier. I think it's still best to leave this to a follow up PR, so it can be analyzed and discussed in greater detail. And even if it's done after this is released in 3.0, no big deal since there are no backward compatibility problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO Data IO issues that don't fit into a more specific label
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants