Skip to content

Conversation

@bradhe
Copy link
Contributor

@bradhe bradhe commented May 5, 2025

This PR introduces a collection of new functions to the Table interface to support common functionality.

  • upsert allows users to upsert records based on the specified join_cols (which are columns be names)
  • delete allows users to delete records. You can supply a set of predicates (filters) that are pyarrow.compute.Expression for matching what records to delete
  • rows_affected will return information about the rows affected in your operations
  • schema will return the PyArrow schema
  • to_polars returns a polars.LazyFrame that can be used for creating data frames. This is meant to be a replacement for the read function that will be deprecated.

TODO

  • Add some simple tests for Table and friends

@bradhe bradhe requested review from Copilot, datancoffee and sankroh May 5, 2025 15:21
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds several new functionalities to the Table interface, including upsert, delete, rows_affected, schema conversion, and a new to_polars method. The changes enhance schema interoperability between PyIceberg and PyArrow and expand the data manipulation operations.

  • Introduces conversion functions between PyArrow and PyIceberg types and expressions.
  • Adds methods for upsert, delete, and rows affected tracking.
  • Implements schema conversion and a replacement for the deprecated read functionality.

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
src/tower/utils/pyarrow.py Adds conversion methods and supports expression conversion.
src/tower/_tables.py Implements new Table methods: to_polars, upsert, delete, and rows_affected.
Comments suppressed due to low confidence (3)

src/tower/utils/pyarrow.py:152

  • The function 'convert_expression' is not defined; it appears you intended to use 'convert_pyarrow_expression' for recursive conversion.
return And(convert_expression(expr.args[0]), convert_expression(expr.args[1]))

src/tower/_tables.py:127

  • When 'filters' is a list, use 'convert_pyarrow_expressions' instead of 'convert_pyarrow_expression' to correctly convert all expressions.
next_filters = convert_pyarrow_expression(filters)

src/tower/_tables.py:55

  • The 'to_polars' method is declared to return a LazyFrame but collects the data into a DataFrame; remove '.collect()' to align the implementation with the declared return type.
return pl.scan_iceberg(self._table).collect()

Copy link
Contributor

@datancoffee datancoffee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!
Two suggestions: don't call collect() in lazy read, and use as_arrow() to get the arrow schema

def schema(self) -> pa.Schema:
# We take an Iceberg Schema and we need to convert it into a PyArrow Schema
iceberg_schema = self._table.schema()
return convert_iceberg_schema(iceberg_schema)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can get the Arrow schema by calling as_arrow()
See https://py.iceberg.apache.org/reference/pyiceberg/schema/#pyiceberg.schema.Schema.as_arrow
Unless you want to implement some special mapping, of course

You can reduce this to:

iceberg_schema = self._table.schema()
return iceberg_schema.as_arrow()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, thanks. Converted this to use PyIceberg's methodology instead of our own.

raise ValueError(f"Unsupported Arrow type: {arrow_type}")


def iceberg_to_arrow_type(iceberg_type):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can get the Arrow schema by calling as_arrow()
See https://py.iceberg.apache.org/reference/pyiceberg/schema/#pyiceberg.schema.Schema.as_arrow
Unless you want to implement some special mapping, of course

You can reduce this to:

iceberg_schema.as_arrow()

return IcebergSchema(*fields)


def convert_iceberg_schema(iceberg_schema: IcebergSchema) -> pa.Schema:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we might not need all of this

you can get the Arrow schema by calling as_arrow()
See https://py.iceberg.apache.org/reference/pyiceberg/schema/#pyiceberg.schema.Schema.as_arrow
Unless you want to implement some special mapping, of course

You can reduce this to:

iceberg_schema.as_arrow()

Converts the table to a Polars LazyFrame. This is useful when you
understand Polars and you want to do something more complicated.
"""
return pl.scan_iceberg(self._table).collect()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be without the .collect() call?

return pl.scan_iceberg(self._table)

We want to return a LazyFrame, ya?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, fixed. Thanks!

@bradhe bradhe requested review from Copilot and datancoffee May 6, 2025 13:21
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds several new methods to the Table interface and enhances utility modules to support upsert, delete, schema retrieval, and conversion operations between PyArrow and PyIceberg, along with comprehensive tests.

  • Introduces new Table methods such as upsert, delete, rows_affected, schema, and to_polars.
  • Enhances the utilities to convert PyArrow expressions into PyIceberg expressions.
  • Updates the TableReference and related re-export modules for improved catalog handling and lazy data frame support.

Reviewed Changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
tests/tower/test_tables.py Adds tests for reading, writing, upserting, deleting, and schema retrieval on tables.
src/tower/utils/pyarrow.py Introduces functions for converting PyArrow fields, schemas, and expressions to PyIceberg equivalents.
src/tower/pyiceberg.py Provides a dynamic dispatch wrapper for pyiceberg, re-exporting its attributes.
src/tower/pyarrow.py Re-exports pyarrow functionality.
src/tower/polars.py Re-exports polars functionalities for creating LazyFrames.
src/tower/_tables.py Extends the Table interface with new methods and stats, and updates catalog handling in TableReference.

Comment on lines +155 to +171
left_expr = None # You'd need to extract this
right_expr = None # You'd need to extract this
return And(
convert_pyarrow_expression(left_expr),
convert_pyarrow_expression(right_expr)
)
elif "or" in expr_str.lower() and isinstance(expr, pc.Expression):
# Similar simplification
left_expr = None # You'd need to extract this
right_expr = None # You'd need to extract this
return Or(
convert_pyarrow_expression(left_expr),
convert_pyarrow_expression(right_expr)
)
elif "not" in expr_str.lower() and isinstance(expr, pc.Expression):
# Similar simplification
inner_expr = None # You'd need to extract this
Copy link

Copilot AI May 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logical operation branch in 'convert_pyarrow_expression' assigns None to sub-expressions which may lead to runtime errors if logical expressions are passed. Consider either raising a NotImplementedError for these cases or implementing proper extraction of the sub-expressions.

Suggested change
left_expr = None # You'd need to extract this
right_expr = None # You'd need to extract this
return And(
convert_pyarrow_expression(left_expr),
convert_pyarrow_expression(right_expr)
)
elif "or" in expr_str.lower() and isinstance(expr, pc.Expression):
# Similar simplification
left_expr = None # You'd need to extract this
right_expr = None # You'd need to extract this
return Or(
convert_pyarrow_expression(left_expr),
convert_pyarrow_expression(right_expr)
)
elif "not" in expr_str.lower() and isinstance(expr, pc.Expression):
# Similar simplification
inner_expr = None # You'd need to extract this
try:
left_expr, right_expr = expr.arguments # Extract sub-expressions
except (AttributeError, ValueError):
raise ValueError(f"Failed to extract sub-expressions from 'and' operation: {expr}")
return And(
convert_pyarrow_expression(left_expr),
convert_pyarrow_expression(right_expr)
)
elif "or" in expr_str.lower() and isinstance(expr, pc.Expression):
# Similar simplification
try:
left_expr, right_expr = expr.arguments # Extract sub-expressions
except (AttributeError, ValueError):
raise ValueError(f"Failed to extract sub-expressions from 'or' operation: {expr}")
return Or(
convert_pyarrow_expression(left_expr),
convert_pyarrow_expression(right_expr)
)
elif "not" in expr_str.lower() and isinstance(expr, pc.Expression):
# Similar simplification
try:
(inner_expr,) = expr.arguments # Extract the single sub-expression
except (AttributeError, ValueError):
raise ValueError(f"Failed to extract sub-expression from 'not' operation: {expr}")

Copilot uses AI. Check for mistakes.
@bradhe bradhe merged commit e7ff695 into develop May 6, 2025
16 checks passed
@bradhe bradhe deleted the feature/tow-382-add-new-operations-to-the-table-interface branch May 6, 2025 14:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants