[data] DataSourceV2: V2 ARROW-5030 nested-type fallback#63175
Conversation
There was a problem hiding this comment.
Code Review
This pull request refactors the FileReader to utilize per-fragment scanners instead of a single aggregate scanner, which prevents issues with cross-fragment casts for variable-shape tensor extensions. It also ports the nested-type fallback for Parquet files (addressing ARROW-5030) to the DataSourceV2 path and includes version-specific handling for pyarrow readahead parameters. The review feedback identifies a critical issue where the current retry logic could lead to duplicate data if a transient error occurs during fragment iteration. Additionally, suggestions were made to optimize logging keys to prevent log flooding and to document performance trade-offs when using the fallback reader with filters.
db3fce8 to
9638cec
Compare
|
The refactor is structurally sensible (per-fragment scanners with a swappable |
9638cec to
97c1268
Compare
5779966 to
ce7a3e1
Compare
d456a1b to
02cdef8
Compare
02cdef8 to
19b8856
Compare
19b8856 to
e5411fa
Compare
a9b419f to
3a7a85a
Compare
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Reviewed by Cursor Bugbot for commit 3a7a85a. Configure here.
63b30ac to
52ec75e
Compare
## Why
Parquet files with nested columns (e.g. `list<struct<..., string>>`) whose row groups exceed Arrow's ~2 GB chunking threshold hit `ArrowNotImplementedError` at decode time (ARROW-5030). V1 already has a metadata-only fallback that detects this and switches to `pq.ParquetFile.iter_batches`. This PR ports it to V2 and makes the decision filter-aware.
## What
**Port V1's nested-type fallback to V2.** `FileReader` grows an `_iter_fragment_tables` hook; `ParquetFileReader` overrides it with V1's `_needs_nested_type_fallback` metadata check, falling back to `pq.ParquetFile.iter_batches` (with safe batch sizing, row-group pushdown via `fragment.subset`, and per-batch row-level filtering) when the check fires.
**Make the fallback decision filter-aware.** Previously the check looked only at projected columns. A filter that touches a large nested column *outside* the projection would still force the scanner to decode it for row-level evaluation — and hit ARROW-5030. The check now sees the union of projected + filter-referenced columns:
```python
ds.read_parquet(path).select_columns(["id"]).filter(col("nested_col").is_not_null())
# ^^^^ projection excludes nested_col
# ^^^^ but filter references it
# → fallback must trigger
```
**Carry the predicate as a Ray `Expr` instead of a pyarrow expression.** `pyarrow.compute.Expression` is opaque (no public visitor), so we can't extract filter columns from it after the fact. Keeping the Ray `Expr` as the source of truth — and converting to pyarrow once, at the scanner-kwargs boundary — lets the reader call `get_column_references` for the union above. Touches `ArrowFileScanner.predicate`, `FileReader.predicate`, and `push_filters` (now ANDs Ray `Expr`s).
**Drop the legacy `filter=` kwarg on V2.** `read_parquet(filter=pc.field("x") > 5)` is already deprecated. Since it carries a raw pyarrow expression that can't be introspected, it's silently stripped on the V2 path. Callers should use `read_parquet(path).filter(expr=...)`.
## Tests
- `test_read_parquet_nested_type_arrow_not_implemented_fallback` — V2 skip removed (regression for [ray-project#61675](ray-project#61675)).
- `test_read_parquet_nested_fallback_triggered_when_filter_references_nested_column` — new, V2-only. Projects a flat column and filters on the large nested column; asserts the fallback is invoked.
Signed-off-by: Goutam <goutam@anyscale.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
52ec75e to
f46ae78
Compare
| get_column_references, | ||
| ) | ||
|
|
||
| columns = scanner_kwargs.get("columns") |
There was a problem hiding this comment.
Is this also pc.Expression?
| yield from super()._iter_fragment_tables(fragment, scanner_kwargs) | ||
| return | ||
|
|
||
| if log_once(f"parquet_nested_fallback_v2:{fragment.path}"): |
There was a problem hiding this comment.
Yea left this on purpose so that the user knows the file for which we used the fallback path
|
|
||
| # Scope the safe batch-size calculation to the columns actually being | ||
| # decoded so we don't shrink batches based on columns we won't read. | ||
| leaf_indices = ( |
There was a problem hiding this comment.
I think the name for leaf_indices is unclear to me?
There was a problem hiding this comment.
For example, for struct cols the leaf indices refers to the key in the struct.
If I had address: struct<street: string, zip: int32>
then the leaves would be address.street, address.zip etc.
| # in that case and let the per-batch filter (post null-fill) do | ||
| # all the row-dropping. | ||
| fragment_physical_columns = set(fragment.physical_schema.names) | ||
| filter_touches_missing_column = filter_columns is not None and any( |
There was a problem hiding this comment.
what about inverting this like "filter_columns_all_present_in_schema" or smth like that and then below u do `if filter_Expr is not None and "filter_columns_all_present_in_schema"?
| # hit ARROW-5030 in the normal scanner path. | ||
| filter_columns = ( | ||
| get_column_references(self._predicate) | ||
| if self._predicate is not None |
There was a problem hiding this comment.
I see these if is None checks. I think it's fine when u have a couple, but i think it might be easier to read if separated out like this when u have many
filter_columns=None
if not None
filter_columns = ..There was a problem hiding this comment.
Will do as a follow up clean up change.
…63175) ## Why Parquet files with nested columns (e.g. `list<struct<..., string>>`) whose row groups exceed Arrow's ~2 GB chunking threshold hit `ArrowNotImplementedError` at decode time (ARROW-5030). V1 already has a metadata-only fallback that detects this and switches to `pq.ParquetFile.iter_batches`. This PR ports it to V2 and makes the decision filter-aware. ## What **Port V1's nested-type fallback to V2.** `FileReader` grows an `_iter_fragment_tables` hook; `ParquetFileReader` overrides it with V1's `_needs_nested_type_fallback` metadata check, falling back to `pq.ParquetFile.iter_batches` (with safe batch sizing, row-group pushdown via `fragment.subset`, and per-batch row-level filtering) when the check fires. **Make the fallback decision filter-aware.** Previously the check looked only at projected columns. A filter that touches a large nested column *outside* the projection would still force the scanner to decode it for row-level evaluation — and hit ARROW-5030. The check now sees the union of projected + filter-referenced columns: ```python ds.read_parquet(path).select_columns(["id"]).filter(col("nested_col").is_not_null()) # ^^^^ projection excludes nested_col # ^^^^ but filter references it # → fallback must trigger ``` **Carry the predicate as a Ray `Expr` instead of a pyarrow expression.** `pyarrow.compute.Expression` is opaque (no public visitor), so we can't extract filter columns from it after the fact. Keeping the Ray `Expr` as the source of truth — and converting to pyarrow once, at the scanner-kwargs boundary — lets the reader call `get_column_references` for the union above. Touches `ArrowFileScanner.predicate`, `FileReader.predicate`, and `push_filters` (now ANDs Ray `Expr`s). **Drop the legacy `filter=` kwarg on V2.** `read_parquet(filter=pc.field("x") > 5)` is already deprecated. Since it carries a raw pyarrow expression that can't be introspected, it's silently stripped on the V2 path. Callers should use `read_parquet(path).filter(expr=...)`. ## Tests - `test_read_parquet_nested_type_arrow_not_implemented_fallback` — V2 skip removed (regression for [ray-project#61675](ray-project#61675)). - `test_read_parquet_nested_fallback_triggered_when_filter_references_nested_column` — new, V2-only. Projects a flat column and filters on the large nested column; asserts the fallback is invoked. Signed-off-by: Goutam <goutam@anyscale.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --- [//]: # (BEGIN SAPLING FOOTER) Stack created with [Sapling](https://sapling-scm.com). Best reviewed with [ReviewStack](https://reviewstack.dev/ray-project/ray/pull/63175). * ray-project#63326 * __->__ ray-project#63175 Signed-off-by: Goutam <goutam@anyscale.com> Co-authored-by: Goutam V. <> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: anindyam1969 <amukherjee@kinetica.com>

Why
Parquet files with nested columns (e.g.
list<struct<..., string>>) whose row groups exceed Arrow's ~2 GB chunking threshold hitArrowNotImplementedErrorat decode time (ARROW-5030). V1 already has a metadata-only fallback that detects this and switches topq.ParquetFile.iter_batches. This PR ports it to V2 and makes the decision filter-aware.What
Port V1's nested-type fallback to V2.
FileReadergrows an_iter_fragment_tableshook;ParquetFileReaderoverrides it with V1's_needs_nested_type_fallbackmetadata check, falling back topq.ParquetFile.iter_batches(with safe batch sizing, row-group pushdown viafragment.subset, and per-batch row-level filtering) when the check fires.Make the fallback decision filter-aware. Previously the check looked only at projected columns. A filter that touches a large nested column outside the projection would still force the scanner to decode it for row-level evaluation — and hit ARROW-5030. The check now sees the union of projected + filter-referenced columns:
Carry the predicate as a Ray
Exprinstead of a pyarrow expression.pyarrow.compute.Expressionis opaque (no public visitor), so we can't extract filter columns from it after the fact. Keeping the RayExpras the source of truth — and converting to pyarrow once, at the scanner-kwargs boundary — lets the reader callget_column_referencesfor the union above. TouchesArrowFileScanner.predicate,FileReader.predicate, andpush_filters(now ANDs RayExprs).Drop the legacy
filter=kwarg on V2.read_parquet(filter=pc.field("x") > 5)is already deprecated. Since it carries a raw pyarrow expression that can't be introspected, it's silently stripped on the V2 path. Callers should useread_parquet(path).filter(expr=...).Tests
test_read_parquet_nested_type_arrow_not_implemented_fallback— V2 skip removed (regression for #61675).test_read_parquet_nested_fallback_triggered_when_filter_references_nested_column— new, V2-only. Projects a flat column and filters on the large nested column; asserts the fallback is invoked.Signed-off-by: Goutam goutam@anyscale.com
Co-Authored-By: Claude Opus 4.7 (1M context) noreply@anthropic.com
Stack created with Sapling. Best reviewed with ReviewStack.