[Data] Making get_parquet_dataset configurable in # of fragments to scan #61670
[Data] Making get_parquet_dataset configurable in # of fragments to scan #61670alexeykudinkin merged 5 commits intomasterfrom
get_parquet_dataset configurable in # of fragments to scan #61670Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a valuable feature to configure the number of fragments scanned for schema inference in get_parquet_dataset. The implementation is well-structured, involving a significant refactoring that correctly handles PyArrow version differences and schema promotion for null types. Additionally, the PR includes important correctness fixes for predicate evaluation and filesystem compatibility checks. My feedback is minor and focuses on a small code cleanup opportunity.
| _block_udf: Optional[Callable[[Block], Block]] = None, | ||
| filesystem: Optional["pyarrow.fs.FileSystem"] = None, | ||
| schema: Optional[Union[type, "pyarrow.lib.Schema"]] = None, | ||
| schema: Optional[Union["pyarrow.lib.Schema"]] = None, |
There was a problem hiding this comment.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.
| schema=schema, | ||
| filesystem=filesystem, | ||
| **dataset_kwargs, | ||
| ) |
There was a problem hiding this comment.
Explicit schema parameter breaks dataset_kwargs containing schema
Low Severity
The new pq.ParquetDataset call passes schema=schema explicitly alongside **dataset_kwargs. Previously, schema was not an explicit keyword argument, so users could pass "schema" inside dataset_kwargs and it would be forwarded correctly. Now, if dataset_kwargs contains a "schema" key, it will raise TypeError: got multiple values for keyword argument 'schema'. This is a regression even though dataset_kwargs is deprecated.
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
…nfigurable Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
668c1ea to
69bc1a0
Compare
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>


Description
Related issues
Additional information