Skip to content

[Data] Making get_parquet_dataset configurable in # of fragments to scan #61670

Merged
alexeykudinkin merged 5 commits intomasterfrom
ak/pq-ds-clup
Mar 13, 2026
Merged

[Data] Making get_parquet_dataset configurable in # of fragments to scan #61670
alexeykudinkin merged 5 commits intomasterfrom
ak/pq-ds-clup

Conversation

@alexeykudinkin
Copy link
Contributor

@alexeykudinkin alexeykudinkin commented Mar 11, 2026

Description

  • Making get_parquet_dataset configurable in # of fragments to scan
  • Minor clean ups

Related issues

Link related issues: "Fixes #1234", "Closes #1234", or "Related to #1234".

Additional information

Optional: Add implementation details, API changes, usage examples, screenshots, etc.

@alexeykudinkin alexeykudinkin requested a review from a team as a code owner March 11, 2026 23:45
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a valuable feature to configure the number of fragments scanned for schema inference in get_parquet_dataset. The implementation is well-structured, involving a significant refactoring that correctly handles PyArrow version differences and schema promotion for null types. Additionally, the PR includes important correctness fixes for predicate evaluation and filesystem compatibility checks. My feedback is minor and focuses on a small code cleanup opportunity.

_block_udf: Optional[Callable[[Block], Block]] = None,
filesystem: Optional["pyarrow.fs.FileSystem"] = None,
schema: Optional[Union[type, "pyarrow.lib.Schema"]] = None,
schema: Optional[Union["pyarrow.lib.Schema"]] = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The Union in Optional[Union["pyarrow.lib.Schema"]] is redundant. You can simplify this to Optional["pyarrow.lib.Schema"] for better readability.

Suggested change
schema: Optional[Union["pyarrow.lib.Schema"]] = None,
schema: Optional["pyarrow.lib.Schema"] = None,

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.

schema=schema,
filesystem=filesystem,
**dataset_kwargs,
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Explicit schema parameter breaks dataset_kwargs containing schema

Low Severity

The new pq.ParquetDataset call passes schema=schema explicitly alongside **dataset_kwargs. Previously, schema was not an explicit keyword argument, so users could pass "schema" inside dataset_kwargs and it would be forwarded correctly. Now, if dataset_kwargs contains a "schema" key, it will raise TypeError: got multiple values for keyword argument 'schema'. This is a regression even though dataset_kwargs is deprecated.

Fix in Cursor Fix in Web

@ray-gardener ray-gardener bot added the data Ray Data-related issues label Mar 12, 2026
@alexeykudinkin alexeykudinkin added the go add ONLY when ready to merge, run all tests label Mar 12, 2026
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
…nfigurable

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
@alexeykudinkin alexeykudinkin enabled auto-merge (squash) March 12, 2026 22:09
@github-actions github-actions bot disabled auto-merge March 12, 2026 22:09
@alexeykudinkin alexeykudinkin merged commit 6d34c30 into master Mar 13, 2026
7 checks passed
@alexeykudinkin alexeykudinkin deleted the ak/pq-ds-clup branch March 13, 2026 16:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

data Ray Data-related issues go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Ray fails to serialize self-reference objects

2 participants