Skip to content

[data] add read_parquet metadata fetch memory regression test#63376

Merged
justinvyu merged 3 commits into
ray-project:masterfrom
justinvyu:infer_schema_regression_test
May 19, 2026
Merged

[data] add read_parquet metadata fetch memory regression test#63376
justinvyu merged 3 commits into
ray-project:masterfrom
justinvyu:infer_schema_regression_test

Conversation

@justinvyu
Copy link
Copy Markdown
Contributor

@justinvyu justinvyu commented May 15, 2026

Description

  • Add regression test for parquet schema inference memory scaling
  • Verifies that read_parquet memory usage doesn't grow linearly with file count when the first file has a pa.null() column (which triggers the _infer_schema fallback path on PyArrow < 22.0)

Test plan

python -m pytest python/ray/data/tests/datasource/test_parquet.py::test_read_parquet_memory_growth -xvs

Signed-off-by: Justin Yu <justinvyu@anyscale.com>
@justinvyu justinvyu requested a review from a team as a code owner May 15, 2026 19:45
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a regression test, test_read_parquet_memory_growth, to ensure that memory usage during Parquet schema inference does not grow linearly with the number of files. The test compares memory deltas between reading small and large sets of files. Feedback was provided regarding the potential for a ZeroDivisionError in the ratio calculation if the memory delta for the small file set is zero, as well as the risk of misleading results if the delta is negative due to system memory management.

Comment thread python/ray/data/tests/datasource/test_parquet.py Outdated
Comment thread python/ray/data/tests/datasource/test_parquet.py
@ray-gardener ray-gardener Bot added the data Ray Data-related issues label May 16, 2026
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit adf870f. Configure here.

Comment thread python/ray/data/tests/datasource/test_parquet.py
@justinvyu justinvyu added the go add ONLY when ready to merge, run all tests label May 18, 2026
@justinvyu
Copy link
Copy Markdown
Contributor Author

justinvyu commented May 19, 2026

Sanity check ✅

Fails on premerge before merging in datasource v2 from master:

[2026-05-18T22:42:30Z] >       assert ratio < 2, (
[2026-05-18T22:42:30Z]             f"Memory grew too much with more files: ratio={ratio:.1f}\n"
[2026-05-18T22:42:30Z]             f"delta_small={delta_small / 1024 / 1024:.1f} MiB, "
[2026-05-18T22:42:30Z]             f"delta_large={delta_large / 1024 / 1024:.1f} MiB"
[2026-05-18T22:42:30Z]         )
[2026-05-18T22:42:30Z] E       AssertionError: Memory grew too much with more files: ratio=12.6
[2026-05-18T22:42:30Z] E         delta_small=9.1 MiB, delta_large=114.8 MiB
[2026-05-18T22:42:30Z] E       assert 12.581335616438356 < 2
[2026-05-18T22:42:30Z]
[2026-05-18T22:42:30Z] python/ray/data/tests/datasource/test_parquet.py:1741: AssertionError

Passes on premerge after merging in datasource v2
Ran with a print statement on my local to produce this result:

  - delta_small: 2.58 MiB (100 files)
  - delta_large: 0.72 MiB (1000 files)
  - ratio: 0.28

@justinvyu justinvyu enabled auto-merge (squash) May 19, 2026 20:53
@justinvyu justinvyu merged commit 07efa6e into ray-project:master May 19, 2026
7 checks passed
@justinvyu justinvyu deleted the infer_schema_regression_test branch May 19, 2026 22:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

data Ray Data-related issues go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants