[data] add `read_parquet` metadata fetch memory regression test by justinvyu · Pull Request #63376 · ray-project/ray

justinvyu · 2026-05-15T19:45:41Z

Description

Add regression test for parquet schema inference memory scaling
Verifies that read_parquet memory usage doesn't grow linearly with file count when the first file has a pa.null() column (which triggers the _infer_schema fallback path on PyArrow < 22.0)

Test plan

python -m pytest python/ray/data/tests/datasource/test_parquet.py::test_read_parquet_memory_growth -xvs

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

gemini-code-assist

Code Review

This pull request introduces a regression test, test_read_parquet_memory_growth, to ensure that memory usage during Parquet schema inference does not grow linearly with the number of files. The test compares memory deltas between reading small and large sets of files. Feedback was provided regarding the potential for a ZeroDivisionError in the ratio calculation if the memory delta for the small file set is zero, as well as the risk of misleading results if the delta is negative due to system memory management.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Justin Yu <justinvyu@anyscale.com>

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit adf870f. Configure here.}

justinvyu · 2026-05-19T20:43:40Z

Sanity check ✅

Fails on premerge before merging in datasource v2 from master:

[2026-05-18T22:42:30Z] >       assert ratio < 2, (
[2026-05-18T22:42:30Z]             f"Memory grew too much with more files: ratio={ratio:.1f}\n"
[2026-05-18T22:42:30Z]             f"delta_small={delta_small / 1024 / 1024:.1f} MiB, "
[2026-05-18T22:42:30Z]             f"delta_large={delta_large / 1024 / 1024:.1f} MiB"
[2026-05-18T22:42:30Z]         )
[2026-05-18T22:42:30Z] E       AssertionError: Memory grew too much with more files: ratio=12.6
[2026-05-18T22:42:30Z] E         delta_small=9.1 MiB, delta_large=114.8 MiB
[2026-05-18T22:42:30Z] E       assert 12.581335616438356 < 2
[2026-05-18T22:42:30Z]
[2026-05-18T22:42:30Z] python/ray/data/tests/datasource/test_parquet.py:1741: AssertionError

Passes on premerge after merging in datasource v2
Ran with a print statement on my local to produce this result:

  - delta_small: 2.58 MiB (100 files)
  - delta_large: 0.72 MiB (1000 files)
  - ratio: 0.28

…r_schema_regression_test

add regression test

71e3cfa

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

justinvyu requested a review from goutamvenkat-anyscale May 15, 2026 19:45

justinvyu requested a review from a team as a code owner May 15, 2026 19:45

gemini-code-assist Bot reviewed May 15, 2026

View reviewed changes

Comment thread python/ray/data/tests/datasource/test_parquet.py Outdated

cursor Bot reviewed May 15, 2026

View reviewed changes

Comment thread python/ray/data/tests/datasource/test_parquet.py

ray-gardener Bot added the data Ray Data-related issues label May 16, 2026

clamp RSS deltas to avoid ZeroDivisionError in memory growth test

adf870f

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Justin Yu <justinvyu@anyscale.com>

cursor Bot reviewed May 18, 2026

View reviewed changes

Comment thread python/ray/data/tests/datasource/test_parquet.py

justinvyu added the go add ONLY when ready to merge, run all tests label May 18, 2026

Merge branch 'master' of https://github.com/ray-project/ray into infe…

ba649ff

…r_schema_regression_test

goutamvenkat-anyscale approved these changes May 19, 2026

View reviewed changes

justinvyu enabled auto-merge (squash) May 19, 2026 20:53

justinvyu merged commit 07efa6e into ray-project:master May 19, 2026
7 checks passed

justinvyu deleted the infer_schema_regression_test branch May 19, 2026 22:32

justinvyu mentioned this pull request May 20, 2026

[data] Deflake infer_schema regression test #63537

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[data] add `read_parquet` metadata fetch memory regression test#63376

[data] add `read_parquet` metadata fetch memory regression test#63376
justinvyu merged 3 commits into
ray-project:masterfrom
justinvyu:infer_schema_regression_test

justinvyu commented May 15, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment

Uh oh!

Uh oh!

justinvyu commented May 19, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

justinvyu commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Test plan

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

justinvyu commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

justinvyu commented May 15, 2026 •

edited

Loading

justinvyu commented May 19, 2026 •

edited

Loading