[data] add read_parquet metadata fetch memory regression test#63376
Conversation
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
There was a problem hiding this comment.
Code Review
This pull request introduces a regression test, test_read_parquet_memory_growth, to ensure that memory usage during Parquet schema inference does not grow linearly with the number of files. The test compares memory deltas between reading small and large sets of files. Feedback was provided regarding the potential for a ZeroDivisionError in the ratio calculation if the memory delta for the small file set is zero, as well as the risk of misleading results if the delta is negative due to system memory management.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Justin Yu <justinvyu@anyscale.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit adf870f. Configure here.
|
Sanity check ✅ Fails on premerge before merging in datasource v2 from master: Passes on premerge after merging in datasource v2 |
…r_schema_regression_test

Description
read_parquetmemory usage doesn't grow linearly with file count when the first file has apa.null()column (which triggers the_infer_schemafallback path on PyArrow < 22.0)Test plan
python -m pytest python/ray/data/tests/datasource/test_parquet.py::test_read_parquet_memory_growth -xvs