[data] Deflake infer_schema regression test#63537
Conversation
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
…schema is not called Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Justin Yu <justinvyu@anyscale.com>
There was a problem hiding this comment.
Code Review
This pull request replaces the memory-growth regression test for Parquet reads with a more deterministic unit test that verifies _infer_schema is not called, which prevents O(N) metadata reads. Feedback was provided to update the mock's return value to include the complete schema of the generated Parquet files (both data and null_col columns) to improve the test's robustness.
| "ray.data._internal.datasource.parquet_datasource._infer_schema", | ||
| mock, | ||
| ) | ||
| mock.return_value = pa.schema({"data": pa.int64()}) |
There was a problem hiding this comment.
To improve the test's robustness, the mock's return value should reflect the complete schema of the data being tested. The generated Parquet files will have a unified schema including both data and null_col columns. Providing an incomplete schema could lead to confusing failures if the code under test changes and the mock is unexpectedly called.
| mock.return_value = pa.schema({"data": pa.int64()}) | |
| mock.return_value = pa.schema([pa.field("data", pa.int64()), pa.field("null_col", pa.int64())]) |
| mock = MagicMock() | ||
| monkeypatch.setattr( | ||
| "ray.data._internal.datasource.parquet_datasource._infer_schema", | ||
| mock, |
There was a problem hiding this comment.
Once we nuke out v1, not sure I see the value of this test. Also in V2, this function is not invoked.
Description
The regression test to check that
read_parquetdoesn't use too much memory was not very reliable. It used rss difference before/after theread_parquetcall which doesn't work consistently on CI runners. Plus, it's not guaranteed to report the peak memory usage which is what needs to be checked (since the infer_schema memory gets freed by the time we measure therss_after.Switch to just checking that
infer_schemais not called in the new codepath.Related issues
#63376
Additional information
I tried using wall-clock time, tracemalloc peak memory but nothing works well here.