Skip to content

[data] Deflake infer_schema regression test#63537

Open
justinvyu wants to merge 2 commits into
ray-project:masterfrom
justinvyu:deflake-memory-test
Open

[data] Deflake infer_schema regression test#63537
justinvyu wants to merge 2 commits into
ray-project:masterfrom
justinvyu:deflake-memory-test

Conversation

@justinvyu
Copy link
Copy Markdown
Contributor

Description

The regression test to check that read_parquet doesn't use too much memory was not very reliable. It used rss difference before/after the read_parquet call which doesn't work consistently on CI runners. Plus, it's not guaranteed to report the peak memory usage which is what needs to be checked (since the infer_schema memory gets freed by the time we measure the rss_after.

Switch to just checking that infer_schema is not called in the new codepath.

Related issues

#63376

Additional information

I tried using wall-clock time, tracemalloc peak memory but nothing works well here.

justinvyu and others added 2 commits May 19, 2026 23:27
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
…schema is not called

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
@justinvyu justinvyu requested a review from a team as a code owner May 20, 2026 07:45
@justinvyu justinvyu changed the title Deflake memory test [data] Deflake infer_schema regression test May 20, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request replaces the memory-growth regression test for Parquet reads with a more deterministic unit test that verifies _infer_schema is not called, which prevents O(N) metadata reads. Feedback was provided to update the mock's return value to include the complete schema of the generated Parquet files (both data and null_col columns) to improve the test's robustness.

"ray.data._internal.datasource.parquet_datasource._infer_schema",
mock,
)
mock.return_value = pa.schema({"data": pa.int64()})
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

To improve the test's robustness, the mock's return value should reflect the complete schema of the data being tested. The generated Parquet files will have a unified schema including both data and null_col columns. Providing an incomplete schema could lead to confusing failures if the code under test changes and the mock is unexpectedly called.

Suggested change
mock.return_value = pa.schema({"data": pa.int64()})
mock.return_value = pa.schema([pa.field("data", pa.int64()), pa.field("null_col", pa.int64())])

Comment on lines +1765 to +1768
mock = MagicMock()
monkeypatch.setattr(
"ray.data._internal.datasource.parquet_datasource._infer_schema",
mock,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once we nuke out v1, not sure I see the value of this test. Also in V2, this function is not invoked.

@ray-gardener ray-gardener Bot added the data Ray Data-related issues label May 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

data Ray Data-related issues

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants