[data] Deflake infer_schema regression test by justinvyu · Pull Request #63537 · ray-project/ray

justinvyu · 2026-05-20T07:45:00Z

Description

The regression test to check that read_parquet doesn't use too much memory was not very reliable. It used rss difference before/after the read_parquet call which doesn't work consistently on CI runners. Plus, it's not guaranteed to report the peak memory usage which is what needs to be checked (since the infer_schema memory gets freed by the time we measure the rss_after.

Switch to just checking that infer_schema is not called in the new codepath.

Related issues

#63376

Additional information

I tried using wall-clock time, tracemalloc peak memory but nothing works well here.

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

…schema is not called Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Justin Yu <justinvyu@anyscale.com>

gemini-code-assist

Code Review

This pull request replaces the memory-growth regression test for Parquet reads with a more deterministic unit test that verifies _infer_schema is not called, which prevents O(N) metadata reads. Feedback was provided to update the mock's return value to include the complete schema of the generated Parquet files (both data and null_col columns) to improve the test's robustness.

gemini-code-assist · 2026-05-20T07:48:16Z

+        "ray.data._internal.datasource.parquet_datasource._infer_schema",
+        mock,
    )
+    mock.return_value = pa.schema({"data": pa.int64()})


To improve the test's robustness, the mock's return value should reflect the complete schema of the data being tested. The generated Parquet files will have a unified schema including both data and null_col columns. Providing an incomplete schema could lead to confusing failures if the code under test changes and the mock is unexpectedly called.

Suggested change

mock.return_value = pa.schema({"data": pa.int64()})

mock.return_value = pa.schema([pa.field("data", pa.int64()), pa.field("null_col", pa.int64())])

goutamvenkat-anyscale · 2026-05-20T07:59:07Z

+    mock = MagicMock()
+    monkeypatch.setattr(
+        "ray.data._internal.datasource.parquet_datasource._infer_schema",
+        mock,


Once we nuke out v1, not sure I see the value of this test. Also in V2, this function is not invoked.

justinvyu and others added 2 commits May 19, 2026 23:27

use wall clock time instead

afa7ea5

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

[data] Simplify read_parquet memory regression test to assert _infer_…

8f3fa81

…schema is not called Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Justin Yu <justinvyu@anyscale.com>

justinvyu requested a review from a team as a code owner May 20, 2026 07:45

justinvyu changed the title ~~Deflake memory test~~ [data] Deflake infer_schema regression test May 20, 2026

gemini-code-assist Bot reviewed May 20, 2026

View reviewed changes

goutamvenkat-anyscale reviewed May 20, 2026

View reviewed changes

ray-gardener Bot added the data Ray Data-related issues label May 20, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[data] Deflake infer_schema regression test#63537

[data] Deflake infer_schema regression test#63537
justinvyu wants to merge 2 commits into
ray-project:masterfrom
justinvyu:deflake-memory-test

justinvyu commented May 20, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 20, 2026

Uh oh!

goutamvenkat-anyscale May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	mock.return_value = pa.schema({"data": pa.int64()})
	mock.return_value = pa.schema([pa.field("data", pa.int64()), pa.field("null_col", pa.int64())])

Conversation

justinvyu commented May 20, 2026

Description

Related issues

Additional information

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 20, 2026

Choose a reason for hiding this comment

Uh oh!

goutamvenkat-anyscale May 20, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants