Skip to content

Comments

[core] Make testable stream redirection#51191

Merged
edoakes merged 17 commits intoray-project:masterfrom
dentiny:hjiang/testable-stream-redirection
Mar 21, 2025
Merged

[core] Make testable stream redirection#51191
edoakes merged 17 commits intoray-project:masterfrom
dentiny:hjiang/testable-stream-redirection

Conversation

@dentiny
Copy link
Contributor

@dentiny dentiny commented Mar 9, 2025

This PR is stacked upon #51179, to make redirection stream unit testable.
Basically a no-op change, to extract redirection logic out into a separate file, and leave exit hook and global registry where they're now.

dentiny added 2 commits March 9, 2025 05:46
Signed-off-by: dentiny <dentinyhao@gmail.com>
Signed-off-by: dentiny <dentinyhao@gmail.com>
@dentiny dentiny added the go add ONLY when ready to merge, run all tests label Mar 9, 2025
@dentiny dentiny requested a review from jjyao March 9, 2025 06:58
@dentiny dentiny requested a review from a team as a code owner March 9, 2025 06:58
@dentiny dentiny force-pushed the hjiang/testable-stream-redirection branch from d52564e to b333b43 Compare March 9, 2025 07:08
Signed-off-by: dentiny <dentinyhao@gmail.com>
@dentiny dentiny force-pushed the hjiang/testable-stream-redirection branch from b333b43 to bbab2c0 Compare March 9, 2025 10:10
Signed-off-by: dentiny <dentinyhao@gmail.com>
Comment on lines 77 to 78
// TODO(hjiang): Current implementation is flaky intrinsically, sleep for a while to
// make sure pipe content has been read over to spdlog.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you help me understand the path to making it deterministic? Is it "impossible" or just requires some more work?

we should avoid sleeps in unit tests as much as possible for two reasons:

  1. it makes them intrinsically flaky (as you noted)
  2. it makes the unit tests slow. In an ideal world, for developer productivity we would be able to run the full suite of C++ unit tests in the codebase very quickly. Having individual tests with multi-second sleeps gets in the way of that.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At a minimum, you should be able to mitigate both of these issues by restructuring the test to repeatedly check for the expected behavior on a short interval (or in a busy loop) and only fail if the expected behavior is not observed after a longer timeout.

This is a common pattern in other Ray tests.

Copy link
Contributor Author

@dentiny dentiny Mar 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you help me understand the path to making it deterministic? Is it "impossible" or just requires some more work?

A lot more work.

In terms of effort to make it deterministic, you need to

  • Write a special indicator to a uni-directional pipe in thread thread (main thread, thread-1);
  • Listen thread (thread-2) checks whether received string contains the indicator;
  • Listen thread asks dump thread (thread-3) to do a flush;
  • The tricky thing is we use uni-directional pipe (since in production there's no meaning for the reserve direction) for main thread and background threads, there's no directly way to echo back;
  • You could use filesystem for IPC (i.e. write a file) but it takes busy polling which is also not good

In short, we need a channel to propagate "flush complete" signal from dump thread to main thread.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it makes the unit tests slow

Yes. Current testing for ray, a C++ test pipeline usually takes several hours, within which C++ unit testing itself only takes minutes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At a minimum, you should be able to mitigate both of these issues by restructuring the test to repeatedly check for the expected behavior on a short interval (or in a busy loop) and only fail if the expected behavior is not observed after a longer timeout.

I doubt it.

We use gtest to capture stdout and stderr,
https://github.com/google/googletest/blob/main/googletest/src/gtest-port.cc#L1201-L1209

stream object gets deleted after one invocation, cannot be repeated called.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Current testing for ray, a C++ test pipeline usually takes several hours, within which C++ unit testing itself only takes minutes.

CI is one consideration, but one of the main motivations for making unit tests fast is for rapid iteration locally. If we can run a large swath of unit tests quickly on our laptops, it increases developer productivity. I have worked in codebases where this is done well and it's very empowering.

And yes, while the current state of the C++ tests and CI is poor, we should be striving to make it better and set a new standard with each added PR

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I totally agree, I illustrate the difficulty to make it unit testable above.
Do you want me to add the reasoning to the comment?

// TODO(hjiang): Current implementation is naive, which directly flushes on spdlog logger
// and could miss those in the pipe; it's acceptable because we only use it in the unit
// test for now.
void FlushOnRedirectedStream(RedirectionHandleWrapper &redirection_handle_wrapper);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only used for testing, so it isn't really part of the redirection utils interface, and it's only 1 line of code. So prefer to make it a utility inside the test file and promote it only if it is required in the real implementation.

Copy link
Contributor Author

@dentiny dentiny Mar 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have two tests using the same flush function, do you prefer to copy & paste it twice?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Link:

RedirectStderrForOnce(opts);

RedirectStderrForOnce(opts);

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually it looks like the comment might be wrong and this is actually used in the top-level stream_redirection_utils? https://github.com/ray-project/ray/pull/51191/files#diff-2a7e096be7505545545d270bcc45f035b23dfa071acddaf0f67fd0a8f011d5a2R66

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All flush related functions are only used in unit tests, but not production code.

Signed-off-by: dentiny <dentinyhao@gmail.com>
Copy link
Collaborator

@edoakes edoakes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still struggling a bit to understand the motivation behind the change and the structure of the code. We are adding an additional shallow layer of abstraction for something quite simple, and having 2x duplicate files with the same name (stream_direction_utils.{cc,h}) is likely to be confusing to future readers. The test cases also seem to be nearly copy-pasted from tests in the existing stream_redirection_utils_test.cc file.

I also don't think it makes sense to add a new namespace (ray::internal) for this very tiny change, which is another concept that someone reading the code needs to keep in their head. Note that all of the cpp code in src/ray/ is "internal", there is no public API exposed directly here.

What is it that you weren't able to test previously that this enables?

Comment on lines 77 to 78
// TODO(hjiang): Current implementation is flaky intrinsically, sleep for a while to
// make sure pipe content has been read over to spdlog.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Current testing for ray, a C++ test pipeline usually takes several hours, within which C++ unit testing itself only takes minutes.

CI is one consideration, but one of the main motivations for making unit tests fast is for rapid iteration locally. If we can run a large swath of unit tests quickly on our laptops, it increases developer productivity. I have worked in codebases where this is done well and it's very empowering.

And yes, while the current state of the C++ tests and CI is poor, we should be striving to make it better and set a new standard with each added PR

Comment on lines 36 to 43
void SyncOnStreamRedirection(RedirectionHandleWrapper &redirection_handle_wrapper) {
redirection_handle_wrapper.scoped_dup2_wrapper = nullptr;
redirection_handle_wrapper.redirection_file_handle.Close();
}

void FlushOnRedirectedStream(RedirectionHandleWrapper &redirection_handle_wrapper) {
redirection_handle_wrapper.redirection_file_handle.Flush();
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw, these look quite a lot like methods, what's the reason to make them functions that take the RedirectionHandleWrapper instead of defining RedirectionHandleWrapper as a class?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The external functions people use are exposed as functions, not class.

// TODO(hjiang): Current implementation is naive, which directly flushes on spdlog logger
// and could miss those in the pipe; it's acceptable because we only use it in the unit
// test for now.
void FlushOnRedirectedStream(RedirectionHandleWrapper &redirection_handle_wrapper);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually it looks like the comment might be wrong and this is actually used in the top-level stream_redirection_utils? https://github.com/ray-project/ray/pull/51191/files#diff-2a7e096be7505545545d270bcc45f035b23dfa071acddaf0f67fd0a8f011d5a2R66

@dentiny
Copy link
Contributor Author

dentiny commented Mar 10, 2025

I'm still struggling a bit to understand the motivation behind the change and the structure of the code. We are adding an additional shallow layer of abstraction for something quite simple, and having 2x duplicate files with the same name (stream_direction_utils.{cc,h}) is likely to be confusing to future readers. The test cases also seem to be nearly copy-pasted from tests in the existing stream_redirection_utils_test.cc file.

For the current implementation, everytime we do stream redirection, we hook the destruction to process-wise exit hook, which is not ideal for unit tests, because if we want to test multiple scenarios, we have to write multiple testing files, but not multiple test cases.

The exit hook is made intentional to mimic RAII, so users don't forget to call the sync function themselves.
Also it's not easy to provide a pybinder for the data structure, because there's RAII object inside of RAII object...

The PR tries to improve unit testing (that's why I name the PR as "testable"); in principle it's a noop change, simply extracts the stream redirection data structure out, so we could unit test only on the internal functions.

@dentiny
Copy link
Contributor Author

dentiny commented Mar 10, 2025

I also don't think it makes sense to add a new namespace (ray::internal) for this very tiny change, which is another concept that someone reading the code needs to keep in their head. Note that all of the cpp code in src/ray/ is "internal", there is no public API exposed directly here.

internal folder and namespace is pretty common, almost every abseil folder does that:
https://github.com/abseil/abseil-cpp/tree/master/absl/base/internal
I also add bazel visibility rule to prevent external usage.

@edoakes
Copy link
Collaborator

edoakes commented Mar 10, 2025

I also don't think it makes sense to add a new namespace (ray::internal) for this very tiny change, which is another concept that someone reading the code needs to keep in their head. Note that all of the cpp code in src/ray/ is "internal", there is no public API exposed directly here.

internal folder and namespace is pretty common, almost every abseil folder does that: https://github.com/abseil/abseil-cpp/tree/master/absl/base/internal I also add bazel visibility rule to prevent external usage.

This makes sense for abseil because it exposes a public C++ API. It's similar to how our Python code is structured to have a _private directory. But for the src/ray/ codebase, there are no public APIs so an internal/ distinction is completely arbitrary.

@edoakes
Copy link
Collaborator

edoakes commented Mar 11, 2025

For the current implementation, everytime we do stream redirection, we hook the destruction to process-wise exit hook, which is not ideal for unit tests, because if we want to test multiple scenarios, we have to write multiple testing files, but not multiple test cases.

I see; this helps clarify it a bit. Let me do another pass in depth and see if I have a concrete suggestion for how to structure it that reduces some redundancy.

@dentiny
Copy link
Contributor Author

dentiny commented Mar 20, 2025

Chatted with Edward offline, the goal is to make a RAII class for stream redirection, so we could reduce layers of indirection and public functions exposed.

dentiny added 5 commits March 20, 2025 02:13
Signed-off-by: dentiny <dentinyhao@gmail.com>
Signed-off-by: dentiny <dentinyhao@gmail.com>
Signed-off-by: dentiny <dentinyhao@gmail.com>
Signed-off-by: dentiny <dentinyhao@gmail.com>
Signed-off-by: dentiny <dentinyhao@gmail.com>
@dentiny
Copy link
Contributor Author

dentiny commented Mar 20, 2025

@edoakes I checked all existing classes / functions under src/ray/utils directory, I don't find other valid use cases for an internal folder, so I feel following abseil's coding standard as the new coding standard makes sense. Let me know your thoughts!

@dentiny dentiny requested a review from edoakes March 20, 2025 02:30
Copy link
Collaborator

@edoakes edoakes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks way cleaner! Left a few more suggestions that I think would further improve the readability & extensibility of the code.

@@ -0,0 +1,40 @@
// Copyright 2025 The Ray Authors.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are going to start following the standard of an internal/ directory, do you think it makes sense to push it to the top level like ray/internal/util/ or prefer ray/util/internal/ ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer the later, that's what abseil does:


RedirectionHandleWrapper::RedirectionHandleWrapper(MEMFD_TYPE_NON_UNIQUE stream_fd,
const StreamRedirectionOption &opt) {
RedirectionFileHandle handle = CreateRedirectionFileHandle(opt);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is RedirectionFileHandle used anywhere else? If not, it might make sense to consider combining the two classes (again in the direction of "deep" interfaces).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or at a minimum, I'd suggestion putting them in the same internal/ file

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also suggest moving the definition of StreamRedirectionOption here as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The redirection file handle is defined here:

// File handle requires active destruction via owner calling [Close].

It's vague for pipe logger class / file, because:

  • It's an independent class, which functions well itself; which means it's legal and ok to use the class alone;
  • meanwhile, as you mentioned, it's not used elsewhere, so it's reasonable to consider it internal implementation details as of now.

I usually prefer to leave these class external, because there're quite a few times when I find some internal classes inside abseil useful, and people are not willing to use it merely because it's internal.

Example-1: OStream in abseil abseil/abseil-cpp#1827
Example-2: LRU cache in boost https://www.boost.org/doc/libs/1_67_0/boost/compute/detail/lru_cache.hpp

Putting it into public utils folder means we need to maintain the API stability in some sense, but I feel it's ok for PipeLogger.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also suggest moving the definition of StreamRedirectionOption here as well.

On redirection option, I place it in a separate class / file because:

  • It's used by both internal and public functions / classes, so it should be placed (1) either as a separate file; (2) or as part of internal file;
  • It's exposed as public interface (user need to specify the option when they do stream redirection), so I prefer (1)

Signed-off-by: dentiny <dentinyhao@gmail.com>
Copy link
Collaborator

@edoakes edoakes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks for integrating all the feedback

Signed-off-by: dentiny <dentinyhao@gmail.com>
@edoakes edoakes enabled auto-merge (squash) March 20, 2025 22:18
Signed-off-by: dentiny <dentinyhao@gmail.com>
@github-actions github-actions bot disabled auto-merge March 20, 2025 23:12
@edoakes edoakes enabled auto-merge (squash) March 20, 2025 23:32
Signed-off-by: dentiny <dentinyhao@gmail.com>
@github-actions github-actions bot disabled auto-merge March 21, 2025 11:03
@edoakes edoakes merged commit c6639d2 into ray-project:master Mar 21, 2025
5 checks passed
dhakshin32 pushed a commit to dhakshin32/ray that referenced this pull request Mar 27, 2025
This PR is stacked upon ray-project#51179,
to make redirection stream unit testable.
Basically a no-op change, to extract redirection logic out into a
separate file, and leave exit hook and global registry where they're
now.

---------

Signed-off-by: dentiny <dentinyhao@gmail.com>
Signed-off-by: Dhakshin Suriakannu <d_suriakannu@apple.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-backlog go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants