[core] Make testable stream redirection#51191
Conversation
Signed-off-by: dentiny <dentinyhao@gmail.com>
Signed-off-by: dentiny <dentinyhao@gmail.com>
d52564e to
b333b43
Compare
Signed-off-by: dentiny <dentinyhao@gmail.com>
b333b43 to
bbab2c0
Compare
Signed-off-by: dentiny <dentinyhao@gmail.com>
| // TODO(hjiang): Current implementation is flaky intrinsically, sleep for a while to | ||
| // make sure pipe content has been read over to spdlog. |
There was a problem hiding this comment.
Can you help me understand the path to making it deterministic? Is it "impossible" or just requires some more work?
we should avoid sleeps in unit tests as much as possible for two reasons:
- it makes them intrinsically flaky (as you noted)
- it makes the unit tests slow. In an ideal world, for developer productivity we would be able to run the full suite of C++ unit tests in the codebase very quickly. Having individual tests with multi-second sleeps gets in the way of that.
There was a problem hiding this comment.
At a minimum, you should be able to mitigate both of these issues by restructuring the test to repeatedly check for the expected behavior on a short interval (or in a busy loop) and only fail if the expected behavior is not observed after a longer timeout.
This is a common pattern in other Ray tests.
There was a problem hiding this comment.
Can you help me understand the path to making it deterministic? Is it "impossible" or just requires some more work?
A lot more work.
In terms of effort to make it deterministic, you need to
- Write a special indicator to a uni-directional pipe in thread thread (main thread, thread-1);
- Listen thread (thread-2) checks whether received string contains the indicator;
- Listen thread asks dump thread (thread-3) to do a flush;
- The tricky thing is we use uni-directional pipe (since in production there's no meaning for the reserve direction) for main thread and background threads, there's no directly way to echo back;
- You could use filesystem for IPC (i.e. write a file) but it takes busy polling which is also not good
In short, we need a channel to propagate "flush complete" signal from dump thread to main thread.
There was a problem hiding this comment.
it makes the unit tests slow
Yes. Current testing for ray, a C++ test pipeline usually takes several hours, within which C++ unit testing itself only takes minutes.
There was a problem hiding this comment.
At a minimum, you should be able to mitigate both of these issues by restructuring the test to repeatedly check for the expected behavior on a short interval (or in a busy loop) and only fail if the expected behavior is not observed after a longer timeout.
I doubt it.
We use gtest to capture stdout and stderr,
https://github.com/google/googletest/blob/main/googletest/src/gtest-port.cc#L1201-L1209
stream object gets deleted after one invocation, cannot be repeated called.
There was a problem hiding this comment.
Yes. Current testing for ray, a C++ test pipeline usually takes several hours, within which C++ unit testing itself only takes minutes.
CI is one consideration, but one of the main motivations for making unit tests fast is for rapid iteration locally. If we can run a large swath of unit tests quickly on our laptops, it increases developer productivity. I have worked in codebases where this is done well and it's very empowering.
And yes, while the current state of the C++ tests and CI is poor, we should be striving to make it better and set a new standard with each added PR
There was a problem hiding this comment.
I totally agree, I illustrate the difficulty to make it unit testable above.
Do you want me to add the reasoning to the comment?
| // TODO(hjiang): Current implementation is naive, which directly flushes on spdlog logger | ||
| // and could miss those in the pipe; it's acceptable because we only use it in the unit | ||
| // test for now. | ||
| void FlushOnRedirectedStream(RedirectionHandleWrapper &redirection_handle_wrapper); |
There was a problem hiding this comment.
This is only used for testing, so it isn't really part of the redirection utils interface, and it's only 1 line of code. So prefer to make it a utility inside the test file and promote it only if it is required in the real implementation.
There was a problem hiding this comment.
We have two tests using the same flush function, do you prefer to copy & paste it twice?
There was a problem hiding this comment.
Link:
There was a problem hiding this comment.
actually it looks like the comment might be wrong and this is actually used in the top-level stream_redirection_utils? https://github.com/ray-project/ray/pull/51191/files#diff-2a7e096be7505545545d270bcc45f035b23dfa071acddaf0f67fd0a8f011d5a2R66
There was a problem hiding this comment.
All flush related functions are only used in unit tests, but not production code.
Signed-off-by: dentiny <dentinyhao@gmail.com>
edoakes
left a comment
There was a problem hiding this comment.
I'm still struggling a bit to understand the motivation behind the change and the structure of the code. We are adding an additional shallow layer of abstraction for something quite simple, and having 2x duplicate files with the same name (stream_direction_utils.{cc,h}) is likely to be confusing to future readers. The test cases also seem to be nearly copy-pasted from tests in the existing stream_redirection_utils_test.cc file.
I also don't think it makes sense to add a new namespace (ray::internal) for this very tiny change, which is another concept that someone reading the code needs to keep in their head. Note that all of the cpp code in src/ray/ is "internal", there is no public API exposed directly here.
What is it that you weren't able to test previously that this enables?
| // TODO(hjiang): Current implementation is flaky intrinsically, sleep for a while to | ||
| // make sure pipe content has been read over to spdlog. |
There was a problem hiding this comment.
Yes. Current testing for ray, a C++ test pipeline usually takes several hours, within which C++ unit testing itself only takes minutes.
CI is one consideration, but one of the main motivations for making unit tests fast is for rapid iteration locally. If we can run a large swath of unit tests quickly on our laptops, it increases developer productivity. I have worked in codebases where this is done well and it's very empowering.
And yes, while the current state of the C++ tests and CI is poor, we should be striving to make it better and set a new standard with each added PR
| void SyncOnStreamRedirection(RedirectionHandleWrapper &redirection_handle_wrapper) { | ||
| redirection_handle_wrapper.scoped_dup2_wrapper = nullptr; | ||
| redirection_handle_wrapper.redirection_file_handle.Close(); | ||
| } | ||
|
|
||
| void FlushOnRedirectedStream(RedirectionHandleWrapper &redirection_handle_wrapper) { | ||
| redirection_handle_wrapper.redirection_file_handle.Flush(); | ||
| } |
There was a problem hiding this comment.
btw, these look quite a lot like methods, what's the reason to make them functions that take the RedirectionHandleWrapper instead of defining RedirectionHandleWrapper as a class?
There was a problem hiding this comment.
The external functions people use are exposed as functions, not class.
| // TODO(hjiang): Current implementation is naive, which directly flushes on spdlog logger | ||
| // and could miss those in the pipe; it's acceptable because we only use it in the unit | ||
| // test for now. | ||
| void FlushOnRedirectedStream(RedirectionHandleWrapper &redirection_handle_wrapper); |
There was a problem hiding this comment.
actually it looks like the comment might be wrong and this is actually used in the top-level stream_redirection_utils? https://github.com/ray-project/ray/pull/51191/files#diff-2a7e096be7505545545d270bcc45f035b23dfa071acddaf0f67fd0a8f011d5a2R66
For the current implementation, everytime we do stream redirection, we hook the destruction to process-wise exit hook, which is not ideal for unit tests, because if we want to test multiple scenarios, we have to write multiple testing files, but not multiple test cases. The exit hook is made intentional to mimic RAII, so users don't forget to call the sync function themselves. The PR tries to improve unit testing (that's why I name the PR as "testable"); in principle it's a noop change, simply extracts the stream redirection data structure out, so we could unit test only on the internal functions. |
|
This makes sense for abseil because it exposes a public C++ API. It's similar to how our Python code is structured to have a |
I see; this helps clarify it a bit. Let me do another pass in depth and see if I have a concrete suggestion for how to structure it that reduces some redundancy. |
Signed-off-by: dentiny <dentinyhao@gmail.com>
|
Chatted with Edward offline, the goal is to make a RAII class for stream redirection, so we could reduce layers of indirection and public functions exposed. |
Signed-off-by: dentiny <dentinyhao@gmail.com>
Signed-off-by: dentiny <dentinyhao@gmail.com>
Signed-off-by: dentiny <dentinyhao@gmail.com>
Signed-off-by: dentiny <dentinyhao@gmail.com>
Signed-off-by: dentiny <dentinyhao@gmail.com>
|
@edoakes I checked all existing classes / functions under |
edoakes
left a comment
There was a problem hiding this comment.
This looks way cleaner! Left a few more suggestions that I think would further improve the readability & extensibility of the code.
| @@ -0,0 +1,40 @@ | |||
| // Copyright 2025 The Ray Authors. | |||
There was a problem hiding this comment.
If we are going to start following the standard of an internal/ directory, do you think it makes sense to push it to the top level like ray/internal/util/ or prefer ray/util/internal/ ?
There was a problem hiding this comment.
I prefer the later, that's what abseil does:
- For example, there's
internalfolder forcontainer: https://github.com/abseil/abseil-cpp/tree/master/absl/container/internal - And
internalfolder inside offunctional: https://github.com/abseil/abseil-cpp/tree/master/absl/functional/internal
|
|
||
| RedirectionHandleWrapper::RedirectionHandleWrapper(MEMFD_TYPE_NON_UNIQUE stream_fd, | ||
| const StreamRedirectionOption &opt) { | ||
| RedirectionFileHandle handle = CreateRedirectionFileHandle(opt); |
There was a problem hiding this comment.
Is RedirectionFileHandle used anywhere else? If not, it might make sense to consider combining the two classes (again in the direction of "deep" interfaces).
There was a problem hiding this comment.
Or at a minimum, I'd suggestion putting them in the same internal/ file
There was a problem hiding this comment.
I would also suggest moving the definition of StreamRedirectionOption here as well.
There was a problem hiding this comment.
The redirection file handle is defined here:
ray/src/ray/util/pipe_logger.h
Line 35 in ffd7cc6
It's vague for pipe logger class / file, because:
- It's an independent class, which functions well itself; which means it's legal and ok to use the class alone;
- meanwhile, as you mentioned, it's not used elsewhere, so it's reasonable to consider it internal implementation details as of now.
I usually prefer to leave these class external, because there're quite a few times when I find some internal classes inside abseil useful, and people are not willing to use it merely because it's internal.
Example-1: OStream in abseil abseil/abseil-cpp#1827
Example-2: LRU cache in boost https://www.boost.org/doc/libs/1_67_0/boost/compute/detail/lru_cache.hpp
Putting it into public utils folder means we need to maintain the API stability in some sense, but I feel it's ok for PipeLogger.
There was a problem hiding this comment.
I would also suggest moving the definition of StreamRedirectionOption here as well.
On redirection option, I place it in a separate class / file because:
- It's used by both internal and public functions / classes, so it should be placed (1) either as a separate file; (2) or as part of internal file;
- It's exposed as public interface (user need to specify the option when they do stream redirection), so I prefer (1)
Signed-off-by: dentiny <dentinyhao@gmail.com>
edoakes
left a comment
There was a problem hiding this comment.
Looks good, thanks for integrating all the feedback
Signed-off-by: dentiny <dentinyhao@gmail.com>
Signed-off-by: dentiny <dentinyhao@gmail.com>
This PR is stacked upon ray-project#51179, to make redirection stream unit testable. Basically a no-op change, to extract redirection logic out into a separate file, and leave exit hook and global registry where they're now. --------- Signed-off-by: dentiny <dentinyhao@gmail.com> Signed-off-by: Dhakshin Suriakannu <d_suriakannu@apple.com>
This PR is stacked upon #51179, to make redirection stream unit testable.
Basically a no-op change, to extract redirection logic out into a separate file, and leave exit hook and global registry where they're now.