-
Notifications
You must be signed in to change notification settings - Fork 6.5k
[Core] Use fd instead of handle for windows log redirection #53852
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
if (handle == INVALID_HANDLE_VALUE) { | ||
return Status::IOError("") << "Fails to get file handle for flushing"; | ||
} | ||
if (!FlushFileBuffers(handle)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For flush, there is no corresponding C API so we need to covert fd back to HANDLE and use the WinAPI
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will mixing the APIs in this way cause issues like the one this PR intends to fix?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the testing, it didn't cause issues.
I actually didn't fully understand the root cause. I initially started the PR just trying to simplify things by uniformly using fd and then realized that it fixed the issue.
One thing I noticed is that converting from HANDLE to fd transfer the ownership. But getting a HANDLE back from fd (_get_osfhandle
) should be safe as long as we don't close the handle directly. All these are based on my very limited Windows knowledge.
The _open_osfhandle call transfers ownership of the Win32 file handle to the file descriptor. To close a file opened by using _open_osfhandle, call [_close](https://learn.microsoft.com/en-us/cpp/c-runtime-library/reference/close?view=msvc-170). The underlying OS file handle is also closed by a call to _close. Don't call the Win32 function CloseHandle on the original handle. If the file descriptor is owned by a FILE * stream, then a call to [fclose](https://learn.microsoft.com/en-us/cpp/c-runtime-library/reference/fclose-fcloseall?view=msvc-170) closes both the file descriptor and the underlying handle. In this case, don't call _close on the file descriptor or CloseHandle on the original handle.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't pretend to understand this and the PR is merged, but could you please leave a comment in compat.h
that describes your current understanding and can aid in future debugging? This might break again in the future.
Have you verified that this fixes the reported issue? If so, can you detail how you tested it? |
I tried it and it works, (ray-dev) PS C:\Users\czgdp1807\Anyscale\ray\python> ray start --head
Usage stats collection is enabled. To disable this, add `--disable-usage-stats` to the command that starts the cluster, or run the following command: `ray disable-usage-stats` before starting the cluster. See https://docs.ray.io/en/master/cluster/usage-stats.html for more details.
Local node IP: 127.0.0.1
--------------------
Ray runtime started.
--------------------
Next steps
To connect to this Ray cluster:
import ray
ray.init()
To terminate the Ray runtime, run
ray stop
To view the status of the cluster, use
ray status
(ray-dev) PS C:\Users\czgdp1807\Anyscale\ray\python> ray status
======== Autoscaler status: 2025-06-17 17:50:13.618875 ========
Node status
---------------------------------------------------------------
Active:
1 node_1f5d161361f94b4077b0d9ee34c485990b85c77bd55bd6db82d8766a
Pending:
(no pending nodes)
Recent failures:
(no failures)
Resources
---------------------------------------------------------------
Total Usage:
0.0/4.0 CPU
0B/7.21GiB memory
0B/3.09GiB object_store_memory
Total Constraints:
(no request_resources() constraints)
Total Demands:
(no resource demands) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like fd
s are bing directly used in this PR and that's why its working on Windows 11.
Could we get a cherry-pick for that PR to be integrated in recently published 2.47 release? |
@PhilippWillms we will release it as 2.48 which should be released in 2-3 weeks. Would that work? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand the change fully, but it might be worth documenting this in the code to aid future debugging.
Thanks for debugging and fixing this!
if (handle == INVALID_HANDLE_VALUE) { | ||
return Status::IOError("") << "Fails to get file handle for flushing"; | ||
} | ||
if (!FlushFileBuffers(handle)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't pretend to understand this and the PR is merged, but could you please leave a comment in compat.h
that describes your current understanding and can aid in future debugging? This might break again in the future.
Status CompleteWrite(int fd, const char *data, size_t len) { | ||
const int ret = _write(fd, data, len); | ||
if (ret == -1) { | ||
return Status::IOError("") << "Fails to write to file because " << strerror(errno); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/Fails/Failed/
Status CompleteWrite(int fd, const char *data, size_t len) { | ||
const int ret = _write(fd, data, len); | ||
if (ret == -1) { | ||
return Status::IOError("") << "Fails to write to file because " << strerror(errno); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/Fails/Failed/
@@ -64,36 +64,39 @@ Status Flush(MEMFD_TYPE_NON_UNIQUE fd) { | |||
} | |||
return Status::OK(); | |||
} | |||
Status Close(MEMFD_TYPE_NON_UNIQUE fd) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why was this MEMFD_TYPE_NON_UNIQUE
before and isn't that just an alias for int
anyway?
} | ||
if ((DWORD)len != bytes_written) { | ||
if (ret != static_cast<int>(len)) { | ||
return Status::IOError("") << "Fails to write all requested bytes, requests to write " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/Fails/Failed/
s/write/wrote/
auto ostream = | ||
std::make_shared<boost::iostreams::stream<boost::iostreams::file_descriptor_sink>>( | ||
std::move(fd_sink)); | ||
#if defined(__APPLE__) || defined(__linux__) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably also deserves a comment that explains the edge case that caused this issue.
Ok, fine for me. |
…ect#53852) Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com> Signed-off-by: Scott Lee <scott.lee@rebellions.ai>
Why are these changes needed?
Windows has two sets of APIs for working with files. One is the WinAPI which works with
HANDLE
. The other is C API which works withfd
(just like Unix). Currently in master we are mix using both and seems have caused issues like #52739. In this PR, we use the C API instead which is more consistent with the Unix code path.ray start
previously failed on my windows machine now passes with this PR.Related issue number
Closes #52739
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.