Skip to content

[Core] Use fd instead of handle for windows log redirection #53852

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Jun 18, 2025

Conversation

jjyao
Copy link
Collaborator

@jjyao jjyao commented Jun 16, 2025

Why are these changes needed?

Windows has two sets of APIs for working with files. One is the WinAPI which works with HANDLE. The other is C API which works with fd (just like Unix). Currently in master we are mix using both and seems have caused issues like #52739. In this PR, we use the C API instead which is more consistent with the Unix code path.

ray start previously failed on my windows machine now passes with this PR.

Related issue number

Closes #52739

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
@jjyao jjyao added the go add ONLY when ready to merge, run all tests label Jun 16, 2025
jjyao added 3 commits June 16, 2025 09:35
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
if (handle == INVALID_HANDLE_VALUE) {
return Status::IOError("") << "Fails to get file handle for flushing";
}
if (!FlushFileBuffers(handle)) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For flush, there is no corresponding C API so we need to covert fd back to HANDLE and use the WinAPI

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will mixing the APIs in this way cause issues like the one this PR intends to fix?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the testing, it didn't cause issues.

I actually didn't fully understand the root cause. I initially started the PR just trying to simplify things by uniformly using fd and then realized that it fixed the issue.

One thing I noticed is that converting from HANDLE to fd transfer the ownership. But getting a HANDLE back from fd (_get_osfhandle) should be safe as long as we don't close the handle directly. All these are based on my very limited Windows knowledge.

The _open_osfhandle call transfers ownership of the Win32 file handle to the file descriptor. To close a file opened by using _open_osfhandle, call [_close](https://learn.microsoft.com/en-us/cpp/c-runtime-library/reference/close?view=msvc-170). The underlying OS file handle is also closed by a call to _close. Don't call the Win32 function CloseHandle on the original handle. If the file descriptor is owned by a FILE * stream, then a call to [fclose](https://learn.microsoft.com/en-us/cpp/c-runtime-library/reference/fclose-fcloseall?view=msvc-170) closes both the file descriptor and the underlying handle. In this case, don't call _close on the file descriptor or CloseHandle on the original handle.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't pretend to understand this and the PR is merged, but could you please leave a comment in compat.h that describes your current understanding and can aid in future debugging? This might break again in the future.

@jjyao jjyao marked this pull request as ready for review June 17, 2025 05:44
@jjyao jjyao requested a review from a team as a code owner June 17, 2025 05:44
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
@edoakes
Copy link
Collaborator

edoakes commented Jun 17, 2025

Have you verified that this fixes the reported issue? If so, can you detail how you tested it?

@czgdp1807
Copy link
Contributor

I tried it and it works,

(ray-dev) PS C:\Users\czgdp1807\Anyscale\ray\python> ray start --head
Usage stats collection is enabled. To disable this, add `--disable-usage-stats` to the command that starts the cluster, or run the following command: `ray disable-usage-stats` before starting the cluster. See https://docs.ray.io/en/master/cluster/usage-stats.html for more details.

Local node IP: 127.0.0.1

--------------------
Ray runtime started.
--------------------

Next steps

  To connect to this Ray cluster:
    import ray
    ray.init()

  To terminate the Ray runtime, run
    ray stop

  To view the status of the cluster, use
    ray status
(ray-dev) PS C:\Users\czgdp1807\Anyscale\ray\python> ray status
======== Autoscaler status: 2025-06-17 17:50:13.618875 ========
Node status
---------------------------------------------------------------
Active:
 1 node_1f5d161361f94b4077b0d9ee34c485990b85c77bd55bd6db82d8766a
Pending:
 (no pending nodes)
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Total Usage:
 0.0/4.0 CPU
 0B/7.21GiB memory
 0B/3.09GiB object_store_memory

Total Constraints:
 (no request_resources() constraints)
Total Demands:
 (no resource demands)

Copy link
Contributor

@czgdp1807 czgdp1807 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like fds are bing directly used in this PR and that's why its working on Windows 11.

@jjyao jjyao requested a review from edoakes June 17, 2025 20:24
@jjyao jjyao merged commit 906db98 into ray-project:master Jun 18, 2025
5 checks passed
@jjyao jjyao deleted the jjyao/fdwin branch June 18, 2025 04:47
@PhilippWillms
Copy link

Could we get a cherry-pick for that PR to be integrated in recently published 2.47 release?

@jjyao
Copy link
Collaborator Author

jjyao commented Jun 18, 2025

@PhilippWillms we will release it as 2.48 which should be released in 2-3 weeks. Would that work?

Copy link
Contributor

@israbbani israbbani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the change fully, but it might be worth documenting this in the code to aid future debugging.

Thanks for debugging and fixing this!

if (handle == INVALID_HANDLE_VALUE) {
return Status::IOError("") << "Fails to get file handle for flushing";
}
if (!FlushFileBuffers(handle)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't pretend to understand this and the PR is merged, but could you please leave a comment in compat.h that describes your current understanding and can aid in future debugging? This might break again in the future.

Status CompleteWrite(int fd, const char *data, size_t len) {
const int ret = _write(fd, data, len);
if (ret == -1) {
return Status::IOError("") << "Fails to write to file because " << strerror(errno);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/Fails/Failed/

Status CompleteWrite(int fd, const char *data, size_t len) {
const int ret = _write(fd, data, len);
if (ret == -1) {
return Status::IOError("") << "Fails to write to file because " << strerror(errno);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/Fails/Failed/

@@ -64,36 +64,39 @@ Status Flush(MEMFD_TYPE_NON_UNIQUE fd) {
}
return Status::OK();
}
Status Close(MEMFD_TYPE_NON_UNIQUE fd) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why was this MEMFD_TYPE_NON_UNIQUE before and isn't that just an alias for int anyway?

}
if ((DWORD)len != bytes_written) {
if (ret != static_cast<int>(len)) {
return Status::IOError("") << "Fails to write all requested bytes, requests to write "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/Fails/Failed/
s/write/wrote/

auto ostream =
std::make_shared<boost::iostreams::stream<boost::iostreams::file_descriptor_sink>>(
std::move(fd_sink));
#if defined(__APPLE__) || defined(__linux__)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably also deserves a comment that explains the edge case that caused this issue.

@PhilippWillms
Copy link

@PhilippWillms we will release it as 2.48 which should be released in 2-3 weeks. Would that work?

Ok, fine for me.

rebel-scottlee pushed a commit to rebellions-sw/ray that referenced this pull request Jun 21, 2025
…ect#53852)

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Signed-off-by: Scott Lee <scott.lee@rebellions.ai>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
go add ONLY when ready to merge, run all tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Core] ray.init() and ray start fails on Windows 11 in ray 2.45+
5 participants