Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compiler hangs occasionally on many-core CPUs on Windows #73532

Closed
hjyamauchi opened this issue May 8, 2024 · 6 comments · Fixed by swiftlang/swift-corelibs-foundation#4954
Closed
Labels
bug A deviation from expected or documented behavior. Also: expected but undesirable behavior. triage needed This issue needs more specific labels

Comments

@hjyamauchi
Copy link
Contributor

hjyamauchi commented May 8, 2024

Description

The swift compiler occasionally hangs during a build. This is seen more frequently on many-core (> 16 cores) machines, In particular AMD threadripper CPUs with 32 cores / 64 threads. There are several swiftc processes that are left running and making no progress, with no (child) swift-frontend processes, when it happens.

Reproduction

This happens in a large internal app build on Windows. In some particular AMD threadripper machines, it happens 100% of the time. We have seen something hang in other machines much less frequently, which may be the same issue.

Expected behavior

The build doesn't hang and finishes, as opposed to hanging forever.

Environment

Windows

Additional information

No response

@hjyamauchi hjyamauchi added bug A deviation from expected or documented behavior. Also: expected but undesirable behavior. triage needed This issue needs more specific labels labels May 8, 2024
hjyamauchi added a commit to hjyamauchi/swift-corelibs-foundation that referenced this issue May 9, 2024
Increase the initial size of __CFReadSocketsFds to reduce the chance
of resizing it which appears to lead to compiler hangs.
@hjyamauchi
Copy link
Contributor Author

hjyamauchi commented May 9, 2024

This has been seen to happen in rare cases only (a few random processes) in in our internal large swift application build.

The symptom is that the compiler driver thread is stuck waiting forever in the while loop in Process.waitUntilExit after the child process that it is waiting for has already finished.

Based on inspections with the debugger, the self.isRunning flag is true which indicates that the CFSocketCreateWithNative callback never fired, even though the child process already finished.

Fortunately, adding a small amount of logging doesn't change the reproducibility, but unfortunately adding too much logging makes it go away. And as it's been so far only reproducible in a large swift build which involves many, many invocations of swiftc/swift-frontend processes and it is hard to know in which it occurs and to attach a debugger in real time. Most of the debugging so far relied on limited amount of logging.

A further investigation shows that the reason why the callback never fires seems to be that some arbitrary socket file descriptors occasionally get dropped (some bits in the bit vector cleared/unset) for unknown reasons after they are put into the __CFReadSocketsFds bit vector. This causes those file descriptors to never be tested on the select call and the above callback never fires. It seems to happen around the time it gets resized via the CFDataIncreaseLength call in __CFSocketFdSet. If I effectively turn off the resizing by increasing the initial size of bit vector, this hang reliably goes away. So I suspect a bug in the resize code, but couldn't spot a bug in the resize code and verified with extra debugging code that the bit vector contents are identical before and after the resize at least right before/after the resizing still within the same critical section. However, when a different thread accesses the same bit vector in subsequent critical sections, it occasionally finds that some bits are dropped.

I also checked that the access to the bit vector is properly synchronized but no issues found. I also instrumented in the other points in the code where bits in the bit vector could be potentially cleared but didn't find anything suspicious. This looks like a data corruption of some kind and my current theory is some sort of race-y data/heap corruption in lower-level code such as a race-condition bug in the underlying memory allocators (CFDataAllocator, etc.) or the lock implementation (CFLock, etc.) unless it's broken CPU/hardware or something like that.

@hjyamauchi
Copy link
Contributor Author

hjyamauchi commented May 9, 2024

swiftlang/swift-corelibs-foundation#4951 is a suggested workaround that reliably avoids this hang by reducing the chance of bitvector resizing by allocating a larger initial size. Ideally we'd fix the root cause but given the cost/benefit tradeoff and that this code is deprecated and is going to be replaced by swift-foundation, I hope this will at least unblock us and allow us to further experiment around this issue.

@lxbndr
Copy link
Contributor

lxbndr commented May 10, 2024

I am afraid that on Windows it is even more complicated. I did some research a while ago on this, because we noticed that creating too much CFSockets makes test app hang. The reason of such weird behavior is the fd_set is not a bit set on Windows. And the CoreFoundation code is written with bit set in mind. I guess we have no other choice other than rewrite some parts to use platform-specific fd_set handling to make everything work correctly. And this is quite challenging task, as bit set gives some advantages and simplifies a lot of things (like the capacity grow you mentioned).

I stopped working on this because the only issue I noticed was one synthetic test. It is unfortunate that this issue affects the compiler in such drastic way 😞

Here is my WIP commit with initial fix I made. Just for reference. tbh I even don't remember all "how and why"s, but hope it describes the idea at least. And it fixes fd_set growth problem in vitro.

@hjyamauchi
Copy link
Contributor Author

@lxbndr Oh my... thanks for posting and the patch :) I'm intrigued by the fact that it works to this extent despite this issue 🤯

I confirmed that your WIP commit reliably fixes the hang in our internal build, as is.

Would you be willing to put up a PR out of it? That would definitely unblock us. It'd be great if we can merge it.

@lxbndr
Copy link
Contributor

lxbndr commented May 11, 2024

@hjyamauchi I guess we can do that, even if it is not perfect. If it makes sense and fixes real issues, it worth to try.

lxbndr added a commit to readdle/swift-corelibs-foundation that referenced this issue May 12, 2024
Fixes swiftlang/swift#73532.

On Windows, socket handles in a `fd_set` are not represented as
bit flags as in Berkeley sockets. While we have no `fd_set` dynamic
growth in this implementation, the `FD_SETSIZE` defined as 1024
in `CoreFoundation_Prefix.h` should be enough for majority of tasks.
lxbndr added a commit to readdle/swift-corelibs-foundation that referenced this issue May 13, 2024
Fixes swiftlang/swift#73532.

On Windows, socket handles in a `fd_set` are not represented as
bit flags as in Berkeley sockets. While we have no `fd_set` dynamic
growth in this implementation, the `FD_SETSIZE` defined as 1024
in `CoreFoundation_Prefix.h` should be enough for majority of tasks.
@hjyamauchi
Copy link
Contributor Author

@lxbndr thanks for the fix!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug A deviation from expected or documented behavior. Also: expected but undesirable behavior. triage needed This issue needs more specific labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants