-
-
Notifications
You must be signed in to change notification settings - Fork 1.7k
MonitorFdHup
: replace pthread_cancel
trick with a notification pipe
#12714
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MonitorFdHup
: replace pthread_cancel
trick with a notification pipe
#12714
Conversation
e35b2c9
to
0d8a35f
Compare
on macOS:
Will check if I can reproduce this. |
0d8a35f
to
b39ad3a
Compare
Can reproduce this locally. Normal macOS tests works but this is a |
fyi this is https://gerrit.lix.systems/c/lix/+/1605 |
Similar fix also this commit predates the commit in Lix by a year. However the race condition in macOS uncovered here seems to be not tested in Lix as there is no equivalent to nix-daemon-compat-tests, so unsure if it also happens there. I can reproduce it the issue without the commits here by running tests in isolation on macOS and they go away if I add sleeps before the garbage collection calls. The issue is that there are still temproots shown as hold by the nix-daemon preventing the garbage collection temporarily. |
OSX failure reproduced: https://termbin.com/5mgd |
src/libutil/unix/monitor-fd.hh
Outdated
/* This will only happen on macOS. We sleep a bit to | ||
avoid waking up too often if the client is sending | ||
input. */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My best guess is that the problem is this preeexisting macos-only workaround (dating to 9fc8d00). If I understand this correctly, this means that after client "hangs up", the daemon (fork for that connection) will wait up to one second 1 second before itself exiting. This causes a race condition which causes the test failures.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you likely have a second bug (sorry i am on my phone so can't link the lix fix but I've mentioned it multiple times before incl on the "why is hydra running lix" thread) where the daemon loses its monitor fd hup thread when it forks, since fork kills all non main threads.
this thread is only a phenomenon on the client without fixing that bug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean https://gerrit.lix.systems/c/lix/+/2086 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where the daemon loses its monitor fd hup thread when it forks, since fork kills all non main threads.
I don't think you mean the MonitorFdHup
thread, since that is made in processConnection
after the forks. But there is still the (separately) signal handler thread, and you think that is involved with this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, right, i am mixing up those two. i am not confident that it is strictly that but it IS related to fd hup handling! since fd hup is treated as ctrl c/interrupt, and if you have interrupt handling on the daemon being broken also, it seems quite likely to cause some shit to be broken? idk.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Random results
micro-sleeping
- sleep(1);
+ usleep(10);
Does pass the tests. It may be a compromise between correctness and a hot-loop.
poll mask POLLPRI
- POLLRDNORM
+ POLLPRI
passes tests as well. This seems to be a signal that will be far less prevalent and less likely to cause a race.
Note: on Darwin lhr-aarch64-darwin-01 24.1.0 Darwin Kernel Version 24.1.0: Thu Oct 10 21:06:23 PDT 2024; root:xnu-11215.41.3~3/RELEASE_ARM64_T8132 arm64
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should probably just do both tbh.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://git.lix.systems/alois31/lix/commit/69e2ee5b25752ba5fd8644cef56fb9d627ca4a64?style=unified&whitespace=ignore-change&show-outdated= oh there is this. Yes we saw POLLPRI
too just now. I guess that is a bit more per the spec that POLLHUP
, which macOS's own man page says is not supposed to work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note that lix has a comment there also questioning whether the alleged bug still exists
It's a pretty small diff, so let's just start formatting before we make other changes.
Of interest: https://openradar.appspot.com/37537852 Seems like the high likelihood of a POLLRDNORM and high sleep creates a high likelihood of a race condition. The sleep was needed in XY-fashion to deal with too many POLLRDNORMs. Moving to POLLPRI seems to have both backwards compat with buggy poll() and less spinning on newer kernels. |
Syscalls can fail for many reasons and we don't want to loose the errno and error context.
Better than just putting `1` in multiple spots.
On NixOS#8946, we faced a surprising behaviour wrt. exception when using pthread_cancel. In a nutshell when a thread is inside a catch block and it's getting pthread_cancel by another one, then the original exception is bubbled up and crashes the process. We now poll on the notification pipe from the thread and exit when the main thread closes its end. This solution does not exhibit surprising behaviour wrt. exceptions. Co-authored-by: Mic92 <joerg@thalheim.io> Fixes NixOS#8946 See also Lix https://gerrit.lix.systems/c/lix/+/1605 which is very similar by coincidence. Pulled a comment from that.
This was filed as NixOS#7584, but as far as I can tell, the previous solution of POLLHUP works just fine on macOS 14. I've also tested on an ancient machine with macOS 10.15.7, which also has POLLHUP work correctly. It's possible this might regress some older versions of macOS that have a kernel bug, but I went looking through the history on the sources and didn't find anything that looked terribly convincingly like a bug fix between 2020 and today. If such a broken version exists, it seems pretty reasonable to suggest simply updating the OS. Change-Id: I178a038baa000f927ea2cbc4587d69d8ab786843 Based off of commit 69e2ee5. Ericson2314 added additional other information.
After the previous commit it should not be necessary. Furthermore, if we *do* sleep, we'll exacerbate a race condition (in conjunction with getting rid of the thread cancellation) that will cause test failures.
b39ad3a
to
49f486d
Compare
MonitorFdHup
: replace pthread_cancel
trick with a notification pipe
…2714 `MonitorFdHup`: replace `pthread_cancel` trick with a notification pipe (backport #12714)
…2714 `MonitorFdHup`: replace `pthread_cancel` trick with a notification pipe (backport #12714)
…2714 `MonitorFdHup`: replace `pthread_cancel` trick with a notification pipe (backport #12714)
} | ||
}); | ||
}; | ||
|
||
~MonitorFdHup() | ||
{ | ||
pthread_cancel(thread.native_handle()); | ||
close(notifyPipe.writeSide.get()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be notifyPipe.writeSide.close()
? Otherwise the notifyPipe.writeSide
destructor will close the same FD again, which is bad.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in #12736
…2714 `MonitorFdHup`: replace `pthread_cancel` trick with a notification pipe (backport #12714)
Apparently this is sometimes a problem for tests containing race conditions, since it caused the daemon processes to stick around for a second. Doesn't make writing such tests any less racey and foolish, but we can stop doing the silly thing regardless. CC: NixOS/nix#12714 (comment) Change-Id: Iad6e55cf78c4a4517082194fa00a30d921224457
Motivation
On #8946, we faced a surprising
behaviour wrt. exception when using pthread_cancel. In a nutshell when
a thread is inside a catch block and it's getting pthread_cancel by
another one, then the original exception is bubbled up and crashes the
process.
We now poll on the notification pipe from the thread and exit when the
main thread closes its end. This solution does not exhibit surprising
behaviour wrt. exceptions.
Context
#8946
Add 👍 to pull requests you find important.
The Nix maintainer team uses a GitHub project board to schedule and track reviews.