`MonitorFdHup`: replace `pthread_cancel` trick with a notification pipe #12714

picnoir · 2025-03-21T16:05:14Z

Motivation

On #8946, we faced a surprising
behaviour wrt. exception when using pthread_cancel. In a nutshell when
a thread is inside a catch block and it's getting pthread_cancel by
another one, then the original exception is bubbled up and crashes the
process.

We now poll on the notification pipe from the thread and exit when the
main thread closes its end. This solution does not exhibit surprising
behaviour wrt. exceptions.

Context

#8946

Add 👍 to pull requests you find important.

The Nix maintainer team uses a GitHub project board to schedule and track reviews.

Mic92 · 2025-03-21T16:40:12Z

on macOS:

       >  13/189 nix-functional-tests:main / multiple-outputs                        FAIL            5.78s   exit status 1
       >  47/189 nix-functional-tests:main / dependencies                            FAIL            4.57s   exit status 1
       >  97/189 nix-functional-tests:main / build-delete                            FAIL            4.62s   exit status 1
       >

Will check if I can reproduce this.

Mic92 · 2025-03-21T18:32:29Z

Can reproduce this locally. Normal macOS tests works but this is a nix-daemon-compat-tests-2.27.0pre20250321_b39ad3a-with-daemon-2.27.0pre20250321_b39ad3a. test

lf- · 2025-03-21T18:55:28Z

fyi this is https://gerrit.lix.systems/c/lix/+/1605

Mic92 · 2025-03-22T09:33:43Z

Similar fix also this commit predates the commit in Lix by a year. However the race condition in macOS uncovered here seems to be not tested in Lix as there is no equivalent to nix-daemon-compat-tests, so unsure if it also happens there. I can reproduce it the issue without the commits here by running tests in isolation on macOS and they go away if I add sleeps before the garbage collection calls.

The issue is that there are still temproots shown as hold by the nix-daemon preventing the garbage collection temporarily.

tomberek · 2025-03-23T17:37:52Z

OSX failure reproduced: https://termbin.com/5mgd

Ericson2314 · 2025-03-23T19:37:15Z

src/libutil/unix/monitor-fd.hh

                /* This will only happen on macOS. We sleep a bit to
                   avoid waking up too often if the client is sending
                   input. */


My best guess is that the problem is this preeexisting macos-only workaround (dating to 9fc8d00). If I understand this correctly, this means that after client "hangs up", the daemon (fork for that connection) will wait up to one second 1 second before itself exiting. This causes a race condition which causes the test failures.

you likely have a second bug (sorry i am on my phone so can't link the lix fix but I've mentioned it multiple times before incl on the "why is hydra running lix" thread) where the daemon loses its monitor fd hup thread when it forks, since fork kills all non main threads.

this thread is only a phenomenon on the client without fixing that bug.

Do you mean https://gerrit.lix.systems/c/lix/+/2086 ?

where the daemon loses its monitor fd hup thread when it forks, since fork kills all non main threads.

I don't think you mean the MonitorFdHup thread, since that is made in processConnection after the forks. But there is still the (separately) signal handler thread, and you think that is involved with this?

oh, right, i am mixing up those two. i am not confident that it is strictly that but it IS related to fd hup handling! since fd hup is treated as ctrl c/interrupt, and if you have interrupt handling on the daemon being broken also, it seems quite likely to cause some shit to be broken? idk.

Random results

micro-sleeping

- sleep(1); + usleep(10);

Does pass the tests. It may be a compromise between correctness and a hot-loop.

poll mask POLLPRI

- POLLRDNORM + POLLPRI

passes tests as well. This seems to be a signal that will be far less prevalent and less likely to cause a race.

Note: on Darwin lhr-aarch64-darwin-01 24.1.0 Darwin Kernel Version 24.1.0: Thu Oct 10 21:06:23 PDT 2024; root:xnu-11215.41.3~3/RELEASE_ARM64_T8132 arm64

Should probably just do both tbh.

https://git.lix.systems/alois31/lix/commit/69e2ee5b25752ba5fd8644cef56fb9d627ca4a64?style=unified&whitespace=ignore-change&show-outdated= oh there is this. Yes we saw POLLPRI too just now. I guess that is a bit more per the spec that POLLHUP, which macOS's own man page says is not supposed to work.

apple-oss-distributions/xnu@e13b1fa#diff-a5aa0b0e7f4d866ca417f60702689fc797e9cdfe33b601b05ccf43086c35d395R1468

note that lix has a comment there also questioning whether the alleged bug still exists

It's a pretty small diff, so let's just start formatting before we make other changes.

tomberek · 2025-03-23T22:07:32Z

Of interest: https://openradar.appspot.com/37537852

Seems like the high likelihood of a POLLRDNORM and high sleep creates a high likelihood of a race condition. The sleep was needed in XY-fashion to deal with too many POLLRDNORMs. Moving to POLLPRI seems to have both backwards compat with buggy poll() and less spinning on newer kernels.

Syscalls can fail for many reasons and we don't want to loose the errno and error context.

Better than just putting `1` in multiple spots.

On NixOS#8946, we faced a surprising behaviour wrt. exception when using pthread_cancel. In a nutshell when a thread is inside a catch block and it's getting pthread_cancel by another one, then the original exception is bubbled up and crashes the process. We now poll on the notification pipe from the thread and exit when the main thread closes its end. This solution does not exhibit surprising behaviour wrt. exceptions. Co-authored-by: Mic92 <joerg@thalheim.io> Fixes NixOS#8946 See also Lix https://gerrit.lix.systems/c/lix/+/1605 which is very similar by coincidence. Pulled a comment from that.

This was filed as NixOS#7584, but as far as I can tell, the previous solution of POLLHUP works just fine on macOS 14. I've also tested on an ancient machine with macOS 10.15.7, which also has POLLHUP work correctly. It's possible this might regress some older versions of macOS that have a kernel bug, but I went looking through the history on the sources and didn't find anything that looked terribly convincingly like a bug fix between 2020 and today. If such a broken version exists, it seems pretty reasonable to suggest simply updating the OS. Change-Id: I178a038baa000f927ea2cbc4587d69d8ab786843 Based off of commit 69e2ee5. Ericson2314 added additional other information.

After the previous commit it should not be necessary. Furthermore, if we *do* sleep, we'll exacerbate a race condition (in conjunction with getting rid of the thread cancellation) that will cause test failures.

…2714 `MonitorFdHup`: replace `pthread_cancel` trick with a notification pipe (backport #12714)

edolstra · 2025-03-24T10:44:38Z

src/libutil/unix/monitor-fd.hh

            }
        });
    };

    ~MonitorFdHup()
    {
-        pthread_cancel(thread.native_handle());
+        close(notifyPipe.writeSide.get());


Shouldn't this be notifyPipe.writeSide.close()? Otherwise the notifyPipe.writeSide destructor will close the same FD again, which is bad.

Done in #12736

…2714 `MonitorFdHup`: replace `pthread_cancel` trick with a notification pipe (backport #12714)

Apparently this is sometimes a problem for tests containing race conditions, since it caused the daemon processes to stick around for a second. Doesn't make writing such tests any less racey and foolish, but we can stop doing the silly thing regardless. CC: NixOS/nix#12714 (comment) Change-Id: Iad6e55cf78c4a4517082194fa00a30d921224457

picnoir requested a review from edolstra as a code owner March 21, 2025 16:05

Mic92 force-pushed the pic/monitorhup-fix-pthread-cancellation branch 2 times, most recently from e35b2c9 to 0d8a35f Compare March 21, 2025 16:11

Mic92 added infra Issue affecting the @NixOS/infra team backport 2.24-maintenance Automatically creates a PR against the branch backport 2.26-maintenance Automatically creates a PR against the branch backport 2.27-maintenance Automatically creates a PR against the branch labels Mar 21, 2025

Mic92 enabled auto-merge March 21, 2025 16:13

Mic92 force-pushed the pic/monitorhup-fix-pthread-cancellation branch from 0d8a35f to b39ad3a Compare March 21, 2025 17:36

Ericson2314 reviewed Mar 23, 2025

View reviewed changes

monitor-fd.hh: Format

041394b

It's a pretty small diff, so let's just start formatting before we make other changes.

Mic92 and others added 6 commits March 23, 2025 18:22

MonitorFdHup: raise explicit SysError rather unreachable

8e0bc2c

Syscalls can fail for many reasons and we don't want to loose the errno and error context.

MonitorFdHup: Cleanup a bit with designated initializers

d028bb4

MonitorFdHup: introduce a num_fds variable

cb95791

Better than just putting `1` in multiple spots.

MonitorFdHup: Don't sleep anymore

49f486d

After the previous commit it should not be necessary. Furthermore, if we *do* sleep, we'll exacerbate a race condition (in conjunction with getting rid of the thread cancellation) that will cause test failures.

Ericson2314 force-pushed the pic/monitorhup-fix-pthread-cancellation branch from b39ad3a to 49f486d Compare March 23, 2025 23:15

Ericson2314 changed the title ~~MonitorFdHup: replace pthread_cancel trick with a notification pipe~~ MonitorFdHup: replace pthread_cancel trick with a notification pipe Mar 23, 2025

Ericson2314 approved these changes Mar 23, 2025

View reviewed changes

Ericson2314 mentioned this pull request Mar 23, 2025

Crash in Nix 2.16.1 #8946

Closed

Mic92 merged commit 648c095 into NixOS:master Mar 23, 2025
13 checks passed

This was referenced Mar 23, 2025

MonitorFdHup: replace pthread_cancel trick with a notification pipe (backport #12714) #12731

Merged

MonitorFdHup: replace pthread_cancel trick with a notification pipe (backport #12714) #12732

Merged

mergify bot mentioned this pull request Mar 23, 2025

MonitorFdHup: replace pthread_cancel trick with a notification pipe (backport #12714) #12733

Merged

Ericson2314 added a commit that referenced this pull request Mar 24, 2025

Merge pull request #12732 from NixOS/mergify/bp/2.26-maintenance/pr-1…

65873d1

…2714 `MonitorFdHup`: replace `pthread_cancel` trick with a notification pipe (backport #12714)

Ericson2314 added a commit that referenced this pull request Mar 24, 2025

Merge pull request #12733 from NixOS/mergify/bp/2.27-maintenance/pr-1…

1a87f12

…2714 `MonitorFdHup`: replace `pthread_cancel` trick with a notification pipe (backport #12714)

Ericson2314 added a commit that referenced this pull request Mar 24, 2025

Merge pull request #12731 from NixOS/mergify/bp/2.24-maintenance/pr-1…

71ab003

…2714 `MonitorFdHup`: replace `pthread_cancel` trick with a notification pipe (backport #12714)

edolstra reviewed Mar 24, 2025

View reviewed changes

picnoir deleted the pic/monitorhup-fix-pthread-cancellation branch March 24, 2025 12:38

tomberek added the backport 2.25-maintenance Automatically creates a PR against the branch label Mar 24, 2025

mergify bot mentioned this pull request Mar 24, 2025

MonitorFdHup: replace pthread_cancel trick with a notification pipe (backport #12714) #12744

Merged

tomberek added a commit that referenced this pull request Mar 24, 2025

Merge pull request #12744 from NixOS/mergify/bp/2.25-maintenance/pr-1…

7e9be2b

…2714 `MonitorFdHup`: replace `pthread_cancel` trick with a notification pipe (backport #12714)

Uh oh!

MonitorFdHup: replace pthread_cancel trick with a notification pipe #12714

MonitorFdHup: replace pthread_cancel trick with a notification pipe #12714

Uh oh!

Conversation

picnoir commented Mar 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Context

Uh oh!

Mic92 commented Mar 21, 2025

Uh oh!

Mic92 commented Mar 21, 2025

Uh oh!

lf- commented Mar 21, 2025

Uh oh!

Mic92 commented Mar 22, 2025

Uh oh!

tomberek commented Mar 23, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tomberek Mar 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Random results

micro-sleeping

poll mask POLLPRI

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tomberek commented Mar 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

`MonitorFdHup`: replace `pthread_cancel` trick with a notification pipe #12714

`MonitorFdHup`: replace `pthread_cancel` trick with a notification pipe #12714

picnoir commented Mar 21, 2025 •

edited

Loading

tomberek Mar 23, 2025 •

edited

Loading

tomberek commented Mar 23, 2025 •

edited

Loading