Skip to content

MonitorFdHup: replace pthread_cancel trick with a notification pipe #12714

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Mar 23, 2025

Conversation

picnoir
Copy link
Member

@picnoir picnoir commented Mar 21, 2025

Motivation

On #8946, we faced a surprising
behaviour wrt. exception when using pthread_cancel. In a nutshell when
a thread is inside a catch block and it's getting pthread_cancel by
another one, then the original exception is bubbled up and crashes the
process.

We now poll on the notification pipe from the thread and exit when the
main thread closes its end. This solution does not exhibit surprising
behaviour wrt. exceptions.

Context

#8946


Add 👍 to pull requests you find important.

The Nix maintainer team uses a GitHub project board to schedule and track reviews.

@picnoir picnoir requested a review from edolstra as a code owner March 21, 2025 16:05
@Mic92 Mic92 force-pushed the pic/monitorhup-fix-pthread-cancellation branch 2 times, most recently from e35b2c9 to 0d8a35f Compare March 21, 2025 16:11
@Mic92 Mic92 added infra Issue affecting the @NixOS/infra team backport 2.24-maintenance Automatically creates a PR against the branch backport 2.26-maintenance Automatically creates a PR against the branch backport 2.27-maintenance Automatically creates a PR against the branch labels Mar 21, 2025
@Mic92 Mic92 enabled auto-merge March 21, 2025 16:13
@Mic92
Copy link
Member

Mic92 commented Mar 21, 2025

on macOS:

       >  13/189 nix-functional-tests:main / multiple-outputs                        FAIL            5.78s   exit status 1
       >  47/189 nix-functional-tests:main / dependencies                            FAIL            4.57s   exit status 1
       >  97/189 nix-functional-tests:main / build-delete                            FAIL            4.62s   exit status 1
       >

Will check if I can reproduce this.

@Mic92 Mic92 force-pushed the pic/monitorhup-fix-pthread-cancellation branch from 0d8a35f to b39ad3a Compare March 21, 2025 17:36
@Mic92
Copy link
Member

Mic92 commented Mar 21, 2025

Can reproduce this locally. Normal macOS tests works but this is a nix-daemon-compat-tests-2.27.0pre20250321_b39ad3a-with-daemon-2.27.0pre20250321_b39ad3a. test

@lf-
Copy link
Member

lf- commented Mar 21, 2025

fyi this is https://gerrit.lix.systems/c/lix/+/1605

@Mic92
Copy link
Member

Mic92 commented Mar 22, 2025

Similar fix also this commit predates the commit in Lix by a year. However the race condition in macOS uncovered here seems to be not tested in Lix as there is no equivalent to nix-daemon-compat-tests, so unsure if it also happens there. I can reproduce it the issue without the commits here by running tests in isolation on macOS and they go away if I add sleeps before the garbage collection calls.

The issue is that there are still temproots shown as hold by the nix-daemon preventing the garbage collection temporarily.

@tomberek
Copy link
Contributor

OSX failure reproduced: https://termbin.com/5mgd

Comment on lines 74 to 76
/* This will only happen on macOS. We sleep a bit to
avoid waking up too often if the client is sending
input. */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My best guess is that the problem is this preeexisting macos-only workaround (dating to 9fc8d00). If I understand this correctly, this means that after client "hangs up", the daemon (fork for that connection) will wait up to one second 1 second before itself exiting. This causes a race condition which causes the test failures.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you likely have a second bug (sorry i am on my phone so can't link the lix fix but I've mentioned it multiple times before incl on the "why is hydra running lix" thread) where the daemon loses its monitor fd hup thread when it forks, since fork kills all non main threads.

this thread is only a phenomenon on the client without fixing that bug.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where the daemon loses its monitor fd hup thread when it forks, since fork kills all non main threads.

I don't think you mean the MonitorFdHup thread, since that is made in processConnection after the forks. But there is still the (separately) signal handler thread, and you think that is involved with this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, right, i am mixing up those two. i am not confident that it is strictly that but it IS related to fd hup handling! since fd hup is treated as ctrl c/interrupt, and if you have interrupt handling on the daemon being broken also, it seems quite likely to cause some shit to be broken? idk.

Copy link
Contributor

@tomberek tomberek Mar 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Random results

micro-sleeping

- sleep(1);
+ usleep(10);

Does pass the tests. It may be a compromise between correctness and a hot-loop.

poll mask POLLPRI

- POLLRDNORM
+ POLLPRI

passes tests as well. This seems to be a signal that will be far less prevalent and less likely to cause a race.

Note: on Darwin lhr-aarch64-darwin-01 24.1.0 Darwin Kernel Version 24.1.0: Thu Oct 10 21:06:23 PDT 2024; root:xnu-11215.41.3~3/RELEASE_ARM64_T8132 arm64

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should probably just do both tbh.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://git.lix.systems/alois31/lix/commit/69e2ee5b25752ba5fd8644cef56fb9d627ca4a64?style=unified&whitespace=ignore-change&show-outdated= oh there is this. Yes we saw POLLPRI too just now. I guess that is a bit more per the spec that POLLHUP, which macOS's own man page says is not supposed to work.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note that lix has a comment there also questioning whether the alleged bug still exists

It's a pretty small diff, so let's just start formatting before we make
other changes.
@tomberek
Copy link
Contributor

tomberek commented Mar 23, 2025

Of interest: https://openradar.appspot.com/37537852

Seems like the high likelihood of a POLLRDNORM and high sleep creates a high likelihood of a race condition. The sleep was needed in XY-fashion to deal with too many POLLRDNORMs. Moving to POLLPRI seems to have both backwards compat with buggy poll() and less spinning on newer kernels.

Mic92 and others added 6 commits March 23, 2025 18:22
Syscalls can fail for many reasons and we don't want to loose the errno
and error context.
Better than just putting `1` in multiple spots.
On NixOS#8946, we faced a surprising
behaviour wrt. exception when using pthread_cancel. In a nutshell when
a thread is inside a catch block and it's getting pthread_cancel by
another one, then the original exception is bubbled up and crashes the
process.

We now poll on the notification pipe from the thread and exit when the
main thread closes its end. This solution does not exhibit surprising
behaviour wrt. exceptions.

Co-authored-by: Mic92 <joerg@thalheim.io>

Fixes NixOS#8946

See also Lix https://gerrit.lix.systems/c/lix/+/1605 which is very
similar by coincidence. Pulled a comment from that.
This was filed as NixOS#7584, but as far
as I can tell, the previous solution of POLLHUP works just fine on macOS
14. I've also tested on an ancient machine with macOS 10.15.7, which
also has POLLHUP work correctly.

It's possible this might regress some older versions of macOS that have
a kernel bug, but I went looking through the history on the sources and
didn't find anything that looked terribly convincingly like a bug fix
between 2020 and today. If such a broken version exists, it seems pretty
reasonable to suggest simply updating the OS.

Change-Id: I178a038baa000f927ea2cbc4587d69d8ab786843

Based off of commit 69e2ee5. Ericson2314 added
additional other information.
After the previous commit it should not be necessary. Furthermore, if we
*do* sleep, we'll exacerbate a race condition (in conjunction with
getting rid of the thread cancellation) that will cause test failures.
@Ericson2314 Ericson2314 force-pushed the pic/monitorhup-fix-pthread-cancellation branch from b39ad3a to 49f486d Compare March 23, 2025 23:15
@Ericson2314 Ericson2314 changed the title MonitorFdHup: replace pthread_cancel trick with a notification pipe MonitorFdHup: replace pthread_cancel trick with a notification pipe Mar 23, 2025
@Ericson2314 Ericson2314 mentioned this pull request Mar 23, 2025
@Mic92 Mic92 merged commit 648c095 into NixOS:master Mar 23, 2025
13 checks passed
Ericson2314 added a commit that referenced this pull request Mar 24, 2025
…2714

`MonitorFdHup`: replace `pthread_cancel` trick with a notification pipe (backport #12714)
Ericson2314 added a commit that referenced this pull request Mar 24, 2025
…2714

`MonitorFdHup`: replace `pthread_cancel` trick with a notification pipe (backport #12714)
Ericson2314 added a commit that referenced this pull request Mar 24, 2025
…2714

`MonitorFdHup`: replace `pthread_cancel` trick with a notification pipe (backport #12714)
}
});
};

~MonitorFdHup()
{
pthread_cancel(thread.native_handle());
close(notifyPipe.writeSide.get());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be notifyPipe.writeSide.close()? Otherwise the notifyPipe.writeSide destructor will close the same FD again, which is bad.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in #12736

@picnoir picnoir deleted the pic/monitorhup-fix-pthread-cancellation branch March 24, 2025 12:38
@tomberek tomberek added the backport 2.25-maintenance Automatically creates a PR against the branch label Mar 24, 2025
tomberek added a commit that referenced this pull request Mar 24, 2025
…2714

`MonitorFdHup`: replace `pthread_cancel` trick with a notification pipe (backport #12714)
lf- added a commit to lix-project/lix that referenced this pull request Apr 1, 2025
Apparently this is sometimes a problem for tests containing race
conditions, since it caused the daemon processes to stick around for a
second. Doesn't make writing such tests any less racey and foolish, but
we can stop doing the silly thing regardless.

CC: NixOS/nix#12714 (comment)
Change-Id: Iad6e55cf78c4a4517082194fa00a30d921224457
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 2.24-maintenance Automatically creates a PR against the branch backport 2.25-maintenance Automatically creates a PR against the branch backport 2.26-maintenance Automatically creates a PR against the branch backport 2.27-maintenance Automatically creates a PR against the branch infra Issue affecting the @NixOS/infra team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants