Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multithreaded fork appears flaky on OSX #14232

Closed
alexcrichton opened this Issue May 15, 2014 · 24 comments

Comments

Projects
None yet
10 participants
@alexcrichton
Copy link
Member

alexcrichton commented May 15, 2014

In the following program, a number of threads are made, and then each thread forks of a child that sleeps forever and then immediately kills it. I would expect this program to succeed continuously, but it wedges on OSX occasionally, reporting a successful signal delivery, but failing to actually deliver the signal apparently.

#include <stdlib.h>
#include <pthread.h>
#include <signal.h>
#include <stdio.h>
#include <unistd.h>
#include <assert.h>

void *work(void *arg) {
#define M 100
  int i;
  for (i = 0; i < M; i++) {
    int a = fork();
    assert(a >= 0);
    if (a == 0) {
      sleep(1000000000);
      return arg;
    }
    assert(kill(a, SIGTERM) == 0);
    int ret;
    assert(waitpid(a, &ret, 0) == a);
    assert(!WIFEXITED(ret));
  }
  return arg;
}

int main() {
#define N 8
  pthread_t c[N];
  int i;
  for (i = 0; i < N; i++) {
    assert(pthread_create(&c[i], NULL, work, (void*) (size_t) i) == 0);
  }
  for (i = 0; i < N; i++) {
    assert(pthread_join(c[i], NULL) == 0);
  }
}

This is essentially how we fork() in libnative, and it's how we're using fork from libgreen. Trying to investigate a solution to this, but I'm starting to think that multithreaded fork is just fundamentally broken on basically all platforms except linux.

This issue has appeared as various forms of flakiness on the bots, which is why I started investigating.

@thestinger

This comment has been minimized.

Copy link
Contributor

thestinger commented Sep 19, 2014

It should really just use posix_spawn. It's significantly faster and has fewer issues on non-Linux platforms.

@steveklabnik

This comment has been minimized.

Copy link
Member

steveklabnik commented Feb 14, 2015

@alexcrichton with the runtime changes, is this still relevant?

@alexcrichton

This comment has been minimized.

Copy link
Member Author

alexcrichton commented Feb 16, 2015

Unfortunately, yes.

@alexcrichton

This comment has been minimized.

Copy link
Member Author

alexcrichton commented Mar 5, 2015

I... don't think there's anything we can do about this, so I think it's just something we're going to have to live with unfortunately.

@thestinger

This comment has been minimized.

Copy link
Contributor

thestinger commented Mar 5, 2015

Intentionally providing broken functionality rather than using posix_spawn? At least document that Rust's subprocess management doesn't work.

@aturon

This comment has been minimized.

Copy link
Member

aturon commented Mar 5, 2015

I'm uncomfortable just closing this outright without some further discussion; I'm going to re-open as needs-decision.

@aturon aturon reopened this Mar 5, 2015

@aturon

This comment has been minimized.

Copy link
Member

aturon commented Mar 5, 2015

triage: P-backcompat-libs (1.0 beta)

@rust-highfive rust-highfive added this to the 1.0 beta milestone Mar 5, 2015

@aturon

This comment has been minimized.

Copy link
Member

aturon commented Mar 24, 2015

triage: P-backcompat-libs (1.0)

@pnkfelix

This comment has been minimized.

Copy link
Member

pnkfelix commented Apr 2, 2015

nominating for discussion at next week's triage.

@pnkfelix pnkfelix added the I-nominated label Apr 2, 2015

@alexcrichton alexcrichton removed this from the 1.0 milestone Apr 2, 2015

@alexcrichton alexcrichton removed the A-runtime label Apr 9, 2015

@pnkfelix

This comment has been minimized.

Copy link
Member

pnkfelix commented Apr 9, 2015

There are many open questions here, and many ways that we might try to resolve or at least work around this in future versions of the stdlib.

One fundamental issue is that there are some usage patterns supported by separated fork followed by exec that are not supported out the box by posix_spawn.

We should however:

  • investigate where we can do a best-effort attempt to use posix_spawn when possible
  • document the current behavior, with its flakiness, along with our plans for potential changes in the future.

The former bullet need not happen for 1.0, but the latter bullet definitely should happen for 1.0.

@aturon

This comment has been minimized.

Copy link
Member

aturon commented Apr 9, 2015

(In particular, posix_spawn does not support setting the working directory for the child process)

@pnkfelix

This comment has been minimized.

Copy link
Member

pnkfelix commented Apr 9, 2015

assigning to @aturon to follow through on the 1.0 parts.

@pnkfelix pnkfelix assigned aturon and unassigned alexcrichton Apr 9, 2015

@pnkfelix pnkfelix added this to the 1.0 milestone Apr 9, 2015

@pnkfelix pnkfelix removed the I-nominated label Apr 9, 2015

@rprichard

This comment has been minimized.

Copy link
Contributor

rprichard commented Apr 10, 2015

FWIW: I modified the test program to call work directly from main, without creating any threads. It still hung fairly reliably on my OS X 10.10.2 laptop. I increased M from 100 to 20000. It also printed the (!WIFEXITED(ret)) assertion failure, but infrequently.

@alexcrichton

This comment has been minimized.

Copy link
Member Author

alexcrichton commented Apr 10, 2015

@rprichard fascinating! I have also been able to reproduce that, but I haven't gotten the assertion to trigger yet, it just deadlocks when the child doesn't receive the signal. I'm getting more and more suspicious of this over time...

@brson brson removed this from the 1.0 milestone Apr 24, 2015

@brson brson added P-medium and removed P-backcompat-libs labels Apr 24, 2015

@brson

This comment has been minimized.

Copy link
Contributor

brson commented Apr 24, 2015

Hopefully this is a bug that can be fixed in the fullness of time, because the API is not changing now.

@pnkfelix

This comment has been minimized.

Copy link
Member

pnkfelix commented Apr 28, 2015

In the notes from triage, I noted that this should happen for 1.0:

document the current behavior, with its flakiness, along with our plans for potential changes in the future.

Has this documentation actually happened? Should we open a separate issue?

@alexcrichton

This comment has been minimized.

Copy link
Member Author

alexcrichton commented Apr 28, 2015

The documentation has not changed yet, but it's not something I'd want to block the release on, so I'm not sure a new bug is worth it (it would still be nice to do though)

@geofft

This comment has been minimized.

Copy link
Contributor

geofft commented May 15, 2015

I can reproduce the gist of @rprichard's results on an iMac13,2 ("27-inch, Late 2012") running 10.10.3: if I replace main with work(NULL), deadlocks are rarer, but certainly possible (especially if I run eight parallel shell loops running the program 500 times). So this isn't (solely) about threads: single-threaded programs can hit the same bug too.

Also, if I replace the fork call with a posix_spawn of sleep 100000, deadlocks are rarer, but still not impossible: I've observed at least one deadlock where sending a SIGTERM to the sleep process caused the app to terminate normally. So this isn't (solely) about fork: posix_spawn can hit the same bug too.

Looking at the xnu kernel source for kill (see bsd/kern/kern_sig.c), it's implemented via psignal / psignal_internal, both of which return void and have no error reporting. It's certainly possible there's something racy inside xnu itself between setting up a new process and returning to the caller of fork / posix_spawn. I don't fully understand the Mach-plus-BSD setup in xnu, but I could imagine that there's a period of time when the process exists enough to have a BSD pid, but not enough (missing a Mach task or thread?) to get signals.

If I add a 1-millisecond sleep in the parent, the original version of the program (threads and fork) no longer seems to deadlock at all, but it'd be useful if someone else can confirm that. Which brings to mind a bad solution: we could just sleep a millisecond or so in library code after forking on OS X. I wonder if that would help the issues with @alexcrichton's flaky bots.

Can we rename this issue to be something more like "signals immediately after process creation are racy on OS X", since it's not strictly related to multiple threads? I think the problem only appears worse there because more cores are used, which exercises races in the kernel more....

@alexcrichton

This comment has been minimized.

Copy link
Member Author

alexcrichton commented May 15, 2015

Wow, thanks for the thorough investigation @geofft! I think at this point it's pretty clear that there's not a whole lot we can do about this (if it's a kernel bug of some form), beyond your sleeping suggestion, but unfortunately even that wouldn't be rock-solid.

I'm going to nominate this bug for closure at this point, however, as the prospects for solving it seem to be getting bleaker every day.

@pnkfelix

This comment has been minimized.

Copy link
Member

pnkfelix commented May 16, 2015

This concern of mine (from an earlier comment) has not yet been addressed, AFAICT:

In the notes from triage, I noted that this should happen for 1.0:

document the current behavior, with its flakiness, along with our plans for potential changes in the future.

Has this documentation actually happened? Should we open a separate issue?

I would argue against closing this until that's addressed.

@geofft

This comment has been minimized.

Copy link
Contributor

geofft commented May 16, 2015

The thinking earlier in the thread was that this could be resolved by switching to a more reliable OS X API, but as far as I can tell, no such API exists: this is a fundamental bug in how OS X signal delivery works, regardless of whether you're using fork or posix_spawn or whatever. I think it might have made sense to document something like "Currently, Rust's subprocess mechanism uses a less-reliable API, but we're looking at switching to a more reliable one" (or "but you should use posix_spawn manually if you care", or something). But I don't think it's necessary to document straight-up platform bugs.

We should probably report this to Apple, though. I could do that.

@comex

This comment has been minimized.

Copy link
Contributor

comex commented Jun 4, 2015

I'm curious why a bug that supposedly only occurs when a new process is killed near-instantly after starting up is showing up as an issue in practice (e.g. on the bots) - why are you starting a process if you don't want it to do anything? Anyway, if it's a child process, a waitpid-with-WNOHANG and kill loop would probably work around any cases where the process doesn't actually die.

@alexcrichton alexcrichton added T-libs A-docs and removed I-nominated labels Jun 9, 2015

alexcrichton added a commit to alexcrichton/rust that referenced this issue Aug 4, 2015

std: Ignore test_process_mask on OSX
The investigation into rust-lang#14232 discovered that it's possible that signal delivery
to a newly spawned process is racy on OSX. This test has been failing spuriously
on the OSX bots for some time now, so ignore it as we don't currently know a
solution and it looks like it may be out of our control.

alexcrichton added a commit to alexcrichton/rust that referenced this issue Aug 4, 2015

std: Ignore test_process_mask on OSX
The investigation into rust-lang#14232 discovered that it's possible that signal delivery
to a newly spawned process is racy on OSX. This test has been failing spuriously
on the OSX bots for some time now, so ignore it as we don't currently know a
solution and it looks like it may be out of our control.
@steveklabnik

This comment has been minimized.

Copy link
Member

steveklabnik commented Aug 5, 2015

Can someone help me out with what exactly needs to be documented here, and what our plans are? It's been a long time since the discussion happened, and I don't want something incorrect.

@alexcrichton

This comment has been minimized.

Copy link
Member Author

alexcrichton commented Aug 5, 2015

OK, I'm going to close this in favor of #27537. With @geofft's investigation it's becoming clearer that there's basically not much we can do here, so the "documentation bug" should just be documenting what we're doing. I don't really want to add a clause to Command::spawn that says something along the lines of "we think there's a race in the OSX kernel". Rather I'd rather add "here's the syscalls we're calling to spawn a process!".

As a result, this is basically not-a-bug and documentation is covered by #27537

bors added a commit that referenced this issue Aug 6, 2015

Auto merge of #27516 - alexcrichton:osx-flaky-zomg, r=brson
The investigation into #14232 discovered that it's possible that signal delivery
to a newly spawned process is racy on OSX. This test has been failing spuriously
on the OSX bots for some time now, so ignore it as we don't currently know a
solution and it looks like it may be out of our control.

bors added a commit that referenced this issue Aug 10, 2015

Auto merge of #27516 - alexcrichton:osx-flaky-zomg, r=brson
The investigation into #14232 discovered that it's possible that signal delivery
to a newly spawned process is racy on OSX. This test has been failing spuriously
on the OSX bots for some time now, so ignore it as we don't currently know a
solution and it looks like it may be out of our control.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.