Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ignore EINTR on unix::OsIpcReceiverSet::select #137

Merged
merged 2 commits into from Feb 11, 2017

Conversation

@dlrobertson
Copy link
Collaborator

dlrobertson commented Feb 2, 2017

Ignore EINTR on Poll::poll in unix::OsIpcReceiverSet::select. Returning
EINTR as an error on select causes panics in servo.

Resolves: #131

@@ -472,7 +472,12 @@ impl OsIpcReceiverSet {
let mut selection_results = Vec::new();
match self.poll.poll(&mut self.events, None) {
Ok(sz) if sz > 0 => {},
_ => { return Err(UnixError::last()); }
Err(ref e) if e.kind() != ErrorKind::Interrupted => {
return Ok(selection_results);

This comment has been minimized.

@dlrobertson

dlrobertson Feb 2, 2017

Author Collaborator

Debated about what to return here. I think a empty vec makes the most sense.

This comment has been minimized.

@antrik

antrik Feb 2, 2017

Contributor

Well, the canonical way of handling EINTR is using a retry loop.

In most cases it's probably not strictly necessary to handle it here, since typical users calling select() in a loop and iterating over the result each time will generally just work -- not sure though we should rely on this, as we never return an empty result otherwise as far as I can tell. On a more conceptual level, it seems less ideal to pass control back to the caller unnecessarily -- better just handle it transparently I'd say.

This comment has been minimized.

@dlrobertson

dlrobertson Feb 2, 2017

Author Collaborator

Good point... I like the retry better. The update will include this

@@ -472,7 +472,12 @@ impl OsIpcReceiverSet {
let mut selection_results = Vec::new();
match self.poll.poll(&mut self.events, None) {
Ok(sz) if sz > 0 => {},
_ => { return Err(UnixError::last()); }
Err(ref e) if e.kind() != ErrorKind::Interrupted => {

This comment has been minimized.

@antrik

antrik Feb 2, 2017

Contributor

Why !=?

This comment has been minimized.

@dlrobertson

dlrobertson Feb 2, 2017

Author Collaborator

Good catch. I changed the control flow right before committing and forgot to change this.

This comment has been minimized.

@antrik

antrik Feb 2, 2017

Contributor

I wonder whether it would be realistic to add a test case for this? :-)

This comment has been minimized.

@dlrobertson

dlrobertson Feb 3, 2017

Author Collaborator

Another good point. This one is definitely difficult to reproduce as a unit test. I made my best attempt at it with b03e9db. Parts of it feel a bit hacky, like the "kill sled", but it does cause older versions to panic. It also caught a bug in the code! Hence the new changes to OsIpcReceiverSet::select

@antrik
Copy link
Contributor

antrik commented Feb 2, 2017

While the solution is probably fine, what I'd like to understand is why this only came up with mio, and wasn't a problem with poll()?

@dlrobertson dlrobertson force-pushed the dlrobertson:ignore_eintr branch 4 times, most recently from 579c77d to b03e9db Feb 3, 2017
@dlrobertson
Copy link
Collaborator Author

dlrobertson commented Feb 3, 2017

While the solution is probably fine, what I'd like to understand is why this only came up with mio, and wasn't a problem with poll()?

Agreed. This is really interesting. I'm not really sure that I'll be able to definitively answer this. It may have been just a timing issue? Pre-mio versions do fail the unit test I just added, which causes me to lean towards the timing theory.

@dlrobertson dlrobertson force-pushed the dlrobertson:ignore_eintr branch 2 times, most recently from 7e94d53 to 56b66bb Feb 3, 2017
_ => { return Err(UnixError::last()); }
Ok(sz) => {
if sz < 1 {
return self.select();

This comment has been minimized.

@antrik

antrik Feb 3, 2017

Contributor

Why is this necessary?

This comment has been minimized.

@dlrobertson

dlrobertson Feb 5, 2017

Author Collaborator

In testing poll always returned 0 after each call that returned EINTR. I need to do some more testing and debugging around this.

},
Err(ref e) => {
if e.kind() == ErrorKind::Interrupted {
return self.select();

This comment has been minimized.

@antrik

antrik Feb 3, 2017

Contributor

Sneaky ;-)

I think I'd actually be more comfortable with an explicit loop -- not sure how others feel about this...

This comment has been minimized.

@jdm

jdm Feb 3, 2017

Member

I agree.

This comment has been minimized.

@antrik

antrik Feb 4, 2017

Contributor

Indeed this is not even a real tail call (as @mbrubeck pointed out on IRC), since the Vec has a destructor -- so even in optimised builds this clutters the stack... Not much of an issue here, as it's a rare exceptional situation, so the worst thing I guess is the bloated backtrace -- but it's a good thing to keep in mind in general I guess :-)

This comment has been minimized.

@dlrobertson

dlrobertson Feb 5, 2017

Author Collaborator

Good call. I replaced the recursive calls with a while loop.

let parent_pid = unsafe {
libc::getpid()
};
unsafe {

This comment has been minimized.

@antrik

antrik Feb 3, 2017

Contributor

Not sure it's a good idea to have consecutive unsafe blocks... Generally we just use a larger enclosing block, if there are no (non-trivial) safe operations in between.

libc::signal(libc::SIGTERM,
sigterm_handler as fn(libc::c_int) as *mut libc::c_void as libc::sighandler_t);
}
let _ = unsafe { fork(|| {

This comment has been minimized.

@antrik

antrik Feb 3, 2017

Contributor

No need to explicitly ignore the return value here.

BTW, I wonder whether you could send the signal the other way instead, so only the child process is affected, avoiding problematic interaction with other tests? (Admittedly, this would complicate the code, as the child would have to report back the result I guess...)

This comment has been minimized.

@dlrobertson

dlrobertson Feb 5, 2017

Author Collaborator

I wonder whether you could send the signal the other way instead

Overall, I like sending the signal to the child better. It could make debugging a little harder if an unexpected panic occurs in the select. Overall much better though.

}
tx0.send(b"A", vec![], vec![]).unwrap();
})};
for _ in rx_set.select().unwrap() {}

This comment has been minimized.

@antrik

antrik Feb 3, 2017

Contributor

Not sure how useful it is to iterate the results without actually checking them... I'd suggest invoking next() to get the first result, like some other receiver_set() tests are already doing. (Possibly followed by another next() to assert that it's None.)

This comment has been minimized.

@dlrobertson

dlrobertson Feb 5, 2017

Author Collaborator

Good point

@antrik
Copy link
Contributor

antrik commented Feb 3, 2017

I was thinking of a slightly different test: rather then sending SIGTERM and catching it, you could just send SIGSTOP followed by SIGSTART (after a short delay), like this:

libc::kill(parent_pid, libc::SIGSTOP);
thread::sleep(Duration::from_millis(3));
libc::kill(parent_pid, libc::SIGCONT);

(Possibly with another wait before the first signal -- though it doesn't seem necessary in my testing...)

AIUI, that's more or less what happens when running in a debugger. Unlike the SIGTERM approach, it appears to be fully reliable in triggering the failure with mio, even without any looping. It never triggers with the poll() implementation though, so indeed this seems to make a difference.

(The only way I got a failure with the old implementation when running multi-threaded is by using a humongous number of SIGTERM iterations... It's much easier to trigger with just one thread, but still needs many iterations.)

Another option I considered is using SIGALRM. The advantage would be that it doesn't necessarily need a separate process -- though I'm not sure how it interacts with threads; so it might actually still need looping or a separate process for a reliable result. (I haven't actually tested this approach.)

BTW, did you see that the test timed out on the MacOS builder? I wonder whether it actually hangs there for some reason -- or sending many signals just takes very long on MacOS...

@dlrobertson dlrobertson force-pushed the dlrobertson:ignore_eintr branch 4 times, most recently from e2500fd to 9dae7c1 Feb 5, 2017
@dlrobertson
Copy link
Collaborator Author

dlrobertson commented Feb 5, 2017

I was thinking of a slightly different test: rather then sending SIGTERM and catching it, you could just send SIGSTOP followed by SIGSTART (after a short delay), like this:

I like this a lot better. The current implementation uses this.

BTW, did you see that the test timed out on the MacOS builder? I wonder whether it actually hangs there for some reason -- or sending many signals just takes very long on MacOS...

I did notice this, and it does appear to still be happening. I'll try to reproduce it locally on my mac.

(The only way I got a failure with the old implementation when running multi-threaded is by using a humongous number of SIGTERM iterations... It's much easier to trigger with just one thread, but still needs many iterations.)

I also found this to be true.

Another option I considered is using SIGALRM. The advantage would be that it doesn't necessarily need a separate process -- though I'm not sure how it interacts with threads

Hadn't really thought about using SIGALRM. The current implementation doesn't use it, but I'll continue to look into the best options for producing a reproducible test case.

Ok(_) => tx0.send(b"\x00", vec![], vec![]).unwrap(),
Err(e) => {
let err = format!("{:?}", e);
tx0.send(err.into_bytes().as_slice(), vec![], vec![]).unwrap();

This comment has been minimized.

@antrik

antrik Feb 6, 2017

Contributor

You can just use as_bytes() instead of into_bytes().as_slice().

Note that this is not exactly the same, since into_bytes() consumes the original string, and the resulting temporary byte string is dropped at the end of the expression; while as_bytes() keeps the original string in place. In this case (and many others), this doesn't really make a difference though :-)

This comment has been minimized.

@dlrobertson

dlrobertson Feb 6, 2017

Author Collaborator

Fixed in update

}
unsafe {
// reap the child
libc::waitpid(child_pid, &mut stat, 0);

This comment has been minimized.

@antrik

antrik Feb 6, 2017

Contributor

Why not use the child_pid.wait(), like other tests are doing?

This comment has been minimized.

@dlrobertson

dlrobertson Feb 6, 2017

Author Collaborator

Old habits die hard 😄

tx0.send(b"B", vec![], vec![]).unwrap();
// Send the result of the select back to the parent
match rx_set.select() {
Ok(_) => tx0.send(b"\x00", vec![], vec![]).unwrap(),

This comment has been minimized.

@antrik

antrik Feb 6, 2017

Contributor

You are not checking at all any more whether the select() actually yields the expected result?

This comment has been minimized.

@dlrobertson

dlrobertson Feb 6, 2017

Author Collaborator

Fixed in update

// Receive the results from the child and panic if the the
// returned value from select was an error.
result_set.add(rx0).unwrap();
for result in result_set.select().unwrap() {

This comment has been minimized.

@antrik

antrik Feb 6, 2017

Contributor

I don't see a need to use a set here: this channel is only for getting back the result; the actual select test is done when sending to the child...

This comment has been minimized.

@dlrobertson

dlrobertson Feb 6, 2017

Author Collaborator

Agreed. Fixed in update.

Ok(sz) if sz > 0 => {},
_ => { return Err(UnixError::last()); }
};
let mut sz = 0;

This comment has been minimized.

@antrik

antrik Feb 6, 2017

Contributor

Nitpick: I generally prefer more meaningful variable names, especially when they have a non-trivial scope -- which sz now has after moving it out of the match.

This comment has been minimized.

@dlrobertson

dlrobertson Feb 6, 2017

Author Collaborator

Good call. I swapped num_events and sz.

_ => { return Err(UnixError::last()); }
};
let mut sz = 0;
while sz <=0 {

This comment has been minimized.

@antrik

antrik Feb 6, 2017

Contributor

Nitpick: missing blank after <=.

BTW, since this is usize, I'd say it would be less misleading to just check == 0?

This comment has been minimized.

@dlrobertson

dlrobertson Feb 6, 2017

Author Collaborator

Two good catches. This is now cleaner.

Ok(num_events) => {
if num_events > 0 {
sz = num_events;
}

This comment has been minimized.

@antrik

antrik Feb 6, 2017

Contributor

There is no need for this if: you are checking the same thing in the while condition.

(Or you could use a loop instead of while, and do an explicit break when getting results; thus keeping the condition here. Not sure though whether this would be clearer, or less clear...)

@dlrobertson dlrobertson force-pushed the dlrobertson:ignore_eintr branch 2 times, most recently from a1f7daf to b5c702b Feb 6, 2017
@dlrobertson
Copy link
Collaborator Author

dlrobertson commented Feb 6, 2017

Cleaned up the test a bit and addressed comments in the review. I also went back and found the minimal example needed to trigger the mac hang. The following should cause us to hang on mac

#[test]
fn basic() {
    let (tx0, rx0) = platform::channel().unwrap();
    let child_pid = unsafe {
        fork(|| {
            // child panics
            tx0.send(b"\x00", vec![], vec![]).unwrap();
        })
    };
    let (res, _, _) = rx0.recv().unwrap();
    assert!(res == b"\x00");
    child_pid.wait();
}
Copy link
Contributor

antrik left a comment

Looks good now apart from one remaining issue with the test case.

Ok(result) => {
for evt in result {
if let OsIpcSelectionResult::DataReceived(_, data, _, _) = evt {
assert!(data == b"\x00");

This comment has been minimized.

@antrik

antrik Feb 6, 2017

Contributor

I'm sorry to say that the error checking is both overly complicated and incomplete: it doesn't check for wrong number of results (including 0); nor for unexpected ChannelClosed instead of DataReceived...

As I mentioned before, you can just do let (id, data, _, _) = result.into_iter().next().unwrap().unwrap(), like existing test cases.

(In case it's not clear, the first unwrap() is for the next() returning Some(...), while the second one is a custom method of OsIpcSelectionResult.)

Or being really thorough:

let mut result_iter = result.into_iter();
let (id, data, _, _) = result_iter.next().unwrap().unwrap();
assert!(id == rx_id);
assert!(data == b"\x00");
assert!(result_iter.next().is_none());

This comment has been minimized.

@dlrobertson

dlrobertson Feb 7, 2017

Author Collaborator

I tried to follow this, but instead of using asserts I formatted strings to send back to the parent so that we get informative error messages if we hit one of these.

This comment has been minimized.

@antrik

antrik Feb 8, 2017

Contributor

Well, I see your point -- but if you really want to go down this road, you'd have to match all the result values manually in the child process, rather than using any asserts/unwraps etc. Or maybe you could use something with set_hook() and/or catch_unwind(), which would allow sticking with the usual assert/unwrap stuff -- but that seems tricky to get right as well.

To be honest, I'm not sure either is worth the effort. Just send back the correct result I'd say, and panic the child process otherwise -- like the other multi-threaded/multi-process tests do. Running into a test failure is not the common case; when someone does, they kinda just have to figure out that they need to run the tests with -- --nocapture to see the additional error output... Not ideal, but good enough for the tests.

This comment has been minimized.

@dlrobertson

dlrobertson Feb 9, 2017

Author Collaborator

Ah! Another good point... I'll just stick with the assert!

@antrik
Copy link
Contributor

antrik commented Feb 7, 2017

@danlrobertson the other cross-process tests are using OsIpcOneShotServer. I suspect the problem is that raw Mach ports -- unlike FDs -- are not automatically cloned on fork(); so the child never actually gets the sender.

(This is actually quite unsurprising -- we have the problem with raw Mach ports in GNU Hurd...)

@antrik antrik mentioned this pull request Feb 7, 2017
@dlrobertson dlrobertson force-pushed the dlrobertson:ignore_eintr branch from b5c702b to f789c23 Feb 7, 2017
@dlrobertson
Copy link
Collaborator Author

dlrobertson commented Feb 7, 2017

I suspect the problem is that raw Mach ports -- unlike FDs -- are not automatically cloned on fork(); so the child never actually gets the sender.

Interesting. I changed the cfg for the test so it doesn't run on mac.

@antrik
Copy link
Contributor

antrik commented Feb 8, 2017

@danlrobertson well, it shouldn't be too hard to change the test to use OsIpcOneShotServer so it works with the MacOS back-end as well? I think it would be worthwhile.

Though I just realised that there is also another problem there: any messages sent at the platform level will be padded to a multiple of 4 bytes on MacOS -- so you need to avoid odd lengths, or truncate the results, like other platform tests do...

(Or maybe you could consider implementing this as a top-level test instead of a platform test? Automatic serialisation of the messages would avoid such pitfalls, while otherwise not changing much in this case I think; and I'd don't see any obvious obstacles from the top of my head...)

@dlrobertson dlrobertson force-pushed the dlrobertson:ignore_eintr branch from f789c23 to ab37d6d Feb 9, 2017
Ignore EINTR on Poll::poll in unix::OsIpcReceiverSet::select. Returning
EINTR as an error on select causes panics in servo.
@dlrobertson dlrobertson force-pushed the dlrobertson:ignore_eintr branch from ab37d6d to 75805c1 Feb 9, 2017
@dlrobertson
Copy link
Collaborator Author

dlrobertson commented Feb 9, 2017

Updated the unit test to use OsIpcOneShotServer, removed some unneeded clutter from the test, and extended everything sent to a multiple of 4 bytes.

fn receiver_set_eintr() {
let (server, name) = OsIpcOneShotServer::new().unwrap();
let (tx1, rx1) = platform::channel().unwrap();
let mut rx_set = OsIpcReceiverSet::new().unwrap();

This comment has been minimized.

@antrik

antrik Feb 9, 2017

Contributor

Are you sure rx_set actually survives the fork() on MacOS? I had assumed that you would have to create the main channel and receiver set in the child process, and send tx1 to the parent process through tx0 along with the "ready" message...

(In fact I don't think forking a receiver or receiver set is a good idea even with the Unix back-end using sockets, since that breaks the guarantee of having a unique receiver I believe.)

This comment has been minimized.

@dlrobertson

dlrobertson Feb 11, 2017

Author Collaborator

create the main channel and receiver set in the child process, and send tx1 to the parent process through tx0 along with the "ready" message

Good point. The update does this.

assert_eq!(rx_id, id);
assert_eq!(vec![0, 0, 0, 0], data);
assert!(result_iter.next().is_none());
})

This comment has been minimized.

@antrik

antrik Feb 9, 2017

Contributor

Uh... Seems I have been unclear again: you still need to send an indicator to the parent that the child received the expected result successfully before it exits -- otherwise the parent (and thus test harness) never knows whether the test actually succeeded or not... My point was just that you don't really have to send anything back when the child detects some failure, since the parent never getting the success indication is sufficient to make the test case fail (by timing out).

This comment has been minimized.

@dlrobertson

dlrobertson Feb 11, 2017

Author Collaborator

Your right, my bad... I wasn't thinking when I removed the last send 😄

let mut result_iter = result.into_iter();
let (id, data, _, _) = result_iter.next().unwrap().unwrap();
assert_eq!(rx_id, id);
assert_eq!(vec![0, 0, 0, 0], data);

This comment has been minimized.

@antrik

antrik Feb 9, 2017

Contributor

Nitpick: it's rather inconsistent to use a vector here, while you use a byte string slice in send()... You could use assert_eq!(b"\x00\x00\x00\x00".as_ref().to_owned(), data) or assert_eq!(b"\x00\x00\x00\x00".as_ref(), &*data).

(And while at it, I'd say [0; 4] would be more readable than b"\x00\x00\x00\x00", if you want to stick with the 0s. Probably better though to use byte string slices of four actual characters instead -- preferably different one in each send(), to tell them apart...)

This comment has been minimized.

@dlrobertson

dlrobertson Feb 11, 2017

Author Collaborator

assert_eq!(b"\x00\x00\x00\x00".as_ref(), &*data) I like this best. The update uses this and uses more descriptive strings.

@dlrobertson dlrobertson force-pushed the dlrobertson:ignore_eintr branch from 75805c1 to a52087b Feb 11, 2017
#[cfg(not(any(feature = "force-inprocess", target_os = "windows", target_os = "android")))]
fn receiver_set_eintr() {
let (start_server, start_name) = OsIpcOneShotServer::new().unwrap();
let (end_server, end_name) = OsIpcOneShotServer::new().unwrap();

This comment has been minimized.

@antrik

antrik Feb 11, 2017

Contributor

Why are you using a separate server (channel) for the final result, rather than just sending the success message over the first channel as well?

Note that this isn't really wrong; so I wouldn't mind merging it as is -- it just makes the test more complicated than necessary...

This comment has been minimized.

@dlrobertson

dlrobertson Feb 11, 2017

Author Collaborator

Why are you using a separate server (channel) for the final result, rather than just sending the success message over the first channel as well?

Sorry, just hadn't used OneShotServers before and didn't read the code close enough. Updated in the latest changes.

// Wait until the child is ready
let (_, res, mut channels, _) = start_server.accept().unwrap();
assert!(res == b" Ready! ");
assert!(channels.len() == 1);

This comment has been minimized.

@antrik

antrik Feb 11, 2017

Contributor

I don't think that last assert is really necessary: as this receive is not really part of the stuff we are testing here (rather just needed for the test harness), there is no need to be too thorough here. If we don't get the channel we require for the test, the next unwrap() will panic anyway.

This comment has been minimized.

@dlrobertson

dlrobertson Feb 11, 2017

Author Collaborator

True, this is a complicated enough test, so IMO any removal of unnecessary asserts or code is a win.

If Poll::poll is interrupted by a signal, it returns EINTR. A test
should be created that sends a signal that will be caught while blocking
on OsIpcReceiverSet::select.
@dlrobertson dlrobertson force-pushed the dlrobertson:ignore_eintr branch from a52087b to 71c5b8a Feb 11, 2017
@antrik
Copy link
Contributor

antrik commented Feb 11, 2017

@bors-servo
Copy link
Contributor

bors-servo commented Feb 11, 2017

📌 Commit 71c5b8a has been approved by antrik

@bors-servo
Copy link
Contributor

bors-servo commented Feb 11, 2017

Testing commit 71c5b8a with merge c15142c...

bors-servo added a commit that referenced this pull request Feb 11, 2017
Ignore EINTR on unix::OsIpcReceiverSet::select

Ignore `EINTR` on Poll::poll in `unix::OsIpcReceiverSet::select`. Returning
`EINTR` as an error on select causes panics in servo.

Resolves: #131
@bors-servo
Copy link
Contributor

bors-servo commented Feb 11, 2017

☀️ Test successful - status-appveyor, status-travis
Approved by: antrik
Pushing c15142c to master...

@bors-servo bors-servo merged commit 71c5b8a into servo:master Feb 11, 2017
3 checks passed
3 checks passed
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
homu Test successful
Details
@dlrobertson dlrobertson deleted the dlrobertson:ignore_eintr branch Feb 11, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

6 participants
You can’t perform that action at this time.