Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry macOS select operations instead of panicking #119

Merged
merged 1 commit into from Feb 10, 2017
Merged

Retry macOS select operations instead of panicking #119

merged 1 commit into from Feb 10, 2017

Conversation

@jdm
Copy link
Member

jdm commented Nov 10, 2016

#99 made the calculation for buffer sizes for receives work correctly, but there appears to be no guarantee that the size will remain the same when we retry. This can lead to a panic in the current code (as observed in servo/servo#14146 (comment)). The test passes if we retry the operation until it succeeds, which matches other code I've found in the wild like https://github.com/openmach/openmach/blob/a185c84f7f5052b7b9f5531ab3612aaf88d3802e/libmach/flick_mach3mig_client.c#L88 .

@jdm
Copy link
Member Author

jdm commented Nov 10, 2016

@antrik
Copy link
Contributor

antrik commented Nov 10, 2016

@jdm do you have some idea what actually triggers this situation? Or, more specifically, how we could test for this in the ipc-channel test suite?...

@jdm
Copy link
Member Author

jdm commented Nov 10, 2016

I'm trying to see if I can come up with an isolated testcase as we speak.

@jdm
Copy link
Member Author

jdm commented Nov 10, 2016

Going to dump my investigations here:
When I add prinlns to the code that allocates a larger buffer (and right before panicking), I see this output in a non-panicking run:

provided 4096, got 12520
provided 4096, got 12084
provided 4096, got 5972
provided 4096, got 11892
provided 4096, got 12024
provided 4096, got 4440
provided 4096, got 169808
provided 4096, got 5956
provided 4096, got 5956
provided 4096, got 5956

A panicking run shows me this:

provided 4096, got 12520
provided 4096, got 12084
provided 4096, got 5972
provided 4096, got 11892
provided 4096, got 12024
provided 4096, got 4440
provided 4512, got 5956
message was bigger than we were told (thread <unnamed>, at /Users/jdm/src/ipc-channel/src/platform/macos/mod.rs:600)

Interestingly, 169808 / 5956 = 28.

@jrmuizel
Copy link
Contributor

jrmuizel commented Nov 11, 2016

What size is the message after it fails the first time?

@jdm
Copy link
Member Author

jdm commented Nov 11, 2016

That's the "got N".

@jdm
Copy link
Member Author

jdm commented Nov 11, 2016

More interesting results from adding a println to the send implementation when the message size exceeds 4096 (the default stack buffer size used by recv):
Successful run:

sending 12520 on port 28931
provided 4096, got 12520 on port 5891
sending 12084 on port 9223
provided 4096, got 12084 on port 9223
sending 5972 on port 8711
provided 4096, got 5972 on port 5891
sending 11892 on port 7427
provided 4096, got 11892 on port 29187
sending 12024 on port 148227
provided 4096, got 12024 on port 5891
sending 4440 on port 148227
provided 4096, got 4440 on port 5891
sending 5956 on port 8711
provided 4096, got 5956 on port 5891
sending 5952 on port 8711
provided 4096, got 5952 on port 5891
sending 169808 on port 6147
provided 4096, got 169808 on port 6147
sending 5956 on port 8711
provided 4096, got 5956 on port 5891
sending 5956 on port 8711
provided 4096, got 5956 on port 5891

Panicking run:

sending 12520 on port 28947
provided 4096, got 12520 on port 5891
sending 12084 on port 9223
provided 4096, got 12084 on port 9223
sending 5972 on port 8711
provided 4096, got 5972 on port 5891
sending 11892 on port 7427
provided 4096, got 11892 on port 30723
sending 12024 on port 47363
provided 4096, got 12024 on port 5891
sending 5956 on port 8711
sending 4440 on port 47363
provided 4096, got 4440 on port 5891
provided 4512, got 5956 on port 5891
message was bigger than we were told (thread <unnamed>, at /Users/jdm/src/ipc-channel/src/platform/macos/mod.rs:603)

Suspicious that the sizes involved in the final two sends are the same sizes observed in the erroneous case.

@jdm
Copy link
Member Author

jdm commented Nov 11, 2016

More interesting information - the port number showing the issue in the previous run corresponds to an IpcReceiverSet (which is only used in the resource thread to select over the public and private network channels).

antrik added a commit to antrik/ipc-channel that referenced this pull request Nov 14, 2016
… sizes

If I'm reading @jdm's analysis in
servo#119 correctly, this should
hopefully be able to reproduce the problem discussed there.
antrik added a commit to antrik/ipc-channel that referenced this pull request Nov 14, 2016
… sizes

If I'm reading @jdm's analysis in
servo#119 correctly, this should
hopefully be able to reproduce the problem discussed there.
@antrik
Copy link
Contributor

antrik commented Nov 15, 2016

@jdm I'm a bit confused now: does this patch solve the problem or not?...

@jdm
Copy link
Member Author

jdm commented Nov 15, 2016

@antrik The panic does not occur. I no longer believe this is the right solution though, since we may be silently dropping a message.

@antrik
Copy link
Contributor

antrik commented Nov 16, 2016

@jdm since we are using MACH_RCV_LARGE, this shouldn't ever drop a message according to my reading of http://web.mit.edu/darwin/src/modules/xnu/osfmk/man/mach_msg.html . Rather, the message remains queued, and we can reattempt to get it.

The problem with port sets is probably that when several ports in the set have messages queued, there is apparently no guarantee that on the next attempt we will get the same one as on the last fail... And if the new one needs a larger buffer than the previous one, we fail again. However, since we grow the buffer each time an attempt fails, we will succeed ultimately.

Of course this is just guesswork... Too bad my attempt at creating a test case for this failed to expose the problem :-(

(BTW, another thing I found in the doc is that apparently we do get back the actual trailer size as well on a failed attempt... Though considering the relatively small sizes involved, trying precise allocation here is probably not worth the effort.)

@jdm
Copy link
Member Author

jdm commented Nov 16, 2016

My readings agreed with you (and I noticed the trailer size bit as well), but it had not occurred to me that port sets may return a different message each time. I'll see if I can rustle up any further documentation about port sets.

@jdm
Copy link
Member Author

jdm commented Nov 23, 2016

Haven't found any supporting documentation yet, so I've started reading the source. That being said, I realized a complication with the theory - the code in question is an IpcReceiverSet, but it's selecting over the public and private resource thread receivers, while only the public channel is ever used by the test. Naively, it's surprising that a different message could be retrieved, given that the same port from the set would be selected every time.

@antrik
Copy link
Contributor

antrik commented Dec 30, 2016

@jdm how shall we proceed with this? Do you want to further investigate why this is happening? Do you want to try coming up with a minimal test case? Or shall we just merge it as is, hoping it really fixes whatever the actual problem is, and it won't come up again?...

@jdm jdm mentioned this pull request Feb 1, 2017
@jdm
Copy link
Member Author

jdm commented Feb 7, 2017

I've spent more time looking for ideas on and off since this comment. I don't have any good ones. I think we should merge this.

@jrmuizel
Copy link
Contributor

jrmuizel commented Feb 7, 2017

This should get a better comment describing how we don't really understand what's going on.

… sizes.

We have observed the following situation in the wild:
* a thread repeatedly selects over an IpcReceiverSet containing two receivers (A and B)
* two other threads use senders that are connected to receiver A
* the senders for A send messages to receiver A that are both larger than the default receive buffer size
* receiver A is selected and the message is discovered to be too large for the receive buffer
* a new buffer is allocated that is large enough for the last message that was encountered
* the select operation is repeated with the new buffer, causing receiver A to be selected again
* the second message is returned this time, which is larger than the buffer that was allocated for the first message

There is no documented reason why this situation should occur, nor did reading the source of the Mach implementation
provide any explanation. The solution in this PR is to continue retrying selection operations while Mach reports
that the receive buffer was of insufficient size for the received message.

Multiple attempts to reproduce this problem in isolated unit tests were unsuccessful.
@jdm
Copy link
Member Author

jdm commented Feb 9, 2017

I have added a more detailed commit message detailing the erroneous situation.

@antrik
Copy link
Contributor

antrik commented Feb 9, 2017

@bors-servo
Copy link
Contributor

bors-servo commented Feb 9, 2017

📌 Commit 0dabb37 has been approved by antrik

@bors-servo
Copy link
Contributor

bors-servo commented Feb 9, 2017

Testing commit 0dabb37 with merge ab68740...

bors-servo added a commit that referenced this pull request Feb 9, 2017
Retry macOS select operations instead of panicking
@bors-servo
Copy link
Contributor

bors-servo commented Feb 10, 2017

☀️ Test successful - status-appveyor, status-travis
Approved by: antrik
Pushing ab68740 to master...

@bors-servo bors-servo merged commit 0dabb37 into master Feb 10, 2017
6 checks passed
6 checks passed
continuous-integration/appveyor/branch AppVeyor build succeeded
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
continuous-integration/travis-ci/push The Travis CI build passed
Details
dependency-ci Dependencies checked
Details
homu Test successful
Details
@nox nox deleted the retry branch Feb 10, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

5 participants
You can’t perform that action at this time.