Join GitHub today
GitHub is home to over 50 million developers working together to host and review code, manage projects, and build software together.
Sign upRetry macOS select operations instead of panicking #119
Conversation
|
cc @jrmuizel |
|
@jdm do you have some idea what actually triggers this situation? Or, more specifically, how we could test for this in the |
|
I'm trying to see if I can come up with an isolated testcase as we speak. |
|
Going to dump my investigations here:
A panicking run shows me this:
Interestingly, 169808 / 5956 = 28. |
|
What size is the message after it fails the first time? |
|
That's the "got N". |
|
More interesting results from adding a println to the send implementation when the message size exceeds 4096 (the default stack buffer size used by recv):
Panicking run:
Suspicious that the sizes involved in the final two sends are the same sizes observed in the erroneous case. |
|
More interesting information - the port number showing the issue in the previous run corresponds to an IpcReceiverSet (which is only used in the resource thread to select over the public and private network channels). |
|
@jdm I'm a bit confused now: does this patch solve the problem or not?... |
|
@antrik The panic does not occur. I no longer believe this is the right solution though, since we may be silently dropping a message. |
|
@jdm since we are using The problem with port sets is probably that when several ports in the set have messages queued, there is apparently no guarantee that on the next attempt we will get the same one as on the last fail... And if the new one needs a larger buffer than the previous one, we fail again. However, since we grow the buffer each time an attempt fails, we will succeed ultimately. Of course this is just guesswork... Too bad my attempt at creating a test case for this failed to expose the problem :-( (BTW, another thing I found in the doc is that apparently we do get back the actual trailer size as well on a failed attempt... Though considering the relatively small sizes involved, trying precise allocation here is probably not worth the effort.) |
|
My readings agreed with you (and I noticed the trailer size bit as well), but it had not occurred to me that port sets may return a different message each time. I'll see if I can rustle up any further documentation about port sets. |
|
Haven't found any supporting documentation yet, so I've started reading the source. That being said, I realized a complication with the theory - the code in question is an IpcReceiverSet, but it's selecting over the public and private resource thread receivers, while only the public channel is ever used by the test. Naively, it's surprising that a different message could be retrieved, given that the same port from the set would be selected every time. |
|
@jdm how shall we proceed with this? Do you want to further investigate why this is happening? Do you want to try coming up with a minimal test case? Or shall we just merge it as is, hoping it really fixes whatever the actual problem is, and it won't come up again?... |
|
I've spent more time looking for ideas on and off since this comment. I don't have any good ones. I think we should merge this. |
|
This should get a better comment describing how we don't really understand what's going on. |
… sizes. We have observed the following situation in the wild: * a thread repeatedly selects over an IpcReceiverSet containing two receivers (A and B) * two other threads use senders that are connected to receiver A * the senders for A send messages to receiver A that are both larger than the default receive buffer size * receiver A is selected and the message is discovered to be too large for the receive buffer * a new buffer is allocated that is large enough for the last message that was encountered * the select operation is repeated with the new buffer, causing receiver A to be selected again * the second message is returned this time, which is larger than the buffer that was allocated for the first message There is no documented reason why this situation should occur, nor did reading the source of the Mach implementation provide any explanation. The solution in this PR is to continue retrying selection operations while Mach reports that the receive buffer was of insufficient size for the received message. Multiple attempts to reproduce this problem in isolated unit tests were unsuccessful.
|
I have added a more detailed commit message detailing the erroneous situation. |
|
@bors-servo r+ |
|
|
Retry macOS select operations instead of panicking
|
|
jdm commentedNov 10, 2016
#99 made the calculation for buffer sizes for receives work correctly, but there appears to be no guarantee that the size will remain the same when we retry. This can lead to a panic in the current code (as observed in servo/servo#14146 (comment)). The test passes if we retry the operation until it succeeds, which matches other code I've found in the wild like https://github.com/openmach/openmach/blob/a185c84f7f5052b7b9f5531ab3612aaf88d3802e/libmach/flick_mach3mig_client.c#L88 .