Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unix: Degregister descriptors that return ECONNRESET #149

Merged
merged 1 commit into from Feb 16, 2017

Conversation

@dlrobertson
Copy link
Collaborator

dlrobertson commented Feb 15, 2017

Deregister descriptors that return ECONNRESET on recv.

Resolves: #133

@dlrobertson
Copy link
Collaborator Author

dlrobertson commented Feb 15, 2017

CC @antrik @nox

Entirely my fault. I should have caught this in the initial PR.

@dlrobertson dlrobertson force-pushed the dlrobertson:ignore_err_and_hup branch 2 times, most recently from 580f9d6 to d6e0d08 Feb 15, 2017
@@ -500,6 +501,8 @@ impl OsIpcReceiverSet {
}
Err(err) if err.channel_is_closed() => {
self.pollfds.remove(&evt_token).unwrap();
let io = EventedFd(&poll_entry.fd);
self.poll.deregister(&io).unwrap();

This comment has been minimized.

@dlrobertson

dlrobertson Feb 15, 2017

Author Collaborator

After I thought about it and looked at the code more, I realized the real problem is that we don't deregister the descriptor when we hit an error here. Theoretically just deregistering the descriptor here should resolve the issue.

This comment has been minimized.

@antrik

antrik Feb 15, 2017

Contributor

Well, we actually discussed this in the original PR IIRC. My assumption was that we don't need to remove it explicitly, since closing the FD should do so automatically -- but I wasn't aware of the duplicate FD special case back then :-( This makes total sense now: after the sender end is closed, and no more messages outstanding, recv() returns the channel_is_closed status (i.e. ECONNRESET), so we close the original FD and drop the token -- but a duplicate FD might still be open, in which case epoll_wait() will diligently keep reporting the hup status in following calls.

Explicitly removing the FD should indeed fix this for good. Have you tried whether this alone fixes the failures?

(BTW, is there any particular reason why you introduced the io temporary? It doesn't seem more readable to me than as a one-liner... Might be different perhaps if the temporary name was more informative -- but io doesn't really tell me anything here...)

This comment has been minimized.

@dlrobertson

dlrobertson Feb 15, 2017

Author Collaborator

This makes total sense now

I know right. After I saw this, it seemed plain as day.

Explicitly removing the FD should indeed fix this for good. Have you tried whether this alone fixes the failures?

Yes. Just removing the fd causes the tests to pass.

BTW, is there any particular reason why you introduced the io temporary?

gdb if I remember right. Had a breakpoint that printed its value. I can remove it.

@nox
Copy link
Member

nox commented Feb 15, 2017

@antrik Does that look ok to you? It made the try run succeed at least.

Copy link
Contributor

antrik left a comment

The later change looks good; the original one I have doubts about... See comments.

BTW, any chance we could create an ipc-channel test case for this? I am aware that getting this reliable might be tricky, since in my current understanding of the situation, it only happens if, after a receiver is transferred over another IPC channel, the receiving process goes on to use this receiver, before the sending process ever got the chance to finalise the send() operation, and thus drop the original FD... The way scheduling works in the face of IPC might actually help us here (I guess we just need to make sure we close the sender end even before sending the receiver end over the other channel?) -- but if it doesn't, I'm not sure there is much we can do there...

@@ -500,6 +501,8 @@ impl OsIpcReceiverSet {
}
Err(err) if err.channel_is_closed() => {
self.pollfds.remove(&evt_token).unwrap();
let io = EventedFd(&poll_entry.fd);
self.poll.deregister(&io).unwrap();

This comment has been minimized.

@antrik

antrik Feb 15, 2017

Contributor

Well, we actually discussed this in the original PR IIRC. My assumption was that we don't need to remove it explicitly, since closing the FD should do so automatically -- but I wasn't aware of the duplicate FD special case back then :-( This makes total sense now: after the sender end is closed, and no more messages outstanding, recv() returns the channel_is_closed status (i.e. ECONNRESET), so we close the original FD and drop the token -- but a duplicate FD might still be open, in which case epoll_wait() will diligently keep reporting the hup status in following calls.

Explicitly removing the FD should indeed fix this for good. Have you tried whether this alone fixes the failures?

(BTW, is there any particular reason why you introduced the io temporary? It doesn't seem more readable to me than as a one-liner... Might be different perhaps if the temporary name was more informative -- but io doesn't really tell me anything here...)

panic!("Readable event for unknown token: {:?}", evt_token)
// Do not panic for readable events that are errors or hup. We
// have already closed the descriptor.
if !(evt_kind.is_error() || evt_kind.is_hup()) {

This comment has been minimized.

@antrik

antrik Feb 15, 2017

Contributor

I have some doubts about this. So this means its possible for the event to be readable and at the same time hup or error? (Is this documented anywhere?) If that's the case, maybe we need to rethink the way we handle error and/or hup status in general?... With all that uncertainty around this, I'd rather defer such changes, unless it is strictly necessary to fix the issue at hand.

In either case, I still don't see though how we could ever get a readable event after dropping the token, since we only do that after we get informed that the sender closed the channel... So I'd rather we still panic if that should ever happen. That's what uncovered the actual bug above, after all...

@metajack metajack removed their assignment Feb 15, 2017
@nox
Copy link
Member

nox commented Feb 15, 2017

BTW, any chance we could create an ipc-channel test case for this?

This would be highly valuable, but I wouldn't mind if this lands later.

@dlrobertson dlrobertson force-pushed the dlrobertson:ignore_err_and_hup branch from d6e0d08 to c2ca2a2 Feb 15, 2017
Copy link
Collaborator Author

dlrobertson left a comment

I removed the second part and only left the deregistering of the fd after the ECONNRESET.

BTW, any chance we could create an ipc-channel test case for this?

I can try. I'll submit that as a separate PR so that I am no longer blocking the PR

@nox can you try your PR with the updated changes. Sorry to keep asking you to do this.

@@ -500,6 +501,8 @@ impl OsIpcReceiverSet {
}
Err(err) if err.channel_is_closed() => {
self.pollfds.remove(&evt_token).unwrap();
let io = EventedFd(&poll_entry.fd);
self.poll.deregister(&io).unwrap();

This comment has been minimized.

@dlrobertson

dlrobertson Feb 15, 2017

Author Collaborator

This makes total sense now

I know right. After I saw this, it seemed plain as day.

Explicitly removing the FD should indeed fix this for good. Have you tried whether this alone fixes the failures?

Yes. Just removing the fd causes the tests to pass.

BTW, is there any particular reason why you introduced the io temporary?

gdb if I remember right. Had a breakpoint that printed its value. I can remove it.

After we return ECONNRESET from recv, we remove the descriptors token
from the pollfds HashMap. Ensure that we also deregister the descriptor
when this occurs.
@dlrobertson dlrobertson force-pushed the dlrobertson:ignore_err_and_hup branch from c2ca2a2 to c16ac1d Feb 15, 2017
@dlrobertson dlrobertson changed the title Do not panic on readable events that are errors Unix: Degregister descriptors that return ECONNRESET Feb 15, 2017
@dlrobertson dlrobertson force-pushed the dlrobertson:ignore_err_and_hup branch from c16ac1d to e0f660c Feb 15, 2017
@antrik
Copy link
Contributor

antrik commented Feb 15, 2017

@bors-servo
Copy link
Contributor

bors-servo commented Feb 15, 2017

📌 Commit e0f660c has been approved by antrik

@bors-servo
Copy link
Contributor

bors-servo commented Feb 15, 2017

Testing commit e0f660c with merge 7c7fd16...

bors-servo added a commit that referenced this pull request Feb 15, 2017
Unix: Degregister descriptors that return ECONNRESET

Deregister descriptors that return `ECONNRESET` on `recv`.

Resolves: #133
@bors-servo
Copy link
Contributor

bors-servo commented Feb 15, 2017

💔 Test failed - status-travis

@nox
Copy link
Member

nox commented Feb 15, 2017

@bors-servo r=antrik

I Git bad.

@bors-servo
Copy link
Contributor

bors-servo commented Feb 15, 2017

💡 This pull request was already approved, no need to approve it again.

  • This pull request previously failed. You should add more commits to fix the bug, or use retry to trigger a build again.
@bors-servo
Copy link
Contributor

bors-servo commented Feb 15, 2017

📌 Commit e0f660c has been approved by antrik

@nox
Copy link
Member

nox commented Feb 15, 2017

@bors-servo retry

I also Homu bad.

@bors-servo
Copy link
Contributor

bors-servo commented Feb 15, 2017

Testing commit e0f660c with merge 4f928dd...

bors-servo added a commit that referenced this pull request Feb 15, 2017
Unix: Degregister descriptors that return ECONNRESET

Deregister descriptors that return `ECONNRESET` on `recv`.

Resolves: #133
@bors-servo
Copy link
Contributor

bors-servo commented Feb 16, 2017

☀️ Test successful - status-appveyor, status-travis
Approved by: antrik
Pushing 4f928dd to master...

@bors-servo bors-servo merged commit e0f660c into servo:master Feb 16, 2017
3 checks passed
3 checks passed
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
homu Test successful
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

6 participants
You can’t perform that action at this time.