netlink: fix/support stopping goroutines reading netlink raw sockets#8101
netlink: fix/support stopping goroutines reading netlink raw sockets#8101mvo5 merged 4 commits intocanonical:masterfrom
Conversation
During the review of PR#8085 we discoverd that the netlink socket reads will block for an arbitrary long time which means that stoping a udev monitor is not reliable. This PR (by Samuele) fixes this by using a new netlink.RawSockStopper(). Note that from go-1.11 onwards we can simplify this by just setting the netlink fd to non-blocking and then wrapping it with os.NewFile. With that the regular os.File.Close() will work as expected.
osutil/udev/netlink/rawsockstop.go
Outdated
| // fd is readable or stop was called. | ||
| // TODO: with go 1.11+ it should be possible to just switch to setting | ||
| // fd to non-blocking and then wrapping the socket via os.NewFile and | ||
| // use Closeq to force a read to stop. |
|
Thank you, I'm going to review this. Nb, it's worth running these changes manually through tests/nested/{core,classic}/hotplug spread tests. |
| // Gather devices from udevadm info output (enumeration on startup). | ||
| devices, parseErrors, err := hotplug.EnumerateExistingDevices() | ||
| if err != nil { | ||
| m.disconnect() |
There was a problem hiding this comment.
Should we also check/include err from m.disconnect() here?
|
It seems to be failing on google-nested:ubuntu-18.04-64:tests/nested/classic/hotplug, I'll add some debug and investigate. |
stolowski
left a comment
There was a problem hiding this comment.
Needs a fix detailed in the comment.
| // use Close to force a read to stop. | ||
| // c.f. https://github.com/golang/go/commit/ea5825b0b64e1a017a76eac0ad734e11ff557c8e | ||
| func RawSockStopper(fd int) (readableOrStop func() (bool, error), stop func(), err error) { | ||
| stopR, stopW, err := os.Pipe() |
There was a problem hiding this comment.
As mentioned yesterday it currently fails on nested hotplug tests with "udevmon.go:148: udev event error: Internal error: bad file descriptor" when processing normal events (what was confusing is that this error occured in random moments) - at some point the udevmon would not register an add or remove event.
The problem is with os.File*s returned by os.Pipe() going out of scope here and eventually getting garbage-collected, making the descriptors invalid. Here is one way of fixing it by wrapping them in a "stopper" object: https://paste.ubuntu.com/p/nkhPXFNRXP/
(it can probably be slightly simplified)
There was a problem hiding this comment.
You should use runtime.KeepAlive to ensure lifecycle is handled correctly. It is used for most raw file descriptor code.
There was a problem hiding this comment.
ok, the fix for that should be simple, close over stopR instead of reading out the Fd once
|
I'm somewhat skeptical about this. To be reliable you should open a file descriptor, switch it to non-blocking mode, select / epoll it (epoll is preferred because it is easier to use correctly and is not constrained on fd numbers), read from it and handle EAGAIN. The kernel is full of nasty surprises like getting a "readable-descriptor" notification that only ends up with, EAGAIN (because it wasn't really) and handling that with blocking code is impossible. |
|
@zyga we can turn the fd non-bocking, what we cannot do is let go manage it for us before go 1.11+ |
| t.Fatal("readableOrStop: expected nothing to read, just stopped") | ||
| } | ||
|
|
||
| } |
There was a problem hiding this comment.
Perhaps these tests could use loops around the ops and explicitly call GC inbetween iterations to verify the fix for lifecycle.
also other robustness improvements by turning the raw socket fd to non-blocking mode
|
@zyga @stolowski I pushed the fixes we discussed |
|
the one after will need a similar change about EAGAIN|EWOULDBLOCK |
syscall.* usually returns syscall.Errno for errors, which has predicates to detect EINTR, EAGAIN, EWOULDBLOCK etc in more a high-level way
| // both stopR and stopW must be kept alive otherwise the corresponding | ||
| // file descriptors will get closed | ||
| readableOrStop = func() (bool, error) { | ||
| return stopperSelectReadable(fd, int(stopR.Fd())) |
There was a problem hiding this comment.
Right, makes sense and is way simpler than I envisioned in earlier comment, thank you. I think the comment should even be expanded and made more explicit about keeping stopR within the scope of readableOrStop to prevent GC.
stolowski
left a comment
There was a problem hiding this comment.
Looks good, and nested hotplug tests are now happy, thank you! Please see my other suggestions.
| r.Bits[fdIdx] = 1 << fdShift | ||
| r.Bits[stopFdIdx] |= 1 << stopFdShift | ||
| _, err := syscall.Select(maxFd+1, &r, nil, nil, stopperSelectTimeout) | ||
| if errno, ok := err.(syscall.Errno); ok && errno.Temporary() { |
There was a problem hiding this comment.
What does errno.Temporary stand here for? EAGAIN?
There was a problem hiding this comment.
the definition has changed a bit over time (strangely enough) but since 1.9 up to current go master it covers the things relevant here:
EINTR and also (for the other places we use it) EAGAIN, EWOULDBLOCK
plus other errors that the syscalls we care about here don't generate (things like EMFILE or ETIMEDOUT)
stolowski
left a comment
There was a problem hiding this comment.
Thanks for working on this! My last remarks are not a blocker, it's fine to potentially investigate them in a followup where they make sense. +1
|
👍 from me as well. I opened the PR but there is no code from me in this so I consider my +1 a real one and merge this. |
During the review of PR#8085 we discoverd that the netlink socket
reads will block for an arbitrary long time which means that stoping
a udev monitor is not reliable. This PR (by Samuele) fixes this by
using a new netlink.RawSockStopper().
Note that from go-1.11 onwards we can simplify this by just setting
the netlink fd to non-blocking and then wrapping it with os.NewFile.
With that the regular os.File.Close() will work as expected.