Skip to content

netlink: fix/support stopping goroutines reading netlink raw sockets#8101

Merged
mvo5 merged 4 commits intocanonical:masterfrom
mvo5:netlink-blocking-1
Feb 21, 2020
Merged

netlink: fix/support stopping goroutines reading netlink raw sockets#8101
mvo5 merged 4 commits intocanonical:masterfrom
mvo5:netlink-blocking-1

Conversation

@mvo5
Copy link
Copy Markdown
Contributor

@mvo5 mvo5 commented Feb 6, 2020

During the review of PR#8085 we discoverd that the netlink socket
reads will block for an arbitrary long time which means that stoping
a udev monitor is not reliable. This PR (by Samuele) fixes this by
using a new netlink.RawSockStopper().

Note that from go-1.11 onwards we can simplify this by just setting
the netlink fd to non-blocking and then wrapping it with os.NewFile.
With that the regular os.File.Close() will work as expected.

During the review of PR#8085 we discoverd that the netlink socket
reads will block for an arbitrary long time which means that stoping
a udev monitor is not reliable. This PR (by Samuele) fixes this by
using a new netlink.RawSockStopper().

Note that from go-1.11 onwards we can simplify this by just setting
the netlink fd to non-blocking and then wrapping it with os.NewFile.
With that the regular os.File.Close() will work as expected.
// fd is readable or stop was called.
// TODO: with go 1.11+ it should be possible to just switch to setting
// fd to non-blocking and then wrapping the socket via os.NewFile and
// use Closeq to force a read to stop.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo Closeq

@stolowski
Copy link
Copy Markdown
Contributor

Thank you, I'm going to review this. Nb, it's worth running these changes manually through tests/nested/{core,classic}/hotplug spread tests.

@stolowski stolowski self-requested a review February 6, 2020 13:54
// Gather devices from udevadm info output (enumeration on startup).
devices, parseErrors, err := hotplug.EnumerateExistingDevices()
if err != nil {
m.disconnect()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also check/include err from m.disconnect() here?

@stolowski
Copy link
Copy Markdown
Contributor

It seems to be failing on google-nested:ubuntu-18.04-64:tests/nested/classic/hotplug, I'll add some debug and investigate.

Copy link
Copy Markdown
Contributor

@stolowski stolowski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs a fix detailed in the comment.

// use Close to force a read to stop.
// c.f. https://github.com/golang/go/commit/ea5825b0b64e1a017a76eac0ad734e11ff557c8e
func RawSockStopper(fd int) (readableOrStop func() (bool, error), stop func(), err error) {
stopR, stopW, err := os.Pipe()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned yesterday it currently fails on nested hotplug tests with "udevmon.go:148: udev event error: Internal error: bad file descriptor" when processing normal events (what was confusing is that this error occured in random moments) - at some point the udevmon would not register an add or remove event.

The problem is with os.File*s returned by os.Pipe() going out of scope here and eventually getting garbage-collected, making the descriptors invalid. Here is one way of fixing it by wrapping them in a "stopper" object: https://paste.ubuntu.com/p/nkhPXFNRXP/
(it can probably be slightly simplified)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should use runtime.KeepAlive to ensure lifecycle is handled correctly. It is used for most raw file descriptor code.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, the fix for that should be simple, close over stopR instead of reading out the Fd once

@zyga
Copy link
Copy Markdown
Contributor

zyga commented Feb 7, 2020

I'm somewhat skeptical about this. To be reliable you should open a file descriptor, switch it to non-blocking mode, select / epoll it (epoll is preferred because it is easier to use correctly and is not constrained on fd numbers), read from it and handle EAGAIN.

The kernel is full of nasty surprises like getting a "readable-descriptor" notification that only ends up with, EAGAIN (because it wasn't really) and handling that with blocking code is impossible.

@pedronis
Copy link
Copy Markdown
Contributor

pedronis commented Feb 7, 2020

@zyga we can turn the fd non-bocking, what we cannot do is let go manage it for us before go 1.11+

t.Fatal("readableOrStop: expected nothing to read, just stopped")
}

}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps these tests could use loops around the ops and explicitly call GC inbetween iterations to verify the fix for lifecycle.

also other robustness improvements by turning the raw socket fd
to non-blocking mode
@pedronis
Copy link
Copy Markdown
Contributor

pedronis commented Feb 7, 2020

@zyga @stolowski I pushed the fixes we discussed

@pedronis
Copy link
Copy Markdown
Contributor

pedronis commented Feb 7, 2020

the one after will need a similar change about EAGAIN|EWOULDBLOCK

syscall.* usually returns syscall.Errno for errors, which has
predicates to detect EINTR, EAGAIN, EWOULDBLOCK etc in more a
high-level way
// both stopR and stopW must be kept alive otherwise the corresponding
// file descriptors will get closed
readableOrStop = func() (bool, error) {
return stopperSelectReadable(fd, int(stopR.Fd()))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, makes sense and is way simpler than I envisioned in earlier comment, thank you. I think the comment should even be expanded and made more explicit about keeping stopR within the scope of readableOrStop to prevent GC.

Copy link
Copy Markdown
Contributor

@stolowski stolowski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, and nested hotplug tests are now happy, thank you! Please see my other suggestions.

@stolowski stolowski self-requested a review February 10, 2020 09:14
r.Bits[fdIdx] = 1 << fdShift
r.Bits[stopFdIdx] |= 1 << stopFdShift
_, err := syscall.Select(maxFd+1, &r, nil, nil, stopperSelectTimeout)
if errno, ok := err.(syscall.Errno); ok && errno.Temporary() {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does errno.Temporary stand here for? EAGAIN?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the definition has changed a bit over time (strangely enough) but since 1.9 up to current go master it covers the things relevant here:

EINTR and also (for the other places we use it) EAGAIN, EWOULDBLOCK

plus other errors that the syscalls we care about here don't generate (things like EMFILE or ETIMEDOUT)

@pedronis pedronis dismissed stolowski’s stale review February 10, 2020 15:30

the code was fixed since

Copy link
Copy Markdown
Contributor

@stolowski stolowski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this! My last remarks are not a blocker, it's fine to potentially investigate them in a followup where they make sense. +1

@pedronis pedronis requested a review from bboozzoo February 21, 2020 08:47
@mvo5
Copy link
Copy Markdown
Contributor Author

mvo5 commented Feb 21, 2020

👍 from me as well. I opened the PR but there is no code from me in this so I consider my +1 a real one and merge this.

@mvo5 mvo5 merged commit 2851658 into canonical:master Feb 21, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants