Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Windows event notification #52

Closed
11 tasks
njsmith opened this issue Feb 12, 2017 · 30 comments
Closed
11 tasks

Windows event notification #52

njsmith opened this issue Feb 12, 2017 · 30 comments

Comments

@njsmith
Copy link
Member

njsmith commented Feb 12, 2017

Problem

Windows has 3 incompatible families of event notifications APIs: IOCP, select/WSAPoll, and WaitForMultipleEvents-and-variants. They each have unique capabilities. This means: if you want to be able to react to all the different possible events that Windows can signal, then you must use all 3 of these. Needless to say, this creates a challenge for event loop design. There are a number of potentially viable ways to arrange these pieces; the question is which one we should use.

(Actually, all 3 together still isn't sufficient, b/c there are some things that still require threads – like console IO – and I'm ignoring GUI events entirely because Trio isn't a GUI library. But never mind. Just remember that when someone tells you that Windows' I/O subsystem is great, that their statement isn't wrong but does require taking a certain narrow perspective...)

Considerations

The WaitFor*Event family

The Event-related APIs are necessary to, for example, wait for a notification that a child process has exited. (The job object API provides a way to request IOCP notifications about process death, but the docs warn that the notifications are lossy and therefore useless...) Otherwise though they're very limited – in particular they have both O(n) behavior and max 64 objects in an interest set – so you definitely don't want to use these as your primary blocking call. We're going to be calling these in a background thread of some kind. The two natural architectures are to use WaitForSingleObject(Ex) and allocate one-thread-per-event, or else use WaitForMultipleObjects(Ex) and try and coalesce up to 64 events into each thread (substantially more complicated to implement but with 64x less memory overhead for thread stacks, if it matters). This is orthogonal to the rest of this issue, so it gets its own thread: #233

IOCP

IOCP is the crown jewel of Windows the I/O subsystem, and what you generally hear recommended. It follows a natively asynchronous model where you just go ahead and issue a read or write or whatever, and it runs in the background until eventually the kernel tells you it's done. It provides an O(1) notification mechanism. It's pretty slick. But... it's not as obvious a choice as everyone makes it sound. (Did you know the Chrome team has mostly given up on trying to make it work?)

Issues:

  • When doing a UDP send, the send is only notified as complete once the packet hits the wire; i.e., using IOCP for UDP totally removes in-kernel buffering/flow-control. So to get decent throughput you must implement your own buffering system allowing multiple UDP sends to be in flight at once (but not too many because you don't want to introduce arbitrary latency). Or you could just use the non-blocking API and the kernel worries about this for you. (This hit Chrome hard; they switched to using non-blocking IO for UDP on Windows. ref1, ref2.)

  • When doing a TCP receive with a large buffer, apparently the kernel does a Nagle-like thing where it tries to hang onto the data for a while before delivering it to the application, thus introducing pointless latency. (This also bit Chrome hard; they switched to using non-blocking IO for TCP receive on Windows. ref1, ref2)

  • Sometimes you really do want to check whether a socket is readable before issuing a read: in particular, apparently outstanding IOCP receive buffers get pinned into kernel memory or some such nonsense, so it's possible to exhaust system resources by trying to listen to a large number of mostly-idle sockets.

  • Sometimes you really do want to check whether a socket is writable before issuing a write: in particular, because it allows adaptive protocols to provide lower latency if they can delay deciding what bytes to write until the last moment.

  • Python provides a complete non-blocking API out-of-the-box, and we use this API on other platforms, so using non-blocking IO on Windows as well is much MUCH simpler for us to implement than IOCP, which requires us to pretty much build our own wrappers from scratch.

On the other hand, IOCP is the only way to do a number of things like: non-blocking IO to the filesystem, or monitoring the filesystem for changes, or non-blocking IO on named pipes. (Named pipes are popular for talking to subprocesses – though it's also possible to use a socket if you set it up right.)

select/WSAPoll

You can also use select/WSAPoll. This is the only documented way to check if a socket is readable/writable. However:

  • As is well known, these are O(n) APIs, which sucks if you have lots of sockets. It's not clear how much it sucks exactly -- just copying the buffer into kernel-space probably isn't a big deal for realistic interest set sizes -- but clearly it's not as nice as O(1). On my laptop, select.select on 3 sets of 512 idle sockets takes <200 microseconds, so I don't think this will, like, immediately kill us. Especially since people mostly don't run big servers on Windows? OTOH an empty epoll on the same laptop returns in ~0.6 microseconds, so there is some difference...

  • select.select is limited to 512 sockets, but this is trivially overcome; the Windows fd_set structure is just a array of SOCKETs + a length field, which you can allocate in any size you like (Windows: wait_{read,writ}able limited to 512 sockets #3). (This is a nice side-effect of Windows never having had a dense fd space. This also means WSAPoll doesn't have much reason to exist. Unlike other platforms where poll beats select because poll uses an array and select uses a bitmap, WSAPoll is not really any more efficient than select. Its only advantage is that it's similar to how poll works on other platforms... but it's gratuitously incompatible. The one other interesting feature is that you can do an alertable wait with it, which gives a way to cancel it from another thread without using an explicit wakeup socket, via QueueUserAPC.)

  • Non-blocking IO on windows is apparently a bit inefficient because it adds an extra copy. (I guess they don't have zero-copy enqueueing of data to receive buffers? And on send I guess it makes sense that you can do that legitimately zero-copy with IOCP but not with nonblocking, which is nice.) Again I'm not sure how much this matters given that we don't have zero-copy byte buffers in Python to start with, but it's a thing.

  • select only works for sockets; you still need IOCP etc. for responding to other kinds of notifications.

Options

Given all of the above, our current design is a hybrid that uses select and non-blocking IO for sockets, with IOCP available when needed. We run select in the main thread, and IOCP in a worker thread, with a wakeup socket to notify when IOCP events occur. This is vastly simpler than doing it the other way around, because you can trivially queue work to an IOCP from any thread, while if you want to modify select's interest set from another thread it's a mess. As an initial design, this makes a lot of sense, because it allows us to provide full features (including e.g. wait_writable for adaptive protocols), avoid the tricky issues that IOCP creates for sockets, and requires a minimum of special code.

The other attractive option would be if we could solve the issues with IOCP and switch to using it alone – this would be simpler and get rid of the O(n) select. However, as we can see above, there are a whole list of challenges that would need to be overcome first.

Working around IOCP's limitations

UDP sends

I'm not really sure what the best approach here is. One option is just to limit the number of outstanding UDP data to some fixed amount (maybe tunable through a "virtual" (i.e. implemented by us) sockopt), and drop packets or return errors if we exceed that. This is clearly solvable in principle, it's just a bit annoying to figure out the details.

Spurious extra latency in TCP receives

I think that using the MSG_PUSH_IMMEDIATE flag should solve this.

Checking readability / writability

It turns out that IOCP actually can check readability! It's not mentioned on MSDN at all, but there's a well-known bit of folklore about the "zero-byte read". If you issue a zero-byte read, it won't complete until there's data ready to read. ref1 (← official MS docs! also note this is ch. 6 of "NPfMW", referenced below), ref2, ref3.

That's for SOCK_STREAM sockets. What about SOCK_DGRAM? libuv does zero-byte reads with MSG_PEEK set (to avoid consuming the packet, truncating it to zero bytes in the process). MSDN explicitly says that this doesn't work (MSG_PEEK and overlapped IO supposedly don't work together), but I guess I trust libuv more than MSDN? I don't 100% trust either – this would need to be verified.

What about writability? Empirically, if you have a non-blocking socket on windows with a full send buffer and you do a zero-byte send, it returns EWOULDBLOCK. (This is weird; other platforms don't do this.) If this behavior also translates to IOCP sends, then this zero-byte send trick would give us a way to use IOCP to check writability on SOCK_STREAM sockets.

For writability of SOCK_DGRAM I don't think there's any trick, but it's not clear how meaningful SOCK_DGRAM writability is anyway. If we do our own buffering than presumably we can implement it there.

Alternatively, there is a remarkable piece of undocumented sorcery, where you reach down directly to make syscalls, bypassing the Winsock userland, and apparently can get OVERLAPPED notifications when a socket is readable/writable: ref1, ref2, ref3, ref4, ref5. I guess this is how select is implemented? The problem with this is that it only works if your sockets are implemented directly in the kernel, which is apparently not always the case (because of like... antivirus tools and other horrible things that can interpose themselves into your socket API). So I'm inclined to discount this as unreliable. [Edit: or maybe not, see below]

Implementing all this junk

I actually got a ways into this. Then I ripped it out when I realized how many nasty issues there were beyond just typing in long and annoying API calls. But it could easily be resurrected; see 7e7a809 and its parent.

TODO

If we do want to switch to using IOCP in general, then the sequence would go something like:

  • check whether zero-byte sends give a way to check TCP writability via IOCP – this is probably the biggest determinant of whether going to IOCP-only is even possible (might be worth checking what doing UDP sends with MSG_PARTIAL does too while we're at it)
  • check whether you really can do zero-byte reads on UDP sockets like libuv claims
  • figure out what kind of UDP send buffering strategy makes sense (or if we decide that UDP sends can just drop packets instead of blocking then I guess the non-blocking APIs remain viable even if we can't do wait_socket_writable on UDP sockets)

At this point we'd have the information to decide whether we can/should go ahead. If so, then the plan would look something like:

  • migrate away from select for the cases that can't use IOCP readable/writable checking: [Not necessary, AFD-based select should work for these too]
    • connect
    • accept
  • implement wait_socket_readable and wait_socket_writable on top of IOCP and get rid of select (but at this point we're still doing non-blocking I/O on sockets, just using IOCP as a select replacement)
  • (optional / someday) switch to using IOCP for everything instead of non-blocking I/O

New plan:

  • Use the tricks from the thread below to reimplement wait_socket_{readable,writable} using AFD, and confirm it works
  • Add LSP testing to our Windows CI
  • Consider whether we want to switch to using IOCP in more cases, e.g. send/recv. Not sure it's worth bothering.
@njsmith
Copy link
Member Author

njsmith commented Mar 10, 2017

Something to think about in this process: according to the docs on Windows, with overlapped I/o you can still get WSAEWOULDBLOCK ("too many outstanding overlapped requests"). No-one on the internet seems to have any idea when this actually occurs or why. Twisted has a FIXME b/c they don't handle it, just propagate the error out. Maybe we can do better? Or maybe not.

(I bet if you don't throttle UDP sends at all then that's one way to get this? Though libuv AFAICT doesn't throttle UDP sends at all and I guess they do ok... it'd be worth trying this just to see how Windows handles it.)

@njsmith
Copy link
Member Author

njsmith commented Mar 7, 2018

Small note to remember: if we start using GetQueuedCompletionStatusEx directly with timeouts, then we'll need to make sure that we're careful about how we round those timeouts: https://bugs.python.org/issue20311

@njsmith
Copy link
Member Author

njsmith commented Apr 19, 2018

Hmm, someone on stackoverflow says that they wrote a test program to see if zero-byte IOCP sends could be used to check for writability, and that it didn't work for them :-/

@tmm1
Copy link

tmm1 commented Sep 18, 2018

That was me, and afaict the IOCP 0-byte trick only works for readability.

According to https://blog.grijjy.com/2018/08/29/creating-high-performance-udp-servers-on-windows-and-linux/ the 0-byte read trick does with MSG_PEEK (even though the msdn docs say it only works for non-overlapped sockets?).

To initiate a zero-byte read operation for UDP, you simply pass an overlapped IO event with an empty buffer. The key is to include the MSG_PEEK flag in the WsaRecv() overlapped api call. This signals the underlying completion logic to raise an overlapped event, but not to pass any data.

@njsmith
Copy link
Member Author

njsmith commented Sep 22, 2018

@tmm1 Hello! 👋 Thank you for that :-).

Looking at this again 18 months later, and with that information in mind... there's still no urgent need to change what we're doing, though if we move forward with some of the wilder ideas in #399 then that might make IOCP more attractive (in particular see #399 (comment)).

How bad would it be to make wait_socket_writable on Windows be unimplemented, or a weird expensive thing that exists but we generally don't use (e.g., it could call select in a thread, like how WaitForSingleObject is implemented)? On Windows, SocketStream.wait_send_all_might_not_block could just return immediately, for a small degradation in UX, or maybe we'll remove wait_send_all_might_not_block anyway (#371). Another use case is wrapping external libraries. ZMQ requires wait_socket_readable, but doesn't need wait_socket_writable. The postgres client library libpq's async mode absolutely assumes that you have wait_socket_writable, unfortunately (e.g.), so that limitation is inherited by psycopg2 and wrappers like @Fuyukai's riopg. Another user is @agronholm's hyperio. Given hyperio's goals and that asyncio also doesn't guarantee any way to wait for readable/writable sockets on Windows, I suspect hyperio would be better off building on top of asyncio's protocols/transports and Trio's Stream layer, which would avoid this issue, but for now that's not what it does.

I guess we could dig into the "undocumented sorcery" mentioned above and see how bad it is...

@njsmith
Copy link
Member Author

njsmith commented Sep 26, 2018

I guess we could dig into the "undocumented sorcery" mentioned above and see how bad it is...

I got curious about this and did do a bit more digging.

So when you call select on Windows, that's a function in ws2_32.dll. I haven't found the source code for this available anywhere (though I haven't checked whether it comes with visual studio or anything... the CRT source code does, so it's possible winsock does as well).

Normally, running on standard out-of-the-box install of Windows, select then uses the "AFD" kernel subsystem to actually do what it needs to do. This works by opening a magic filename, and sending requests using the magic DeviceIoControl function (which is essentially equivalent of Unix ioctl, i.e., "shove some arbitrary requests into the kernel using a idiosyncratic format, I hope you know what you're doing").

That part is actually fine. It's arcane and low-level, but documented well-enough in the source code of projects like libuv and wepoll, and basically sensible. You put together an array of structs listing the sockets you want to wait for and what events you want to wait for, and then submit it to the kernel, and get back a regular IOCP notification when it's ready.

However! This is not the only thing select can do. There is also something called "layered service providers". The idea here is that ws2_32.dll allows random userspace dlls to interpose themselves on socket calls, and there's a whole elaborate system for how these dlls can intercept calls on the way in, pass them down to the lower level, pass them back out again, etc. The best article online about how this works seems to be Unraveling the Mysteries of Writing a Winsock 2 Layered Service Provider from the MS Systems Journal May 1999, which is no longer available from MS but is still available through sketchy mirror sites (archive.org backup) (henceforth: UtMoWaW2LSP). Or, there's a book from 2002 called Network Programming for Microsoft Windows, which has a chapter on this, and might be available from your favorite ebook pirate site, not that you have such a thing (henceforth: NPfMW).

Like most unstructured hooking APIs, the designers had high hopes that this would be used for all kinds of fancy things, but in practice it suffers from the deadly flaw of lack of composability:

LSPs have tremendous potential for value-added networking services. But the current Winsock 2 specification does not answer an important question: where to insert an LSP in the protocol chain if there's another one installed [...] the widespread success of an LSP is effectively prohibited because the only safe installation is an LSP over a base provider and making the new chain the default provider for the protocol. Such an approach guarantees the service of the new LSP, but removes the existing LSP as the default provider in the chain. – UtMoWaW2LSP

And in fact, select is particularly broken, because it gets passed multiple sockets at the same time. And with the way the LSP hooking interface works, different sockets might have different hooks installed. So what happens if you pass sockets belonging to two different LSPs into the same select call? How does ws2_32.dll handle this?

Apparently it picks one of the hooks at random and forces it to handle all of the sockets. Which can't possibly work, but nonetheless:

As a consequence of the architecture of Winsock 2 service providers, select in user applications can only correctly handle sockets from the same service provider in a single call. – UtMoWaW2LSP

Even though the Winsock specification explicitly states that only sockets from the same provider may be passed into select, many applications ignore this [well obviously, what else are they supposed to do? -njs] and frequently pass down both TCP and UDP handles together. The base Microsoft providers do not verify that all socket handles are from the same provider. In addition, the Microsoft providers will correctly handle sockets from multiple providers in a single select call. This is a problem with LSPs because an LSP may be layered over just a single entry such as TCP. In this case, the LSP's WSPSelect is invoked with FD_SETs that contain their own sockets plus sockets from other providers (such as a UDP socket from the Microsoft provider). When the LSP is translating the socket handles and comes upon the UDP handle, the context query will fail. At this point, it may return an error (WSAENOTSOCK) or pass the socket down unmodified. If an error is returned, then for the case of an LSP layered only over UDP/IPv4 (or TCP/IPv4), Internet Explorer will no longer function. A workaround is to always install the LSP over all providers for a given address family (such as for IPv4, install over TCP/IPv4, UDP/IPv4, and RAW/IPv4). No Microsoft application or service currently passes socket handles from multiple address families into a single select call, although LSASS on Windows NT 4.0 used to pass IPX and IPv4 sockets together (this has been fixed in the latest service packs for Windows NT 4.0). – NPfMW

Also, the LSP design means that you can inject arbitrary code into any process by modifying the registry keys that configure the LSPs. This tends to break sandboxes, and means that buggy LSPs cause crashes critical system processes like winlogon.exe.

So this whole design is fundamentally broken, and MS is actively trying to get rid of it: as of Windows 8 / Window Server 2012, it's officially deprecated, and to help nudge people away from it, Windows 8 "Metro" apps simply ignore any installed LSPs.

Nonetheless, it is apparently used by old firewalls and anti-virus programs and also malware. (I guess they must just blindly hook everything, to work around the composability problems?), so some percentage of Windows machines in the wild do still have LSPs installed.

Another fun aspect of LSPs: there are some functions that are not "officially" part of Winsock, but are rather "Microsoft extensions", like ConnectEx (the IOCP version of connect), TransmitFile (the Windows version of sendfile, see #45), etc. Given that MS owns the Winsock standard I have no idea what the distinction here is supposed to be, but I guess it made sense to someone at some point. Anyway, the extension functions have a funny API: there isn't a public function in ws2_32.dll you can just call; instead, you have to first call WSAIoctl with the magic code SIO_GET_EXTENSION_FUNCTION_POINTER, and that returns a pointer to ConnectEx or whatever. How does that interact with LSPs? Well, WSAIoctl actually takes a socket as an argument. So there's no way for LSPs to hook the normal ConnectEx; instead, LSPs are allowed to simply replace the normal ConnectEx on their sockets, by causing WSAIoctl to return a different function pointer entirely.

Twisted and asyncio both get this wrong, btw – they call WSAIoctl once-per-process, and assume that the function pointer that's returned can be used for any socket. But this is wrong – you're actually supposed to re-fetch the function pointers for every socket you want to call them on. Libuv does this correctly. It's interesting that Twisted and asyncio seem to get away with this, though.

You can detect whether an LSP has hooked any particular socket by using getsockopt(socket, SOL_SOCKET, SO_PROTOCOL_INFOW, ...), and then checking protocol_info->ProviderId, which is a GUID labeling the which "provider" handles this socket; the built-in MS providers have well-known GUIDs. (libuv checks this: see uv_msafd_provider_ids, uv__fast_poll_get_peer_socket, etc.).

You can also blatantly violate the whole LSP abstraction layer by passing SIO_BASE_HANDLE to WSAIoctl to fetch the "base service provider" socket, i.e., the one that's at the bottom of the whole LSP stack. In fact, AFAICT libuv actually does this unconditionally, and then uses the "base handle" for all future socket-related calls, including sending and receiving data. So I guess all libuv apps just... ignore LSPs entirely?

Given that libuv seems to get away with this in practice, and Windows 8 Metro apps do something like this, and that Twisted and asyncio are buggy in the presence of LSPs and seem to get away with it in practice, I think we can probably use SIO_BASE_HANDLE and then pretend LSPs don't exist?

But... having used SIO_BASE_HANDLE to get a "base provider" handle, are you guaranteed to be able to use the AFD magic stuff? What is a "base provider" anyway?

NPfMW says:

A base provider exposes a Winsock interface that directly implements a protocol such as the Microsoft TCP/IP provider.

Typically, a base provider has a kernel mode protocol driver associated with it. For example, the Microsoft TCP and UPD providers require the TCP/IP driver TCPIP.SYS to ultimately function. It is also possible to develop your own base providers, but that is beyond the scope of this book. For more information about base providers, consult the Windows Driver Development Kit (DDK).

On a Windows system, you can get a list of "providers" (both LSPs and base providers) by running netsh winsock show catalog. Here's the output on a Win 10 VM (from modern.ie):

C:\Users\IEUser
λ netsh winsock show catalog

Winsock Catalog Provider Entry
------------------------------------------------------
Entry Type:                         Base Service Provider
Description:                        Hyper-V RAW
Provider ID:                        {1234191B-4BF7-4CA7-86E0-DFD7C32B5445}
Provider Path:                      %SystemRoot%\system32\mswsock.dll
Catalog Entry ID:                   1001
Version:                            2
Address Family:                     34
Max Address Length:                 36
Min Address Length:                 36
Socket Type:                        1
Protocol:                           1
Service Flags:                      0x20026
Protocol Chain Length:              1

Winsock Catalog Provider Entry
------------------------------------------------------
Entry Type:                         Base Service Provider
Description:                        MSAFD Tcpip [TCP/IP]
Provider ID:                        {E70F1AA0-AB8B-11CF-8CA3-00805F48A192}
Provider Path:                      %SystemRoot%\system32\mswsock.dll
Catalog Entry ID:                   1006
Version:                            2
Address Family:                     2
Max Address Length:                 16
Min Address Length:                 16
Socket Type:                        1
Protocol:                           6
Service Flags:                      0x20066
Protocol Chain Length:              1

Winsock Catalog Provider Entry
------------------------------------------------------
Entry Type:                         Base Service Provider
Description:                        MSAFD Tcpip [UDP/IP]
Provider ID:                        {E70F1AA0-AB8B-11CF-8CA3-00805F48A192}
Provider Path:                      %SystemRoot%\system32\mswsock.dll
Catalog Entry ID:                   1007
Version:                            2
Address Family:                     2
Max Address Length:                 16
Min Address Length:                 16
Socket Type:                        2
Protocol:                           17
Service Flags:                      0x20609
Protocol Chain Length:              1

Winsock Catalog Provider Entry
------------------------------------------------------
Entry Type:                         Base Service Provider
Description:                        MSAFD Tcpip [RAW/IP]
Provider ID:                        {E70F1AA0-AB8B-11CF-8CA3-00805F48A192}
Provider Path:                      %SystemRoot%\system32\mswsock.dll
Catalog Entry ID:                   1008
Version:                            2
Address Family:                     2
Max Address Length:                 16
Min Address Length:                 16
Socket Type:                        3
Protocol:                           0
Service Flags:                      0x20609
Protocol Chain Length:              1

Winsock Catalog Provider Entry
------------------------------------------------------
Entry Type:                         Base Service Provider
Description:                        MSAFD Tcpip [TCP/IPv6]
Provider ID:                        {F9EAB0C0-26D4-11D0-BBBF-00AA006C34E4}
Provider Path:                      %SystemRoot%\system32\mswsock.dll
Catalog Entry ID:                   1009
Version:                            2
Address Family:                     23
Max Address Length:                 28
Min Address Length:                 28
Socket Type:                        1
Protocol:                           6
Service Flags:                      0x20066
Protocol Chain Length:              1

Winsock Catalog Provider Entry
------------------------------------------------------
Entry Type:                         Base Service Provider
Description:                        MSAFD Tcpip [UDP/IPv6]
Provider ID:                        {F9EAB0C0-26D4-11D0-BBBF-00AA006C34E4}
Provider Path:                      %SystemRoot%\system32\mswsock.dll
Catalog Entry ID:                   1010
Version:                            2
Address Family:                     23
Max Address Length:                 28
Min Address Length:                 28
Socket Type:                        2
Protocol:                           17
Service Flags:                      0x20609
Protocol Chain Length:              1

Winsock Catalog Provider Entry
------------------------------------------------------
Entry Type:                         Base Service Provider
Description:                        MSAFD Tcpip [RAW/IPv6]
Provider ID:                        {F9EAB0C0-26D4-11D0-BBBF-00AA006C34E4}
Provider Path:                      %SystemRoot%\system32\mswsock.dll
Catalog Entry ID:                   1011
Version:                            2
Address Family:                     23
Max Address Length:                 28
Min Address Length:                 28
Socket Type:                        3
Protocol:                           0
Service Flags:                      0x20609
Protocol Chain Length:              1

Winsock Catalog Provider Entry
------------------------------------------------------
Entry Type:                         Base Service Provider
Description:                        MSAFD Irda [IrDA]
Provider ID:                        {3972523D-2AF1-11D1-B655-00805F3642CC}
Provider Path:                      %SystemRoot%\system32\mswsock.dll
Catalog Entry ID:                   1012
Version:                            2
Address Family:                     26
Max Address Length:                 32
Min Address Length:                 8
Socket Type:                        1
Protocol:                           1
Service Flags:                      0x20006
Protocol Chain Length:              1

Winsock Catalog Provider Entry
------------------------------------------------------
Entry Type:                         Base Service Provider
Description:                        RSVP TCPv6 Service Provider
Provider ID:                        {9D60A9E0-337A-11D0-BD88-0000C082E69A}
Provider Path:                      %SystemRoot%\system32\mswsock.dll
Catalog Entry ID:                   1017
Version:                            2
Address Family:                     23
Max Address Length:                 28
Min Address Length:                 28
Socket Type:                        1
Protocol:                           6
Service Flags:                      0x22066
Protocol Chain Length:              1

Winsock Catalog Provider Entry
------------------------------------------------------
Entry Type:                         Base Service Provider
Description:                        RSVP TCP Service Provider
Provider ID:                        {9D60A9E0-337A-11D0-BD88-0000C082E69A}
Provider Path:                      %SystemRoot%\system32\mswsock.dll
Catalog Entry ID:                   1018
Version:                            2
Address Family:                     2
Max Address Length:                 16
Min Address Length:                 16
Socket Type:                        1
Protocol:                           6
Service Flags:                      0x22066
Protocol Chain Length:              1

Winsock Catalog Provider Entry
------------------------------------------------------
Entry Type:                         Base Service Provider
Description:                        RSVP UDPv6 Service Provider
Provider ID:                        {9D60A9E0-337A-11D0-BD88-0000C082E69A}
Provider Path:                      %SystemRoot%\system32\mswsock.dll
Catalog Entry ID:                   1019
Version:                            2
Address Family:                     23
Max Address Length:                 28
Min Address Length:                 28
Socket Type:                        2
Protocol:                           17
Service Flags:                      0x22609
Protocol Chain Length:              1

Winsock Catalog Provider Entry
------------------------------------------------------
Entry Type:                         Base Service Provider
Description:                        RSVP UDP Service Provider
Provider ID:                        {9D60A9E0-337A-11D0-BD88-0000C082E69A}
Provider Path:                      %SystemRoot%\system32\mswsock.dll
Catalog Entry ID:                   1020
Version:                            2
Address Family:                     2
Max Address Length:                 16
Min Address Length:                 16
Socket Type:                        2
Protocol:                           17
Service Flags:                      0x22609
Protocol Chain Length:              1

Winsock Catalog Provider Entry
------------------------------------------------------
Entry Type:                         Base Service Provider (32)
Description:                        Hyper-V RAW
Provider ID:                        {1234191B-4BF7-4CA7-86E0-DFD7C32B5445}
Provider Path:                      %SystemRoot%\system32\mswsock.dll
Catalog Entry ID:                   1001
Version:                            2
Address Family:                     34
Max Address Length:                 36
Min Address Length:                 36
Socket Type:                        1
Protocol:                           1
Service Flags:                      0x20026
Protocol Chain Length:              1

Winsock Catalog Provider Entry
------------------------------------------------------
Entry Type:                         Base Service Provider (32)
Description:                        MSAFD Tcpip [TCP/IP]
Provider ID:                        {E70F1AA0-AB8B-11CF-8CA3-00805F48A192}
Provider Path:                      %SystemRoot%\system32\mswsock.dll
Catalog Entry ID:                   1006
Version:                            2
Address Family:                     2
Max Address Length:                 16
Min Address Length:                 16
Socket Type:                        1
Protocol:                           6
Service Flags:                      0x20066
Protocol Chain Length:              1

Winsock Catalog Provider Entry
------------------------------------------------------
Entry Type:                         Base Service Provider (32)
Description:                        MSAFD Tcpip [UDP/IP]
Provider ID:                        {E70F1AA0-AB8B-11CF-8CA3-00805F48A192}
Provider Path:                      %SystemRoot%\system32\mswsock.dll
Catalog Entry ID:                   1007
Version:                            2
Address Family:                     2
Max Address Length:                 16
Min Address Length:                 16
Socket Type:                        2
Protocol:                           17
Service Flags:                      0x20609
Protocol Chain Length:              1

Winsock Catalog Provider Entry
------------------------------------------------------
Entry Type:                         Base Service Provider (32)
Description:                        MSAFD Tcpip [RAW/IP]
Provider ID:                        {E70F1AA0-AB8B-11CF-8CA3-00805F48A192}
Provider Path:                      %SystemRoot%\system32\mswsock.dll
Catalog Entry ID:                   1008
Version:                            2
Address Family:                     2
Max Address Length:                 16
Min Address Length:                 16
Socket Type:                        3
Protocol:                           0
Service Flags:                      0x20609
Protocol Chain Length:              1

Winsock Catalog Provider Entry
------------------------------------------------------
Entry Type:                         Base Service Provider (32)
Description:                        MSAFD Tcpip [TCP/IPv6]
Provider ID:                        {F9EAB0C0-26D4-11D0-BBBF-00AA006C34E4}
Provider Path:                      %SystemRoot%\system32\mswsock.dll
Catalog Entry ID:                   1009
Version:                            2
Address Family:                     23
Max Address Length:                 28
Min Address Length:                 28
Socket Type:                        1
Protocol:                           6
Service Flags:                      0x20066
Protocol Chain Length:              1

Winsock Catalog Provider Entry
------------------------------------------------------
Entry Type:                         Base Service Provider (32)
Description:                        MSAFD Tcpip [UDP/IPv6]
Provider ID:                        {F9EAB0C0-26D4-11D0-BBBF-00AA006C34E4}
Provider Path:                      %SystemRoot%\system32\mswsock.dll
Catalog Entry ID:                   1010
Version:                            2
Address Family:                     23
Max Address Length:                 28
Min Address Length:                 28
Socket Type:                        2
Protocol:                           17
Service Flags:                      0x20609
Protocol Chain Length:              1

Winsock Catalog Provider Entry
------------------------------------------------------
Entry Type:                         Base Service Provider (32)
Description:                        MSAFD Tcpip [RAW/IPv6]
Provider ID:                        {F9EAB0C0-26D4-11D0-BBBF-00AA006C34E4}
Provider Path:                      %SystemRoot%\system32\mswsock.dll
Catalog Entry ID:                   1011
Version:                            2
Address Family:                     23
Max Address Length:                 28
Min Address Length:                 28
Socket Type:                        3
Protocol:                           0
Service Flags:                      0x20609
Protocol Chain Length:              1

Winsock Catalog Provider Entry
------------------------------------------------------
Entry Type:                         Base Service Provider (32)
Description:                        MSAFD Irda [IrDA]
Provider ID:                        {3972523D-2AF1-11D1-B655-00805F3642CC}
Provider Path:                      %SystemRoot%\system32\mswsock.dll
Catalog Entry ID:                   1012
Version:                            2
Address Family:                     26
Max Address Length:                 32
Min Address Length:                 8
Socket Type:                        1
Protocol:                           1
Service Flags:                      0x20006
Protocol Chain Length:              1

Winsock Catalog Provider Entry
------------------------------------------------------
Entry Type:                         Base Service Provider (32)
Description:                        RSVP TCPv6 Service Provider
Provider ID:                        {9D60A9E0-337A-11D0-BD88-0000C082E69A}
Provider Path:                      %SystemRoot%\system32\mswsock.dll
Catalog Entry ID:                   1017
Version:                            2
Address Family:                     23
Max Address Length:                 28
Min Address Length:                 28
Socket Type:                        1
Protocol:                           6
Service Flags:                      0x22066
Protocol Chain Length:              1

Winsock Catalog Provider Entry
------------------------------------------------------
Entry Type:                         Base Service Provider (32)
Description:                        RSVP TCP Service Provider
Provider ID:                        {9D60A9E0-337A-11D0-BD88-0000C082E69A}
Provider Path:                      %SystemRoot%\system32\mswsock.dll
Catalog Entry ID:                   1018
Version:                            2
Address Family:                     2
Max Address Length:                 16
Min Address Length:                 16
Socket Type:                        1
Protocol:                           6
Service Flags:                      0x22066
Protocol Chain Length:              1

Winsock Catalog Provider Entry
------------------------------------------------------
Entry Type:                         Base Service Provider (32)
Description:                        RSVP UDPv6 Service Provider
Provider ID:                        {9D60A9E0-337A-11D0-BD88-0000C082E69A}
Provider Path:                      %SystemRoot%\system32\mswsock.dll
Catalog Entry ID:                   1019
Version:                            2
Address Family:                     23
Max Address Length:                 28
Min Address Length:                 28
Socket Type:                        2
Protocol:                           17
Service Flags:                      0x22609
Protocol Chain Length:              1

Winsock Catalog Provider Entry
------------------------------------------------------
Entry Type:                         Base Service Provider (32)
Description:                        RSVP UDP Service Provider
Provider ID:                        {9D60A9E0-337A-11D0-BD88-0000C082E69A}
Provider Path:                      %SystemRoot%\system32\mswsock.dll
Catalog Entry ID:                   1020
Version:                            2
Address Family:                     2
Max Address Length:                 16
Min Address Length:                 16
Socket Type:                        2
Protocol:                           17
Service Flags:                      0x22609
Protocol Chain Length:              1

(This only has the service providers; the command also lists "name space providers", which is a different extension mechanism, but I left those out.)

So if we group by entries with the same GUID, there's one base provider that does IPv4 (TCP, UDP, and RAW), another that does IPv6 (TCP, UDP, and RAW), one that handles Hyper-V sockets, one that handles IrDA sockets, and one that handles "RSVP" (whatever that is).

The three special GUIDs that libuv thinks can work with the AFD magic seem to correspond to the IPv4 and IPv6 providers, and then the third one from some googling appears to MS's standard Bluetooth provider. (I guess this VM doesn't have the Bluetooth drivers installed.)

So: I think libuv on Windows always ends up using the AFD magic when it's working IPv4 or IPv6 or Bluetooth sockets. When it gets something else – meaning IrDA, Hyper-V, RSVP, or something even more exotic involving a third-party driver – it falls back on calling select in a thread, similar to how we implement WaitForSingleObject.

But AFAIK, on Windows, Python's socket module only supports IPv4 and IPv6 anyway. And I actually kinda suspect that IrDA and Hyper-V use AFD too, since they're official microsoft-maintained providers.

Final conclusion

I think we can use SIO_BASE_HANDLE + the freaky AFD magic to implement wait_socket_writable on top of IOCP, with no fallback needed.

@piscisaureus
Copy link

piscisaureus commented Sep 27, 2018

@njsmith

I randomly stumbled upon this message. Since this is stuff is so arcane that it's actually kinda fun, some extra pointers:

So when you call select on Windows, that's a function in ws2_32.dll. I haven't found the source code for this available anywhere (though I haven't checked whether it comes with visual studio or anything... the CRT source code does, so it's possible winsock does as well).

cough

https://github.com/pustladi/Windows-2000/blob/661d000d50637ed6fab2329d30e31775046588a9/private/net/sockets/winsock2/wsp/msafd/select.c#L59-L655

https://github.com/metoo10987/WinNT4/blob/f5c14e6b42c8f45c20fe88d14c61f9d6e0386b8e/private/ntos/afd/poll.c#L68-L707
(Note that this source code is very old. I have reasons to believe some things have changed significantly.)

The three special GUIDs that libuv thinks can work with the AFD magic seem to correspond to the IPv4 and IPv6 providers, and then the third one from some googling appears to MS's standard Bluetooth provider. (I guess this VM doesn't have the Bluetooth drivers installed.)

This is actually kind of a bug in libuv -- this logic makes libuv use it's "slow" mode too eagerly (libuv has a fallback mechanism where it simply calls select() in a thread). If a socket is bound to an IFS LSP, it'll have a different GUID that's not one of those 3 on the whitelist, however the socket would still work with IOCTL_AFD_POLL. Since all base service providers are built into windows and they all use AFD, it's simply enough to check whether the protocol chain length (in the WSAProtocol_InfoW structure) equals 1.

I think we can use SIO_BASE_HANDLE + the freaky AFD magic to implement wait_socket_writable on top of IOCP, with no fallback needed.

Note that windows itself actually does this most of the time: since windows vista, select() cannot be intercepted by LSPs unless the LSP does all of the following:

  • It's a non-IFS lsp
  • it hooks the SIO_BSP_HANDLE_SELECT ioctl
  • it installs itself over all protocols.

I have yet to encounter the first LSP that actually does this.
See https://blogs.msdn.microsoft.com/wndp/2006/07/13/the-new-trouble-with-select-and-lsps/

If you wanted to be super pedantic about it, you could call both SIO_BASE_HANDLE and SIO_BSP_HANDLE_SELECT and check if they are the same.

So if we group by entries with the same GUID, there's one base provider that does IPv4 (TCP, UDP, and RAW), another that does IPv6 (TCP, UDP, and RAW), one that handles Hyper-V sockets, one that handles IrDA sockets, and one that handles "RSVP" (whatever that is).

There's really no need to do any of this grouping in practice, they all use the same code path in the AFD driver. I removed it from wepoll in piscisaureus/wepoll@e7e8385.

Also note that RSVP is deprecated and no longer supported, and that since the latest windows 10 update there is AF_UNIX which also works with IOCTL_AFD_POLL just fine.

@njsmith
Copy link
Member Author

njsmith commented Sep 28, 2018

@piscisaureus Oh hey, thanks for dropping by! I was wondering why wepoll didn't seem to check GUIDs, and was completely baffled by the docs for SIO_BSP_HANDLE_SELECT... that clarifies a lot!

What do you think of unconditionally calling SIO_BASE_HANDLE and using that instead of the original socket for everything, the way libuv seems to? I guess this will probably have the effect of circumventing any LSPs, but... will anyone care? It seems like it might be simpler than messing with SIO_BSP_HANDLE_SELECT...

And do you happen to know what happens if you try to use the AFD poll stuff with a handle that isn't a MSAFD base provider handle? Does it give a sensible error, or...?

@piscisaureus
Copy link

piscisaureus commented Sep 30, 2018

What do you think of unconditionally calling SIO_BASE_HANDLE and using that instead of the original socket for everything, the way libuv seems to?

Libuv doesn't do that. It uses SIO_BASE_HANDLE to determine whether it can use some optimizations, but it doesn't attempt to bypass LSPs.

If you want to bypass LSPs outright, then you should do it when you create the socket. This is done by using WSASocketW() instead of socket() and passing it the appropriate WSAPROTOCOL_INFOW structure associated with the base service provider.

When you create an non-IFS-LSP-bound socket and then get and use the base handle everywhere, it would likely cause memory leaks and other weirdness (in particular if you also use the base handle when closing the socket with closesocket()).

I guess this will probably have the effect of circumventing any LSPs, but... will anyone care?

Not too many people will care probably. The only LSP i've seen people use intentionally is Proxifier. That's an IFS LSP, so there's no need to bypass it, the socket is a valid AFD handle.

And do you happen to know what happens if you try to use the AFD poll stuff with a handle that isn't a MSAFD base provider handle? Does it give a sensible error, or...?

You'll get a STATUS_INVALID_HANDLE error.
https://github.com/metoo10987/WinNT4/blob/f5c14e6b42c8f45c20fe88d14c61f9d6e0386b8e/private/ntos/afd/poll.c#L155

@njsmith
Copy link
Member Author

njsmith commented Oct 1, 2018

Libuv doesn't do that. It uses SIO_BASE_HANDLE to determine whether it can use some optimizations, but it doesn't attempt to bypass LSPs.

Oh, you're right. I was looking at uv_poll_init, which does use SIO_BASE_HANDLE for everything, but uv_poll_t is only used for checking readability/writability – normally libuv networking is through the uv_tcp_t and uv_udp_t paths, which don't care about this stuff because they just use IOCP send/receive.

It looks like the only place uv_tcp_t uses SIO_BASE_HANDLE is when calling CancelIo. Odd, I wonder why they do that. Hmm, and it looks like it was added in libuv/libuv@6e8eb33 by... @piscisaureus :-). Do you remember why you did it that way? ...oh ugh, is it because LSP's can't override CancelIo, so it just fails on any non-IFS handle?

If you want to bypass LSPs outright, then you should do it when you create the socket. This is done by using WSASocketW() instead of socket() and passing it the appropriate WSAPROTOCOL_INFOW structure associated with the base service provider.

When you create an non-IFS-LSP-bound socket and then get and then use the base handle everywhere, it would likely cause memory leaks and other weirdness (in particular if you also use the base handle when closing the socket with closesocket()).

I see, excellent points, thank you. And on further thought, it probably wouldn't have worked anyway, since for the sockets we control we can use IOCP send; the sockets where we have to poll for writability are the ones from some 3rd-party library like libpq where we can't control what they do anyway.

I guess the main reason this seemed attractive is that it would let us have a single code path, without treating LSP sockets differently. I'm kind of terrified of that, because I have no idea how to test the LSP code paths. Do you by chance happen to know any reliable way to set up LSPs for testing? ("OK, so to make sure we're testing our code under realistic conditions, we're going to install malware on all our Windows test systems...")

Though, it's also probably not a huge deal, because it sounds like we can have a single straight-line code path in all cases anyway, where we do the SIO_BASE_HANDLE dance just when checking for writability, and otherwise use whatever random socket the OS gives us.

Thanks for all the information, by the way, this is super helpful.

@piscisaureus
Copy link

piscisaureus commented Oct 2, 2018

It looks like the only place uv_tcp_t uses SIO_BASE_HANDLE is when calling CancelIo. Do you remember why you did it that way? ...oh ugh, is it because LSP's can't override CancelIo, so it just fails on any non-IFS handle?

Indeed CancelIo/CancelIoEx doesn't work with non IFS sockets. IIRC libuv also works around issues with SetFileCompletionNotificationModes not working in the presence of non-IFS LSPs. Come to think of it, I'm kind of surprised that CreateIoCompletionPort does work, it seems unlikely that LSPs can trap that?

I see, excellent points, thank you. And on further thought, it probably wouldn't have worked anyway, since for the sockets we control we can use IOCP send; the sockets where we have to poll for writability are the ones from some 3rd-party library like libpq where we can't control what they do anyway.

This was also the main reason we added uv_poll -- to integrate with c-ares (dns library).
Although I think if I were to do it again today, I'd have used it also for uv_tcp/uv_udp so the different backends could share a lot more code.

I'm kind of terrified of that, because I have no idea how to test the LSP code paths. Do you by chance happen to know any reliable way to set up LSPs for testing? ("OK, so to make sure we're testing our code under realistic conditions, we're going to install malware on all our Windows test systems...")

https://github.com/piscisaureus/wepoll/blob/0857889a0fccdba62654e11ae743dd96d85cc711/.appveyor.yml#L73-L95

Proxifier, as mentioned earlier, is an IFS LSP I have seen in the wild.
The PC Tools thing contains a non-IFS. The product itself is obsolete. I think it actually breaks the internet on appveyor but I can still test on the loopback interface.

IIRC older versions of the Windows SDK also contained a reference implementation for different LSP types. You could compile and install it. I've never gotten around to doing that.

@njsmith
Copy link
Member Author

njsmith commented Oct 2, 2018

Indeed CancelIo/CancelIoEx doesn't work with non IFS sockets. IIRC libuv also works around issues with SetFileCompletionNotificationModes not working in the presence of non-IFS LSPs. Come to think of it, I'm kind of surprised that CreateIoCompletionPort does work, it seems unlikely that LSPs can trap that?

Looking again at Network Programming for Microsoft Windows, 2nd ed., I think the way LSPs handle IO completion ports is that when you call CreateIoCompletionPort, then ws2_32.dll somehow gets notified and it internally stores an association between the LSP socket and the relevant completion port. Then when a LSP intercepts a call to WSASend or whatever, it's supposed to notice that there's a non-NULL OVERLAPPED, and then when the operation completes it calls another magic ws2_32.dll function called WPUCompleteOverlappedRequest, which looks up the appropriate IO completion port / event object / etc. and handles the notification delivery. So... that's fine. But CancelIo didn't exist yet when this was written, and I can't see any they could have retrofitted it in besides the SIO_BASE_HANDLE hack that libuv's already using.

That said, I have no idea why this doesn't work with FILE_SKIP_COMPLETION_PORT_ON_SUCCESS. It's true that there's no reasonable way for WPUCompleteOverlappedRequest to know whether the operation succeeded immediately (and in fact the sample LSP in that book always fails all overlapped operations with WSA_IO_PENDING because it makes the code simpler). So they can't skip notifications just when they're redundant – either they have to skip all notifications, even when they're necessary, or they have to deliver all notifications, even when they're redundant. And apparently they chose the first option? Bizarre.

This was also the main reason we added uv_poll -- to integrate with c-ares (dns library). Although I think if I were to do it again today, I'd have used it also for uv_tcp/uv_udp so the different backends could share a lot more code.

That is definitely what we're going to do in Trio, at least to start :-).

https://github.com/piscisaureus/wepoll/blob/0857889a0fccdba62654e11ae743dd96d85cc711/.appveyor.yml#L73-L95

Oh beautiful, we are totally going to steal that wholesale. Thank you!


New plan

So it sounds like at this point we now know everything we need to to reimplement select on top of IOCP. This means we can dramatically simplify trio/_core/_io_windows.py (we can get rid of the background thread!), while keeping its external API exactly the same (in particular, trio/_socket.py won't need to change at all), and probably making it faster too. This is great! We should do this.

@njsmith
Copy link
Member Author

njsmith commented Oct 3, 2018

this little article has some useful information on the poorly-documented FILE_SKIP_SET_EVENT_ON_HANDLE (apparently you just always want this in modern software).

@njsmith
Copy link
Member Author

njsmith commented Oct 25, 2019

I put up a PR for switching to AFD_POLL: #1269

Here's an interesting fact about the AFD_POLL stuff: if you try to issue multiple AFD_POLLs on the same socket at the same time, then Windows gets them confused. For example, if you issue one poll for AFD_POLL_RECEIVE, and a second poll for AFD_POLL_SEND, then Windows might report that the second poll has completed when the socket becomes readable, which is nonsense.

I think @piscisaureus knows about this, because I see some stuff in wepoll that in retrospect seems to be working around it. But it threw me for a loop, so wanted to leave a note here for future archaeologists :-).

The solution of course is that if you want to do multiple polls simultaneously on the same socket, then you need to coalesce them into a single AFD_POLL operation, and then demux them again afterwards. Kind of annoying, but epoll works the same way, so not a huge deal once you know it's necessary.

Hmm, looking at that wepoll code again, I'm noticing that when it wants to cancel an old AFD_POLL and submit a new one, it waits for the cancellation to be confirmed before it issues a new one. In my PR, I call CancelIoEx and then immediately submit the new one, then when the the old operation's completion notification arrives I just discard it. That passes our test suite, so it isn't obviously broken, but maybe there's some weird edge case that @piscisaureus ran into that made him switch to the more slow-and-careful approach? OTOH maybe it's just because wepoll wants to only allocate one completion object per socket, and that wouldn't work if you could have two operations in-flight at the same time.

There's also this intriguing Exclusive boolean field on the AFD_POLL submission struct, which I guess might automatically cancel the previous operation, rather than making us do it by hand through CancelIoEx? That would be handy if true. But since this is all completely undocumented I have no idea what it actually does. I don't even know where the name Exclusive came from.

@piscisaureus
Copy link

piscisaureus commented Oct 30, 2019

@njsmith

Some notes on polling the socket multiple times:

  • I can confirm your observation that polling a socket multiple times (without setting the Exclusive) flag is buggy. I have never caught it reporting the wrong events, in my experience only events associated with the last submitted AFD_POLL operation would complete, while the earlier ones just get stuck and never report any events.

  • However, when you submit multiple poll operations which are associated with separate \Device\Afd\Bla handles, and those handles are associated to distinct completion ports, then it all seems to work fine. While the usefulness of supporting this may not be super obvious, there are real use cases for this, e.g. polling a server socket for AFD_POLL_ACCEPT events from multiple threads which each have their own completion port, as a simple form of load balancing. This strategy is used by wepoll and it is tested in https://github.com/piscisaureus/wepoll/blob/d5f8f5f1b1be1a4ba8adb51eb4ee4de7a305a9c8/test/test-multi-poll.c .

  • When the Exclusive flag is used, the poll operation becomes very exclusive: it will cancel all poll operations on the same socket, across all driver handles and completion ports. Libuv does use the Exclusive flag and therefore does not support the form of load balancing mentioned above. The reason we did this is that at the time we had to support Windows XP, which does not support the CancelIoEx() syscall, so leveraging the Exclusive flag was the only way we could cancel pending poll operations.

In my PR, I call CancelIoEx and then immediately submit the new one, then when the the old operation's completion notification arrives I just discard it.

This strategy may be viable but it relies on two assumptions:

  1. It assumes that CancelIoEx() is always synchronous, which, according to the docs, is not guaranteed. If it turns out to be not synchronous, the kernel will run an "asynchronous procedure call" in the thread that started the operation and update the OVERLAPPED struct at a later time, after CancelIoEx() has already returned. If you recycle the OVERLAPPED memory allocation immediately after calling CancelIoEx without checking that it is no longer pending, you'll get heap corruption.

  2. The second assumption is that you can reliably determine whether a completion is for the 'old` vs the 'new' operation. In single threaded operation this is trivial (you can simply use a counter). In multi-threaded operation I don't believe it is possible to do this reliably. Since wepoll purports to be thread-safe, it always waits for the 'old' cancelled operation to complete before submitting a new one.

@njsmith
Copy link
Member Author

njsmith commented Oct 30, 2019

@piscisaureus Once again, thanks so much for sharing your hard-earned wisdom :-)

I can confirm your observation that polling a socket multiple times (without setting the Exclusive) flag is buggy. I have never caught it reporting the wrong events, in my experience only events associated with the last submitted AFD_POLL operation would complete, while the earlier ones just get stuck and never report any events.

I made a standalone demo just to convince myself I wasn't seeing things... if you set up a socket so that it's neither readable nor writable, then submit a poll for SEND, then submit a poll for RECEIVE, then send a byte through to make it readable... the SEND poll completes with the RECEIVE flag set. Nothing about this makes the slightest bit of sense (note that it's actually the earlier poll that completes first in this case!), and then it gets even weirder from there. If you're morbidly curious, you can see it here: https://github.com/python-trio/trio/blob/7818f589326220c21f9b15cb9f7d257c6e14d44b/notes-to-self/afd-lab.py

However, when you submit multiple poll operations which are associated with separate \Device\Afd\Bla handles, and those handles are associated to distinct completion ports, then it all seems to work fine.

Just to confirm: when you say "separate \Device\Afd\Bla handles", you mean, separate calls to CreateFile, right, not necessarily distinct path strings? Wepoll always uses \Device\Afd\Wepoll and the multi-poll thing still works?

I would also like to allow multiple independent event loops to poll the same socket at the same time, and right now I just cargo-culted from Wepoll, so every event loop instance re-opens the same hard-coded path (Device\Afd\Trio), so if that's going to cause problems then I want to know :-).

...Probably the real answer is that I should write some tests.

(By the way, does the \Device\Afd\Blabla path actually matter for anything? Is there any deep reason you use \Wepoll?)

When the Exclusive flag is used, the poll operation becomes very exclusive: it will cancel all poll operations on the same socket, across all driver handles and completion ports.

Ah, OK, good to know. Guess I don't want to use that then!

This strategy may be viable but it relies on two assumptions:

I allocate a new OVERLAPPED object for every operation (this is Python, we're malloc'ing constantly anyway), and I'm careful not to not deallocate the old OVERLAPPED object until the cancellation notification arrives. So I think that avoids both of the problems you mentioned.

The big assumption that I am making, that I'm not sure is valid, is that if I CancelIoEx on IOCTL_AFD_POLL and then submit a new independent IOCTL_AFD_POLL, I won't hit the weird bugs that are triggered by having multiple IOCTL_AFD_POLL operations outstanding at the same time.

I can at least confirm that Trio's test suite passes with my current approach. That definitely wasn't true with my first draft where I issued separate SEND and RECEIVE polls simultaneously. But like... it's really hard to know what other weird bugs are lurking.

@piscisaureus
Copy link

Just to confirm: when you say "separate \Device\Afd\Bla handles", you mean, separate calls to CreateFile, right, not necessarily distinct path strings? Wepoll always uses \Device\Afd\Wepoll and the multi-poll thing still works?

Correct. The file name doesn't matter and can be the same every time, I'm talking about different handles.

The big assumption that I am making, that I'm not sure is valid, is that if I CancelIoEx on IOCTL_AFD_POLL and then submit a new independent IOCTL_AFD_POLL, I won't hit the weird bugs that are triggered by having multiple IOCTL_AFD_POLL operations outstanding at the same time.

I don't think that's gonna be a problem. But who knows 🤷‍♂

@piscisaureus
Copy link

@njsmith

There is one more thing that I forgot to mention:

In wepoll I never associate more than 32 sockets with one \Device\Afd\Wepoll handle (see https://github.com/piscisaureus/wepoll/blob/d5f8f5f1b1be1a4ba8adb51eb4ee4de7a305a9c8/src/poll-group.c#L13).

The reason for this is that CancelIoEx() seems to get slower when as the number of outstanding poll requests related to that driver handle goes up; I suspect that CancelIoEx walks some linked list of IORPs to find the one to cancel. In a well-designed program you usually won't change the event set very often, and in effect CancelIoEx won't be called very often. But in scenarios where you do end up calling it, the resulting O(number_of_sockets^2) time complexity becomes very noticeable.

With 32 sockets per AFD handle, the overhead of CancelIoEx is unmeasurable. I felt that the overhead of opening 1.03 handles per socket (on average) was worth it to avoid this problem.

@njsmith
Copy link
Member Author

njsmith commented Oct 31, 2019

@piscisaureus whoa, yikes! Trio doesn't expose a register/unregister API, it exposes a one-off wait-for-this-socket API, so we end up registering/unregistering sockets constantly, so that's very important to know about!

@njsmith
Copy link
Member Author

njsmith commented Oct 31, 2019

OK, can confirm; with a little script that creates N sockets, registers READ polls on all of them, and then cancels the polls, on a virtualized Windows 10 v1607 (OS build 14393.2214), I get:

-- 10 sockets --
socket creation: 1.57 ms, 156.62 µs/socket
spawning wait tasks: 1.03 ms, 102.72 µs/socket
cancelling wait tasks: 0.63 ms, 63.14 µs/socket
closing sockets: 0.45 ms, 44.99 µs/socket

-- 100 sockets --
socket creation: 14.45 ms, 144.51 µs/socket
spawning wait tasks: 5.46 ms, 54.63 µs/socket
cancelling wait tasks: 3.50 ms, 34.96 µs/socket
closing sockets: 3.08 ms, 30.84 µs/socket

-- 1000 sockets --
socket creation: 136.73 ms, 136.73 µs/socket
spawning wait tasks: 59.74 ms, 59.74 µs/socket
cancelling wait tasks: 46.98 ms, 46.98 µs/socket
closing sockets: 36.55 ms, 36.55 µs/socket

-- 10000 sockets --
socket creation: 1360.23 ms, 136.02 µs/socket
spawning wait tasks: 1028.86 ms, 102.89 µs/socket
cancelling wait tasks: 4302.78 ms, 430.28 µs/socket
closing sockets: 697.85 ms, 69.78 µs/socket

-- 20000 sockets --
socket creation: 2723.91 ms, 136.20 µs/socket
spawning wait tasks: 3930.50 ms, 196.53 µs/socket
cancelling wait tasks: 41185.01 ms, 2059.25 µs/socket
closing sockets: 2224.82 ms, 111.24 µs/socket

-- 30000 sockets --
socket creation: 4760.19 ms, 158.67 µs/socket
spawning wait tasks: 8201.80 ms, 273.39 µs/socket
cancelling wait tasks: 129445.00 ms, 4314.83 µs/socket
closing sockets: 5533.64 ms, 184.45 µs/socket

This is using the current version of #1269, which uses a single handle for all AFD_POLL operations. (Note the socket creation times are a bit slow b/c it includes setting up loopback TCP connections.)

So there's a small super-linearity in the issue times, and a huge super-linearity in the cancel times, though it doesn't get severe until >1000 sockets.

For comparison, here's the same script on Linux (using epoll):

-- 10 sockets --
socket creation: 0.03 ms, 3.38 µs/socket
spawning wait tasks: 0.38 ms, 38.25 µs/socket
cancelling wait tasks: 0.33 ms, 32.79 µs/socket
closing sockets: 0.04 ms, 4.11 µs/socket

-- 100 sockets --
socket creation: 0.29 ms, 2.86 µs/socket
spawning wait tasks: 2.74 ms, 27.43 µs/socket
cancelling wait tasks: 2.41 ms, 24.10 µs/socket
closing sockets: 0.28 ms, 2.78 µs/socket

-- 1000 sockets --
socket creation: 2.40 ms, 2.40 µs/socket
spawning wait tasks: 19.11 ms, 19.11 µs/socket
cancelling wait tasks: 25.10 ms, 25.10 µs/socket
closing sockets: 2.44 ms, 2.44 µs/socket

-- 10000 sockets --
socket creation: 30.91 ms, 3.09 µs/socket
spawning wait tasks: 332.19 ms, 33.22 µs/socket
cancelling wait tasks: 389.10 ms, 38.91 µs/socket
closing sockets: 25.25 ms, 2.52 µs/socket

-- 20000 sockets --
socket creation: 83.32 ms, 4.17 µs/socket
spawning wait tasks: 1221.53 ms, 61.08 µs/socket
cancelling wait tasks: 768.57 ms, 38.43 µs/socket
closing sockets: 50.18 ms, 2.51 µs/socket

-- 30000 sockets --
socket creation: 166.94 ms, 5.56 µs/socket
spawning wait tasks: 2095.34 ms, 69.84 µs/socket
cancelling wait tasks: 1236.32 ms, 41.21 µs/socket
closing sockets: 78.62 ms, 2.62 µs/socket

Test script:

import time
import trio
import trio.testing
import socket

async def main():
    for total in [10, 100, 1_000, 10_000, 20_000, 30_000]:
        def pt(start, end, desc):
            total_ms = (end - start) * 1000
            per_sock_us = total_ms * 1000 / total
            print(f"{desc}: {total_ms:.2f} ms, {per_sock_us:.2f} µs/socket")

        print(f"\n-- {total} sockets --")
        t_start = time.perf_counter()
        sockets = []
        for _ in range(total // 2):
            a, b = socket.socketpair()
            sockets += [a, b]
        t_sockets = time.perf_counter()
        pt(t_start, t_sockets, "socket creation")
        async with trio.open_nursery() as nursery:
            for s in sockets:
                nursery.start_soon(trio.hazmat.wait_readable, s)
            await trio.testing.wait_all_tasks_blocked()
            t_registered = time.perf_counter()
            pt(t_sockets, t_registered, "spawning wait tasks")
            nursery.cancel_scope.cancel()
        t_cancelled = time.perf_counter()
        pt(t_registered, t_cancelled, "cancelling wait tasks")
        for sock in sockets:
            sock.close()
        pt(t_cancelled, time.perf_counter(), "closing sockets")

trio.run(main)

@njsmith
Copy link
Member Author

njsmith commented Oct 31, 2019

Hmm, and here's something embarrassing... if I run that script with 500 sockets, then on current Trio master (select-based), I get:

-- 500 sockets --
socket creation: 71.15 ms, 142.30 µs/socket
spawning wait tasks: 9.89 ms, 19.77 µs/socket
cancelling wait tasks: 14.21 ms, 28.41 µs/socket
closing sockets: 17.15 ms, 34.29 µs/socket

While with the #1269's IOCP-based polling, I get:

-- 500 sockets --
socket creation: 68.05 ms, 136.11 µs/socket
spawning wait tasks: 27.85 ms, 55.69 µs/socket
cancelling wait tasks: 19.83 ms, 39.66 µs/socket
closing sockets: 16.16 ms, 32.33 µs/socket

So switching to IOCP makes this like... ~2x slower. I'm guessing this is mostly the overhead of having moved more code into Python and using FFI to generate C calls at runtime, versus the select.select wrapper that's built into Python and written directly in C?

Of course the reason I did 500 sockets was because select.select crashes if you give it >512 sockets, and to work around that we'd need to write our own wrapper in Python, which would presumably increase the overhead again. And with the select version every loop iteration is O(total sockets), while with IOCP each loop iteration is only O(active sockets); in this microbenchmark every socket is active, but in real programs total sockets >> active sockets.

So I think the conclusions for now are:

  • The new IOCP backend is still probably faster than the old select backend for real world usage
  • There are probably opportunities for micro-optimizations in the IOCP backend. I can speculate about some of them (moving the CFFI usage to ahead-of-time mode? don't allocate a new OVERLAPPED for every socket? maybe skip calling WSAIoctl(SIO_BASE_SOCKET) on every poll operation somehow?), but I don't have a working profiler on Windows currently so these are just wild guesses.
  • Splitting up AFD_POLL operations over multiple helper handles is important for scaling to >1000 sockets, but since the current backend crashes if you have >500 sockets, the AFD_POLL branch is still an improvement over the status quo; the quadratic (?) CancelIoEx shouldn't be considered a regression, just an opportunity for future improvements.

@njsmith
Copy link
Member Author

njsmith commented Oct 31, 2019

And with the select version every loop iteration is O(total sockets), while with IOCP each loop iteration is only O(active sockets); in this microbenchmark every socket is active, but in real programs total sockets >> active sockets.

OK, confirmed this empirically – on the old select backend, a pure no-op schedule operation gets ~4x slower with 500 sockets compared to 10 sockets:

-- 10 sockets --
socket creation: 1.40 ms, 139.83 µs/socket
spawning wait tasks: 0.80 ms, 80.20 µs/socket
scheduling 1000 times: 23.00 ms, 23.00 µs/schedule   <----
cancelling wait tasks: 0.67 ms, 66.81 µs/socket
closing sockets: 0.55 ms, 54.93 µs/socket

-- 500 sockets --
socket creation: 66.06 ms, 132.12 µs/socket
spawning wait tasks: 10.76 ms, 21.53 µs/socket
scheduling 1000 times: 83.72 ms, 83.72 µs/schedule  <----
cancelling wait tasks: 11.57 ms, 23.15 µs/socket
closing sockets: 18.91 ms, 37.82 µs/socket

And on the IOCP backend there's no slowdown, like we'd expect:

-- 10 sockets --
socket creation: 1.51 ms, 151.12 µs/socket
spawning wait tasks: 1.27 ms, 127.41 µs/socket
scheduling 1000 times: 18.27 ms, 18.27 µs/schedule  <----
cancelling wait tasks: 0.92 ms, 91.87 µs/socket
closing sockets: 0.63 ms, 63.03 µs/socket

-- 500 sockets --
socket creation: 68.47 ms, 136.94 µs/socket
spawning wait tasks: 26.55 ms, 53.10 µs/socket
scheduling 1000 times: 17.71 ms, 17.71 µs/schedule  <----
cancelling wait tasks: 18.57 ms, 37.13 µs/socket
closing sockets: 21.82 ms, 43.65 µs/socket

@njsmith
Copy link
Member Author

njsmith commented Oct 31, 2019

@piscisaureus btw, based on these measurements the cost of each individual CancelIoEx appears to grow super-linearly with the number of outstanding polls... so somehow they're doing something O(n^2) inside each call? It's even worse than just iterating over an O(n) list...

@piscisaureus
Copy link

@piscisaureus whoa, yikes! Trio doesn't expose a register/unregister API, it exposes a one-off wait-for-this-socket API

So maybe this is an optimization to consider: instead of calling epoll_ctl() or NtDeviceIoControlFile() immediately from wait_readable()/wait_writable(), you can store the new interest set somewhere and defer making the actual syscall until just before you hit epoll_wait()/GetQueuedCompletionStatusEx(). This saves you from making a syscall to register interests and then cancelling them right after. I don't know how big the net impact would be in Trio's case, since there is probably some overhead associated with the bookkeeping itself. I do know that libev and libuv and wepoll use this optimization -- note that these are C libraries, so bookkeeping is cheaper for them.

So switching to IOCP makes this like... ~2x slower. I'm guessing this is mostly the overhead of having moved more code into Python and using FFI to generate C calls at runtime, versus the select.select wrapper that's built into Python and written directly in C?

It could also be overhead associated with making more syscalls.

Personally I wouldn't be too worried about it. Polling 500 sockets where all sockets are active is right in the sweet spot for select() efficiency, because you end up making only one syscall to retrieve a large number of events, and the overhead of creating and scanning the poll set that's typically causes select's poor scalability is absent in this case (there are events for all sockets in the poll set, so no cycles wasted). I wouldn't be surprised if select on linux is also faster than epoll in this particular scenario. Real life workloads looks very different, in particular in each event loop cycle you'll typically only retrieve a handful of events.

@piscisaureus btw, based on these measurements the cost of each individual CancelIoEx appears to grow super-linearly with the number of outstanding polls...

I had no idea it was that bad, I just noticed that CancelIoEx scaled badly. Thanks for sharing.

so somehow they're doing something O(n^2) inside each call? It's even worse than just iterating over an O(n) list...

I don't know what they're doing. CancelIoEx did not appear until Windows Vista so there's no source code that we can look at.

(The win2k source code for NtCanceloFile is here. It looks like this function walks a per-thread linked list of IRPs; if the CancelIoEx implementation did something similar, spreading out IRPs over multiple AFD device handles would not make any difference, therefore I suppose they use a different data structure now.)

@njsmith
Copy link
Member Author

njsmith commented Nov 5, 2019

So maybe this is an optimization to consider: instead of calling epoll_ctl() or NtDeviceIoControlFile() immediately from wait_readable()/wait_writable(), you can store the new interest set somewhere and defer making the actual syscall until just before you hit epoll_wait()/GetQueuedCompletionStatusEx(). This saves you from making a syscall to register interests and then cancelling them right after.

Hmm. I'm having trouble visualizing how this would be a noticeable win.

For something like kqueue or io_uring, where you can submit multiple events in a single syscall, obviously batching makes sense. (We should look into doing that for kqueue, in fact...) For epoll or AFD, the only win would be if you would have cancelled the submission on the same iteration of the event loop where you submitted it, like you say.

I think the only two cases where this would happen are:

  1. if the wait_* gets cancelled (like via Trio's cancellation system), on the same iteration where it started. This does happen, because if there was a pending cancellation, it'll get delivered as soon as the task went to sleep. But (a) I have trouble imagining a workload where the bottleneck is delivering Trio-level cancellations to wait_*, and (b) we could catch the common cases by doing a very cheap check for pending cancellation before we submit the wait in the first place, without any complex batching logic. So the only case where batching would help then is if some other task calls cancel on the exact same iteration of the event loop where the first task entered the wait_*, which has to be vanishingly rare.

  2. If separate calls to wait_readable and wait_writable are submitted on the same loop iteration, so we first submit one interest alone, and then have to replace that with a combined interest for both. Again, I guess this could happen sometimes, but it seems very uncommon... in practice wait_writable only gets called when we're writing a substantial chunk of data and start getting backpressure, and there's no reason that wait_readable would sync up with that. (Though hmm, I guess it could be more common if we switch to always calling wait_writable after sending data, as considered in Should send_all automatically do wait_send_all_might_not_block after sending the data? #371.)

Am I missing something? It's not obvious to me why this would be a noticeable win for libuv either – your API encourages a different pattern for registering/unregistering waiters, but it still seems like it would be pretty rare for the same socket to get registered and then unregistered again within a single tick. Yet you say that libuv took the trouble to implement this. So that seems like more evidence that I might be missing something.

@piscisaureus
Copy link

@njsmith

Hmm. I'm having trouble visualizing how this would be a noticeable win.

I am not sure myself that it would actually be a win for Trio. I suggested it because... (emphasis mine)

Trio doesn't expose a register/unregister API, it exposes a one-off wait-for-this-socket API, so we end up registering/unregistering sockets constantly

... gave me the impression that cancellation is a thing that happens often.

On linux it's easy to see how this might happen. Let's consider the case where a user wants to send a large buffer over a socket:

  1. User calls r = write(s, buffer, length). It fails with EAGAIN or returns r < length to indicate a partial write.
  2. User calls wait_writable(s) -> Framework calls epoll_ctl(s, EPOLL_CTL_ADD, fd, event /* w/ EPOLLOUT */)
  3. Framework calls epoll_wait(). One of the returned events indicates that s is writable again.
  4. Framework calls epoll_ctl(s, EPOLL_CTL_DEL, fd, NULL) because it is supposed to report an event only once.
  5. Repeat, go to step 1.

So in every 'write cycle' the user sees EAGAIN and subsequently registers interest in the EPOLLOUT event, and the framework deregisters the event because it doesn't know what user code is going to do next. Both can be avoided by deferring the actual epoll_ctl() syscall so it happens just before epoll_wait().

Again, I'm not actually campaigning for this change, and I don't know for sure whether it is applicable to Trio.
It's also (at best) an optimization, #1280 seems more important.

@GSPP
Copy link

GSPP commented Nov 5, 2019

The NtCancelIoFile source code indicates that this routine waits for cancellation in 10ms increments:

    // Delay execution for a time and let the request
    // finish.  The delay time is 10ms.

That could be a performance killer if that is still the current behavior.

@oremanj
Copy link
Member

oremanj commented May 12, 2020

#1269 landed a while ago, so I'm going to close this. I doubt we'll have any trouble finding it if we need to reference something. :-)

@njsmith
Copy link
Member Author

njsmith commented Aug 5, 2021

Apparently Windows is adding Yet Another I/O Management System. Basically io_uring-but-for-windows, which is neat. Unfortunately, though, it goes through WaitForMultipleEvents, not IOCP: https://windows-internals.com/i-o-rings-when-one-i-o-operation-is-not-enough/
Maybe they'll fix it? Currently it's very early days.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants